Posted on August 18, 2022
It also has a large ecosystem of plugins supporting everything from adding maps and visualizations, to extending SQLite with custom SQL functions.
It looks like this:
In : import ibis In : ibis.options.interactive = True # enable interactive mode In : con = ibis.sqlite.connect("legislators.db") # connect to a database In : legislators = con.tables["legislators"] # access tables In : legislators.groupby("bio_gender").count() # query using a dataframe-like API Out: bio_gender count 0 F 399 1 M 12195
For users less familiar with SQL (like myself), having a dataframe-like API can enable better usage of existing data tools. Without Ibis I'd be more prone to writing simple SELECT statements only to extract the data I cared about, then analyze it locally using a more familiar tool like pandas. With ibis I can run more of my queries in the backing database itself, improving execution time and reducing data transfer.
Ibis supports a large number of backends and operations. As such, its internals can get a bit complicated. To help onboard myself to the project, I decided to write a new tiny backend linking Ibis and Datasette. This is something I've wanted for a while - I'm more comfortable in a terminal than a web UI, but I wanted to explore all the interesting open datasets Simon and team have put together.
The project is called ibis-datasette (repo). It can be installed using pip:
$ pip install ibis-datasette
Using it, you can connect ibis to any datasette server by passing in the full URL. For example, here we connect to the congress-legislators datasette demo, and run the same query as we did above:
In : import ibis In : ibis.options.interactive = True In : con = ibis.datasette.connect( ...: "https://congress-legislators.datasettes.com/legislators" ...: ) In : legislators = con.tables["legislators"] In : legislators.groupby("bio_gender").count() Out: bio_gender count 0 F 399 1 M 12195
Even though we're executing on a different backend with a different protocol, the user-facing code is the same, only the connect call is different.
Of course ibis can run more complicated queries.
For example, here we learn that Jeannette Rankin was the first female US representative, elected in 1917 in Montana.
In : terms = con.tables["legislator_terms"] In : first_female_rep = ( ...: legislators ...: .join(terms, legislators.id == terms.legislator_id) ...: .filter(lambda _: _.bio_gender == "F") ...: .select("name", "state", "start") ...: .sort_by("start") ...: .limit(1) ...: ) In : first_female_rep Out: name state start 0 Jeannette Rankin MT 1917-04-02
For an even more complicated query, here we compute the percentage of female US representatives per decade, filtering out the ~140 years of no representation:
In : percent_female_by_decade = ( ...: legislators ...: .join(terms, legislators.id == terms.legislator_id) ...: .select("bio_gender", "start") ...: .mutate( ...: decade=lambda _: (ibis.date(_.start).year() / 10).cast("int32") * 10 ...: ) ...: .group_by("decade") ...: .aggregate( ...: n_female=lambda _: (_.bio_gender == "F").sum(), ...: n_total=lambda _: _.count() ...: ) ...: .mutate( ...: percent_female=lambda _: 100 * (_.n_female / _.n_total) ...: ) ...: .filter(lambda _: _.percent_female > 0) ...: .select("decade", "percent_female") ...: ) In : percent_female_by_decade Out: decade percent_female 0 1910 0.040584 1 1920 0.883179 2 1930 1.608363 3 1940 1.845166 4 1950 3.030303 5 1960 2.718287 6 1970 3.592073 7 1980 4.977188 8 1990 10.830922 9 2000 15.865783 10 2010 20.196641 11 2020 27.789047
For the curious, you can see the generated SQL query using the ibis.show_sql function:
In : ibis.show_sql(percent_female_by_decade) SELECT t0.decade, t0.percent_female FROM ( SELECT t1.decade AS decade, t1.n_female AS n_female, t1.n_total AS n_total, t1.percent_female AS percent_female FROM ( SELECT t2.decade AS decade, t2.n_female AS n_female, t2.n_total AS n_total, ( t2.n_female / CAST(t2.n_total AS REAL) ) * 100 AS percent_female FROM ( SELECT t3.decade AS decade, SUM(CAST(t3.bio_gender = 'F' AS INTEGER)) AS n_female, COUNT('*') AS n_total FROM ( SELECT t4.bio_gender AS bio_gender, t4.start AS start, CAST(CAST(STRFTIME('%Y', DATE(t4.start)) AS INTEGER) / CAST(10 AS REAL) AS INTEGER) * 10 AS decade FROM ( SELECT bio_gender, start FROM main.legislators AS t5 JOIN main.legislator_terms AS t6 ON t5.id = t6.legislator_id ) AS t4 ) AS t3 GROUP BY t3.decade ) AS t2 ) AS t1 WHERE t1.percent_female > 0 ) AS t0
I wouldn't want to write all that by hand!
But then again, I'm not a SQL programmer. One benefit of Ibis is that it allows more seamless interoperation between tools. I didn't have to handwrite the above query, but can now share it with SQL users without requiring them to use Python.
Completing the loop, here's a static datasette link for the full query.
ibis-datasette has been a fun ~1-day hack, and I hope it remains a small and simple side project. It was definitely a good learning experience. That said, there are a couple known warts:
I wouldn't recommend using ibis-datasette for serious work, but I've found it a useful tool for exploring public datasette instances.
I would recommend using ibis and datasette for serious work though. They're both excellent, mature libraries, bringing some user friendliness to SQL database work.