Posted on August 18, 2022
Datasette is an excellent tool for exploring and publishing data. Given an existing SQLite database, it provides:
It also has a large ecosystem of plugins supporting everything from adding maps and visualizations, to extending SQLite with custom SQL functions.
For more information, I encourage you to watch this intro video. Or just start poking around the examples. The UI lends itself well to self-guided exploration.
At my new day job I work on another SQL-adjacent tool called Ibis. Ibis provides a consistent dataframe-like API for querying data using a number of SQL (and non-SQL) backends.
It looks like this:
In [1]: import ibis
In [2]: ibis.options.interactive = True # enable interactive mode
In [3]: con = ibis.sqlite.connect("legislators.db") # connect to a database
In [4]: legislators = con.tables["legislators"] # access tables
In [5]: legislators.groupby("bio_gender").count() # query using a dataframe-like API
Out[5]:
bio_gender count
0 F 399
1 M 12195
For users less familiar with SQL (like myself), having a dataframe-like API can enable better usage of existing data tools. Without Ibis I'd be more prone to writing simple SELECT statements only to extract the data I cared about, then analyze it locally using a more familiar tool like pandas. With ibis I can run more of my queries in the backing database itself, improving execution time and reducing data transfer.
Ibis supports a large number of backends and operations. As such, its internals can get a bit complicated. To help onboard myself to the project, I decided to write a new tiny backend linking Ibis and Datasette. This is something I've wanted for a while - I'm more comfortable in a terminal than a web UI, but I wanted to explore all the interesting open datasets Simon and team have put together.
The project is called ibis-datasette (repo). It can be installed using pip:
$ pip install ibis-datasette
Using it, you can connect ibis to any datasette server by passing in the full URL. For example, here we connect to the congress-legislators datasette demo, and run the same query as we did above:
In [1]: import ibis
In [2]: ibis.options.interactive = True
In [3]: con = ibis.datasette.connect(
...: "https://congress-legislators.datasettes.com/legislators"
...: )
In [4]: legislators = con.tables["legislators"]
In [5]: legislators.groupby("bio_gender").count()
Out[5]:
bio_gender count
0 F 399
1 M 12195
Even though we're executing on a different backend with a different protocol, the user-facing code is the same, only the connect call is different.
Of course ibis can run more complicated queries.
For example, here we learn that Jeannette Rankin was the first female US representative, elected in 1917 in Montana.
In [6]: terms = con.tables["legislator_terms"]
In [7]: first_female_rep = (
...: legislators
...: .join(terms, legislators.id == terms.legislator_id)
...: .filter(lambda _: _.bio_gender == "F")
...: .select("name", "state", "start")
...: .sort_by("start")
...: .limit(1)
...: )
In [8]: first_female_rep
Out[8]:
name state start
0 Jeannette Rankin MT 1917-04-02
For an even more complicated query, here we compute the percentage of female US representatives per decade, filtering out the ~140 years of no representation:
In [9]: percent_female_by_decade = (
...: legislators
...: .join(terms, legislators.id == terms.legislator_id)
...: .select("bio_gender", "start")
...: .mutate(
...: decade=lambda _: (ibis.date(_.start).year() / 10).cast("int32") * 10
...: )
...: .group_by("decade")
...: .aggregate(
...: n_female=lambda _: (_.bio_gender == "F").sum(),
...: n_total=lambda _: _.count()
...: )
...: .mutate(
...: percent_female=lambda _: 100 * (_.n_female / _.n_total)
...: )
...: .filter(lambda _: _.percent_female > 0)
...: .select("decade", "percent_female")
...: )
In [10]: percent_female_by_decade
Out[10]:
decade percent_female
0 1910 0.040584
1 1920 0.883179
2 1930 1.608363
3 1940 1.845166
4 1950 3.030303
5 1960 2.718287
6 1970 3.592073
7 1980 4.977188
8 1990 10.830922
9 2000 15.865783
10 2010 20.196641
11 2020 27.789047
For the curious, you can see the generated SQL query using the ibis.show_sql function:
In [11]: ibis.show_sql(percent_female_by_decade)
SELECT
t0.decade,
t0.percent_female
FROM (
SELECT
t1.decade AS decade,
t1.n_female AS n_female,
t1.n_total AS n_total,
t1.percent_female AS percent_female
FROM (
SELECT
t2.decade AS decade,
t2.n_female AS n_female,
t2.n_total AS n_total,
(
t2.n_female / CAST(t2.n_total AS REAL)
) * 100 AS percent_female
FROM (
SELECT
t3.decade AS decade,
SUM(CAST(t3.bio_gender = 'F' AS INTEGER)) AS n_female,
COUNT('*') AS n_total
FROM (
SELECT
t4.bio_gender AS bio_gender,
t4.start AS start,
CAST(CAST(STRFTIME('%Y', DATE(t4.start)) AS INTEGER) / CAST(10 AS REAL) AS INTEGER) * 10 AS decade
FROM (
SELECT
bio_gender,
start
FROM main.legislators AS t5
JOIN main.legislator_terms AS t6
ON t5.id = t6.legislator_id
) AS t4
) AS t3
GROUP BY
t3.decade
) AS t2
) AS t1
WHERE
t1.percent_female > 0
) AS t0
I wouldn't want to write all that by hand!
But then again, I'm not a SQL programmer. One benefit of Ibis is that it allows more seamless interoperation between tools. I didn't have to handwrite the above query, but can now share it with SQL users without requiring them to use Python.
Completing the loop, here's a static datasette link for the full query.
ibis-datasette has been a fun ~1-day hack, and I hope it remains a small and simple side project. It was definitely a good learning experience. That said, there are a couple known warts:
I wouldn't recommend using ibis-datasette for serious work, but I've found it a useful tool for exploring public datasette instances.
I would recommend using ibis and datasette for serious work though. They're both excellent, mature libraries, bringing some user friendliness to SQL database work.
Interested in ibis or ibis-datasette? Please feel free to reach out on github or twitter.