Ibis-Datasette

Posted on August 18, 2022

Datasette

Datasette is an excellent tool for exploring and publishing data. Given an existing SQLite database, it provides:

A nice web UI for exploring the dataset
A flexible JSON API for querying the dataset using SQL (SELECT statements only)
Tooling for publishing that data on a number of common cloud platforms.

It also has a large ecosystem of plugins supporting everything from adding maps and visualizations, to extending SQLite with custom SQL functions.

For more information, I encourage you to watch this intro video. Or just start poking around the examples. The UI lends itself well to self-guided exploration.

Ibis

At my new day job I work on another SQL-adjacent tool called Ibis. Ibis provides a consistent dataframe-like API for querying data using a number of SQL (and non-SQL) backends.

It looks like this:

In [1]: import ibis

In [2]: ibis.options.interactive = True  # enable interactive mode

In [3]: con = ibis.sqlite.connect("legislators.db")  # connect to a database

In [4]: legislators = con.tables["legislators"]  # access tables

In [5]: legislators.groupby("bio_gender").count()  # query using a dataframe-like API
Out[5]:
  bio_gender  count
0          F    399
1          M  12195

For users less familiar with SQL (like myself), having a dataframe-like API can enable better usage of existing data tools. Without Ibis I'd be more prone to writing simple SELECT statements only to extract the data I cared about, then analyze it locally using a more familiar tool like pandas. With ibis I can run more of my queries in the backing database itself, improving execution time and reducing data transfer.

Datasette & Ibis

Ibis supports a large number of backends and operations. As such, its internals can get a bit complicated. To help onboard myself to the project, I decided to write a new tiny backend linking Ibis and Datasette. This is something I've wanted for a while - I'm more comfortable in a terminal than a web UI, but I wanted to explore all the interesting open datasets Simon and team have put together.

The project is called ibis-datasette (repo). It can be installed using pip:

$ pip install ibis-datasette

Using it, you can connect ibis to any datasette server by passing in the full URL. For example, here we connect to the congress-legislators datasette demo, and run the same query as we did above:

In [1]: import ibis

In [2]: ibis.options.interactive = True

In [3]: con = ibis.datasette.connect(
   ...:    "https://congress-legislators.datasettes.com/legislators"
   ...: )

In [4]: legislators = con.tables["legislators"]

In [5]: legislators.groupby("bio_gender").count()
Out[5]:
  bio_gender  count
0          F    399
1          M  12195

Even though we're executing on a different backend with a different protocol, the user-facing code is the same, only the connect call is different.

Of course ibis can run more complicated queries.

For example, here we learn that Jeannette Rankin was the first female US representative, elected in 1917 in Montana.

In [6]: terms = con.tables["legislator_terms"]

In [7]: first_female_rep = (
   ...:    legislators
   ...:    .join(terms, legislators.id == terms.legislator_id)
   ...:    .filter(lambda _: _.bio_gender == "F")
   ...:    .select("name", "state", "start")
   ...:    .sort_by("start")
   ...:    .limit(1)
   ...: )

In [8]: first_female_rep
Out[8]:
               name state       start
0  Jeannette Rankin    MT  1917-04-02

For an even more complicated query, here we compute the percentage of female US representatives per decade, filtering out the ~140 years of no representation:

In [9]: percent_female_by_decade = (
   ...:     legislators
   ...:     .join(terms, legislators.id == terms.legislator_id)
   ...:     .select("bio_gender", "start")
   ...:     .mutate(
   ...:         decade=lambda _: (ibis.date(_.start).year() / 10).cast("int32") * 10
   ...:     )
   ...:     .group_by("decade")
   ...:     .aggregate(
   ...:         n_female=lambda _: (_.bio_gender == "F").sum(),
   ...:         n_total=lambda _: _.count()
   ...:     )
   ...:     .mutate(
   ...:         percent_female=lambda _: 100 * (_.n_female / _.n_total)
   ...:     )
   ...:     .filter(lambda _: _.percent_female > 0)
   ...:     .select("decade", "percent_female")
   ...: )

In [10]: percent_female_by_decade
Out[10]:
    decade  percent_female
0     1910        0.040584
1     1920        0.883179
2     1930        1.608363
3     1940        1.845166
4     1950        3.030303
5     1960        2.718287
6     1970        3.592073
7     1980        4.977188
8     1990       10.830922
9     2000       15.865783
10    2010       20.196641
11    2020       27.789047

For the curious, you can see the generated SQL query using the ibis.show_sql function:

In [11]: ibis.show_sql(percent_female_by_decade)
SELECT
  t0.decade,
  t0.percent_female
FROM (
  SELECT
    t1.decade AS decade,
    t1.n_female AS n_female,
    t1.n_total AS n_total,
    t1.percent_female AS percent_female
  FROM (
    SELECT
      t2.decade AS decade,
      t2.n_female AS n_female,
      t2.n_total AS n_total,
      (
        t2.n_female / CAST(t2.n_total AS REAL)
      ) * 100 AS percent_female
    FROM (
      SELECT
        t3.decade AS decade,
        SUM(CAST(t3.bio_gender = 'F' AS INTEGER)) AS n_female,
        COUNT('*') AS n_total
      FROM (
        SELECT
          t4.bio_gender AS bio_gender,
          t4.start AS start,
          CAST(CAST(STRFTIME('%Y', DATE(t4.start)) AS INTEGER) / CAST(10 AS REAL) AS INTEGER) * 10 AS decade
        FROM (
          SELECT
            bio_gender,
            start
          FROM main.legislators AS t5
          JOIN main.legislator_terms AS t6
            ON t5.id = t6.legislator_id
        ) AS t4
      ) AS t3
      GROUP BY
        t3.decade
    ) AS t2
  ) AS t1
  WHERE
    t1.percent_female > 0
) AS t0

I wouldn't want to write all that by hand!

But then again, I'm not a SQL programmer. One benefit of Ibis is that it allows more seamless interoperation between tools. I didn't have to handwrite the above query, but can now share it with SQL users without requiring them to use Python.

Completing the loop, here's a static datasette link for the full query.

Wrapping Up

ibis-datasette has been a fun ~1-day hack, and I hope it remains a small and simple side project. It was definitely a good learning experience. That said, there are a couple known warts:

Ibis makes heavy use of sqlalchemy for both SQL generation and execution. This meant that I had to write both a dbapi and SQLAlchemy backend for datasette to get everything hooked up properly, even though it's just a thin wrapper around the existing sqlite backend. In the future it might be good to separate SQL generation from execution in ibis to simplify this process. This may also open up opportunities for further optimization, since we may be able to make use of more efficient database APIs instead of relying on the generic dbapi 2.0 spec.
Datasette's JSON API doesn't expose a way to provide non-string parameters for parametrized queries, while SQLAlchemy generates lots of parametrized queries. For now I'm hacking around this with some terrible string processing; since it's only for numeric values on an immutable database, the chance of a Bobby Tables incident is low. It still feels wrong though. In the future we should be able to avoid this issue entirely by handling parametrization entirely in ibis (or ibis-datasette).

I wouldn't recommend using ibis-datasette for serious work, but I've found it a useful tool for exploring public datasette instances.

I would recommend using ibis and datasette for serious work though. They're both excellent, mature libraries, bringing some user friendliness to SQL database work.

Interested in ibis or ibis-datasette? Please feel free to reach out on github or twitter.