Posted on August 24, 2022
Note: This blogpost is available as an interactive notebook here.
In the previous post we introduced ibis-datasette. To recap:
In this post, we'll use ibis and ibis-datasette to explore the ScotRail datasette.
This datasette is super fun to play around with. It's composed of ~2400 different audioclips (and transcriptions) from Scottish train operator ScotRail's automated station announcements.
If you haven't seen it, I encourage you to read Simon Willison's excellent blogpost on putting this datasette together, and some interesting queries to try (we'll be replicating one of these below).
While you can use the datasette UI directly, I wanted to use ibis and the full power of Python to explore and build some interesting things.
---
First we start with some imports and initialization.
Here we:
In [1]: import ibis
...: from ibis import _
In [2]: ibis.options.interactive = True
In [3]: import pandas as pd
...: pd.set_option('max_colwidth', 400)
Next we need to connect to the datasette. This is done by passing the full URL to ibis.datasette.connect:
In [4]: con = ibis.datasette.connect("https://scotrail.datasette.io/scotrail")
Once connected, we can start poking around.
The first thing I usually do when exploring a new dataset is examine the tables and schemas.
In [5]: con.list_tables()
Out[5]:
['announcements',
'announcements_fts',
'announcements_fts_config',
'announcements_fts_data',
'announcements_fts_docsize',
'announcements_fts_idx']
In [6]: con.tables.announcements.schema()
Out[6]:
ibis.Schema {
File string
Transcription string
Category string
mp3 string
Notes string
Timestamp string
NRE_ID string
}
In [7]: con.tables.announcements.head()
Out[7]:
File Transcription Category ... Notes Timestamp NRE_ID
0 0031 I am sorry to announce that the Apology ...
1 0085 We are sorry to announce that the Apology ... Most frequently used file
2 1339 We are sorry to announce that Apology ...
3 1488 we apologise for the inconvenience caused Apology ...
4 1524 Apologies to customers... Apology ...
[5 rows x 7 columns]
The main table is announcements, the most interesting columns of which are:
Since we're going to be accessing this table a lot below, lets save it to a shorter local variable name:
In [8]: t = con.tables.announcements
To get a better sense of the scale of data we're working with, lets take a closer look at the Category column.
I want to know how many categories there are, and how the audio clips are distributed across these categories.
To do this, we can use:
In [9]: category_counts = (
...: t.group_by("Category")
...: .count()
...: .sort_by(ibis.desc("count"))
...: )
...:
...: category_counts
Out[9]:
Category count
0 Destination 1271
1 Reason 421
2 Time 161
3 Passenger information 153
4 Number 102
5 Train operating company 76
6 Platform information 67
7 Conjoining 66
8 Weather 30
9 Safety 15
10 Train formation 14
11 Special train 12
12 Operational 10
13 Apology 8
14 Fare information 7
15 Platform ID 7
16 Heritage Railway 6
17 Number combo 4
18 Non-vocal 3
19 Strathclyde metro 3
20 Request stop 2
21 Station 1
22 Train operating company 1
Here we can see there are 23 categories, with 90% of the audio clips falling into the first 6. A few categories to highlight:
The Reason category is the most fun to look through. There are all sorts of reasons a train might be cancelled, from "Sheep on the railway" to "A wartime bomb near the railway".
One reoccuring reason is theft (err, "attempted theft") of various things. Lets find all reasons involving "theft".
This can be done by using .filter() to filter rows based on a predicate. Here we need two predicates:
In [10]: thefts = t.filter((_.Category == "Reason") & _.Transcription.contains("theft"))
...:
...: thefts
Out[10]:
File Transcription ... Timestamp NRE_ID
0 0969 Attempted theft of overhead line electrification equipment ...
1 0970 Attempted theft of overhead line electrification equipment earlier today ...
2 0971 Attempted theft of overhead line electrification equipment yesterday ...
3 0972 Attempted theft of railway equipment ...
4 0973 Attempted theft of railway equipment earlier today ...
5 0974 Attempted theft of railway equipment yesterday ...
6 0975 Attempted theft of signalling cables ...
7 0976 Attempted theft of signalling cables earlier today ...
8 0977 Attempted theft of signalling cables yesterday ...
9 0978 Attempted theft of third rail electrification equipment ...
10 0979 Attempted theft of third rail electrification equipment earlier today ...
11 0980 Attempted theft of third rail electrification equipment yesterday ...
[12 rows x 7 columns]
All of these rows also include a link to an mp3 file containing that clip. To play a clip in a jupyter notebook, we can make use of IPython.display.Audio. For example, lets play the first clip from above:
In [12]: from IPython.display import Audio
...:
...: mp3_url = thefts.limit(1).execute().mp3.iloc[0]
...:
...: Audio(mp3_url)
In his blogpost Simon wrote up a SQL query for generating a Random apology by combining a few random rows from different categories above. It generates surprisingly coherent sentences, you can see the datasette version here.
If you're interested you can click "show" at the top to see the full SQL query - it's readable, but a bit long.
I wanted to reproduce the same query using ibis. Since ibis is just a Python library, you can make use of things like functions to abstract away some of the repetitiveness in the SQL query above.
Here's what I came up with:
In [12]: def random(category):
...: """Select a random row from a given category"""
...: return (
...: t.filter(_.Category == category)
...: .sort_by(ibis.random())
...: .select("Transcription", "mp3")
...: .limit(1)
...: )
...:
...: def phrase(text):
...: """Select a row with a specific transcription"""
...: return (
...: t.filter(_.Transcription == text)
...: .select("Transcription", "mp3")
...: .limit(1)
...: )
...:
...: query = ibis.union(
...: random("Apology"),
...: random("Train operating company"),
...: random("Destination"),
...: phrase("has been cancelled"),
...: phrase("due to"),
...: random("Reason"),
...: )
Since the query selects random rows, if you run the cell below multiple times, you should see different results every time:
In [13]: query.execute()
Out[13]:
Transcription mp3
0 I am sorry to announce that the https://github.com/matteason/scotrail-announcements-june-2022/raw/main/announcements/1529.mp3
1 Southeastern Trains https://github.com/matteason/scotrail-announcements-june-2022/raw/main/announcements/0201.mp3
2 Dawlish https://github.com/matteason/scotrail-announcements-june-2022/raw/main/announcements/0702.mp3
3 has been cancelled https://github.com/matteason/scotrail-announcements-june-2022/raw/main/announcements/0340.mp3
4 due to https://github.com/matteason/scotrail-announcements-june-2022/raw/main/announcements/1528.mp3
5 A person being hit by a train https://github.com/matteason/scotrail-announcements-june-2022/raw/main/announcements/0834.mp3
If we wanted to do all computation in the backend, we could use group_concat (docs) to then concatenate the Transcription rows together, returning a single string:
In [14]: random_apology = query.Transcription.group_concat(" ")
...:
...: random_apology
Out[14]: 'we apologise for the inconvenience caused East Midlands Cartsdyke has
been cancelled due to A train not stopping in the correct position at a station'
Note that the full query above is translated to SQL and executed on the datasette server, no computation is happening locally.
If you want to see the generated SQL, you can use the ibis.show_sql function. It's much longer than the Python code that generated it:
In [15]: ibis.show_sql(random_apology)
WITH anon_1 AS (
SELECT
t1."Transcription" AS "Transcription",
t1.mp3 AS mp3
FROM (
SELECT
t2."File" AS "File",
t2."Transcription" AS "Transcription",
t2."Category" AS "Category",
t2.mp3 AS mp3,
t2."Notes" AS "Notes",
t2."Timestamp" AS "Timestamp",
t2."NRE_ID" AS "NRE_ID"
FROM main.announcements AS t2
WHERE
t2."Category" = 'Apology'
ORDER BY
RANDOM()
) AS t1
LIMIT 1
OFFSET 0
), anon_2 AS (
SELECT
t1."Transcription" AS "Transcription",
t1.mp3 AS mp3
FROM (
SELECT
t2."File" AS "File",
t2."Transcription" AS "Transcription",
t2."Category" AS "Category",
t2.mp3 AS mp3,
t2."Notes" AS "Notes",
t2."Timestamp" AS "Timestamp",
t2."NRE_ID" AS "NRE_ID"
FROM main.announcements AS t2
WHERE
t2."Category" = 'Train operating company'
ORDER BY
RANDOM()
) AS t1
LIMIT 1
OFFSET 0
), anon_3 AS (
SELECT
t1."Transcription" AS "Transcription",
t1.mp3 AS mp3
FROM (
SELECT
t2."File" AS "File",
t2."Transcription" AS "Transcription",
t2."Category" AS "Category",
t2.mp3 AS mp3,
t2."Notes" AS "Notes",
t2."Timestamp" AS "Timestamp",
t2."NRE_ID" AS "NRE_ID"
FROM main.announcements AS t2
WHERE
t2."Category" = 'Destination'
ORDER BY
RANDOM()
) AS t1
LIMIT 1
OFFSET 0
), anon_4 AS (
SELECT
t1."Transcription" AS "Transcription",
t1.mp3 AS mp3
FROM (
SELECT
t2."File" AS "File",
t2."Transcription" AS "Transcription",
t2."Category" AS "Category",
t2.mp3 AS mp3,
t2."Notes" AS "Notes",
t2."Timestamp" AS "Timestamp",
t2."NRE_ID" AS "NRE_ID"
FROM main.announcements AS t2
WHERE
t2."Transcription" = 'has been cancelled'
) AS t1
LIMIT 1
OFFSET 0
), anon_5 AS (
SELECT
t1."Transcription" AS "Transcription",
t1.mp3 AS mp3
FROM (
SELECT
t2."File" AS "File",
t2."Transcription" AS "Transcription",
t2."Category" AS "Category",
t2.mp3 AS mp3,
t2."Notes" AS "Notes",
t2."Timestamp" AS "Timestamp",
t2."NRE_ID" AS "NRE_ID"
FROM main.announcements AS t2
WHERE
t2."Transcription" = 'due to'
) AS t1
LIMIT 1
OFFSET 0
), anon_6 AS (
SELECT
t1."Transcription" AS "Transcription",
t1.mp3 AS mp3
FROM (
SELECT
t2."File" AS "File",
t2."Transcription" AS "Transcription",
t2."Category" AS "Category",
t2.mp3 AS mp3,
t2."Notes" AS "Notes",
t2."Timestamp" AS "Timestamp",
t2."NRE_ID" AS "NRE_ID"
FROM main.announcements AS t2
WHERE
t2."Category" = 'Reason'
ORDER BY
RANDOM()
) AS t1
LIMIT 1
OFFSET 0
)
SELECT
GROUP_CONCAT(t0."Transcription", ' ') AS tmp
FROM (
SELECT
anon_1."Transcription" AS "Transcription",
anon_1.mp3 AS mp3
FROM anon_1
UNION ALL
SELECT
anon_2."Transcription" AS "Transcription",
anon_2.mp3 AS mp3
FROM anon_2
UNION ALL
SELECT
anon_3."Transcription" AS "Transcription",
anon_3.mp3 AS mp3
FROM anon_3
UNION ALL
SELECT
anon_4."Transcription" AS "Transcription",
anon_4.mp3 AS mp3
FROM anon_4
UNION ALL
SELECT
anon_5."Transcription" AS "Transcription",
anon_5.mp3 AS mp3
FROM anon_5
UNION ALL
SELECT
anon_6."Transcription" AS "Transcription",
anon_6.mp3 AS mp3
FROM anon_6
) AS t0
However, we're only using ibis to push the bulk of the computation to the backend. We don't need to handle everything in SQL, only enough to reduce the size of the results to something reasonable to return from the datasette server.
We also have access to the full Python ecosystem to process results. This lets us do some things that wouldn't be possible in SQL alone, like concatenating mp3 files :).
The ipywidgets library provides support for building simple UIs in Python, with the rendering handled by the notebook. This is nice for me, as I am not a web engineer - I'm a novice at best at javascript/html. However, I do know how to write Python.
Below we hack together a quick UI with ipywidgets to make a button for generating a random apology, complete with a merged mp3 file so you can listen to your work. You don't really need to understand this code, it has nothing to do with ibis or ibis-datasette itself.
Clicking the button will pull generate a new random apology, download and merge the mp3 files, and display both the apology sentence and merged mp3.
Obviously this can't run for real in a static blogpost (it's most interesting in the interactive notebook on mybinder). To work around that, we include a single generated audio clip below.
In [16]: import tempfile
...: import os
...: import pydub
...: import httpx
...: import ipywidgets
...: from IPython.display import Audio, display
...:
...: output = ipywidgets.Output()
...: button = ipywidgets.Button(description='Random Apology', icon="repeat")
...: UI = ipywidgets.VBox([button, output])
...:
...:
...: def concatenate_mp3s(urls: list[str]) -> bytes:
...: with httpx.Client(follow_redirects=True) as client, tempfile.TemporaryDirectory() as tempdir:
...: output = None
...: for i, url in enumerate(urls):
...: path = os.path.join(tempdir, f"part{i}.mp3")
...: with open(path, "wb") as f:
...: resp = client.get(url)
...: resp.raise_for_status()
...: f.write(resp.content)
...: part = pydub.AudioSegment.from_mp3(path)
...: if output is None:
...: output = part
...: else:
...: output = output + part
...: out_path = os.path.join(tempdir, "output.mp3")
...: output.export(out_path, format="mp3")
...: with open(out_path, "rb") as f:
...: return f.read()
...:
...:
...: @button.on_click
...: def on_click(*args):
...: output.clear_output()
...: result = query.execute()
...: msg = " ".join(result.Transcription)
...: mp3 = concatenate_mp3s(result.mp3)
...: with output:
...: print(msg)
...: display(Audio(mp3))
...:
...:
...: UI
I am sorry to announce that the Midland Mainline Turbostar service from
Mitcham Eastfields has been cancelled due to The train conductor being
taken ill
datasette makes it easier to publish accessible open data on the web, with a UI exposed for writing SQL queries. However, not everyone is extremely SQL literate (myself included). ibis and ibis-datasette let Python programmers access this same data resource, but through a familiar dataframe-like interface.
Interested in ibis or ibis-datasette? Please feel free to reach out on github or twitter.