Conda Repodata¶
This example benchmarks using different JSON libraries to parse and query the current_repodata.json file from conda-forge. This is a medium-sized (~14 MiB) JSON file containing nested metadata about every package on conda-forge.
The following libraries are compared:
This benchmark measures how long it takes each library to decode the
current_repodata.json
file, extract the name and size of each package, and
determine the top 10 packages by file size.
Results
$ python query_repodata.py
json: 139.14 ms
ujson: 124.91 ms
orjson: 91.69 ms
simdjson: 66.40 ms
msgspec: 25.73 ms
Commentary
All of these are fairly quick, library choice likely doesn’t matter at all for simple scripts on small- to medium-sized data.
While
orjson
is faster thanjson
, the difference between them is only ~30%. Creating python objects dominates the execution time of any well optimized decoding library. How fast the underlying JSON parser is matters, but JSON optimizations can only get you so far if you’re still creating a new Python object for every node in the JSON object.simdjson
is much more performant. This is partly due to the SIMD optimizations it uses, but mostly it’s due to not creating so many Python objects.simdjson
first parses a JSON blob into a proxy object. It then lazily creates Python objects as needed as different fields are accessed. This means you only pay the cost of creating Python objects for the fields you use; a query that only accesses a few fields runs much faster since not as many Python objects are created. The downside is every attribute access results in some indirection as new objects are createdmsgspec
is the fastest option tested. It relies on defining a known schema beforehand. We don’t define the schema for the entire structure, only for the fields we access. Only fields that are part of the schema are decoded, with a new Python object created for each. This allocates the same number of objects assimdjson
, but does it all at once, avoiding indirection costs later on during use. See this performance tip for more information.
Source
The full example source can be found here.
import json
import time
import orjson
import requests
import simdjson
import ujson
import msgspec
def query_msgspec(data: bytes) -> list[tuple[int, str]]:
# Use Struct types to define the JSON schema. For efficiency we only define
# the fields we actually need.
class Package(msgspec.Struct):
name: str
size: int
class RepoData(msgspec.Struct):
packages: dict[str, Package]
# Decode the data as a `RepoData` type
repo_data = msgspec.json.decode(data, type=RepoData)
# Sort packages by `size`, and return the top 10
return sorted(
((p.size, p.name) for p in repo_data.packages.values()), reverse=True
)[:10]
def query_orjson(data: bytes) -> list[tuple[int, str]]:
repo_data = orjson.loads(data)
return sorted(
((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
)[:10]
def query_json(data: bytes) -> list[tuple[int, str]]:
repo_data = json.loads(data)
return sorted(
((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
)[:10]
def query_ujson(data: bytes) -> list[tuple[int, str]]:
repo_data = ujson.loads(data)
return sorted(
((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
)[:10]
def query_simdjson(data: bytes) -> list[tuple[int, str]]:
repo_data = simdjson.Parser().parse(data)
return sorted(
((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
)[:10]
# Download the current_repodata.json file
resp = requests.get(
"https://conda.anaconda.org/conda-forge/noarch/current_repodata.json"
)
resp.raise_for_status()
data = resp.content
libraries = [
("json", query_json),
("ujson", query_ujson),
("orjson", query_orjson),
("simdjson", query_simdjson),
("msgspec", query_msgspec),
]
# Run the query with each JSON library, timing the execution
for lib, func in libraries:
start = time.perf_counter()
func(data)
stop = time.perf_counter()
print(f"{lib}: {(stop - start) * 1000:.2f} ms")