Conda Repodata¶

This example benchmarks using different JSON libraries to parse and query the current_repodata.json file from conda-forge. This is a medium-sized (~14 MiB) JSON file containing nested metadata about every package on conda-forge.

The following libraries are compared:

This benchmark measures how long it takes each library to decode the current_repodata.json file, extract the name and size of each package, and determine the top 10 packages by file size.

Results

$ python query_repodata.py
json: 139.14 ms
ujson: 124.91 ms
orjson: 91.69 ms
simdjson: 66.40 ms
msgspec: 25.73 ms

Commentary

All of these are fairly quick, library choice likely doesn’t matter at all for simple scripts on small- to medium-sized data.
While orjson is faster than json, the difference between them is only ~30%. Creating python objects dominates the execution time of any well optimized decoding library. How fast the underlying JSON parser is matters, but JSON optimizations can only get you so far if you’re still creating a new Python object for every node in the JSON object.
simdjson is much more performant. This is partly due to the SIMD optimizations it uses, but mostly it’s due to not creating so many Python objects. simdjson first parses a JSON blob into a proxy object. It then lazily creates Python objects as needed as different fields are accessed. This means you only pay the cost of creating Python objects for the fields you use; a query that only accesses a few fields runs much faster since not as many Python objects are created. The downside is every attribute access results in some indirection as new objects are created
msgspec is the fastest option tested. It relies on defining a known schema beforehand. We don’t define the schema for the entire structure, only for the fields we access. Only fields that are part of the schema are decoded, with a new Python object created for each. This allocates the same number of objects as simdjson, but does it all at once, avoiding indirection costs later on during use. See this performance tip for more information.

Source

The full example source can be found here.

import json
import time

import orjson
import requests
import simdjson
import ujson

import msgspec


def query_msgspec(data: bytes) -> list[tuple[int, str]]:
    # Use Struct types to define the JSON schema. For efficiency we only define
    # the fields we actually need.
    class Package(msgspec.Struct):
        name: str
        size: int

    class RepoData(msgspec.Struct):
        packages: dict[str, Package]

    # Decode the data as a `RepoData` type
    repo_data = msgspec.json.decode(data, type=RepoData)

    # Sort packages by `size`, and return the top 10
    return sorted(
        ((p.size, p.name) for p in repo_data.packages.values()), reverse=True
    )[:10]


def query_orjson(data: bytes) -> list[tuple[int, str]]:
    repo_data = orjson.loads(data)
    return sorted(
        ((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
    )[:10]


def query_json(data: bytes) -> list[tuple[int, str]]:
    repo_data = json.loads(data)
    return sorted(
        ((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
    )[:10]


def query_ujson(data: bytes) -> list[tuple[int, str]]:
    repo_data = ujson.loads(data)
    return sorted(
        ((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
    )[:10]


def query_simdjson(data: bytes) -> list[tuple[int, str]]:
    repo_data = simdjson.Parser().parse(data)
    return sorted(
        ((p["size"], p["name"]) for p in repo_data["packages"].values()), reverse=True
    )[:10]


# Download the current_repodata.json file
resp = requests.get(
    "https://conda.anaconda.org/conda-forge/noarch/current_repodata.json"
)
resp.raise_for_status()
data = resp.content

libraries = [
    ("json", query_json),
    ("ujson", query_ujson),
    ("orjson", query_orjson),
    ("simdjson", query_simdjson),
    ("msgspec", query_msgspec),
]

# Run the query with each JSON library, timing the execution
for lib, func in libraries:
    start = time.perf_counter()
    func(data)
    stop = time.perf_counter()
    print(f"{lib}: {(stop - start) * 1000:.2f} ms")