Parsing pyproject.toml#

PEP 518 defined a new pyproject.toml configuration file Python projects can use for configuring:

  • Metadata (name, version, …)

  • Dependencies

  • Build systems

  • Additional development tools (black, mypy, pytest, … all support pyproject.toml files for configuration).

The format was defined in a series of Python Enhancement Proposals (PEPs), which also serve as the main documentation for the file schema.

  • PEP 517: A build-system independent format for source trees

  • PEP 518: Specifying minimum build system requirements for Python projects

  • PEP 621: Storing project metadata in pyproject.toml

Here we define a msgspec schema for parsing and validating a pyproject.toml file. This includes full schema definitions for all fields in the build-system and project tables, as well as an untyped table under tool.

The full example source can be found here.

from typing import Any

import msgspec


class Base(
    msgspec.Struct,
    omit_defaults=True,
    forbid_unknown_fields=True,
    rename="kebab",
):
    """A base class holding some common settings.

    - We set ``omit_defaults = True`` to omit any fields containing only their
      default value from the output when encoding.
    - We set ``forbid_unknown_fields = True`` to error nicely if an unknown
      field is present in the input TOML. This helps catch typo errors early,
      and is also required per PEP 621.
    - We set ``rename = "kebab"`` to rename all fields to use kebab case when
      encoding/decoding, as this is the convention used in pyproject.toml. For
      example, this will rename ``requires_python`` to ``requires-python``.
    """

    pass


class BuildSystem(Base):
    requires: list[str] = []
    build_backend: str | None = None
    backend_path: list[str] = []


class Readme(Base):
    file: str | None = None
    text: str | None = None
    content_type: str | None = None


class License(Base):
    file: str | None = None
    text: str | None = None


class Contributor(Base):
    name: str | None = None
    email: str | None = None


class Project(Base):
    name: str | None = None
    version: str | None = None
    description: str | None = None
    readme: str | Readme | None = None
    license: str | License | None = None
    authors: list[Contributor] = []
    maintainers: list[Contributor] = []
    keywords: list[str] = []
    classifiers: list[str] = []
    urls: dict[str, str] = {}
    requires_python: str | None = None
    dependencies: list[str] = []
    optional_dependencies: dict[str, list[str]] = {}
    scripts: dict[str, str] = {}
    gui_scripts: dict[str, str] = {}
    entry_points: dict[str, dict[str, str]] = {}
    dynamic: list[str] = []


class PyProject(Base):
    build_system: BuildSystem | None = None
    project: Project | None = None
    tool: dict[str, dict[str, Any]] = {}


def decode(data: bytes | str) -> PyProject:
    """Decode a ``pyproject.toml`` file from TOML"""
    return msgspec.toml.decode(data, type=PyProject)


def encode(msg: PyProject) -> bytes:
    """Encode a ``PyProject`` object to TOML"""
    return msgspec.toml.encode(msg)

Here we use it to load the pyproject.toml for Starlette:

In [1]: import pyproject

In [2]: import urllib.request

In [3]: url = "https://raw.githubusercontent.com/encode/starlette/master/pyproject.toml"

In [4]: with urllib.request.urlopen(url) as f:
   ...:     data = f.read()

In [5]: result = pyproject.decode(data)  # decode the pyproject.toml

In [6]: result.build_system
Out[6]: BuildSystem(requires=['hatchling'], build_backend='hatchling.build', backend_path=[])

In [7]: result.project.name
Out[7]: 'starlette'

Note that this only validates that fields are of the proper type. It doesn’t check:

  • Whether strings like URLs or dependency specifiers are valid. Some of these could be handled using msgspec’s existing Constraints system, but not all of them.

  • Mutually exclusive field restrictions (for example, you can’t set both project.license.file and project.license.text). msgspec currently has no way of declaring these restrictions.

Even with these caveats, the schemas here are still useful:

  • Since forbid_unknown_fields=True is configured, any extra fields will raise a nice error message. This is very useful for catching typos in configuration files, as the misspelled field names won’t be silently ignored.

  • Type errors for fields will also be caught, with a nice error raised.

  • Any downstream consumers of decode have a nice high-level object to work with, complete with type annotations. This plays well with tab-completion and tools like mypy or pyright, improving usability.

For example, here’s an invalid pyproject.toml.

[build-system]
requires = "hatchling"
build-backend = "hatchling.build"

[project]
name = "myproject"
version = "0.1.0"
description = "a super great library"
authors = [
    {name = "alice shmalice", email = "alice@company.com"}
]

Can you spot the error? Using the schemas defined above, msgpspec can detect schema issues like this, and raise a nice error message. In this case the issue is that build-system.requires should be an array of strings, not a single string:

In [1]: import pyproject

In [2]: with open("pyproject.toml", "rb") as f:
   ...:     invalid = f.read()

In [3]: pyproject.decode(invalid)
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In [3], line 1
----> 1 pyproject.decode(invalid)
ValidationError: Expected `array`, got `str` - at `$.build-system.requires`