Parsing pyproject.toml
¶
PEP 518 defined a new pyproject.toml
configuration file Python projects
can use for configuring:
Metadata (name, version, …)
Dependencies
Build systems
Additional development tools (black, mypy, pytest, … all support
pyproject.toml
files for configuration).
The format was defined in a series of Python Enhancement Proposals (PEPs), which also serve as the main documentation for the file schema.
PEP 517: A build-system independent format for source trees
PEP 518: Specifying minimum build system requirements for Python projects
PEP 621: Storing project metadata in pyproject.toml
Here we define a msgspec schema for parsing and validating a pyproject.toml
file. This includes full schema definitions for all fields in the
build-system
and project
tables, as well as an untyped table under
tool
.
The full example source can be found here.
from typing import Any
import msgspec
class Base(
msgspec.Struct,
omit_defaults=True,
forbid_unknown_fields=True,
rename="kebab",
):
"""A base class holding some common settings.
- We set ``omit_defaults = True`` to omit any fields containing only their
default value from the output when encoding.
- We set ``forbid_unknown_fields = True`` to error nicely if an unknown
field is present in the input TOML. This helps catch typo errors early,
and is also required per PEP 621.
- We set ``rename = "kebab"`` to rename all fields to use kebab case when
encoding/decoding, as this is the convention used in pyproject.toml. For
example, this will rename ``requires_python`` to ``requires-python``.
"""
pass
class BuildSystem(Base):
requires: list[str] = []
build_backend: str | None = None
backend_path: list[str] = []
class Readme(Base):
file: str | None = None
text: str | None = None
content_type: str | None = None
class License(Base):
file: str | None = None
text: str | None = None
class Contributor(Base):
name: str | None = None
email: str | None = None
class Project(Base):
name: str | None = None
version: str | None = None
description: str | None = None
readme: str | Readme | None = None
license: str | License | None = None
authors: list[Contributor] = []
maintainers: list[Contributor] = []
keywords: list[str] = []
classifiers: list[str] = []
urls: dict[str, str] = {}
requires_python: str | None = None
dependencies: list[str] = []
optional_dependencies: dict[str, list[str]] = {}
scripts: dict[str, str] = {}
gui_scripts: dict[str, str] = {}
entry_points: dict[str, dict[str, str]] = {}
dynamic: list[str] = []
class PyProject(Base):
build_system: BuildSystem | None = None
project: Project | None = None
tool: dict[str, dict[str, Any]] = {}
def decode(data: bytes | str) -> PyProject:
"""Decode a ``pyproject.toml`` file from TOML"""
return msgspec.toml.decode(data, type=PyProject)
def encode(msg: PyProject) -> bytes:
"""Encode a ``PyProject`` object to TOML"""
return msgspec.toml.encode(msg)
Here we use it to load the pyproject.toml for Starlette:
In [1]: import pyproject
In [2]: import urllib.request
In [3]: url = "https://raw.githubusercontent.com/encode/starlette/master/pyproject.toml"
In [4]: with urllib.request.urlopen(url) as f:
...: data = f.read()
In [5]: result = pyproject.decode(data) # decode the pyproject.toml
In [6]: result.build_system
Out[6]: BuildSystem(requires=['hatchling'], build_backend='hatchling.build', backend_path=[])
In [7]: result.project.name
Out[7]: 'starlette'
Note that this only validates that fields are of the proper type. It doesn’t check:
Whether strings like URLs or dependency specifiers are valid. Some of these could be handled using msgspec’s existing Constraints system, but not all of them.
Mutually exclusive field restrictions (for example, you can’t set both
project.license.file
andproject.license.text
).msgspec
currently has no way of declaring these restrictions.
Even with these caveats, the schemas here are still useful:
Since
forbid_unknown_fields=True
is configured, any extra fields will raise a nice error message. This is very useful for catching typos in configuration files, as the misspelled field names won’t be silently ignored.Type errors for fields will also be caught, with a nice error raised.
Any downstream consumers of
decode
have a nice high-level object to work with, complete with type annotations. This plays well with tab-completion and tools like mypy or pyright, improving usability.
For example, here’s an invalid pyproject.toml
.
[build-system]
requires = "hatchling"
build-backend = "hatchling.build"
[project]
name = "myproject"
version = "0.1.0"
description = "a super great library"
authors = [
{name = "alice shmalice", email = "alice@company.com"}
]
Can you spot the error? Using the schemas defined above, msgpspec
can
detect schema issues like this, and raise a nice error message. In this case
the issue is that build-system.requires
should be an array of strings, not
a single string:
In [1]: import pyproject
In [2]: with open("pyproject.toml", "rb") as f:
...: invalid = f.read()
In [3]: pyproject.decode(invalid)
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
Cell In [3], line 1
----> 1 pyproject.decode(invalid)
ValidationError: Expected `array`, got `str` - at `$.build-system.requires`