Features and constraints
Much of how the ETL is designed falls out of its design goals.
No special hardware
To ensure that members of the public can run and audit our code, we have designed the ETL to be a standalone Python program that operates on flat files and fetches what it needs on demand.
It should not need any special hardware or services, and individual ETL steps may use no more than 32GB memory.
It should be possible to run the ETL on MacOS, Linux and Windows (via WSL).
Public by default
All our data work is public by default; we only use private data sources when it is overwhelmingly in the public interest, or when the data is early-access and will shortly become publicly available.
Outputs as pure functions of inputs
To ensure our work is reproducible, we take our own snapshots of any upstream data that we use, meaning that if in future the upstream data provider changes their site, their data or their API, we can still build our datasets from "raw ingredients".
graph LR upstream --> snapshot --> etl --> catalog[on-disk catalog]
We secondly keep record all data dependencies in a directed graph or DAG (see YAML files from
dag/), and forbid steps from using any data as input that isn't explicitly declared as a dependency. This means that the result of any step is a pure function of its inputs.
Checksums for safe caching
We keep the ETL efficient to build by using a Merkle tree of MD5 checksums:
- Snapshots have a checksum available in their metadata.
- Datasets have a checksum of their inputs available in their metadata (the
When we ask the ETL to build something by running
etl <query>, it will only build things that are out of date. We can force a rebuild by passing the
Ready for data science
Previously, although we could chart data, it was very difficult to work with in Jupyter notebooks.
We have designed the ETL so that data is recorded at different stages of processing. The phase called
meadow is the version closest to the upstream provider, and the version called
garden is the best and most useful version of the data. We call data in
garden "ready for data science".