Skip to content

Features and constraints

Much of how the ETL is designed falls out of its design goals.

No special hardware

To ensure that members of the public can run and audit our code, we have designed the ETL to be a standalone Python program that operates on flat files and fetches what it needs on demand.

It should not need any special hardware or services, and individual ETL steps may use no more than 32GB memory.

It should be possible to run the ETL on MacOS, Linux and Windows (via WSL).

Public by default

All our data work is public by default; we only use private data sources when it is overwhelmingly in the public interest, or when the data is early-access and will shortly become publicly available.

Outputs as pure functions of inputs

To ensure our work is reproducible, we take our own snapshots of any upstream data that we use, meaning that if in future the upstream data provider changes their site, their data or their API, we can still build our datasets from "raw ingredients".

graph LR

upstream --> snapshot --> etl --> catalog[on-disk catalog]

We secondly keep record all data dependencies in a directed graph or DAG (see YAML files from dag/), and forbid steps from using any data as input that isn't explicitly declared as a dependency. This means that the result of any step is a pure function of its inputs.

Checksums for safe caching

We keep the ETL efficient to build by using a Merkle tree of MD5 checksums:

  • Snapshots have a checksum available in their metadata.
  • Datasets have a checksum of their inputs available in their metadata (the source_checksum field).

When we ask the ETL to build something by running etl <query>, it will only build things that are out of date. We can force a rebuild by passing the --force flag.

Ready for data science

Previously, although we could chart data, it was very difficult to work with in Jupyter notebooks.

We have designed the ETL so that data is recorded at different stages of processing. The phase called meadow is the version closest to the upstream provider, and the version called garden is the best and most useful version of the data. We call data in garden "ready for data science".