This section presents a high-level view of our ETL processes.
The ETL has different types of steps (e.g.
grapher, among others). Some steps, can be understood under the same principle (e.g.
garden are both "Transform" steps within the ETL).This section does not cover all these steps in detail, but rather describes the very high-level meachanics of ETL.
In general, an ETL consists of three main processes:
- Extract: Downloads data from a source or multiple sources.
- Transform: Processes and combines the downloaded data so it is usable.
- Load: Moves the transformed data to a data repository (e.g. database).
Our ETL follows these same principles, with different names, and has the following three main building blocks:
- Snapshots: We download files from external sources and store them as snapshots.
- Datasets: Next, we curate the downloaded data to have clean datasets. These datasets can depend on snapshots, but also on other curated datasets.
- Grapher views: We adapt the curated datasets for Grapher, which are then loaded to Grapher.
As a consequence, nodes in our DAG can come in different flavours.
Snapshots are edge nodes in the computational graph. They represent a copy of a upstream data file ー preserving its original format ー at a particular point in time. That is, they are entry points to the ETL, and therefore don't have dependencies within the computational graph.
flowchart LR upstream((____)):::node -.->|copy| snapshot((____)):::node subgraph id [Snapshot] snapshot end classDef node fill:#002147,color:#002147
Snapshots are managed in the ETL using DVC, which allows tracking of different file versions and metadata.
Their URI begins with the prefix
snapshot://, therefore using the following format:
Old-style snapshots are managed by the walden codebase and begin with the prefix
There are still several snapshots using this channel.
Datasets are nodes in the computational graph and the main units of work in the ETL. They represent a transformation from one or more ingredients into a new (and potentially better and more useful) output.
Their URI begins with the prefix
data://, and use the following convention:
channel is used as the highest level of grouping and represents a stage of data curation. This will either be
namespace is typically a data provider, like
un, but it in cases where there are many data providers, it can describe the topic area instead (e.g.
The dataset URI is used to identify the code that builds the dataset and also to identify the output file on disk.
From the URI
data://garden/faostat/2022-05-17/faostat_fa we can derive where the code to produce the dataset is located:
In addition, it also tells us in which S3 bucket it is stored:
The Grapher codebase can only accept datasets that are in a particular shape:
(entity, year, variable, value)
They present a long format, where each row is a particular datapoint of a variable:
entity: Identifies the entity (typically a country).
year: Pinpoints the data point to a particular year
variable: Name of the variable.
value: Value of the variable.
However, datasets in the ETL are often in a very different shape instead. For example, they may have data broken down by gender, disease type, fish stock, or some other dimension. Therefore, we need a step that adapts the ETL dataset format into a Grapher friendly format: The grapher step.
Grapher steps are responsible for reshaping a dataset on disk into a grapher view. A single variable may fan out into a large number of grapher views.
These steps use the following URI convention:
Grapher views are still normal datasets, but they adapt the data to the way it must look when being inserted to MySQL. For each grapher view, there is a corresponding matching
grapher:// step automatically generated which does the actual insert to MySQL, if MySQL credentials have been configured.
You should have configured your MySQL credentials during your local Grapher environment setup.
The automatically generated step adapts the data for Grapher, including:
- Generating a Grapher variable for every combination of values of dimensions. For instance,
Death rate (sex=male, age group=60-70).
- Converting countries to
entity_idvalues that match the
entitiestable in MySQL.
- If the dataset has multiple sources, merge them into a single combined source to fit Grapher's data model for sources.