Skip to content

Building datasets

You will learn more about the structure and design of the ETL in the next section.

Every step in the dag has a URI. URIs allow us to uniquely identify any step (or node) throughout the whole ETL. This allows us to reference datasets (and use them) when building a new dataset.

For example, the Human Development Reports (by the UNDP):

data://garden/un/2022-11-29/undp_hdr

Dry-runs

See what steps would be executed to build the undp_hdr dataset by running:

$ poetry run etl --dry-run data://garden/un/2022-11-29/undp_hdr
Detecting which steps need rebuilding...
OK (0s)
Running 4 steps:
1. snapshot://un/2022-11-29/undp_hdr.csv...
2. snapshot://un/2022-11-29/undp_hdr.xlsx...
3. data://meadow/un/2022-11-29/undp_hdr...
4. data://garden/un/2022-11-29/undp_hdr...

The first two listed steps are snapshot:// steps, which when executed will download upstream snapshots of the dataset to the data/snapshots/ folder. The last two steps are data:// steps, which will generate local datasets in the data/ folder.

meadow and garden channels

In the above example, you can indetify different channels in the URIs: meadow and garden, followed by the same string un/2022-11-29/undp_hdr. These represent different levels of curation of a dataset (in this example, the UNDP HDR version 2022-11-29 dataset).

garden datasets are good to go, whereas meadow datasets have not been curated enough to be used in production environments. We will explore these nuances later on.

Note that you can skip the full path of the step, in which case it will do a regex match against all available steps:

$ poetry run etl --dry-run undp_hdr
Detecting which steps need rebuilding...
OK (0s)
Running 4 steps:
1. snapshot://un/2022-11-29/undp_hdr.csv...
2. snapshot://un/2022-11-29/undp_hdr.xlsx...
3. data://meadow/un/2022-11-29/undp_hdr...
4. data://garden/un/2022-11-29/undp_hdr...
5. data://grapher/un/2022-11-29/undp_hdr...

Note that here there is an extra dataset listed, with prefix data://grapher/, as it matches the query (its URI contains the query text "undp_hdr").

Generate the dataset

Now let's build the dataset, by removing the --dry-run option:

$ poetry run etl data://garden/un/2022-11-29/undp_hdr
Detecting which steps need rebuilding...
OK (0s)
Running 4 steps:
1. snapshot://un/2022-11-29/undp_hdr.csv...
OK (5s)

2. snapshot://un/2022-11-29/undp_hdr.xlsx...
OK (5s)

3. data://meadow/un/2022-11-29/undp_hdr...
2023-04-19 22:28.41 [info     ] undp_hdr.start
2023-04-19 22:28.44 [info     ] undp_hdr.end
OK (5s)

4. data://garden/un/2022-11-29/undp_hdr...
2023-04-19 22:28.46 [info     ] undp_hdr.start
2023-04-19 22:28.47 [info     ] undp_hdr.harmonize_countries
2023-04-19 22:28.47 [info     ] undp_hdr.format_df
2023-04-19 22:28.47 [info     ] undp_hdr.dtypes
2023-04-19 22:28.47 [info     ] undp_hdr.sanity_check
2023-04-19 22:28.47 [info     ] undp_hdr.creating_table
2023-04-19 22:28.47 [info     ] undp_hdr.end
OK (3s)

Let's confirm that the dataset was built locally:

$ ls data/garden/un/2022-11-29/undp_hdr/
undp_hdr.feather
undp_hdr.meta.json
undp_hdr.parquet
index.json

Several files got built for the dataset: index.json gives metadata about the whole dataset, and the remaining three files all represent a single data table, which is saved in both Feather and Parquet formats.

Parallel execution

There's a flag etl ... --workers 4 you can use to run the ETL in parallel. This is useful when rebuilding large part of ETL (e.g. after updating regions).