Getting started
The ETL can be used by OWID staff or the general public to build a copy of our data catalog. Here's how to get set up.
Getting set up
The ETL is supported and regularly run on Linux, MacOS and Windows via WSL.
Install dependencies
You will need Python 3.9+, basic build tools, and MySQL client libraries:
We recommend using Homebrew to install dependencies.
Ensure you have XCode command line tools:
xcode-select --install
Then install Python 3.9+ and MySQL client:
brew install python mysql-client poetry
You then need to inform Python where to find MySQL by adding some lines to your ~/.zshrc
file. Run brew info mysql-client
to see what's needed.
For example, on an M1/M2 Mac where Homebrew installs to /opt/homebrew
, you add:
export PATH="/opt/homebrew/opt/mysql-client/bin:$PATH"
export LDFLAGS="-L/opt/homebrew/opt/mysql-client/lib"
export CPPFLAGS="-I/opt/homebrew/opt/mysql-client/include"
On an Intel Mac, the paths will be slightly different.
Finally, check that you have the correct version of Python as your default:
which python3
It should say something like /usr/local/bin/python3
or /opt/homebrew/bin/python3
. If not, you will have to change the PATH
variable in your shell profile (e.g. ~/.bash_profile
or ~/.zshrc
).
You can install most things you need with apt
:
sudo apt install python3-dev python3-virtualenv python3-setuptools mysql-client
However, the version of Poetry that ships with Ubuntu is too old, so we need to install a more recent version.
The recommended method is to run:
curl -sSL https://install.python-poetry.org | python3 -
You will need to install WSL2 to get started.
You should use Ubuntu 22.04 as your Linux distribution.
Then, enter your Linux console and follow the instructions for Ubuntu 22.04.
Extra config for staff
OWID staff who want to add steps to the ETL will also need an ~/.aws/config
file configured, so that you can use snapshots.
The file itself will look like this:
[default]
aws_access_key_id = <access_key>
aws_secret_access_key = <secret_key>
Please ask someone in the #data-architecture
Slack channel to help you get set up.
Check your environment
You can get started by using make
to see available commands. The best way to check if your environment is healthy is to run:
make test
It will clone two submodules in the vendor/
folder, run poetry install
, and then run all CI checks.
If make test
succeeds, then you should be able to build any dataset you like, including the entire catalog. If it fails, feel free to raise a Github issue <https://github.com/owid/etl/issues>
_, or OWID staff can also ask using the #tech-issues
Slack channel.
Building datasets
Before continuing, activate your Python virtual environment by running:
$ source .venv/bin/activate
Dry-runs
Every step in the dag has a URI. For example, Our World In Data's population density dataset has the URI:
data://garden/ggdc/2020-10-01/ggdc_maddison
We can see what steps would be executed to build it by running::
$ etl --dry-run data://garden/ggdc/2020-10-01/ggdc_maddison
Detecting which steps need rebuilding...
Running 2 steps:
1. snapshot://ggdc/2020-10-01/ggdc_maddison.xlsx...
2. data://garden/ggdc/2020-10-01/ggdc_maddison...
The first step is a snapshot://
step, which when run will download an upstream snapshot of this dataset to the ~/.owid/walden
folder.
The second step is a data://
step, which will generate a local dataset in the data/
folder of the top-level etl/
folder.
Observe that we can also skip the full path of the step, in which case it will do a regex match against all available steps:
$ etl --dry-run ggdc_maddison
Now let's build the dataset, by removing the --dry-run
option:
$ etl data://garden/ggdc/2020-10-01/ggdc_maddison
Detecting which steps need rebuilding...
Running 2 steps:
1. snapshot://ggdc/2020-10-01/ggdc_maddison.xlsx...
OK (2s)
2. data://garden/ggdc/2020-10-01/ggdc_maddison...
OK (4s)
Let's confirm that the dataset was built locally:
$ ls data/garden/ggdc/2020-10-01/ggdc_maddison/
index.json
maddison_gdp.feather
maddison_gdp.meta.json
maddison_gdp.parquet
Several files got built for the dataset. The first is index.json
which gives metadata about the whole dataset. The remaining three files all represent a single data table, which is saved in both Feather and Parquet formats.
Consuming data
Now that our data/
folder has a table built, we can try reading it. Let's run python
and use Pandas:
>>> import pandas as pd
>>> df = pd.read_feather('data/garden/ggdc/2020-10-01/ggdc_maddison/maddison_gdp.feather')
>>> df.head()
country year gdp_per_capita population gdp
0 Afghanistan 1820 NaN 3280000.0 NaN
1 Afghanistan 1870 NaN 4207000.0 NaN
2 Afghanistan 1913 NaN 5730000.0 NaN
3 Afghanistan 1950 1156.0 8150000.0 9.421400e+09
4 Afghanistan 1951 1170.0 8284000.0 9.692280e+09
We can see that this dataset provides three indicators (gdp
, population
, and gdp_per_capita
), reported by country and year.
All tables generated by the ETL can also be read and written using a wrapper around Pandas, the Table
class. If we read the table using that, it will also pick up the metadata that was in the .meta.json
file.
>>> from owid.catalog import Table
>>> t = Table.read('data/garden/ggdc/2020-10-01/ggdc_maddison/maddison_gdp.feather')
>>> t.head()
gdp_per_capita population gdp
country year
Afghanistan 1820 NaN 3280000.0 NaN
1870 NaN 4207000.0 NaN
1913 NaN 5730000.0 NaN
1950 1156.0 8150000.0 9.421400e+09
1951 1170.0 8284000.0 9.692280e+09
In this case, we can see that it understood that country
and year
columns were the primary key for this table, and put them in the index.