Skip to content

Regular updates

This is a work in progress

Things might change in the near-future. To stay up-to-date with the latest updates, join the discussion on GitHub!

ETL was initially conceived to maintain and publish datasets that are updated once a year. This is still the most common kind of update that we do this days.

However, sometimes we need to update datasets more frequently. This is the case for instance for the COVID-19 dataset, and other examples, which need weekly or monthly updates.

In such cases, the processing code remains the same, but the origin data needs to be updated. Put simply, the ETL process is the same, but with an updated snapshot of the data.

If you want to keep a dataset up-to-date with the latest data, follow the steps below.

Create the data pipeline using latest version

Firstly, create the necessary steps to build the dataset (i.e. snapshot, meadow, garden, etc.). Use version latest for all of them, to avoid adding duplicate code.

Make sure to add these steps to the DAG. For instance, in the example below, we want to keep the cases_deaths dataset up-to-date with the latest data.

# WHO - Cases and deaths
data://meadow/covid/latest/cases_deaths:
  - snapshot://covid/latest/cases_deaths.csv
data://garden/covid/latest/cases_deaths:
  - data://meadow/covid/latest/cases_deaths
  - data://garden/regions/2023-01-01/regions
  - data://garden/wb/2024-03-11/income_groups
  - data://garden/demography/2024-07-15/population
data://grapher/covid/latest/cases_deaths:
  - data://garden/covid/latest/cases_deaths

Create the update script

Create an update script and save it in the scripts/ directory. This script must be a bash script, which basically needs to run the necessary code to update the snapshot. In the example below, we user [].

scripts/update-covid-cases-deaths.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
#
#  update-covid-cases-deaths.sh
#
#  Update COVID-19 cases and deaths dataset data://grapher/covid/latest/cases_deaths
#

set -e

start_time=$(date +%s)

echo '--- Update COVID-19 cases and deaths'
cd /home/owid/etl
uv run python snapshots/covid/latest/cases_deaths.py

# commit to master will trigger ETL which is gonna run the step
echo '--- Commit and push changes'

git add .
git commit -m ":robot: update: covid-19 cases and deaths" || true
git push origin master -q || true

end_time=$(date +%s)

echo "--- Done! ($(($end_time - $start_time))s)"

In the example above, you need to replace the code in line 14. Optionally, edit the text in lines 12 and 20 to better log the update.

Schedule update in Buildkite.

Finally, you need to schedule the regular update. To do so, go to Buildkite and edit the instructions in the file.

Simply add a

- label: "Update <step>"
    command:
    - "sudo su - owid -c 'bash /home/owid/etl/scripts/update-<step>.sh'"