Skip to content

Update data

This guide explains the general workflow to update a dataset that already exists in ETL. It assumes that you've created your working environment.

Quick guide

In a nutshell, these are the steps to follow:

  1. Initial setup
    • Use the ETL Dashboard in Wizard to create new versions of the steps (this will duplicate the code of the old steps).
    • Execute the newly created snapshot scripts, if any.
      • If any of them fails, skip this step.
    • Commit files generated by the ETL dashboard and push them to the branch.
      • Note that, if snapshot steps were not successfully executed, buildkite will fail to build. Ignore this for now.
  2. Update and run the new steps
    • Adapt the code of the new steps and ensure ETL (e.g. etlr step-names --grapher) can execute them successfully.
    • Commit changes to the code.
  3. Update indicators and charts
    • Use Indicator Upgrader to update the charts (so they use the new variables instead of the old ones).
      • If needed, adapt existing charts or create new ones on the staging server.
    • Use Chart Diff to approve changes in charts and newly created charts.
  4. Archive unused steps
    • Use the ETL Dashboard to archive old steps (this will move old steps from the active dag to the archive dag).
  5. Submit your work for review
    • Commit all your final work and set your PR to be ready for review.
      • Select which commits need to be reviewed omitting the very first one (so that reviewer only sees the changes with respect to the old version of the step).
    • Make further changes, if suggested by the reviewer.
  6. Publish your work: Once approved, merge the PR.
  7. After publishing:
    • Archive old grapher dataset(s).
    • Pull changes to explorers in Admin (if applicable).
    • Announce your update.

For simplicity, let's go through it with a real example: Assume you have to update the "Near-surface temperature anomaly" dataset, by the Met Office Hadley Centre.

This guide assumes you have already a working installation of etl, and use VSCode with the appropriate configuration and plugins.

1. Initial setup

  • Update your master and configuration:

    • Go to ETL master branch (by running git switch master), and ensure it's up-to-date in your local repository (git pull).
    • Ensure that, in your .env file, you have set STAGING=1.
  • Create a draft PR and a temporary staging server

    • Create a PR with the following command (replace {short_name} with the short name of the dataset, e.g. temperature-anomaly):

      etl pr "{short_name}: update" data
      

      This will create a new git branch in your local repository with an empty commit, which will be pushed to remote. It will also create a draft pull request in github, and a staging server.

    • Wait for a notification from @owidbot. It should take a few minutes, and will inform you that the staging server http://staging-site-update-{short_name} has been created.

  • Update steps using the ETL Dashboard:

    • Start the ETL Wizard, by running:
      etlwiz
      

    Note

    Even though it is possible to access the wizard from production or from a staging server, we recommend always using your local wizard. This means initialized from your local computer (but connecting to a staging database, with STAGING=1).

    • Inside the Wizard, go to "Dashboard".
    • On the Steps table, select the grapher dataset you want to update. Click on "Add selected steps to the Operations list". In this case, it has only 1 chart, so it will be an easy update.
    • Scroll down to the Operations list, and click on "Add all dependencies".
    • Click on "Remove non-updateable (e.g. population)" (although, for this simple example, it makes no difference).
    • Scroll down and expand the "Additional parameters to update steps" box, to deactivate the "Dry run" option.
    • Then click on "Update X steps" (in this case, X equals 6). This will create all the new ETL code files needed for the update, and write those steps in the dag (in this case, in the climate.yml dag file).
      Chart Upgrader
      Animation of how to update steps in ETL Dashboard.
  • Execute new snapshots:

    • Try to execute any newly created snapshots scripts.
      • If any of the scripts fail, don't fix them. This will be done later. For now, move to the next step.
  • Commit the new files:

    • Commit the newly generated files and push them to the new branch.
      • If any snapshot failed in the previous step, CI/CD checks will fail. Ignore this for now.

2. Update and run the new steps

Ensure that all snapshot scripts and ETL steps run successfully. Adapt the code if needed.

  • Edit the snapshot metadata files: Some modifications may be needed, for example, the date_published field may need to be manually updated.

    • For convenience (throughout the rest of the work), open the corresponding dag file in a tab (Cmd+P to open the Quick Open bar, then type climate.yml and enter).
    • To open a specific snapshot, go to the bottom of the dag, where the new steps are. Select the dag entry of one of the snapshots (without including the snapshot:// prefix), namely met_office_hadley_centre/2024-07-02/near_surface_temperature_global.csv, and then hit Cmd+C, Cmd+P, Cmd+V, Enter.

    Chart Upgrader
    Animation of the editing process of a snapshot.

    Note

    We should always quickly have a look at the license URL, to ensure it has not changed (see our guide on source's licenses).

  • Ensure the snapshot steps work:

    python snapshots/met_office_hadley_centre/2024-07-02/near_surface_temperature.py
    
  • Ensure the meadow, garden, and grapher steps work: Edit these steps and execute them. You can do that either one by one:

    etlr meadow/met_office_hadley_centre/2024-07-02/near_surface_temperature
    etlr garden/met_office_hadley_centre/2024-07-02/near_surface_temperature
    etlr grapher/met_office_hadley_centre/2024-07-02/near_surface_temperature
    etlr grapher/met_office_hadley_centre/2024-07-02/near_surface_temperature --grapher
    

    Or all at once:

    etlr near_surface_temperature --grapher
    

    Note

    Remember that, even though your ETL code is run locally, the database you are accessing is the one from the staging server (because of the STAGING=1 parameter in your .env file).

  • Commit your changes: You should now include the changes in the dag too.

    git add .
    git commit -m "Update snapshots and data steps"
    git push origin update-temperature-anomaly
    

3. Update indicators and charts

Upgrade indicators used in charts

After updating the data, it is time to update the affected charts! This involves migrating the indicators used in some charts to the new ones available.

  • Update indicators using Indicator Upgrader:

    • Start Wizard:

      etlwiz
      

      And click on Indicator Upgrader.

    • By default, you should see selected the new grapher dataset (which has no charts), and its corresponding old version (with one chart). Press Next.

    • Ensure the mapping from old to new indicators is correct. Press Next.
    • Ensure the list of affected charts is as expected. Press Update charts. This will update all the affected charts in the PR staging server.
    • If you have more datasets to update, simply refresh the page (Cmd+R) and, by default, the next new dataset will be selected.

    Chart Upgrader
    Indicator upgrade flow.

  • Do further chart changes: You can make any further changes to charts in your staging server, if needed.

Approve chart changes

Review all changes in existing charts, and also new charts.

  • Start Chart Diff in Wizard: A link will appear at the bottom of the page when you've submitted the changes in the Indicator Upgrader. Alternatively, you can select it on the Wizard menu on the sidebar.
  • Review the chart changes:

    • Inspect the changes in the charts, and approve them if everything looks good.
    • If you notice some issues, you can go back to the code and do further changes.
    • If you are not happy with the changes, you can reject these.

    Chart Upgrader
    Chart diff flow. You'll be shown any chart that you've changed in your staging server (either via indicator upgrader or manually in the admin) compared to production. Here, you need to approve and/or reject the differences.

4. Archive unused steps

After your updates, the old steps are no longer relevant. Therefore, we move these to the archive dag. By doing this, we minimize the risk of using outdated steps by mistake.

  • Archive old steps using the ETL Dashboard:

    • Go to ETL Dashboard in your local Wizard.
    • On the Steps table, select the old step (the one that you have just updated, and that now should appear as "Archivable"), and click on "Add selected steps to the Operations list".
    • Scroll down to the Operations list, and click on "Add all dependencies".
    • Scroll down and expand the "Additional parameters to archive steps" box, to deactivate the "Dry run" option. You can keep "Include usages" activated (it will never archive a step that is used by a chart), but you may want to deactivate it if you created a step that is not yet used by any charts (to avoid archiving it).
    • Then click on "Archive X steps" (in this case, X equals 6).

    Chart Upgrader
    Archive ETL steps.

  • Sanity-check your archived steps: To ensure nothing has been archived by mistake, you can run etl d version-tracker.

  • Commit the changes in the dag files.

5. Submit your work for review

You have now completed the first iteration of your work. Time to get a second opinion on your changes!

Helping reviewer see the actual changes

Default Github code diff ("Files changed") will include copied files as new files, and the reviewer will not be able to distinguish what is copied and what is new.

To help the reviewer, you can add the following link to your PR description

https://github.com/owid/etl/pull/<pr_number>/files/<first_commit_hash>..HEAD
where <pr_number> is the number of your PR, and <first_commit_hash> is the hash of the first commit that copied the steps (get it from git log or copy from Github). This will show the changes excluding the first commit that copied the steps. (Note that if you rebase your branch, the link will not work anymore since the commit hash will change.)

Alternatively, if you're the reviewer, go to "Files changed" and select only relevant commits from the dropdown "Changes from all commits".

  • Ensure CI/CD checks have passed: In the GitHub page of the draft PR, check that all checks have a green tick.
    • If any of them has a red cross ❌:
      • Click on Details, to open Buildkite and get more details on the error(s).
      • Sometimes, retrying the check that failed fixes the problem. You can do this by clicking on the job that failed, and then clicking on Retry. If this does not solve the issue, ask for support.
  • Set the PR ready for review: If you see that "All checks have passed", the PR is ready for review.
    • Add a meaningful description, stating all the main changes you made, possibly linking to any relevant GitHub issues.
    • Click on Ready for review.
    • Finally, add a reviewer. If the PR is very long and you want to have multiple reviewers, specify in the description what each one should review.
  • Implement changes: Wait for the review, and implement any changes brought up by the reviewer that you consider apply.

6. Publish your work

Share the result of your work with the world.

  • Once the PR is approved, click on "Edit" on the right of the PR title. You will see a dropdown to select the "base" of the PR. Change it to master, and confirm.
  • Click on "Squash and merge" and confirm.
    • After this, the code for the new steps will be integrated with master. ETL will build the new steps in production, and, under the hood, all changes you made to charts on your staging server will be synced with public charts.

7. After publishing

Archive old grapher datasets

For convenience, we should archive grapher datasets that have been replaced by new ones.

Note

This step is a bit cumbersome, feel free to skip if you don't feel confident about it. There is an open issue to make this easier.

  • Go to the grapher dataset admin.
  • Search for the dataset (type "Near-surface"). Click on it.
  • Copy the dataset id from the URL (e.g. if the URL is https://admin.owid.io/admin/datasets/6520, the dataset id is 6520).
  • Access the production database (e.g. using DBeaver), search for the dataset with that id, and set isPrivate and isArchived to 1.

Pull changes to explorers in Admin

If your work affects explorers, you can run etl explorer-update. - It may take a few minutes, and it will update all *-explorer.tsv files in your owid-content repository. - You can access the owid-content repository, and commit any useful changes (otherwise, you can revert them with git restore .). - Push those changes and create a new PR in owid-content.

After updating any TSV configuration in owid-content, make sure to pull changes from the explorer admin site (up-right button).

Wrap up

  • Close any relevant issues from the owid-issues or etl repositories.
  • If it's an important update, announce it on slack #article-and-data-updates channel.