Update data¶

This guide explains the general workflow to update a dataset that already exists in ETL.

'Quick' summary

In a nutshell, these are the steps to follow:

Initial setup
- Use the ETL Dashboard in Wizard to create new versions of the steps (this will duplicate the code of the old steps).
- Execute the newly created snapshot scripts, if any.
  - If any of them fails, skip this step.
- Commit files generated by the ETL dashboard and push them to the branch.
  - Note that, if snapshot steps were not successfully executed, buildkite will fail to build. Ignore this for now.
Update and run the new steps
- Adapt the code of the new steps and ensure ETL (e.g. etlr step-names --grapher) can execute them successfully.
- Commit changes to the code.
Update indicators and charts
- Use Indicator Upgrader to update the charts (so they use the new variables instead of the old ones).
  - If needed, adapt existing charts or create new ones on the staging server.
- Use Chart Diff to approve changes in charts and newly created charts.
Remove unused steps
- Delete the old steps from the active dag and their files; etl archive-dag records them in the archive dag.
Submit your work for review
- Commit all your final work and set your PR to be ready for review.
  - Select which commits need to be reviewed omitting the very first one (so that reviewer only sees the changes with respect to the old version of the step).
- Make further changes, if suggested by the reviewer.
Publish your work: Once approved, merge the PR.
After publishing:
- Archive old grapher dataset(s) in Admin.
- Pull changes to explorers in Admin (if applicable).
- Announce your update.

For simplicity, let's go through it with a real example: Assume you have to update the "Near-surface temperature anomaly" dataset, by the Met Office Hadley Centre.

This guide assumes you have already a working installation of etl, and use VSCode with the appropriate configuration and plugins.

1. Initial setup¶

Update your master and configuration:
- Go to ETL master branch (by running git switch master), and ensure it's up-to-date in your local repository (git pull).
- Ensure that, in your .env file, you have set STAGING=1.
Create a draft PR and a temporary staging server
- Create a PR with the following command (replace {short_name} with the short name of the dataset, e.g. temperature-anomaly):
```
etl pr "{short_name}: update" data
```
  This will create a new git branch in your local repository with an empty commit, which will be pushed to remote. It will also create a draft pull request in github, and a staging server.
- Wait for a notification from @owidbot. It should take a few minutes, and will inform you that the staging server http://staging-site-update-{short_name} has been created.
Update steps using the ETL Dashboard:
- Start the ETL Wizard, by running:
```
etlwiz
```
Note

Even though it is possible to access the wizard from production or from a staging server, we recommend always using your local wizard. This means initialized from your local computer (but connecting to a staging database, with STAGING=1).
- Inside the Wizard, go to "Dashboard".
- On the Steps table, select the grapher dataset you want to update. Click on "Add selected steps to the Operations list". In this case, it has only 1 chart, so it will be an easy update.
- Scroll down to the Operations list, and click on "Add all dependencies".
- Click on "Remove non-updateable (e.g. population)" (although, for this simple example, it makes no difference).
- Scroll down and expand the "Additional parameters to update steps" box, to deactivate the "Dry run" option.
- Then click on "Update X steps" (in this case, X equals 6). This will create all the new ETL code files needed for the update, and write those steps in the dag (in this case, in the climate.yml dag file).
  
  Animation of how to update steps in ETL Dashboard.
Execute new snapshots:
- Try to execute any newly created snapshots scripts.
  - If any of the scripts fail, don't fix them. This will be done later. For now, move to the next step.
Commit the new files:
- Commit the newly generated files and push them to the new branch.
  - If any snapshot failed in the previous step, CI/CD checks will fail. Ignore this for now.

2. Update and run the new steps¶

Ensure that all snapshot scripts and ETL steps run successfully. Adapt the code if needed.

Edit the snapshot metadata files: Some modifications may be needed, for example, the date_published field may need to be manually updated.
- For convenience (throughout the rest of the work), open the corresponding dag file in a tab (Cmd+P to open the Quick Open bar, then type climate.yml and enter).
- To open a specific snapshot, go to the bottom of the dag, where the new steps are. Select the dag entry of one of the snapshots (without including the snapshot:// prefix), namely met_office_hadley_centre/2024-07-02/near_surface_temperature_global.csv, and then hit Cmd+C, Cmd+P, Cmd+V, Enter.
Animation of the editing process of a snapshot.

Note

We should always quickly have a look at the license URL, to ensure it has not changed (see our guide on source's licenses).

Ensure the snapshot steps work:

etls met_office_hadley_centre/2024-07-02/near_surface_temperature

Ensure the meadow, garden, and grapher steps work: Edit these steps and execute them. You can do that either one by one:

etlr meadow/met_office_hadley_centre/2024-07-02/near_surface_temperature
etlr garden/met_office_hadley_centre/2024-07-02/near_surface_temperature
etlr grapher/met_office_hadley_centre/2024-07-02/near_surface_temperature
etlr grapher/met_office_hadley_centre/2024-07-02/near_surface_temperature --grapher

Or all at once:

etlr near_surface_temperature --grapher

Note

Remember that, even though your ETL code is run locally, the database you are accessing is the one from the staging server (because of the STAGING=1 parameter in your .env file).

Commit your changes: You should now include the changes in the dag too.

git add .
git commit -m "Update snapshots and data steps"
git push origin update-temperature-anomaly

3. Update indicators and charts¶

More details

Upgrade indicators used in charts¶

After updating the data, it is time to update the affected charts! This involves migrating the indicators used in some charts to the new ones available.

Update indicators using Indicator Upgrader:
- Start Wizard:
```
etlwiz
```
  And click on Indicator Upgrader.
- By default, you should see selected the new grapher dataset (which has no charts), and its corresponding old version (with one chart). Press Next.
- Ensure the mapping from old to new indicators is correct. Press Next.
- Ensure the list of affected charts is as expected. Press Update charts. This will update all the affected charts in the PR staging server.
- If you have more datasets to update, simply refresh the page (Cmd+R) and, by default, the next new dataset will be selected.
Indicator upgrade flow.
Do further chart changes: You can make any further changes to charts in your staging server, if needed.

Approve chart changes¶

Review all changes in existing charts, and also new charts.

Start Chart Diff in Wizard: A link will appear at the bottom of the page when you've submitted the changes in the Indicator Upgrader. Alternatively, you can select it on the Wizard menu on the sidebar.
Review the chart changes:
- Inspect the changes in the charts, and approve them if everything looks good.
- If you notice some issues, you can go back to the code and do further changes.
- If you are not happy with the changes, you can reject these.
Chart diff flow. You'll be shown any chart that you've changed in your staging server (either via indicator upgrader or manually in the admin) compared to production. Here, you need to approve and/or reject the differences.

4. Remove unused steps¶

After your updates, the old steps are no longer relevant. Remove them so we don't use outdated steps by mistake.

Delete the old steps from the active dag (dag/*.yml) and delete their files (etl/steps/..., snapshots/...).
etl archive-dag then reconstructs the archive dag from git history, recording each removed step together with the commit where it was last active — so it can be recovered later with git checkout <commit>. There is no manual archiving step.
Sanity-check with etl d version-tracker to ensure no active step still depends on something you removed.
Commit the changes in the dag files.

5. Submit your work for review¶

You have now completed the first iteration of your work. Time to get a second opinion on your changes!

Helping reviewer see the actual changes

Default Github code diff ("Files changed") will include copied files as new files, and the reviewer will not be able to distinguish what is copied and what is new.

To help the reviewer, you can add the following link to your PR description

https://github.com/owid/etl/pull/<pr_number>/files/<first_commit_hash>..HEAD

where <pr_number> is the number of your PR, and <first_commit_hash> is the hash of the first commit that copied the steps (get it from git log or copy from Github). This will show the changes excluding the first commit that copied the steps. (Note that if you rebase your branch, the link will not work anymore since the commit hash will change.)

Alternatively, if you're the reviewer, go to "Files changed" and select only relevant commits from the dropdown "Changes from all commits".

Ensure CI/CD checks have passed: In the GitHub page of the draft PR, check that all checks have a green tick.
- If any of them has a red cross ❌:
  - Click on Details, to open Buildkite and get more details on the error(s).
  - Sometimes, retrying the check that failed fixes the problem. You can do this by clicking on the job that failed, and then clicking on Retry. If this does not solve the issue, ask for support.
Set the PR ready for review: If you see that "All checks have passed", the PR is ready for review.
- Add a meaningful description, stating all the main changes you made, possibly linking to any relevant GitHub issues.
- Click on Ready for review.
- Finally, add a reviewer. If the PR is very long and you want to have multiple reviewers, specify in the description what each one should review.
Implement changes: Wait for the review, and implement any changes brought up by the reviewer that you consider apply.

6. Publish your work¶

Share the result of your work with the world.

Once the PR is approved, click on "Edit" on the right of the PR title. You will see a dropdown to select the "base" of the PR. Change it to master, and confirm.
Click on "Squash and merge" and confirm.
- After this, the code for the new steps will be integrated with master. ETL will build the new steps in production, and, under the hood, all changes you made to charts on your staging server will be synced with public charts.

7. After publishing¶

Archive old grapher datasets¶

For convenience, we should archive grapher datasets that have been replaced by new ones.

Go to the grapher dataset admin.
Search for the dataset (type "Near-surface") and click on it.
In the dataset page, under "Settings" tab, click on "Archive dataset" on the bottom.

Wrap up¶

Close any relevant issues from the owid-issues or etl repositories.
If it's an important update, announce it on slack #article-and-data-updates channel.