etl¶

Run OWID's ETL client.

Create ETL step templates, compare different datasets, generate dependency visualisations, synchronise charts across different servers, import datasets from non-ETL OWID sources, improve your metadata, etc.

Note: For a UI experience, refer to CLI etlwiz.

Usage:

etl [OPTIONS] COMMAND [ARGS]...

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

Subcommands

b: Run Backport tools.
chart-gpt: Add FASTT suggestions by chatGPT to pending chart revisions in admin.
chart-sync: Sync Grapher charts and revisions from an environment to the main environment.
chart-upgrade: Generate chart revisions in Grapher using MAPPING_FILE JSON file.
chart-upgrade: Process and update metadata using chatGPT tool.
compare: Compare two dataframes/tables/datasets in terms of their structure, values and metadata.
d: Run development tools.
diff: Compare all datasets from two catalogs and print out a summary of their differences.
explorer-update: Update one or more explorer (tsv) files.
graphviz: Generate a Graphviz DOT file to see all dependencies.
harmonize: Generate a dictionary with the mapping of country names to OWID's canonical names.
metadata-export: Export dataset, tables & indicator metadata in YAML format.
metadata-migrate: Generate (or update) the metadata YAML in a Grapher step based on an existing chart.
run: Generate datasets by running their corresponding ETL steps.
update: Update one or more steps to their new version, if possible.
variable-mapping-translate: Translate the variable mapping dictionary from from one environment to another.
variable-match: Match variable IDs from an old dataset to a new dataset's.

etl b¶

Run Backport tools.

Usage:

etl b [OPTIONS] COMMAND [ARGS]...

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

Subcommands

bulk: Backport in bulk.
fasttrack: Create Fast-track ready spreadsheet from an existing dataset.
migrate: Migrate existing dataset from MySQL into ETL.
run: Backport datasets.

etl b bulk¶

Backport in bulk.

Usage:

etl b bulk [OPTIONS]

Options:

Name	Type	Description	Default
`--dataset-ids`, `-d`	integer	N/A	None
`--dry-run` / `--no-dry-run`	boolean	Do not add dataset to a catalog on dry-run	`False`
`--limit`	integer	N/A	`1000000`
`--upload` / `--skip-upload`	boolean	Upload dataset to S3	`True`
`--force` / `--no-force`	boolean	Force overwrite even if checksums match	`False`
`--prune` / `--no-prune`	boolean	Prune datasets from local snapshots that are not in DB anymore	`False`
`--backport` / `--skip-backport`	boolean	Backport datasets, can be skipped if you only want to prune	`True`
`--data-metadata` / `--skip-data-metadata`	boolean	Upload data & metadata JSON of variable to R2	`False`
`--all` / `--no-all`	boolean	Backport all datasets, even those without charts or archived	`False`
`--workers`	integer	Thread workers to parallelize which steps need rebuilding (steps execution is not parallelized)	`1`
`--help`	boolean	Show this message and exit.	`False`

etl b fasttrack¶

Create Fast-track ready spreadsheet from an existing dataset.

Installation:

Add Google Sheets API and Google Drive API to your project in the Google Cloud Platform Console.
Download the credentials as a JSON file and save it in the same directory as this notebook.
Point env variable GOOGLE_APPLICATION_CREDENTIALS to the credentials file.
Share Fast-track template with the service account email address (e.g. 937270026338-compute@developer.gserviceaccount.com)

Example:

ENV=.env.live etl b fasttrack --dataset-id 5546 --short-name democracy_lexical_index --no-backport

Usage:

etl b fasttrack [OPTIONS]

Options:

Name	Type	Description	Default
`--dataset-id`	integer	Dataset ID to migrate	_required
`--short-name`	text	New short name to use, underscored dataset name by default	None
`--backport` / `--no-backport`	boolean	Backport dataset before migrating	`True`
`--recreate` / `--no-recreate`	boolean	Recreate the spreadsheet if it already exists	`False`
`--help`	boolean	Show this message and exit.	`False`

etl b migrate¶

Migrate existing dataset from MySQL into ETL.

It imports datasets into ETL from MySQL by creating: - Ingest script that downloads data from S3 backport - Garden step with YAML metadata - Grapher step

Example:

ENV=.env.live etl b migrate --dataset-id 5205 --namespace covid --short-name hospital__and__icu --no-backport

Usage:

etl b migrate [OPTIONS]

Options:

Name	Type	Description	Default
`--dataset-id`	integer	Dataset ID to migrate	_required
`--namespace`	text	New namespace	_required
`--version`	text	New version	`latest`
`--short-name`	text	New short name to use, underscored dataset name by default	None
`--backport` / `--no-backport`	boolean	Backport dataset before migrating	`True`
`--force` / `--no-force`	boolean	Force overwrite even if checksums match	`False`
`--dry-run` / `--no-dry-run`	boolean	Do not add dataset to a catalog on dry-run	`False`
`--upload` / `--skip-upload`	boolean	Upload dataset to S3 as snapshot	`True`
`--help`	boolean	Show this message and exit.	`False`

etl b run¶

Backport datasets.

Usage:

etl b run [OPTIONS]

Options:

Name	Type	Description	Default
`--dataset-id`	integer	N/A	_required
`--force` / `--no-force`	boolean	Force overwrite even if checksums match	`False`
`--dry-run` / `--no-dry-run`	boolean	Do not add dataset to a catalog on dry-run	`False`
`--upload` / `--skip-upload`	boolean	Upload dataset to S3 as snapshot	`True`
`--data-metadata` / `--skip-data-metadata`	boolean	Upload data & metadata JSON of variable to R2	`False`
`--help`	boolean	Show this message and exit.	`False`

etl chart-gpt¶

Add FASTT suggestions by chatGPT to pending chart revisions in admin.

This command gets a set of chart revision suggestions from the database, and queries chatGPT to get new suggestions for the title and subtitle of the chart. The new suggestions are then added to the database in an auxiliary table.

The new suggestions are available from the chart revision admin tool.

Usage:

etl chart-gpt [OPTIONS]

Options:

Name	Type	Description	Default
`-u`, `--user-id`	integer	ID of the user. Use this to only get the revisions created by this particular user. By default will get revisions from all users.	None
`-i`, `--revision-id`	integer	ID of the revision. Use this to only get a specific review. By default will get revisions from all users.	None
`-me`, `--only-mine`	boolean	Use this to only modify those revisions created using your user id. Make sure that your environment variable GPT_SAMPLE_SIZE is properly set. If set, userid value will be ignored.	`False`
`-f`, `--overwrite`	boolean	Use this to overwrite existing suggestions. That is, existing gpt suggestions for the retrieved revisions will be replaced.	`False`
`-s`, `--sample-size`	integer	Number of reviews sampled from chatGPT.	`3`
`-n`, `--model-name`	choice (`gpt-3.5` \| `gpt-3.5-turbo` \| `gpt-4`)	Choose chat GPT model version. By default uses gpt-3.5.	`gpt-3.5`
`-t`, `--system-prompt`	text	Path to a custom chatGPT system prompt.	None
`--version`	boolean	Show the version and exit.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl chart-sync¶

Sync Grapher charts and revisions from an environment to the main environment.

It syncs the charts and revisions from SOURCE to TARGET. This is especially useful for syncing work from staging servers to production.

SOURCE and TARGET can be either name of staging servers (e.g. "staging-site-mybranch") or paths to .env files. Use ".env.prod.write" as TARGET to sync to live.

Note 1: The dataset (with the new chart's underlying indicators) from SOURCE must exist in TARGET. This means that you have to merge your work to master and wait for the ETL to finish running all steps.
Note 2: Staging servers are destroyed after 1 day of merging to master, so this script should be run before that, but after the dataset has been built by ETL in production.

Considerations on charts:

Only published charts from staging are synced.
New charts are synced as drafts in target (unless --publish flag is used).
Existing charts (with the same slug) are added as chart revisions in target. (Revisions could be pre-approved with --approve-revisions flag)
You get a warning if the chart has been modified on live after staging server was created.
Deleted charts are not synced.

Considerations on chart revisions:

Approved chart revisions on staging are automatically applied in target, assuming the chart has not been modified.

Considerations on tags:

Tags are synced only for new charts, any edits to tags in existing charts are ignored.

Example 1: Run staging-sync in dry-run mode to see what charts will be updated

$ etl chart-sync staging-site-my-branch .env.prod.write --dry-run

Example 2: Run it for real

etl chart-sync staging-site-my-branch .env.prod.write

Example 3: Sync only one chart

etl chart-sync staging-site-my-branch .env.prod.write --chart-id 123 --dry-run

Example 4: Update charts directly without creating chart revisions (useful for large datasets updates like population)

etl chart-sync staging-site-my-branch .env.prod.write --approve-revisions

Usage:

etl chart-sync [OPTIONS] SOURCE TARGET

Options:

Name	Type	Description	Default
`--chart-id`	integer	Sync *only* the chart with this id.	None
`--publish` / `--no-publish`	boolean	Automatically publish new charts.	`False`
`--approve-revisions` / `--keep-revisions`	boolean	Directly update existing charts with approved revisions (i.e. skip chart revision). Useful for large updates. This still
creates a chart revision if the target chart has been modified.	`False`
`--staging-created-at`	text	Staging server UTC creation date. It is used to warn about charts that have been
updated in production. Default is branch creation date.	None
`--include`	text	Include only charts with variables whose catalogPath matches the provided string.	None
`--exclude`	text	Exclude charts with variables whose catalogPath matches the provided string.	None
`--errors`	choice (`raise` \| `warn`)	How to handle errors when syncing charts. 'warn' will skip the chart and continue.	`raise`
`--dry-run` / `--no-dry-run`	boolean	Do not write to target database.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl chart-upgrade¶

Generate chart revisions in Grapher using MAPPING_FILE JSON file.

MAPPING_FILE is a JSON file mapping "old variables" to "new" ones. Typically old variables belong to a dataset that you want to deprecate and replace with a new one, which contains the "new variables".

Note 1: Make sure that you are connected to the database. By default, it connects to Grapher based on the environment file found in the project's root directory "path/to/etl/.env".

Note 2: You should use the default --use-version option value, unless you are aware of the changes in the backend.

Example:

/* file: variable-mapping.json */
{
    2032: 147395,
    2033: 147396
}

Usage:

etl chart-upgrade [OPTIONS] MAPPING_FILE

Options:

Name	Type	Description	Default
`--revision-reason`, `-r`	text	Assign a reason for the suggested chart revision.	None
`-u`, `--use-version`	choice (`0` \| `1` \| `2`)	Choose the backend version to use. By default uses latest version.	`2`
`--help`	boolean	Show this message and exit.	`False`

etl chart-upgrade¶

Process and update metadata using chatGPT tool.

You can learn more about this tool in our official documentation.

Usage:

etl chart-upgrade [OPTIONS]

Options:

Name	Type	Description	Default
`--path-to-file`	text	Path to the metadata file.	None
`--output-dir`	text	Path to save the new metadata file.	None
`--overwrite`	boolean	Overwrite input file if set to True. Otherwise, save the new file in the output directory.	`False`
`--model`	text	Name of the model.	`gpt-3.5-turbo`
`--help`	boolean	Show this message and exit.	`False`

etl compare¶

Compare two dataframes/tables/datasets in terms of their structure, values and metadata.

Usage:

etl compare [OPTIONS] COMMAND [ARGS]...

Options:

Name	Type	Description	Default
`--absolute-tolerance`	float	The absolute tolerance for floating point comparisons.	`1e-08`
`--relative-tolerance`	float	The relative tolerance for floating point comparisons.	`0.05`
`--show-values` / `--hide-values`	boolean	Show a preview of the values where the dataframes are different.	`False`
`--show-shared` / `--hide-shared`	boolean	Show the structural overlap of the two dataframes (shared columns, index columns and index values).	`False`
`--truncate-lists-at`	integer	Print truncated lists if they are longer than the given length.	`20`
`--help`	boolean	Show this message and exit.	`False`

Subcommands

dataframes: Compare two DATAFRAME1 with DATAFRAME2.
grapher: Compare a dataset in the local database with the remote database.
table:

etl compare dataframes¶

Compare two DATAFRAME1 with DATAFRAME2.

It compares the columns, index columns and index values (row indices) as sets between the two dataframes and outputs the differences. Finally it compares the values of the overlapping columns and rows with the given threshold values for absolute and relative tolerance.

The exit code is: - 0 if the dataframes are equal - 1 if there is an error loading the dataframes - 2 if the dataframes are structurally equal but are otherwise different - 3 if the dataframes have different structure and/or different values.

Usage:

etl compare dataframes [OPTIONS] DATAFRAME1 DATAFRAME2

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

etl compare grapher¶

Compare a dataset in the local database with the remote database.

It loads the dataset from grapher/NAMESPACE/VERSION/DATASET. It compares dataset and variables metadata, and optionally the values from S3 with (use the --values flag for this). It does the comparison in the same way as the etl-catalog command.

The exit code is always 0 even if dataframes are different.

Examples:

compare  --show-values grapher ggdc 2020-10-01 ggdc_maddison__2020_10_01 --values

Usage:

etl compare grapher [OPTIONS] NAMESPACE VERSION DATASET

Options:

Name	Type	Description	Default
`--remote-env`	path	Path to .env file with remote database credentials.	`.env.prod`
`--local-env`	path	Path to .env file with remote database credentials.	`.env`
`--values`	boolean	Compare values from S3 (can be both CPU and memory heavy!).	`False`
`--help`	boolean	Show this message and exit.	`False`

etl compare table¶

Compare a table in the local catalog with the analogous one in the remote catalog.

The table in the local catalog is loaded from CHANNEL/NAMESPACE/DATASET/{version}/TABLE. The value for {version} is given by the option --version. If not given, the latest local version of the dataset is compared with the latest remote version of the same dataset.

It compares the columns, index columns and index values (row indices) as sets between the two dataframes and outputs the differences. Finally it compares the values of the overlapping columns and rows with the given threshold values for absolute and relative tolerance.

The exit code is: - 0 if the tables are equal - 1 if there is an error loading the tables - 2 if the tables are structurally equal but are otherwise different - 3 if the tables have different structure and/or different values.

Usage:

etl compare table [OPTIONS] CHANNEL NAMESPACE DATASET TABLE

Options:

Name	Type	Description	Default
`--version`	text	Version of catalog dataset to compare with.	None
`--debug`	boolean	Print debug information.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl d¶

Run development tools.

Usage:

etl d [OPTIONS] COMMAND [ARGS]...

Options:

Name	Type	Description	Default
`--help`	boolean	Show this message and exit.	`False`

Subcommands

d publish: Create a catalog-[channel].feather file inside etl/data with all tables in each channel.
prune: Prune data without steps in the DAG.
publish: Publish the generated data catalog to S3.
run-python-step: Import and run a specific step of the ETL.
version-tracker: Check that all DAG dependencies (e.g. make sure no step is missing).

etl d d publish¶

Create a catalog-[channel].feather file inside etl/data with all tables in each channel.

This enables catalog.find to be aware of what datasets currently exists. So, if for example you create a new dataset locally, you won't be able to find it in your local catalog unless you re-run reindex.

Usage:

etl d d publish [OPTIONS]

Options:

Name	Type	Description	Default
`--channel`, `-c`	choice (`garden` \| `meadow` \| `grapher` \| `backport` \| `open_numbers` \| `examples` \| `explorers`)	Reindex only the selected channel(s) (subfolders of data/)	`('garden', 'meadow', 'grapher', 'backport', 'open_numbers', 'examples', 'explorers')`
`--include`	text	Reindex only datasets matching pattern	None
`--help`	boolean	Show this message and exit.	`False`

etl d prune¶

Prune data without steps in the DAG.

Usage:

etl d prune [OPTIONS]

Options:

Name	Type	Description	Default
`--dag-path`	path	Path to DAG yaml file	`/home/docs/checkouts/readthedocs.org/user_builds/owid-etl/checkouts/latest/dag/main.yml`
`--data-dir`	path	Path to data directory	`/home/docs/checkouts/readthedocs.org/user_builds/owid-etl/checkouts/latest/data`
`--dry-run`	boolean	Only print files that would be deleted	`False`
`--help`	boolean	Show this message and exit.	`False`

etl d publish¶

Publish the generated data catalog to S3.

Usage:

etl d publish [OPTIONS]

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview the datasets to sync without actually publishing them.	`False`
`--private`, `-p`	boolean	Publish private catalog.	`False`
`--bucket`, `-b`	text	Bucket name.	`owid-catalog`
`--channel`, `-c`	choice (`garden` \| `meadow` \| `grapher` \| `backport` \| `open_numbers` \| `examples` \| `explorers`)	Publish only selected channel (subfolder of data/), push all by default.	`('garden', 'meadow', 'grapher', 'backport', 'open_numbers', 'examples', 'explorers')`
`--help`	boolean	Show this message and exit.	`False`

etl d run-python-step¶

Import and run a specific step of the ETL.

Meant to be ran as a subprocess by the main etl command. There's a quite big overhead (~3s) from importing all packages again in the new subprocess.

Usage:

etl d run-python-step [OPTIONS] URI DEST_DIR

Options:

Name	Type	Description	Default
`--ipdb`	boolean	N/A	`False`
`--help`	boolean	Show this message and exit.	`False`

etl d version-tracker¶

Check that all DAG dependencies (e.g. make sure no step is missing).

Run all version tracker sanity checks.

Usage:

etl d version-tracker [OPTIONS]

Options:

Name	Type	Description	Default
`--skip-db`	boolean	True to skip connecting to the database of the current environment. False to try to connect to DB, to get a better informed picture of what steps may be missing or archivable. If not connected, all checks will be based purely on the content of the ETL dag.	`False`
`--warn-on-archivable`	boolean	True to warn about archivable steps. By default this is False, because we currently have many archivable steps.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl diff¶

Compare all datasets from two catalogs and print out a summary of their differences.

Compare all the datasets from catalog in PATH_A with all the datasets in catalog PATH_B. The catalog paths link to the data/ folder with all the datasets (it contains a catalog.meta.json file)

You can also use a path to a dataset.

Note that you can use the keyword "REMOTE" as the path, if you want to run a comparison with the remote catalog.

This tool is useful as a quick way to see what has changed in the catalog and whether our updates don't have any unexpected side effects.

Note: This command differs from etl compare in that it compares all the datasets and not two specific ones.

How does it work?

It uses source checksums to find candidates for comparison. Source checksum includes all files used to generate the dataset and should be sufficient to find changed datasets, just note that we're not using checksum of the files themselves. So if you change core ETL code or some of the dependencies, e.g. change in owid-datautils-py, core ETL code or updating library version, the change won't be detected. In cases like these you should increment ETL version which is added to all source checksums (not implemented yet).

Example 1: Compare the remote catalog with a local one

$ etl diff REMOTE data/ --include maddison

Example 2: Compare two local catalogs

$ etl diff other-data/ data/ --include maddison

Usage:

etl diff [OPTIONS] PATH_A PATH_B

Options:

Name	Type	Description	Default
`--channel`, `-c`	choice (`garden` \| `meadow` \| `grapher` \| `backport` \| `open_numbers` \| `examples` \| `explorers`)	Compare only selected channel (subfolder of data/).	`['garden', 'meadow', 'grapher']`
`--include`	text	Compare only datasets matching pattern.	None
`--cols`	text	Compare only columns matching pattern.	None
`--exclude`, `-e`	text	Exclude datasets matching pattern.	None
`--verbose`, `-v`	boolean	Print more detailed differences.	`False`
`--snippet`	boolean	Print code snippet for loading both tables, useful for debugging in notebook	`False`
`--workers`, `-w`	integer	Use multiple threads.	`1`
`--help`	boolean	Show this message and exit.	`False`

etl explorer-update¶

Update one or more explorer (tsv) files.

This command will update the content of one or more explorer (tsv) files, with the following logic: - If it is a file-based explorer, ensure URLs to data catalog point to the latest version of the data. - If it is an indicator-based explorer, ensure variable ids correspond to the latest versions of the variables.

Usage:

etl explorer-update [OPTIONS] [EXPLORER_NAMES]...

Options:

Name	Type	Description	Default
`--explorers-dir`	text	Path to explorer files. Default: /home/docs/checkouts/readthedocs.org/user_builds/owid-etl/checkouts/owid-content/explorers	`/home/docs/checkouts/readthedocs.org/user_builds/owid-etl/checkouts/owid-content/explorers`
`--dry-run`	boolean	Do not write to explorer files, simply print potential changes. Default: False.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl graphviz¶

Generate a Graphviz DOT file to see all dependencies.

Saves the output as a file in OUTPUT_PATH.

Usage:

etl graphviz [OPTIONS] OUTPUT_FILE

Options:

Name	Type	Description	Default
`--filter`	text	Filter the DAG by regex	None
`--targets`	boolean	Show target nodes.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl harmonize¶

Generate a dictionary with the mapping of country names to OWID's canonical names.

Harmonize the country names in COLUMN of a DATA_FILE (CSV or feather) and save the mapping to OUTPUT_FILE as a JSON file. The harmonization process is done according to OWID's canonical country names.

The harmonization process is done interactively, where the user is prompted with a list of ambiguous country names and asked to select the correct country name from a list of suggestions (ranked by similarity).

When the mapping is ambiguous, you can use:

Choose Option [custom] to enter a custom name.
Type Ctrl-C to exit and save the partially complete mapping

If a mapping file already exists, it will resume where the mapping file left off.

Usage:

etl harmonize [OPTIONS] DATA_FILE COLUMN OUTPUT_FILE

Options:

Name	Type	Description	Default
`--institution`, `-i`	text	Append '(institution)' to countries	None
`--num-suggestions`, `-n`	integer	Number of suggestions to show per entity	`5`
`--help`	boolean	Show this message and exit.	`False`

etl metadata-export¶

Export dataset, tables & indicator metadata in YAML format.

Given a DATASET_PATH, load the corresponding dataset and export its metadata in YAML format (including table and indicator metadata). The metadata file and can be later edited manually. If the output YAML already exists, it will be updated with new values.

When can this be useful? - This is useful when some metadata fields have been created dynamically in the code and you want to see the final result. - To prefill the YAML metadata file with the list of indicators and tables in the dataset. Note that, when first created, an ETL step is not yet aware of the columns of the tables of the dataset. It only knows that once you've executed th step.

Example 1: Save to YAML file etl/steps/data/garden/ggdc/2020-10-01/ggdc_maddison.meta.yml

etl metadata-export data/garden/ggdc/2020-10-01/ggdc_maddison

Example 2: Show output instead of saving the file

etl metadata-export data/garden/ggdc/2020-10-01/ggdc_maddison --show

Usage:

etl metadata-export [OPTIONS] DATASET_PATH

Options:

Name	Type	Description	Default
`-o`, `--output`	path	Save output into YAML file. If not specified, save to *.meta.yml	None
`--show` / `--no-show`	boolean	Show output instead of saving it into a file.	`False`
`--decimals`	text	Add display.numDecimalPlaces to all numeric variables. Use integer or `auto` for autodetection. Disable with `no`.	`auto`
`--help`	boolean	Show this message and exit.	`False`

etl metadata-migrate¶

Generate (or update) the metadata YAML in a Grapher step based on an existing chart.

This process pre-fills the indicator with all available metadata from the existing dataset (in the old format) and adds grapher configuration taken from the chart config (accessed via its chart slug).

Fields that are missing will be prefixed with 'TBD' and should either be filled in manually or removed. The description field needs to be restructured into new fields. This step could potentially be automated by ChatGPT in the future.

Note: It is designed for use with the --chart-slug option. The use of --uri in conjunction with other options has not been as thoroughly tested.

Example 1: Show generated YAML in console

STAGING=mojmir etl metadata-migrate --chart-slug political-regime --show

Example 2: Create YAML file in Grapher step

STAGING=mojmir etl metadata-migrate --chart-slug political-regime

Usage:

etl metadata-migrate [OPTIONS]

Options:

Name	Type	Description	Default
`--chart-slug`, `-c`	text	Slug of the chart to generate metadata for. Example: 'human-rights-index-vdem'.	None
`--uri`, `-u`	text	URI of the dataset to generate metadata for. Example: 'happiness/2023-03-20/happiness'.	None
`--cols`, `-c`	text	Only generate metadata for columns matching pattern.	None
`--table-name`, `-t`	text	Table to select.	None
`--run-etl` / `--no-run-etl`	boolean	Refresh ETL for the given dataset.	`True`
`--show`, `-s` / `--no-show`	boolean	Show output instead of saving it into a file.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl run¶

Generate datasets by running their corresponding ETL steps.

Run all ETL steps in the DAG matching the value of STEPS. A match is a dataset with an uri that contains the value of any of the words in STEPS.

Example 1: Run steps matching "mars" in the DAG file:

$ etl run mars

Example 2: Preview those steps that match "mars" or "prio" (i.e. don't run them):

$ etl run mars prio

Example 3: If you only want to preview what would be executed, use the --dry-run flag:

$ etl run mars prio --dry-run

Usage:

etl run [OPTIONS] [STEPS]...

Options:

Name	Type	Description	Default
`--dry-run`	boolean	Preview the steps without actually running them.	`False`
`--force`, `-f`	boolean	Re-run the steps even if they appear done and up-to-date	`False`
`--private`, `-p`	boolean	Run private steps.	`False`
`--grapher`, `-g` / `--no-grapher`, `-ng`	boolean	Upsert datasets from grapher channel to DB (OWID staff only, DB access required)	`False`
`--ipdb`	boolean	Run the debugger on uncaught exceptions.	`False`
`--backport`, `-b`	boolean	Add steps for backporting OWID datasets.	`False`
`--downstream`, `-d`	boolean	Include downstream dependencies (steps that depend on the included steps).	`False`
`--only`, `-o`	boolean	Only run the selected step (no upstream or downstream dependencies). Overrides `--downstream` option.	`False`
`--exclude`, `-e`	text	Comma-separated patterns to exclude	None
`--dag-path`	path	Path to DAG yaml file	`/home/docs/checkouts/readthedocs.org/user_builds/owid-etl/checkouts/latest/dag/main.yml`
`--workers`, `-w`	integer	Parallelize execution of steps. [1]	`1`
`--use-threads`, `-t` / `--no-threads`, `-nt`	boolean	Use threads when checking dirty steps and upserting to MySQL. Turn off when debugging.	`True`
`--strict`, `-s` / `--no-strict`, `-ns`	boolean	Force strict or lax validation on DAG steps (e.g. checks for primary keys in data steps).	None
`--watch`, `-w`	boolean	Run ETL infinitely and update changed files.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl update¶

Update one or more steps to their new version, if possible.

This tool lets you update one or more snapshots or data steps to a new version. It will:

Create new folders and files for each of the steps.
Add the new steps to the dag, with the same header comments as their current version.

Notes:

Keep in mind that:

If there is ambiguity, the user will be asked for confirmation before updating each step, and on situations where there is some ambiguity.
If new snapshots are created that are not used by any steps, they are added to a temporary dag (temp.yml). These steps are then removed from the temporary dag as soon as they are used by an active step.
All dependencies of new steps will be assumed to use their latest version possible.
Steps whose version is already equal to the new version will be skipped.

Examples:

Note: Remove the --dry-run if you want to actually execute the updates in the examples below (but then remember to revert changes).

To update a single snapshot to the new version:
```
$ etl update snapshot://animal_welfare/2023-10-24/fur_laws.xlsx --dry-run
```
Note that, since no steps are using this snapshot, the new snapshot will be added to the temporary dag.

To update not only that snapshot, but also the steps that use it:

$ etl update snapshot://animal_welfare/2023-10-24/fur_laws.xlsx --include-usages --dry-run

To update all dependencies of the climate change impacts explorer:
```
$ etl update data://explorers/climate/latest/climate_change_impacts --include-dependencies --dry-run
```
Note that the code of the explorers step itself will not be updated (since it has version "latest"), but its dependencies will be updated in the dag.

Usage:

etl update [OPTIONS] [STEPS]...

Options:

Name	Type	Description	Default
`--step-version-new`	text	New version for step. Default: 2024-04-27.	`2024-04-27`
`--include-dependencies`	boolean	Update also steps that are direct dependencies of the given steps. Default: False.	`False`
`--include-usages`	boolean	Update also steps that are directly using the given steps. Default: False.	`False`
`--dry-run`	boolean	Do not write to dag or create step files. Default: False.	`False`
`--interactive` / `--non-interactive`	boolean	Skip user interactions (for confirmation and when there is ambiguity). Default: False.	`False`
`--help`	boolean	Show this message and exit.	`False`

etl variable-mapping-translate¶

Translate the variable mapping dictionary from from one environment to another.

Generate equivalent variable mapping file for the new DB. This is because the variable IDs for the same variables may differ between environments (local, staging or production environments). If you have the variable mapping for one of the environments, you can easily obtain the equivalent for another environment using this command.

A common use case is when you have the mapping for your local environment and wish to have the equivalent for the production environment. Instead of creating yet again the mapping for the production environment, simply run this command which will 'translate' the mapping you found for your local environment to one that is consistent with the production environment's variable IDs.

Usage:

etl variable-mapping-translate [OPTIONS]

Options:

Name	Type	Description	Default
`-e1`, `--env-file-1`	text	Path to the configuration file for connection 1. This file should contain all the environment variables required to connect to the SQL. Should be in the format of a .env file.	_required
`-e2`, `--env-file-2`	text	Path to the configuration file for connection 2. This file should contain all the environment variables required to connect to the SQL. Should be in the format of a .env file.	_required
`-m1`, `--mapping-file-1`	text	Path to the JSON file containing the variable mapping from connection 1. This file should have been previously created and curated by the user. See command `etl variable-match` to create this file.	_required
`-m2`, `--mapping-file-2`	text	Path where to store the newly generated variable mapping from connection 2.	_required
`--help`	boolean	Show this message and exit.	`False`

etl variable-match¶

Match variable IDs from an old dataset to a new dataset's.

After a dataset has been uploaded to OWID's MySQL database, we need to pair new variable IDs with the old ones, so that all graphs update properly.

If the variable names are identical, the task is trivial: find indexes of old variables and map them to the indexes of the identical variables in the new dataset. However, if variable names have changed (or the number of variables have changed) the pairing may need to be done manually. This CLI helps in either scenario.

Usage:

etl variable-match [OPTIONS]

Options:

Name	Type	Description	Default
`-f`, `--output-file`	text	Path to output JSON file.	_required
`-old`, `--old-dataset-name`	text	Old dataset name (as defined in grapher).	_required
`-new`, `--new-dataset-name`	text	New dataset name (as defined in grapher).	_required
`-s`, `--similarity-name`	text	Name of similarity function to use when fuzzy matching variables. Default: partial_ratio. Available methods: token_set_ratio, token_sort_ratio, partial_ratio, partial_token_set_ratio, partial_token_sort_ratio, ratio.	`partial_ratio`
`-a`, `--add-identical-pairs`	boolean	If given, add variables with identical names in both datasets to the comparison. If not given, omit such variables and assume they should be paired.	`False`
`-m`, `--max-suggestions`	integer	Number of suggestions to show per old variable. That is, for every old variable at most [--max-suggestions] suggestions will be listed.	`10`
`--help`	boolean	Show this message and exit.	`False`