Skip to content

Private datasets

While most of the data at OWID is publicly available, some datasets are added to our catalog with some restrictions. These include datasets that are not redistributable, or that are not meant to be shared with the public. This can happen due to a strict license by the data provider, or because the data is still in a draft stage and not ready for public consumption.

Various privacy configurations are available:

  • Disable data downloading options on Grapher.
  • Disable public access to the original file (snapshot).
  • Hide the dataset from our public catalog (accessible via owid-catalog-py).

In the following, we explain how to create private steps in the ETL pipeline and how to run them.

Create a private step

Make your dataset completely private

  • Snapshot: Set meta.is_public to false in the snapshot DVC file.
  • Meadow, Garden, Grapher: Use data-private:// prefix in the step name in the DAG. Set dataset.non_redistributable to true in the dataset garden metadata.

Snapshot

To create a private snapshot step, set the meta.is_public property in the snapshot .dvc file to false:

meta:
  is_public: false

  origin:
    # Data product / Snapshot
    title: World Population Prospects
    ...

This will prevent the file to be publicly accessible without the appropriate credentials.

Meadow, Garden, Grapher

Creating a private data step means that the data will not be listed in the public catalog, and therefore will not be accessible via owid-catalog-py.

To create a private data step (meadow, garden or grapher) simply use data-private prefix in the step name in the DAG. For example, the step grapher/ihme_gbd/2024-06-10/leading_causes_deaths (this is from health.yml) is private:

# IHME GBD Leading cause of  deaths - update
data-private://meadow/ihme_gbd/2024-06-10/cause_hierarchy:
  - snapshot-private://ihme_gbd/2024-06-10/cause_hierarchy.csv
data-private://garden/ihme_gbd/2024-06-10/leading_causes_deaths:
  - data-private://garden/ihme_gbd/2024-05-20/gbd_cause
  - data-private://meadow/ihme_gbd/2024-06-10/cause_hierarchy
data-private://grapher/ihme_gbd/2024-06-10/leading_causes_deaths:
  - data-private://garden/ihme_gbd/2024-06-10/leading_causes_deaths

Make the data non-downloadable

To make the data non-downloadable on Grapher, set the non_redistributable property in the dataset metadata (typically the garden metadata yaml file) to true:

dataset:
  non_redistributable: true

Running private ETL

To run a private step, you need to use the --private flag. Otherwise, private steps are not detected by etl command:

etl run run [step-name] --private

Bringing private data to public

If you want to make a private step public simply follow the steps below:

  • In the DAG: Replace data-private:// prefix with data://.
  • In the snapshot DVC file: Set meta.is_public to true (or simply remove this property).
  • (Optional) Allow for Grapher downloads: Set dataset.non_redistributable to false in the dataset garden metadata (or simply remove this property).

After this, re-run the snapshot step and commit your changes.