Setting up the development environment

This document explains all the necessary steps to set up your environment and work with this project correctly.

Perhaps you want to set up the environment to help us out, or to learn how we work, or because you want to set up a similar workflow. In any case, we appreciate the time you are taking here 😀.

Python

This project uses Python for most of its processes. We have tested the code in Python 3.9 and 3.10. We recommend creating a virtual environment and installing all dependencies there. Something like:

# Create
python -m venv venv

# Activate
. venv/bin/activate

Download the project

First thing is to download the project. If you just want to run the code, clone it from the official repository:

git clone https://github.com/owid/covid-19-data.git

Note that the project is quite significant in size, so you may want to use a shallow clone:

git clone --depth 1 --no-single-branch https://github.com/owid/covid-19-data.git

If you want to contribute consider forking the repository instead.

Install project library

This project is built around the python library cowidev, which we are developing to help us maintain and improve our COVID-19 data pipeline. We recommend using pip in editable mode. For this, you need to be in scripts/ folder, next to the setup.py file:

cd scripts
pip install -e .

If the installation went well, running cowid in your terminal will execute but raise an EnvironmentError error.

Set environment variables

To run the pipeline, you need to create three environment variables: OWID_COVID_PROJECT_DIR, OWID_COVID_CONFIG and OWID_COVID_SECRETS. The last two variables point to files that we will create in the following sections.

Variable

Description

OWID_COVID_PROJECT_DIR

Path to the local project directory, e.g. /Users/username/projects/covid-19-data/

OWID_COVID_CONFIG

Path to the data pipeline configuration file. This file provides the default configuration values for the pipeline. Our team uses config.yaml, which you can use and adapt to your needs.

OWID_COVID_SECRETS

Path to the data pipeline secrets file.

You need to add these variables to your shell config file (i.e. .bashrc, .bash_profile or .zshrc), e.g.:

export OWID_COVID_PROJECT_DIR=/Users/username/projects/covid-19-data
export OWID_COVID_CONFIG=${OWID_COVID_PROJECT_DIR}/scripts/config.yaml
export OWID_COVID_SECRETS=${OWID_COVID_PROJECT_DIR}/scripts/secrets.yaml

Note that this is an example and you are free to choose other paths as long as they point to the correct files. More on the config.yaml and secrets.yaml file below.

Configuration file

The configuration file is required to run the COVID-19 vaccination and testing data pipelines (might be extended to other pipelines). Please find below a sample with its structure. You can also check the one we use.

Note that all fields are required, even if left empty.

execution:
  parallel:  # Use parallelization (bool)
  njobs:  # Number of threads when parallel=True (int)

pipeline:
  # Vaccination data pipeline
  vaccinations:
    # Get step
    get:
      countries:  # Countries to collect data for (list)
      skip_countries:  # Countries to skip data collection for (list)
    # Process step
    process:
      skip_complete:  # Countries to skip data processing (list)
      skip_monotonic_check:
      skip_anomaly_check:  # Skip anomaly checks for these countries, dates and metrics (dict)
        Australia:  # Country name, Australia left as an example (list)
          - date:  # Date to avoid check for (str YYYY-MM-DD)
            metrics:  # Metric to avoid check for (str)
    # Generate step
    generate:
    # Export step
    export:

  # Testing data pipeline
  testing:
    # Get step
    get:
      countries:  # Countries to collect data for (list)
      skip_countries:  # Countries to skip data collection for (list)
    # Process step
    process:
    # Generate step
    generate:
    # Export step
    export:
  
  # Hospitalization data pipeline
  hospitalizations:
  # Generate step
    generate:
      # Countries to include
      countries:
      # Countries to skip
      skip_countries:

Secrets file

We use the secrets file to update internal flows with the pipeline’s output (fields vax and test). While there are many fields, contributors may only need set one field: google.clients_secrets, which is needed to interact with Google Drive / Google Sheets based sources (more on how to get it here).

Note that this file is not shared, as it may contain sensitive data.

# Google configuration (dict)
google:
  client_secrets:  # Path to google client_secrets.json file
  mail:  # Email (str), OPTIONAL

scraperapi:
  token:  # Token for https://www.scraperapi.com/ services (free plan)
  
slack:
  token:  # Token to send messages to slack

# Vaccination configuration (dict), OPTIONAL
vaccinations:
  post:  # OWID Vaccination internal post link (str)
  sheet_id:  # OWID Vaccination internal spredsheet ID, where manual imports happen (str)

# Testing configuration (dict), OPTIONAL
testing:
  post:  # OWID Testing internal post link (str)
  sheet_id:  # OWID Testing internal spredsheet ID, where manual imports happen (str)
  sheet_id_attempted:  # OWID Extra Testing internal spredsheet ID, where attempted countries are listed (str)

# Twitter configuration (dict), OPTIONAL
twitter:
  consumer_key:  # Consumer key (str)
  consumer_secret:  # Consumer secret (str)
  access_secret:  # Access secret (str)
  access_token:  # Acces token (str)

How can I get the google client_secrets.json file?

The value of google.client_secrets should point to the JSON file downloaded from Google Cloud Platform that contains your personal Google credentials. To obtain it, you can follow gsheets documentation:

Log into the Google Developers Console with the Google account whose spreadsheets you want to access. Create (or select) a project and enable the Drive API and Sheets API (under Google Apps APIs).

Go to the Credentials for your project and create New credentials > OAuth client ID > of type Other. In the list of your OAuth 2.0 client IDs click Download JSON for the Client ID you just created.

We recommend saving the downloaded file in a safe directory, with a simplified name, e.g. ~/.config/owid/client_secrets.json.

What is scraperapi.token?

Scraper API is a service with a friendly proxy that allows you to access any HTML without being blocked. Four our pipeline you need to register and get their TOKEN. The free plan should be OK!

Verify installation

Once you have installed the library, configured the configuration and secrets files accordingly along with the environment variables you should be able to run:

~ cowid --help
Usage: cowid [OPTIONS] COMMAND [ARGS]...

  COVID-19 Data pipeline tool by Our World in Data.

Options:
  --parallel / --no-parallel  Parallelize process.  [default: parallel]
  --n-jobs INTEGER            Number of threads to use.  [default: -2]
  -S, --server / --no-server  Only critical log and final message to slack.
                              [default: no-server]
  --help                      Show this message and exit.

Commands:
  megafile    COVID-19 data integration pipeline (former megafile)
  test        COVID-19 Testing data pipeline.
  vax         COVID-19 Vaccination data pipeline.
  hosp        COVID-19 Hospitalization data pipeline.
  jhu         COVID-19 Cases/Deaths data pipeline.
  variants    COVID-19 Variants data pipeline.
  xm          COVID-19 Excess Mortality data pipeline.
  gmobility   Google Mobility data pipeline.
  oxcgrt      COVID-19 stringency index (by OxCGRT) data pipeline.
  decoupling  COVID-19 Decoupling data pipeline.
  sweden      COVID-19 Sweden data pipeline.
  uk-nations  COVID-19 UK Nations data pipeline.
  check       COVID-19 data pipeline checks.

Questions?

Raise an issue, we are happy to help!