Setting up the development environment¶
This document explains all the necessary steps to set up your environment and work with this project correctly.
Perhaps you want to set up the environment to help us out, or to learn how we work, or because you want to set up a similar workflow. In any case, we appreciate the time you are taking here 😀.
Python¶
This project uses Python for most of its processes. We have tested the code in Python 3.9 and 3.10. We recommend creating a virtual environment and installing all dependencies there. Something like:
# Create
python -m venv venv
# Activate
. venv/bin/activate
Download the project¶
First thing is to download the project. If you just want to run the code, clone it from the official repository:
git clone https://github.com/owid/covid-19-data.git
Note that the project is quite significant in size, so you may want to use a shallow clone:
git clone --depth 1 --no-single-branch https://github.com/owid/covid-19-data.git
If you want to contribute consider forking the repository instead.
Install project library¶
This project is built around the python library cowidev
, which we are developing to help us
maintain and improve our COVID-19 data pipeline. We recommend using pip
in editable mode. For this, you need to be in scripts/
folder, next to the setup.py
file:
cd scripts
pip install -e .
If the installation went well, running cowid
in your terminal will execute but raise an EnvironmentError
error.
Set environment variables¶
To run the pipeline, you need to create three environment variables: OWID_COVID_PROJECT_DIR
, OWID_COVID_CONFIG
and
OWID_COVID_SECRETS
. The last two variables point to files that we will create in the following sections.
Variable |
Description |
---|---|
|
Path to the local project directory, e.g. |
|
Path to the data pipeline configuration file. This file provides the default configuration values for the pipeline. Our team uses |
|
Path to the data pipeline secrets file. |
You need to add these variables to your shell config file (i.e. .bashrc
, .bash_profile
or .zshrc
), e.g.:
export OWID_COVID_PROJECT_DIR=/Users/username/projects/covid-19-data
export OWID_COVID_CONFIG=${OWID_COVID_PROJECT_DIR}/scripts/config.yaml
export OWID_COVID_SECRETS=${OWID_COVID_PROJECT_DIR}/scripts/secrets.yaml
Note that this is an example and you are free to choose other paths as long as they point to the correct files. More on
the config.yaml
and secrets.yaml
file below.
Configuration file¶
The configuration file is required to run the COVID-19 vaccination and testing data pipelines (might be extended to other pipelines). Please find below a sample with its structure. You can also check the one we use.
Note that all fields are required, even if left empty.
execution:
parallel: # Use parallelization (bool)
njobs: # Number of threads when parallel=True (int)
pipeline:
# Vaccination data pipeline
vaccinations:
# Get step
get:
countries: # Countries to collect data for (list)
skip_countries: # Countries to skip data collection for (list)
# Process step
process:
skip_complete: # Countries to skip data processing (list)
skip_monotonic_check:
skip_anomaly_check: # Skip anomaly checks for these countries, dates and metrics (dict)
Australia: # Country name, Australia left as an example (list)
- date: # Date to avoid check for (str YYYY-MM-DD)
metrics: # Metric to avoid check for (str)
# Generate step
generate:
# Export step
export:
# Testing data pipeline
testing:
# Get step
get:
countries: # Countries to collect data for (list)
skip_countries: # Countries to skip data collection for (list)
# Process step
process:
# Generate step
generate:
# Export step
export:
# Hospitalization data pipeline
hospitalizations:
# Generate step
generate:
# Countries to include
countries:
# Countries to skip
skip_countries:
Secrets file¶
We use the secrets file to update internal flows with the pipeline’s output (fields vax
and test
). While there
are many fields, contributors may only need set one field: google.clients_secrets
, which is needed to interact with Google Drive /
Google Sheets based sources (more on how to get it here).
Note that this file is not shared, as it may contain sensitive data.
# Google configuration (dict)
google:
client_secrets: # Path to google client_secrets.json file
mail: # Email (str), OPTIONAL
scraperapi:
token: # Token for https://www.scraperapi.com/ services (free plan)
slack:
token: # Token to send messages to slack
# Vaccination configuration (dict), OPTIONAL
vaccinations:
post: # OWID Vaccination internal post link (str)
sheet_id: # OWID Vaccination internal spredsheet ID, where manual imports happen (str)
# Testing configuration (dict), OPTIONAL
testing:
post: # OWID Testing internal post link (str)
sheet_id: # OWID Testing internal spredsheet ID, where manual imports happen (str)
sheet_id_attempted: # OWID Extra Testing internal spredsheet ID, where attempted countries are listed (str)
# Twitter configuration (dict), OPTIONAL
twitter:
consumer_key: # Consumer key (str)
consumer_secret: # Consumer secret (str)
access_secret: # Access secret (str)
access_token: # Acces token (str)
How can I get the google client_secrets.json
file?¶
The value of google.client_secrets
should point to the JSON file downloaded from Google Cloud Platform that contains
your personal Google credentials. To obtain it, you can follow gsheets
documentation:
Log into the Google Developers Console with the Google account whose spreadsheets you want to access. Create (or select) a project and enable the Drive API and Sheets API (under Google Apps APIs).
Go to the Credentials for your project and create New credentials > OAuth client ID > of type Other. In the list of your OAuth 2.0 client IDs click Download JSON for the Client ID you just created.
We recommend saving the downloaded file in a safe directory, with a simplified name, e.g.
~/.config/owid/client_secrets.json
.
What is scraperapi.token
?¶
Scraper API is a service with a friendly proxy that allows you to access any HTML without being blocked. Four our pipeline you need to register and get their TOKEN. The free plan should be OK!
Verify installation¶
Once you have installed the library, configured the configuration and secrets files accordingly along with the environment variables you should be able to run:
~ cowid --help
Usage: cowid [OPTIONS] COMMAND [ARGS]...
COVID-19 Data pipeline tool by Our World in Data.
Options:
--parallel / --no-parallel Parallelize process. [default: parallel]
--n-jobs INTEGER Number of threads to use. [default: -2]
-S, --server / --no-server Only critical log and final message to slack.
[default: no-server]
--help Show this message and exit.
Commands:
megafile COVID-19 data integration pipeline (former megafile)
test COVID-19 Testing data pipeline.
vax COVID-19 Vaccination data pipeline.
hosp COVID-19 Hospitalization data pipeline.
jhu COVID-19 Cases/Deaths data pipeline.
variants COVID-19 Variants data pipeline.
xm COVID-19 Excess Mortality data pipeline.
gmobility Google Mobility data pipeline.
oxcgrt COVID-19 stringency index (by OxCGRT) data pipeline.
decoupling COVID-19 Decoupling data pipeline.
sweden COVID-19 Sweden data pipeline.
uk-nations COVID-19 UK Nations data pipeline.
check COVID-19 data pipeline checks.
Questions?¶
Raise an issue, we are happy to help!