Data pipeline#

To produce our dataset we are constantly developing our dedicated library cowidev. This library provides us with the command tool cowid which eases:

  1. Running several sub-processes (or pipelines) that generate intermediate datasets.

  2. Jointly processing and merging all these intermediate datasets into the final and complete dataset.

Consequently, the dataset is updated multiple times a day (at least at 06:00 and 18:00 UTC), using the latest generated intermediate datasets.

Overview#

The dataset pipeline is built from several pipelines, which are executed independently and whose outputs are combined in a final step. The complexity of the pipelines varies. For instance, for vaccinations, testing and hospitalization we are responsible for collecting, processing and publishing the data but for cases/deaths we leave the collection step to Johns Hopkins Coronavirus Resource Center and then transform and publish the data. Note that on 23 June 2022, we stopped adding new data points to our COVID-19 testing dataset (read more)).

The table below lists all the constituent pipelines, along with their execution frequencies, and what are the pipelines’ tasks.

Pipeline

Frequency

Tasks

Vaccinations

every weekday at 12:00 UTC

Collection, transformation, presentation

Testing

Phased out (read more)

Collection, transformation, presentation

Hospitalization & ICU

daily at 06:00 and 18:00 UTC

Collection, transformation, presentation

Cases & Deaths (JHU)

daily at 04:00, 10:00, 16:00 and 22:00 UTC

Transformation, presentation

Excess mortality

weekly

Transformation, presentation

Variants

daily at 20:00 UTC

Transformation, presentation

Reproduction rate

daily

Presentation

Policy responses (OxCGRT)

daily

Transformation, presentation

Public monitor (YouGov)

weekly

Transformation, presentation

You can find all the automation details in this file.

Vaccinations pipeline#

The vaccination pipeline is probably the most complete one, where we scrape and extract data for each country in the dataset.

The pipeline is executed manually, by @edomt or @lucasrodes every weekday (i.e. Monday until Friday) before 12 UTC.

Execution steps#

# Download/scrape data
cowid vax get

# Proces/check data
cowid vax process

# Generate dataset
cowid vax generate

# Integrate into full dataset
cowid vax export

See also

Intermediate dataset, including per-country files and data technical details.

Testing pipeline#

We scrape and process data for multiple countries, similarly to the vaccinations pipeline. The pipeline is executed manually, by @camapel on Mondays and Fridays.

Warning

On 23 June 2022, we stopped adding new datapoints to our COVID-19 testing dataset. We continue to update all other metrics in our COVID-19 dataset. You can read more here.

Execution steps#

# Download/scrape data
cowid testing get

Hospitalization & ICU pipeline#

We scrape and process the data similarly as to what we do for testing and vaccinations. The pipeline is run daily.

Execution steps#

# Download data & generate dataset
cowid hosp generate

# Update Grapher-ready files
cowid hosp grapher-io

# Update Grapher database
cowid hosp grapher-db

Cases & Deaths (JHU) pipeline#

We source cases and death figures from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. We transform some of the variables and re-publish the dataset.

Execution steps#

# Download data
cowid jhu download

# Generate dataset
cowid jhu generate

# Update Grapher database
cowid jhu grapher-db

Excess Mortality pipeline#

The pipeline is manually executed once a week. The reported all-cause mortality data is from the Human Mortality Database (HMD) Short-term Mortality Fluctuations project and the World Mortality Dataset (WMD). Both sources are updated weekly. We also present estimates of excess deaths globally that are published by The Economist.

Execution steps#

# Download data and generate dataset
cowid xm generate

Variants pipeline#

We run this pipeline daily.

Execution steps#

# Download data and generate dataset
cowid variants generate

# Update Grapher-ready files
cowid variants grapher-io

Note

The data on variants and sequencing is indeed no longer available to download. It is published by GISAID under a license that doesn’t allow us to redistribute it. Please visit the data publisher’s website for more details. You may want to register an account there if you’re really interested in using this data.

Reproduction rate pipeline#

We source the data from crondonm/TrackingR/.

See also

Tracking R of COVID-19 A New Real-Time Estimation Using the Kalman Filter, by Francisco Arroyo, Francisco Bullano, Simas Kucinskas, and Carlos Rondón-Moreno

Policy responses (OxCGRT) pipeline#

# Get the data
cowid oxcgrt get

# Update Grapher files
cowid oxcgrt grapher-io

# Upload data to database
cowid oxcgrt grapher-db

Public monitor (YouGov) pipeline#

Warning

The YouGov pipeline is under construction.