Snapshot step
This is the first step to be implemented when importing data from an upstream source. It basically consists in copy-pasting an external file into our platform, to ensure that we have a reliable and secure access to the file.
Note that an external source may decide to delete the file. Also, this enables reproducibility of all the ETL processes, since the file in the source may change (e.g. remove datapoints, add datapoints, change field names, etc.).
The following diagram shows a case where we import a certain dataset to our snapshot catalog. Imagine the vertical axis is time (lower node was published later). In this case, we are importing different versions of the same dataset.
flowchart LR
upstream1((___)):::node -.->|copy| snapshot1((v1)):::node_ss
upstream2((___)):::node -.->|copy| snapshot2((v2)):::node_ss
subgraph id0 [Upstream]
upstream1
upstream2
end
subgraph id [Snapshot]
snapshot1
snapshot2
end
classDef node fill:#002147,color:#002147
classDef node_ss fill:#002147,color:#fff
The snapshot step typically consists of a DVC file and a script that downloads the upstream data ands saves it to our snapshot catalog. Snapshot files are located in the snapshots/
directory of the project.
Note that we need a DVC file per upstream data file; hence, in some instances, if the source publishes a datset using multiple files, we need multiple DVC files.
Metadata
A Snapshot is a picture of a data product (e.g. a data CSV file) provided by an upstream data provider at a particular point in time.
It is the entrypoint to ETL. This is where we define metadata attributes of the data product and the particular snapshot. This is fundamental to ensure that the data is properly documented and that the metadata is propagated to the rest of the system.
The metadata in Snapshot consists mainly of one object: meta.origin
. To learn more about it, please refer to the reference.
This metadata is captured in a DVC file (similar to a yaml file), which contains all the snapshot metadata fields as key-value pairs.
Example of snapshots/animal_welfare/2023-09-08/fur_laws.pdf.dvc
This file specifies all the upstream source file details (including link to download it, metadata fields, etc.). Filling the fields of this file requires some manual work, as we are "translating" all the information that the source provides into our snaphsot metadata format.
meta:
origin:
title: Fur banning
producer: Fur Free Alliance
citation_full: Overview national fur legislation, Fur Free Alliance (2023).
url_main: https://www.furfreealliance.com/fur-bans/
url_download:
https://www.furfreealliance.com/wp-content/uploads/2023/04/Overview-national-fur-legislation-General-Provisions.pdf
date_published: '2023-04-01'
date_accessed: '2023-09-08'
license:
name: CC BY 4.0
license:
name: CC BY 4.0
is_public: true
wdir: ../../../data/snapshots/animal_welfare/2023-09-08
outs:
- md5: e326e86b4c1225f688951df82a2f85af
size: 178968
path: fur_laws.pdf
Learn more in our reference.