Skip to content

Grapher schema sync

The grapher chart-config schema is owned by the web team in owid-grapher and published at https://files.ourworldindata.org/schemas/grapher-schema.NNN.json. It is mutated in place without version bumps (e.g. dumbbell plots landed in grapher-schema.010.json directly), so ETL needs a way to follow upstream changes.

ETL keeps a vendored pin of the upstream schema — a committed snapshot, same model as a lockfile — and derives or checks everything else against it:

owid-grapher (web team)
    │  publishes / mutates grapher-schema.010.json
schemas/grapher-schema.010.json              ← vendored pin (the only upstream copy in ETL)
    ├─→ etl/collection/model/schema_types.py    generated from it    ← unit test (offline)
    ├─→ multidim/explorer-schema $refs           resolved against it  ← at runtime (offline)
    └─→ dataset-schema.json embedded block       checked against it   ← unit test (offline, enum-deep)

vendored ↔ live upstream                      ← scheduled workflow + integration tests

Everything inside the repo is machine-checked against the pin on every PR; only the pin itself can lag upstream, and that is what the scheduled workflow watches.

The scheduled workflow

.github/workflows/sync-grapher-schema.yml runs on weekdays at 07:00 UTC (plus manual workflow_dispatch). When everything is in sync it is a no-op (two curls).

Upstream event Workflow action
Nothing changed No-op
Pinned schema mutated in place Refreshes the vendored copy, regenerates schema_types.py, and opens/updates a draft PR on the auto-sync-grapher-schema branch. The PR's own CI flags any remaining manual propagation.
New schema version published (grapher-schema.latest.json $id ≠ our pin) Opens an issue (deduped by title) pointing at the version-bump procedure. Not auto-PR'd, since a bump needs judgment.

Failure modes and caveats

  • Red CI on a bot PR is by design, not a malfunction: test_grapher_config_schema_sync fails when the upstream change also needs manual propagation (see below). The failing test output is the todo list.
  • Branch force-update: if upstream changes again before a bot PR is merged, the next run force-updates the auto-sync-grapher-schema branch and can clobber manual commits on it. Finish and merge bot PRs promptly; if you need more time, move the work to your own branch.
  • files.ourworldindata.org unreachable, dependency installation or generator failures → the job fails loudly in the Actions tab; nothing is skipped silently.
  • Bot PRs are created with the default GITHUB_TOKEN: Buildkite CI triggers normally (external webhook), but GitHub-Actions-based CI would not.

Completing a sync: the /sync-grapher-schema skill

The automatic part of a sync (vendored refresh + type regeneration) is committed by the workflow. The judgment part is guided by the internal Claude Code skill /sync-grapher-schema:

  • mirroring the upstream diff into the grapher_config block embedded in schemas/dataset-schema.json, preserving the deliberate ETL-side deviations (Jinja oneOf escape hatches, the extra WorldMap chart type, the ETL-only data/includedEntities properties);
  • adding $refs for genuinely new properties to schemas/multidim-schema.json / schemas/explorer-schema.json;
  • handling the rarer version bump (new grapher-schema.NNN): bumping DEFAULT_GRAPHER_SCHEMA in etl/config.py, updating $refs, re-vendoring.

It can also be run ad-hoc — e.g. when the web team announces a change and you don't want to wait for the cron (the skill performs the refresh itself), or trigger the workflow manually via workflow_dispatch.

Never edit schema_types.py by hand

etl/collection/model/schema_types.py is fully generated by scripts/generate_schema_types.py; a unit test fails if it doesn't round-trip. Hand-written types belong in etl/collection/model/params.py.

Who does what

Who Responsibility
Web team Nothing ETL-specific: publish the schema and announce changes, as they already do.
The workflow Detect upstream changes; open the draft PR / issue.
Auto-assigned reviewer Triage bot PRs: review the vendored diff; CI green → mark ready and merge; CI red → run /sync-grapher-schema on the branch (or delegate).
Anyone on the data team Can run /sync-grapher-schema; picks up "new version" issues.
CI on every PR Refuses internally-inconsistent states (schemas ↔ generated types ↔ embedded block).

Drift guards reference

Drift vector Guard
schemas/*.json edited without regenerating types, or generated file hand-edited tests/test_schema_types_generation.py (unit, offline)
Embedded grapher_config in dataset-schema.json out of sync with the vendored pin test_grapher_config_schema_sync in tests/test_metadata_schemas.py (unit, offline, enum-deep)
Vendored pin stale vs live upstream (in-place mutation) scheduled workflow → draft PR; test_vendored_grapher_schema_is_current (integration)
Upstream publishes a new schema version scheduled workflow → issue; test_no_newer_grapher_schema_version (integration)