Core Concepts

What is a Data Contract?

A data contract is a formal, versioned agreement between the team that produces data and the teams that consume it. It specifies exactly what schema, quality level, and semantics the data must conform to.


The Problem It Solves

In most organisations, software engineers own the systems that generate data. Data analysts and scientists own the pipelines that consume it. Without a formal agreement, engineers silently rename columns, change types, or drop tables β€” causing downstream breakages that can take days to diagnose.

Common Failure Modes

πŸ’₯
Silent schema drift
A column is renamed; dashboards go blank overnight.
πŸ•³οΈ
Null explosions
A new code path omits a required field; ML models receive nulls.
πŸ“‰
Type coercion bugs
An integer field becomes a string; aggregations silently return 0.
πŸ”€
Undocumented semantics
revenue means net in one team and gross in another.

The Producer-Consumer Model

A data contract makes the implicit agreement explicit. The producer commits to a schema and quality bar; consumers can rely on it.

producer-consumer model
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚   Producer    β”‚         β”‚  Data Contract  β”‚         β”‚    Consumers      β”‚
  β”‚ (Eng / App)   │─signs──▢│  (YAML in Git)  │◀─relies─│ Analysts / ML /   β”‚
  β”‚               β”‚         β”‚                 β”‚         β”‚ Data Scientists   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                               ContractHQ
                            validates on every
                               PR & schedule

Anatomy of a Data Contract

A contract YAML file has four top-level sections:

metadata:

Identity fields β€” dataset name, owner, version, SLA, and tags.

schema:

Column-by-column field definitions with types and constraints.

quality:

Row-level checks β€” freshness, completeness, custom SQL assertions.

semantics:

Human-readable definitions β€” what the data means, not just its shape.

Data Contract vs. JSON Schema

AspectJSON SchemaData Contract
PurposeValidate API payloadsGovern analytical datasets
Where enforcedAt request timeAt pipeline / PR time
VersioningManual / ad-hocSemantic versioning built-in
Quality checksStructural onlyStructural + statistical
OwnershipNot definedFirst-class owner field
ℹ️
ContractHQ extends the open-source Data Contract Specification (DCS) format, which means your contracts are portable across other tooling that supports DCS.