What is a Data Contract?
A data contract is a formal, versioned agreement between the team that produces data and the teams that consume it. It specifies exactly what schema, quality level, and semantics the data must conform to.
The Problem It Solves
In most organisations, software engineers own the systems that generate data. Data analysts and scientists own the pipelines that consume it. Without a formal agreement, engineers silently rename columns, change types, or drop tables β causing downstream breakages that can take days to diagnose.
Common Failure Modes
The Producer-Consumer Model
A data contract makes the implicit agreement explicit. The producer commits to a schema and quality bar; consumers can rely on it.
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββ
β Producer β β Data Contract β β Consumers β
β (Eng / App) ββsignsβββΆβ (YAML in Git) βββreliesββ Analysts / ML / β
β β β β β Data Scientists β
βββββββββββββββββ ββββββββββ¬βββββββββ βββββββββββββββββββββ
β
ContractHQ
validates on every
PR & scheduleAnatomy of a Data Contract
A contract YAML file has four top-level sections:
metadata:Identity fields β dataset name, owner, version, SLA, and tags.
schema:Column-by-column field definitions with types and constraints.
quality:Row-level checks β freshness, completeness, custom SQL assertions.
semantics:Human-readable definitions β what the data means, not just its shape.
Data Contract vs. JSON Schema
| Aspect | JSON Schema | Data Contract |
|---|---|---|
| Purpose | Validate API payloads | Govern analytical datasets |
| Where enforced | At request time | At pipeline / PR time |
| Versioning | Manual / ad-hoc | Semantic versioning built-in |
| Quality checks | Structural only | Structural + statistical |
| Ownership | Not defined | First-class owner field |