Why this matters
Data contracts turn fragile, ad-hoc data flows into reliable, predictable interfaces. As a Data Engineer, you are accountable for resilience: preventing breaking schema changes, keeping data fresh, and making quality rules explicit. Contracts align producers and consumers so you get fewer surprises, less firefighting, and faster delivery.
- Real tasks this enables: agree on schemas and SLAs with source teams; catch breaking changes early; version and roll out schema updates safely; set and monitor data quality expectations.
- Outcome: stable pipelines, predictable delivery times, and trust in your datasets.
Who this is for
- Data Engineers who ingest from APIs, databases, or event streams.
- Data Product Owners who need clear, testable expectations.
Prerequisites
- Basic understanding of schemas (e.g., JSON, Avro, Parquet).
- Familiarity with batch or streaming pipelines.
- Basic observability concepts (freshness, completeness, SLAs).
Concept explained simply
A data contract is a written agreement between a data producer and data consumers that defines what data is delivered, how it looks, how reliable it is, and how changes happen. It is like an API contract but for data.
Mental model
Think of a contract as a promise with tests. The producer promises structure, meaning, and timeliness. The consumer promises how they will use the data and how they will adapt to versioned changes. Automated checks verify the promises continuously.
Core components of a data contract
- Purpose and owner: why the dataset exists; who is accountable (producer) and who to contact.
- Schema: fields, types, nullability, constraints, primary keys or natural keys.
- Semantics: definitions and units (e.g., revenue currency, timezone, event semantics).
- Quality rules: freshness, completeness, uniqueness, referential integrity, valid values.
- SLAs/SLOs: delivery time goals (and windows), allowable failure rates, alerting policy.
- Change management: versioning strategy, deprecation period, communication channel, migration timeline.
- Access and privacy: sensitivity classification, PII handling, masking/anonymization rules.
- Lineage and dependencies: upstream systems and downstream critical consumers.
See a minimal template
name: orders_events
purpose: Order lifecycle events for analytics and fulfillment
owner: team-commerce-platform
contact: data-owners@company
schema:
format: json
version: 1.0.0
primary_key: order_id
fields:
- name: event_id
type: string
required: true
- name: order_id
type: string
required: true
- name: event_type
type: string
allowed: [CREATED, PAID, SHIPPED, CANCELLED]
required: true
- name: event_ts
type: timestamp
timezone: UTC
required: true
quality:
freshness:
target: <= 5 minutes from event_ts to topic
alert_after: 10 minutes
completeness:
expected_rate: >= 99.5% of app events
uniqueness:
keys: [event_id]
slas:
delivery: 99.9% events available within 5 minutes
oncall: #alerts-to: #data-oncall
changes:
versioning: semver
deprecate_notice: 30 days
breaking_requires: new major version and dual-publish window
access:
sensitivity: internal
pii: none
lineage:
upstream: commerce-service
downstream: analytics-warehouse, fulfillment-dashboard
Worked examples
Example 1 — Streaming topic (clickstream)
Scenario: Product team produces a clickstream Kafka topic.
- Schema: user_id (string), session_id (string), event (enum), url (string), ts (timestamp UTC).
- Keys: event_id unique; optionally partition by user_id.
- Quality: freshness <= 60s, uniqueness on event_id, allowed event values, url non-empty.
- SLAs: 99% messages under 60s end-to-end.
- Change: adding device_type is non-breaking if optional; dropping event becomes breaking → new major version.
Example 2 — Batch table (daily customer dimension)
Scenario: CRM exports a daily customers table to the lake.
- Schema: customer_id PK; email nullable; country code ISO-3166.
- Quality: daily delivery by 06:00 UTC; completeness ≥ 99%; no duplicate customer_id.
- SLAs: if delivery delayed beyond 06:15, page on-call and post status note.
- Change: rename email → primary_email requires deprecation window and backfill alias column.
Example 3 — Third‑party API ingestion
Scenario: Marketing API provides campaign spends.
- Schema stability: vendor may add fields any day; treat unknown fields as ignored; strict checks on required keys campaign_id, date, spend.
- Quality: reconcile total spend within ±1% of vendor dashboard daily.
- Change: vendor announces removal of field; add dual-source validation and version pinning in contract.
Step-by-step: Create your first data contract
- Identify the data product
Define purpose, owner, target consumers. - Fix the schema
List fields with types and nullability. Add keys and constraints. - Define semantics
Units, timezones, enumerations, and definitions. - Set reliability targets
Freshness windows, delivery SLAs, alerting thresholds. - Add quality rules
Uniqueness, completeness, valid ranges and referential integrity. - Plan change management
Versioning policy, deprecation timeline, communication channel. - Document access and privacy
Sensitivity class and PII handling rules. - Automate checks
Turn key rules into validation tests in pipelines.
Checklist: Is your contract ready?
- Owner and contact are explicit.
- Schema includes types, nullability, and keys.
- Semantics specify units/timezone/enums.
- Freshness and delivery SLAs are measurable.
- Quality checks cover uniqueness and completeness.
- Versioning and deprecation windows are defined.
- Access classification is set.
- Automated validations exist or are planned.
Exercises
Do these hands-on tasks. You can compare with the solutions below each exercise.
Exercise 1 — Draft a minimal contract for an Orders topic
Create a one-page contract for a streaming topic named orders_events that captures order lifecycle events.
- Include: purpose, owner, schema with types, keys, allowed event values, freshness target, uniqueness, and a basic change policy.
- Keep it concise but testable.
Show solution
name: orders_events
purpose: Order lifecycle events for analytics, fraud, and fulfillment
owner: team-commerce-platform
schema:
format: json
version: 1.0.0
primary_key: event_id
fields:
- {name: event_id, type: string, required: true}
- {name: order_id, type: string, required: true}
- {name: event_type, type: string, allowed: [CREATED, PAID, SHIPPED, CANCELLED], required: true}
- {name: event_ts, type: timestamp, timezone: UTC, required: true}
quality:
freshness: {target: <= 5m, alert_after: 10m}
uniqueness: {keys: [event_id]}
changes:
versioning: semver
deprecate_notice: 30 days
Exercise 2 — Plan a safe schema change
You need to add device_type (string, optional) and remove event_url from orders_events. Propose a plan that avoids breaking consumers.
Show solution
- Bump minor version to 1.1.0 adding device_type as optional; communicate change.
- Mark event_url as deprecated in 1.1.0; announce removal after 30 days.
- Dual-publish or keep field present but unused during deprecation window.
- After window, release 2.0.0 without event_url; keep 1.x for a transition period.
- Provide mapping notes and tests validating both 1.x and 2.x during migration.
Self-check after exercises
- Are all required fields and constraints explicit?
- Can you measure freshness and uniqueness automatically?
- Would a consumer know exactly what to do during your deprecation window?
Note: The quick test is available to everyone. Only logged-in users will have their progress saved.
Common mistakes and how to self-check
- Vague SLAs (e.g., "as soon as possible"). Self-check: Is there a numeric target and alert condition?
- Implicit semantics (e.g., local time without timezone). Self-check: Is timezone/units written?
- Unversioned changes (rename fields silently). Self-check: Does every change map to a version bump?
- No deprecation period. Self-check: Is there a minimum notice window?
- Missing keys. Self-check: What guarantees uniqueness and idempotency?
- Overly strict rules on third-party data. Self-check: Are tolerances realistic and monitored?
Practical projects
- Contract retrofit: Pick one pipeline, write a contract, and add two automated checks (freshness and uniqueness).
- Versioned rollout: Simulate adding a new field via minor version and deprecating an old one via major version.
- Quality dashboard: Create a small dashboard or report showing SLA compliance and contract violations for one dataset.
Learning path
- Start: Data Contracts Basics (this page).
- Next: Data quality checks and monitoring (freshness, completeness, alerts).
- Then: Schema evolution strategies and backward/forward compatibility.
- Advanced: Data product SLAs and error budgets across batch and streaming.
Next steps
- Turn one of your existing pipelines into a contracted data product.
- Automate at least two rules from your contract.
- Take the quick test to validate your understanding.
Mini challenge
In 10 minutes, write a contract summary (max 12 lines) for a dataset you own. Include: purpose, 5 fields with types, one key, one freshness target, one uniqueness rule, and a change policy in one sentence. Share it with a teammate and ask: "What did you have to guess?" Refine until nothing is ambiguous.