Why this matters
As an Analytics Engineer, you turn raw data into reliable models others depend on. Schema tests and contract thinking prevent silent breaks when upstream teams change fields, types, or meanings. Real tasks you'll do:
- Define what a model guarantees: columns, types, keys, allowed values, and freshness.
- Write schema tests to catch broken primary keys, nulls where they shouldn't be, or unexpected category values.
- Negotiate safe, additive changes with data producers and communicate breaking changes clearly.
Concept explained simply
A schema is the shape of your dataset: column names, data types, and constraints. Schema tests verify those assumptions are true. Contract thinking means you treat your dataset like a product with a clear interface. Consumers rely on that interface; changes must be deliberate and safe.
Mental model
Think of your model as an API:
- Inputs: upstream tables and their contracts
- Interface: your columns, types, constraints, and SLAs
- Guarantees: tests that prove the interface still holds
- Change policy: what is additive (safe) vs breaking (requires coordination)
Common schema tests (open)
- Primary key integrity: unique and not null
- Type checks: column types are as expected (enforced or validated)
- Accepted values: category/enum columns only include known values
- Relationships: foreign keys match referenced dimension keys
- Freshness/latency: data updated within agreed time window
- Row-level expectations: e.g., totals >= 0, dates not in the future
Contract changes: additive vs breaking
- Additive (usually safe): adding a new column, adding a new enum value (if documented), adding an index, backfilling more history
- Breaking (requires coordination/versioning): renaming/removing columns, changing data types incompatibly, altering semantics (e.g., order_status meaning), changing primary keys
Worked examples
Example 1: Core orders model
Contract highlights:
- Primary key: order_id (unique, not null)
- Foreign key: customer_id references dim_customers.customer_id
- Types: order_total numeric, order_date date, order_status text
- Accepted values: order_status in {placed, paid, shipped, cancelled, returned}
- Freshness: max 2 hours delay for new orders (SLO)
Tests you would implement:
- not_null: order_id, customer_id, order_date
- unique: order_id
- relationship: customer_id -> dim_customers.customer_id
- accepted_values: order_status
- custom check: order_total >= 0
Example 2: Event ingestion with evolving enums
Upstream adds a new event_type "trial_started". Additive change, but consumers need to know. Update accepted_values test to include the new value and add a note in the contract under "change log". If consumers depend on exhaustive lists, they can pin a version or filter explicitly.
Example 3: Incompatible type change proposal
Producer wants to change customer_id from integer to string. This is breaking. Contract response:
- Introduce a new column customer_id_v2 (string) in parallel
- Backfill customer_id_v2
- Deprecate old column with a removal date
- Update downstream models to use customer_id_v2, then remove the old column after the deprecation window
Who this is for
- Analytics Engineers and BI Developers responsible for reliable models
- Data Engineers who want clear downstream guarantees
- Analysts who maintain semantic layers and dashboards
Prerequisites
- Basic SQL (SELECT, JOIN, GROUP BY)
- Familiarity with dimensional modeling (facts, dimensions)
- Understanding of data types and nullability
Learning path
- Review schema tests and contract basics (this lesson)
- Implement tests on a core model
- Define a dataset contract and share it with consumers
- Practice change management: additive vs breaking, versioning, deprecation
How to apply this at work
- Pick one high-impact model (e.g., orders) and write a one-page contract: purpose, owner, columns, tests, SLOs, change policy.
- Implement schema tests for PK, FKs, enums, and basic numeric guards.
- Automate: run tests in CI before deploy; block merges on failures that would break the contract.
Exercises
These mirror the graded exercises below. Do them here, then submit in the panel.
Exercise 1: Draft a dataset contract (orders)
Create a concise contract for a model named orders_core. Include: ownership, primary/foreign keys, column types with nullability, accepted values for order_status, freshness SLO, and change policy (additive vs breaking).
Checklist to self-verify
- Primary key defined and testable
- All business-critical columns have types and nullability
- Enums listed and justified
- Freshness has a numeric threshold and timezone
- Clear additive/breaking examples
Exercise 2: Write schema tests from the contract
Translate the contract into test definitions (e.g., tool-agnostic or dbt-style). Include unique/not_null, accepted_values, relationships, and one custom numeric check.
Tip
Start with the primary key, then enums, then relationships. Add freshness if your tool supports it; otherwise, sketch an approach.
Common mistakes and self-check
- Missing primary key tests: Self-check—can I prove uniqueness and not null for the PK?
- Enums not updated: Self-check—does my accepted_values list match real data in the last week?
- Assuming type safety: Self-check—do I validate types or rely only on warehouse coercion?
- No change policy: Self-check—if a column were removed tomorrow, would consumers know what happens?
- Freshness ignored: Self-check—do I monitor latency and define what "late" means?
Practical projects
- Project A: Write a contract and tests for a dimension (dim_customers). Add a small change log section.
- Project B: Simulate a breaking change (rename a column in a dev copy). Practice deprecation: add new column, backfill, update downstream, remove old.
- Project C: Add anomaly checks alongside schema tests (e.g., daily row_count drop > 25% triggers warn) and document that this is non-contract but monitored.
What good looks like (quality checklist)
- Contract fits on one page, with owner and version
- Primary key and foreign keys have tests
- Enums documented with rationale
- Freshness SLO stated (e.g., within 2 hours, UTC)
- Clear change policy with examples and deprecation window
Next steps
- Apply the same approach to your top 3 models
- Add a short "breaking change template" your team can reuse
- Set CI to fail on contract-breaking test failures; allow warnings for non-breaking anomalies
Mini challenge
Your upstream proposes removing order_status because a new status_code will replace it. Outline a safe plan:
- Add status_code while keeping order_status for 30 days
- Backfill status_code and map values from order_status
- Update accepted_values tests for status_code; mark order_status as deprecated
- Communicate removal date; after 30 days, remove order_status and update tests
Ready for the Quick Test?
Note: The quick test is available to everyone; only logged-in users will have their progress saved.