Why this matters
As a Data Platform Engineer, you ship data models, pipelines, and tables that other teams depend on. CI/CD for data assets prevents bad schemas, broken jobs, and low-quality data from reaching production. It shortens feedback loops, enables safe changes, and keeps data trustworthy.
- Real tasks you will face:
- Run tests on pull requests for dbt models, SQL, and Spark jobs.
- Catch breaking schema changes before they land.
- Promote artifacts from dev to staging to prod with approvals.
- Trigger backfills safely and idempotently after deploy.
Concept explained simply
CI is automatic checks on every change; CD is safe, repeatable releases after those checks pass. For data, the key difference from app CI/CD is that your "runtime" is data itself. So, your pipeline must guard both code and datasets.
Mental model
Think of each data asset (table, view, feature, job) as a versioned product. Your CI proves it can build, test, and read/write safely. Your CD promotes it gradually—first to staging with sample/backfilled data, then to production with guardrails and rollback options.
Key guardrails to adopt
- Fail-fast: lint SQL/YAML, compile models, and run unit tests on PRs.
- Data quality contracts: tests for schema, nulls, ranges, uniqueness, and freshness.
- Two-phase schema changes: additive first, remove later.
- Versioned artifacts: container images, dbt manifests, DAG files, and migration scripts.
- Idempotent jobs: safe to rerun without duplicating data.
CI/CD blueprint for data assets
- On pull request (CI):
- Static checks: format and lint (e.g., SQL style), validate YAML/JSON.
- Compile/build: dry-run dbt, parse DAGs, build containers.
- Unit tests: local DuckDB or small test warehouse with seed data.
- Data tests: run model tests on a small sample (row-limit) or synthetic data.
- Contract checks: schemas stable? no breaking column changes?
- Security: secret scanning and permission checks (no prod creds in CI).
- Artifacts: publish build outputs (image, manifest, DAG) to a registry.
- On merge to main (CD - staging):
- Deploy artifacts to staging environment.
- Migrate schemas (additive), seed sample data, run dbt tests.
- Run a limited backfill or replay to validate end-to-end.
- Create a release candidate and wait for approval.
- Promotion to production (CD - prod):
- Approve gate -> deploy with change window.
- Run canary: subset of partitions/dates first.
- Monitor SLOs (freshness, success rates, test failures).
- Complete rollout; enable full backfill if needed.
- Rollback plan: previous artifact + data snapshot/table alias flip.
Worked examples
Example 1 — dbt model PR checks
- CI spins up an ephemeral warehouse or DuckDB.
- Run: dependency install, dbt compile, dbt run --select changed models with row limits.
- Execute dbt tests: unique, not_null, accepted_values.
- Validate contracts: no removed columns without deprecation window.
- Publish manifest.json as a build artifact.
- PR status: red if tests fail, green if all pass.
Example 2 — Streaming Spark job release
- CI builds container image and runs unit tests with a micro-batch input.
- Integration test: run job against synthetic Kafka topic in CI for a few messages.
- CD to staging: deploy job with small checkpoint, process 10 minutes of data.
- Quality gates: check late events rate, null ratio, schema compatibility.
- Promote to production with canary consumer group for 5% traffic, then 100%.
Example 3 — Safe schema change (breaking)
- Phase 1 (additive): add new column, backfill it, keep old column.
- Publish dual-write transforms that populate both columns.
- Update downstream models to read new column; keep compatibility.
- Phase 2 (removal): deprecate old column, wait a release cycle, then remove.
- CD enforces: warning on PR, block removal without signed approval.
Design choices that matter
- Environments: at least dev, staging, prod with separate credentials.
- Data for CI: synthetic or sampled/obfuscated data to respect privacy.
- Partition-aware tests: validate fresh partitions, not entire history.
- Backfills: triggered jobs with guards (date ranges, dry-run preview).
- Observability: alert on test failures, freshness lag, and row-count deltas.
Exercises
Do these after reading the examples. A short checklist is included to verify your work.
Exercise 1 — Draft a PR CI pipeline for a dbt repo
Write the sequence of CI steps that run on every pull request touching dbt models. Include linting, compilation, tests, and artifact handling.
Hints
- Use a small ephemeral database or DuckDB for speed.
- Focus on changed models only; enforce contracts.
- Publish manifest as an artifact.
Exercise 2 — Plan a safe breaking schema change
Describe a two-phase rollout for removing a column used by downstream dashboards, including deprecation and rollback steps.
Hints
- Add before remove; dual-write; deprecate; remove later.
- Use staging validation and canary in prod.
- Keep a fast rollback: table alias flip or previous artifact.
Self-check checklist
- Your CI plan has: lint, compile, unit tests, data tests, contract checks, artifact publish.
- Your schema change plan has: additive phase, deprecation notice, downstream migration, removal phase, and rollback.
Common mistakes and how to self-check
- Running tests on full datasets in CI. Self-check: Are tests limited to samples/changed models?
- Dropping columns in one release. Self-check: Is there a deprecation window with dual-read/write?
- No artifact versioning. Self-check: Can you redeploy a prior image/manifest in one click?
- Secrets in CI logs. Self-check: Are credentials masked and rotated?
- Non-idempotent backfills. Self-check: Can you rerun without duplicates?
Practical projects
- Create a PR CI workflow for a small dbt repo using DuckDB and run model tests on changed files.
- Package a batch job into a container, run unit tests, and publish the image to a registry.
- Implement a two-phase schema migration with a staging validation job and a production canary.
Who this is for
- Data Platform Engineers building shared data infrastructure and developer workflows.
Prerequisites
- Basic Git and pull request workflow.
- Familiarity with SQL and a modeling tool (e.g., dbt) or a scheduler (e.g., Airflow).
- Comfort with containers or virtual environments.
Learning path
- Start: CI fundamentals (linting, unit tests, artifacts).
- Next: Data quality tests and contracts.
- Then: CD with staging, approvals, and canaries.
- Finally: Advanced topics—backfills, rollbacks, and observability.
Next steps
- Automate your PR checks for one repo this week.
- Define your environment promotion policy and rollback plan.
- Add at least three data quality tests to a critical table.
Mini challenge
Pick one production table with frequent changes. Propose a CD plan that includes: staging validation with a 1-day backfill, a 10% prod canary for one partition, and automatic rollback on test failure. Keep it under 10 bullet points.
Quick Test
Take the short test below to check your understanding. Available to everyone. Note: only logged-in users get their progress saved.