Menu

Topic 3 of 8

CI CD For Data Pipelines

Learn CI CD For Data Pipelines for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Quick outline (open)
  • Why this matters
  • Concept explained simply
  • Mental model
  • Worked examples
  • Checklist before you ship
  • Exercises
  • Common mistakes
  • Practical projects
  • Who this is for
  • Prerequisites
  • Learning path
  • Next steps
  • Mini challenge

Why this matters

As a Data Engineer, you ship changes that can affect dashboards, machine learning features, and critical reports. CI/CD makes your data pipelines reliable by:

  • Blocking schema-breaking changes before they reach production.
  • Automatically testing transformations and DAGs on every pull request.
  • Promoting artifacts safely across dev → stage → prod with data quality gates.
  • Providing fast rollbacks if a deployment impacts SLAs.

Concept explained simply

CI (Continuous Integration) automatically checks your code every time you push: it installs dependencies, lints, runs unit tests, validates DAGs/SQL, and compiles your project. CD (Continuous Delivery/Deployment) packages your pipeline, deploys it to environments, runs smoke tests with real infrastructure, and promotes only if checks pass.

Mental model

Imagine a factory assembly line with gates:

  • Gate 1 (CI): Is the part built correctly? (lint, unit tests, compile, DAG check)
  • Gate 2 (CD - Dev): Does it run on real machines? (container build, deploy to dev)
  • Gate 3 (CD - Stage): Does data look healthy on sample/limited scope? (smoke and data quality checks)
  • Gate 4 (CD - Prod): Limited canary, observe metrics, then full rollout. Rollback ready.

Worked examples

Example 1 — Minimal CI for a dbt + Python pipeline

This CI runs on every pull request:

name: ci
on: [pull_request]
jobs:
  build-test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Lint
        run: ruff check .
      - name: Unit tests
        run: pytest -q tests/unit
      - name: Validate dbt project
        run: |
          dbt deps
          dbt compile
      - name: Validate Airflow DAGs
        run: python -m pyflakes dags || true

What this catches: style errors, failing unit tests, SQL compile errors, and obvious DAG import errors before merging.

Example 2 — CD with environment promotion and data gates

This CD builds a Docker image, deploys to dev, then stage with smoke tests, then promotes to prod after a canary:

name: cd
on:
  push:
    branches: [main]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build image
        run: docker build -t registry/pipeline:${GITHUB_SHA} .
      - name: Push image
        run: docker push registry/pipeline:${GITHUB_SHA}
  deploy-dev:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to dev
        run: ./infra/deploy.sh dev registry/pipeline:${GITHUB_SHA}
      - name: Dev smoke tests
        run: python checks/smoke.py --env dev
  deploy-stage:
    needs: deploy-dev
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to stage
        run: ./infra/deploy.sh stage registry/pipeline:${GITHUB_SHA}
      - name: Data quality gate
        run: python checks/data_quality.py --env stage --threshold 0.98
  deploy-prod:
    needs: deploy-stage
    runs-on: ubuntu-latest
    steps:
      - name: Canary to prod (10%)
        run: ./infra/deploy.sh prod registry/pipeline:${GITHUB_SHA} --scope canary
      - name: Observe metrics
        run: python checks/observe.py --env prod --minutes 15 --error-rate-threshold 0.01
      - name: Full rollout
        run: ./infra/deploy.sh prod registry/pipeline:${GITHUB_SHA} --scope full

Key idea: each step is conditional on the previous passing, with simple numeric thresholds for quality gates.

Example 3 — Safe schema change with backward compatibility

Scenario: adding a non-nullable column to a fact table.

  • Step 1: Add column as nullable with default; write code to populate it; keep consumers reading old schema.
  • Step 2: Backfill in batches; monitor null rate and row counts.
  • Step 3: Switch consumers to new column behind a feature flag; keep dual-write temporarily.
  • Step 4: After stability, enforce NOT NULL; remove old paths.
  • Rollback plan: If metrics dip, revert consumers to old column, stop backfill, and remove feature flag.

Checklist before you ship

  • [ ] Git branch uses a clear naming convention (e.g., feature/, fix/).
  • [ ] Lint, unit tests, and project compilation pass in CI.
  • [ ] Data quality checks exist for critical models (row counts, null rates, freshness).
  • [ ] Deploy scripts are idempotent and can be re-run safely.
  • [ ] Secrets are injected via environment variables or secret manager, not hardcoded.
  • [ ] Rollback plan is documented and tested on a non-prod env.
  • [ ] Observability: basic metrics and alerts are configured (failures, latency, data volume).

Exercises

Complete these hands-on tasks. You can check solutions below each exercise. Your progress in the quick test is available to everyone; only logged-in users will have their progress saved.

Exercise 1 — Write a minimal CI workflow

Create a YAML CI workflow that:

  • Runs on pull requests.
  • Sets up Python 3.10 and installs requirements.txt.
  • Runs a linter (ruff) and unit tests (pytest on tests/unit).
  • Validates a dbt project (dbt deps + dbt compile).
Show solution
name: ci
on: [pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.10'
      - run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - run: ruff check .
      - run: pytest -q tests/unit
      - run: |
          dbt deps
          dbt compile

Exercise 2 — Plan a safe prod rollout

Write a step-by-step promotion plan for a pipeline change that adds a new dimension column used by downstream dashboards. Include testing, data gates, canary scope, observation window, and rollback triggers.

Show solution
  1. Dev: deploy, run unit tests + dev smoke (sample run), verify logs.
  2. Stage: deploy, backfill small subset, validate row counts and null rate < 2%.
  3. Prod canary: enable on 10% partitions; monitor 15 minutes for failure rate < 1% and volume within ±5% baseline.
  4. Full rollout: expand to 100% after metrics are stable.
  5. Rollback: if thresholds breach, revert image/tag, disable feature flag, and restore previous DAG version.

Common mistakes and self-check

Skipping data checks in CI/CD

Fix: Add at least row count and null rate checks for critical tables. Treat thresholds as gates, not warnings.

Breaking consumers with schema changes

Fix: Use backward-compatible changes first (nullable + default), dual-write, then enforce constraints later.

No rollback path

Fix: Keep previous container tag and DAG version. Test rollback in stage before prod rollout.

Hardcoded secrets

Fix: Use environment variables injected by a secret manager. Never commit keys to the repo.

Practical projects

  • Build-and-test: Create a repo with a small dbt model and a Python UDF. Add CI that lints, tests, and compiles.
  • Env promotion: Package a simple pipeline in Docker, deploy to dev and stage with a smoke test script.
  • Data gate: Write a small checker that fails if null rate exceeds 1% and wire it as a CD gate.

Who this is for

  • Aspiring and junior Data Engineers who want reliable deployments.
  • Analytics Engineers adding automation to transformations.
  • Engineers moving from ad-hoc scripts to production-grade pipelines.

Prerequisites

  • Basic Git (commit, branch, PR).
  • Comfort with Python or SQL-based transformations.
  • Familiarity with a workflow tool (e.g., Airflow) or a transformation tool (e.g., dbt).

Learning path

  1. Set up CI: lint + unit tests + compile checks.
  2. Add data checks: row counts, null rates, freshness for key tables.
  3. Introduce CD: build artifacts, deploy to dev, then stage.
  4. Add gates and canary release for prod.
  5. Document rollback and practice it in non-prod.

Next steps

  • Implement at least one CI workflow in your repository this week.
  • Add one data quality gate to your stage environment.
  • Take the quick test below to check your understanding. Note: anyone can take the test; only logged-in users will have their progress saved.

Mini challenge

Your team needs to add a new required column to a high-traffic table. Design a CI/CD plan that avoids downtime. Include test coverage, promotion steps, canary, metrics to watch, and rollback. Write it in 6–10 bullet points.

Practice Exercises

2 exercises to complete

Instructions

Create a CI workflow YAML that runs on pull requests, sets up Python 3.10, installs requirements.txt, runs ruff lint, runs pytest on tests/unit, and validates dbt with dbt deps + dbt compile.

Expected Output
A single YAML CI file where all steps complete successfully on a sample PR: lint OK, tests pass, dbt compiles without errors.

Have questions about CI CD For Data Pipelines?

AI Assistant

Ask questions about this tool