Why this matters
Analytics Engineers ship data models that power dashboards, ML features, and decisions. Continuous Integration (CI) automation makes sure every change is tested before it reaches production. It prevents broken models, failing dashboards, slow queries, and late-night hotfixes.
- Real tasks you will handle: run tests on every pull request, spin up temporary schemas, seed test data, run data quality checks, and block merges if tests fail.
- Impact: faster reviews, fewer regressions, consistent data quality, and confident releases.
Who this is for
- Analytics Engineers who maintain SQL/ELT pipelines and dbt or similar frameworks.
- Data/BI Engineers adding automated testing to their workflow.
- Data team leads standardizing quality gates in CI.
Prerequisites
- Basic Git (branches, pull requests, commits).
- Familiarity with SQL and your warehouse (e.g., Snowflake/BigQuery/Redshift/Databricks).
- Ability to run your project locally (e.g., dbt build/tests or equivalent SQL test scripts).
Concept explained simply
CI automation runs your tests automatically when code changes. Think of it as a robot reviewer: it checks your models, seeds data, runs tests, and reports pass/fail in minutes.
Mental model
Picture three lanes:
- Lane 1: Setup — install dependencies, set environment variables, create a temporary schema.
- Lane 2: Test — build only changed models, run tests and data checks, collect artifacts (logs).
- Lane 3: Teardown — drop temporary objects, clean up secrets, post results.
You want this to run on each pull request and on the main branch after merges.
Workflow overview (high level)
- Trigger: pull request opened/updated or commit to main.
- Prepare: checkout code, set environment variables, install CLI/tools.
- Isolate: create an ephemeral schema (often named with the branch).
- Seed: load small, deterministic test fixtures.
- Build & test: build models and run tests; stop on failures.
- Report: annotate PR with status; store logs and test results.
- Clean up: drop temporary schema to avoid clutter and cost.
Why ephemeral schemas?
They isolate test runs so your work cannot pollute shared environments. You can safely run destructive operations and parallel builds without conflicts.
Worked examples
Example 1: Minimal CI to run data model tests on pull requests
Goal: On each PR, install tools, create an ephemeral schema, run "build + tests", then clean up.
# ci.yml (conceptual YAML)
on:
pull_request:
branches: [ main ]
jobs:
test:
runs-on: linux
steps:
- name: Checkout
run: git checkout $GIT_BRANCH
- name: Env
run: |
export WAREHOUSE_USER=$CI_USER
export WAREHOUSE_PASSWORD=$CI_PASSWORD
export SCHEMA="ci_${GIT_BRANCH//[^a-zA-Z0-9]/_}"
- name: Create schema
run: sql-cli "CREATE SCHEMA IF NOT EXISTS ${SCHEMA};"
- name: Build & test (changed only)
run: |
dbt deps
dbt build --select state:modified+ --defer --state .
- name: Drop schema
if: always()
run: sql-cli "DROP SCHEMA IF EXISTS ${SCHEMA} CASCADE;"
Notes: Use your project’s CLI (e.g., dbt) and SQL CLI appropriate to your warehouse. The state-based selection speeds up runs by testing only changes and their dependencies.
Example 2: Seed small test fixtures and verify row-level expectations
# Seed and test using SQL
steps:
- name: Seed fixtures
run: dbt seed --select customers_seed orders_seed
- name: Run data checks
run: |
# No null customer_id in orders
sql-cli "SELECT COUNT(*) AS c FROM {{ref('orders')}} WHERE customer_id IS NULL;" > nulls.txt
if [ $(cat nulls.txt) -gt 0 ]; then exit 1; fi
# Unique customer_id in customers
sql-cli "SELECT COUNT(*) - COUNT(DISTINCT customer_id) AS d FROM {{ref('customers')}};" > dupes.txt
if [ $(cat dupes.txt) -gt 0 ]; then exit 1; fi
These are small, fast checks that catch common issues early.
Example 3: Branch-named schema isolation and teardown safety
# Safe teardown pattern
SCHEMA="ci_${GIT_BRANCH//[^a-zA-Z0-9]/_}"
if [ -z "$SCHEMA" ]; then echo "Schema name missing"; exit 1; fi
sql-cli "CREATE SCHEMA IF NOT EXISTS ${SCHEMA};"
# ... run builds/tests targeting ${SCHEMA} ...
sql-cli "DROP SCHEMA IF EXISTS ${SCHEMA} CASCADE;"
This prevents accidental drops in shared schemas and keeps environments clean.
Step-by-step: Build a minimal CI pipeline
- Define triggers
on: pull_request: branches: [ main ] push: branches: [ main ] - Prepare environment
steps: - name: Checkout run: git checkout $GIT_BRANCH - name: Setup tools run: | pip install dbt-core dbt-your-adapter install-sql-cli - Ephemeral schema
export SCHEMA="ci_${GIT_BRANCH//[^a-zA-Z0-9]/_}" sql-cli "CREATE SCHEMA IF NOT EXISTS ${SCHEMA};" - Seed and build
dbt deps # Only changed nodes and their children dbt build --select state:modified+ --defer --state . - Store artifacts
mkdir -p artifacts cp -r target/* artifacts/ - Teardown
sql-cli "DROP SCHEMA IF EXISTS ${SCHEMA} CASCADE;"
Tip: Make tests fast
- Use small fixtures (seed tables with dozens, not millions, of rows).
- Select changed nodes instead of rebuilding everything.
- Run cheap SQL assertions before heavier transformations.
Common mistakes and self-check
- Running full builds on every PR. Self-check: Does your CI take >15 minutes for small changes? If yes, switch to change-aware selection.
- No isolation. Self-check: Do test runs create objects in shared schemas? Use branch-based schemas.
- Unpinned dependencies. Self-check: Builds randomly fail after dependency updates? Pin versions to avoid surprises.
- Forgetting teardown. Self-check: Are there dozens of old CI schemas? Ensure teardown runs even on failure (always/try-finally).
- Secrets in logs. Self-check: Search logs for credentials; mask them and use environment variables.
Exercises
Do these in your sandbox repo or copy the snippets into a scratch file to reason through them.
Exercise 1 — Minimal PR pipeline with isolation
Write a CI snippet that:
- Creates a branch-named schema.
- Runs dependency install, seeds two fixtures, and builds only changed models with tests.
- Always drops the schema at the end.
Expected output:
- Logs show created schema and later dropped schema with the branch name.
- Build step runs with a change-aware selector (e.g., state:modified+).
- Exit code 0 on success; non-zero if tests fail.
Show solution
on:
pull_request:
branches: [ main ]
jobs:
pr_tests:
runs-on: linux
steps:
- name: Checkout
run: git checkout $GIT_BRANCH
- name: Prepare env
run: |
pip install dbt-core dbt-your-adapter
install-sql-cli
export SCHEMA="ci_${GIT_BRANCH//[^a-zA-Z0-9]/_}"
echo "SCHEMA=$SCHEMA" >> $GITHUB_ENV
- name: Create schema
run: sql-cli "CREATE SCHEMA IF NOT EXISTS ${SCHEMA};"
- name: Seed
run: dbt seed --select customers_seed orders_seed
- name: Build & test (changed)
run: |
dbt deps
dbt build --select state:modified+ --defer --state .
- name: Teardown
if: always()
run: sql-cli "DROP SCHEMA IF EXISTS ${SCHEMA} CASCADE;"
Exercise 2 — Add two SQL quality gates
Add steps that fail the build if:
- order_date is in the future in the orders model.
- total_amount is negative in the orders model.
Expected output:
- If either condition is true, CI fails with a clear message.
- If both pass, CI proceeds and exits 0.
Show solution
- name: Data quality checks
run: |
sql-cli "SELECT COUNT(*) FROM {{ref('orders')}} WHERE order_date > CURRENT_DATE;" > future.txt
if [ $(cat future.txt) -gt 0 ]; then echo "Fail: future order_date"; exit 1; fi
sql-cli "SELECT COUNT(*) FROM {{ref('orders')}} WHERE total_amount < 0;" > negative.txt
if [ $(cat negative.txt) -gt 0 ]; then echo "Fail: negative total_amount"; exit 1; fi
- I can run tests on each PR automatically.
- My CI uses an ephemeral schema.
- I can fail fast on simple SQL assertions.
- My teardown runs even when tests fail.
Practical projects
- Starter: Add CI to an existing dbt project that runs seeds, build, and tests only on changed nodes.
- Intermediate: Implement schema isolation per branch, publish test artifacts, and add 3 SQL assertions.
- Advanced: Split pipeline into parallel jobs: unit tests, model builds, and data quality checks; enforce required status checks before merge.
Learning path
- Before this: Writing data tests (schema tests, SQL assertions) and environment management.
- Now: Automate tests in CI with isolation and fast feedback.
- Next: Add deployment gates, data contracts, and scheduled monitors in production.
Next steps
- Harden CI: pin dependencies and add caching to speed up runs.
- Add contract tests for critical models (columns, types, nullability).
- Track flakiness: quarantine or fix flaky tests and reduce nondeterminism in fixtures.
Mini challenge
Within 30 minutes, create a branch-based schema, run one build+test job, and add a single SQL assertion that fails on negative totals. Post the CI log snippet that shows the assertion and the final pass/fail.
Note on saving progress
The quick test is available to everyone; only logged-in users have their progress saved.