Why this matters
Analytics Engineers ship code that transforms and tests data. Continuous Integration (CI) runs automated checks on every branch and pull request so that broken SQL, failing dbt tests, or leaking secrets never reach production.
- Catch breaking schema changes before merge.
- Run dbt compile/test, lint SQL, and validate data contracts on every PR.
- Speed up reviews with visible, reproducible CI logs and artifacts.
- Protect production by gating merges with required checks.
Who this is for
- Analytics Engineers and BI Developers working with Git.
- Data/Analytics teams adopting dbt, SQLFluff, Great Expectations, or similar tools.
- Anyone who wants safe, automated checks for data code.
Prerequisites
- Comfort with Git (branches, pull requests).
- Basic shell and YAML familiarity.
- Working knowledge of your data toolchain (e.g., dbt, SQLFluff).
- Access to a test data warehouse or a mocked profile.
Concept explained simply
CI is a robot that runs your project’s checks every time code changes. It installs dependencies, lints and tests, and reports pass/fail on your PR. If something breaks, the robot blocks the merge.
Mental model
Imagine a secure turnstile. Only people who pass all checks can go through. CI is the turnstile for your code: it lets in only code that compiles, passes tests, and respects style and contracts.
Core building blocks
- Triggers: When to run (push, pull_request, schedule).
- Runners: Where it runs (hosted runners or your own machines).
- Jobs & steps: The ordered tasks: checkout, setup, install, lint, test, upload artifacts.
- Environment: Variables, secrets, and profiles for connecting to warehouses.
- Caching: Speed up installs by reusing downloads or compiled state.
- Artifacts: Files you want to keep (logs, compiled models, coverage reports).
- Matrices: Run the same job across variations (Python versions, envs).
- Gates: Required checks that must pass before merging.
Worked examples
Example 1 — Minimal GitHub Actions for dbt + SQLFluff
Goal: Lint SQL and run dbt compile/test on pull requests that touch models or macros.
name: data-ci
on:
pull_request:
paths:
- 'models/**'
- 'macros/**'
- 'dbt_project.yml'
- 'packages.yml'
jobs:
checks:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install deps
run: |
python -m pip install --upgrade pip
pip install dbt-core sqlfluff sqlfluff-templater-dbt
# Example: add your adapter, e.g., dbt-postgres or dbt-bigquery
pip install dbt-postgres
- name: Lint SQL
run: |
sqlfluff version
sqlfluff lint models || { echo 'SQLFluff found issues'; exit 1; }
- name: dbt deps
run: dbt deps
- name: dbt compile
env:
DBT_PROFILES_DIR: ./.ci/profiles
run: dbt compile --profiles-dir $DBT_PROFILES_DIR
- name: dbt test (modified only)
env:
DBT_PROFILES_DIR: ./.ci/profiles
run: |
dbt seed --profiles-dir $DBT_PROFILES_DIR --full-refresh || true
dbt test --profiles-dir $DBT_PROFILES_DIR -s state:modified+
Tip: Use a safe test profile pointing to a non-production dataset.
Example 2 — GitLab CI variant with artifacts
stages:
- lint
- test
variables:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.pip-cache"
cache:
key: "$CI_JOB_NAME"
paths:
- .pip-cache/
lint_sql:
stage: lint
image: python:3.11-slim
script:
- pip install --upgrade pip
- pip install sqlfluff sqlfluff-templater-dbt
- sqlfluff lint models
artifacts:
when: always
paths:
- sqlfluff.log
run_dbt_tests:
stage: test
image: python:3.11-slim
script:
- pip install dbt-core dbt-postgres
- dbt deps
- dbt compile --profiles-dir .ci/profiles
- dbt test --profiles-dir .ci/profiles
artifacts:
when: always
paths:
- target/
Example 3 — Speed-up with caching and a matrix
Run checks on multiple environments, caching Python packages for faster runs.
name: data-ci-matrix
on:
pull_request:
jobs:
checks:
runs-on: ubuntu-latest
strategy:
matrix:
python: ["3.10", "3.11"]
env: ["staging", "dev"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python }}
- name: Cache pip
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: pip-${{ runner.os }}-${{ matrix.python }}-${{ hashFiles('**/requirements.txt') }}
- name: Install
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Lint and test
env:
APP_ENV: ${{ matrix.env }}
DBT_PROFILES_DIR: ./.ci/profiles
run: |
sqlfluff lint models
dbt deps
dbt compile --profiles-dir $DBT_PROFILES_DIR
dbt test --profiles-dir $DBT_PROFILES_DIR -s state:modified+
Quality gates for data
- Fail on SQL lint errors.
- Fail if dbt tests fail (unique, not null, relationships).
- Optionally compare metrics vs. yesterday (artifact or lightweight script) and fail on big deltas.
- Block merge until required checks pass.
Secrets and security
- Store warehouse credentials as CI secrets/variables; reference them as environment variables.
- Never echo secrets to logs; avoid verbose debug for connection steps.
- Use least-privileged, read-only or sandbox accounts for CI.
Speed and reliability tips
- Cache pip and dbt packages. Reuse compiled state to run only modified tests where possible.
- Run lightweight checks first (lint, compile) to fail fast.
- Use a small test dataset or seeds; avoid hitting large production tables.
- Split jobs (lint vs. test) to parallelize.
Branch strategy with CI
- Require CI checks on PRs into main or production branches.
- Protect main: only merges via PR with passing checks.
- Optionally require code owners for sensitive models.
Exercises
These match the tasks in the Exercises section below. Finish them and run the Quick Test. Note: Everyone can take the test; only logged-in users get saved progress.
- Create a minimal CI workflow that lints SQL and runs dbt tests on PRs.
- Add caching and a data-quality gate that reads a small JSON report and fails if status != pass.
Checklist
- Workflow triggers on pull_request.
- Installs Python deps and your dbt adapter.
- Runs sqlfluff lint on models.
- Runs dbt deps, compile, and tests (preferably only modified code).
- Uses a safe DB profile and secrets.
- Implements caching for faster reruns.
- Includes at least one artifact (logs or target/).
Common mistakes and self-check
- Using production credentials in CI. Self-check: Are your credentials read-only and scoped to test data?
- Running full dbt test on every PR. Self-check: Are you selecting only changed models when possible?
- No caching. Self-check: Do reruns complete faster than first runs?
- Secrets printed in logs. Self-check: Search logs for connection strings or tokens; they should be masked.
- Unclear failures. Self-check: Do steps have descriptive names and clear errors?
Practical projects
- Project A: Set up CI for a dbt starter project with SQLFluff and required PR checks.
- Project B: Add a metrics delta check comparing today vs. yesterday; fail on large deviations.
- Project C: Create a matrix workflow that runs against dev and staging profiles.
Learning path
- Refresh Git branches and PRs.
- Write a minimal YAML workflow (checkout, setup Python).
- Add SQL linting and dbt compile/test.
- Introduce caching and artifacts.
- Add secrets and a safe profiles.yml for CI.
- Gate merges with required checks and branch protections.
Next steps
- Expand CI to include coverage for tests, freshness checks, and contract enforcement.
- Introduce scheduled CI runs for nightly data smoke tests.
- Pair CI with CD to auto-deploy after merge when checks pass.
Mini challenge
Extend your workflow to upload the dbt target/ directory as an artifact, and only run dbt test for models changed in the PR plus their children. Add a job condition so that the test job only runs if linting passes.