luvv to helpDiscover the Best Free Online Tools
Topic 3 of 9

Working With CI For Data Projects

Learn Working With CI For Data Projects for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

Analytics Engineers ship code that transforms and tests data. Continuous Integration (CI) runs automated checks on every branch and pull request so that broken SQL, failing dbt tests, or leaking secrets never reach production.

  • Catch breaking schema changes before merge.
  • Run dbt compile/test, lint SQL, and validate data contracts on every PR.
  • Speed up reviews with visible, reproducible CI logs and artifacts.
  • Protect production by gating merges with required checks.

Who this is for

  • Analytics Engineers and BI Developers working with Git.
  • Data/Analytics teams adopting dbt, SQLFluff, Great Expectations, or similar tools.
  • Anyone who wants safe, automated checks for data code.

Prerequisites

  • Comfort with Git (branches, pull requests).
  • Basic shell and YAML familiarity.
  • Working knowledge of your data toolchain (e.g., dbt, SQLFluff).
  • Access to a test data warehouse or a mocked profile.

Concept explained simply

CI is a robot that runs your project’s checks every time code changes. It installs dependencies, lints and tests, and reports pass/fail on your PR. If something breaks, the robot blocks the merge.

Mental model

Imagine a secure turnstile. Only people who pass all checks can go through. CI is the turnstile for your code: it lets in only code that compiles, passes tests, and respects style and contracts.

Core building blocks

  • Triggers: When to run (push, pull_request, schedule).
  • Runners: Where it runs (hosted runners or your own machines).
  • Jobs & steps: The ordered tasks: checkout, setup, install, lint, test, upload artifacts.
  • Environment: Variables, secrets, and profiles for connecting to warehouses.
  • Caching: Speed up installs by reusing downloads or compiled state.
  • Artifacts: Files you want to keep (logs, compiled models, coverage reports).
  • Matrices: Run the same job across variations (Python versions, envs).
  • Gates: Required checks that must pass before merging.

Worked examples

Example 1 — Minimal GitHub Actions for dbt + SQLFluff

Goal: Lint SQL and run dbt compile/test on pull requests that touch models or macros.

name: data-ci
on:
  pull_request:
    paths:
      - 'models/**'
      - 'macros/**'
      - 'dbt_project.yml'
      - 'packages.yml'

jobs:
  checks:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install deps
        run: |
          python -m pip install --upgrade pip
          pip install dbt-core sqlfluff sqlfluff-templater-dbt
          # Example: add your adapter, e.g., dbt-postgres or dbt-bigquery
          pip install dbt-postgres

      - name: Lint SQL
        run: |
          sqlfluff version
          sqlfluff lint models || { echo 'SQLFluff found issues'; exit 1; }

      - name: dbt deps
        run: dbt deps

      - name: dbt compile
        env:
          DBT_PROFILES_DIR: ./.ci/profiles
        run: dbt compile --profiles-dir $DBT_PROFILES_DIR

      - name: dbt test (modified only)
        env:
          DBT_PROFILES_DIR: ./.ci/profiles
        run: |
          dbt seed --profiles-dir $DBT_PROFILES_DIR --full-refresh || true
          dbt test --profiles-dir $DBT_PROFILES_DIR -s state:modified+

Tip: Use a safe test profile pointing to a non-production dataset.

Example 2 — GitLab CI variant with artifacts
stages:
  - lint
  - test

variables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.pip-cache"

cache:
  key: "$CI_JOB_NAME"
  paths:
    - .pip-cache/

lint_sql:
  stage: lint
  image: python:3.11-slim
  script:
    - pip install --upgrade pip
    - pip install sqlfluff sqlfluff-templater-dbt
    - sqlfluff lint models
  artifacts:
    when: always
    paths:
      - sqlfluff.log

run_dbt_tests:
  stage: test
  image: python:3.11-slim
  script:
    - pip install dbt-core dbt-postgres
    - dbt deps
    - dbt compile --profiles-dir .ci/profiles
    - dbt test --profiles-dir .ci/profiles
  artifacts:
    when: always
    paths:
      - target/
Example 3 — Speed-up with caching and a matrix

Run checks on multiple environments, caching Python packages for faster runs.

name: data-ci-matrix
on:
  pull_request:

jobs:
  checks:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python: ["3.10", "3.11"]
        env: ["staging", "dev"]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python }}

      - name: Cache pip
        uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: pip-${{ runner.os }}-${{ matrix.python }}-${{ hashFiles('**/requirements.txt') }}

      - name: Install
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Lint and test
        env:
          APP_ENV: ${{ matrix.env }}
          DBT_PROFILES_DIR: ./.ci/profiles
        run: |
          sqlfluff lint models
          dbt deps
          dbt compile --profiles-dir $DBT_PROFILES_DIR
          dbt test --profiles-dir $DBT_PROFILES_DIR -s state:modified+

Quality gates for data

  • Fail on SQL lint errors.
  • Fail if dbt tests fail (unique, not null, relationships).
  • Optionally compare metrics vs. yesterday (artifact or lightweight script) and fail on big deltas.
  • Block merge until required checks pass.

Secrets and security

  • Store warehouse credentials as CI secrets/variables; reference them as environment variables.
  • Never echo secrets to logs; avoid verbose debug for connection steps.
  • Use least-privileged, read-only or sandbox accounts for CI.

Speed and reliability tips

  • Cache pip and dbt packages. Reuse compiled state to run only modified tests where possible.
  • Run lightweight checks first (lint, compile) to fail fast.
  • Use a small test dataset or seeds; avoid hitting large production tables.
  • Split jobs (lint vs. test) to parallelize.

Branch strategy with CI

  • Require CI checks on PRs into main or production branches.
  • Protect main: only merges via PR with passing checks.
  • Optionally require code owners for sensitive models.

Exercises

These match the tasks in the Exercises section below. Finish them and run the Quick Test. Note: Everyone can take the test; only logged-in users get saved progress.

  1. Create a minimal CI workflow that lints SQL and runs dbt tests on PRs.
  2. Add caching and a data-quality gate that reads a small JSON report and fails if status != pass.

Checklist

  • Workflow triggers on pull_request.
  • Installs Python deps and your dbt adapter.
  • Runs sqlfluff lint on models.
  • Runs dbt deps, compile, and tests (preferably only modified code).
  • Uses a safe DB profile and secrets.
  • Implements caching for faster reruns.
  • Includes at least one artifact (logs or target/).

Common mistakes and self-check

  • Using production credentials in CI. Self-check: Are your credentials read-only and scoped to test data?
  • Running full dbt test on every PR. Self-check: Are you selecting only changed models when possible?
  • No caching. Self-check: Do reruns complete faster than first runs?
  • Secrets printed in logs. Self-check: Search logs for connection strings or tokens; they should be masked.
  • Unclear failures. Self-check: Do steps have descriptive names and clear errors?

Practical projects

  • Project A: Set up CI for a dbt starter project with SQLFluff and required PR checks.
  • Project B: Add a metrics delta check comparing today vs. yesterday; fail on large deviations.
  • Project C: Create a matrix workflow that runs against dev and staging profiles.

Learning path

  1. Refresh Git branches and PRs.
  2. Write a minimal YAML workflow (checkout, setup Python).
  3. Add SQL linting and dbt compile/test.
  4. Introduce caching and artifacts.
  5. Add secrets and a safe profiles.yml for CI.
  6. Gate merges with required checks and branch protections.

Next steps

  • Expand CI to include coverage for tests, freshness checks, and contract enforcement.
  • Introduce scheduled CI runs for nightly data smoke tests.
  • Pair CI with CD to auto-deploy after merge when checks pass.

Mini challenge

Extend your workflow to upload the dbt target/ directory as an artifact, and only run dbt test for models changed in the PR plus their children. Add a job condition so that the test job only runs if linting passes.

Practice Exercises

2 exercises to complete

Instructions

Create .github/workflows/data-ci.yml that:

  • Triggers on pull_request for models, macros, and dbt config files.
  • Checks out code, sets up Python 3.11, installs dbt-core + your adapter and SQLFluff.
  • Runs sqlfluff lint on models.
  • Runs dbt deps, compile, seed (optional), and test for modified code only.
  • Uses a safe CI profile via DBT_PROFILES_DIR (e.g., ./.ci/profiles).

Tip: If you don’t have a warehouse, point the profile to a local/mock or a small sandbox schema.

Expected Output
CI job passes: SQL lint shows 0 errors; dbt compile succeeds; dbt tests run with all passing for modified models.

Working With CI For Data Projects — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Working With CI For Data Projects?

AI Assistant

Ask questions about this tool