How to learn Sources And Freshness for dbt Concepts in Analytics Engineer for free

Who this is for

Analytics Engineers and BI Developers who use dbt to transform warehouse data and need reliable pipelines with clear SLAs for raw data arrival.

Prerequisites

Basic dbt project setup and profiles configured
Comfort with YAML and SQL
Ability to run dbt commands from your terminal

Why this matters

In real teams, business metrics break when raw data stops arriving or arrives late. dbt sources document and test your upstream raw tables, while freshness checks tell you if data is on time. You will:

Gate downstream model runs when raw data is stale
Alert stakeholders when data lags SLAs
Document raw lineage and ownership for fast troubleshooting

Typical tasks in the profession

Declare raw data tables as sources and apply tests
Set freshness thresholds (warn/error) and schedule checks
Reference sources correctly in models using the source() function
Optimize freshness checks with filters for partitioned data

Concept explained simply

A source in dbt is a raw table that dbt does not build but depends on. You declare it in YAML so you can reference it safely in models via {{ source('src_name','table_name') }}. A freshness check compares the most recent ingestion timestamp in that table to the current time. If it exceeds your thresholds, dbt warns or errors.

Mental model

Source YAML = contract for upstream data (where it lives, what it’s called, how fresh it should be)
loaded_at_field = the best column that indicates when data landed in the warehouse (ingestion time)
warn_after/error_after = SLA lines in the sand
Freshness run = a quick health check before building models

Worked examples

Example 1 — Basic source with freshness

version: 2
sources:
  - name: raw
    database: "{{ target.database }}"
    schema: raw
    tables:
      - name: orders
        description: Raw orders landed via ingestion service
        loaded_at_field: _ingested_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Run a check:

dbt source freshness --select source:raw.orders

If the newest _ingested_at is older than 12 hours, you get a warning; older than 24 hours, an error.

Example 2 — Filter for large/partitioned tables

For very large tables or late-arriving rows, limit the scan window so checks stay fast and fair:

freshness:
  warn_after: {count: 6, period: hour}
  error_after: {count: 12, period: hour}
  filter: "_ingested_at >= (current_timestamp - interval '3 day')"

The filter restricts the rows considered in the freshness calculation.

Example 3 — Using a source in a model

-- models/stg_orders.sql
with src as (
  select *
  from {{ source('raw', 'orders') }}
)
select * from src

Always use source() for raw tables (not ref()). This keeps lineage and documentation correct.

Small gotchas to watch

Use ingestion timestamps, not business event timestamps, for loaded_at_field
Time zones: pick a consistent convention; your warehouse time is the reference
Case sensitivity/quoting varies by adapter; align with how objects were created

How to implement (step-by-step)

Create or update models/sources.yml with your source and tables
Pick loaded_at_field that truly reflects when data landed
Set warn_after and error_after to match business SLAs
Optionally add filter to restrict scan window
Reference sources in models via source()
Schedule dbt source freshness to run before model builds
Fail the job on freshness errors; notify on warnings as needed

Common mistakes and self-checks

Mistake: Using an event timestamp for loaded_at_field
Fix: Use the ingestion/landing timestamp.
Mistake: Thresholds too strict, constant false alarms
Fix: Start with realistic SLAs; tune with historical arrival patterns.
Mistake: Forgetting to run freshness checks in CI/scheduler
Fix: Add a pre-step job: dbt source freshness.
Mistake: No filter on massive partitions, slow checks
Fix: Add a rolling-window filter (e.g., last 3–7 days).
Mistake: Hardcoding database/schema that differ by environment
Fix: Use {{ target.database }} and environment-appropriate schema configs.

Self-check

Does your source YAML compile? Run dbt parse.
Does freshness run within SLA time? If not, adjust thresholds or filter.
Do models use source() (not ref()) for raw tables?

Exercises

Complete these to solidify the skill. The same tasks are listed below with solutions hidden. You can take the Quick Test afterwards. Note: Anyone can take the test; only signed-in users will see saved progress.

Exercise 1 — Declare a source with freshness

Create models/sources.yml declaring a source raw with tables orders and customers. Use _ingested_at as loaded_at_field. Set warn at 12 hours and error at 24 hours. Then run a freshness check for orders.

Expected: The YAML compiles; running freshness returns success, warning, or error depending on your data.

Show solution

version: 2
sources:
  - name: raw
    database: "{{ target.database }}"
    schema: raw
    tables:
      - name: orders
        loaded_at_field: _ingested_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
      - name: customers
        loaded_at_field: _ingested_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

dbt source freshness --select source:raw.orders

Exercise 2 — Add a filter for performance

For table orders, add a freshness filter to limit the scan to 3 days. Keep thresholds at warn 6h / error 12h. Run the freshness check again.

Expected: The check completes quickly and only evaluates rows in the last 3 days.

Show solution

version: 2
sources:
  - name: raw
    database: "{{ target.database }}"
    schema: raw
    tables:
      - name: orders
        loaded_at_field: _ingested_at
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 12, period: hour}
          filter: "_ingested_at >= (current_timestamp - interval '3 day')"

dbt source freshness --select source:raw.orders

Checklist before you move on

I can declare sources in YAML and reference them with source()
I can set realistic warn_after and error_after thresholds
I can use filter to keep checks fast and fair
I can schedule and interpret dbt source freshness results

Practical projects

Project 1: Add freshness checks to two critical sources; set warn/error thresholds aligned with stakeholders; create a notification rule in your scheduler.
Project 2: Build a staging model layer that references only source() nodes; ensure every model depends on a freshness check pre-step.
Project 3: Create a small dashboard showing source freshness status over time using your BI tool of choice.

Learning path

Before: Models and the ref() function; Configs and materializations
Now: Sources and Freshness
Next: Tests and documentation; Exposures and SLAs; Scheduling and CI

Next steps

Set up a scheduled job: freshness check, then model run
Share SLAs with stakeholders and adjust thresholds after a week of observations
Take the Quick Test below to confirm mastery

Mini challenge

Your events source lands hourly but occasionally backfills last week’s rows. Propose a freshness.filter and thresholds that avoid false errors while still catching outages within 2 hours. Write the YAML snippet and one sentence explaining your choice.

Menu

Sources And Freshness

Table of Contents