Who this is for
Analytics Engineers and BI Developers who use dbt to transform warehouse data and need reliable pipelines with clear SLAs for raw data arrival.
Prerequisites
- Basic dbt project setup and profiles configured
- Comfort with YAML and SQL
- Ability to run dbt commands from your terminal
Why this matters
In real teams, business metrics break when raw data stops arriving or arrives late. dbt sources document and test your upstream raw tables, while freshness checks tell you if data is on time. You will:
- Gate downstream model runs when raw data is stale
- Alert stakeholders when data lags SLAs
- Document raw lineage and ownership for fast troubleshooting
Typical tasks in the profession
- Declare raw data tables as sources and apply tests
- Set freshness thresholds (warn/error) and schedule checks
- Reference sources correctly in models using the
source()function - Optimize freshness checks with filters for partitioned data
Concept explained simply
A source in dbt is a raw table that dbt does not build but depends on. You declare it in YAML so you can reference it safely in models via {{ source('src_name','table_name') }}. A freshness check compares the most recent ingestion timestamp in that table to the current time. If it exceeds your thresholds, dbt warns or errors.
Mental model
- Source YAML = contract for upstream data (where it lives, what it’s called, how fresh it should be)
loaded_at_field= the best column that indicates when data landed in the warehouse (ingestion time)warn_after/error_after= SLA lines in the sand- Freshness run = a quick health check before building models
Worked examples
Example 1 — Basic source with freshness
version: 2
sources:
- name: raw
database: "{{ target.database }}"
schema: raw
tables:
- name: orders
description: Raw orders landed via ingestion service
loaded_at_field: _ingested_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
Run a check:
dbt source freshness --select source:raw.ordersIf the newest _ingested_at is older than 12 hours, you get a warning; older than 24 hours, an error.
Example 2 — Filter for large/partitioned tables
For very large tables or late-arriving rows, limit the scan window so checks stay fast and fair:
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 12, period: hour}
filter: "_ingested_at >= (current_timestamp - interval '3 day')"
The filter restricts the rows considered in the freshness calculation.
Example 3 — Using a source in a model
-- models/stg_orders.sql
with src as (
select *
from {{ source('raw', 'orders') }}
)
select * from src
Always use source() for raw tables (not ref()). This keeps lineage and documentation correct.
Small gotchas to watch
- Use ingestion timestamps, not business event timestamps, for
loaded_at_field - Time zones: pick a consistent convention; your warehouse time is the reference
- Case sensitivity/quoting varies by adapter; align with how objects were created
How to implement (step-by-step)
- Create or update
models/sources.ymlwith your source and tables - Pick
loaded_at_fieldthat truly reflects when data landed - Set
warn_afteranderror_afterto match business SLAs - Optionally add
filterto restrict scan window - Reference sources in models via
source() - Schedule
dbt source freshnessto run before model builds - Fail the job on freshness errors; notify on warnings as needed
Common mistakes and self-checks
- Mistake: Using an event timestamp for
loaded_at_field
Fix: Use the ingestion/landing timestamp. - Mistake: Thresholds too strict, constant false alarms
Fix: Start with realistic SLAs; tune with historical arrival patterns. - Mistake: Forgetting to run freshness checks in CI/scheduler
Fix: Add a pre-step job:dbt source freshness. - Mistake: No filter on massive partitions, slow checks
Fix: Add a rolling-windowfilter(e.g., last 3–7 days). - Mistake: Hardcoding database/schema that differ by environment
Fix: Use{{ target.database }}and environment-appropriate schema configs.
Self-check
- Does your source YAML compile? Run
dbt parse. - Does freshness run within SLA time? If not, adjust thresholds or filter.
- Do models use
source()(notref()) for raw tables?
Exercises
Complete these to solidify the skill. The same tasks are listed below with solutions hidden. You can take the Quick Test afterwards. Note: Anyone can take the test; only signed-in users will see saved progress.
Exercise 1 — Declare a source with freshness
Create models/sources.yml declaring a source raw with tables orders and customers. Use _ingested_at as loaded_at_field. Set warn at 12 hours and error at 24 hours. Then run a freshness check for orders.
- Expected: The YAML compiles; running freshness returns success, warning, or error depending on your data.
Show solution
version: 2
sources:
- name: raw
database: "{{ target.database }}"
schema: raw
tables:
- name: orders
loaded_at_field: _ingested_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
- name: customers
loaded_at_field: _ingested_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
dbt source freshness --select source:raw.ordersExercise 2 — Add a filter for performance
For table orders, add a freshness filter to limit the scan to 3 days. Keep thresholds at warn 6h / error 12h. Run the freshness check again.
- Expected: The check completes quickly and only evaluates rows in the last 3 days.
Show solution
version: 2
sources:
- name: raw
database: "{{ target.database }}"
schema: raw
tables:
- name: orders
loaded_at_field: _ingested_at
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 12, period: hour}
filter: "_ingested_at >= (current_timestamp - interval '3 day')"
dbt source freshness --select source:raw.ordersChecklist before you move on
- I can declare sources in YAML and reference them with
source() - I can set realistic
warn_afteranderror_afterthresholds - I can use
filterto keep checks fast and fair - I can schedule and interpret
dbt source freshnessresults
Practical projects
- Project 1: Add freshness checks to two critical sources; set warn/error thresholds aligned with stakeholders; create a notification rule in your scheduler.
- Project 2: Build a staging model layer that references only
source()nodes; ensure every model depends on a freshness check pre-step. - Project 3: Create a small dashboard showing source freshness status over time using your BI tool of choice.
Learning path
- Before: Models and the
ref()function; Configs and materializations - Now: Sources and Freshness
- Next: Tests and documentation; Exposures and SLAs; Scheduling and CI
Next steps
- Set up a scheduled job: freshness check, then model run
- Share SLAs with stakeholders and adjust thresholds after a week of observations
- Take the Quick Test below to confirm mastery
Mini challenge
Your events source lands hourly but occasionally backfills last week’s rows. Propose a freshness.filter and thresholds that avoid false errors while still catching outages within 2 hours. Write the YAML snippet and one sentence explaining your choice.