Who this is for
Analytics Engineers, BI Developers, and Data Leads who need clear, reusable documentation for raw data sources and an explicit ownership model to keep data reliable.
Prerequisites
- Basic SQL and understanding of your warehouse schemas.
- Familiarity with your ingestion tools (e.g., ELT/ETL connectors).
- Working knowledge of data modeling terms: grain, lineage, PII, SLA.
Why this matters
In real work, most downstream data issues come from unclear inputs. A good source definition and explicit ownership prevent firefighting by making:
- Ingestion changes predictable (owners approve and communicate).
- Breakages traceable (you know who to contact and what to check).
- Compliance manageable (PII and retention are documented at the source).
Typical tasks you will face:
- Document a new connector (e.g., Salesforce, Ads, Payments) with SLAs and grain.
- Decide who owns freshness when a table stops updating.
- Review a PR introducing new raw tables; verify naming, fields, and lineage are correct.
Concept explained simply
A source definition is a single page that answers: What is this data, where does it come from, how reliable is it, how often does it update, who owns it, and what happens if it changes or breaks?
Ownership means one clearly named person or team is accountable for a scope: they approve changes, maintain SLAs, and are the escalation point.
Mental model
Think of your data platform like a city’s water system. A source definition is the blueprint and service contract for an intake pipe: its capacity (volume), purity (quality), schedule (SLA), and the maintenance team (owner). If taps run dry, you check the intake pipe’s blueprint first.
Mini glossary
- Grain: the level of detail (e.g., one row per ad account per day).
- Freshness SLA: how quickly after real-world time data is expected (e.g., within 2 hours).
- Lineage: where data comes from and what processes move it.
- DRI: Directly Responsible Individual (single accountable owner).
What to include in a source definition
Use this minimal yet complete structure:
{
"name": "source_system.table (business name)",
"purpose": "What business question this source enables",
"grain": "1 row = ?",
"schema": "warehouse.schema.table",
"fields": [
{"name": "field", "type": "TYPE", "meaning": "plain-English" , "pii": "none|sensitive|restricted"}
],
"refresh": {"schedule": "cron/interval", "freshness_sla": "e.g., <= 2h", "late_threshold": "when to alert"},
"quality": {"checks": ["row_count not null", "id unique", "loaded_at recent"], "known_gaps": ["ads API lags up to 24h"]},
"lineage": {"source_system": "e.g., Salesforce", "ingestion": "e.g., Fivetran", "destination": "schema.table"},
"ownership": {"dri": "Name", "producer_team": "who creates data", "platform_team": "who runs pipelines", "analytics_contact": "AE/BI contact", "escalation": "how to page"},
"change_management": {"breaking_change_process": "how to propose/approve", "deprecation": "policy & timeline"},
"security": {"pii": true/false, "retention": "e.g., 24 months", "access_tier": "open/restricted"}
}
- Keep it in your documentation system near your source models.
- One source = one owner (DRI); many contributors allowed.
Tip: Minimum viable source doc
- Grain
- Freshness SLA
- Lineage (system → tool → table)
- DRI + escalation
- PII flag
Ownership model (RACI-lite)
- DRI (Accountable) — single person named. Approves breaking changes, owns SLAs, signs off on deprecation.
- Responsible — engineers/operators who implement ingestion and fixes.
- Consulted — data consumers affected by changes (Finance, Marketing, Product).
- Informed — broader stakeholders notified of timelines and incidents.
Ownership boundaries
- Producer ownership: correctness of what leaves the source system (e.g., Salesforce field logic).
- Platform ownership: reliability of connectors, schedules, credentials.
- Analytics ownership: contracts, expectations, and downstream compatibility.
Change and incident workflow
- Propose: Open a change request with risk, timeline, and test plan.
- Review: DRI checks grain, fields, PII, SLA impact, and alerts plan.
- Communicate: Notify Consulted/Informed with effective date and backfill approach.
- Implement: Ship behind flags when possible; monitor checks.
- Post-incident: Add known gaps, adjust SLAs or checks, document lessons.
Worked examples
Example 1 — Marketing Ads Spend (daily)
{
"name": "ads_platform.ad_insights (Marketing Ad Spend)",
"purpose": "Daily spend and clicks by account and campaign",
"grain": "1 row = account_id, campaign_id, date",
"schema": "raw_marketing.ad_insights",
"fields": [
{"name": "date", "type": "DATE", "meaning": "reporting day", "pii": "none"},
{"name": "account_id", "type": "STRING", "meaning": "ads account", "pii": "none"},
{"name": "spend", "type": "NUMERIC", "meaning": "cost in account currency", "pii": "none"}
],
"refresh": {"schedule": "hourly", "freshness_sla": "<= 4h", "late_threshold": "6h"},
"quality": {"checks": ["date not null", "(account_id,campaign_id,date) unique", "loaded_at < now() - 6h"], "known_gaps": ["API backfills last 28 days only"]},
"lineage": {"source_system": "Facebook Ads API", "ingestion": "ELT connector", "destination": "raw_marketing.ad_insights"},
"ownership": {"dri": "A. Chen", "producer_team": "Marketing Ops", "platform_team": "Data Platform", "analytics_contact": "AE: J. Patel", "escalation": "#data-incidents"},
"change_management": {"breaking_change_process": "RFC + approval by DRI", "deprecation": "60-day notice"},
"security": {"pii": false, "retention": "36 months", "access_tier": "open"}
}
Example 2 — CRM Opportunities (near-real time)
{
"name": "salesforce.opportunity (CRM Opportunities)",
"purpose": "Sales pipeline and bookings",
"grain": "1 row = opportunity_id",
"schema": "raw_crm.opportunity",
"fields": [
{"name": "opportunity_id", "type": "STRING", "meaning": "SFDC id", "pii": "none"},
{"name": "amount", "type": "NUMERIC", "meaning": "currency in USD", "pii": "none"},
{"name": "close_date", "type": "DATE", "meaning": "expected close", "pii": "none"}
],
"refresh": {"schedule": "15 min", "freshness_sla": "<= 30 min", "late_threshold": "45 min"},
"quality": {"checks": ["opportunity_id unique", "row_count delta >= -20%"], "known_gaps": ["deleted records appear after 24h"]},
"lineage": {"source_system": "Salesforce", "ingestion": "CDC", "destination": "raw_crm.opportunity"},
"ownership": {"dri": "M. Rivera", "producer_team": "RevOps", "platform_team": "Data Platform", "analytics_contact": "AE: D. Lin", "escalation": "pager rotation"},
"change_management": {"breaking_change_process": "RevOps change request + AE signoff", "deprecation": "30-day notice"},
"security": {"pii": false, "retention": "indefinite", "access_tier": "restricted"}
}
Example 3 — Finance Forecasts (monthly spreadsheets)
{
"name": "finance.forecast_uploads (Finance Forecast XLS)",
"purpose": "Budget vs actuals reporting",
"grain": "1 row = cost_center, month",
"schema": "raw_finance.forecast_uploads",
"fields": [
{"name": "month", "type": "DATE", "meaning": "first day of month", "pii": "none"},
{"name": "cost_center", "type": "STRING", "meaning": "department code", "pii": "none"},
{"name": "amount", "type": "NUMERIC", "meaning": "forecasted spend", "pii": "none"}
],
"refresh": {"schedule": "monthly", "freshness_sla": "by 5th of month", "late_threshold": "7th of month"},
"quality": {"checks": ["month not null", "amount >= 0"], "known_gaps": ["manual uploads may have header typos"]},
"lineage": {"source_system": "Manual XLS via S3 drop", "ingestion": "batch loader", "destination": "raw_finance.forecast_uploads"},
"ownership": {"dri": "F. Nwosu", "producer_team": "FP&A", "platform_team": "Data Platform", "analytics_contact": "AE: R. Gupta", "escalation": "email on-call list"},
"change_management": {"breaking_change_process": "Template RFC + sample file", "deprecation": "quarterly review"},
"security": {"pii": false, "retention": "60 months", "access_tier": "restricted"}
}
Exercises
Your work is not saved unless you are logged in; however, the exercises and the quick test are available to everyone.
Exercise 1 — Draft a source definition for Product Events (Snowplow)
Mirror of the graded exercise below. Create a concise source definition for a Snowplow events table that powers product analytics.
Requirements
- Table: raw_product.snowplow_events
- Grain: one row per event (unique event_id)
- Important fields: event_id, collector_tstamp, user_id, event_name
- Refresh: streaming; SLA within 10 minutes
- Lineage: Snowplow collector → enrichment → warehouse
- Ownership: DRI = Product Data Lead; platform = Data Platform
- Security: user_id may be PII depending on implementation (mark as sensitive)
- Write it using the template above (keep it brief).
- Add 3 quality checks you would monitor.
- Describe how you would alert when freshness exceeds 15 minutes.
Self-check checklist
- Grain states the unique key clearly.
- Freshness SLA and late threshold are explicit.
- PII classification for user_id is present.
- DRI is a single named person.
- At least one uniqueness or not-null check included.
Common mistakes and how to self-check
- No single owner: Multiple names means no one is accountable. Fix: pick one DRI and add consulted contacts.
- Missing grain: Without row-level definition, downstream joins break. Fix: state the composite key in plain English.
- Vague SLAs: "Hourly-ish" is not an SLA. Fix: specify a measurable threshold and alerting rule.
- Ignoring PII: Fields like emails slip through. Fix: mark PII and set access tier.
- No change process: Breaking changes shipped silently. Fix: add a lightweight RFC and notice period.
Self-audit in 3 minutes
- Can a new teammate tell when data is late and who to ping?
- Are unique keys and not-null checks written down?
- Would a field rename be caught by review before breaking models?
Practical projects
- Document 5 critical sources in your stack using the template; add owners and SLAs.
- Implement freshness and uniqueness checks for those sources; record known gaps.
- Create a one-page change/incident playbook and link it in every source doc.
Learning path
- Start with 1-2 high-impact sources (CRM, Payments) and write MVP docs.
- Add freshness checks and a clear DRI for each source.
- Expand fields documentation and security notes (PII, retention).
- Introduce a simple RFC template for breaking changes.
- Roll out to long tail sources; schedule periodic doc reviews.
Next steps
- Complete the exercise and compare with the solution.
- Take the quick test below to check your understanding.
- Pick one production source and ship a doc + checks today.
Mini challenge
In 5 sentences, write how you would communicate a breaking field rename in a core source: who is informed, who approves, when it ships, and how you monitor after.