Why this matters
As a Data Architect, you are accountable for trustworthy data products. Data Quality SLAs turn vague expectations ("data should be good") into measurable, automatic commitments. They protect downstream analytics, ML models, and operational decisions by defining what “good” means, how it’s measured, and what happens when standards aren’t met.
- Real task: Specify timeliness and completeness for a finance fact table so month-end close runs on time.
- Real task: Agree with a source team on a duplicate rate threshold for events and an escalation path if breached.
- Real task: Instrument monitors and dashboards that quantify data quality health using SLIs and SLOs.
Who this is for
- Data Architects and Data Engineers designing reliable data products.
- Data Stewards and Product Managers formalizing data contracts.
- Analytics Engineers maintaining tests and monitors for BI.
Prerequisites
- Basic understanding of data models (tables, schemas, pipelines).
- Familiarity with common data quality dimensions: completeness, accuracy, timeliness, consistency, uniqueness, validity.
- Basic monitoring/alerting concepts.
Concept explained simply
A Data Quality SLA is a written agreement that defines minimum quality targets for a data product and what to do if targets aren’t met.
- SLI (Service Level Indicator): The measurement. Example: percent of rows with non-null email.
- SLO (Service Level Objective): The target. Example: >= 99.0% per daily load.
- SLA (Service Level Agreement): The commitment and response. Example: If SLO is missed twice in 30 days, notify data steward within 1 hour and prioritize fix within 2 business days.
Mental model: Guardrails + Thermometer + Playbook
Think of:
- Guardrails (SLOs): The bounds the system should stay within.
- Thermometer (SLIs): The numbers showing current health.
- Playbook (SLA): Who does what when things go wrong.
Key components of a Data Quality SLA
- Scope: Data product name, owners, consumers.
- Dimensions & SLIs: What you measure (e.g., completeness %, duplicate rate, freshness minutes).
- Targets (SLOs): Thresholds, periods, and compliance windows.
- Measurement method: Exact formulas, sampling rules, and source of truth.
- Frequency: When metrics are evaluated (per run, hourly, daily, rolling 30 days).
- Alerting: Warning vs critical thresholds; how alerts are grouped or latched.
- Escalation & roles: On-call teams, stewards, business owners, response SLAs.
- Exceptions: Planned maintenance, holidays, known data gaps.
- Error budget: How many failures are allowed before intervention (e.g., 0.5% of runs per month).
- Reporting: Dashboard or report fields and where they appear (by name, not links).
Example SLI formulas
completeness_pct = 100 * (rows_with_required_fields / total_rows) duplicate_rate_pct = 100 * (duplicate_keys / total_rows) freshness_delay_min = minutes(now_utc - last_successful_load_time) validity_pct = 100 * (rows_passing_domain_rules / total_rows)
Worked examples
Example 1: Finance Daily Sales Timeliness
- Scope: Table fact_daily_sales; consumers: FP&A and dashboards.
- SLI: freshness_delay_min (minutes past 06:00 UTC).
- SLO: On 99.5% of business days per month, the table is fully loaded by 06:00 UTC.
- Measurement: Compare load_end_time to 06:00 UTC on business days from the corporate calendar.
- SLA: If delay > 30 minutes twice in 7 days, page on-call; RCA within 2 business days.
Example 2: Marketing Leads Completeness
- Scope: Table dim_lead from CRM; consumers: Growth analytics.
- SLI: completeness_pct for columns email, acquisition_source.
- SLO: completeness_pct >= 99.0% per daily load; warn at 99.2%, critical at 99.0%.
- Measurement: Required fields not null; exclude test leads from denominator.
- SLA: If critical breach occurs, suppress email campaigns until fixed; steward notified within 1 hour.
Example 3: Product Events Uniqueness
- Scope: events table; consumers: product analytics and attribution.
- SLI: duplicate_rate_pct based on event_id uniqueness per day.
- SLO: duplicate_rate_pct <= 0.2% on a 7-day rolling basis.
- Measurement: Count duplicates per day; average over last 7 days.
- SLA: If rolling average > 0.2%, enable dedupe job and open ticket to SDK team; fix within 3 business days.
Design an SLA in 30 minutes
- Define scope: Name the data product, owners, main consumers, and business purpose.
- Pick 2–4 critical dimensions: Usually timeliness, completeness, and one domain-specific rule.
- Write SLIs precisely: Include formulas and filters (e.g., exclude internal test traffic).
- Set SLOs: Numeric thresholds and period (per run or rolling window).
- Specify measurement & frequency: When, how, and where results are computed.
- Alerting & escalation: Thresholds, who gets notified, response time, and RCA expectations.
- Document exceptions: Holidays, planned maintenance, known data gaps.
- Assign ownership: Producer owner, consumer steward, and on-call rotation.
Copyable mini-template (YAML)
product: fact_daily_sales
owners:
producer: data-platform
steward: finance-analytics
consumers: [fp&a, exec-dashboard]
scope: Daily sales for financial reporting.
slis:
- name: freshness_delay_min
formula: minutes(now_utc - last_successful_load_time)
slo: <= 0 minutes by 06:00 UTC on 99.5% business days per month
- name: completeness_pct
formula: 100 * (rows_with_required_fields / total_rows)
required_fields: [order_id, amount, currency]
slo: >= 99.8% per daily run
measurement:
frequency: daily
business_calendar: corp_calendar
alerting:
warn: when freshness_delay_min > 0
critical: when freshness_delay_min > 30 or completeness_pct < 99.8
escalation:
notify: on-call-data, steward-finance
response_sla: acknowledge 15m, RCA 2 business days
exceptions:
planned_outage: month_end_maintenance
reporting:
dashboard_fields: [run_date, freshness_delay_min, completeness_pct, status]
Exercises
Do these exercises to practice. The quick test does not save progress unless you are logged in, but you can take it for free anytime.
- Exercise 1 (mirrors ex1 below): Draft a concise SLA for a customer_360 table.
- Exercise 2 (mirrors ex2 below): Calculate SLIs and determine pass/fail vs SLOs.
Checklist before you finalize an SLA
- SLIs have explicit formulas and filters.
- SLOs include thresholds and periods.
- Alerting includes warn and critical levels.
- Escalation names people/teams and timelines.
- Exceptions are documented.
- Ownership is clear.
Common mistakes and self-check
- Vague metrics: “Fresh often.” Self-check: Can someone compute it with no extra info?
- Missing period: Targets must say “per run,” “per month,” or “rolling 7 days.”
- No exception policy: Leads to false alarms during planned downtime.
- Alert spam: No warn vs critical separation causes desensitization. Add thresholds and grouping.
- Wrong denominator: Include or exclude test rows consistently.
- Owner ambiguity: If it fails, who responds? Name teams and timelines.
Practical projects
- Create an SLA spec (YAML or JSON) for one data product you own, including SLIs, SLOs, and escalation.
- Instrument two SLIs (e.g., completeness and freshness) and produce a daily status table with pass/fail.
- Simulate a breach and run your escalation playbook, then write a short RCA note.
Mini challenge
You have a pipeline that occasionally arrives 10 minutes late. Your new SLO is “By 07:00 UTC on 99% of business days.” Draft two alert thresholds and a short escalation note (3 lines). Keep it concise and testable.
Learning path
- Define SLIs and SLOs for one critical data product.
- Implement monitors and daily/rolling computations.
- Set up alerting and escalation with owners.
- Iterate using error budgets and post-incident RCAs.
Next steps
- Apply the template to two more data products with different dimensions.
- Consolidate SLA results into a single quality dashboard.
- Review SLAs quarterly with stakeholders and adjust targets as the system stabilizes.
Quick Test
Take the quick test to check your understanding. Everyone can take it for free; logged-in users will see saved progress.