Why this matters
Data that is late, broken, or ownerless erodes trust. Ownership and SLAs give your organization clear accountability and measurable expectations for data products. As a Data Engineer, you will define who owns a dataset or pipeline, agree on service levels, and respond when things go wrong.
- Real tasks you will do: assign data product owners, define SLIs/SLOs/SLAs, set escalation paths, document runbooks, and monitor SLA compliance.
- Impacts: reliable dashboards, predictable ML features, fewer fire drills, and faster incident resolution.
Concept explained simply
Think of a data pipeline as a service. A service has an owner, a promise (SLA), and a scoreboard (SLIs/SLOs) to check if the promise is kept.
- SLI (Service Level Indicator): a measurable metric (e.g., table freshness in minutes).
- SLO (Service Level Objective): the target for the SLI (e.g., freshness ≤ 60 minutes for 95% of loads per month).
- SLA (Service Level Agreement): the commitment and response plan communicated to stakeholders (e.g., daily sales table available by 06:00 UTC with 99.5% monthly success; if missed, notify finance by 06:15 UTC and run hotfix).
- Ownership: a clearly named person/team accountable for the service, including decisions, communication, and incident handling.
Glossary: Owner vs Steward vs Custodian; SLA vs SLO vs SLI
- Owner: accountable for outcomes and communication.
- Steward: ensures data meaning, definitions, and usage policy.
- Custodian: operates infrastructure (platform team) and access.
- SLI: metric you measure (e.g., delivery success rate).
- SLO: target for the SLI (e.g., 99.5% monthly).
- SLA: the public promise + what happens when it is not met.
Mental model
Use the contract-and-scoreboard model:
- Contract: Who owns it, what is promised, when to escalate, and how to communicate.
- Scoreboard: SLIs you track continuously with alerting and monthly reviews.
Worked examples
Example 1: Daily Finance Snapshot (batch pipeline)
- SLIs: freshness (time data is available), delivery success rate, row completeness.
- SLOs: available by 06:00 UTC on business days, 99.5% monthly success; completeness ≥ 99.9% rows.
- SLA (commitment): If not available by 06:00 UTC, notify Finance channel by 06:15 UTC, provide ETA, and issue a backfill by 08:00 UTC.
- Escalation: On-call DE within 15 min; if unresolved by 45 min, escalate to DE manager; 2-hour breach escalates to Head of Data.
- Runbook action: check orchestrator run status, re-run failed task, validate row counts vs last good load, send incident update.
Example 2: Streaming Click Events (near real-time)
- SLIs: end-to-end latency (p95), message loss rate, schema compatibility incidents.
- SLOs: p95 latency ≤ 120s during 08:00–22:00 local; message loss < 0.01% monthly; zero breaking schema changes.
- SLA: If p95 > 120s for > 10 min, notify Analytics Eng; if loss ≥ 0.01%, pause consumers and enable replay.
- Escalation: Platform on-call after 10 min sustained breach; joint call if breach lasts 30 min.
- Runbook: scale consumer group, check broker partitions, replay from last safe offset, verify lag drops.
Example 3: ML Feature Store (hourly features)
- SLIs: feature freshness, materialization success rate, drift checks pass rate.
- SLOs: freshness ≤ 60 min for 99%; success rate ≥ 99.7%; drift alerts <= 2 per month.
- SLA: Missed freshness triggers fallback to previous hour features and alerts model owners.
- Escalation: Feature pipeline owner (primary), then ML engineer, then Data platform lead.
How to set ownership and SLAs (step-by-step)
- Identify the data product: name, purpose, criticality (Low/Medium/High).
- Assign a clear Owner (team + primary person) and backup.
- Define consumers and their needs (e.g., dashboard refresh time, training windows).
- Choose SLIs that matter: freshness, success rate, completeness, latency, schema stability.
- Set realistic SLO targets using past data; avoid 100% unless truly required.
- Write the SLA: commitment window, communication rules, escalation, and remediation.
- Operationalize: implement monitors, alerts, runbooks, and monthly review.
Copy-and-adapt SLA template
Data Product: finance_daily_snapshot Owner: Data Engineering (Primary: A. Singh), Backup: J. Li Consumers: Finance FP&A, Executive Dashboard SLIs: - Freshness: arrival time (UTC) - Delivery success rate: % of days delivered on time - Completeness: % of expected rows present SLOs (monthly): - Freshness: available by 06:00 UTC on business days, 99.5% - Completeness: ≥ 99.9% SLA (commitment): - If freshness target missed: notify Finance by 06:15 UTC, share ETA, deliver backfill by 08:00 UTC. Escalation: - T+15 min: On-call DE - T+45 min: DE Manager - T+120 min: Head of Data Runbook: - Check orchestrator job, logs, upstream source status - Re-run failed tasks; if source outage, switch to cached extract - Validate counts and critical metrics; communicate status every 30 min Review: - Monthly: report SLA adherence, incidents, actions
Practical projects
- Project 1 — Ownership registry: Create a simple catalog (spreadsheet or metadata tool) listing data products with Owner, Backup, Criticality, SLA link, and contact channel.
- Project 2 — SLA monitors: Implement freshness and success-rate checks for one batch pipeline and one streaming pipeline. Add alerts and a weekly SLO report.
- Project 3 — Incident drill: Run a tabletop exercise for a simulated late delivery. Practice the runbook and communication updates; capture learnings.
Exercises
Complete these and compare with the solutions. Note: Everyone can do the quick test; only logged-in users will have their progress saved.
Exercise 1: Define SLIs/SLOs and an SLA for a daily pipeline
Scenario: Marketing needs the leads_enriched_daily table by 07:00 UTC on business days. Data comes from CRM and an enrichment API. The API occasionally rate-limits.
- Pick 3–4 SLIs that matter.
- Set monthly SLO targets that balance reliability and reality.
- Draft a 3–4 line SLA including who is notified, when, and how you recover.
Show solution
Possible answer:
- SLIs: freshness (UTC availability), delivery success rate, enrichment failure rate, completeness (% leads with enrichment fields).
- SLOs: freshness by 07:00 UTC on business days 99.3%; success rate 99.3%; enrichment failure rate <= 2% monthly; completeness ≥ 98%.
- SLA: If missed, notify Marketing Ops by 07:10 UTC, share ETA; auto-retry API up to 3x with exponential backoff; if still failing by 07:30, deliver partial dataset flagged
enrichment_status, backfill by 09:00 UTC.
Exercise 2: Design an escalation matrix
Scenario: A streaming pipeline’s p95 latency breaches target (> 120s) for 15 minutes during peak hours.
- Define who is paged first, second, and third, with time thresholds.
- Specify what each escalation level should check or do.
Show solution
Possible answer:
- T+0 min: On-call Streaming DE — validate consumer lag, scale consumers +1, check broker health metrics.
- T+15 min sustained: Platform SRE — inspect partition hotspots, throttle offending producers, consider adding partitions.
- T+30 min sustained: Joint call with Analytics Eng lead — enable replay plan; communicate to stakeholders every 15 min.
Quick checklist
- Owner and backup named with contact channel
- 3–5 SLIs chosen that reflect consumer value
- SLOs set using historical performance
- SLA includes communication and recovery steps
- Escalation matrix with time thresholds
- Runbook steps documented and tested
Common mistakes and self-check
- Overpromising 100%: Aim for realistic SLOs. Self-check: Can you meet it given the worst week last quarter?
- Too many SLIs: Focus on 3–5 that consumers care about. Self-check: If this SLI is red, would you act?
- Ownerless data: Every product needs a single accountable owner. Self-check: Is there a named person and backup?
- No communication plan: Breaches happen; silence is worse. Self-check: Do you have a template for updates and cadence?
- Ignoring dependencies: Upstream SLAs matter. Self-check: Are upstream contracts documented and monitored?
- Set-and-forget SLOs: Review monthly. Self-check: Did you adjust targets after repeated breaches or sustained overperformance?
Mini challenge
Design SLIs/SLOs and an SLA for a weekly inventory_snapshot table used by supply chain on Mondays at 09:00 local time. Include an escalation path and a brief runbook.
Sample answer
- SLIs: freshness by 09:00; completeness of SKUs ≥ 99.95%; reconciliation mismatch rate <= 0.1%.
- SLOs: 99.7% monthly on-time; mismatch <= 0.1%; completeness ≥ 99.95%.
- SLA: If late, notify Supply Chain by 09:10 with ETA; run reconciliation and publish "provisional" snapshot; backfill by 11:00.
- Escalation: T+10 min on-call DE; T+40 min DE manager; T+90 min Head of Ops.
- Runbook: verify upstream extracts, run reconciliation job, compare SKU counts to last week, publish provisional table with
is_provisionalflag.
Who this is for
- Data Engineers who own pipelines/tables and interface with business users.
- Team leads formalizing data product reliability.
Prerequisites
- Basic understanding of your orchestration tool and alerting stack.
- Ability to query logs/metrics and read pipeline run histories.
- Familiarity with the data products and their stakeholders.
Learning path
- List your top 5 critical data products with owners and consumers.
- Define SLIs and draft SLOs using 90 days of history.
- Write SLAs and an escalation matrix; review with stakeholders.
- Implement monitors and alerts; run an incident drill.
- Review monthly; adjust targets and runbooks.
Next steps
- Apply the SLA template to one batch and one streaming pipeline this week.
- Schedule a 30-minute review with consumers to align on SLOs.
- Take the quick test below to confirm understanding. Progress is saved for logged-in users; everyone can take the test for free.