Why this matters
As a Data Platform Engineer, you run shared services (ingestion, storage, compute, orchestration, catalogs) that many teams depend on. Clear SLAs and reliability goals let you:
- Set expectations: what users can rely on and when.
- Prioritize engineering work: what to harden first.
- Detect and react to incidents before users do.
- Balance speed of change with stability using error budgets.
Real tasks you will do:
- Define SLOs for ingestion latency, batch data freshness, and job success rate.
- Implement SLIs via metrics and logs, wire alerts, and favor multi-window burn-rate alerting.
- Negotiate SLAs with stakeholders and document RTO/RPO for critical platform components.
- Report reliability weekly and adjust targets based on actual usage and pain points.
Concepts explained simply
- SLI (Service Level Indicator): the thing you measure. Example: percent of successful API requests, end-to-end pipeline lag, data freshness at 07:00 UTC.
- SLO (Service Level Objective): the target for the SLI over a time window. Example: 99.9% of API requests under 300 ms in a 30-day window.
- SLA (Service Level Agreement): a formal promise to customers, often with credits/penalties. Usually built from SLOs and more conservative.
- Error budget: 100% minus the SLO target. If SLO is 99.9% availability, error budget is 0.1% unavailability in the window.
- RTO/RPO (Disaster Recovery goals): RTO is how fast you must recover service; RPO is how much data you can afford to lose.
- Golden signals: latency, traffic, errors, saturation. For data platforms add freshness, completeness, job success rate, and schema stability.
Mental model
Think of reliability as a budget you spend. New releases, migrations, and peak loads “spend” the error budget. If you overspend, slow down changes and improve stability until the budget recovers.
How much downtime does 99.9% allow?
- 99.5% ≈ 3 h 39 m per month
- 99.9% ≈ 43 m 49 s per month
- 99.99% ≈ 4 m 23 s per month
These are rough, month-level approximations you can use to sanity-check goals.
How to set platform SLOs
- Map critical user journeys: ingestion → processing → storage → serving. Identify who is impacted if each step fails.
- Choose SLIs per service: availability, latency, throughput, data freshness, job success, data quality pass rate.
- Pick targets and windows: monthly (28–31 days) is common; weekly for fast-moving services.
- Define measurement: exact queries on metrics/logs. Example SLI: successful_requests / total_requests where status < 500.
- Alert on burn rates: short window to catch fast burns, long window to catch slow leaks.
- Set an error budget policy: release freezes, rollback criteria, incident review triggers.
- Review and report: publish reliability weekly, adjust targets only with evidence.
Example burn-rate alerts
- Page immediately if 2-hour burn-rate > 14.4 (consumes a day’s budget in 100 minutes).
- Page if 6-hour burn-rate > 6.
- Ticket if 3-day burn-rate > 2.
Adjust numbers to your SLO and on-call capacity.
Worked examples
1) Ingestion API availability
- SLI: availability = successful_requests / total_requests (5xx excluded; 4xx are client errors).
- SLO: 99.9% monthly.
- Alerting: burn-rate policy on 2 h, 6 h, 3 d windows.
- Implementation: health checks, autoscaling, multi-AZ, idempotent writes with retries/backoff.
What downtime fits 99.9%?
About 43 minutes per 30-day month. Plan maintenance windows within the budget or mark them as planned outage and exclude if agreed in your policy.
2) Batch data freshness
- SLI: percent of tables with latest partition date = current_date() by 07:00 UTC.
- SLO: ≥ 95% of critical tables fresh by 07:00 UTC across the last 30 days.
- Alerting: page if today’s projection at 06:30 falls below 95%; ticket if 7-day rolling < 98%.
- Implementation: DAG-level retries, backfills, resource reservations, data quality gates before publish.
3) Streaming pipeline lag
- SLI: consumer_lag_p99 in seconds.
- SLO: p99 lag <= 120 s for ≥ 99% of minutes per month.
- Alerting: page if lag burn-rate indicates budget depletion within 4 hours.
- Implementation: backpressure, partition rebalancing, autoscaling consumers, schema evolution tests.
4) Metastore DR goals
- RTO: 30 minutes. RPO: 5 minutes.
- Implementation: cross-region replicas, point-in-time recovery, automated failover playbooks, restore drills quarterly.
Choosing good targets
- Start with what users feel: fresh dashboards by start-of-day, stable schemas during business hours.
- Prefer few meaningful SLOs over many noisy ones.
- Include maintenance policy: what counts in/out of SLO; communicate planned downtime early.
Implementation patterns that raise reliability
- Redundancy: multi-AZ/region, hot standbys.
- Safe deploys: blue/green, canaries, feature flags, automatic rollback.
- Robustness: idempotent operations, retries with jitter, circuit breakers, bulkheads.
- Observability: RED/USE + data freshness, data quality pass rate, lineage impact.
- Capacity: autoscaling, quotas, load shedding to protect the platform.
Who this is for
- Data Platform Engineers owning shared services and SLAs.
- Data Engineers who build critical pipelines for many stakeholders.
Prerequisites
- Basic understanding of data pipelines (batch and streaming).
- Familiarity with metrics, logging, and alerting concepts.
- Some experience operating services in production is helpful.
Learning path
- Learn SLI/SLO/SLA, error budgets, RTO/RPO (this lesson).
- Instrument SLIs on one platform service.
- Define alert policy and on-call runbooks.
- Pilot an error budget policy with your team.
- Roll out SLOs to additional services.
Common mistakes and self-check
- Mistake: SLOs that users don’t care about. Self-check: Can a user feel the miss? If not, reconsider.
- Mistake: Too many SLOs. Self-check: Keep 2–4 per service.
- Mistake: No defined measurement. Self-check: Can you write the exact query to compute the SLI today?
- Mistake: Alerting only on absolute thresholds. Self-check: Add burn-rate alerts to protect the monthly target.
- Mistake: Changing targets too often. Self-check: Review monthly/quarterly, change only with evidence.
Practical projects
- Project 1: Build a freshness SLO for 10 critical tables. Dashboard: percent fresh by 07:00 UTC, daily trend, burns.
- Project 2: API availability SLO with burn-rate alerts. Simulate failures to validate alerts and runbook.
- Project 3: Define RTO/RPO for the metadata catalog; perform a restore drill and record recovery time.
Exercises
Do these now. You can check solutions in the expandable blocks. Everyone can take the test; logged-in users have their progress saved.
Exercise 1: Data catalog SLOs
Define SLIs and SLOs for a shared data catalog used by analysts and pipelines.
- Pick 3 SLIs (availability, latency, metadata freshness, etc.).
- Propose monthly SLO targets and how to measure each.
- Write one burn-rate alert for fast failure and one for slow burn.
Show solution
Example answer:
- SLI1 availability = successful_reads / total_reads; SLO = 99.9% monthly; measurement via HTTP 2xx per minute.
- SLI2 search_latency_p95 <= 300 ms; SLO = p95 under 300 ms for 99% of minutes monthly.
- SLI3 metadata_freshness = percent of entities synced within 10 minutes of source change; SLO = ≥ 98% monthly.
- Alerts: page if 2 h burn-rate > 14; ticket if 3 d burn-rate > 2.
Exercise 2: Error budget math
You set an SLO of 99.9% for the ingestion API over 30 days. In the first week you had 14 minutes of downtime.
- How much monthly budget remains (in minutes)?
- Do you restrict releases next week? Why?
Show solution
Monthly budget for 99.9% ≈ 43.8 minutes. After 14 minutes, ≈ 29.8 minutes remain.
Decision: monitor burn-rate. If the week’s burn is elevated (e.g., incidents tied to new changes), slow or freeze releases until the trend stabilizes. If downtime was due to a one-off external event and burn-rate is normal, proceed cautiously with canaries.
Self-check checklist
- I can define SLIs that map to user pain.
- I can write the exact query to compute each SLI.
- I chose sensible time windows and burn-rate alerts.
- I documented maintenance and exclusion policies.
- I have an error budget policy tied to release decisions.
Mini challenge
You run a multi-tenant Spark platform. Users complain about jobs waiting too long to start and occasional failed jobs due to capacity. Propose 3 SLIs and monthly SLOs that would address this, and list two engineering changes to meet them. Keep it to 6–8 lines.
Hint
Think: queue wait time, job success rate, executor allocation latency, cluster saturation.
Next steps
- Instrument one SLI today (start with freshness or availability).
- Draft an error budget policy with your team.
- Schedule a quarterly DR drill for one critical component.