Why this matters
As an ETL Developer, you are responsible for data being where it should be, when it should be, and in the shape users expect. Service Level Agreements (SLAs) define those promises; monitoring proves you meet them and alerts you when you do not. This directly impacts dashboards being ready by 8 AM, ML features staying fresh, and downstream jobs running on time.
- Real tasks include: defining “on-time by 07:00” for daily loads, setting alert thresholds for late partitions, tracking success rate across a month, and writing runbooks for fast incident resolution.
- Good SLAs and monitoring reduce surprise outages, speed up root-cause finding, and keep business stakeholders confident in your pipelines.
Who this is for
- ETL Developers who schedule and operate batch or streaming pipelines.
- Data engineers who need clear targets for data freshness, completeness, and reliability.
- Analyst engineers or platform engineers who collaborate on data reliability.
Prerequisites
- Basic understanding of ETL/ELT jobs and orchestrators (e.g., DAGs, tasks, retries).
- Ability to read logs/metrics from jobs and data stores.
- Familiarity with data quality checks (row counts, null rates, schema checks).
Concept explained simply
Think of an SLA as a promise to your users: what they get and by when. To keep that promise, you watch signals (monitoring) that tell you if things are healthy. If a promise is at risk, alerts ping you so you can fix it before users feel pain.
Mental model
- Promises: What the user cares about (e.g., “Sales dashboard updated by 07:00, 99% of days”).
- Sensors: Metrics that show health (e.g., job duration, data freshness, consumer lag).
- Guardrails: Alert rules, escalation, and runbooks that trigger fast, correct action.
Key definitions
- SLI (Service Level Indicator): A measurement (e.g., data freshness in minutes).
- SLO (Service Level Objective): Target for an SLI (e.g., p95 freshness ≤ 30 minutes, monthly).
- SLA (Service Level Agreement): A user-facing promise. Often includes SLOs and consequences if missed.
- RTO/RPO: Recovery Time/Objectives for outages and data loss.
- Failure budget: Allowed margin for being out of SLO (e.g., 1% of days can be late).
Designing SLAs for ETL pipelines
- Common SLOs for batch:
- On-time completion: e.g., 99% of runs complete by 07:00 local in a calendar month.
- Success rate: e.g., 99.5% task success across the DAG, monthly.
- Freshness: e.g., max source-to-warehouse freshness p95 ≤ 60 minutes.
- Data completeness/quality: e.g., row count within ±2% of expected; null rate < 0.1% on key columns.
- Common SLOs for streaming:
- End-to-end latency: e.g., p95 ≤ 2 minutes, 99% of the time.
- Backlog/consumer lag: e.g., lag < 10k messages for 99% of the time.
- Throughput: e.g., sustain ≥ 5k msgs/min for 99% of the time.
Useful formulas
- On-time rate (%) = on_time_runs / total_runs × 100
- Freshness (minutes) = data_arrival_time_in_warehouse − source_event_time
- Availability (%) = successful_runs / total_runs × 100
Runbook template (copy and adapt)
- Impact: Who is affected and how (dashboards, SLAs).
- Symptoms: What you see (alerts, logs, metrics).
- Immediate actions: Stop-gap steps (retry task, backfill yesterday, disable downstream).
- Diagnosis: How to find the root cause (check recent code changes, resource limits, upstream status).
- Resolution: Step-by-step fix (rollback, increase retries, reprocess, reindex).
- Post-incident: Timeline, contributing factors, permanent fix, SLO review.
Monitoring toolkit (tool-agnostic)
- Metrics: Task duration, success rate, schedule delay, freshness, backlog/lag, throughput, error counts.
- Logs: Structured logs with correlation/job IDs for each run.
- Events: Job start/finish, retries, state transitions.
- Alerts: Severity levels with clear thresholds and routing (info, warn, critical).
- Dashboards: One “at-a-glance” overview + deep-dive per pipeline.
Alert design tips
- Alert on symptoms that matter to users (missed 07:00 SLA), not on every minor fluctuation.
- Add duration to thresholds (e.g., “p95 latency > 2 min for 5 consecutive minutes”).
- Deduplicate and group alerts by run/pipeline to reduce noise.
- Every alert must map to a runbook action. If not actionable, downgrade to dashboard-only.
Worked examples
Example 1 — Daily batch SLA for a sales dashboard
- SLOs:
- On-time: 99% of days complete by 07:00 local.
- Success rate: 99.5% monthly.
- Completeness: row count within ±2% of prior 7-day mean.
- Monitoring:
- SLIs: job end time, task success ratio, row count delta, schema drift flag.
- Dashboard panels: on-time heatmap by day; rolling success rate; completeness trend.
- Alerts:
- Warn: 06:45 run not started.
- Critical: 07:05 not finished OR completeness breach.
- Runbook: Retry failed task; if source late, notify stakeholders and trigger backfill window 10:00–11:00.
Example 2 — Incremental warehouse load (hourly)
- SLOs:
- Freshness p95 ≤ 30 minutes, daily.
- Availability ≥ 99.9% monthly.
- Monitoring:
- SLIs: freshness per table; job duration; late partitions count.
- Alerts:
- Warn: freshness p95 > 30 min for 2 consecutive hours.
- Critical: missing hourly partition for 3 hours.
- Runbook: Run catch-up DAG for missed hours; verify upstream CDC lag; scale workers if queue depth high.
Example 3 — Streaming pipeline
- SLOs:
- End-to-end p95 latency ≤ 2 minutes for 99% of the day.
- Consumer lag < 10k messages, 99% of the day.
- Monitoring:
- SLIs: p95/99 latency, lag, error rate, throughput.
- Alerts:
- Warn: p95 latency > 2 min for 5 minutes.
- Critical: lag > 50k for 10 minutes or continuous errors > 2% for 5 minutes.
- Runbook: Scale consumers; throttle producers; replay from checkpoint if necessary; verify schema compatibility.
Hands-on: Build your SLA + Monitoring plan
- Define user impact: Who needs the data and when?
- Pick 2–4 SLIs that represent user pain (freshness, on-time, completeness, success rate).
- Set SLO targets and failure budget (e.g., 99% on-time per month → 1 day allowed late).
- Decide alert thresholds and durations; map each to a runbook action.
- Create a single dashboard view for the pipeline.
- Simulate a failure (dry run): confirm alerts fire and runbook resolves it.
- [ ] Clear owner/on-call rotation documented
- [ ] SLIs measurable from existing logs/metrics
- [ ] SLOs realistic but challenging
- [ ] Alerts actionable with runbooks
- [ ] Dashboard reviewed with stakeholders
Exercises
Note: Anyone can do the exercises and test here for free. If you are logged in, your progress will be saved.
Exercise 1 — Draft an SLA + Monitoring spec
Scenario: A daily “Orders” pipeline ingests source files between 02:00–05:30 and must update the dashboard by 07:00. Historical volume: 8–12M rows/day. Occasional source delays happen weekly.
- Write SLOs (on-time, success rate, completeness, freshness).
- List SLIs and alert thresholds (warn vs critical).
- Outline a short runbook.
Need a hint?
- Consider a failure budget: how many days late per month?
- Use time-based thresholds with a grace period.
- Completeness can use a rolling baseline.
Exercise 2 — Tune noisy alerts
Scenario: Your hourly incremental job triggers 20+ alerts overnight due to brief spikes in freshness (p95 35–40 min) that auto-resolve within 5 minutes. Stakeholders were not impacted.
- Propose threshold changes to reduce noise but protect user impact.
- Describe grouping/deduplication and escalation rules.
Need a hint?
- Add a sustained duration requirement to alerts.
- Group multiple similar alerts into one incident per run/hour.
- Keep a critical threshold that reflects true impact.
Common mistakes and how to self-check
- Too many SLIs: Keep 2–4 that map directly to user value.
- No duration on thresholds: Use sustained breaches (e.g., 10 minutes) to avoid noise.
- Alerting on non-actionable metrics: If no runbook action exists, downgrade to dashboard-only.
- Unrealistic SLOs: Co-design with stakeholders; review monthly against reality.
- No owner: Every pipeline and alert must have a named owner and backup.
Self-check
- Can you explain who suffers if an SLO is missed and how?
- Can you show where each SLI is measured on your dashboard?
- Does every alert have a first action and an escalation path?
- Have you tested the runbook in a simulation or backfill dry run?
Practical projects
- Project 1: Instrument a batch pipeline to emit SLIs (on-time, success rate, row counts) and build a single overview dashboard.
- Project 2: Implement alert rules with warn/critical tiers and a deduplication window; link each to a runbook.
- Project 3: Add data quality gates (null rate, unique keys, expected range checks) that can block downstream tasks when failing.
Mini challenge
You manage two pipelines that feed the CFO’s 08:00 finance dashboard. Pipeline A is daily with an on-time SLO of 99% by 06:30. Pipeline B is hourly with freshness p95 ≤ 20 minutes. Last week, A was late once (finished 06:55) and B had p95 freshness 25 minutes for 30 minutes at 07:00. Decide: which alerts should be critical vs warn, and what you would communicate to stakeholders.
Considerations
- Was user impact real at 08:00? If yes, critical for the impacting pipeline.
- Use failure budget context: one late day may be within budget.
- For B, sustained breach and proximity to 08:00 matters.
Learning path
- Before: Understand job scheduling, dependencies, and retries.
- Now: Define SLAs/SLOs, instrument SLIs, and create actionable alerts.
- Next: Strengthen incident response (backfills, SLAs for upstream/downstream, capacity planning).
Next steps
- Finalize an SLA for one of your pipelines and review with stakeholders.
- Deploy one dashboard panel per SLI and test alert firing in a safe environment.
- Take the quick test below to confirm understanding.
Quick Test
Answer a few questions to check your understanding. You can take it for free; login saves your progress.