luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Managing SLAs And Freshness

Learn Managing SLAs And Freshness for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Who this is for

You work with data pipelines and business stakeholders who expect data to be ready by a clear time. You want practical methods to define, monitor, and meet SLAs while keeping datasets fresh and trustworthy.

Prerequisites

  • Basic understanding of batch pipelines (e.g., orchestrators like Airflow/Dagster or scheduled dbt runs).
  • Comfort with SQL and reading timestamps.
  • Familiarity with dataset lineage and dependencies.

Learning path

  • First: scheduling basics and dependency management.
  • Then: this lesson on SLAs, SLOs, and freshness.
  • Next: alerting, incident response, and backfills.

Why this matters

Analytics Engineers are often asked questions like: "Will finance have the sales report by 07:00?", "How old is the data in the dashboard?", or "Why is the table late today?" Managing SLAs and freshness lets you:

  • Guarantee availability of critical datasets before business cutoffs (e.g., daily executive dashboards).
  • Quantify staleness and detect when pipelines silently serve old data.
  • Reduce on-call noise with clear, meaningful alerts and runbooks.

Concept explained simply

  • SLA (Service Level Agreement): The agreed deadline or level of service (e.g., "The Finance Sales Mart is ready by 07:00 UTC on weekdays").
  • SLO (Service Level Objective): The performance target for consistency (e.g., "Meet the 07:00 deadline on 99% of weekdays each quarter").
  • SLI (Service Level Indicator): The measurement you observe (e.g., "Age of newest record" or "Job completion timestamp").
  • Freshness: How old your data is. A common metric is: freshness = now() - max(source_timestamp).
  • Latency: Time from event happening to it appearing in the analytics table. Often overlaps with freshness.

Mental model

Think of your pipeline as a relay race with a finish line at the SLA time. Each stage (extract, transform, load) needs enough time and a small buffer. Freshness is the stopwatch showing how far behind real time the baton currently is.

Worked examples

Example 1 — Daily sales mart due 07:00 UTC
  • SLA: Sales_Mart is updated by 07:00 UTC, Mon–Fri.
  • Upstream: OLTP export completes by ~06:25 UTC (p95).
  • Transform time: 20 minutes (p95).
  • Buffer: 15 minutes.

Plan: Schedule start at 06:25 UTC. Expected finish ~06:45, buffer until 07:00. Freshness SLI: now() - max(order_created_at) <= 60 minutes at 07:00.

Policy snippet
SLA: 07:00 UTC daily (Mon–Fri)
SLO: 99% on-time per quarter
SLI:
  - Job finish time
  - Freshness at 07:00: now - max(order_created_at) <= 60m
Alerts:
  - Warn at 06:50 if job not started
  - Page at 07:01 if freshness > 60m or job unfinished
Example 2 — Hourly micro-batch with tight freshness
  • Requirement: Dashboard shows events no older than 15 minutes during business hours.
  • Approach: Run every 5 minutes; each run processes last 10 minutes.
  • SLI: max(now() - max(event_ts)) over business hours <= 15 minutes.
  • Alert: page if freshness > 20 minutes for 2 consecutive checks.

Runbook: scale workers, reduce batch size, or switch to streaming path during spikes.

Example 3 — Cross-team dependency and cascade
  • Marketing_Mart depends on Ads_Spend and Web_Sessions.
  • Upstream SLAs: Ads_Spend 06:15, Web_Sessions 06:20; your transform needs 15 minutes.

Critical path: start when both upstreams are ready (06:20). Finish by ~06:35. SLA for Marketing_Mart can be 06:45 with 10 minutes buffer.

Design SLAs and freshness in 8 steps

Step 1. Identify business cutoffs (who needs what by when?).

Step 2. Map dependencies and p95 durations for each stage.

Step 3. Choose SLIs (e.g., job finish time, freshness formula).

Step 4. Set SLA (deadline) and SLO (on-time target).

Step 5. Add buffer (typically 10–25% of critical path time).

Step 6. Implement freshness checks and data tests.

Step 7. Define alerts: warn before breach, page at breach, include runbook link.

Step 8. Document ownership, escalation, and holiday changes.

Monitoring and alerting patterns

  • Pre-breach warnings: If a job hasn’t started near the latest safe start time.
  • Freshness guards: Compare now() vs max(source timestamp) on a schedule.
  • Dual checks: Job success does not guarantee data freshness; verify both.
  • Silence policies: Reduce noise during known maintenance windows.
  • Backfill controls: Lower priority; avoid paging like day-of SLA jobs.

Runbook: when an SLA is at risk

  • 1) Check orchestrator queue/backlog. If queued, scale or re-prioritize.
  • 2) Check upstream data arrival and watermark movement.
  • 3) Estimate finish time vs SLA; decide to skip non-critical steps.
  • 4) Communicate ETA to stakeholders with a timestamped update.
  • 5) If breach occurs, document cause, impact, and prevention action.

Exercises

Do these now. The quick test references them.

  1. Exercise 1 — Draft an SLA + freshness plan (daily mart)
    See instructions below in the Exercises panel.
  2. Exercise 2 — Triage a freshness breach
    Practice a realistic incident response.

Checklist — before you call it "done"

  • Business deadline and owner documented.
  • Critical path time measured (p95) and buffer added.
  • SLIs defined and queryable (finish time, freshness).
  • Alerts: warn-before, page-on-breach, and clear runbook steps.
  • Backfill policy and maintenance window defined.
  • Post-incident review template prepared.

Common mistakes and how to self-check

  • Mistake: Equating job success with data freshness. Self-check: Inspect max(source timestamp) and row counts per partition.
  • Mistake: No buffer time. Self-check: Compare p95 runtime + upstream variance vs scheduled start.
  • Mistake: Paging for non-critical datasets. Self-check: Label severity and route alerts appropriately.
  • Mistake: One-size-fits-all SLAs. Self-check: Tailor to business criticality and seasonality.
  • Mistake: Unclear ownership. Self-check: Ensure on-call rotation and escalation are documented.

Practical projects

  • Project 1: Pick one critical dataset. Write its SLA/SLO/SLIs, implement a freshness query, and add alerts.
  • Project 2: Measure p50/p95 runtimes for each task in a DAG; redesign schedule to add a 15% buffer.
  • Project 3: Create a breach runbook template and test it with a simulated delay.

Mini challenge

Your team promises: "The Marketing dashboard is updated by 06:45 UTC (99% of days)." Upstreams deliver at 06:20 (p95) and 06:28 (p95). Your transform takes 12 minutes (p95). Propose a start time, buffer, and two SLIs. Write a 2–3 sentence answer.

Next steps

  • Automate SLI queries and pipe them into alerts.
  • Introduce severity levels and on-call rotations for critical pipelines.
  • Review SLAs quarterly; adjust for growth and seasonality.

Quick Test

This test is available to everyone. Only logged-in users will see saved progress.

Practice Exercises

2 exercises to complete

Instructions

Context: You maintain a daily Orders_Mart used by Finance by 07:30 UTC. Upstream OLTP export completes by 06:55 UTC (p95). Your transformations take 25 minutes (p95). You want at least 10 minutes of buffer. Define:

  • A clear SLA and SLO.
  • Two SLIs: one for job readiness, one for freshness.
  • A safe schedule (start time) and buffer rationale.
  • Alert rules: warn-before and page-on-breach.
  • A simple runbook first 3 steps if at 07:25 UTC the job is still running.
Expected Output
A concise policy (5–10 bullets) and a small config or pseudo-YAML showing the freshness threshold and alerting windows.

Managing SLAs And Freshness — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Managing SLAs And Freshness?

AI Assistant

Ask questions about this tool