How to learn Managing SLAs And Freshness for Orchestration Basics in Analytics Engineer for free

Who this is for

You work with data pipelines and business stakeholders who expect data to be ready by a clear time. You want practical methods to define, monitor, and meet SLAs while keeping datasets fresh and trustworthy.

Prerequisites

Basic understanding of batch pipelines (e.g., orchestrators like Airflow/Dagster or scheduled dbt runs).
Comfort with SQL and reading timestamps.
Familiarity with dataset lineage and dependencies.

Learning path

First: scheduling basics and dependency management.
Then: this lesson on SLAs, SLOs, and freshness.
Next: alerting, incident response, and backfills.

Why this matters

Analytics Engineers are often asked questions like: "Will finance have the sales report by 07:00?", "How old is the data in the dashboard?", or "Why is the table late today?" Managing SLAs and freshness lets you:

Guarantee availability of critical datasets before business cutoffs (e.g., daily executive dashboards).
Quantify staleness and detect when pipelines silently serve old data.
Reduce on-call noise with clear, meaningful alerts and runbooks.

Concept explained simply

SLA (Service Level Agreement): The agreed deadline or level of service (e.g., "The Finance Sales Mart is ready by 07:00 UTC on weekdays").
SLO (Service Level Objective): The performance target for consistency (e.g., "Meet the 07:00 deadline on 99% of weekdays each quarter").
SLI (Service Level Indicator): The measurement you observe (e.g., "Age of newest record" or "Job completion timestamp").
Freshness: How old your data is. A common metric is: freshness = now() - max(source_timestamp).
Latency: Time from event happening to it appearing in the analytics table. Often overlaps with freshness.

Mental model

Think of your pipeline as a relay race with a finish line at the SLA time. Each stage (extract, transform, load) needs enough time and a small buffer. Freshness is the stopwatch showing how far behind real time the baton currently is.

Worked examples

Example 1 — Daily sales mart due 07:00 UTC

SLA: Sales_Mart is updated by 07:00 UTC, Mon–Fri.
Upstream: OLTP export completes by ~06:25 UTC (p95).
Transform time: 20 minutes (p95).
Buffer: 15 minutes.

Plan: Schedule start at 06:25 UTC. Expected finish ~06:45, buffer until 07:00. Freshness SLI: now() - max(order_created_at) <= 60 minutes at 07:00.

Policy snippet
SLA: 07:00 UTC daily (Mon–Fri)
SLO: 99% on-time per quarter
SLI:
  - Job finish time
  - Freshness at 07:00: now - max(order_created_at) <= 60m
Alerts:
  - Warn at 06:50 if job not started
  - Page at 07:01 if freshness > 60m or job unfinished

Example 2 — Hourly micro-batch with tight freshness

Requirement: Dashboard shows events no older than 15 minutes during business hours.
Approach: Run every 5 minutes; each run processes last 10 minutes.
SLI: max(now() - max(event_ts)) over business hours <= 15 minutes.
Alert: page if freshness > 20 minutes for 2 consecutive checks.

Runbook: scale workers, reduce batch size, or switch to streaming path during spikes.

Example 3 — Cross-team dependency and cascade

Marketing_Mart depends on Ads_Spend and Web_Sessions.
Upstream SLAs: Ads_Spend 06:15, Web_Sessions 06:20; your transform needs 15 minutes.

Critical path: start when both upstreams are ready (06:20). Finish by ~06:35. SLA for Marketing_Mart can be 06:45 with 10 minutes buffer.

Design SLAs and freshness in 8 steps

Step 1. Identify business cutoffs (who needs what by when?).

Step 2. Map dependencies and p95 durations for each stage.

Step 3. Choose SLIs (e.g., job finish time, freshness formula).

Step 4. Set SLA (deadline) and SLO (on-time target).

Step 5. Add buffer (typically 10–25% of critical path time).

Step 6. Implement freshness checks and data tests.

Step 7. Define alerts: warn before breach, page at breach, include runbook link.

Step 8. Document ownership, escalation, and holiday changes.

Monitoring and alerting patterns

Pre-breach warnings: If a job hasn’t started near the latest safe start time.
Freshness guards: Compare now() vs max(source timestamp) on a schedule.
Dual checks: Job success does not guarantee data freshness; verify both.
Silence policies: Reduce noise during known maintenance windows.
Backfill controls: Lower priority; avoid paging like day-of SLA jobs.

Runbook: when an SLA is at risk

1) Check orchestrator queue/backlog. If queued, scale or re-prioritize.
2) Check upstream data arrival and watermark movement.
3) Estimate finish time vs SLA; decide to skip non-critical steps.
4) Communicate ETA to stakeholders with a timestamped update.
5) If breach occurs, document cause, impact, and prevention action.

Exercises

Do these now. The quick test references them.

Exercise 1 — Draft an SLA + freshness plan (daily mart)
See instructions below in the Exercises panel.
Exercise 2 — Triage a freshness breach
Practice a realistic incident response.

Checklist — before you call it "done"

Business deadline and owner documented.
Critical path time measured (p95) and buffer added.
SLIs defined and queryable (finish time, freshness).
Alerts: warn-before, page-on-breach, and clear runbook steps.
Backfill policy and maintenance window defined.
Post-incident review template prepared.

Common mistakes and how to self-check

Mistake: Equating job success with data freshness. Self-check: Inspect max(source timestamp) and row counts per partition.
Mistake: No buffer time. Self-check: Compare p95 runtime + upstream variance vs scheduled start.
Mistake: Paging for non-critical datasets. Self-check: Label severity and route alerts appropriately.
Mistake: One-size-fits-all SLAs. Self-check: Tailor to business criticality and seasonality.
Mistake: Unclear ownership. Self-check: Ensure on-call rotation and escalation are documented.

Practical projects

Project 1: Pick one critical dataset. Write its SLA/SLO/SLIs, implement a freshness query, and add alerts.
Project 2: Measure p50/p95 runtimes for each task in a DAG; redesign schedule to add a 15% buffer.
Project 3: Create a breach runbook template and test it with a simulated delay.

Mini challenge

Your team promises: "The Marketing dashboard is updated by 06:45 UTC (99% of days)." Upstreams deliver at 06:20 (p95) and 06:28 (p95). Your transform takes 12 minutes (p95). Propose a start time, buffer, and two SLIs. Write a 2–3 sentence answer.

Next steps

Automate SLI queries and pipe them into alerts.
Introduce severity levels and on-call rotations for critical pipelines.
Review SLAs quarterly; adjust for growth and seasonality.

Quick Test

This test is available to everyone. Only logged-in users will see saved progress.

Menu

Managing SLAs And Freshness

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Mental model

Worked examples

Design SLAs and freshness in 8 steps

Monitoring and alerting patterns

Runbook: when an SLA is at risk

Exercises

Checklist — before you call it "done"

Common mistakes and how to self-check

Practical projects

Mini challenge

Next steps

Quick Test

Practice Exercises

Draft an SLA + freshness plan for a daily orders mart

Instructions

Expected Output

Investigate and respond to a freshness breach

Managing SLAs And Freshness — Quick Test

Have questions about Managing SLAs And Freshness?

AI Assistant