How to learn Scheduling And SLAs for Orchestration And Scheduling in Data Engineer for free

Why this matters

In real teams, data is expected on time: dashboards before business opens, feeds for partners by strict cutoffs, and machine learning features updated by a window. Getting the schedule and the SLA right prevents noisy alerts, missed deadlines, and broken trust.

Deliver a daily sales report by 07:00 local time.
Refresh feature tables every 15 minutes with 99% on-time delivery.
Run month-end jobs across time zones and holidays without surprises.

Who this is for

Data engineers who run recurring pipelines and need predictable, on-time delivery.
Analytics engineers and platform engineers aligning data delivery with business hours.
Team leads defining reliability targets and on-call policies.

Prerequisites

Basic understanding of a workflow orchestrator (e.g., Airflow, Dagster, or Prefect).
Familiarity with cron-like schedules and time zones.
Knowledge of your pipelines’ average and p95 runtimes.

Concept explained simply

Scheduling decides when a pipeline starts. An SLA (Service Level Agreement) is the promise about when outputs will be ready for consumers. They’re related but not the same: schedule is a trigger; SLA is the delivery promise.

Mental model

Think of a commuter train. The timetable (schedule) says when the train departs. The commitment to arrive before 08:30 is the SLA. You choose a departure time that normally arrives early enough—even when there’s typical delay (buffer for variability). The SLI is the measurement (e.g., % of trains that arrived by 08:30). The SLO sets a target for that measurement (e.g., 98% of days).

Key terms you’ll use

SLA: A clear promise to data consumers (e.g., “Daily sales table is ready by 07:00 America/New_York”).
SLI: The metric you track (e.g., “% of days the table is ready by 07:00”).
SLO: The target for the SLI (e.g., “≥ 99% on-time days each month”).
Schedule: When the job is triggered (cron, event, or hybrid).
Catchup/Backfill: Automatically running missed intervals; reprocessing historical dates safely.
Data availability window: When upstream data lands; you can’t start before this.
Buffers: Time added to absorb variability (use p95 or p99 durations, not averages).
Concurrency: Limits and queues; ensure required capacity exists at the scheduled time.
Time zones & DST: Specify the timezone; test transitions.
Business calendars: Skip weekends/holidays if the business is closed or SLAs differ.
Alerting: Warnings before the SLA; hard alerts when the SLA is missed; clear ownership and escalation.

Worked examples

Example 1 — Daily report SLA with buffer

Scenario: SLA: “Report ready by 07:00 America/New_York.” Upstream completes by 06:10 ±5 min. Pipeline p95 runtime: 18 min.

Earliest safe start: 06:15 (wait past typical lateness).
Finish by: 06:33 at p95.
Buffer before SLA: 27 minutes.
Schedule: 06:15 local daily (cron: 15 6 * * * with tz=America/New_York).
Alerts: Soft warn at 06:45 if not finished; hard SLA miss at 07:00 escalates to on-call.

Example 2 — DST-safe scheduling

Scenario: SLA is business-local: “Table ready by 08:00 Europe/London” (honor DST). Use orchestrator timezone set to Europe/London. Test on the DST change day.

Do not run in UTC then “mentally convert.”
Document the timezone in the SLA text.
Run canaries the week before DST to verify timings.

Example 3 — SLO and error budget

SLO: “≥ 98% of weekdays delivered by 07:00 this quarter.”

Weekdays in a typical quarter ≈ 65.
Allowed late days (error budget) = 2% of 65 ≈ 1.3 → allow 1 day late; the second late day exhausts budget.
Action: On first late day, create a post-incident note; on second, trigger reliability backlog (optimizations).

How to implement this in practice

Write the SLA in plain language
- Deliverable + Time + Timezone + Days + Owner + Contact.
- Example: “Daily sales table analytics.sales_daily is ready by 07:00 America/New_York on weekdays. Owner: Data Platform. Contact: #data-oncall.”
Measure current performance (SLIs)
- Record start/end timestamps, upstream availability time, duration p50/p95.
- Validate at least two weeks of data before committing SLOs.
Choose schedule type
- Cron-based: predictable cycles (daily, hourly).
- Event-based: trigger when data lands (sensors), with a guardrail timeout.
- Hybrid: sensor + backstop cron (start anyway if sensor hasn’t fired by X).
Add buffers and capacity
- Use p95 duration + upstream lateness to set start time.
- Confirm cluster slots/concurrency at that time block.
Configure alerting
- Warn before SLA (T-10 min), hard alert at SLA time, include run id and owner.
- Suppress duplicate pages during retries; page once, then update status.
Handle catchup/backfill
- Enable catchup for historical gaps; ensure idempotency.
- Document how SLA is interpreted for backfills (usually excluded).
Test across edge cases
- DST transitions, month-end spikes, upstream delays.
- Chaos test: inject 20% slowdown; confirm you still meet SLA or get timely alerts.

Exercises

These mirror the graded tasks below. You can attempt here, then check solutions. The quick test at the end is available to everyone; only logged-in users get saved progress.

Exercise 1 — Pick a schedule and SLA guards

Scenario: SLA: “Daily orders dashboard ready by 08:30 America/New_York (Mon–Fri).” Upstream feed completes by 08:00 ±3 min. Pipeline p95 runtime: 18 min. Define:

Start time and cron (with timezone) that meets the SLA with buffer.
Soft warning and hard miss alert times.
State how you will handle a one-off 10-minute upstream delay.

Show a sample solution

Schedule: Start 08:05 local. Cron: 5 8 * * 1-5 with tz=America/New_York.

Expected finish at p95 ≈ 08:23 → 7-minute buffer.
Alerts: Soft warn at 08:25 if not finished; hard SLA miss at 08:30 escalates.
One-off delay: If upstream is +10 min late (08:10), ETA finish ≈ 08:28; still within SLA. If running past 08:30, log SLA miss and auto-annotate run with upstream delay cause.

Exercise 2 — Configure SLA monitoring (pseudo-Airflow)

Write a minimal configuration to: run at 06:00 Europe/London on weekdays, set a task-level SLA of 20 minutes for the transform step, disable catchup, and send an alert on SLA miss.

Show a sample solution

# Pseudo-code (conceptual)
dag = DAG(
  dag_id="sales_daily",
  schedule="0 6 * * 1-5",
  timezone="Europe/London",
  catchup=False,
)

extract = Task("extract")
transform = Task("transform", sla="0:20:00")  # 20 minutes
load = Task("load")

extract >> transform >> load

# Notifications
on_sla_miss = Notification(target="#data-oncall", severity="high")
dag.on_sla_miss = on_sla_miss

Key points: weekday cron, explicit timezone, catchup disabled for this business deliverable, SLA attached to the critical task, and an on_sla_miss handler.

Checklist before you submit

Timezone is explicit and matches the SLA text.
Start time considers upstream availability and p95 runtime.
Alerts have both a soft warning and a hard miss with escalation.
Catchup/backfill policy is stated.
Owner/contact are clear.

Common mistakes and self-check

Using average runtime instead of p95: Leads to frequent near-misses. Self-check: Compare last 14 days p50 vs p95; buffer to p95.
Ignoring timezone/DST: Jobs slip by an hour twice a year. Self-check: Simulate the DST switch week in a test env.
No upstream window: Starting before data lands. Self-check: Track and chart upstream arrival times.
Over-alerting: Pages on transient retries. Self-check: Ensure one page per incident with updates, not per retry.
Unclear ownership: Alerts with no responders. Self-check: SLA text includes owner and contact.
Unlimited catchup: Massive backfill storms. Self-check: Throttle catchup and set concurrency limits.

Practical projects

Project A: Convert three business deliverables into SLAs/SLOs/SLIs. Implement schedules and alerts; produce a one-page runbook.
Project B: Build a hybrid trigger: event-based start with a cron backstop and a guardrail timeout. Prove it survives a missed event.
Project C: Reliability tune-up: reduce p95 by 30% for one pipeline (parallelism, caching, pruning). Show improved on-time SLI.

Learning path

Before this: Basics of your orchestrator, task dependencies, idempotent transformations.
This topic: Translate business needs to schedules and SLAs; implement alerts; plan buffers and capacity.
Next: Advanced dependency management, backfill strategies at scale, incident response and postmortems.

Next steps

Standardize your team’s SLA template and adopt it for top 5 pipelines.
Automate SLI collection and a weekly reliability report.
Rehearse a DST cutover and a delayed-upstream scenario.

Mini challenge

Business asks: “Marketing leads table must be ready by 09:00 America/Los_Angeles on weekdays.” Upstream CRM export lands by 08:20 ±8 min; pipeline p95 is 22 minutes. Design:

Schedule (cron + timezone) and start time.
Soft and hard alert times.
One sentence SLA, one SLO, and one SLI.

One possible approach

Start 08:30 local (after typical lateness), cron 30 8 * * 1-5 tz=America/Los_Angeles.
p95 finish ≈ 08:52 → 8-minute buffer.
Warn at 08:55; hard miss at 09:00; escalate to on-call.
SLA: “Leads table ready by 09:00 PT on weekdays.” SLO: “≥ 99% on-time.” SLI: “% of weekdays ready by 09:00.”

Quick test

Take the short quiz below. Everyone can take it; only logged-in users get saved progress.

Menu

Scheduling And SLAs

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Key terms you’ll use

Worked examples

How to implement this in practice

Exercises

Exercise 1 — Pick a schedule and SLA guards

Exercise 2 — Configure SLA monitoring (pseudo-Airflow)

Checklist before you submit

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Quick test

Practice Exercises

Pick a schedule and SLA guards

Instructions

Expected Output

Configure SLA monitoring (pseudo-Airflow)

Scheduling And SLAs — Quick Test

Have questions about Scheduling And SLAs?

AI Assistant