Who this is for
- MLOps and data platform engineers who run ML training, batch feature jobs, and offline inference.
- Data engineers responsible for data freshness and reliability of scheduled pipelines.
- Analysts or ML scientists who want to translate business deadlines into robust schedules and SLAs.
Prerequisites
- Basic familiarity with an orchestrator (e.g., understanding of DAGs/flows, tasks, retries, sensors).
- Comfort with cron syntax and timezones (UTC vs local).
- Understanding of upstream/downstream data dependencies and idempotent jobs.
Why this matters
In the MLOps role, you are accountable not just for building pipelines, but for when they deliver value. Marketing needs fresh predictions by 8:00. Finance needs month-end scoring by day 1 noon. A model retraining must complete before dashboards refresh. Scheduling and SLA management let you guarantee these outcomes, absorb delays safely, and communicate clearly when the system will recover.
Real tasks you will do
- Convert a business deadline into a cron schedule with buffers, retries, and catchup.
- Define SLIs (what you measure), SLOs (the target), and SLAs (the commitment) for ML pipelines.
- Handle backfills, holidays, daylight saving time, and upstream slips without breaking the contract.
- Design alerting and escalation for SLA misses with clear runbooks.
Concept explained simply
Scheduling decides when a pipeline should run. SLA management ensures it finishes in time for someone who depends on it. You measure reality (SLIs), set targets (SLOs), and commit a promise (SLA) that includes what happens if you miss.
Definitions at a glance
- SLI (Service Level Indicator): A metric. Example: “Pipeline completed by 07:30 local.”
- SLO (Service Level Objective): Target for the SLI. Example: “≥ 99% of weekdays.”
- SLA (Service Level Agreement): A formal commitment that may define escalation or fallback. Example: “If missed, alert within 5 minutes and publish last good model.”
Mental model
Imagine a commuter train. The clock is your schedule. The timetable reliability is your SLO. The passenger promise that a train will arrive, or you’ll provide an alternative, is your SLA. Then, buffers and detours (retries, catchup, fallbacks) protect the promise when things go wrong.
Key concepts you will use
- Time-based vs event-based triggers (cron vs data-available sensors).
- Windows: daily, weekly, monthly runs; rolling vs fixed windows; backfills.
- Catchup: running missed periods after downtime.
- Concurrency and pools: limit parallelism to avoid overload.
- Timeouts and retries with backoff; max active runs per pipeline.
- Calendars and timezones: holidays, DST; prefer UTC for storage and internal scheduling.
- Idempotency and safe reruns: design to rerun without duplicate effects.
- Alerting: SLA miss alerts, runtime breach alerts, data freshness alerts.
Worked examples
Example 1 — Daily model retraining for 07:30 dashboard
- Business need: Dashboard at 07:30 local, Mon–Fri.
- Data readiness: Features by 06:00 (p95).
- Durations: Train 30m, Evaluate 5m, Register 5m, Publish 5m.
- Plan: Start 06:30 (buffer 30m for late data), expected finish 07:15, 15m slack.
- Cron: 30 6 * * 1-5 (use UTC if ops standard, otherwise local TZ set in orchestrator).
- Retries: 2 attempts, 5m backoff; per-task timeout 20m; pipeline timeout 60m.
- SLA: Complete by 07:30 on ≥99% weekdays; if miss, alert + auto-publish last good model.
- Catchup: Enabled; backfill for missed days during outages.
Example 2 — Weekly feature rebuild with DST and holidays
- Need: Recompute features every Sun 02:00 UTC to avoid business hours.
- Durations: 90m typical; large data weeks 150m.
- Plan: Start 02:00 UTC; SLA complete by 05:00 UTC with 2-hour buffer.
- Calendar: UTC avoids DST confusion; skip on maintenance holidays and run Monday 02:00 UTC instead.
- Cron: 0 2 * * 0 (UTC). Holiday skip implemented via calendar condition check.
- Backfill policy: Allow ranged backfills per week; ensure idempotent overwrite of partitions.
Example 3 — Monthly batch inference dependent on external dataset
- Need: Score on the first business day, by 12:00 local.
- Dependency: External dataset delivered between 03:00–06:00 local.
- Trigger: Time + external-availability sensor with 30m polling until 08:00.
- Plan: Start immediately when data is present; latest safe start 09:30; runtime 120m; SLA 12:00.
- Fallback: If data absent by 09:30, trigger escalation and use prior-month model to produce a provisional report.
- Catchup: Disabled for the monthly job—use a manual backfill runbook to avoid accidental late publication.
Step-by-step: From deadline to robust schedule
- Clarify dependency and deadline: Who consumes the output? Exact time window? Timezone?
- Measure reality: Collect durations (p50, p95), upstream arrival p95, and failure rates.
- Choose trigger(s): Pure cron, event-based, or hybrid (cron + sensor with timeout).
- Set buffers: Start after upstream p95 and finish before consumer deadline with slack.
- Define SLIs/SLOs/SLA: Completion time, data freshness, success rate; targets and commitments.
- Reliability controls: Retries, timeouts, concurrency limits, resource pools, idempotency checks.
- Backfills and catchup: Enable for periodic gaps; document runbook and safe ranges.
- Alerting and escalation: Who, how fast, what fallback, and when to page vs notify.
- Document in a runbook: One page with schedule, SLA, contacts, metrics, and procedures.
Monitoring and alerts
- Core SLIs: On-time completion rate; end-to-end latency; data freshness at publish time; failure rate; retry count.
- Alert routes: Info (Slack/email), Warning (on-call during business hours), Critical (pager when SLA at risk).
- Alert rules: Predictive alerts when runtime trend suggests SLA risk; immediate alert on dependency sensor timeout.
Example alert thresholds
- Warning if runtime exceeds p95 for 3 consecutive days.
- Critical if it is 20 minutes to deadline and remaining tasks’ p95 exceeds slack.
- Critical if external data not arrived by latest safe start time.
Capacity and backlog management
- Limit max active runs to prevent stampedes during catchup.
- Use pools/queues for heavy tasks (GPU training, large joins).
- Pause lower-priority backfills when at risk of SLA for current period.
- Prefer UTC for scheduling; convert to local for communications.
Backfill strategy
- Run newest periods first to restore current SLA; then backfill older gaps.
- Use idempotent writes and partition overwrite to avoid duplicates.
- Throttle concurrency to protect shared warehouses and clusters.
Exercises
Do these, then check your answers in the solutions below. These mirror the interactive exercises at the end of this page.
Exercise 1 — Design a daily schedule and SLA
Business request: “Our sales dashboard must be ready by 08:00 local, Tue–Sat. Upstream features arrive by 06:00 (p95). Training takes 45 minutes p95; evaluation/publish 10 minutes. We want high reliability and a sensible buffer.”
- Pick a start time, cron, and timezone choice.
- Set SLIs/SLOs/SLA (completion time, reliability target).
- Define retries/timeouts and a fallback for misses.
- Choose catchup and backfill policies.
Exercise 2 — Triage recurring SLA misses
Scenario: Your weekly feature job (Sun 02:00 UTC) often finishes at 05:40. SLA is 05:00. Runtime p95 is now 210 minutes (previously 150). You must reduce risk and restore SLA reliability.
- Propose scheduling and resource changes.
- Adjust alerts to be predictive, not last-minute.
- Decide whether to move the start time, split tasks, or change pools.
Checklist before you ship
- [ ] Confirm timezone and holiday calendar.
- [ ] Confirm upstream readiness p95 and latest safe start.
- [ ] Ensure idempotent writes and a clear backfill plan.
- [ ] Define SLIs, SLOs, and SLA escalation steps.
- [ ] Test retries/timeouts and simulate an upstream delay.
Common mistakes and self-check
- Too-tight SLOs: No buffer between expected finish and deadline. Self-check: Do you have at least one task’s p95 worth of slack?
- Ignoring upstream SLAs: Scheduling before data is reliably ready. Self-check: Is your start time set after upstream p95 plus a sensor timeout?
- Disabled catchup causing silent data gaps. Self-check: Can you backfill last week’s dates safely?
- Timezone/DST drift. Self-check: Are you using UTC internally and converting only for communication?
- Unbounded parallel backfills choking resources. Self-check: Have you set max active runs and pool limits?
Practical projects
- Implement a daily training pipeline with cron + data-available sensor and a 20-minute SLA buffer. Include retries and a last-good-model fallback.
- Create a weekly feature rebuild with a holiday calendar and ranged backfill CLI/runbook.
- Add SLA miss and predictive runtime alerts using p95 thresholds and latest safe start logic.
Learning path
- Before this: Orchestration fundamentals, data dependencies, idempotent job design.
- Now: Scheduling and SLA Management (this page).
- Next: Monitoring and Alerting, Cost-aware scalability, and Incident response runbooks.
Next steps
- Take the Quick Test to validate your understanding. Everyone can take it; only logged-in users will have progress saved.
- Pick one Practical project and implement it in a sandbox environment.
Mini challenge
You run two daily jobs: A) Model retrain due 07:30; B) Batch inference due 08:30 and depends on A. Yesterday A finished at 07:55 and B missed its SLA. Propose a plan to protect both SLAs: scheduling buffers, alert thresholds, and fallback for B when A is late. Write your plan in 5–7 bullet points and include one predictive alert rule.