Who this is for
- MLOps Engineers building scheduled or event-driven ML pipelines.
- Data Engineers and ML Engineers who need reliable batch or streaming workflows.
- Anyone owning operational SLAs for training, feature pipelines, or inference batch jobs.
Prerequisites
- Basic familiarity with a workflow/orchestration tool (e.g., Airflow/Prefect/Dagster/Kubeflow).
- Comfort with logs, metrics, and alerts in your stack.
- Understanding of your pipeline’s typical runtimes and failure modes.
Why this matters
In production, tasks fail, APIs slow down, and jobs overrun. Thoughtful retries, timeouts, and SLAs keep pipelines reliable, predictable, and cheaper to run. You’ll use these to:
- Reduce flakiness from transient errors without waking up the on-call.
- Prevent jobs from running forever and burning compute.
- Alert the right people quickly when a deadline is at risk.
Concept explained simply
- Retries: Try again when a failure might be temporary. Add backoff so each attempt waits longer than the last, and add jitter to avoid synchronized retry storms.
- Timeouts: A time limit to stop a task that is slow or stuck. Can be per task or whole workflow.
- SLAs: A deadline by which a task/workflow should complete. If breached, notify or take fallback action.
Mental model: three protective fences
- Inner fence (Timeout): "Don’t run longer than X." Prevents runaway compute.
- Middle fence (Retries + Backoff + Jitter): "If it fails, try again, smarter and slower." Handles flakiness.
- Outer fence (SLA): "All of this must finish by Y." Aligns runtime with business needs.
Key terms (quick reference)
- Exponential backoff: Waits grow geometrically (e.g., 15s, 30s, 60s...).
- Jitter: Randomizes wait to avoid retry thundering herd.
- Non-retryable errors: Fail fast errors (e.g., validation errors) that retries won’t fix.
- Per-try timeout vs. overall timeout: Limit for each attempt vs. total run limit for a task or DAG.
Key settings and patterns
- Retries: retries=N; backoff=exponential; base_delay; max_delay; jitter on; non_retryable=[...].
- Timeouts: task_timeout; overall_run_timeout; graceful_cancel hooks for external resources.
- SLAs: SLA >= P95 runtime + buffer (10–30%). Alert on breach; don’t auto-kill unless it’s safe.
- Idempotency: Ensure safe re-runs (dedupe keys, upserts, checkpoints, external job cancellation).
Worked examples
Example 1: Flaky warehouse read
Scenario: Feature extraction occasionally fails with transient 502s.
- Retries: 3 attempts, exponential backoff base 20s, factor 2, max 2m, jitter enabled.
- Timeouts: Per-try query timeout 8m (normal 3–5m).
- SLA: Daily pipeline SLA 45m (P95 total 35m + buffer 10m).
- Idempotency: Use a partition key (date) and upsert, so partial writes don’t duplicate.
Why this works
Short backoff handles brief outages; jitter avoids retry storms; per-try timeout keeps one bad query from burning the SLA; idempotency makes retries safe.
Example 2: Model training can overrun
Scenario: Training usually 18–25m but can spike to 50m due to data skew.
- Retry policy: 1 retry only for known transient errors (e.g., spot instance preemption). Non-retryable: bad hyperparams, ValueError.
- Timeouts: Per-try timeout 30m; overall workflow timeout 60m.
- SLA: 75m for the whole DAG (P95 ~55–60m + 15m buffer).
- Safety: On timeout, send cancel to the training service and checkpoint every 5m.
Why this works
Timeout caps cost; limited retries avoid doubling long runs; cancel prevents orphaned GPU jobs.
Example 3: External API with rate limits
Scenario: Embedding API sometimes returns 429 or 5xx.
- Retries: Up to 4; exponential backoff base 10s, cap 90s, full jitter. Honor Retry-After when present.
- Timeout: Per call 20s; batch task timeout 25m.
- SLA: Batch must finish by 40m; fallback to cached embeddings if SLA risk is high.
- Concurrency: Limit parallel calls to respect provider limits.
Why this works
Backoff + jitter protect both your job and the API. Timeout + fallback maintain user-facing reliability.
Design recipe (apply in your stack)
- Measure: Collect P50/P95 runtimes and top failure modes for each task.
- Choose Timeouts: Per-try timeout slightly above P95; set overall run timeout to cap total cost.
- Define Retries: 2–4 attempts, exponential backoff with jitter, cap max delay, specify non-retryable errors.
- Set SLA: P95 of total runtime + 10–30% buffer. Define alert path and a safe fallback.
- Make Idempotent: Use unique keys, upserts, checkpoints, and cleanup hooks.
- Test: Chaos test: inject 5xx, add latency, simulate hangs; verify alerts and cleanup.
Common mistakes and self-check
- Retry storm: Missing jitter and caps cause synchronized retries. Fix: enable jitter and max delay; limit concurrency.
- Non-idempotent side effects: Duplicated inserts on retry. Fix: dedupe keys, upserts, transactional writes.
- Wrong timeout level: Only task timeout set; workflow still runs forever. Fix: set both task and overall run timeouts.
- Retrying on logic errors: Wastes time. Fix: mark non-retryable exceptions (e.g., validation errors).
- Over-tight SLA: Constant false alarms. Fix: base on P95 with buffer and review monthly.
- No cleanup: External job keeps running after timeout. Fix: implement cancel/cleanup handlers.
Self-check checklist
- Can you show P95 runtimes used to pick SLA?
- Do retries use exponential backoff with jitter and a max cap?
- Are non-retryable exceptions explicitly listed?
- Are both per-try and overall timeouts defined?
- Is the task idempotent and/or does it dedupe?
- Is there a cleanup/cancel path for external resources?
- Did you run a failure simulation and receive the expected alerts?
Exercises
Complete these tasks. You can check suggested solutions below each exercise. Your progress is saved only if you are logged in; the test is available to everyone.
Exercise 1: Design a retry/backoff plan under an SLA
Nightly feature extraction reads from the warehouse and writes to a partitioned table. Typical pipeline time is 25m; the feature step averages 8m and sometimes fails early with 502. The whole pipeline SLA is 40m. Propose: retries count, backoff policy (base, factor, cap, jitter), per-try timeout, and show worst-case added delay so the SLA is still met.
Show solution
Sample plan: - Retries: 3 total attempts (1 initial + 2 retries) - Backoff: exponential, base 20s, factor 2, max 2m, full jitter - Per-try timeout: 10m for feature step (normal 8m) - Worst case timing overhead: Attempt1 fails early (~1m) -> wait ~20s Attempt2 runs full 10m -> wait ~40s Attempt3 runs full 10m Total step time ~1m + 0.33m + 10m + 0.67m + 10m ≈ 22m Pipeline other steps ~17m -> total ≈ 39m < 40m SLA - Idempotency: Upsert by (run_date, entity_id). Non-retryable: schema/validation errors.
Exercise 2: Pseudo-config for retries, timeouts, and SLA
Write a pseudo-configuration for a task that calls an external API. Requirements: 4 total attempts, exponential backoff factor 2, base delay 15s, max delay 2m, jitter on; per-try timeout 25s; task-level overall timeout 8m; non-retryable on ValueError and 4xx except 429; idempotency key per batch; SLA for the DAG at 60m with alert-only behavior.
Show solution
task "embed_batch":
idempotency_key: batch_id
retries:
total_attempts: 4
backoff: exponential
base_delay: 15s
factor: 2
max_delay: 2m
jitter: true
non_retryable: [ValueError, Http4xxExcept429]
timeouts:
per_try: 25s
task_overall: 8m
on_timeout: cancel_external_job()
behavior:
respect_retry_after: true
concurrency_limit: 5
dag_sla:
deadline: 60m
on_breach: alert("mlops-oncall")
Practical projects
- Harden a real pipeline: add per-try timeouts, retries with jitter, and SLA alerts. Validate with chaos tests that inject 5xx and latency.
- External-job cleanup: integrate cancel hooks so timeouts gracefully stop jobs on your compute/service.
- Idempotent writes: refactor a non-idempotent sink to upserts keyed by date/entity to make retries safe.
Learning path
- Instrument runtimes and error types; compute P95.
- Add timeouts, retries with jitter and caps; specify non-retryables.
- Set SLA = P95 + buffer; define alert and fallback.
- Implement idempotency and cleanup hooks.
- Run failure simulations; refine limits monthly.
Next steps
- Apply this to one production DAG this week.
- Add dashboards for runtime percentiles, retry counts, and SLA breaches.
- Document your policies and share with the team.
Mini challenge
Your nightly DAG has three tasks: extract P95=7m, transform P95=9m, train P95=20m. Propose per-try timeouts, retries, and a DAG SLA with a buffer, and describe your non-retryable errors. Keep the total SLA realistic and alert-only.
Take the Quick Test
Test is available to everyone. Only logged-in users will have progress saved.