Why this matters
In ML training and batch pipelines, things fail: object storage hiccups, flaky APIs, schema changes, and quota limits. Without good failure handling, your models go stale, SLAs are missed, and teams lose trust in automation. Robust retries and clear alerts keep retraining reliable and predictable.
- Nightly retraining fails due to transient storage errors. Smart retries finish the job without waking anyone up.
- Feature generation runs late. An alert with context allows quick triage and minimizes data drift.
- A training step re-runs after a worker restart. Idempotent design prevents duplicate models or double billing.
Who this is for
- MLOps engineers building or maintaining training and batch pipelines.
- Data engineers orchestrating feature pipelines.
- ML engineers who own model retraining jobs.
Prerequisites
- Basic Python or another scripting language.
- Familiarity with a workflow orchestrator (e.g., Airflow, Prefect, Dagster) concepts.
- Understanding of your storage, compute, and messaging services’ limits/timeouts.
Concept explained simply
Failures are either transient (usually recover if retried) or permanent (need a code/config/data fix). Retries are an automatic way to give transient issues time to resolve. Alerts tell humans when automation cannot proceed safely or when SLOs are threatened.
Mental model
Imagine your pipeline as a train line. Switches and signals (policies) keep trains safe. If a track is briefly blocked (transient), trains slow down and try again later (backoff). If the track is broken (permanent), trains stop at a station (fail fast) and the control room (alerts) dispatches maintenance with a runbook.
Failure taxonomy (open me)
- Transient: network blips, brief rate limits, temporary 5xx, short quota hiccups.
- Permanent: bad credentials, code bugs, schema mismatch, missing partition, wrong path.
- Unknown: treat conservatively; retry a few times then escalate.
Defaults that work well in practice
- Exponential backoff with jitter: delay = min(max_delay, base * 2^attempt) + random_jitter
- Max attempts: 3–6 for batch jobs; fewer for fast SLAs.
- Global timeout per task and per run to avoid infinite retries.
- Idempotency keys (run_id) to make retries safe.
- Alerts only on actionable states; group/aggregate to reduce noise.
Runbook skeleton for alerts
- What failed, when, correlation/run_id
- Last logs and error snippet
- Is it safe to retry? Manual retry steps
- Escalation path if issue persists
Key components
- Error classification: decide retryable vs not.
- Retry policy: backoff, jitter, max attempts, cutoff time.
- Idempotency: unique run IDs, checkpoints, safe re-runs.
- Timeouts: per-try and overall to prevent hanging tasks.
- Circuit breakers: stop hammering a failing dependency.
- Alerting: severity levels, deduplication, escalation windows.
- Observability: metrics like success rate, retry count, time-to-recovery.
Worked examples
Example 1: Retrying object storage reads with backoff and jitter
Scenario: Feature extraction step reads a large parquet file. Occasionally the read fails with HTTP 503.
Approach
- Classify 5xx as transient; 4xx (except 429) as permanent.
- Retry up to 5 times with exponential backoff and jitter, then fail.
- Per-attempt timeout of 60s; overall timeout of 6 minutes.
import random, time
class TransientError(Exception):
pass
class PermanentError(Exception):
pass
def read_object():
# pretend: sometimes returns 503
if random.random() < 0.6:
raise TransientError("503 Service Unavailable")
return b"data"
def fetch_with_retry(max_attempts=5, base=1.0, max_delay=20.0):
start = time.time()
for attempt in range(1, max_attempts+1):
try:
return read_object()
except TransientError as e:
if attempt == max_attempts:
raise
backoff = min(max_delay, base * (2 ** (attempt-1)))
jitter = random.uniform(0, backoff * 0.25)
sleep_s = backoff + jitter
print(f"attempt={attempt} transient='{e}' sleeping={sleep_s:.2f}s")
time.sleep(sleep_s)
except PermanentError:
raise
if time.time() - start > 360: # overall cutoff
raise TimeoutError("overall timeout exceeded")
res = fetch_with_retry()Example 2: Idempotent training with run_id
Scenario: Training job may be rescheduled after a worker crash. We must avoid producing duplicate model versions.
Approach
- Generate a unique run_id per scheduled run (timestamp or orchestrator-provided).
- Write artifacts under a path including run_id.
- Before training, check if artifacts for run_id exist; if yes, skip or verify and mark success.
import os, json
RUN_ID = os.environ.get("RUN_ID", "2024-01-01T00-00-00Z")
ARTIFACT_DIR = f"models/{RUN_ID}"
META = f"{ARTIFACT_DIR}/meta.json"
os.makedirs(ARTIFACT_DIR, exist_ok=True)
if os.path.exists(META):
print("Idempotent: artifacts already exist, marking as success.")
else:
# train()
model_path = f"{ARTIFACT_DIR}/model.bin"
with open(model_path, "wb") as f:
f.write(b"fake-model")
with open(META, "w") as f:
json.dump({"run_id": RUN_ID, "status": "ok"}, f)
print("Training completed once.")Example 3: Alert routing and escalation
Scenario: A single run fails once due to a transient error. We do not want to page on-call. If three consecutive runs fail or data is late beyond SLA, we escalate.
Policy sketch
- First failure: informational alert to team channel; auto-ack if next run succeeds.
- Three consecutive failures or lateness > 60 minutes: page primary on-call; include runbook snippet and correlation ID.
- Deduplicate alerts for the same run_id and root cause within 30 minutes.
{
"rules": [
{"condition": "run.failure_count==1", "severity": "info", "notify": ["mlops-room"], "dedup_window": "30m"},
{"condition": "run.consecutive_failures >= 3 or sla.late_minutes >= 60", "severity": "high", "notify": ["oncall"], "escalate_after": "15m"}
],
"payload": ["pipeline", "step", "run_id", "error_type", "last_log_lines"]
}Common mistakes and self-check
- Retrying permanent failures (e.g., 401/403, missing file path). Self-check: list your error codes; mark retryable vs not.
- No jitter. Self-check: confirm randomization prevents thundering herd after incidents.
- Infinite retries without cutoff. Self-check: verify per-task max attempts and overall run timeout.
- Non-idempotent steps. Self-check: can this step run twice without corruption? If not, add a run_id checkpoint.
- Alert spam on single failures. Self-check: simulate one transient failure; ensure no paging, only informative alert.
- Missing context in alerts. Self-check: can someone fix it using just the alert payload?
Practical projects
- Harden one pipeline step: add retries with backoff+noise, timeouts, and idempotent outputs.
- Create an alert policy with severity rules and dedup windows for your nightly training job.
- Add metrics: retry_count, time_to_recovery, and success_rate to your monitoring and visualize weekly trends.
Exercises
These match the structured exercises below. Do them directly here or in your editor.
Exercise 1: Implement a safe retry wrapper with backoff and jitter
Goal: Wrap a function so that transient errors are retried with exponential backoff and jitter, with clear logs and a cutoff.
- Implement retry(func, max_attempts, base_delay, max_delay, retry_on)
- Simulate a function that fails twice before succeeding.
- Log attempts, sleep time, and final result.
- Ensure an overall timeout protects from hanging.
Hint
- Use random.uniform for jitter.
- Differentiate transient vs permanent exceptions.
Exercise 2: Design an alerting and escalation policy for a training pipeline
Goal: Produce a machine-readable policy (JSON/YAML-like) that routes alerts by severity, deduplicates within a window, and escalates sustained failures.
- Include: conditions, severity, notify targets, dedup window, escalation timing.
- Cover first-failure info and multi-failure paging.
- Include an example alert payload fields list.
Hint
- Use consecutive failures or lateness minutes as conditions.
- One high-severity rule should escalate if unacknowledged.
Exercise checklist
- Retry wrapper prints attempt logs and respects max attempts.
- Backoff grows and includes random jitter.
- Permanent errors fail fast (no extra retries).
- Alert policy has at least two severities and dedup logic.
- Escalation defined with a clear timer.
Mini challenge
Pick one current pipeline step. In under 30 minutes, add: (1) a per-try timeout, (2) exponential backoff with jitter, (3) an idempotency check using run_id. Then run a dry-run with induced failures to confirm the behavior.
Learning path
- Start: implement retries with backoff+jitter on one step.
- Add timeouts and overall run cutoff.
- Introduce idempotency keys and checkpoints.
- Define alert severity rules and dedup windows.
- Instrument metrics: retry_count, success_rate, time_to_recovery.
- Review weekly; tune thresholds and policies based on incidents.
Next steps
- Apply these patterns to all critical steps in your training and feature pipelines.
- Create a shared runbook template used by every high-severity alert.
- Schedule a game day to rehearse failure scenarios and measure time-to-recovery.
Progress and Quick Test
The Quick Test below is available to everyone. Only logged-in users get saved progress and personalized recommendations.