Why this matters
As a Data Engineer, you build and operate data pipelines that power analytics, ML features, and dashboards. Clear pipeline documentation tells others how things work, and runbooks tell on-call engineers exactly what to do when things break. This reduces downtime, accelerates onboarding, and keeps data trustworthy.
- Real tasks you will face: hand over a pipeline to another team; answer auditors about data lineage; debug failed jobs at 2 a.m.; estimate impact of a delayed feed; retire or migrate a pipeline safely.
Concept explained simply
Pipeline documentation is the “what and why”: purpose, inputs, outputs, schedule, dependencies, SLAs, data contracts, and owner. A runbook is the “how to act now”: symptoms, diagnosis steps, fixes, and escalation for incidents.
Mental model
Think of a pipeline like a commercial flight. The pipeline documentation is the flight plan (route, cargo, schedule, weather risks), while the runbook is the cockpit checklist and emergency procedures. Both are essential; one informs, the other guides action under pressure.
Core components you should include
Pipeline documentation checklist
- Name and short summary (one sentence)
- Business purpose and stakeholders
- Owners and on-call rotation
- Inputs (sources, schemas or links to schemas, contracts)
- Transformations (brief logic, key assumptions)
- Outputs (tables/topics/files, schemas or links to schemas)
- Schedule and SLAs (e.g., daily 02:00 UTC, SLA: 03:00 UTC)
- Dependencies (upstream/downstream)
- Operational characteristics (runtime, resource usage, cost notes)
- Quality checks (expectations, thresholds)
- Security/compliance considerations (PII handling, retention)
- Change history and versioning approach
Runbook checklist
- Scope: what this runbook covers
- Common symptoms and alerts (exact alert names)
- Immediate sanity checks (dashboards, commands)
- Decision tree (if X then Y)
- Remediation steps (exact commands/UI steps)
- Rollback procedure
- Escalation path (roles, contacts)
- Verification (how to confirm recovery)
- Post-incident notes (what to capture)
Copy-and-paste templates
{"name": "",
"summary": "",
"purpose": "",
"owners": ["", ""] ,
"inputs": [{"source": "", "schema": "", "contract": ""}],
"transformations": "",
"outputs": [{"target": "", "schema": "", "consumers": ["teams"]}],
"schedule": {"type": "batch", "cron": "0 2 * * *", "timezone": "UTC"},
"sla": "",
"dependencies": {"upstream": ["..."], "downstream": ["..."]},
"quality_checks": ["", "="] ,
"security": {"pii": false, "retention_days": 365},
"operations": {"avg_runtime_min": 18, "cost_note": ""},
"change_history": [{"date": "YYYY-MM-DD", "change": ""}]}
Runbook
-------
Scope:
Alerts:
Sanity checks:
Decision tree: -> ; ->
Remediation:
Rollback:
Escalation:
Verification:
Post-incident: Worked examples
Example 1 — Nightly batch customer metrics (Airflow + SQL)
Summary: Computes daily customer metrics for dashboards.
Inputs: postgres.sales.orders (D-1 complete by 01:00 UTC); crm.customers.
Transformations: Windowed aggregations on orders, join to customers on customer_id; filters refunded orders.
Outputs: dw.customer_daily_metrics (partitioned by dt).
Schedule/SLA: 02:00 UTC daily; SLA 03:00 UTC.
Dependencies: Upstream: orders ETL; Downstream: BI dashboards.
Quality checks: row_count within 5% of 7-day median; null_rate(customer_id)=0.
Runbook (incidents):
- Symptom: Airflow task metrics_sql failed with "lock timeout".
- Sanity: Check db locks: select blocked_locks...
- Fix: Retry task once; if persists, run VACUUM ANALYZE on temp table, then re-run.
- Escalate: DB on-call if lock exceeds 20 min.
- Verify: Row count and freshness checks green.
Example 2 — Streaming events to warehouse (Kafka → Spark Structured Streaming → Delta)
Summary: Ingests clickstream from Kafka topic web.clicks to bronze/silver Delta tables.
Inputs: Kafka brokers (SASL_SSL); topic: web.clicks; schema in schema registry.
Transformations: JSON parsing; PII hashing for email; late event watermark 10 min; sessionization by user_id.
Outputs: delta.bronze_clicks, delta.silver_sessions.
Schedule/SLA: Continuous; SLA end-to-end latency < 5 minutes p95.
Dependencies: Schema registry availability; S3 bucket permissions.
Quality checks: p95 latency < 5m; malformed record rate < 0.5%.
Runbook (incidents):
- Symptom: Lag increasing on web.clicks partitions.
- Sanity: Check consumer group lag per partition; verify broker health.
- Fix: Scale executors by +2; if malformed spikes, enable dead-letter queue and bump bad-record threshold to 1% temporarily.
- Rollback: Revert autoscaling to baseline.
- Verify: Lag trending down, p95 latency < 5m.
Example 3 — ML feature pipeline (dbt + Feature Store)
Summary: Builds daily churn_features for model training and online serving.
Inputs: dw.customer_daily_metrics, support.tickets.
Transformations: dbt models feature_eng.sql; standardization and capping outliers; ensures no target leakage.
Outputs: fs.churn_features (offline), fs_online.churn_features (online).
Schedule/SLA: Daily 03:30 UTC; SLA 05:00 UTC (before training job at 05:15).
Quality checks: Training-serving skew < 1%; feature null-rate < 2%.
Runbook (incidents):
- Symptom: Online store missing latest partition.
- Sanity: Compare offline vs online counts for dt=today; check feature registry sync logs.
- Fix: Re-run sync_online_features with dt=today; if fails, backfill last 2 days.
- Escalation: ML platform on-call after 30 min.
- Verify: Online row_count within 1% of offline.
Step-by-step: write your first pipeline doc
- Start with the one-line summary. Clarify purpose and key output.
- List inputs/outputs. Include table or topic names and schemas.
- Describe transformations briefly. Mention key logic and assumptions.
- Define schedule and SLA. When it runs and when it must be ready.
- Map dependencies. Note upstream sources and downstream consumers.
- Add operational notes. Runtime, costs, data quality checks.
- Create the runbook. Symptoms, checks, decision tree, remediation, verification, escalation.
Mini task: turn a chat message into a one-line summary
Message: “This job updates the 'orders_by_region' table every morning so Finance dashboards show yesterday's totals.”
Answer example: “Nightly batch job that aggregates orders by region for Finance dashboards (dw.orders_by_region, ready by 07:00 local).”
Exercises
Do these to build muscle memory. You can compare with solutions in the collapsible blocks.
Exercise 1 — Draft a minimal runbook from a brief incident
Incident brief: “Airflow DAG user_activity_daily failed on task load_to_warehouse. Error: S3 503 Slow Down on write to s3://prod-warehouse/tmp/. Retries exhausted. BI team waiting.”
- Write a runbook with: Scope, Alerts, Sanity checks, Decision tree, Remediation, Escalation, Verification.
Show solution
Scope: Failures writing to S3 during load_to_warehouse.
Alerts: Airflow task failure; S3 write error rate > 2%.
Sanity checks: Check AWS status dashboard; list S3 bucket metrics (5xx); confirm IAM role still valid.
Decision tree: If AWS incident ongoing → pause DAG 30 min; If only this task failing → check temp path quota; If IAM expired → refresh role session.
Remediation: Set Airflow task retry=5 with exponential backoff; switch tmp path to s3://prod-warehouse/tmp2/; re-run failed task; if still failing, redirect to local disk spill (ensure 100GB free) and re-run.
Escalation: Cloud platform on-call if 30 min no recovery.
Verification: Output table row_count and freshness checks green; BI dashboard updated.
Exercise 2 — Normalize a messy pipeline doc
Raw notes: “Runs hourly. Pulls events from topic clicks. Joins with users for country. Sometimes late data. Dashboard owners: Growth. Alert when latency high.”
- Rewrite using the provided template sections. Be explicit about inputs, outputs, SLA, and quality checks.
Show solution
Name/Summary: Hourly enrichment of click events with user country for Growth dashboard.
Inputs: kafka.topic web.clicks; users.dim (postgres). Schema in registry.
Transformations: Parse JSON; join on user_id; handle late data with 10 min watermark.
Outputs: dw.clicks_enriched_hourly partitioned by hour.
Schedule/SLA: Hourly at :10; SLA: :20 (p95 latency < 10m).
Dependencies: Schema registry and users.dim freshness < 24h.
Quality checks: null_rate(user_id)=0; join match rate >= 95%; malformed < 1%.
Owners: Data Platform; Growth as stakeholder.
- Checklist to self-evaluate your exercises:
- Is the summary one sentence?
- Are inputs/outputs named precisely?
- Is there a measurable SLA?
- Are quality checks specific (with thresholds)?
- Does the runbook include verification and escalation?
Common mistakes and self-check
- Vague SLAs (e.g., “morning”). Fix: state exact time or latency and timezone.
- No owners or escalation path. Fix: add team and on-call contact.
- Missing data contracts. Fix: specify schemas and expectations (types, nullability).
- Runbooks that say “check logs” only. Fix: add exact commands, dashboards, thresholds.
- Docs drift from reality. Fix: update on every material change; add change history.
5-minute self-check before you publish
- Can a new teammate rerun/backfill using your doc alone?
- Can on-call resolve top 2 incidents in < 15 minutes using the runbook?
- Are privacy/retention notes clear for PII?
- Do quality checks guard against the last outage you had?
Practical projects
- Document two existing pipelines (one batch, one streaming) using the template, and run a peer review.
- Create a runbook drill: simulate a common alert in a dev environment and time the recovery using the runbook.
- Set up a “doc completeness” checklist in your CI (even a simple text check) so every new pipeline PR includes docs and a runbook.
Who this is for
- Data Engineers responsible for building and operating pipelines.
- Analytics Engineers and ML Engineers who own data flows.
- On-call responders who need clear, actionable steps.
Prerequisites
- Basic understanding of your orchestration tool (e.g., Airflow, dbt, Spark).
- Ability to read/write SQL and understand schemas.
- Familiarity with your monitoring/alerting stack.
Learning path
- Learn core pipeline components and SLAs.
- Write one complete pipeline doc using the template.
- Create a precise runbook for top two failure modes.
- Run a tabletop incident drill and refine the runbook.
- Adopt doc versioning and review with your team.
Next steps
- Pick one critical pipeline and bring its doc and runbook to “production-ready”.
- Schedule a 30-minute peer review and capture improvement actions.
- Add a recurring reminder (monthly) to review and update runbooks.
Mini challenge
In 10 minutes, write a one-paragraph pipeline doc for any small job you run today. Include summary, inputs, outputs, schedule, SLA, and one quality check.
Quick Test note
The Quick Test below is available to everyone. If you are logged in, your progress will be saved automatically; otherwise, you can still practice for free without saving.
Practice Exercises
2 exercises to complete
Instructions
Incident brief: “Airflow DAG user_activity_daily failed on task load_to_warehouse. Error: S3 503 Slow Down on write to s3://prod-warehouse/tmp/. Retries exhausted. BI team waiting.”
Create a runbook with the following sections: Scope, Alerts, Sanity checks, Decision tree, Remediation, Escalation, Verification.
Expected Output
A clear, one-page runbook covering the seven sections with concrete steps and thresholds.Pipeline Documentation And Runbooks — Quick Test
Test your knowledge with 8 questions. Pass with 70% or higher.
8 questions70% to passHave questions about Pipeline Documentation And Runbooks?
AI Assistant
Ask questions about this tool