Why this matters
As an ETL Developer, you hand over pipelines that others will operate, extend, or use for analytics. Clear limitations and assumptions prevent surprises, reduce on-call noise, and speed up incident response. They set expectations for data freshness, completeness, and behavior under edge cases.
- Stakeholders know what the pipeline guarantees and what it does not.
- Support teams can triage incidents faster with documented constraints.
- Future developers understand trade-offs and when to revisit them.
Concept explained simply
Definitions:
- Limitation: A known constraint of the system today. It is real, testable, and affects users (e.g., “Max 15 min freshness”).
- Assumption: A condition believed to be true that the design relies on, but is outside your full control (e.g., “Source table has stable primary keys”).
Mental model
Think of your pipeline like a bridge: limitations are the posted weight and speed limits; assumptions are the soil conditions and weather you designed for. Both must be visible on the sign before anyone drives across.
Quality criteria for good statements
- Specific: state measurable thresholds (e.g., “≤ 30 minutes lag” instead of “near real-time”).
- Testable: you could write a monitor for it.
- Contextual: include scope (which datasets, jobs, or time windows).
- Actionable: mention the mitigation or escalation path if relevant.
- Time-bound: if temporary, add an expiry or review date.
Copy-paste templates
Limitation template:
Limitation: [What is constrained] Scope: [dataset/job/environment] Metric/Threshold: [e.g., freshness ≤ X min, completeness ≥ Y%] Reason: [tech/cost/policy] Mitigation: [monitor, retry, manual step, contact] Owner/Review: [team] — review by [YYYY-MM-DD]
Assumption template:
Assumption: [condition relied upon] Evidence: [contract/SLA/observation/date] Impact if false: [what breaks/how] Detection: [monitor/alert/check] Fallback: [safe behavior when violated]
Common categories
- Freshness and latency (e.g., maximum end-to-end lag)
- Completeness and duplication (e.g., late-arriving data policy, dedup key)
- Schema stability and drift policy
- Idempotency and retry behavior
- Backfill scope and reprocessing windows
- Cost/performance trade-offs (e.g., partitioning granularity)
- Source availability and SLAs
- Access/PII constraints and masking rules
Worked examples
Example 1 — Freshness and late data
Limitation:
Limitation: Orders_incremental job guarantees freshness ≤ 20 minutes under normal operation. Late events arriving >24 hours after event_time are dropped.
Reason: Source Kafka retention and cost limits on deep reprocessing.
Mitigation: A daily reconciliation compares source totals; anomalies >1% trigger an alert to DataOps.
Assumption:
Assumption: event_time reflects when the order was placed, not processing time. Evidence: source spec v2.1 signed 2025-06-03. If false: time-based aggregations will skew.
Example 2 — Backfill and idempotency
Limitation:
Limitation: Backfills supported only within the last 90 days for sales_fact. Older periods require manual request due to cold storage costs.
Assumption:
Assumption: Upstream CDC guarantees at-least-once delivery; combine with (order_id, event_time) dedup idempotency. If CDC switches to exactly-once, no change needed. If at-most-once occurs, gaps may appear; reconciliation will detect within 24 hours.
Example 3 — Schema drift and nullability
Limitation:
Limitation: New columns from source ERP appear as nullable strings for up to 2 weeks before typed modeling is updated.
Mitigation: Model owners review weekly. Consumers must not rely on new columns until marked “promoted”.
Assumption:
Assumption: Primary keys do not change type. If a key type changes, ingestion fails fast and creates a P1 incident.
How to document in your handover
- List critical user promises: freshness, completeness, schema stability.
- Walk each pipeline stage and note known constraints per stage.
- Record external dependencies and their SLAs (assumptions).
- Add detection/monitoring for each item (how you know it holds/breaks).
- Attach review dates for temporary constraints; assign owners.
Mini snippets you can reuse
- Late data policy: “Events older than X days are quarantined to bucket Y for manual review.”
- Retry policy: “Job retries 3 times with exponential backoff; on failure, alert channel Z.”
- Schema policy: “Unknown columns are ingested as raw JSON in column extras.”
Exercises
Do these now. You can compare with the solutions below. Tip: Write measurable thresholds.
Exercise 1 — Rewrite vague statements
Vague notes:
- Data is real-time.
- Sometimes duplicates happen.
- We can backfill if needed.
Task: Rewrite each into a testable limitation or assumption using the templates.
Show solution
Possible rewrites:
- Limitation: Freshness ≤ 5 minutes for 95% of loads; worst-case ≤ 20 minutes during source maintenance windows.
- Limitation: Dedup uses (user_id, event_id); residual duplicate rate ≤ 0.05% per day measured by reconciliation.
- Limitation: Backfills supported for rolling 60 days via automated job; older periods require manual ticket and up to 48 hours lead time.
Exercise 2 — Classify from a scenario
Scenario: The CRM API can throttle to 200 req/min without notice. Your extractor batches pages of 500 records, and if throttled, it sleeps and resumes. Analysts expect same-day completeness for yesterday by 08:00 UTC.
Task: List at least 2 limitations and 2 assumptions, and add detection/mitigation where relevant.
Show solution
Sample answers:
- Limitation: Extractor respects 200 req/min throttle; full extraction may take up to 3 hours for 10M records. Mitigation: progress metric and alert if ETA exceeds 6 hours.
- Limitation: Completeness for day D by 08:00 UTC, except during vendor incidents tracked by status feed.
- Assumption: Vendor throttle is stable at ≥ 200 req/min; detection: monitor HTTP 429 rates and moving average.
- Assumption: CRM “updated_at” is monotonic for incremental sync; if violated, late updates detected by 48-hour overlap window.
Self-check checklist
- Is each statement measurable (numbers, times, percentages)?
- Does it include scope (which datasets/jobs)?
- Is there a way to detect a breach (monitor/alert)?
- Is ownership and review date clear for temporary limits?
- Would a new engineer understand it without asking you?
Common mistakes and how to self-check
- Using vague terms like “near real-time” without numbers. Fix: add exact thresholds.
- Mixing limitations and assumptions. Fix: label them distinctly.
- Omitting detection. Fix: pair each item with a monitor.
- Not stating scope. Fix: name datasets/jobs and environments.
- Never revisiting temporary constraints. Fix: include review dates.
Practical projects
- Project 1: Take one of your pipelines. Add a Limitations & Assumptions section with at least 6 items across freshness, completeness, schema, and backfill.
- Project 2: Implement one monitor per item (synthetic freshness check, dedup rate gauge, schema drift alert). Include the alert channel in the doc.
- Project 3: Run a tabletop “assumption broken” drill. Document the fallback when the source SLA fails.
Mini challenge
Draft three items for a new clickstream pipeline that ingests from S3 every 10 minutes, where files may arrive late and contain extra columns.
Show a possible answer
- Limitation: Freshness ≤ 20 minutes for 95% of intervals; files older than 48 hours are quarantined.
- Limitation: New columns are captured in extras as strings for up to 14 days before promotion.
- Assumption: Filenames include event_date in UTC and are immutable after upload; detection: checksum mismatch alert.
Who this is for
- ETL/ELT Developers handing over pipelines to ops and analysts
- Data Engineers formalizing reliability expectations
- Analytics Engineers documenting downstream model guarantees
Prerequisites
- Basic ETL/ELT process understanding
- Familiarity with your pipeline scheduler and monitoring
- Awareness of upstream data contracts or SLAs
Learning path
- Identify user promises (freshness, completeness).
- Draft limitations and assumptions per dataset/job.
- Add detection, mitigation, and owners.
- Review with stakeholders; iterate.
- Set review dates and monitor alerts.
Next steps
- Integrate these items into your runbooks and READMEs.
- Add alert references so on-call can act quickly.
- Review quarterly; remove outdated constraints.
Quick Test
This short test is available to everyone. Only logged-in users get saved progress.