Why this matters
Mental model
Picture a grid of boxes (partitions) arranged by time. A scheduler fills the newest box each period. When something breaks, some boxes stay empty or wrong. Backfilling is selecting the affected boxes and refilling them carefully, respecting dependencies and resources.
Key terms (click to expand)
- Partition: A subset of data/work keyed by time or another dimension (e.g., date=2025-12-01).
- Catchup: Automatically running missed scheduled intervals since the last successful run.
- Backfill: Manually triggered runs for a historical range of partitions.
- Idempotency: Rerunning produces the same result without duplication.
- Concurrency/Rate limit: How many runs can execute in parallel.
- Late data: Data arriving after its intended partition time.
Worked examples
Example 1: Daily batch feature pipeline with safe backfill
Scenario: A daily feature job failed for a week. You need to backfill 7 days without overloading the warehouse.
- Freeze code and configs at the version that should produce the desired output.
- Choose the date range (e.g., 2025-03-01 to 2025-03-07).
- Limit concurrency to 2 and verify idempotency (upserts/overwrite).
- Run in ascending date order to honor dependencies.
Sample configuration snippet
# Pseudocode-style configuration
schedule: daily 02:00 UTC
catchup: true
backfill:
start: 2025-03-01
end: 2025-03-07
concurrency: 2
retries: 2
mode: overwrite # partition overwrite to ensure idempotent writesOutcome: 7 runs, at most 2 in parallel, each overwriting its partition safely.
Example 2: Partitioned training data re-computation after logic change
Scenario: A feature definition changed. You must recompute the last 90 days to keep training data consistent.
- Create a new versioned output path (e.g., features_v2/date=YYYY-MM-DD).
- Backfill the last 90 days in parallel with strict resource limits.
- Switch downstream training to read features_v2 only after validation.
Command & guardrails
# Pseudocode commands
backfill --partitions 2024-12-01..2025-02-28 \
--max-parallel 3 \
--write-mode overwrite \
--target features_v2
# Guardrails
- Validate row counts & null rates per partition
- Compare feature distribution drift vs. previous versionOutcome: New version fully populated; downstreams flip over atomically.
Example 3: Handling late-arriving events without double counting
Scenario: Events for day D arrive up to 48 hours late. The daily job runs at 02:00 UTC. How to avoid missing late data?
- Adopt a watermark delay: process day D at D+2 (schedule lag).
- Allow a small reprocessing window (e.g., rolling 3 days) with idempotent upserts.
- Periodic mini-backfills: re-run last 2–3 partitions nightly.
Idempotent write pattern
# Pseudocode write
MERGE INTO features AS t
USING staging AS s
ON t.entity_id = s.entity_id AND t.date = s.date
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...)Outcome: Late data is included; no double counting due to merge semantics.
Who this is for
- MLOps Engineers operating batch or streaming ML pipelines.
- Data/ML Platform Engineers building orchestration and data reliability.
- Data Scientists who schedule and re-run partitioned workflows.
Prerequisites
- Basic Python or SQL for data pipelines.
- Familiarity with an orchestrator (e.g., conceptual knowledge of DAGs/jobs).
- Understanding of partitioned datasets (daily/hourly).
Learning path
- Grasp partitions and scheduling concepts.
- Practice backfill planning, idempotent writes, and concurrency limits.
- Add data quality checks and lineage to backfills.
- Automate with templates and parameterized backfill jobs.
How to design safe backfills
- Define the exact partition range and order.
- Choose write mode: overwrite or upsert/merge.
- Set concurrency based on system limits.
- Add validation checks per partition.
- Record run metadata and lineage for traceability.
Validation checklist per partition
- Row count within expected bounds
- No unexpected null spikes
- Key uniqueness maintained
- Feature distribution within drift thresholds
- Downstream read test passes
Common mistakes and self-check
- Running unlimited parallel backfills. Fix: set max concurrency and monitor throughput.
- Non-idempotent writes causing duplicates. Fix: use overwrite or merge semantics.
- Ignoring dependencies. Fix: order runs and ensure upstream partitions exist.
- No data validation. Fix: per-partition assertions and automated alerts.
- Mixing code versions mid-backfill. Fix: pin code/runtime version for the whole range.
Self-check prompts
- Can I safely re-run any partition without side effects?
- What is the exact range, order, and parallelism of my backfill?
- How will I verify correctness before switching downstreams?
Exercises
Hands-on tasks that mirror real operations. See the Quick Test at the end for a short knowledge check. Your progress is saved if you are logged in; everyone can attempt the test.
Exercise 1 (ex1): Plan a 31-day backfill with dependencies
Context: A daily pipeline raw -> staging -> features missed December 2024. You must refill all three layers safely.
- Define partition range and run order.
- Pick write modes and concurrency.
- Define validations and a rollback plan.
Hints
- Process upstream to downstream per day.
- Cap concurrency to protect your warehouse.
- Compare row counts across layers for consistency.
Exercise 2 (ex2): Late data and rolling reprocessing
Context: Events can arrive 48 hours late. Design a schedule that incorporates late data without full reprocessing.
- Choose a watermark delay.
- Define a rolling mini-backfill window.
- Specify idempotent merge logic.
Hints
- Use D+2 processing for day D.
- Re-run last 2–3 days nightly.
- Use merge keys: entity_id + date.
Practical projects
- Build a parameterized backfill job that accepts start_date, end_date, max_parallel, and write_mode. Include per-partition validations.
- Create a versioned features store (v1, v2) and a controlled switchover process with canary validations.
- Implement a late-data policy: watermark delay, rolling reprocess, and metrics dashboard for late-arrival rates.
Mini challenge
You discover feature drift for 2025-01-10 to 2025-01-14 only. Without touching other dates, design a targeted backfill plan that limits risk and proves correctness. Outline: range, concurrency, write mode, validations, and how you ensure downstreams see only corrected partitions.
Next steps
- Automate common backfill templates with safe defaults.
- Add lineage and audit tags to every partition write.
- Practice with mock outages and timed drills.