Who this is for
- MLOps engineers building training and batch feature pipelines.
- Data engineers responsible for data quality and historical corrections.
- ML practitioners who need reliable retraining after data or logic fixes.
Prerequisites
- Basic batch orchestration concepts (scheduling, retries, task dependencies).
- Comfort with SQL or dataframe transformations.
- Understanding of partitioned data (e.g., by date) and atomic writes.
Why this matters
In real ML systems you will need to:
- Recompute features for past days after a bug fix.
- Ingest late-arriving events without double counting.
- Rebuild training datasets when label definitions change.
- Retry failed jobs safely without corrupting data.
Backfills let you correct history. Idempotency ensures your pipeline can be run repeatedly (by schedule or retry) without producing inconsistent results.
Concept explained simply
Backfill: re-running your pipeline for a past time range or set of keys to fill gaps or apply new logic.
Idempotency: running the same job with the same inputs produces the same outputs, even if you run it many times. This lets you retry safely and backfill confidently.
Mental model
Think of your pipeline as a pure function per partition:
- Output(day) = DeterministicTransform(Inputs frozen for that day)
- Write results using a key (e.g., user_id, day) so re-running overwrites the same rows, not duplicates.
- Make writes atomic (temp + swap) or use upsert/merge with a unique key and dedup rules.
Core principles for safe backfills and idempotency
- Partitioned processing: operate on time partitions or key ranges.
- Deterministic transforms: fix code version, parameters, and input snapshot.
- Idempotency keys: unique keys like (entity_id, date) or (feature_name, date) prevent duplication.
- Atomic writes: write to a temp location then swap, or use MERGE/UPSERT into the target.
- Versioning: write to a new version (e.g., features_v2) when logic changes; switch readers after validation.
- Deduplication: define a rule for late/duplicate records (e.g., keep the latest by ingestion_time).
- Validation gates: row counts, checksums, sampling diffs before promoting results.
- Parameterization: pass start_date, end_date, or impacted keys into the backfill job.
Worked examples
Example 1: Feature store backfill with late events
Show steps
- Freeze input snapshot to a cutoff timestamp.
- Process partitions from start_date to end_date.
- Deduplicate events per user_id, event_id keeping latest ingestion_time.
- Merge into user_daily_features keyed by (user_id, date) with run_id.
-- Pseudocode SQL
WITH dedup AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY ingestion_time DESC) AS rn
FROM raw_events
WHERE event_date BETWEEN :start_date AND :end_date
AND ingestion_time <= :cutoff_ts
), clean AS (
SELECT * FROM dedup WHERE rn = 1
), agg AS (
SELECT user_id, event_date AS date, COUNT(*) AS clicks_7d
FROM clean
GROUP BY user_id, event_date
)
MERGE INTO user_daily_features t
USING agg s
ON t.user_id = s.user_id AND t.date = s.date
WHEN MATCHED THEN UPDATE SET clicks_7d = s.clicks_7d, run_id = :run_id, updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (user_id, date, clicks_7d, run_id, updated_at)
VALUES (s.user_id, s.date, s.clicks_7d, :run_id, CURRENT_TIMESTAMP);
Example 2: Label backfill after a definition change
Show steps
- Write to labels_v2 table to avoid mixing versions.
- Run a partitioned backfill for 2023-10-01 to 2023-12-31.
- Validate row counts and label distribution vs v1.
- Update training config to read labels_v2 after validation.
Example 3: Metric pipeline with atomic swap
Show steps
- Write day D results to metrics_tmp.daily/D.
- Run validations (row counts, checksum of ids, null checks).
- On success, swap/rename metrics_tmp.daily/D → metrics.daily/D atomically.
Example 4: Key-focused backfill after a bug
Show steps
How to run a safe backfill (step-by-step)
Exercises
Complete the tasks below, then compare with the solutions. A quick test awaits at the end. Progress and answers are saved for logged-in users; the test is available to everyone.
Exercise 1: Design an idempotent backfill plan
See details in the Exercises panel below (Exercise 1). Use this checklist while designing:
- Start/end partitions or keys defined
- Input cutoff timestamp fixed
- Write strategy chosen (versioned or MERGE)
- Idempotency key identified
- Validation checks listed
- Rollback plan noted
Exercise 2: Write an idempotent upsert
Implement a MERGE/UPSERT that can be rerun without double counting. Use the checklist:
- Deduplication rule defined
- Deterministic grouping
- Merge key chosen
- Run identifier recorded
- Idempotent on re-run
Common mistakes and how to self-check
- No dedup rule: late duplicates inflate metrics. Self-check: rerun with a sample containing duplicates; verify identical output.
- Unbounded backfills: reprocess all history unnecessarily. Self-check: confirm impact analysis and partitions list before running.
- Non-atomic writes: partial data visible. Self-check: ensure temp + swap or transactional MERGE.
- Changing inputs mid-run: inconsistent results across days. Self-check: set and log a cutoff_ts; fail if source drift is detected.
- Lack of validation: promote bad data. Self-check: define must-pass checks before write promotion.
- Mixing versions: readers see both v1 and v2. Self-check: separate output paths or include a version column, then switch consumers explicitly.
Practical projects
- Build a backfillable feature pipeline: Create user_daily_features with MERGE, dedup, and per-day validations; add a parameterized backfill job.
- Label version migration: Produce labels_v1 and labels_v2 for one quarter; compare distributions; implement a safe switch-over.
- Late data simulator: Generate synthetic late events and verify your pipeline is idempotent by re-running the same partition multiple times.
Learning path
- First: Master partitioning, file/table layouts, and atomic write patterns.
- Then: Implement deduplication strategies and MERGE/UPSERT semantics.
- Next: Add validations and observability (counts, null rates, distribution checks).
- Finally: Orchestrate parameterized backfills and create promotion/rollback playbooks.
Next steps
- Turn one of the practical projects into a reusable template for your team.
- Document your backfill SOP: parameters, validations, and communication plan.
- Take the quick test below to confirm understanding.
Mini challenge
In 2–3 sentences, propose an idempotency key and write strategy for a sessionized feature built from clickstream events. Explain how a retry won’t duplicate sessions.
Quick test and progress
Complete the quick test below. Everyone can take it; logged-in users will have progress saved automatically.