Why this matters
Backfills and historical reconstruction ensure your features reflect what was truly knowable at the time a prediction or label occurred. In real work, you will:
- Fix feature bugs and re-materialize months of values without leaking future information.
- Recompute features after schema or logic changes while keeping training data consistent with production.
- Rebuild features when migrating to a feature store, or when joining new reference tables (SCD2) correctly.
- Handle late-arriving events and fill gaps without corrupting models.
Concept explained simply
Backfill = recomputing features for a past time range. Historical reconstruction = recreating the exact feature values and joins that would have existed at a past timestamp. The core idea: compute features as of the event time, not when your job runs.
- Event time: when the event happened (e.g., click_time).
- Processing time: when your pipeline saw the event (often later).
- Point-in-time correctness: joins and aggregations exclude data after the event time.
- SCD2/time-travel: choose the dimension record valid at the event time.
- Watermark: a boundary after which you assume no more late events will arrive.
- Idempotency: repeated backfills produce the same result.
Mental model
Imagine a timeline. For each training example or prediction time, draw a vertical line. Your features are functions of events strictly to the left of that line. Backfill = replay those functions over the timeline using the state of every table exactly as it was at each line, then write outputs without duplicates.
Core steps for a safe backfill
- Define the exact time range and grain (e.g., training rows' as_of_timestamp).
- Freeze code, configs, and reference snapshots for reproducibility.
- Choose strategy:
- Full: recompute all partitions/range. Use when logic changes globally.
- Incremental: recompute only missing or affected windows (e.g., since watermark).
- Enforce point-in-time joins (time-as-of, SCD2, or time-travel) using the event timestamp.
- Windowed aggregations exclude the current event and all future events.
- Deduplicate by stable keys (e.g., event_id) and event_timestamp.
- Design idempotent writes (overwrite partition, upsert on (entity_id, as_of_timestamp)).
- Validate: row counts, nulls, leakage checks, distribution drift, spot-checks against hand-calculated examples.
What if my reference table is SCD2?
Join on keys AND ensure the chosen record has valid_from <= event_time and (valid_to is null OR valid_to > event_time). This selects the version that existed at event_time.
What about late-arriving events?
Use a watermark to decide when to stop waiting. If late events arrive before the watermark, include them. If after, decide whether to re-backfill impacted windows or accept reduced completeness.
Worked examples
Example 1: 7-day rolling CTR per ad
Goal: For each impression at time t, compute clicks in (t-7d, t) / impressions in (t-7d, t).
-- Pseudo-SQL (point-in-time safe)
WITH w AS (
SELECT
ad_id,
imp_id,
imp_time AS as_of_ts,
SUM(CASE WHEN evt_type = 'impression' THEN 1 ELSE 0 END)
OVER (PARTITION BY ad_id ORDER BY evt_time
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND INTERVAL '1' SECOND PRECEDING) AS imps_7d,
SUM(CASE WHEN evt_type = 'click' THEN 1 ELSE 0 END)
OVER (PARTITION BY ad_id ORDER BY evt_time
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND INTERVAL '1' SECOND PRECEDING) AS clicks_7d
FROM ad_events -- unioned impressions + clicks with evt_time
)
SELECT ad_id, imp_id, as_of_ts,
CASE WHEN imps_7d = 0 THEN 0 ELSE clicks_7d * 1.0 / imps_7d END AS ctr_7d
FROM w;
Note the window excludes the current impression and any future events.
Example 2: Daily active minutes per user (bug fix backfill)
Bug: previous logic double-counted overlapping sessions. Fix the logic and re-materialize past 90 days.
- Define range: last 90 days at daily grain per user.
- Recompute from raw sessions, merging overlaps first.
- Write to offline store partitioned by date, overwriting those partitions (idempotent).
- Validate against sampled users and compare distributions to prior runs.
-- Pseudo-SQL to merge overlaps then sum per day
WITH ordered AS (
SELECT user_id, start_ts, end_ts
FROM sessions
),
merged AS (
-- Implementation detail skipped: collapse overlapping intervals per user
),
by_day AS (
SELECT user_id,
DATE_TRUNC('day', ts) AS d,
SUM(active_minutes) AS act_min
FROM expand_intervals(merged) -- explode intervals into daily buckets
GROUP BY 1,2
)
SELECT user_id, d AS as_of_ts, act_min FROM by_day;
Example 3: Credit risk with SCD2 income table
Compute features at application_time using the income row valid at that time.
SELECT a.app_id, a.user_id, a.application_time AS as_of_ts,
inc.annual_income
FROM applications a
JOIN income_scd2 inc
ON inc.user_id = a.user_id
AND inc.valid_from <= a.application_time
AND (inc.valid_to IS NULL OR inc.valid_to > a.application_time);
-- Add rolling transaction features strictly before application_time
WITH tx AS (
SELECT user_id, txn_time, amount FROM transactions
), win AS (
SELECT a.app_id,
SUM(amount) FILTER (WHERE txn_time > a.application_time - INTERVAL '90' DAY
AND txn_time < a.application_time) AS spend_90d
FROM applications a
LEFT JOIN tx ON tx.user_id = a.user_id
GROUP BY a.app_id
)
SELECT ...;
This avoids leakage by selecting the correct SCD2 record and limiting transactions to before application_time.
Who this is for
- Machine Learning Engineers and Data Scientists building training sets and online features.
- Data Engineers responsible for feature pipelines and data quality.
Prerequisites
- Comfort with SQL window functions and joins.
- Basic understanding of feature stores (offline vs online) and event-time semantics.
- Knowing how your data warehouse handles partitions and idempotent writes.
Learning path
- Start: Event-time vs processing-time, point-in-time joins.
- Then: SCD2 joins and time-travel queries; rolling windows.
- Next: Idempotent upserts, partition strategies, and validations.
- Finally: Run a small backfill end-to-end and add automated checks.
Exercises
Note: The quick test is available to everyone. Only logged-in users will have their progress saved.
Exercise 1 — Point-in-time 7-day rolling spend (SQL)
Compute a 7-day rolling sum of transaction amounts per user for each training row time. Tables:
- transactions(user_id, event_timestamp, amount, event_id)
- feature_requests(user_id, as_of_timestamp)
Ensure you exclude events at or after as_of_timestamp.
Checklist
- Partition by user_id.
- Window: (as_of_timestamp - 7 days) to (as_of_timestamp - 1 second).
- Deduplicate by event_id if needed.
- Join transactions to requests using event-time constraints, not processing time.
Exercise 2 — Design a safe backfill plan
You fixed double-counted daily active days and must re-materialize the last 180 days. Draft a step-by-step plan including strategy, idempotency, watermarking, and validations.
Checklist
- Choose full vs incremental backfill and explain why.
- Specify partitions and write mode.
- Define leakage checks and sampling plan.
- Decide how to handle late events relative to watermark.
Common mistakes and how to self-check
- Using processing time for joins. Self-check: verify all joins constrain records to event_time or use time-travel AS OF.
- Including the current event in rolling features. Self-check: window ends at as_of_timestamp minus 1 second or excludes current row.
- Missing SCD2 filters. Self-check: ensure valid_from/valid_to conditions select the correct historical version.
- Non-idempotent writes causing duplicates. Self-check: unique key on (entity_id, as_of_timestamp) or overwrite partition semantics.
- No validation after backfill. Self-check: compare row counts, null rates, summary stats, and run leakage tests by shifting labels backward.
Practical projects
- Rebuild 90-day rolling engagement features for a public-style events table and validate with sampled users.
- Implement SCD2 time-as-of joins on a mock customer dimension and verify with hand-labeled cases.
- Create a backfill runner that processes date partitions with idempotent upserts and summary validation outputs.
Next steps
- Automate a reusable backfill template with parameters: time range, overwrite mode, and validation level.
- Add monitoring that flags drift/leakage if future backfills diverge from expectations.
Mini challenge
You have orders(order_id, user_id, order_time, product_id) and product_prices_scd2(product_id, price, valid_from, valid_to). Outline a point-in-time SQL that computes the price_at_order and a 30-day pre-order spend per user. Keep it to ~5–10 lines and ensure no leakage.