Why this matters
In real systems, data arrives late, out-of-order, or corrected after the fact. If you ignore it, your aggregates drift, labels become wrong, and models degrade. Handling late data and reprocessing lets you reconcile truth over time without breaking dashboards or model lineage.
- You will re-run backfills after a bug fix or partner resend.
- You will update training sets and models when labels are corrected.
- You must keep results reproducible and auditable across versions.
Who this is for
- MLOps Engineers building batch or streaming pipelines.
- Data/ML Engineers responsible for reliable metrics and training data.
- Anyone who needs consistent versions across data, features, and models.
Prerequisites
- Basic understanding of data pipelines (batch or streaming).
- Familiarity with dataset versioning and immutable artifacts.
- Knowing event time vs processing time is helpful but not required.
Concept explained simply
Late data are records that belong to a past time but arrive now. Reprocessing is the controlled re-run that corrects past results using late or fixed data. To do this safely, we use:
- Event time windows with a lateness allowance (watermark).
- Idempotent, re-runnable jobs with deterministic inputs.
- Data and model versioning so each correction is traceable.
Mental model
Think of your data as a ledger. New facts can append late, and sometimes you discover an entry was wrong and must be amended. You never erase history; you add a new, corrected entry and publish a new versioned report/model that references the exact data snapshot used.
Key terms, briefly
- Event time: when the event actually happened.
- Processing time: when your system saw it.
- Lateness: difference between processing time and event time.
- Watermark: a moving cutoff time that says, “we consider earlier windows complete.”
- Backfill: re-running the pipeline for a historical range.
- Idempotent: re-running yields the same result.
Core patterns you will use
- Watermarking windows: keep windows open for a grace period (e.g., 3 days) to absorb common delays before finalizing.
- Incremental recomputation: recompute only affected windows/partitions to save time and cost.
- Deduplication and ordering: stable keys and event-time ordering prevent double counting.
- Versioned truth: every correction becomes a new dataset version; models reference exact dataset versions.
- Immutable artifacts: never overwrite in place; publish a new version/tag and deprecate the old.
- Time-travel capable storage or snapshots: allow point-in-time reads when rebuilding training sets.
- Safe rollouts: publish new versions behind a flag; enable after validation; keep rollback path.
Worked examples
Example 1 — Late events in daily revenue
Situation: Payment events can arrive up to 48 hours late. You publish daily revenue.
- Define watermark: event_time + 72h grace (48h typical + buffer).
- Daily job computes revenue for day D but marks it provisional until watermark passes.
- If late events for day D arrive within 72h, recompute only partition D.
- Publish dataset version: revenue_v2024-11-07.r2 (r2 = recomputation count).
- Downstream dashboards read the latest version by alias: revenue_latest.
Alternate approach: Two-phase publishing
Phase 1: Provisional (fast, likely incomplete). Phase 2: Final (after watermark). Consumers can choose speed vs. accuracy by selecting the alias.
Example 2 — Partner resend requires backfill
Situation: A partner re-sends one week of corrected events (Nov 1–7).
- Freeze a new input snapshot: partner_feed_v1.3_correction_2024-11-10.
- Backfill compute for partitions Nov 1–7 only.
- Write to new output dataset version: metrics_v2.1.
- Run validation: totals and record counts vs. prior version; diff report stored.
- Promote alias metrics_latest to v2.1 upon validation.
Checklist for safe backfill
- Idempotent writes: overwrite only target partitions in the new versioned output.
- Dedup keys validated for the corrected feed.
- Recorded lineage: input versions and job run IDs.
Example 3 — Corrected labels require model retrain
Situation: Labels for fraud were wrong for October; corrections arrive.
- Create corrected label dataset: labels_v3.0.
- Rebuild training set snapshot: training_data_v5.0 referencing labels_v3.0.
- Retrain model to produce model_v2.4 with metadata: inputs=[training_data_v5.0], code_commit=abc123.
- Run backtests on October–November; compare to model_v2.3.
- Canary deploy model_v2.4; promote on success. Keep rollback to v2.3.
Model lineage you must record
- Exact input dataset versions and time ranges.
- Feature definitions and their versions.
- Training code version and hyperparameters.
- Evaluation metrics and datasets used.
Example 4 — Change data capture (CDC) updates
Situation: Upserts arrive for customer addresses; historical aggregates must reflect the latest address at event time.
- Ingest CDC with valid_from and valid_to periods (slowly changing dimension Type 2 style).
- For each fact, join on address valid at event_time.
- Reprocess only affected partitions when CDC updates close/open intervals.
- Publish new dataset version with change manifest describing impacted partitions.
How to implement reliably
- Decide lateness SLAs per source (e.g., 24–72h grace) and encode as watermarks.
- Partition by event date; store deterministic inputs and outputs with versions.
- Make jobs idempotent: deterministic grouping, stable keys, overwrite-only-target-partitions.
- Track lineage: inputs, code commit, parameters, run ID, and output version.
- Validate: compare aggregates, counts, and sampled diffs before promotion.
- Communicate: publish release notes for each new version and deprecation timeline.
Minimal metadata to store per run
- run_id, job_name, code_version
- input_versions and time ranges
- output_version and affected partitions
- validation summary and sign-off
Common mistakes and how to self-check
- Finalizing too early: choose watermarks based on evidence of lateness; monitor late arrival rate.
- Overwriting in place: always produce a new version; keep immutability for audit and rollback.
- Global reprocess: re-run only impacted partitions/windows; keep a manifest of changes.
- Ignoring dedup: ensure stable primary keys and idempotent merges.
- Model-data mismatch: model must reference exact dataset version used; store lineage.
- No validation: define acceptance thresholds; block promotion if exceeded.
Self-check
- Can you recreate yesterday’s published metrics byte-for-byte?
- Can you list all datasets and model versions impacted by a correction?
- Can you roll back without data loss or downtime?
Exercises
Do these to cement the concepts. The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Plan for late clickstream events
Define a watermark, dedup strategy, and versioning scheme for a daily active users (DAU) metric where 5% of events arrive up to 36 hours late.
Exercise 2 — Labels correction reprocessing plan
Outline the data and model version changes when a label dataset is corrected for the last two weeks and the model must be retrained.
Exercise checklist
- Includes a grace period and when results turn from provisional to final.
- Specifies partitioning and which partitions to re-run.
- Defines dataset and model version naming.
- Lists validation metrics and promotion criteria.
Practical projects
- Build a mini pipeline that computes daily metrics with a 48h watermark and publishes provisional and final outputs with version tags.
- Create a backfill command that accepts a date range and generates a change manifest of impacted partitions.
- Implement a training pipeline that reads a versioned snapshot and writes model artifacts with complete lineage metadata.
Learning path
- Start: Event time vs processing time and watermarks.
- Next: Idempotent writes, partitioning, and dedup keys.
- Then: Versioned datasets and artifact lineage.
- Finally: Automating backfills and safe promotions.
Mini challenge
Create a simulation where 10% of events arrive 24–72 hours late. Produce a daily metric that first publishes provisional results, then finalizes after a watermark. Log every promotion with input and output versions, and generate a diff report comparing provisional vs final totals.
Next steps
- Add monitoring for late arrival rates and automatically adapt watermarks within safe bounds.
- Introduce change manifests to target reprocessing to affected partitions only.
- Standardize version naming and release notes across data and models.