Topic Not Found

Why this matters

In real platforms, jobs fail, sources arrive late, and pipelines drift. Backfills and partition-aware scheduling let you recover missing data safely without breaking SLAs or overloading systems. As a Data Platform Engineer, you will plan, throttle, and monitor backfills so that historical partitions are rebuilt correctly while daily pipelines keep running.

Recover missed days after outages.
Recompute partitions after logic or schema fixes.
Populate new tables from historical sources.
Handle late-arriving records and enforce data quality.

Concept explained simply

Think of your dataset as a calendar or a set of buckets (partitions). Each bucket contains data for a time window or an ID range. Scheduling partitions means deciding which buckets to fill, when to fill them, and how fast to fill them. Backfilling is filling old buckets you skipped or need to fix.

Mental model

Front of the stream: today’s partition (current schedule).
Backlog: older partitions waiting to be processed.
Valve: concurrency limit to control pressure on systems.
Gauge: data quality checks to verify correctness.

Quick glossary

Partition: A slice of data (e.g., daily YYYY-MM-DD or an ID range).
Backfill: Reprocessing one or many historical partitions.
Catchup: Scheduler option to run all missed schedules since last run.
Watermark: The most recent partition successfully processed.
Idempotent: Re-running a partition gives the same correct result (no duplicates).

Key concepts and guardrails

Choose the right partition grain: daily, hourly, monthly, or by ID range. Finer grains give safer, smaller retries.
Define clear boundaries: inclusive/exclusive start-end when selecting partitions.
Make each task idempotent: use MERGE/UPSERT or delete-and-reload scoped to the partition.
Throttle concurrency: set a maximum number of active partitions to avoid overloading compute or sources.
Order of execution: oldest-first is safest for dependencies and watermarks.
Data quality gates: row counts, uniqueness, null-rate, and freshness checks per partition.
Late-arriving data strategy: periodic small reprocess windows or incremental merge.
Pause/resume plan: monitor and be ready to halt if anomalies occur.

How to think about watermarks

Treat the watermark as your book’s bookmark. It marks the last confirmed partition. Backfills move the bookmark backwards to re-read and correct chapters, then advance it forward.

Worked examples

Example 1: Missed 5 daily partitions (YYYY-MM-DD)

Scenario: Ingestion failed for 2025-04-10 to 2025-04-14. Downstream warehouse has limited slots.

Partition list: 2025-04-10 ... 2025-04-14 (5 days).
Idempotency: For each day, DELETE WHERE dt = partition, then INSERT; or use MERGE on a unique key per day.
Concurrency: 2 parallel partitions; order oldest-first.
Checks: Compare counts with source; dedup key check; null-rate thresholds.
Outcome: Safe, controlled backfill that respects compute limits.

Example 2: Monthly aggregates after logic fix

Scenario: A bug in monthly revenue logic from 2024-01 to 2024-06.

Partition list: months Jan–Jun 2024.
Dependency: Ensure daily facts are correct first; then recompute monthly derived partitions.
Idempotency: Overwrite monthly partition or MERGE with stable keys (year, month, customer_id).
Concurrency: 1 partition at a time (aggregate is heavy).
Validation: Aggregate totals vs daily sums; change rate within expected bounds.

Example 3: Backfill by ID ranges

Scenario: Historical customer profiles need enrichment. Table is not time-partitioned; use ID range partitioning.

Define ranges: [1–100k], [100001–200k], ...
Idempotency: MERGE on customer_id; deterministic transformations.
Concurrency: 3 ranges in parallel; throttle API calls to the enrichment service.
Checks: Count matched vs updated; sample manual verification.

Example 4: Late-arriving data strategy

Run a rolling “last 3 days” mini-backfill nightly to capture late events. Keep it small to limit cost; ensure idempotency.

How to schedule backfills safely (step-by-step)

Confirm partitioning: grain (daily/hourly/ID), and boundary rules (inclusive start, inclusive end).
Compute partition list: explicitly enumerate partitions to run.
Make tasks idempotent: partition-scoped DELETE+INSERT or MERGE; avoid blind appends.
Throttle: set max active runs; respect source/API/warehouse limits.
Order oldest-first: aligns with dependencies and watermarks.
Retries and alerts: limited retries with backoff; alert on persistent failures.
Data quality checks: pre-check source availability; post-check counts, duplicates, nulls.
Dry run: test on 1–2 partitions; validate outputs.
Monitor and pause if needed: watch runtime, error rates, cost, and queue depth.

Templates you can reuse

Backfill run sheet
- Objective:
- Partition key & grain:
- Window & boundaries:
- Partition list:
- Concurrency limit:
- Idempotency method:
- DQ checks (pre/post):
- Alerts & rollback plan:

Exercises

Do these before the test. Your progress is saved only when logged in; everyone can still take the test.

Exercise 1 (ex1): Design a 30-day backfill safely

Dataset: daily events table with unique key (event_id, event_date). The source API has rate limits; warehouse slots are shared.

Decide concurrency, ordering, and idempotency approach.
List pre- and post-run data quality checks.
Describe a pause/resume plan and monitoring signals.

Exercise 2 (ex2): Enumerate partitions and set catchup

Backfill window: start=2025-03-28, end=2025-04-02, daily inclusive.

Enumerate the exact partitions in run order.
Explain what happens if catchup=false vs catchup=true while the daily schedule continues.

Checklist: Before you run a backfill

[ ] Partitions enumerated explicitly and reviewed.
[ ] Idempotency verified (MERGE/UPSERT or delete+insert scoped to partition).
[ ] Concurrency and rate limits set; downstream capacity confirmed.
[ ] Data quality gates configured; thresholds agreed.
[ ] Alerting in place; on-call aware.
[ ] Dry run passed on a small sample.
[ ] Pause/resume and rollback plan documented.

Common mistakes and self-check

Mistake: Appending during backfill, creating duplicates. Fix: Use partition-scoped overwrite or MERGE on stable keys.
Mistake: Overloading systems with high concurrency. Fix: Start low (1–3), then carefully increase.
Mistake: Off-by-one partition selection. Fix: Write start/end inclusive rules and list partitions explicitly.
Mistake: Ignoring dependencies. Fix: Ensure upstream partitions are ready; run oldest-first.
Mistake: No DQ checks. Fix: Add counts, uniqueness, null-rate, and freshness checks.

Self-check mini list

Can I re-run any partition without side effects?
Do I know exactly which partitions will run and in what order?
What signal will tell me to pause?

Practical projects

Build a reusable backfill runner that takes: dataset name, partition list, concurrency limit, and DQ checks.
Create a watermark table to track last successful partition per dataset, with timestamps and row counts.
Produce a backfill report summarizing partitions, durations, success/failure, and validation results.

Who this is for

Data Platform Engineers who operate orchestrators and data warehouses.
Data Engineers maintaining pipelines and analytics-ready tables.
Analytics Engineers responsible for reliable derived datasets.

Prerequisites

Basic scheduling/orchestration knowledge (DAGs, tasks, retries).
Familiarity with SQL (MERGE/UPSERT, partition filters).
Understanding of partitioning strategies and data quality checks.

Learning path

Before: Basics of orchestration, retries, and dependencies.
This lesson: Scheduling backfills for partitioned datasets safely.
After: Advanced dependency management, late data strategies, and cost-aware throttling.

Next steps

Finish Exercises 1–2 and run the Quick Test.
Draft a backfill run sheet for a real table you own.
Propose concurrency limits and DQ checks to your team for review.

Mini challenge

Write a short plan (5–8 bullets) to backfill the last 14 days for a daily table while production runs continue, including concurrency, idempotency, checks, and when to pause.

Menu

Scheduling Backfills Partitions

Table of Contents