Why this matters
Incremental refresh lets you process only the data that changed instead of reloading everything. As a BI Developer, this means faster refreshes, fewer gateway timeouts, lower compute costs, and fresher dashboards.
- Daily sales dashboard updates within minutes, not hours.
- Month-end reports refresh reliably during peak usage.
- Large fact tables (tens/hundreds of millions of rows) become manageable.
Who this is for: BI Developers and analytics engineers building refreshable models and dashboards on large datasets.
Prerequisites: Basic SQL, comfort with data types and date/time functions, familiarity with your BI/ETL tool's filters and partitions.
Progress note: The quick test is available to everyone; only logged-in users get saved progress.
Concept explained simply
Incremental refresh updates only the portion of data that could have changed since the last run (for example, the last 7 days), while keeping older data stable. It relies on a watermark (like the latest date loaded) and often uses partitions to isolate time slices.
Key terms
- Watermark: A stored value (date/time or ID) that marks the last successfully processed record or time.
- Partition: A physical or logical slice of data (often by day/month). Old partitions remain untouched; recent ones get refreshed.
- Incremental window: The slice of recent time you reprocess (e.g., last 7 days) to catch late-arriving data.
- Historical range: How far back you keep data (e.g., 5 years) without reprocessing it each time.
- Upsert: Insert new rows and update changed rows.
Mental model
Imagine a bookshelf of monthly folders. Most folders are sealed (historical partitions). Only the newest folder is open (incremental window) where you keep adding and correcting pages. Incremental refresh just opens the last few folders and leaves the rest alone.
Worked examples
Example 1: Power BI-style incremental refresh policy
- Choose a date/time column with reliable values (e.g., OrderDate or ModifiedDate).
- Define the historical range (e.g., 5 years) and the incremental window (e.g., 7 days).
- Enable change detection on ModifiedDate if available so late updates are captured.
Result: Partitions are generated for historical periods; only the recent partitions are refreshed each time.
Foldability caution
Ensure your filters can be pushed to the source (query folding). Avoid complex custom steps that break folding; otherwise, the source might send all data and negate the performance benefits.
Example 2: SQL watermark incremental load (generic ETL)
Goal: Load only rows with ModifiedAt > last watermark, then upsert into the warehouse.
-- 1) Read current watermark
SELECT last_loaded_at FROM etl_control WHERE table_name = 'fact_sales';
-- 2) Pull only changed/new rows
WITH src AS (
SELECT *
FROM staging.sales
WHERE ModifiedAt > :last_loaded_at
)
-- 3) Upsert
MERGE warehouse.fact_sales AS tgt
USING src
ON tgt.SalesID = src.SalesID
WHEN MATCHED AND tgt.ModifiedAt < src.ModifiedAt THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT (...)
VALUES (...);
-- 4) Advance watermark
UPDATE etl_control
SET last_loaded_at = (SELECT COALESCE(MAX(ModifiedAt), :last_loaded_at) FROM staging.sales)
WHERE table_name = 'fact_sales';
Tip: Wrap in a transaction; only advance the watermark after a successful commit.
Example 3: Hash-based incremental when you lack ModifiedAt
- Create a hash of relevant columns (e.g., SHA of business keys and metrics).
- Compare incoming row hashes to stored hashes; update rows where hash differs or row is new.
- Store the hash in the target to speed future comparisons.
-- Compute hash in staging
SELECT OrderID,
CustomerID,
TotalAmount,
SHA2(CONCAT_WS('|', CustomerID, OrderDate, TotalAmount), 256) AS row_hash
FROM staging.orders_increment;
-- Upsert based on hash difference
MERGE warehouse.orders t
USING staging.orders_hashed s
ON t.OrderID = s.OrderID
WHEN MATCHED AND t.row_hash <> s.row_hash THEN UPDATE SET ... , row_hash = s.row_hash
WHEN NOT MATCHED THEN INSERT (...) VALUES (...);
Example 4: Partition maintenance
Keep old partitions read-only. Rebuild or refresh only partitions for the incremental window. Consider monthly partitions for large tables and daily for hot data.
Step-by-step setup (tool-agnostic)
- Pick a trustworthy change indicator: ModifiedAt, CreatedAt, or a hash.
- Define ranges: historical retention (e.g., 24 months) and incremental window (e.g., 7 days).
- Partition strategy: daily/weekly/monthly aligned to query patterns.
- Build incremental extraction: filter source by watermark or date range.
- Upsert logic: MERGE for updates, INSERT for new rows; handle deletes if needed (soft delete flag or CDC).
- Advance watermark only after successful load.
- Monitor: track rows processed, duration, partition counts, and failures.
Handling late-arriving data
Set the incremental window wide enough to catch typical delays (e.g., refresh last 7–14 days). For extreme lateness, schedule periodic backfills of older partitions.
Exercises
Do these before the quick test. You can check solutions inline.
Exercise 1: Choose the right watermark
Dataset columns: OrderID, OrderDate, PostedDate, ModifiedAt (nullable), Status. Business rule: Orders can be corrected up to 10 days after posting. Pick the best column for incremental refresh and justify.
Hints
- Does the column always increase or can it change?
- Which column actually moves when a correction happens?
Show solution
Use ModifiedAt if it is reliably populated on corrections; set the incremental window to at least 10 days. If ModifiedAt is unreliable, use PostedDate with a wider incremental window and periodic backfill.
Exercise 2: Write a safe watermark update
Write pseudocode that loads changed rows and updates the watermark only after a successful commit, with a retry on transient failure.
Hints
- Use a transaction for upsert + watermark update.
- Do not advance the watermark if the load fails midway.
Show solution
begin transaction
last_wm = read_watermark('fact_sales')
changed = select * from src where ModifiedAt > last_wm
upsert(changed)
new_wm = max(changed.ModifiedAt) or last_wm if none
update_watermark('fact_sales', new_wm)
commit
on error:
rollback
retry with backoff up to N times
Checklist: Did you...
- Pick a change indicator that reflects real updates?
- Set an incremental window that matches business latency?
- Ensure the filter can be pushed to the source (foldability)?
- Protect the watermark with transactions?
- Log row counts and duration?
Common mistakes and self-check
- Using CreatedAt instead of ModifiedAt: Misses updates. Self-check: Do late edits appear after refresh?
- Incremental window too small: Late-arriving data is lost. Self-check: Compare source vs model counts for recent days.
- Broken query folding: Source scans everything. Self-check: Measure source-side filtered rows vs total.
- Advancing watermark on failure: Data gaps. Self-check: Simulate failure; verify watermark unchanged.
- No strategy for deletes: Stale records remain. Self-check: Confirm how deletes are represented and propagated.
Practical projects
- Implement incremental refresh for a sales fact table with 3 years history and a 14-day incremental window. Measure refresh time before/after.
- Add change detection (ModifiedAt or hash) and prove updates propagate within the incremental window.
- Automate a monthly backfill job that refreshes the previous month’s partition to catch rare late changes.
Learning path
- Start: Understand watermarks, partitions, and upserts (this lesson).
- Next: Partitioning strategies and storage formats (daily vs monthly, hot vs cold data).
- Then: Change Data Capture (CDC) and handling deletes.
- Finally: Monitoring, alerting, and cost optimization.
Mini challenge
You inherit a 400M-row events table. It has EventTime (UTC) and no ModifiedAt. Late events arrive up to 5 days after occurrence. Design an incremental approach: choose partitioning, define the incremental window, and outline how you’ll detect changes without ModifiedAt. Write 5–7 steps. Then list two metrics you will monitor to prove it works.
Next steps
- Apply incremental refresh to one production dataset this week.
- Create a dashboard that tracks refresh duration, rows processed, and partitions touched.
- Take the quick test below to confirm your understanding.