Who this is for
ETL Developers, Data Engineers, and Analysts who design or maintain pipelines and need to choose between loading entire datasets or just the changes.
Prerequisites
- Basic SQL: SELECT, JOIN, INSERT, UPDATE
- Familiarity with ETL/ELT concepts
- Understanding of primary keys and timestamps
Why this matters
Choosing full reload vs incremental determines cost, speed, and reliability of your pipelines. In real projects you will:
- Load product catalogs nightly without downtime
- Refresh large fact tables hourly without missing late records
- Keep dimensions accurate using slowly changing dimension (SCD) patterns
- Recover gracefully from failures with idempotent, restartable jobs
Concept explained simply
Full reload: Drop or truncate the target and reload the entire dataset. Simple, consistent, but expensive and slow at scale.
Incremental load: Load only what changed since the last run (the delta). Fast, cost-efficient, but needs careful change detection and deduplication.
Mental model
Think of a bookshelf:
- Full reload: Replace the whole shelf with a fresh copy
- Incremental: Only add/replace the books that changed
Deep dive: How to detect changes
- Watermark: Use a column like last_updated to select rows greater than the last processed value
- CDC logs: Read change events (insert/update/delete) from the source
- Hash-diff: Compute a hash of business columns to detect actual content change
How to choose between full and incremental
- Data volume and SLA: Big tables + tight windows favor incremental
- Source capabilities: If you have CDC or reliable timestamps, incremental is easier
- Cost and compute: Full reload can be costly at scale
- Schema volatility: Frequent schema changes may favor a simpler full reload until stable
- Late-arriving data: If common, plan incremental with a sliding window and upserts
- Deletions: Full reload handles naturally; incremental needs explicit delete logic
- Data quality: Full reload guarantees a consistent snapshot; incremental needs idempotency checks
Fast decision checklist
- Is the table small (under a few million rows) and load window is generous? Consider full reload
- Do you have last_updated or CDC? Choose incremental
- Need near real-time data? Incremental
- Do deletes matter and are tracked? Incremental with delete handling
- Frequent schema changes with limited engineering time? Temporary full reload
Worked examples
Example 1: Small dimension (countries)
Context: 250 rows, updates rare, nightly refresh.
Choice: Full reload via TRUNCATE + INSERT SELECT.
Why: Simplicity outweighs negligible cost; guarantees consistency.
Example 2: Large sales fact (2B rows)
Context: Hourly updates, strict SLA, late-arriving events possible.
Choice: Incremental with watermark and partitioned upsert.
Why: Full reload is impossible in the window; incremental loads last hour plus a 24h sliding window to catch late data.
Example 3: Customer SCD Type 2
Context: Need history of attribute changes.
Choice: Incremental upsert creating new version rows when hash-diff changes.
Why: Maintain history; full reload would be wasteful and complex to version each time.
Example 4: Product catalog with deletes
Context: Products can be discontinued; source provides CDC delete events.
Choice: Incremental with CDC; issue soft delete (is_active=false) or hard delete in target.
Why: Reflect deletes without reloading everything.
Implementation patterns
Full reload patterns
- Truncate-Load: TRUNCATE target then INSERT SELECT from source
- Swap Tables: Load into temp table, then atomic rename/swap to minimize downtime
- Partition Overwrite: Overwrite all partitions (or selected partitions)
Incremental patterns
- Watermark: last_updated > last_checkpoint
- CDC Stream: Apply inserts, updates, deletes in order
- Hash-Diff: Compare hash of business columns to detect changes
- Sliding Window: Reprocess a recent time window to catch late data
Idempotency and dedup
- Use MERGE/UPSERT for deterministic outcomes
- Deduplicate staging by primary key and latest timestamp
- Store checkpoints (e.g., max processed timestamp)
SQL snippets
Full reload (truncate-load)
-- Target: dim_country
TRUNCATE TABLE dim_country;
INSERT INTO dim_country (country_id, country_name, iso_code)
SELECT id, name, iso FROM stg_country;
Incremental (watermark + merge)
-- Load changed rows to staging_incremental first
-- SELECT * FROM source WHERE last_updated > :watermark
MERGE INTO tgt_customer t
USING stg_customer_changes s
ON t.customer_id = s.customer_id
WHEN MATCHED AND t.hash_diff <> s.hash_diff THEN
UPDATE SET
t.first_name = s.first_name,
t.last_name = s.last_name,
t.email = s.email,
t.hash_diff = s.hash_diff,
t.last_updated = s.last_updated
WHEN NOT MATCHED THEN
INSERT (customer_id, first_name, last_name, email, hash_diff, last_updated)
VALUES (s.customer_id, s.first_name, s.last_name, s.email, s.hash_diff, s.last_updated);
Handling deletes (CDC)
-- Assume stg_customer_cdc has op in ('I','U','D')
MERGE INTO tgt_customer t
USING stg_customer_cdc s
ON t.customer_id = s.customer_id
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED AND s.op = 'U' THEN UPDATE SET ...
WHEN NOT MATCHED AND s.op = 'I' THEN INSERT (...)
VALUES (...);
Sliding window reprocessing
-- Example: reload last 2 days of partitions
DELETE FROM fact_sales WHERE sale_date >= current_date - INTERVAL '2' DAY;
INSERT INTO fact_sales
SELECT * FROM stg_sales
WHERE sale_date >= current_date - INTERVAL '2' DAY;
Exercises
These exercises mirror the tasks below. After completing them, open the Quick Test to check your understanding. Anyone can take the test; only logged-in users will have their progress saved.
Exercise 1: Choose the right strategy
For each scenario, decide Full reload or Incremental and justify in one sentence:
- A) 50k-row reference table, daily, no deletes
- B) 1.2B-row event table, hourly, late events common
- C) 30M-row orders, nightly, source provides CDC with deletes
Hints
- Think data volume vs load window
- Do you have reliable change markers?
- Do deletes need to be reflected?
Exercise 2: Write an incremental MERGE
Write a SQL MERGE to upsert from staging to target using a watermark on last_updated. Columns: id (PK), status, amount, last_updated, hash_diff. Update only when hash_diff changed.
Hints
- Use WHEN MATCHED THEN UPDATE with hash guard
- Use WHEN NOT MATCHED THEN INSERT
- Assume staging already filtered by last_updated > :watermark
Self-check checklist
- I can state two pros and two cons for full reload
- I can state two pros and two cons for incremental
- I can explain watermark vs CDC
- I can write a MERGE that is idempotent
- I know how to handle deletes and late-arriving data
Common mistakes and how to self-check
- No dedup in staging: Fix by selecting latest record per key before MERGE
- Trusting timestamps that are not updated on change: Use CDC or hash-diff
- Ignoring deletes: Incorporate delete logic (soft/hard) in incremental
- Too-narrow window for late data: Add sliding window reprocessing
- Non-idempotent loads: Use MERGE or overwrite-by-partition to allow safe retries
Quick self-audit
- Can your job re-run without double-counting? If not, add idempotency
- Do you store and validate the checkpoint? If not, persist and verify it
- Are you monitoring row counts vs expected? If not, add simple audits
Practical projects
- Build a pipeline that fully reloads two small dimensions nightly using swap tables
- Create an incremental load for a large fact with watermark, MERGE, and a 48h sliding window
- Implement CDC-driven upsert+delete for a customer table, including a soft delete flag and audit counts
Learning path
- Master full reload patterns (truncate-load, swap)
- Learn watermark-based incremental with MERGE
- Add CDC and delete handling
- Implement sliding windows and partition strategies
- Introduce audits, checkpoints, and monitoring
Mini challenge
You have a 500M-row orders table updated throughout the day. The source has reliable last_updated and occasional deletes via CDC. Design a daily job that:
- Runs hourly
- Captures updates/inserts
- Handles deletes
- Corrects late-arriving rows within 24h
Reveal a strong outline
Staging filtered by last_updated > :watermark; deduplicate by id and max(last_updated); MERGE into target with update-on-hash change and insert when missing; apply delete events; reprocess last 24h partitions each run; persist next watermark as max(last_updated) from processed records.
Next steps
- Do the Quick Test below to validate your understanding
- Revisit your last pipeline and label each table as full or incremental with justification
- If you’re logged in, your test results and progress will be saved automatically; otherwise, you can still take the test for free