Why ETL Design Patterns matter for ETL Developers
ETL design patterns are proven, reusable ways to ingest, transform, and serve data reliably. As an ETL Developer, these patterns help you ship pipelines that are predictable, testable, and easy to operate. They reduce rework, control costs, and protect downstream analytics from broken or late data.
- Build robust staging layers that isolate ingestion from transformation.
- Use incremental loads and watermarks to move only what changed.
- Apply merge/upsert patterns to keep targets synchronized.
- Deduplicate records and handle late-arriving data safely.
- Model slowly changing dimensions (SCD) for accurate history.
- Manage reference data and lookups consistently across pipelines.
Who this is for
- ETL/ELT Developers building batch pipelines.
- Data Engineers maintaining data lakes/warehouses.
- Analytics Engineers who own dimension/fact modeling.
- Platform Engineers standardizing ingestion frameworks.
Prerequisites
- Comfortable with SQL (joins, window functions, basic DDL/DML).
- Basic data modeling: facts, dimensions, keys, and grain.
- Familiarity with batch processing and scheduling concepts.
- Knowing your target platform syntax (e.g., MERGE support) helps.
Learning path (practical roadmap)
-
Staging Layer Design
Mini tasks
- Create landing and staging tables with minimal transformations.
- Add metadata columns: source_file, ingest_ts, batch_id.
- Make your staging loads idempotent (safe to rerun).
-
Incremental Loads and Watermarks
Mini tasks
- Pick a reliable change column (updated_at or high-watermark).
- Implement filters using last successful watermark.
- Store and update the watermark atomically after success.
-
Full Reload vs Incremental Strategy
Mini tasks
- Document when to use each (volume, SLA, change detection).
- Implement a full reload toggle for recovery/backfill.
-
Upserts and Merge Patterns
Mini tasks
- Create a MERGE for a dimension table (Type 1 and Type 2 variants).
- Tune join keys and indexes to speed up merges.
-
Deduplication Strategies
Mini tasks
- Use window functions to keep the latest record per business key.
- Design a deterministic tiebreak rule (e.g., updated_at DESC).
-
Handling Late Arriving Data
Mini tasks
- Separate event_time from processed_time.
- Implement a reprocessing window (e.g., last 7 days of partitions).
- Use unknown dimension rows to preserve referential integrity.
-
Slowly Changing Dimensions (SCD) Basics
Mini tasks
- Implement SCD Type 1 (overwrite) and Type 2 (history tracking).
- Set effective_from, effective_to, and current_flag correctly.
-
Managing Reference Data and Lookups
Mini tasks
- Centralize common code-to-name mappings as conformed dimensions.
- Validate join coverage (audit for missing codes).
-
Testing and Observability
Mini tasks
- Add row count checks before and after transforms.
- Track watermark movement and late-data rates over time.
Worked examples
Example 1: Staging pattern with metadata
-- Landing to staging (append-only, idempotent by batch_id)
INSERT INTO stg_orders (order_id, status, amount, updated_at, source_file, ingest_ts, batch_id)
SELECT order_id,
status,
CAST(amount AS DECIMAL(12,2)) AS amount,
updated_at,
:source_file AS source_file,
CURRENT_TIMESTAMP AS ingest_ts,
:batch_id AS batch_id
FROM landing_orders
WHERE batch_id = :batch_id;
Key idea: keep raw structure but normalize types; include source_file, ingest_ts, batch_id to audit and replay safely.
Example 2: Incremental load with watermark
-- last_watermark is stored externally and passed as :last_wm
WITH new_rows AS (
SELECT *
FROM stg_orders
WHERE updated_at > :last_wm
)
INSERT INTO ods_orders
SELECT * FROM new_rows;
-- After success: update stored watermark to MAX(updated_at) from new_rows
Use strictly greater than on a reliable change column. Advance the watermark only after a successful load.
Example 3: Upsert (MERGE) for Type 1 dimension
MERGE INTO dim_customer AS t
USING src_customer AS s
ON t.customer_nk = s.customer_nk
WHEN MATCHED AND (
COALESCE(t.email,'') <> COALESCE(s.email,'') OR
COALESCE(t.tier,'') <> COALESCE(s.tier,'')
) THEN UPDATE SET
email = s.email,
tier = s.tier,
updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (
customer_nk, email, tier, created_at, updated_at
) VALUES (
s.customer_nk, s.email, s.tier, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
);
Type 1 overwrites changed attributes in place.
Example 4: Dedup keep latest per key
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC, ingest_ts DESC
) AS rn
FROM stg_orders
)
SELECT *
FROM ranked
WHERE rn = 1;
Deterministic ordering ensures consistent dedup. Adjust tiebreakers to your source semantics.
Example 5: Handling late-arriving facts
-- Reprocess last 7 days of event_time partitions to catch late data
INSERT INTO fact_orders (...)
SELECT ...
FROM cleansed_orders
WHERE event_date BETWEEN DATEADD(day, -7, CURRENT_DATE) AND CURRENT_DATE;
Partition by event_time, not processed_time. Keep a rolling backfill window aligned to your lateness profile.
Example 6: SCD Type 2 (expire old, insert new)
-- Assume dim_customer fields: surrogate_key, customer_nk, email, tier,
-- effective_from, effective_to, current_flag
MERGE INTO dim_customer t
USING src_customer s
ON t.customer_nk = s.customer_nk AND t.current_flag = 1
WHEN MATCHED AND (
COALESCE(t.email,'') <> COALESCE(s.email,'') OR
COALESCE(t.tier,'') <> COALESCE(s.tier,'')
) THEN UPDATE SET
t.effective_to = CURRENT_DATE - 1,
t.current_flag = 0
WHEN NOT MATCHED BY TARGET THEN INSERT (
customer_nk, email, tier, effective_from, effective_to, current_flag
) VALUES (
s.customer_nk, s.email, s.tier, CURRENT_DATE, DATE '9999-12-31', 1
);
Two-step outcome: prior row closed; new current row opened with a far-future effective_to.
Drills and exercises
- Create a staging table for a new source and load two batches idempotently.
- Implement a watermark-based incremental load and simulate a rerun.
- Write a MERGE that updates Type 1 attributes and add a NOT MATCHED insert.
- Deduplicate a dataset with a window function and explain your tiebreakers.
- Design a 7-day backfill job for late data and define success checks.
- Model a small Type 2 dimension with effective dates and a current_flag.
- Create a reference lookup table and validate all codes are covered.
Common mistakes and debugging tips
- Using processed_time for business logic. Tip: base partitions and metrics on event_time; keep processed_time for ops.
- Watermark gaps or overlaps. Tip: use > for filter; track last and new max; log counts for each run.
- Non-deterministic dedup. Tip: define explicit ORDER BY with stable tiebreakers like updated_at, ingest_ts, batch_id.
- MERGE keys not indexed. Tip: index/join on natural keys used in ON; pre-aggregate source to one row per key.
- Type 2 without closing old rows. Tip: set effective_to and current_flag correctly in one atomic operation.
- Late-arriving facts breaking FKs. Tip: use unknown dimension rows (e.g., -1) or delay fact load until dims are ready.
- Overusing full reloads. Tip: reserve for small tables, backfills, or repair; prefer incremental with checks.
Practical projects
- Event ingestion with incremental loads: ingest app events daily with watermark, dedup by (user_id, event_id).
- Customer dimension with SCD2: track tier and region; validate history with point-in-time queries.
- Order fact build: integrate late-arriving corrections via a 7-day backfill window.
Mini project: Orders to Dim–Fact mart
Goal: Build a small pipeline that ingests orders and customers, deduplicates, maintains a Type 2 customer dimension, and loads a daily fact_orders table.
- Sources: customers.csv (with updates), orders.json (can arrive late).
- Stages: landing_*, stg_*, cleansed_*.
- Patterns: watermark, dedup, MERGE (Type 2), late-data backfill.
Acceptance criteria
- Idempotent staging with batch_id and ingest_ts.
- Watermark advances only on success; no double-counting.
- dim_customer stores history with effective_from/to and current_flag.
- fact_orders uses event_date partitions and reprocesses last 7 days.
- Row counts and late-data rate logged per run.
Edge cases to test
- Two updates for the same customer in one batch.
- Duplicate orders across files with different timestamps.
- Orders referencing new customers arriving one day late.
Subskills
- Staging Layer Design
- Incremental Loads And Watermarks
- Full Reload Versus Incremental Strategy
- Upserts And Merge Patterns
- Deduplication Strategies
- Handling Late Arriving Data
- Slowly Changing Dimensions Basics
- Managing Reference Data And Lookups
Next steps
- Automate: add orchestration, retries, and alerts around these patterns.
- Harden: add data quality tests (count deltas, null checks, referential checks).
- Scale: pre-aggregate before MERGE; partition by event_date; index join keys.
- Extend: learn CDC ingestion and streaming-friendly variants of these patterns.