How to learn ETL Design Patterns for ETL Developer for free

Why ETL Design Patterns matter for ETL Developers

ETL design patterns are proven, reusable ways to ingest, transform, and serve data reliably. As an ETL Developer, these patterns help you ship pipelines that are predictable, testable, and easy to operate. They reduce rework, control costs, and protect downstream analytics from broken or late data.

Build robust staging layers that isolate ingestion from transformation.
Use incremental loads and watermarks to move only what changed.
Apply merge/upsert patterns to keep targets synchronized.
Deduplicate records and handle late-arriving data safely.
Model slowly changing dimensions (SCD) for accurate history.
Manage reference data and lookups consistently across pipelines.

Who this is for

ETL/ELT Developers building batch pipelines.
Data Engineers maintaining data lakes/warehouses.
Analytics Engineers who own dimension/fact modeling.
Platform Engineers standardizing ingestion frameworks.

Prerequisites

Comfortable with SQL (joins, window functions, basic DDL/DML).
Basic data modeling: facts, dimensions, keys, and grain.
Familiarity with batch processing and scheduling concepts.
Knowing your target platform syntax (e.g., MERGE support) helps.

Learning path (practical roadmap)

Staging Layer Design
Mini tasks
- Create landing and staging tables with minimal transformations.
- Add metadata columns: source_file, ingest_ts, batch_id.
- Make your staging loads idempotent (safe to rerun).
Incremental Loads and Watermarks
Mini tasks
- Pick a reliable change column (updated_at or high-watermark).
- Implement filters using last successful watermark.
- Store and update the watermark atomically after success.
Full Reload vs Incremental Strategy
Mini tasks
- Document when to use each (volume, SLA, change detection).
- Implement a full reload toggle for recovery/backfill.
Upserts and Merge Patterns
Mini tasks
- Create a MERGE for a dimension table (Type 1 and Type 2 variants).
- Tune join keys and indexes to speed up merges.
Deduplication Strategies
Mini tasks
- Use window functions to keep the latest record per business key.
- Design a deterministic tiebreak rule (e.g., updated_at DESC).
Handling Late Arriving Data
Mini tasks
- Separate event_time from processed_time.
- Implement a reprocessing window (e.g., last 7 days of partitions).
- Use unknown dimension rows to preserve referential integrity.
Slowly Changing Dimensions (SCD) Basics
Mini tasks
- Implement SCD Type 1 (overwrite) and Type 2 (history tracking).
- Set effective_from, effective_to, and current_flag correctly.
Managing Reference Data and Lookups
Mini tasks
- Centralize common code-to-name mappings as conformed dimensions.
- Validate join coverage (audit for missing codes).
Testing and Observability
Mini tasks
- Add row count checks before and after transforms.
- Track watermark movement and late-data rates over time.

Worked examples

Example 1: Staging pattern with metadata

-- Landing to staging (append-only, idempotent by batch_id)
INSERT INTO stg_orders (order_id, status, amount, updated_at, source_file, ingest_ts, batch_id)
SELECT order_id,
       status,
       CAST(amount AS DECIMAL(12,2)) AS amount,
       updated_at,
       :source_file AS source_file,
       CURRENT_TIMESTAMP AS ingest_ts,
       :batch_id AS batch_id
FROM landing_orders
WHERE batch_id = :batch_id;

Key idea: keep raw structure but normalize types; include source_file, ingest_ts, batch_id to audit and replay safely.

Example 2: Incremental load with watermark

-- last_watermark is stored externally and passed as :last_wm
WITH new_rows AS (
  SELECT *
  FROM stg_orders
  WHERE updated_at > :last_wm
)
INSERT INTO ods_orders
SELECT * FROM new_rows;
-- After success: update stored watermark to MAX(updated_at) from new_rows

Use strictly greater than on a reliable change column. Advance the watermark only after a successful load.

Example 3: Upsert (MERGE) for Type 1 dimension

MERGE INTO dim_customer AS t
USING src_customer AS s
ON t.customer_nk = s.customer_nk
WHEN MATCHED AND (
    COALESCE(t.email,'') <> COALESCE(s.email,'') OR
    COALESCE(t.tier,'')  <> COALESCE(s.tier,'')
) THEN UPDATE SET
    email = s.email,
    tier  = s.tier,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (
    customer_nk, email, tier, created_at, updated_at
) VALUES (
    s.customer_nk, s.email, s.tier, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
);

Type 1 overwrites changed attributes in place.

Example 4: Dedup keep latest per key

WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY order_id
           ORDER BY updated_at DESC, ingest_ts DESC
         ) AS rn
  FROM stg_orders
)
SELECT *
FROM ranked
WHERE rn = 1;

Deterministic ordering ensures consistent dedup. Adjust tiebreakers to your source semantics.

Example 5: Handling late-arriving facts

-- Reprocess last 7 days of event_time partitions to catch late data
INSERT INTO fact_orders (...)
SELECT ...
FROM cleansed_orders
WHERE event_date BETWEEN DATEADD(day, -7, CURRENT_DATE) AND CURRENT_DATE;

Partition by event_time, not processed_time. Keep a rolling backfill window aligned to your lateness profile.

Example 6: SCD Type 2 (expire old, insert new)

-- Assume dim_customer fields: surrogate_key, customer_nk, email, tier,
-- effective_from, effective_to, current_flag
MERGE INTO dim_customer t
USING src_customer s
ON t.customer_nk = s.customer_nk AND t.current_flag = 1
WHEN MATCHED AND (
  COALESCE(t.email,'') <> COALESCE(s.email,'') OR
  COALESCE(t.tier,'')  <> COALESCE(s.tier,'')
) THEN UPDATE SET
  t.effective_to = CURRENT_DATE - 1,
  t.current_flag = 0
WHEN NOT MATCHED BY TARGET THEN INSERT (
  customer_nk, email, tier, effective_from, effective_to, current_flag
) VALUES (
  s.customer_nk, s.email, s.tier, CURRENT_DATE, DATE '9999-12-31', 1
);

Two-step outcome: prior row closed; new current row opened with a far-future effective_to.

Drills and exercises

Create a staging table for a new source and load two batches idempotently.
Implement a watermark-based incremental load and simulate a rerun.
Write a MERGE that updates Type 1 attributes and add a NOT MATCHED insert.
Deduplicate a dataset with a window function and explain your tiebreakers.
Design a 7-day backfill job for late data and define success checks.
Model a small Type 2 dimension with effective dates and a current_flag.
Create a reference lookup table and validate all codes are covered.

Common mistakes and debugging tips

Using processed_time for business logic. Tip: base partitions and metrics on event_time; keep processed_time for ops.
Watermark gaps or overlaps. Tip: use > for filter; track last and new max; log counts for each run.
Non-deterministic dedup. Tip: define explicit ORDER BY with stable tiebreakers like updated_at, ingest_ts, batch_id.
MERGE keys not indexed. Tip: index/join on natural keys used in ON; pre-aggregate source to one row per key.
Type 2 without closing old rows. Tip: set effective_to and current_flag correctly in one atomic operation.
Late-arriving facts breaking FKs. Tip: use unknown dimension rows (e.g., -1) or delay fact load until dims are ready.
Overusing full reloads. Tip: reserve for small tables, backfills, or repair; prefer incremental with checks.

Practical projects

Event ingestion with incremental loads: ingest app events daily with watermark, dedup by (user_id, event_id).
Customer dimension with SCD2: track tier and region; validate history with point-in-time queries.
Order fact build: integrate late-arriving corrections via a 7-day backfill window.

Mini project: Orders to Dim–Fact mart

Goal: Build a small pipeline that ingests orders and customers, deduplicates, maintains a Type 2 customer dimension, and loads a daily fact_orders table.

Sources: customers.csv (with updates), orders.json (can arrive late).
Stages: landing_*, stg_*, cleansed_*.
Patterns: watermark, dedup, MERGE (Type 2), late-data backfill.

Acceptance criteria

Idempotent staging with batch_id and ingest_ts.
Watermark advances only on success; no double-counting.
dim_customer stores history with effective_from/to and current_flag.
fact_orders uses event_date partitions and reprocesses last 7 days.
Row counts and late-data rate logged per run.

Edge cases to test

Two updates for the same customer in one batch.
Duplicate orders across files with different timestamps.
Orders referencing new customers arriving one day late.

Subskills

Staging Layer Design
Incremental Loads And Watermarks
Full Reload Versus Incremental Strategy
Upserts And Merge Patterns
Deduplication Strategies
Handling Late Arriving Data
Slowly Changing Dimensions Basics
Managing Reference Data And Lookups

Next steps

Automate: add orchestration, retries, and alerts around these patterns.
Harden: add data quality tests (count deltas, null checks, referential checks).
Scale: pre-aggregate before MERGE; partition by event_date; index join keys.
Extend: learn CDC ingestion and streaming-friendly variants of these patterns.

Menu

ETL Design Patterns

Table of Contents

Why ETL Design Patterns matter for ETL Developers

Who this is for

Prerequisites

Learning path (practical roadmap)

Worked examples

Drills and exercises

Common mistakes and debugging tips

Practical projects

Mini project: Orders to Dim–Fact mart

Subskills

Next steps

ETL Design Patterns — Skill Exam

Topics

Staging Layer Design

Incremental Loads And Watermarks

Full Reload Versus Incremental Strategy

Upserts And Merge Patterns

Slowly Changing Dimensions Basics

Deduplication Strategies

Handling Late Arriving Data

Managing Reference Data And Lookups

Have questions about ETL Design Patterns?

AI Assistant