luvv to helpDiscover the Best Free Online Tools

ETL ELT Patterns

Learn ETL ELT Patterns for Analytics Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: December 23, 2025 | Updated: December 23, 2025

Why ETL/ELT patterns matter for Analytics Engineers

ETL/ELT patterns turn raw, messy inputs into reliable, analyzable data. Mastering them lets you ship stable models, keep dashboards fresh, recover from failures, and scale as sources and volumes grow. You will design raw/staging/curated layers, choose batch or micro-batch, implement safe upserts, handle schema evolution, manage PII, and build robust retries.

Who this is for

  • Analytics Engineers who build data models, marts, and pipelines feeding BI and ML.
  • Data Analysts moving toward production-grade data workflows.
  • Engineers who need durable patterns for reliability, governance, and speed.

Prerequisites

  • Comfortable with SQL joins, aggregates, window functions.
  • Basic CLI and version control (e.g., Git).
  • Familiarity with one warehouse (e.g., BigQuery, Snowflake, Redshift) or lakehouse SQL engines.

Learning path

  1. Model the layers (Raw → Staging → Curated)
    What to do
    • Create tables or schemas for raw (as-ingested), staging (cleaned/typed), and curated (business-ready, documented).
    • Define contracts: what arrives in each layer and how it’s validated.
  2. Choose movement cadence
    Batch vs micro-batch
    • Batch: hourly/daily loads for large datasets, simpler operations.
    • Micro-batch: small, frequent loads for fresher metrics; slightly more complex operations.
  3. Make loads idempotent
    Reruns must not duplicate or corrupt data
    • Use deterministic keys and upserts (MERGE).
    • Drive loads by watermarks or change data capture (CDC).
  4. Implement incremental patterns
    Process only new/changed rows
    • Use updated_at, sequence numbers, or CDC timestamps.
    • Test edge cases: late-arriving data, reordered events, deletes.
  5. Harden for reality
    Schema changes, errors, PII
    • Allow additive columns, safely cast types, and maintain upsert keys.
    • Mask/tokenize PII; apply role-based access and column-level policies.
    • Add retries with backoff; quarantine poison records.
  6. Prove reliability
    Validation and observability
    • Row-count deltas, freshness checks, duplicate checks, null checks on keys.
    • Alert on late or failed runs.

Worked examples

1) Idempotent upsert (MERGE) for a batch load

-- Assumes staging.orders_batch has one latest row per order_id in this batch
MERGE INTO curated.orders AS t
USING staging.orders_batch AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  status = s.status,
  amount = s.amount,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.order_id, s.status, s.amount, s.updated_at);
-- Rerunning with the same s produces the same target state (idempotent).

2) Incremental insert using a watermark

-- Insert only rows newer than the current max updated_at in target
INSERT INTO curated.fact_orders (order_id, customer_id, amount, updated_at)
SELECT s.order_id, s.customer_id, s.amount, s.updated_at
FROM staging.orders s
WHERE s.updated_at > COALESCE((SELECT MAX(updated_at) FROM curated.fact_orders), TIMESTAMP '1970-01-01');
Why this works

It ensures only new or changed rows flow through. If late data arrives with an older timestamp, use CDC or add a grace window.

3) CDC apply with deletes

-- staging.cdc_orders: order_id, op ('i','u','d'), updated_at, ...
WITH last_change AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM staging.cdc_orders
)
MERGE INTO curated.orders t
USING (
  SELECT * FROM last_change WHERE rn = 1
) s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN MATCHED AND s.op IN ('u','i') THEN UPDATE SET
  status = s.status,
  amount = s.amount,
  updated_at = s.updated_at
WHEN NOT MATCHED AND s.op IN ('i') THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.order_id, s.status, s.amount, s.updated_at);

4) Deduplication with window functions

-- Keep the most recent record per natural key
WITH ranked AS (
  SELECT t.*, ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY updated_at DESC, _ingest_time DESC
  ) AS rn
  FROM staging.customers t
)
SELECT * FROM ranked WHERE rn = 1;

5) Safe handling of additive schema changes

-- New column discount_code appears in staging; keep pipeline running safely
INSERT INTO curated.fact_orders (order_id, amount, updated_at, discount_code)
SELECT order_id,
       CAST(amount AS DECIMAL(12,2)) AS amount,
       updated_at,
       CAST(NULLIF(discount_code, '') AS VARCHAR(50)) AS discount_code
FROM staging.orders;
Tip

Treat new columns as nullable with defaults initially; backfill later for completeness.

Quick drills

  • Sketch a Raw → Staging → Curated diagram for one source system and list contract checks at each layer.
  • Write a SQL MERGE for a table of your choice with a clear business key.
  • Add a watermark condition for incremental inserts; test reruns.
  • Create a dedup query using ROW_NUMBER for one entity.
  • Simulate a schema change by adding a new optional column and keep your pipeline green.
  • Define how you would mask or tokenize an email column.

Common mistakes and debugging tips

  • Missing idempotency: Reruns duplicate rows. Fix with MERGE or unique constraints on business keys.
  • Unstable watermarks: updated_at not reliable. Prefer CDC or a monotonically increasing sequence.
  • Dropping columns on change: Causes downstream breakage. Use additive changes and compatibility layers.
  • Ignoring deletes: Facts show stale state. Apply CDC delete ops or maintain tombstones.
  • Dedup by SELECT DISTINCT only: Doesn’t guarantee the newest record. Use window rank by business key and timestamp.
  • PII in raw dumps exposed widely: Restrict access, tokenize, or mask at ingestion; propagate policies downstream.
  • Endless retries on corrupt records: Add a quarantine table/stream for poison messages and alert.

Practical projects

  • Build a mini medallion pipeline for one source: CSV or JSON → Raw → Staging (typed, deduped) → Curated (business marts) with freshness checks.
  • Implement micro-batch ingestion for a clickstream table every 5 minutes with a watermark and a 30-minute late-data grace period.
  • Add PII governance: Hash emails and restrict access to unhashed columns; prove with a row access or column masking policy.

Mini project: Retail orders to curated facts

Goal: Produce curated.fact_orders and curated.dim_customers fed by batch files plus CDC for updates/deletes.

Step 1 — Raw layer
  • Land raw.orders_YYYYMMDD.json and raw.customers_YYYYMMDD.json exactly as received.
  • Store metadata: file_name, load_time, row_count.
Step 2 — Staging layer
  • Parse types, standardize timestamps, and add _ingest_time.
  • Deduplicate customers by customer_id using ROW_NUMBER.
Step 3 — Incremental facts
  • Insert into fact_orders using updated_at watermark.
  • For CDC deletes, maintain a staging.cdc_orders table and MERGE deletes.
Step 4 — Hardening
  • Add freshness checks (should be <= 1 hour late) and duplicate checks on order_id.
  • Mask PII: store email_hash in curated; keep raw email restricted.
Step 5 — Backfill
  • Simulate a reprocess of last 7 days safely. Show idempotent rerun results.

Next steps

  • Automate validations: row counts, null checks on keys, duplicate keys.
  • Add orchestration: schedule batch and micro-batch with retries and alerts.
  • Document data contracts for each layer and monitor them continuously.

Subskills

  • Raw Staging Curated Layers
  • Batch Versus Micro Batch Basics
  • Idempotent Loads And Backfills
  • CDC Concepts And Change Tracking
  • Incremental Loading Strategies
  • Handling Schema Changes
  • Deduplication And Upserts
  • Event Time Versus Processing Time
  • Managing PII And Sensitive Data
  • Error Handling And Retry Patterns

Skill exam

The exam is available to everyone. If you are logged in, your progress and best score will be saved.

ETL ELT Patterns — Skill Exam

Answer all questions. You can retake the exam. Everyone can attempt it; if you log in, your progress and best score will be saved. Passing is 70%.

15 questions70% to pass

Have questions about ETL ELT Patterns?

AI Assistant

Ask questions about this tool