How to learn ETL ELT Patterns for Analytics Engineer for free

Why ETL/ELT patterns matter for Analytics Engineers

ETL/ELT patterns turn raw, messy inputs into reliable, analyzable data. Mastering them lets you ship stable models, keep dashboards fresh, recover from failures, and scale as sources and volumes grow. You will design raw/staging/curated layers, choose batch or micro-batch, implement safe upserts, handle schema evolution, manage PII, and build robust retries.

Who this is for

Analytics Engineers who build data models, marts, and pipelines feeding BI and ML.
Data Analysts moving toward production-grade data workflows.
Engineers who need durable patterns for reliability, governance, and speed.

Prerequisites

Comfortable with SQL joins, aggregates, window functions.
Basic CLI and version control (e.g., Git).
Familiarity with one warehouse (e.g., BigQuery, Snowflake, Redshift) or lakehouse SQL engines.

Learning path

Model the layers (Raw → Staging → Curated)
What to do
- Create tables or schemas for raw (as-ingested), staging (cleaned/typed), and curated (business-ready, documented).
- Define contracts: what arrives in each layer and how it’s validated.
Choose movement cadence
Batch vs micro-batch
- Batch: hourly/daily loads for large datasets, simpler operations.
- Micro-batch: small, frequent loads for fresher metrics; slightly more complex operations.
Make loads idempotent
Reruns must not duplicate or corrupt data
- Use deterministic keys and upserts (MERGE).
- Drive loads by watermarks or change data capture (CDC).
Implement incremental patterns
Process only new/changed rows
- Use updated_at, sequence numbers, or CDC timestamps.
- Test edge cases: late-arriving data, reordered events, deletes.
Harden for reality
Schema changes, errors, PII
- Allow additive columns, safely cast types, and maintain upsert keys.
- Mask/tokenize PII; apply role-based access and column-level policies.
- Add retries with backoff; quarantine poison records.
Prove reliability
Validation and observability
- Row-count deltas, freshness checks, duplicate checks, null checks on keys.
- Alert on late or failed runs.

Worked examples

1) Idempotent upsert (MERGE) for a batch load

-- Assumes staging.orders_batch has one latest row per order_id in this batch
MERGE INTO curated.orders AS t
USING staging.orders_batch AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  status = s.status,
  amount = s.amount,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.order_id, s.status, s.amount, s.updated_at);
-- Rerunning with the same s produces the same target state (idempotent).

2) Incremental insert using a watermark

-- Insert only rows newer than the current max updated_at in target
INSERT INTO curated.fact_orders (order_id, customer_id, amount, updated_at)
SELECT s.order_id, s.customer_id, s.amount, s.updated_at
FROM staging.orders s
WHERE s.updated_at > COALESCE((SELECT MAX(updated_at) FROM curated.fact_orders), TIMESTAMP '1970-01-01');

Why this works

It ensures only new or changed rows flow through. If late data arrives with an older timestamp, use CDC or add a grace window.

3) CDC apply with deletes

-- staging.cdc_orders: order_id, op ('i','u','d'), updated_at, ...
WITH last_change AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM staging.cdc_orders
)
MERGE INTO curated.orders t
USING (
  SELECT * FROM last_change WHERE rn = 1
) s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN MATCHED AND s.op IN ('u','i') THEN UPDATE SET
  status = s.status,
  amount = s.amount,
  updated_at = s.updated_at
WHEN NOT MATCHED AND s.op IN ('i') THEN INSERT (order_id, status, amount, updated_at)
VALUES (s.order_id, s.status, s.amount, s.updated_at);

4) Deduplication with window functions

-- Keep the most recent record per natural key
WITH ranked AS (
  SELECT t.*, ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY updated_at DESC, _ingest_time DESC
  ) AS rn
  FROM staging.customers t
)
SELECT * FROM ranked WHERE rn = 1;

5) Safe handling of additive schema changes

-- New column discount_code appears in staging; keep pipeline running safely
INSERT INTO curated.fact_orders (order_id, amount, updated_at, discount_code)
SELECT order_id,
       CAST(amount AS DECIMAL(12,2)) AS amount,
       updated_at,
       CAST(NULLIF(discount_code, '') AS VARCHAR(50)) AS discount_code
FROM staging.orders;

Tip

Treat new columns as nullable with defaults initially; backfill later for completeness.

Quick drills

Sketch a Raw → Staging → Curated diagram for one source system and list contract checks at each layer.
Write a SQL MERGE for a table of your choice with a clear business key.
Add a watermark condition for incremental inserts; test reruns.
Create a dedup query using ROW_NUMBER for one entity.
Simulate a schema change by adding a new optional column and keep your pipeline green.
Define how you would mask or tokenize an email column.

Common mistakes and debugging tips

Missing idempotency: Reruns duplicate rows. Fix with MERGE or unique constraints on business keys.
Unstable watermarks: updated_at not reliable. Prefer CDC or a monotonically increasing sequence.
Dropping columns on change: Causes downstream breakage. Use additive changes and compatibility layers.
Ignoring deletes: Facts show stale state. Apply CDC delete ops or maintain tombstones.
Dedup by SELECT DISTINCT only: Doesn’t guarantee the newest record. Use window rank by business key and timestamp.
PII in raw dumps exposed widely: Restrict access, tokenize, or mask at ingestion; propagate policies downstream.
Endless retries on corrupt records: Add a quarantine table/stream for poison messages and alert.

Practical projects

Build a mini medallion pipeline for one source: CSV or JSON → Raw → Staging (typed, deduped) → Curated (business marts) with freshness checks.
Implement micro-batch ingestion for a clickstream table every 5 minutes with a watermark and a 30-minute late-data grace period.
Add PII governance: Hash emails and restrict access to unhashed columns; prove with a row access or column masking policy.

Mini project: Retail orders to curated facts

Goal: Produce curated.fact_orders and curated.dim_customers fed by batch files plus CDC for updates/deletes.

Step 1 — Raw layer

Land raw.orders_YYYYMMDD.json and raw.customers_YYYYMMDD.json exactly as received.
Store metadata: file_name, load_time, row_count.

Step 2 — Staging layer

Parse types, standardize timestamps, and add _ingest_time.
Deduplicate customers by customer_id using ROW_NUMBER.

Step 3 — Incremental facts

Insert into fact_orders using updated_at watermark.
For CDC deletes, maintain a staging.cdc_orders table and MERGE deletes.

Step 4 — Hardening

Add freshness checks (should be <= 1 hour late) and duplicate checks on order_id.
Mask PII: store email_hash in curated; keep raw email restricted.

Step 5 — Backfill

Simulate a reprocess of last 7 days safely. Show idempotent rerun results.

Next steps

Automate validations: row counts, null checks on keys, duplicate keys.
Add orchestration: schedule batch and micro-batch with retries and alerts.
Document data contracts for each layer and monitor them continuously.

Subskills

Raw Staging Curated Layers
Batch Versus Micro Batch Basics
Idempotent Loads And Backfills
CDC Concepts And Change Tracking
Incremental Loading Strategies
Handling Schema Changes
Deduplication And Upserts
Event Time Versus Processing Time
Managing PII And Sensitive Data
Error Handling And Retry Patterns

Skill exam

The exam is available to everyone. If you are logged in, your progress and best score will be saved.

Menu

ETL ELT Patterns

Table of Contents

Why ETL/ELT patterns matter for Analytics Engineers

Who this is for

Prerequisites

Learning path

Worked examples

1) Idempotent upsert (MERGE) for a batch load

2) Incremental insert using a watermark

3) CDC apply with deletes

4) Deduplication with window functions

5) Safe handling of additive schema changes

Quick drills

Common mistakes and debugging tips

Practical projects

Mini project: Retail orders to curated facts

Next steps

Subskills

Skill exam

ETL ELT Patterns — Skill Exam

Topics

Raw Staging Curated Layers

Batch Versus Micro Batch Basics

Idempotent Loads And Backfills

CDC Concepts And Change Tracking

Incremental Loading Strategies

Handling Schema Changes

Managing PII And Sensitive Data

Error Handling And Retry Patterns

Deduplication And Upserts

Event Time Versus Processing Time

Have questions about ETL ELT Patterns?

AI Assistant