How to learn Window Functions For Dedup And SCD for SQL in Analytics Engineer for free

Why this matters

As an Analytics Engineer, you regularly need to remove duplicates from raw data and maintain Slowly Changing Dimensions (SCD) so analytics stays consistent and trustworthy. Window functions let you pick the right row deterministically and create clean, time-bounded histories for dimensions.

Real tasks you will do: deduplicate customers from multiple sources, keep the latest product attributes, create SCD Type 2 dimensions for user plans or order statuses, and build stable snapshots for reporting.
Impact: fewer flaky dashboards, easier debugging, and predictable business metrics.

Concept explained simply

Window functions compute values across a set of rows related to the current row, without collapsing them. The two stars for dedup and SCD are:

ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC): gives a 1-based rank per entity; keep rn = 1 to get the single best row.
LAG/LEAD(value) OVER (PARTITION BY key ORDER BY time): look at previous/next rows to mark changes and set validity ranges.

Mental model

Partition: who are we comparing within? (e.g., by customer_id)
Order: what defines newest/earliest or sequence? (e.g., updated_at DESC)
Selector: pick the winning row (ROW_NUMBER = 1), or compute start/end boundaries (LEAD to set valid_to)

Quick reference

Dedup single latest row: ROW_NUMBER + filter rn = 1
Handle ties: add deterministic tie-breakers in ORDER BY (e.g., source priority, id)
SCD2 from change log: LEAD(next_time) to set valid_to; last row gets an open-ended high date and is_current = true
SCD2 from snapshots: LAG to detect value changes between days; create ranges only where value changed

Worked examples

Example 1 — Deduplicate to the latest customer record

Goal: keep the single most recent record per customer, prefer source = 'crm' when timestamps tie.

SELECT customer_id, email, updated_at, source
FROM (
  SELECT c.*,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY updated_at DESC,
                    CASE source WHEN 'crm' THEN 1 WHEN 'webform' THEN 2 ELSE 3 END,
                    email ASC
         ) AS rn
  FROM customers_raw c
) t
WHERE rn = 1;

Why it works: ROW_NUMBER ensures exactly one row per customer. ORDER BY defines clear precedence: newest timestamp, then source priority, then a stable tie-breaker.

Example 2 — First event per session

Goal: get the first event in each session for attribution.

SELECT session_id, event_id, event_time, event_type
FROM (
  SELECT e.*,
         ROW_NUMBER() OVER (
           PARTITION BY session_id
           ORDER BY event_time ASC, event_id ASC
         ) AS rn
  FROM events e
) t
WHERE rn = 1;

Note: include a deterministic tie-breaker like event_id to avoid non-deterministic results when timestamps tie.

Example 3 — Build an SCD Type 2 dimension from a status change log

Goal: generate valid_from/valid_to periods for order statuses.

WITH staged AS (
  SELECT
    order_id,
    status,
    status_time AS valid_from,
    LEAD(status_time) OVER (
      PARTITION BY order_id
      ORDER BY status_time
    ) AS next_time
  FROM orders_status_log
  QUALIFY status != LAG(status) OVER (
             PARTITION BY order_id ORDER BY status_time
           ) OR LAG(status) OVER (
             PARTITION BY order_id ORDER BY status_time
           ) IS NULL
), final AS (
  SELECT
    order_id,
    status,
    valid_from,
    COALESCE(next_time, TIMESTAMP '9999-12-31 23:59:59') AS valid_to,
    CASE WHEN next_time IS NULL THEN TRUE ELSE FALSE END AS is_current
  FROM staged
)
SELECT * FROM final;

Why it works: consecutive duplicate statuses are collapsed. LEAD gives the next start time as this row’s end time; the last interval stays open-ended and marked current.

Variant — SCD2 from daily snapshots (plan changes)

WITH with_prev AS (
  SELECT
    user_id,
    snapshot_date,
    plan_id,
    LAG(plan_id) OVER (
      PARTITION BY user_id ORDER BY snapshot_date
    ) AS prev_plan
  FROM user_plans_daily
), change_starts AS (
  SELECT
    user_id,
    plan_id,
    snapshot_date AS valid_from
  FROM with_prev
  WHERE prev_plan IS DISTINCT FROM plan_id  -- start a new range on change
), ranges AS (
  SELECT
    user_id,
    plan_id,
    valid_from,
    LEAD(valid_from) OVER (
      PARTITION BY user_id ORDER BY valid_from
    ) - INTERVAL '1 day' AS valid_to
  FROM change_starts
)
SELECT
  user_id,
  plan_id,
  valid_from,
  COALESCE(valid_to, DATE '9999-12-31') AS valid_to,
  CASE WHEN valid_to IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM ranges;

Step-by-step: choose the right approach

Define the business key (e.g., customer_id, order_id). This is your PARTITION BY.
Define recency/sequence (updated_at, event_time). This is your ORDER BY.
Pick the function:
- Dedup to a single row: ROW_NUMBER
- Detect changes over time: LAG/LEAD
Make it deterministic: add tie-breakers (source priority, id).
Filter or build ranges: rn = 1 for dedup; LEAD to set valid_to for SCD2.
Validate: counts match expectations, no overlapping periods, is_current makes sense.

Exercises

These mirror the graded exercises below. Try them in your SQL environment. Expected outputs are provided for the sample data.

Exercise 1 — Deduplicate customers by latest update with source priority

Input (customers_raw):

customer_id | email            | updated_at           | source
----------- | ---------------- | -------------------- | -------
1           | alice@old.com    | 2024-07-20 10:00:00 | webform
1           | alice@x.com      | 2024-07-21 10:15:00 | crm
2           | bob@x.com        | 2024-06-11 09:00:00 | webform
2           | bob@x.com        | 2024-06-11 09:00:00 | crm
3           | carol@x.com      | 2024-06-12 12:30:00 | partner

Task: Return one row per customer_id keeping the most recent updated_at. If timestamps tie, prefer source priority crm > webform > partner. Columns: customer_id, email, updated_at, source.

Hints

Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY ...)
Order by updated_at DESC, then a CASE expression for source priority
Add a final stable tie-breaker (e.g., email ASC) if needed

Show solution

SELECT customer_id, email, updated_at, source
FROM (
  SELECT c.*,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY updated_at DESC,
                    CASE source WHEN 'crm' THEN 1 WHEN 'webform' THEN 2 WHEN 'partner' THEN 3 ELSE 4 END,
                    email ASC
         ) AS rn
  FROM customers_raw c
) t
WHERE rn = 1;

Expected output

customer_id | email         | updated_at           | source
----------- | ------------- | -------------------- | ------
1           | alice@x.com   | 2024-07-21 10:15:00  | crm
2           | bob@x.com     | 2024-06-11 09:00:00  | crm
3           | carol@x.com   | 2024-06-12 12:30:00  | partner

Exercise 2 — Build an SCD2 orders dimension from a change log

Input (orders_status_log):

order_id | status   | status_time
-------- | -------- | ---------------------
101      | created  | 2024-01-05 08:00:00
101      | paid     | 2024-01-05 09:10:00
101      | shipped  | 2024-01-06 14:00:00
102      | created  | 2024-02-01 10:00:00
102      | created  | 2024-02-01 10:05:00

Task: Collapse consecutive duplicate statuses per order_id, then output order_id, status, valid_from, valid_to, is_current. Use a high date for open-ended valid_to.

Hints

Filter out consecutive duplicates using LAG(status)
Use LEAD(status_time) to set valid_to
Mark the last row per order_id as is_current = true

Show solution

WITH ordered AS (
  SELECT
    order_id,
    status,
    status_time,
    LAG(status) OVER (
      PARTITION BY order_id ORDER BY status_time
    ) AS prev_status
  FROM orders_status_log
), changes_only AS (
  SELECT order_id, status, status_time
  FROM ordered
  WHERE prev_status IS NULL OR status != prev_status
), scd AS (
  SELECT
    order_id,
    status,
    status_time AS valid_from,
    LEAD(status_time) OVER (
      PARTITION BY order_id ORDER BY status_time
    ) AS next_time
  FROM changes_only
)
SELECT
  order_id,
  status,
  valid_from,
  COALESCE(next_time, TIMESTAMP '9999-12-31 23:59:59') AS valid_to,
  CASE WHEN next_time IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM scd
ORDER BY order_id, valid_from;

Expected output

order_id | status  | valid_from           | valid_to              | is_current
-------- | ------- | -------------------- | --------------------- | ----------
101      | created | 2024-01-05 08:00:00  | 2024-01-05 09:10:00   | false
101      | paid    | 2024-01-05 09:10:00  | 2024-01-06 14:00:00   | false
101      | shipped | 2024-01-06 14:00:00  | 9999-12-31 23:59:59   | true
102      | created | 2024-02-01 10:00:00  | 9999-12-31 23:59:59   | true

Self-check checklist

[ ] For dedup, COUNT(*) equals COUNT(DISTINCT key) after filtering rn = 1
[ ] ORDER BY includes deterministic tie-breakers
[ ] For SCD2, no overlapping periods for the same key
[ ] Exactly one is_current = true per key

Common mistakes and how to self-check

Using RANK instead of ROW_NUMBER for dedup. Ties with RANK can produce multiple rows per key. Fix: use ROW_NUMBER and deterministic ORDER BY.
Missing ORDER BY in window. Without it, “latest” is undefined. Always specify your recency column and tie-breakers.
Partitioning by the wrong key. Double-check the business entity you dedup (customer_id vs email).
Overlapping SCD periods. Ensure valid_from of the next row is exactly the boundary for the previous row’s valid_to.
Forgetting to collapse consecutive duplicates before building SCD. Use LAG to detect no-change rows and remove them.
Timezone drift. Align timestamps to a single timezone or use DATE boundaries consistently.

Quick validations

-- Dedup: verify one row per key
SELECT COUNT(*) AS total, COUNT(DISTINCT customer_id) AS unique_ids
FROM deduped;  -- should match

-- SCD: check overlaps
SELECT order_id
FROM orders_dim_scd2 d
JOIN orders_dim_scd2 n
  ON d.order_id = n.order_id
 AND d.valid_from < COALESCE(n.valid_to, TIMESTAMP '9999-12-31')
 AND COALESCE(d.valid_to, TIMESTAMP '9999-12-31') > n.valid_from
 AND (d.status, d.valid_from) <> (n.status, n.valid_from);
-- should return zero rows

Practical projects

Customer unification: merge CRM, web signups, and partner feeds; dedup to one trusted row per customer_id.
User plan SCD2: build a dimension tracking plan changes from daily snapshots; ensure no gaps and single is_current.
Product price history: generate SCD2 price bands from a price change log and validate overlaps.

Who this is for

Analytics Engineers and BI developers cleaning and modeling raw data
Data Analysts who need reliable “latest value” views
Data Engineers building curated marts for reporting

Prerequisites

Comfort with SELECT, WHERE, GROUP BY, and JOINs
Basics of window functions (PARTITION BY, ORDER BY)
Understanding of SCD Type 2 concepts (valid_from/valid_to, is_current)

Learning path

Review window function basics (ROW_NUMBER, RANK, LAG/LEAD)
Practice deterministic dedup with multiple tie-breakers
Build SCD2 from change logs (LEAD) and from snapshots (LAG)
Add data quality checks for overlaps and cardinality
Apply in your warehouse models and schedule validations

Next steps

Harden models with tests that ensure one current record per key
Add source priority mappings to centralize tie-break logic
Parameterize high-date and date boundary conventions for consistency

Mini challenge

You have a daily product catalog snapshot with columns (product_id, snapshot_date, price, currency). Build an SCD2 table tracking price changes only. Requirements: collapse consecutive same prices, set valid_to as day before next change, and flag is_current.

[ ] Use LAG(price) to detect changes
[ ] Create valid_from at change points
[ ] Set valid_to with LEAD(valid_from) - 1 day
[ ] Validate no overlaps and a single is_current per product

Check your knowledge

Take the quick test to lock in the concepts. Anyone can take it for free; only logged-in users will have their progress saved.

Menu

Window Functions For Dedup And SCD

Table of Contents

Why this matters

Concept explained simply

Mental model

Worked examples

Example 1 — Deduplicate to the latest customer record

Example 2 — First event per session

Example 3 — Build an SCD Type 2 dimension from a status change log

Step-by-step: choose the right approach

Exercises

Exercise 1 — Deduplicate customers by latest update with source priority

Expected output

Exercise 2 — Build an SCD2 orders dimension from a change log

Expected output

Self-check checklist

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Check your knowledge

Practice Exercises

Deduplicate customers by latest update with source priority

Instructions

Expected Output

Build an SCD2 orders dimension from a change log

Window Functions For Dedup And SCD — Quick Test

Have questions about Window Functions For Dedup And SCD?

AI Assistant