luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Window Functions For Dedup And SCD

Learn Window Functions For Dedup And SCD for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

As an Analytics Engineer, you regularly need to remove duplicates from raw data and maintain Slowly Changing Dimensions (SCD) so analytics stays consistent and trustworthy. Window functions let you pick the right row deterministically and create clean, time-bounded histories for dimensions.

  • Real tasks you will do: deduplicate customers from multiple sources, keep the latest product attributes, create SCD Type 2 dimensions for user plans or order statuses, and build stable snapshots for reporting.
  • Impact: fewer flaky dashboards, easier debugging, and predictable business metrics.

Concept explained simply

Window functions compute values across a set of rows related to the current row, without collapsing them. The two stars for dedup and SCD are:

  • ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC): gives a 1-based rank per entity; keep rn = 1 to get the single best row.
  • LAG/LEAD(value) OVER (PARTITION BY key ORDER BY time): look at previous/next rows to mark changes and set validity ranges.

Mental model

  • Partition: who are we comparing within? (e.g., by customer_id)
  • Order: what defines newest/earliest or sequence? (e.g., updated_at DESC)
  • Selector: pick the winning row (ROW_NUMBER = 1), or compute start/end boundaries (LEAD to set valid_to)
Quick reference
  • Dedup single latest row: ROW_NUMBER + filter rn = 1
  • Handle ties: add deterministic tie-breakers in ORDER BY (e.g., source priority, id)
  • SCD2 from change log: LEAD(next_time) to set valid_to; last row gets an open-ended high date and is_current = true
  • SCD2 from snapshots: LAG to detect value changes between days; create ranges only where value changed

Worked examples

Example 1 — Deduplicate to the latest customer record

Goal: keep the single most recent record per customer, prefer source = 'crm' when timestamps tie.

SELECT customer_id, email, updated_at, source
FROM (
  SELECT c.*,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY updated_at DESC,
                    CASE source WHEN 'crm' THEN 1 WHEN 'webform' THEN 2 ELSE 3 END,
                    email ASC
         ) AS rn
  FROM customers_raw c
) t
WHERE rn = 1;

Why it works: ROW_NUMBER ensures exactly one row per customer. ORDER BY defines clear precedence: newest timestamp, then source priority, then a stable tie-breaker.

Example 2 — First event per session

Goal: get the first event in each session for attribution.

SELECT session_id, event_id, event_time, event_type
FROM (
  SELECT e.*,
         ROW_NUMBER() OVER (
           PARTITION BY session_id
           ORDER BY event_time ASC, event_id ASC
         ) AS rn
  FROM events e
) t
WHERE rn = 1;

Note: include a deterministic tie-breaker like event_id to avoid non-deterministic results when timestamps tie.

Example 3 — Build an SCD Type 2 dimension from a status change log

Goal: generate valid_from/valid_to periods for order statuses.

WITH staged AS (
  SELECT
    order_id,
    status,
    status_time AS valid_from,
    LEAD(status_time) OVER (
      PARTITION BY order_id
      ORDER BY status_time
    ) AS next_time
  FROM orders_status_log
  QUALIFY status != LAG(status) OVER (
             PARTITION BY order_id ORDER BY status_time
           ) OR LAG(status) OVER (
             PARTITION BY order_id ORDER BY status_time
           ) IS NULL
), final AS (
  SELECT
    order_id,
    status,
    valid_from,
    COALESCE(next_time, TIMESTAMP '9999-12-31 23:59:59') AS valid_to,
    CASE WHEN next_time IS NULL THEN TRUE ELSE FALSE END AS is_current
  FROM staged
)
SELECT * FROM final;

Why it works: consecutive duplicate statuses are collapsed. LEAD gives the next start time as this row’s end time; the last interval stays open-ended and marked current.

Variant — SCD2 from daily snapshots (plan changes)
WITH with_prev AS (
  SELECT
    user_id,
    snapshot_date,
    plan_id,
    LAG(plan_id) OVER (
      PARTITION BY user_id ORDER BY snapshot_date
    ) AS prev_plan
  FROM user_plans_daily
), change_starts AS (
  SELECT
    user_id,
    plan_id,
    snapshot_date AS valid_from
  FROM with_prev
  WHERE prev_plan IS DISTINCT FROM plan_id  -- start a new range on change
), ranges AS (
  SELECT
    user_id,
    plan_id,
    valid_from,
    LEAD(valid_from) OVER (
      PARTITION BY user_id ORDER BY valid_from
    ) - INTERVAL '1 day' AS valid_to
  FROM change_starts
)
SELECT
  user_id,
  plan_id,
  valid_from,
  COALESCE(valid_to, DATE '9999-12-31') AS valid_to,
  CASE WHEN valid_to IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM ranges;

Step-by-step: choose the right approach

  1. Define the business key (e.g., customer_id, order_id). This is your PARTITION BY.
  2. Define recency/sequence (updated_at, event_time). This is your ORDER BY.
  3. Pick the function:
    • Dedup to a single row: ROW_NUMBER
    • Detect changes over time: LAG/LEAD
  4. Make it deterministic: add tie-breakers (source priority, id).
  5. Filter or build ranges: rn = 1 for dedup; LEAD to set valid_to for SCD2.
  6. Validate: counts match expectations, no overlapping periods, is_current makes sense.

Exercises

These mirror the graded exercises below. Try them in your SQL environment. Expected outputs are provided for the sample data.

Exercise 1 — Deduplicate customers by latest update with source priority

Input (customers_raw):

customer_id | email            | updated_at           | source
----------- | ---------------- | -------------------- | -------
1           | alice@old.com    | 2024-07-20 10:00:00 | webform
1           | alice@x.com      | 2024-07-21 10:15:00 | crm
2           | bob@x.com        | 2024-06-11 09:00:00 | webform
2           | bob@x.com        | 2024-06-11 09:00:00 | crm
3           | carol@x.com      | 2024-06-12 12:30:00 | partner

Task: Return one row per customer_id keeping the most recent updated_at. If timestamps tie, prefer source priority crm > webform > partner. Columns: customer_id, email, updated_at, source.

Hints
  • Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY ...)
  • Order by updated_at DESC, then a CASE expression for source priority
  • Add a final stable tie-breaker (e.g., email ASC) if needed
Show solution
SELECT customer_id, email, updated_at, source
FROM (
  SELECT c.*,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY updated_at DESC,
                    CASE source WHEN 'crm' THEN 1 WHEN 'webform' THEN 2 WHEN 'partner' THEN 3 ELSE 4 END,
                    email ASC
         ) AS rn
  FROM customers_raw c
) t
WHERE rn = 1;

Expected output

customer_id | email         | updated_at           | source
----------- | ------------- | -------------------- | ------
1           | alice@x.com   | 2024-07-21 10:15:00  | crm
2           | bob@x.com     | 2024-06-11 09:00:00  | crm
3           | carol@x.com   | 2024-06-12 12:30:00  | partner

Exercise 2 — Build an SCD2 orders dimension from a change log

Input (orders_status_log):

order_id | status   | status_time
-------- | -------- | ---------------------
101      | created  | 2024-01-05 08:00:00
101      | paid     | 2024-01-05 09:10:00
101      | shipped  | 2024-01-06 14:00:00
102      | created  | 2024-02-01 10:00:00
102      | created  | 2024-02-01 10:05:00

Task: Collapse consecutive duplicate statuses per order_id, then output order_id, status, valid_from, valid_to, is_current. Use a high date for open-ended valid_to.

Hints
  • Filter out consecutive duplicates using LAG(status)
  • Use LEAD(status_time) to set valid_to
  • Mark the last row per order_id as is_current = true
Show solution
WITH ordered AS (
  SELECT
    order_id,
    status,
    status_time,
    LAG(status) OVER (
      PARTITION BY order_id ORDER BY status_time
    ) AS prev_status
  FROM orders_status_log
), changes_only AS (
  SELECT order_id, status, status_time
  FROM ordered
  WHERE prev_status IS NULL OR status != prev_status
), scd AS (
  SELECT
    order_id,
    status,
    status_time AS valid_from,
    LEAD(status_time) OVER (
      PARTITION BY order_id ORDER BY status_time
    ) AS next_time
  FROM changes_only
)
SELECT
  order_id,
  status,
  valid_from,
  COALESCE(next_time, TIMESTAMP '9999-12-31 23:59:59') AS valid_to,
  CASE WHEN next_time IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM scd
ORDER BY order_id, valid_from;

Expected output

order_id | status  | valid_from           | valid_to              | is_current
-------- | ------- | -------------------- | --------------------- | ----------
101      | created | 2024-01-05 08:00:00  | 2024-01-05 09:10:00   | false
101      | paid    | 2024-01-05 09:10:00  | 2024-01-06 14:00:00   | false
101      | shipped | 2024-01-06 14:00:00  | 9999-12-31 23:59:59   | true
102      | created | 2024-02-01 10:00:00  | 9999-12-31 23:59:59   | true

Self-check checklist

  • [ ] For dedup, COUNT(*) equals COUNT(DISTINCT key) after filtering rn = 1
  • [ ] ORDER BY includes deterministic tie-breakers
  • [ ] For SCD2, no overlapping periods for the same key
  • [ ] Exactly one is_current = true per key

Common mistakes and how to self-check

  • Using RANK instead of ROW_NUMBER for dedup. Ties with RANK can produce multiple rows per key. Fix: use ROW_NUMBER and deterministic ORDER BY.
  • Missing ORDER BY in window. Without it, “latest” is undefined. Always specify your recency column and tie-breakers.
  • Partitioning by the wrong key. Double-check the business entity you dedup (customer_id vs email).
  • Overlapping SCD periods. Ensure valid_from of the next row is exactly the boundary for the previous row’s valid_to.
  • Forgetting to collapse consecutive duplicates before building SCD. Use LAG to detect no-change rows and remove them.
  • Timezone drift. Align timestamps to a single timezone or use DATE boundaries consistently.
Quick validations
-- Dedup: verify one row per key
SELECT COUNT(*) AS total, COUNT(DISTINCT customer_id) AS unique_ids
FROM deduped;  -- should match

-- SCD: check overlaps
SELECT order_id
FROM orders_dim_scd2 d
JOIN orders_dim_scd2 n
  ON d.order_id = n.order_id
 AND d.valid_from < COALESCE(n.valid_to, TIMESTAMP '9999-12-31')
 AND COALESCE(d.valid_to, TIMESTAMP '9999-12-31') > n.valid_from
 AND (d.status, d.valid_from) <> (n.status, n.valid_from);
-- should return zero rows

Practical projects

  • Customer unification: merge CRM, web signups, and partner feeds; dedup to one trusted row per customer_id.
  • User plan SCD2: build a dimension tracking plan changes from daily snapshots; ensure no gaps and single is_current.
  • Product price history: generate SCD2 price bands from a price change log and validate overlaps.

Who this is for

  • Analytics Engineers and BI developers cleaning and modeling raw data
  • Data Analysts who need reliable “latest value” views
  • Data Engineers building curated marts for reporting

Prerequisites

  • Comfort with SELECT, WHERE, GROUP BY, and JOINs
  • Basics of window functions (PARTITION BY, ORDER BY)
  • Understanding of SCD Type 2 concepts (valid_from/valid_to, is_current)

Learning path

  1. Review window function basics (ROW_NUMBER, RANK, LAG/LEAD)
  2. Practice deterministic dedup with multiple tie-breakers
  3. Build SCD2 from change logs (LEAD) and from snapshots (LAG)
  4. Add data quality checks for overlaps and cardinality
  5. Apply in your warehouse models and schedule validations

Next steps

  • Harden models with tests that ensure one current record per key
  • Add source priority mappings to centralize tie-break logic
  • Parameterize high-date and date boundary conventions for consistency

Mini challenge

You have a daily product catalog snapshot with columns (product_id, snapshot_date, price, currency). Build an SCD2 table tracking price changes only. Requirements: collapse consecutive same prices, set valid_to as day before next change, and flag is_current.

  • [ ] Use LAG(price) to detect changes
  • [ ] Create valid_from at change points
  • [ ] Set valid_to with LEAD(valid_from) - 1 day
  • [ ] Validate no overlaps and a single is_current per product

Check your knowledge

Take the quick test to lock in the concepts. Anyone can take it for free; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

From table customers_raw(customer_id, email, updated_at, source), return one row per customer_id keeping the most recent updated_at. If timestamps tie, prefer source priority crm > webform > partner. Columns: customer_id, email, updated_at, source.

Sample input:

customer_id | email            | updated_at           | source
----------- | ---------------- | -------------------- | -------
1           | alice@old.com    | 2024-07-20 10:00:00 | webform
1           | alice@x.com      | 2024-07-21 10:15:00 | crm
2           | bob@x.com        | 2024-06-11 09:00:00 | webform
2           | bob@x.com        | 2024-06-11 09:00:00 | crm
3           | carol@x.com      | 2024-06-12 12:30:00 | partner
Expected Output
customer_id | email | updated_at | source 1 | alice@x.com | 2024-07-21 10:15:00 | crm 2 | bob@x.com | 2024-06-11 09:00:00 | crm 3 | carol@x.com | 2024-06-12 12:30:00 | partner

Window Functions For Dedup And SCD — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Window Functions For Dedup And SCD?

AI Assistant

Ask questions about this tool