Why this matters
As an Analytics Engineer, you regularly need to remove duplicates from raw data and maintain Slowly Changing Dimensions (SCD) so analytics stays consistent and trustworthy. Window functions let you pick the right row deterministically and create clean, time-bounded histories for dimensions.
- Real tasks you will do: deduplicate customers from multiple sources, keep the latest product attributes, create SCD Type 2 dimensions for user plans or order statuses, and build stable snapshots for reporting.
- Impact: fewer flaky dashboards, easier debugging, and predictable business metrics.
Concept explained simply
Window functions compute values across a set of rows related to the current row, without collapsing them. The two stars for dedup and SCD are:
- ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC): gives a 1-based rank per entity; keep rn = 1 to get the single best row.
- LAG/LEAD(value) OVER (PARTITION BY key ORDER BY time): look at previous/next rows to mark changes and set validity ranges.
Mental model
- Partition: who are we comparing within? (e.g., by customer_id)
- Order: what defines newest/earliest or sequence? (e.g., updated_at DESC)
- Selector: pick the winning row (ROW_NUMBER = 1), or compute start/end boundaries (LEAD to set valid_to)
Quick reference
- Dedup single latest row: ROW_NUMBER + filter rn = 1
- Handle ties: add deterministic tie-breakers in ORDER BY (e.g., source priority, id)
- SCD2 from change log: LEAD(next_time) to set valid_to; last row gets an open-ended high date and is_current = true
- SCD2 from snapshots: LAG to detect value changes between days; create ranges only where value changed
Worked examples
Example 1 — Deduplicate to the latest customer record
Goal: keep the single most recent record per customer, prefer source = 'crm' when timestamps tie.
SELECT customer_id, email, updated_at, source
FROM (
SELECT c.*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC,
CASE source WHEN 'crm' THEN 1 WHEN 'webform' THEN 2 ELSE 3 END,
email ASC
) AS rn
FROM customers_raw c
) t
WHERE rn = 1;
Why it works: ROW_NUMBER ensures exactly one row per customer. ORDER BY defines clear precedence: newest timestamp, then source priority, then a stable tie-breaker.
Example 2 — First event per session
Goal: get the first event in each session for attribution.
SELECT session_id, event_id, event_time, event_type
FROM (
SELECT e.*,
ROW_NUMBER() OVER (
PARTITION BY session_id
ORDER BY event_time ASC, event_id ASC
) AS rn
FROM events e
) t
WHERE rn = 1;
Note: include a deterministic tie-breaker like event_id to avoid non-deterministic results when timestamps tie.
Example 3 — Build an SCD Type 2 dimension from a status change log
Goal: generate valid_from/valid_to periods for order statuses.
WITH staged AS (
SELECT
order_id,
status,
status_time AS valid_from,
LEAD(status_time) OVER (
PARTITION BY order_id
ORDER BY status_time
) AS next_time
FROM orders_status_log
QUALIFY status != LAG(status) OVER (
PARTITION BY order_id ORDER BY status_time
) OR LAG(status) OVER (
PARTITION BY order_id ORDER BY status_time
) IS NULL
), final AS (
SELECT
order_id,
status,
valid_from,
COALESCE(next_time, TIMESTAMP '9999-12-31 23:59:59') AS valid_to,
CASE WHEN next_time IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM staged
)
SELECT * FROM final;
Why it works: consecutive duplicate statuses are collapsed. LEAD gives the next start time as this row’s end time; the last interval stays open-ended and marked current.
Variant — SCD2 from daily snapshots (plan changes)
WITH with_prev AS (
SELECT
user_id,
snapshot_date,
plan_id,
LAG(plan_id) OVER (
PARTITION BY user_id ORDER BY snapshot_date
) AS prev_plan
FROM user_plans_daily
), change_starts AS (
SELECT
user_id,
plan_id,
snapshot_date AS valid_from
FROM with_prev
WHERE prev_plan IS DISTINCT FROM plan_id -- start a new range on change
), ranges AS (
SELECT
user_id,
plan_id,
valid_from,
LEAD(valid_from) OVER (
PARTITION BY user_id ORDER BY valid_from
) - INTERVAL '1 day' AS valid_to
FROM change_starts
)
SELECT
user_id,
plan_id,
valid_from,
COALESCE(valid_to, DATE '9999-12-31') AS valid_to,
CASE WHEN valid_to IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM ranges;
Step-by-step: choose the right approach
- Define the business key (e.g., customer_id, order_id). This is your PARTITION BY.
- Define recency/sequence (updated_at, event_time). This is your ORDER BY.
- Pick the function:
- Dedup to a single row: ROW_NUMBER
- Detect changes over time: LAG/LEAD
- Make it deterministic: add tie-breakers (source priority, id).
- Filter or build ranges: rn = 1 for dedup; LEAD to set valid_to for SCD2.
- Validate: counts match expectations, no overlapping periods, is_current makes sense.
Exercises
These mirror the graded exercises below. Try them in your SQL environment. Expected outputs are provided for the sample data.
Exercise 1 — Deduplicate customers by latest update with source priority
Input (customers_raw):
customer_id | email | updated_at | source
----------- | ---------------- | -------------------- | -------
1 | alice@old.com | 2024-07-20 10:00:00 | webform
1 | alice@x.com | 2024-07-21 10:15:00 | crm
2 | bob@x.com | 2024-06-11 09:00:00 | webform
2 | bob@x.com | 2024-06-11 09:00:00 | crm
3 | carol@x.com | 2024-06-12 12:30:00 | partner
Task: Return one row per customer_id keeping the most recent updated_at. If timestamps tie, prefer source priority crm > webform > partner. Columns: customer_id, email, updated_at, source.
Hints
- Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY ...)
- Order by updated_at DESC, then a CASE expression for source priority
- Add a final stable tie-breaker (e.g., email ASC) if needed
Show solution
SELECT customer_id, email, updated_at, source
FROM (
SELECT c.*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC,
CASE source WHEN 'crm' THEN 1 WHEN 'webform' THEN 2 WHEN 'partner' THEN 3 ELSE 4 END,
email ASC
) AS rn
FROM customers_raw c
) t
WHERE rn = 1;
Expected output
customer_id | email | updated_at | source
----------- | ------------- | -------------------- | ------
1 | alice@x.com | 2024-07-21 10:15:00 | crm
2 | bob@x.com | 2024-06-11 09:00:00 | crm
3 | carol@x.com | 2024-06-12 12:30:00 | partner
Exercise 2 — Build an SCD2 orders dimension from a change log
Input (orders_status_log):
order_id | status | status_time
-------- | -------- | ---------------------
101 | created | 2024-01-05 08:00:00
101 | paid | 2024-01-05 09:10:00
101 | shipped | 2024-01-06 14:00:00
102 | created | 2024-02-01 10:00:00
102 | created | 2024-02-01 10:05:00
Task: Collapse consecutive duplicate statuses per order_id, then output order_id, status, valid_from, valid_to, is_current. Use a high date for open-ended valid_to.
Hints
- Filter out consecutive duplicates using LAG(status)
- Use LEAD(status_time) to set valid_to
- Mark the last row per order_id as is_current = true
Show solution
WITH ordered AS (
SELECT
order_id,
status,
status_time,
LAG(status) OVER (
PARTITION BY order_id ORDER BY status_time
) AS prev_status
FROM orders_status_log
), changes_only AS (
SELECT order_id, status, status_time
FROM ordered
WHERE prev_status IS NULL OR status != prev_status
), scd AS (
SELECT
order_id,
status,
status_time AS valid_from,
LEAD(status_time) OVER (
PARTITION BY order_id ORDER BY status_time
) AS next_time
FROM changes_only
)
SELECT
order_id,
status,
valid_from,
COALESCE(next_time, TIMESTAMP '9999-12-31 23:59:59') AS valid_to,
CASE WHEN next_time IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM scd
ORDER BY order_id, valid_from;
Expected output
order_id | status | valid_from | valid_to | is_current
-------- | ------- | -------------------- | --------------------- | ----------
101 | created | 2024-01-05 08:00:00 | 2024-01-05 09:10:00 | false
101 | paid | 2024-01-05 09:10:00 | 2024-01-06 14:00:00 | false
101 | shipped | 2024-01-06 14:00:00 | 9999-12-31 23:59:59 | true
102 | created | 2024-02-01 10:00:00 | 9999-12-31 23:59:59 | true
Self-check checklist
- [ ] For dedup, COUNT(*) equals COUNT(DISTINCT key) after filtering rn = 1
- [ ] ORDER BY includes deterministic tie-breakers
- [ ] For SCD2, no overlapping periods for the same key
- [ ] Exactly one is_current = true per key
Common mistakes and how to self-check
- Using RANK instead of ROW_NUMBER for dedup. Ties with RANK can produce multiple rows per key. Fix: use ROW_NUMBER and deterministic ORDER BY.
- Missing ORDER BY in window. Without it, “latest” is undefined. Always specify your recency column and tie-breakers.
- Partitioning by the wrong key. Double-check the business entity you dedup (customer_id vs email).
- Overlapping SCD periods. Ensure valid_from of the next row is exactly the boundary for the previous row’s valid_to.
- Forgetting to collapse consecutive duplicates before building SCD. Use LAG to detect no-change rows and remove them.
- Timezone drift. Align timestamps to a single timezone or use DATE boundaries consistently.
Quick validations
-- Dedup: verify one row per key
SELECT COUNT(*) AS total, COUNT(DISTINCT customer_id) AS unique_ids
FROM deduped; -- should match
-- SCD: check overlaps
SELECT order_id
FROM orders_dim_scd2 d
JOIN orders_dim_scd2 n
ON d.order_id = n.order_id
AND d.valid_from < COALESCE(n.valid_to, TIMESTAMP '9999-12-31')
AND COALESCE(d.valid_to, TIMESTAMP '9999-12-31') > n.valid_from
AND (d.status, d.valid_from) <> (n.status, n.valid_from);
-- should return zero rows
Practical projects
- Customer unification: merge CRM, web signups, and partner feeds; dedup to one trusted row per customer_id.
- User plan SCD2: build a dimension tracking plan changes from daily snapshots; ensure no gaps and single is_current.
- Product price history: generate SCD2 price bands from a price change log and validate overlaps.
Who this is for
- Analytics Engineers and BI developers cleaning and modeling raw data
- Data Analysts who need reliable “latest value” views
- Data Engineers building curated marts for reporting
Prerequisites
- Comfort with SELECT, WHERE, GROUP BY, and JOINs
- Basics of window functions (PARTITION BY, ORDER BY)
- Understanding of SCD Type 2 concepts (valid_from/valid_to, is_current)
Learning path
- Review window function basics (ROW_NUMBER, RANK, LAG/LEAD)
- Practice deterministic dedup with multiple tie-breakers
- Build SCD2 from change logs (LEAD) and from snapshots (LAG)
- Add data quality checks for overlaps and cardinality
- Apply in your warehouse models and schedule validations
Next steps
- Harden models with tests that ensure one current record per key
- Add source priority mappings to centralize tie-break logic
- Parameterize high-date and date boundary conventions for consistency
Mini challenge
You have a daily product catalog snapshot with columns (product_id, snapshot_date, price, currency). Build an SCD2 table tracking price changes only. Requirements: collapse consecutive same prices, set valid_to as day before next change, and flag is_current.
- [ ] Use LAG(price) to detect changes
- [ ] Create valid_from at change points
- [ ] Set valid_to with LEAD(valid_from) - 1 day
- [ ] Validate no overlaps and a single is_current per product
Check your knowledge
Take the quick test to lock in the concepts. Anyone can take it for free; only logged-in users will have their progress saved.