Why this matters
Reusable SQL transformations let you deliver changes faster and with fewer bugs. Instead of rewriting the same logic (deduplications, type normalization, SCD handling), you build reliable templates you can apply across datasets and projects.
- Lower maintenance: fix a pattern once, use it everywhere.
- Consistency: same business rules across domains.
- Speed: onboard new data sources faster with proven building blocks.
Concept explained simply
A reusable SQL transformation is a repeatable pattern for common data tasks, parameterized by inputs like table names, key columns, and timestamps. You can implement it with views, CTEs, stored procedures, or templated SQL (e.g., placeholders you replace during deployment).
Jargon decoder
- CTE: a named subquery using WITH; great for structuring steps.
- Idempotent: running the same logic twice yields the same result.
- SCD Type 2: keeps history of changes with start/end timestamps or current flags.
Mental model
Think “Lego bricks.” Each brick solves one focused task (e.g., dedupe by key), is deterministic, and exposes clear parameters. You assemble bricks into pipelines without rewriting the internals.
- Input contract: required columns and types.
- Parameters: keys, timestamps, filters, mode (full vs incremental).
- Output contract: schema, keys, and guarantees (unique keys, no nulls in PKs, etc.).
Reusable transformation design checklist
- State the input schema, required columns, and assumptions.
- Handle nulls and type conversions explicitly (safe casts, COALESCE).
- Be deterministic (e.g., tie-breakers in ORDER BY).
- Expose parameters clearly (table names, key columns, timestamps).
- Emit stable column names and data types.
- Prefer idempotent patterns (can re-run without duplicates).
- Add data quality asserts where possible (e.g., check unique keys).
Worked examples
Example 1 — Deduplicate by business key, keep latest by timestamp
Pattern: use ROW_NUMBER over partition by business key, order by timestamp desc, then keep rn = 1. Deterministic due to a secondary tie-breaker.
-- Parameters to replace: {src_table}, {key_cols}, {ts_col}
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY {key_cols}
ORDER BY {ts_col} DESC, COALESCE(updated_at, {ts_col}) DESC, TO_HEX(FARM_FINGERPRINT(TO_JSON_STRING(t))) ASC
) AS rn
FROM {src_table} AS t
)
SELECT * EXCEPT(rn)
FROM ranked
WHERE rn = 1;
Notes:
- Use a stable tie-breaker when timestamps are equal.
- For engines without TO_HEX/FARM_FINGERPRINT, use any stable deterministic expression or include a natural surrogate key.
Example 2 — Surrogate key via hash of business keys
Pattern: normalize, coalesce, trim, then hash. Ensures same inputs produce same key across loads.
-- Parameters: {src_table}, {business_keys...}
WITH normalized AS (
SELECT
TRIM(LOWER(COALESCE(cast(customer_email as string), ''))) AS customer_email_norm,
TRIM(LOWER(COALESCE(cast(customer_id as string), ''))) AS customer_id_norm,
*
FROM {src_table}
), with_sk AS (
SELECT
TO_HEX(SHA256(CONCAT_WS('|', customer_id_norm, customer_email_norm))) AS customer_sk,
*
FROM normalized
)
SELECT * FROM with_sk;
Notes:
- Use a consistent delimiter and order of keys.
- Choose a stable hash (e.g., SHA256). For engines without CONCAT_WS, CONCAT with a delimiter.
Example 3 — SCD Type 2 from change events (current_flag + valid_from/to)
Pattern: detect changes per key by comparing current row to the next row.
-- Parameters: {src_table}, {key_col}, {ts_col}, {tracked_cols...}
WITH ordered AS (
SELECT
*,
LEAD({ts_col}) OVER (PARTITION BY {key_col} ORDER BY {ts_col}) AS next_ts,
-- Change hash combines tracked columns to detect changes deterministically
TO_HEX(SHA256(CONCAT_WS('|', {tracked_cols}))) AS change_hash
FROM {src_table}
), ranges AS (
SELECT
{key_col},
{tracked_cols},
{ts_col} AS valid_from,
COALESCE(next_ts, TIMESTAMP '9999-12-31 00:00:00') AS valid_to,
CASE WHEN next_ts IS NULL THEN TRUE ELSE FALSE END AS current_flag
FROM ordered
)
SELECT * FROM ranges;
Notes:
- Ensure input is already deduplicated by key and timestamp.
- Adjust timestamp types and literals per engine.
Reusable patterns you can adopt today
- Deduplicate by key + latest timestamp
- Surrogate key hashing with normalized text
- SCD Type 2 intervals with LEAD
- Type normalization (safe casts; default values; unit conversions)
- Date spine join (ensure complete daily series before metrics)
- Idempotent incremental MERGE by primary key and updated_at
Exercises you will build now
Complete the two exercises below. A quick test is available to everyone; only logged-in users have their progress saved.
Exercise 1 — Reusable dedup by key and timestamp
Goal: write a reusable SQL snippet that keeps the latest record per (user_id) using event_time. Break ties deterministically.
Starter input (events)
-- events
user_id | event_time | source | payload
------- | -------------------- | ------ | -------
101 | 2024-01-01 10:00:00 | web | {...}
101 | 2024-01-01 10:00:00 | app | {...}
101 | 2024-01-02 09:00:00 | web | {...}
202 | 2024-02-01 08:30:00 | web | {...}
202 | 2024-02-01 08:30:00 | web | {...}
- Write a CTE that ranks records by (user_id, event_time desc, stable tie-breaker).
- Return only rn = 1 rows.
- Make the pattern easy to reuse by replacing literal names with placeholders in comments.
Exercise 2 — Surrogate key hashing template
Goal: create a stable surrogate key customer_sk from (customer_id, customer_email) with normalization.
Starter input (raw_customers)
-- raw_customers
customer_id | customer_email | country
----------- | --------------------- | -------
C-001 | Alice@example.com | US
c-001 | alice@example.com | US
C-002 | bob@example.com | CA
- Lowercase and trim text fields.
- COALESCE nulls to empty strings before hashing.
- Use a consistent delimiter and order of business keys.
Self-check after exercises
- Can you point to the deterministic tie-breaker in Exercise 1?
- Does your surrogate key produce the same value for logically identical rows?
- Did you document input/output columns and assumptions?
Common mistakes and how to catch them
- Missing tie-breakers: two rows have same timestamp, causing nondeterministic picks. Fix by adding a stable ordering column.
- Hash instability: different normalization or key order creates different hashes for the same entity. Fix by consistent normalization and key ordering.
- Implicit casts: engine-dependent results. Fix by explicit, safe casts and defaults.
- Non-idempotent increments: MERGE without unique keys. Fix by asserting key uniqueness before merging.
- Unstated contracts: unclear required columns. Fix by documenting input/output schemas in comments.
Practical projects
- Build a library of 5 SQL templates: dedupe, SK hashing, SCD2, type normalization, date spine join.
- Create a sample retail mart where each template is applied to at least 2 source tables.
- Add a validation query per template (e.g., check unique keys, count gaps in date spine).
Who this is for
- Data Engineers building ELT pipelines in SQL-centric platforms.
- Analytics Engineers maintaining marts and dimensional models.
- Anyone standardizing repeatable SQL logic across teams.
Prerequisites
- Comfort with SQL SELECT, JOIN, GROUP BY, and window functions.
- Experience with CTEs and views.
- Basic understanding of dimensional modeling (keys, SCD).
Learning path
- Master window functions for ranking and change detection.
- Learn robust normalization and casting patterns.
- Implement SCD2 with LEAD/LAG and validate ranges.
- Package patterns as parameterized templates (document inputs/outputs).
- Adopt idempotent incremental loads with MERGE semantics.
Next steps
- Finish the exercises below (solutions provided in collapsible blocks).
- Take the quick test to check your understanding. Progress is saved if you are logged in.
- Start a small internal template library and share with your team.
Mini challenge
Create a reusable SQL snippet that builds a complete daily date spine between two dates and LEFT JOINs a fact table, filling missing days with zeros for a metric. Document parameters and outputs.