luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Reusable SQL Transformations

Learn Reusable SQL Transformations for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

Reusable SQL transformations let you deliver changes faster and with fewer bugs. Instead of rewriting the same logic (deduplications, type normalization, SCD handling), you build reliable templates you can apply across datasets and projects.

  • Lower maintenance: fix a pattern once, use it everywhere.
  • Consistency: same business rules across domains.
  • Speed: onboard new data sources faster with proven building blocks.

Concept explained simply

A reusable SQL transformation is a repeatable pattern for common data tasks, parameterized by inputs like table names, key columns, and timestamps. You can implement it with views, CTEs, stored procedures, or templated SQL (e.g., placeholders you replace during deployment).

Jargon decoder
  • CTE: a named subquery using WITH; great for structuring steps.
  • Idempotent: running the same logic twice yields the same result.
  • SCD Type 2: keeps history of changes with start/end timestamps or current flags.

Mental model

Think “Lego bricks.” Each brick solves one focused task (e.g., dedupe by key), is deterministic, and exposes clear parameters. You assemble bricks into pipelines without rewriting the internals.

  • Input contract: required columns and types.
  • Parameters: keys, timestamps, filters, mode (full vs incremental).
  • Output contract: schema, keys, and guarantees (unique keys, no nulls in PKs, etc.).

Reusable transformation design checklist

  • State the input schema, required columns, and assumptions.
  • Handle nulls and type conversions explicitly (safe casts, COALESCE).
  • Be deterministic (e.g., tie-breakers in ORDER BY).
  • Expose parameters clearly (table names, key columns, timestamps).
  • Emit stable column names and data types.
  • Prefer idempotent patterns (can re-run without duplicates).
  • Add data quality asserts where possible (e.g., check unique keys).

Worked examples

Example 1 — Deduplicate by business key, keep latest by timestamp

Pattern: use ROW_NUMBER over partition by business key, order by timestamp desc, then keep rn = 1. Deterministic due to a secondary tie-breaker.

-- Parameters to replace: {src_table}, {key_cols}, {ts_col}
WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY {key_cols}
      ORDER BY {ts_col} DESC, COALESCE(updated_at, {ts_col}) DESC, TO_HEX(FARM_FINGERPRINT(TO_JSON_STRING(t))) ASC
    ) AS rn
  FROM {src_table} AS t
)
SELECT * EXCEPT(rn)
FROM ranked
WHERE rn = 1;

Notes:

  • Use a stable tie-breaker when timestamps are equal.
  • For engines without TO_HEX/FARM_FINGERPRINT, use any stable deterministic expression or include a natural surrogate key.
Example 2 — Surrogate key via hash of business keys

Pattern: normalize, coalesce, trim, then hash. Ensures same inputs produce same key across loads.

-- Parameters: {src_table}, {business_keys...}
WITH normalized AS (
  SELECT
    TRIM(LOWER(COALESCE(cast(customer_email as string), ''))) AS customer_email_norm,
    TRIM(LOWER(COALESCE(cast(customer_id as string), '')))     AS customer_id_norm,
    *
  FROM {src_table}
), with_sk AS (
  SELECT
    TO_HEX(SHA256(CONCAT_WS('|', customer_id_norm, customer_email_norm))) AS customer_sk,
    *
  FROM normalized
)
SELECT * FROM with_sk;

Notes:

  • Use a consistent delimiter and order of keys.
  • Choose a stable hash (e.g., SHA256). For engines without CONCAT_WS, CONCAT with a delimiter.
Example 3 — SCD Type 2 from change events (current_flag + valid_from/to)

Pattern: detect changes per key by comparing current row to the next row.

-- Parameters: {src_table}, {key_col}, {ts_col}, {tracked_cols...}
WITH ordered AS (
  SELECT
    *,
    LEAD({ts_col}) OVER (PARTITION BY {key_col} ORDER BY {ts_col}) AS next_ts,
    -- Change hash combines tracked columns to detect changes deterministically
    TO_HEX(SHA256(CONCAT_WS('|', {tracked_cols}))) AS change_hash
  FROM {src_table}
), ranges AS (
  SELECT
    {key_col},
    {tracked_cols},
    {ts_col}            AS valid_from,
    COALESCE(next_ts, TIMESTAMP '9999-12-31 00:00:00') AS valid_to,
    CASE WHEN next_ts IS NULL THEN TRUE ELSE FALSE END AS current_flag
  FROM ordered
)
SELECT * FROM ranges;

Notes:

  • Ensure input is already deduplicated by key and timestamp.
  • Adjust timestamp types and literals per engine.

Reusable patterns you can adopt today

  • Deduplicate by key + latest timestamp
  • Surrogate key hashing with normalized text
  • SCD Type 2 intervals with LEAD
  • Type normalization (safe casts; default values; unit conversions)
  • Date spine join (ensure complete daily series before metrics)
  • Idempotent incremental MERGE by primary key and updated_at

Exercises you will build now

Complete the two exercises below. A quick test is available to everyone; only logged-in users have their progress saved.

Exercise 1 — Reusable dedup by key and timestamp

Goal: write a reusable SQL snippet that keeps the latest record per (user_id) using event_time. Break ties deterministically.

Starter input (events)
-- events
user_id | event_time           | source | payload
------- | -------------------- | ------ | -------
101     | 2024-01-01 10:00:00  | web    | {...}
101     | 2024-01-01 10:00:00  | app    | {...}
101     | 2024-01-02 09:00:00  | web    | {...}
202     | 2024-02-01 08:30:00  | web    | {...}
202     | 2024-02-01 08:30:00  | web    | {...}
  • Write a CTE that ranks records by (user_id, event_time desc, stable tie-breaker).
  • Return only rn = 1 rows.
  • Make the pattern easy to reuse by replacing literal names with placeholders in comments.

Exercise 2 — Surrogate key hashing template

Goal: create a stable surrogate key customer_sk from (customer_id, customer_email) with normalization.

Starter input (raw_customers)
-- raw_customers
customer_id | customer_email        | country
----------- | --------------------- | -------
C-001       | Alice@example.com     | US
c-001       |   alice@example.com   | US
C-002       | bob@example.com       | CA
  • Lowercase and trim text fields.
  • COALESCE nulls to empty strings before hashing.
  • Use a consistent delimiter and order of business keys.

Self-check after exercises

  • Can you point to the deterministic tie-breaker in Exercise 1?
  • Does your surrogate key produce the same value for logically identical rows?
  • Did you document input/output columns and assumptions?

Common mistakes and how to catch them

  • Missing tie-breakers: two rows have same timestamp, causing nondeterministic picks. Fix by adding a stable ordering column.
  • Hash instability: different normalization or key order creates different hashes for the same entity. Fix by consistent normalization and key ordering.
  • Implicit casts: engine-dependent results. Fix by explicit, safe casts and defaults.
  • Non-idempotent increments: MERGE without unique keys. Fix by asserting key uniqueness before merging.
  • Unstated contracts: unclear required columns. Fix by documenting input/output schemas in comments.

Practical projects

  • Build a library of 5 SQL templates: dedupe, SK hashing, SCD2, type normalization, date spine join.
  • Create a sample retail mart where each template is applied to at least 2 source tables.
  • Add a validation query per template (e.g., check unique keys, count gaps in date spine).

Who this is for

  • Data Engineers building ELT pipelines in SQL-centric platforms.
  • Analytics Engineers maintaining marts and dimensional models.
  • Anyone standardizing repeatable SQL logic across teams.

Prerequisites

  • Comfort with SQL SELECT, JOIN, GROUP BY, and window functions.
  • Experience with CTEs and views.
  • Basic understanding of dimensional modeling (keys, SCD).

Learning path

  1. Master window functions for ranking and change detection.
  2. Learn robust normalization and casting patterns.
  3. Implement SCD2 with LEAD/LAG and validate ranges.
  4. Package patterns as parameterized templates (document inputs/outputs).
  5. Adopt idempotent incremental loads with MERGE semantics.

Next steps

  • Finish the exercises below (solutions provided in collapsible blocks).
  • Take the quick test to check your understanding. Progress is saved if you are logged in.
  • Start a small internal template library and share with your team.

Mini challenge

Create a reusable SQL snippet that builds a complete daily date spine between two dates and LEFT JOINs a fact table, filling missing days with zeros for a metric. Document parameters and outputs.

Practice Exercises

2 exercises to complete

Instructions

Write a deduplication SQL that returns only the latest row per user_id, using event_time to order and a deterministic tie-breaker when event_time is equal. Treat the table name and column names as parameters you could replace.

Input table: events(user_id, event_time, source, payload). Output should contain exactly one row per user_id — the latest by event_time; if ties, prefer by a stable expression.

Expected Output
One row per user_id with the latest event. For the sample in lesson: user_id 101 → 2024-01-02 09:00:00; user_id 202 → 2024-02-01 08:30:00 (one of duplicates chosen deterministically).

Reusable SQL Transformations — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Reusable SQL Transformations?

AI Assistant

Ask questions about this tool