How to learn Reusable SQL Transformations for ETL ELT Development in Data Engineer for free

Why this matters

Reusable SQL transformations let you deliver changes faster and with fewer bugs. Instead of rewriting the same logic (deduplications, type normalization, SCD handling), you build reliable templates you can apply across datasets and projects.

Lower maintenance: fix a pattern once, use it everywhere.
Consistency: same business rules across domains.
Speed: onboard new data sources faster with proven building blocks.

Concept explained simply

A reusable SQL transformation is a repeatable pattern for common data tasks, parameterized by inputs like table names, key columns, and timestamps. You can implement it with views, CTEs, stored procedures, or templated SQL (e.g., placeholders you replace during deployment).

Jargon decoder

CTE: a named subquery using WITH; great for structuring steps.
Idempotent: running the same logic twice yields the same result.
SCD Type 2: keeps history of changes with start/end timestamps or current flags.

Mental model

Think “Lego bricks.” Each brick solves one focused task (e.g., dedupe by key), is deterministic, and exposes clear parameters. You assemble bricks into pipelines without rewriting the internals.

Input contract: required columns and types.
Parameters: keys, timestamps, filters, mode (full vs incremental).
Output contract: schema, keys, and guarantees (unique keys, no nulls in PKs, etc.).

Reusable transformation design checklist

State the input schema, required columns, and assumptions.
Handle nulls and type conversions explicitly (safe casts, COALESCE).
Be deterministic (e.g., tie-breakers in ORDER BY).
Expose parameters clearly (table names, key columns, timestamps).
Emit stable column names and data types.
Prefer idempotent patterns (can re-run without duplicates).
Add data quality asserts where possible (e.g., check unique keys).

Worked examples

Example 1 — Deduplicate by business key, keep latest by timestamp

Pattern: use ROW_NUMBER over partition by business key, order by timestamp desc, then keep rn = 1. Deterministic due to a secondary tie-breaker.

-- Parameters to replace: {src_table}, {key_cols}, {ts_col}
WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY {key_cols}
      ORDER BY {ts_col} DESC, COALESCE(updated_at, {ts_col}) DESC, TO_HEX(FARM_FINGERPRINT(TO_JSON_STRING(t))) ASC
    ) AS rn
  FROM {src_table} AS t
)
SELECT * EXCEPT(rn)
FROM ranked
WHERE rn = 1;

Notes:

Use a stable tie-breaker when timestamps are equal.
For engines without TO_HEX/FARM_FINGERPRINT, use any stable deterministic expression or include a natural surrogate key.

Example 2 — Surrogate key via hash of business keys

Pattern: normalize, coalesce, trim, then hash. Ensures same inputs produce same key across loads.

-- Parameters: {src_table}, {business_keys...}
WITH normalized AS (
  SELECT
    TRIM(LOWER(COALESCE(cast(customer_email as string), ''))) AS customer_email_norm,
    TRIM(LOWER(COALESCE(cast(customer_id as string), '')))     AS customer_id_norm,
    *
  FROM {src_table}
), with_sk AS (
  SELECT
    TO_HEX(SHA256(CONCAT_WS('|', customer_id_norm, customer_email_norm))) AS customer_sk,
    *
  FROM normalized
)
SELECT * FROM with_sk;

Notes:

Use a consistent delimiter and order of keys.
Choose a stable hash (e.g., SHA256). For engines without CONCAT_WS, CONCAT with a delimiter.

Example 3 — SCD Type 2 from change events (current_flag + valid_from/to)

Pattern: detect changes per key by comparing current row to the next row.

-- Parameters: {src_table}, {key_col}, {ts_col}, {tracked_cols...}
WITH ordered AS (
  SELECT
    *,
    LEAD({ts_col}) OVER (PARTITION BY {key_col} ORDER BY {ts_col}) AS next_ts,
    -- Change hash combines tracked columns to detect changes deterministically
    TO_HEX(SHA256(CONCAT_WS('|', {tracked_cols}))) AS change_hash
  FROM {src_table}
), ranges AS (
  SELECT
    {key_col},
    {tracked_cols},
    {ts_col}            AS valid_from,
    COALESCE(next_ts, TIMESTAMP '9999-12-31 00:00:00') AS valid_to,
    CASE WHEN next_ts IS NULL THEN TRUE ELSE FALSE END AS current_flag
  FROM ordered
)
SELECT * FROM ranges;

Notes:

Ensure input is already deduplicated by key and timestamp.
Adjust timestamp types and literals per engine.

Reusable patterns you can adopt today

Deduplicate by key + latest timestamp
Surrogate key hashing with normalized text
SCD Type 2 intervals with LEAD
Type normalization (safe casts; default values; unit conversions)
Date spine join (ensure complete daily series before metrics)
Idempotent incremental MERGE by primary key and updated_at

Exercises you will build now

Complete the two exercises below. A quick test is available to everyone; only logged-in users have their progress saved.

Exercise 1 — Reusable dedup by key and timestamp

Goal: write a reusable SQL snippet that keeps the latest record per (user_id) using event_time. Break ties deterministically.

Starter input (events)

-- events
user_id | event_time           | source | payload
------- | -------------------- | ------ | -------
101     | 2024-01-01 10:00:00  | web    | {...}
101     | 2024-01-01 10:00:00  | app    | {...}
101     | 2024-01-02 09:00:00  | web    | {...}
202     | 2024-02-01 08:30:00  | web    | {...}
202     | 2024-02-01 08:30:00  | web    | {...}

Write a CTE that ranks records by (user_id, event_time desc, stable tie-breaker).
Return only rn = 1 rows.
Make the pattern easy to reuse by replacing literal names with placeholders in comments.

Exercise 2 — Surrogate key hashing template

Goal: create a stable surrogate key customer_sk from (customer_id, customer_email) with normalization.

Starter input (raw_customers)

-- raw_customers
customer_id | customer_email        | country
----------- | --------------------- | -------
C-001       | Alice@example.com     | US
c-001       |   alice@example.com   | US
C-002       | bob@example.com       | CA

Lowercase and trim text fields.
COALESCE nulls to empty strings before hashing.
Use a consistent delimiter and order of business keys.

Self-check after exercises

Can you point to the deterministic tie-breaker in Exercise 1?
Does your surrogate key produce the same value for logically identical rows?
Did you document input/output columns and assumptions?

Common mistakes and how to catch them

Missing tie-breakers: two rows have same timestamp, causing nondeterministic picks. Fix by adding a stable ordering column.
Hash instability: different normalization or key order creates different hashes for the same entity. Fix by consistent normalization and key ordering.
Implicit casts: engine-dependent results. Fix by explicit, safe casts and defaults.
Non-idempotent increments: MERGE without unique keys. Fix by asserting key uniqueness before merging.
Unstated contracts: unclear required columns. Fix by documenting input/output schemas in comments.

Practical projects

Build a library of 5 SQL templates: dedupe, SK hashing, SCD2, type normalization, date spine join.
Create a sample retail mart where each template is applied to at least 2 source tables.
Add a validation query per template (e.g., check unique keys, count gaps in date spine).

Who this is for

Data Engineers building ELT pipelines in SQL-centric platforms.
Analytics Engineers maintaining marts and dimensional models.
Anyone standardizing repeatable SQL logic across teams.

Prerequisites

Comfort with SQL SELECT, JOIN, GROUP BY, and window functions.
Experience with CTEs and views.
Basic understanding of dimensional modeling (keys, SCD).

Learning path

Master window functions for ranking and change detection.
Learn robust normalization and casting patterns.
Implement SCD2 with LEAD/LAG and validate ranges.
Package patterns as parameterized templates (document inputs/outputs).
Adopt idempotent incremental loads with MERGE semantics.

Next steps

Finish the exercises below (solutions provided in collapsible blocks).
Take the quick test to check your understanding. Progress is saved if you are logged in.
Start a small internal template library and share with your team.

Mini challenge

Create a reusable SQL snippet that builds a complete daily date spine between two dates and LEFT JOINs a fact table, filling missing days with zeros for a metric. Document parameters and outputs.

Menu

Reusable SQL Transformations

Table of Contents

Why this matters

Concept explained simply

Mental model

Reusable transformation design checklist

Worked examples

Reusable patterns you can adopt today

Exercises you will build now

Exercise 1 — Reusable dedup by key and timestamp

Exercise 2 — Surrogate key hashing template

Self-check after exercises

Common mistakes and how to catch them

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Reusable dedup by key and timestamp

Instructions

Expected Output

Surrogate key hashing template

Reusable SQL Transformations — Quick Test

Have questions about Reusable SQL Transformations?

AI Assistant