luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

SQL Based Transformations

Learn SQL Based Transformations for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As an ETL Developer, most business-ready data is produced by SQL transformations: cleaning raw inputs, joining dimensions, calculating metrics, deduplicating, and loading curated tables for analytics and downstream apps. Strong SQL transforms let you ship reliable data models, reduce pipeline complexity, and speed up delivery.

  • Create daily/weekly aggregates for product and finance
  • Join disparate sources to build customer 360 views
  • Deduplicate events and keep the latest truth
  • Apply incremental logic so pipelines run fast and cheaply

Who this is for

  • ETL Developers and Data Engineers building staging/marts
  • Analysts moving from ad-hoc queries to production-grade SQL
  • Anyone implementing ELT in a data warehouse

Prerequisites

  • Comfort with SELECT, WHERE, GROUP BY, and basic JOINs
  • Understanding of primary keys and basic data types
  • Familiarity with your warehouse dialect (PostgreSQL, Snowflake, BigQuery, etc.)

Concept explained simply

SQL-based transformations turn raw tables into clean, shaped data. Think of it as an assembly line: filter, shape, enrich, then publish.

Mental model: a 4-stage assembly line

1) Ingest: Read raw/staging tables; standardize types and basic formats.
2) Clean: Trim, case-normalize, parse dates, handle nulls (COALESCE).
3) Enrich: JOIN dimensions, compute metrics, use window functions for dedup and rollups.
4) Publish: Write incremental outputs to curated tables with MERGE/UPSERT.
Core operations snapshot
  • JOINs: inner to match; left to preserve base; anti-join to exclude
  • Aggregations: SUM, COUNT, AVG with GROUP BY; conditional via CASE
  • Window functions: ROW_NUMBER, RANK, SUM OVER for dedup/rollups
  • Data cleaning: TRIM, LOWER/UPPER, CAST/SAFE_CAST, COALESCE
  • Incremental patterns: watermarks, MERGE/UPSERT, dedup on keys

Worked examples

Example 1: Clean, join, and aggregate daily revenue

Goal: Sum paid sales by standardized city per day.

Sample SQL
WITH clean_customers AS (
  SELECT
    customer_id,
    UPPER(TRIM(city)) AS city_clean
  FROM customers
), paid_sales AS (
  SELECT order_id, customer_id, order_date, amount
  FROM sales
  WHERE status = 'PAID'
)
SELECT
  s.order_date,
  c.city_clean AS city,
  SUM(s.amount) AS total_paid
FROM paid_sales s
JOIN clean_customers c USING (customer_id)
GROUP BY 1, 2
ORDER BY 1, 2;

Example 2: Deduplicate to latest record per key

Goal: Keep the latest device status per device_id.

Sample SQL
WITH ranked AS (
  SELECT
    device_id,
    status,
    updated_at,
    ROW_NUMBER() OVER (PARTITION BY device_id ORDER BY updated_at DESC) AS rn
  FROM events
)
SELECT device_id, status, updated_at
FROM ranked
WHERE rn = 1;

Example 3: Incremental upsert with watermark

Goal: Only process rows newer than the last watermark and upsert into the target table.

Sample SQL (generic MERGE)
-- Assume last_watermark is provided by your orchestrator
WITH staged AS (
  SELECT *
  FROM staging_customers
  WHERE updated_at > :last_watermark
)
MERGE INTO customer_dim t
USING staged s
  ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET
  email = s.email,
  city = s.city,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, email, city, updated_at)
VALUES (s.customer_id, s.email, s.city, s.updated_at);
Alternative (PostgreSQL upsert)
INSERT INTO customer_dim (customer_id, email, city, updated_at)
SELECT customer_id, email, city, updated_at
FROM staging_customers
WHERE updated_at > :last_watermark
ON CONFLICT (customer_id)
DO UPDATE SET
  email = EXCLUDED.email,
  city = EXCLUDED.city,
  updated_at = EXCLUDED.updated_at;

Hands-on exercises

Quick Test is available to everyone; if you are logged in, your progress and results are saved automatically.

Exercise 1: Join, clean, and aggregate

Your task: Produce daily revenue by standardized city for PAID orders.

Dataset
customers(customer_id, city)
1, 'New York'
2, 'new york'
3, 'Boston '
4, 'boston'
5, 'Chicago'

sales(order_id, customer_id, order_date, amount, status)
100, 1, '2025-01-01', 50, 'PAID'
101, 2, '2025-01-01', 30, 'REFUNDED'
102, 2, '2025-01-01', 20, 'PAID'
103, 3, '2025-01-02', 40, 'PAID'
104, 4, '2025-01-02', 60, 'PAID'
105, 5, '2025-01-02', 70, 'CANCELLED'
  • Trim and uppercase city
  • Filter status to PAID
  • Group by order_date and city
Expected result
order_date   | city      | total_paid
2025-01-01   | NEW YORK  | 70
2025-01-02   | BOSTON    | 100

Exercise 2: Latest record per device

Your task: Return the latest status per device.

Dataset
events(device_id, status, updated_at)
'A', 'online',      '2025-01-01 10:00'
'A', 'offline',     '2025-01-01 12:00'
'B', 'online',      '2025-01-01 09:00'
'B', 'maintenance', '2025-01-02 08:00'
'C', 'offline',     '2025-01-01 11:00'
  • Use ROW_NUMBER over device_id ordered by updated_at desc
  • Filter rn = 1
Expected result
device_id | status       | updated_at
A         | offline      | 2025-01-01 12:00
B         | maintenance  | 2025-01-02 08:00
C         | offline      | 2025-01-01 11:00

Exercise 3: Incremental upsert

Your task: Upsert changes since a watermark.

Dataset
Target table: customer_dim(customer_id, email, city, updated_at)
Existing rows:
1, 'alice@x.com', 'NY', '2025-01-01 00:00'
2, 'bob@x.com',   'SF', '2025-01-01 00:00'

Staging since :last_watermark = '2025-01-01 00:00'
2, 'bob_new@x.com', 'SF', '2025-01-03 00:00'
3, 'carol@x.com',   'LA', '2025-01-02 00:00'
  • Filter staging by watermark
  • Match on customer_id
  • Update email/city/updated_at on match; insert otherwise
Expected result
customer_id | email          | city | updated_at
1           | alice@x.com    | NY   | 2025-01-01 00:00
2           | bob_new@x.com  | SF   | 2025-01-03 00:00
3           | carol@x.com    | LA   | 2025-01-02 00:00

Common mistakes and self-check

  • Join explosions: Many-to-many joins multiply rows unexpectedly. Self-check: COUNT before/after join; if it grows unexpectedly, verify join keys and cardinality.
  • Filtering after aggregation when it should be before: Causes wrong totals. Self-check: Validate row counts and totals on intermediate CTEs.
  • Incorrect dedup logic: Using DISTINCT when latest record is needed. Self-check: Use ROW_NUMBER with clear ORDER BY on a reliable timestamp.
  • Null mishandling: SUM over null numeric columns yields null in some dialects. Self-check: Wrap with COALESCE(value, 0).
  • Type drift: Joining string to integer keys. Self-check: explicit CAST on both sides; compare sample keys.
  • Timezone surprises: Comparing UTC to local timestamps. Self-check: normalize to UTC in staging; be explicit about time zone functions.

Practical projects

  1. Daily Sales Mart
    • Staging: clean orders, normalize statuses
    • Transform: join product and customer dimensions
    • Aggregate: daily revenue, AOV, paid units
    • Publish: MERGE into fact_daily_sales
  2. Customer 360 Mini-Dimension
    • Deduplicate customer records by email
    • Compute first_order_date, last_order_date, lifetime_value
    • Bucket tenure using CASE
  3. Event Stream Snapshot
    • Use ROW_NUMBER to keep latest per device
    • Publish a current_state table refreshed incrementally

Learning path

  • SQL basics refresher: queries, joins, aggregations
  • Window functions for dedup and rollups
  • Incremental processing: watermarks and upserts
  • Dimensional modeling and SCD patterns
  • Performance tuning: pruning, statistics, and partitioning

Next steps

  • Automate with your orchestrator and schedule incremental runs
  • Add tests for row counts, null checks, and key uniqueness
  • Document your models: inputs, logic, assumptions
  • Benchmark with realistic data volumes and optimize joins/filters

Mini challenge

Create a customer_activity table with columns: customer_id, first_order_date, last_order_date, orders_30d, active_flag (1 if last order within 30 days). Use CTEs: clean_orders, per_customer_dates, recent_activity. Acceptance criteria: unique customer_id, correct counts for the provided test data, and active_flag computed from last_order_date.

Practice Exercises

3 exercises to complete

Instructions

Build a query to compute daily PAID revenue by standardized city (trim + uppercase).

Dataset
customers(customer_id, city)
1, 'New York'
2, 'new york'
3, 'Boston '
4, 'boston'
5, 'Chicago'

sales(order_id, customer_id, order_date, amount, status)
100, 1, '2025-01-01', 50, 'PAID'
101, 2, '2025-01-01', 30, 'REFUNDED'
102, 2, '2025-01-01', 20, 'PAID'
103, 3, '2025-01-02', 40, 'PAID'
104, 4, '2025-01-02', 60, 'PAID'
105, 5, '2025-01-02', 70, 'CANCELLED'
  • Clean city with TRIM + UPPER
  • Filter to status = 'PAID'
  • Group by order_date and city
Expected Output
order_date | city | total_paid 2025-01-01 | NEW YORK | 70 2025-01-02 | BOSTON | 100

SQL Based Transformations — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about SQL Based Transformations?

AI Assistant

Ask questions about this tool