How to learn SQL Based Transformations for ETL Tooling And Implementation in ETL Developer for free

Why this matters

As an ETL Developer, most business-ready data is produced by SQL transformations: cleaning raw inputs, joining dimensions, calculating metrics, deduplicating, and loading curated tables for analytics and downstream apps. Strong SQL transforms let you ship reliable data models, reduce pipeline complexity, and speed up delivery.

Create daily/weekly aggregates for product and finance
Join disparate sources to build customer 360 views
Deduplicate events and keep the latest truth
Apply incremental logic so pipelines run fast and cheaply

Who this is for

ETL Developers and Data Engineers building staging/marts
Analysts moving from ad-hoc queries to production-grade SQL
Anyone implementing ELT in a data warehouse

Prerequisites

Comfort with SELECT, WHERE, GROUP BY, and basic JOINs
Understanding of primary keys and basic data types
Familiarity with your warehouse dialect (PostgreSQL, Snowflake, BigQuery, etc.)

Concept explained simply

SQL-based transformations turn raw tables into clean, shaped data. Think of it as an assembly line: filter, shape, enrich, then publish.

Mental model: a 4-stage assembly line

1) Ingest: Read raw/staging tables; standardize types and basic formats.

2) Clean: Trim, case-normalize, parse dates, handle nulls (COALESCE).

3) Enrich: JOIN dimensions, compute metrics, use window functions for dedup and rollups.

4) Publish: Write incremental outputs to curated tables with MERGE/UPSERT.

Core operations snapshot

JOINs: inner to match; left to preserve base; anti-join to exclude
Aggregations: SUM, COUNT, AVG with GROUP BY; conditional via CASE
Window functions: ROW_NUMBER, RANK, SUM OVER for dedup/rollups
Data cleaning: TRIM, LOWER/UPPER, CAST/SAFE_CAST, COALESCE
Incremental patterns: watermarks, MERGE/UPSERT, dedup on keys

Worked examples

Example 1: Clean, join, and aggregate daily revenue

Goal: Sum paid sales by standardized city per day.

Sample SQL

WITH clean_customers AS (
  SELECT
    customer_id,
    UPPER(TRIM(city)) AS city_clean
  FROM customers
), paid_sales AS (
  SELECT order_id, customer_id, order_date, amount
  FROM sales
  WHERE status = 'PAID'
)
SELECT
  s.order_date,
  c.city_clean AS city,
  SUM(s.amount) AS total_paid
FROM paid_sales s
JOIN clean_customers c USING (customer_id)
GROUP BY 1, 2
ORDER BY 1, 2;

Example 2: Deduplicate to latest record per key

Goal: Keep the latest device status per device_id.

Sample SQL

WITH ranked AS (
  SELECT
    device_id,
    status,
    updated_at,
    ROW_NUMBER() OVER (PARTITION BY device_id ORDER BY updated_at DESC) AS rn
  FROM events
)
SELECT device_id, status, updated_at
FROM ranked
WHERE rn = 1;

Example 3: Incremental upsert with watermark

Goal: Only process rows newer than the last watermark and upsert into the target table.

Sample SQL (generic MERGE)

-- Assume last_watermark is provided by your orchestrator
WITH staged AS (
  SELECT *
  FROM staging_customers
  WHERE updated_at > :last_watermark
)
MERGE INTO customer_dim t
USING staged s
  ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET
  email = s.email,
  city = s.city,
  updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, email, city, updated_at)
VALUES (s.customer_id, s.email, s.city, s.updated_at);

Alternative (PostgreSQL upsert)

INSERT INTO customer_dim (customer_id, email, city, updated_at)
SELECT customer_id, email, city, updated_at
FROM staging_customers
WHERE updated_at > :last_watermark
ON CONFLICT (customer_id)
DO UPDATE SET
  email = EXCLUDED.email,
  city = EXCLUDED.city,
  updated_at = EXCLUDED.updated_at;

Hands-on exercises

Quick Test is available to everyone; if you are logged in, your progress and results are saved automatically.

Exercise 1: Join, clean, and aggregate

Your task: Produce daily revenue by standardized city for PAID orders.

Dataset

customers(customer_id, city)
1, 'New York'
2, 'new york'
3, 'Boston '
4, 'boston'
5, 'Chicago'

sales(order_id, customer_id, order_date, amount, status)
100, 1, '2025-01-01', 50, 'PAID'
101, 2, '2025-01-01', 30, 'REFUNDED'
102, 2, '2025-01-01', 20, 'PAID'
103, 3, '2025-01-02', 40, 'PAID'
104, 4, '2025-01-02', 60, 'PAID'
105, 5, '2025-01-02', 70, 'CANCELLED'

Trim and uppercase city
Filter status to PAID
Group by order_date and city

Expected result

order_date   | city      | total_paid
2025-01-01   | NEW YORK  | 70
2025-01-02   | BOSTON    | 100

Exercise 2: Latest record per device

Your task: Return the latest status per device.

Dataset

events(device_id, status, updated_at)
'A', 'online',      '2025-01-01 10:00'
'A', 'offline',     '2025-01-01 12:00'
'B', 'online',      '2025-01-01 09:00'
'B', 'maintenance', '2025-01-02 08:00'
'C', 'offline',     '2025-01-01 11:00'

Use ROW_NUMBER over device_id ordered by updated_at desc
Filter rn = 1

Expected result

device_id | status       | updated_at
A         | offline      | 2025-01-01 12:00
B         | maintenance  | 2025-01-02 08:00
C         | offline      | 2025-01-01 11:00

Exercise 3: Incremental upsert

Your task: Upsert changes since a watermark.

Dataset

Target table: customer_dim(customer_id, email, city, updated_at)
Existing rows:
1, 'alice@x.com', 'NY', '2025-01-01 00:00'
2, 'bob@x.com',   'SF', '2025-01-01 00:00'

Staging since :last_watermark = '2025-01-01 00:00'
2, 'bob_new@x.com', 'SF', '2025-01-03 00:00'
3, 'carol@x.com',   'LA', '2025-01-02 00:00'

Filter staging by watermark
Match on customer_id
Update email/city/updated_at on match; insert otherwise

Expected result

customer_id | email          | city | updated_at
1           | alice@x.com    | NY   | 2025-01-01 00:00
2           | bob_new@x.com  | SF   | 2025-01-03 00:00
3           | carol@x.com    | LA   | 2025-01-02 00:00

Common mistakes and self-check

Join explosions: Many-to-many joins multiply rows unexpectedly. Self-check: COUNT before/after join; if it grows unexpectedly, verify join keys and cardinality.
Filtering after aggregation when it should be before: Causes wrong totals. Self-check: Validate row counts and totals on intermediate CTEs.
Incorrect dedup logic: Using DISTINCT when latest record is needed. Self-check: Use ROW_NUMBER with clear ORDER BY on a reliable timestamp.
Null mishandling: SUM over null numeric columns yields null in some dialects. Self-check: Wrap with COALESCE(value, 0).
Type drift: Joining string to integer keys. Self-check: explicit CAST on both sides; compare sample keys.
Timezone surprises: Comparing UTC to local timestamps. Self-check: normalize to UTC in staging; be explicit about time zone functions.

Practical projects

Daily Sales Mart
- Staging: clean orders, normalize statuses
- Transform: join product and customer dimensions
- Aggregate: daily revenue, AOV, paid units
- Publish: MERGE into fact_daily_sales
Customer 360 Mini-Dimension
- Deduplicate customer records by email
- Compute first_order_date, last_order_date, lifetime_value
- Bucket tenure using CASE
Event Stream Snapshot
- Use ROW_NUMBER to keep latest per device
- Publish a current_state table refreshed incrementally

Learning path

SQL basics refresher: queries, joins, aggregations
Window functions for dedup and rollups
Incremental processing: watermarks and upserts
Dimensional modeling and SCD patterns
Performance tuning: pruning, statistics, and partitioning

Next steps

Automate with your orchestrator and schedule incremental runs
Add tests for row counts, null checks, and key uniqueness
Document your models: inputs, logic, assumptions
Benchmark with realistic data volumes and optimize joins/filters

Mini challenge

Create a customer_activity table with columns: customer_id, first_order_date, last_order_date, orders_30d, active_flag (1 if last order within 30 days). Use CTEs: clean_orders, per_customer_dates, recent_activity. Acceptance criteria: unique customer_id, correct counts for the provided test data, and active_flag computed from last_order_date.

Menu

SQL Based Transformations

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model: a 4-stage assembly line

Worked examples

Example 1: Clean, join, and aggregate daily revenue

Example 2: Deduplicate to latest record per key

Example 3: Incremental upsert with watermark

Hands-on exercises

Exercise 1: Join, clean, and aggregate

Exercise 2: Latest record per device

Exercise 3: Incremental upsert

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Join, clean, and aggregate by city

Instructions

Expected Output

Latest record per device (dedup with window function)

Incremental upsert with watermark

SQL Based Transformations — Quick Test

Have questions about SQL Based Transformations?

AI Assistant