Why this matters
As an ETL Developer, most business-ready data is produced by SQL transformations: cleaning raw inputs, joining dimensions, calculating metrics, deduplicating, and loading curated tables for analytics and downstream apps. Strong SQL transforms let you ship reliable data models, reduce pipeline complexity, and speed up delivery.
- Create daily/weekly aggregates for product and finance
- Join disparate sources to build customer 360 views
- Deduplicate events and keep the latest truth
- Apply incremental logic so pipelines run fast and cheaply
Who this is for
- ETL Developers and Data Engineers building staging/marts
- Analysts moving from ad-hoc queries to production-grade SQL
- Anyone implementing ELT in a data warehouse
Prerequisites
- Comfort with SELECT, WHERE, GROUP BY, and basic JOINs
- Understanding of primary keys and basic data types
- Familiarity with your warehouse dialect (PostgreSQL, Snowflake, BigQuery, etc.)
Concept explained simply
SQL-based transformations turn raw tables into clean, shaped data. Think of it as an assembly line: filter, shape, enrich, then publish.
Mental model: a 4-stage assembly line
Core operations snapshot
- JOINs: inner to match; left to preserve base; anti-join to exclude
- Aggregations: SUM, COUNT, AVG with GROUP BY; conditional via CASE
- Window functions: ROW_NUMBER, RANK, SUM OVER for dedup/rollups
- Data cleaning: TRIM, LOWER/UPPER, CAST/SAFE_CAST, COALESCE
- Incremental patterns: watermarks, MERGE/UPSERT, dedup on keys
Worked examples
Example 1: Clean, join, and aggregate daily revenue
Goal: Sum paid sales by standardized city per day.
Sample SQL
WITH clean_customers AS (
SELECT
customer_id,
UPPER(TRIM(city)) AS city_clean
FROM customers
), paid_sales AS (
SELECT order_id, customer_id, order_date, amount
FROM sales
WHERE status = 'PAID'
)
SELECT
s.order_date,
c.city_clean AS city,
SUM(s.amount) AS total_paid
FROM paid_sales s
JOIN clean_customers c USING (customer_id)
GROUP BY 1, 2
ORDER BY 1, 2;
Example 2: Deduplicate to latest record per key
Goal: Keep the latest device status per device_id.
Sample SQL
WITH ranked AS (
SELECT
device_id,
status,
updated_at,
ROW_NUMBER() OVER (PARTITION BY device_id ORDER BY updated_at DESC) AS rn
FROM events
)
SELECT device_id, status, updated_at
FROM ranked
WHERE rn = 1;
Example 3: Incremental upsert with watermark
Goal: Only process rows newer than the last watermark and upsert into the target table.
Sample SQL (generic MERGE)
-- Assume last_watermark is provided by your orchestrator
WITH staged AS (
SELECT *
FROM staging_customers
WHERE updated_at > :last_watermark
)
MERGE INTO customer_dim t
USING staged s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET
email = s.email,
city = s.city,
updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, email, city, updated_at)
VALUES (s.customer_id, s.email, s.city, s.updated_at);
Alternative (PostgreSQL upsert)
INSERT INTO customer_dim (customer_id, email, city, updated_at)
SELECT customer_id, email, city, updated_at
FROM staging_customers
WHERE updated_at > :last_watermark
ON CONFLICT (customer_id)
DO UPDATE SET
email = EXCLUDED.email,
city = EXCLUDED.city,
updated_at = EXCLUDED.updated_at;
Hands-on exercises
Quick Test is available to everyone; if you are logged in, your progress and results are saved automatically.
Exercise 1: Join, clean, and aggregate
Your task: Produce daily revenue by standardized city for PAID orders.
Dataset
customers(customer_id, city)
1, 'New York'
2, 'new york'
3, 'Boston '
4, 'boston'
5, 'Chicago'
sales(order_id, customer_id, order_date, amount, status)
100, 1, '2025-01-01', 50, 'PAID'
101, 2, '2025-01-01', 30, 'REFUNDED'
102, 2, '2025-01-01', 20, 'PAID'
103, 3, '2025-01-02', 40, 'PAID'
104, 4, '2025-01-02', 60, 'PAID'
105, 5, '2025-01-02', 70, 'CANCELLED'
- Trim and uppercase city
- Filter status to PAID
- Group by order_date and city
Expected result
order_date | city | total_paid
2025-01-01 | NEW YORK | 70
2025-01-02 | BOSTON | 100
Exercise 2: Latest record per device
Your task: Return the latest status per device.
Dataset
events(device_id, status, updated_at)
'A', 'online', '2025-01-01 10:00'
'A', 'offline', '2025-01-01 12:00'
'B', 'online', '2025-01-01 09:00'
'B', 'maintenance', '2025-01-02 08:00'
'C', 'offline', '2025-01-01 11:00'
- Use ROW_NUMBER over device_id ordered by updated_at desc
- Filter rn = 1
Expected result
device_id | status | updated_at
A | offline | 2025-01-01 12:00
B | maintenance | 2025-01-02 08:00
C | offline | 2025-01-01 11:00
Exercise 3: Incremental upsert
Your task: Upsert changes since a watermark.
Dataset
Target table: customer_dim(customer_id, email, city, updated_at)
Existing rows:
1, 'alice@x.com', 'NY', '2025-01-01 00:00'
2, 'bob@x.com', 'SF', '2025-01-01 00:00'
Staging since :last_watermark = '2025-01-01 00:00'
2, 'bob_new@x.com', 'SF', '2025-01-03 00:00'
3, 'carol@x.com', 'LA', '2025-01-02 00:00'
- Filter staging by watermark
- Match on customer_id
- Update email/city/updated_at on match; insert otherwise
Expected result
customer_id | email | city | updated_at
1 | alice@x.com | NY | 2025-01-01 00:00
2 | bob_new@x.com | SF | 2025-01-03 00:00
3 | carol@x.com | LA | 2025-01-02 00:00
Common mistakes and self-check
- Join explosions: Many-to-many joins multiply rows unexpectedly. Self-check: COUNT before/after join; if it grows unexpectedly, verify join keys and cardinality.
- Filtering after aggregation when it should be before: Causes wrong totals. Self-check: Validate row counts and totals on intermediate CTEs.
- Incorrect dedup logic: Using DISTINCT when latest record is needed. Self-check: Use ROW_NUMBER with clear ORDER BY on a reliable timestamp.
- Null mishandling: SUM over null numeric columns yields null in some dialects. Self-check: Wrap with COALESCE(value, 0).
- Type drift: Joining string to integer keys. Self-check: explicit CAST on both sides; compare sample keys.
- Timezone surprises: Comparing UTC to local timestamps. Self-check: normalize to UTC in staging; be explicit about time zone functions.
Practical projects
- Daily Sales Mart
- Staging: clean orders, normalize statuses
- Transform: join product and customer dimensions
- Aggregate: daily revenue, AOV, paid units
- Publish: MERGE into fact_daily_sales
- Customer 360 Mini-Dimension
- Deduplicate customer records by email
- Compute first_order_date, last_order_date, lifetime_value
- Bucket tenure using CASE
- Event Stream Snapshot
- Use ROW_NUMBER to keep latest per device
- Publish a current_state table refreshed incrementally
Learning path
- SQL basics refresher: queries, joins, aggregations
- Window functions for dedup and rollups
- Incremental processing: watermarks and upserts
- Dimensional modeling and SCD patterns
- Performance tuning: pruning, statistics, and partitioning
Next steps
- Automate with your orchestrator and schedule incremental runs
- Add tests for row counts, null checks, and key uniqueness
- Document your models: inputs, logic, assumptions
- Benchmark with realistic data volumes and optimize joins/filters
Mini challenge
Create a customer_activity table with columns: customer_id, first_order_date, last_order_date, orders_30d, active_flag (1 if last order within 30 days). Use CTEs: clean_orders, per_customer_dates, recent_activity. Acceptance criteria: unique customer_id, correct counts for the provided test data, and active_flag computed from last_order_date.