Skill Not Found

Why SQL matters for Analytics Engineers

SQL is the primary tool Analytics Engineers use to turn raw data into trustworthy, fast, and reusable datasets for BI and stakeholders. You will write transformations, build facts and dimensions, implement incremental pipelines, reconcile metrics, and tune performance so dashboards refresh reliably.

What you will learn

Design clean staging, marts, facts, and dimensions using SQL.
Write modular, reusable queries with clear grain and naming.
Implement incremental loads, handle late-arriving data, and upsert safely.
Use window functions for deduplication and Slowly Changing Dimensions (SCD).
Debug metric mismatches and optimize for warehouse performance.

Who this is for

Aspiring or current Analytics Engineers and BI Developers.
Data Analysts moving toward modeling and data pipelines.
Engineers who need reliable, production-grade SQL for analytics.

Prerequisites

Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
Comfort with warehouse tables and views (any platform).
Understanding of business metrics (e.g., revenue, active users).

Learning path

Clean staging from raw: cast types, standardize names, filter bad records.
Reusable SQL patterns: CTE layering, safe joins, clear grains, tests/validations.
Window functions: row_number, lag/lead, rolling windows.
Fact and dimension building: define grains, surrogate keys, conformed dimensions.
Incremental loads + late data: watermarks, MERGE/UPSERT, reprocessing windows.
Performance-aware querying: prune data early, avoid SELECT *, leverage partitioning/clustering.
Debugging and reconciliation: compare sources vs marts, isolate filters and time zones.

Mini tasks to stay on track

Create a staging query that standardizes timestamps and currencies.
Write a ROW_NUMBER dedup CTE and validate counts before/after.
Draft an incremental WHERE clause using a watermark parameter placeholder.
Turn an inner join into a left anti-join and explain the difference.

Worked examples

Example 1 — Staging from raw to clean

Raw web events need consistent types and filters.

-- raw.events: event_id, user_id, event_name, event_ts (STRING), revenue (STRING), is_test (BOOL)
WITH cleaned AS (
  SELECT
    CAST(event_id AS STRING) AS event_id,
    CAST(user_id AS STRING)  AS user_id,
    LOWER(TRIM(event_name))  AS event_name,
    CAST(event_ts AS TIMESTAMP) AS event_ts,
    CAST(revenue AS NUMERIC) AS revenue,
    COALESCE(is_test, FALSE) AS is_test
  FROM raw.events
  WHERE event_id IS NOT NULL
)
SELECT *
FROM cleaned
WHERE is_test = FALSE
  AND event_ts >= TIMESTAMP '2023-01-01 00:00:00';

Why it matters: downstream joins will be stable and filters consistent.

Example 2 — Deduplicate with window functions

Keep the latest order record by updated_at.

WITH ranked AS (
  SELECT o.*, 
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
  FROM staging.orders o
)
SELECT *
FROM ranked
WHERE rn = 1;

Tip: choose the ordering column that reflects most recent truth (updated_at or ingestion_ts).

Example 3 — SCD Type 2 with validity ranges

Track changing customer segments over time.

WITH src AS (
  SELECT customer_id, segment, segment_updated_at,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id, segment
           ORDER BY segment_updated_at
         ) AS seg_rn
  FROM staging.customer_segments
), dedup AS (
  SELECT customer_id, segment, segment_updated_at,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY segment_updated_at
         ) AS rn
  FROM (
    SELECT DISTINCT customer_id, segment, segment_updated_at FROM src
  ) s
), ranges AS (
  SELECT customer_id,
         segment,
         segment_updated_at AS valid_from,
         LEAD(segment_updated_at) OVER (
           PARTITION BY customer_id ORDER BY segment_updated_at
         ) AS next_change
  FROM dedup
)
SELECT customer_id,
       segment,
       valid_from,
       COALESCE(next_change, TIMESTAMP '9999-12-31 00:00:00') AS valid_to,
       CASE WHEN next_change IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM ranges;

Use the validity window [valid_from, valid_to) to join time-travel facts.

Example 4 — Incremental upsert with MERGE

Load only new or changed rows using a watermark.

-- Assume target table: mart.orders_latest keyed by order_id
-- Watermark: last_loaded_at (max updated_at previously loaded)
MERGE INTO mart.orders_latest t
USING (
  SELECT *
  FROM staging.orders
  WHERE updated_at > {{ last_loaded_at }}
) s
ON t.order_id = s.order_id
WHEN MATCHED AND t.updated_at < s.updated_at THEN
  UPDATE SET amount = s.amount,
             status = s.status,
             updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (order_id, customer_id, amount, status, updated_at)
  VALUES (s.order_id, s.customer_id, s.amount, s.status, s.updated_at);

Always compare timestamps to avoid overwriting newer data with older arrivals.

Example 5 — 7-day rolling revenue per day

WITH daily AS (
  SELECT DATE(order_ts) AS d, SUM(amount) AS revenue
  FROM fact.orders
  GROUP BY 1
)
SELECT d,
       SUM(revenue) OVER (
         ORDER BY d
         ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS rev_7d
FROM daily
ORDER BY d;

Window frames compute rolling metrics without self-joins.

Drills and exercises

Create a staging query that converts string timestamps to TIMESTAMP, trims text columns, and drops rows with null IDs.
Write a ROW_NUMBER dedup on a table with duplicates; verify that total distinct keys before equals rows after.
Build a fact table at order line grain and confirm its row count equals the number of lines, not orders.
Write an incremental WHERE clause using updated_at and a placeholder watermark, then test it with two different watermark values.
Implement a left anti-join to find source rows not present in the target.
Optimize a slow query by removing SELECT *, adding predicate pushdown, and pre-aggregating before joins.

Common mistakes and debugging tips

Unclear grain: define the grain in a comment at the top of your query; verify with COUNT(DISTINCT key).
Accidental many-to-many joins: inspect duplicates by grouping both keys and searching for counts > 1.
Window misuse: apply window functions in a subquery before aggregation or ensure correct partitioning.
Forgetting late-arriving updates: use MERGE with timestamp comparisons, not INSERT OVERWRITE without conditions.
SELECT * in production models: select only needed columns to reduce IO and costs.
Time zones: align timestamps before grouping; annotate assumptions in the model.
COUNT(DISTINCT ...) everywhere: prefer approximate distinct or pre-aggregations when exactness isn’t needed.

Quick debugging checklist

Validate row counts across stages (raw → staging → marts).
Compare sums by key between source and target after each transformation.
Check filters: status, date range, test data flags, time zones.
Confirm join keys and null-handling; test inner vs left joins explicitly.
Look for skewed partitions or un-pruned dates in execution plans.

Mini project: Orders mart with incremental loads

Staging: Clean raw orders and customers (types, names, flags).
Dedup: Keep the latest order per order_id using ROW_NUMBER.
Dimensions: Build dim_customer (SCD2 on segment) with valid_from/valid_to.
Fact: Build fact_order_line with line-level grain and a surrogate key.
Incremental load: MERGE into mart tables using updated_at watermark.
Metrics: Daily revenue and 7d rolling revenue; reconcile against source with a validation query.

Acceptance criteria

No duplicate order_id in the latest orders table.
Fact row count equals number of order lines in the source for the loaded period.
dim_customer has exactly one current record per customer.
Incremental run loads only rows with updated_at after the watermark.
Revenue difference between source and mart is within 0.1% for the same filter set.

Practical projects (expand as you learn)

Sessionize web events using gaps-and-islands with window functions, then build a fact_session table.
Create a product dimension with Type 2 history (price tier changes) and join to a sales fact.
Design a metric reconciliation dashboard model comparing source vs mart by day, country, and channel.

Subskills

Master these areas to become production-ready:

Building Transformations From Raw To Mart — create clean staging and conformed marts that are easy to join and trust.
Writing Modular Reusable SQL Models — layer CTEs, document grain, and structure queries for reuse.
Working With Incremental Loads — implement watermarks and MERGE patterns to process only new or changed data.
Handling Late Arriving Data — design safe upserts and small reprocessing windows for delayed events.
Window Functions For Dedup And SCD — use row_number, lag/lead, and validity ranges for history.
Building Facts And Dimensions In SQL — define grains, surrogate keys, and star schemas.
Debugging And Reconciling Metrics — find mismatches via filters, joins, and time coverage checks.
Performance Aware Querying For Warehouses — prune data early, select only needed columns, and leverage partitioning.

Next steps

Add basic tests: not null, unique, accepted values in key dimensions.
Practice building incremental and SCD models on a small public-like dataset.
Learn orchestration and CI concepts to run and validate models on schedule.
Pair with BI to confirm definitions and document metrics in your marts.

Menu

SQL

Table of Contents

Why SQL matters for Analytics Engineers

What you will learn

Who this is for

Prerequisites

Learning path

Worked examples

Example 1 — Staging from raw to clean

Example 2 — Deduplicate with window functions

Example 3 — SCD Type 2 with validity ranges

Example 4 — Incremental upsert with MERGE

Example 5 — 7-day rolling revenue per day

Drills and exercises

Common mistakes and debugging tips

Mini project: Orders mart with incremental loads

Practical projects (expand as you learn)

Subskills

Next steps

SQL — Skill Exam

Topics

Writing Modular Reusable SQL Models

Working With Incremental Loads

Handling Late Arriving Data

Window Functions For Dedup And SCD

Building Facts And Dimensions In SQL

Debugging And Reconciling Metrics

Performance Aware Querying For Warehouses

Building Transformations From Raw To Mart

Have questions about SQL?

AI Assistant