Why SQL matters for Analytics Engineers
SQL is the primary tool Analytics Engineers use to turn raw data into trustworthy, fast, and reusable datasets for BI and stakeholders. You will write transformations, build facts and dimensions, implement incremental pipelines, reconcile metrics, and tune performance so dashboards refresh reliably.
What you will learn
- Design clean staging, marts, facts, and dimensions using SQL.
- Write modular, reusable queries with clear grain and naming.
- Implement incremental loads, handle late-arriving data, and upsert safely.
- Use window functions for deduplication and Slowly Changing Dimensions (SCD).
- Debug metric mismatches and optimize for warehouse performance.
Who this is for
- Aspiring or current Analytics Engineers and BI Developers.
- Data Analysts moving toward modeling and data pipelines.
- Engineers who need reliable, production-grade SQL for analytics.
Prerequisites
- Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
- Comfort with warehouse tables and views (any platform).
- Understanding of business metrics (e.g., revenue, active users).
Learning path
- Clean staging from raw: cast types, standardize names, filter bad records.
- Reusable SQL patterns: CTE layering, safe joins, clear grains, tests/validations.
- Window functions: row_number, lag/lead, rolling windows.
- Fact and dimension building: define grains, surrogate keys, conformed dimensions.
- Incremental loads + late data: watermarks, MERGE/UPSERT, reprocessing windows.
- Performance-aware querying: prune data early, avoid SELECT *, leverage partitioning/clustering.
- Debugging and reconciliation: compare sources vs marts, isolate filters and time zones.
Mini tasks to stay on track
- Create a staging query that standardizes timestamps and currencies.
- Write a ROW_NUMBER dedup CTE and validate counts before/after.
- Draft an incremental WHERE clause using a watermark parameter placeholder.
- Turn an inner join into a left anti-join and explain the difference.
Worked examples
Example 1 — Staging from raw to clean
Raw web events need consistent types and filters.
-- raw.events: event_id, user_id, event_name, event_ts (STRING), revenue (STRING), is_test (BOOL)
WITH cleaned AS (
SELECT
CAST(event_id AS STRING) AS event_id,
CAST(user_id AS STRING) AS user_id,
LOWER(TRIM(event_name)) AS event_name,
CAST(event_ts AS TIMESTAMP) AS event_ts,
CAST(revenue AS NUMERIC) AS revenue,
COALESCE(is_test, FALSE) AS is_test
FROM raw.events
WHERE event_id IS NOT NULL
)
SELECT *
FROM cleaned
WHERE is_test = FALSE
AND event_ts >= TIMESTAMP '2023-01-01 00:00:00';Why it matters: downstream joins will be stable and filters consistent.
Example 2 — Deduplicate with window functions
Keep the latest order record by updated_at.
WITH ranked AS (
SELECT o.*,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM staging.orders o
)
SELECT *
FROM ranked
WHERE rn = 1;Tip: choose the ordering column that reflects most recent truth (updated_at or ingestion_ts).
Example 3 — SCD Type 2 with validity ranges
Track changing customer segments over time.
WITH src AS (
SELECT customer_id, segment, segment_updated_at,
ROW_NUMBER() OVER (
PARTITION BY customer_id, segment
ORDER BY segment_updated_at
) AS seg_rn
FROM staging.customer_segments
), dedup AS (
SELECT customer_id, segment, segment_updated_at,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY segment_updated_at
) AS rn
FROM (
SELECT DISTINCT customer_id, segment, segment_updated_at FROM src
) s
), ranges AS (
SELECT customer_id,
segment,
segment_updated_at AS valid_from,
LEAD(segment_updated_at) OVER (
PARTITION BY customer_id ORDER BY segment_updated_at
) AS next_change
FROM dedup
)
SELECT customer_id,
segment,
valid_from,
COALESCE(next_change, TIMESTAMP '9999-12-31 00:00:00') AS valid_to,
CASE WHEN next_change IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM ranges;Use the validity window [valid_from, valid_to) to join time-travel facts.
Example 4 — Incremental upsert with MERGE
Load only new or changed rows using a watermark.
-- Assume target table: mart.orders_latest keyed by order_id
-- Watermark: last_loaded_at (max updated_at previously loaded)
MERGE INTO mart.orders_latest t
USING (
SELECT *
FROM staging.orders
WHERE updated_at > {{ last_loaded_at }}
) s
ON t.order_id = s.order_id
WHEN MATCHED AND t.updated_at < s.updated_at THEN
UPDATE SET amount = s.amount,
status = s.status,
updated_at = s.updated_at
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount, status, updated_at)
VALUES (s.order_id, s.customer_id, s.amount, s.status, s.updated_at);
Always compare timestamps to avoid overwriting newer data with older arrivals.
Example 5 — 7-day rolling revenue per day
WITH daily AS (
SELECT DATE(order_ts) AS d, SUM(amount) AS revenue
FROM fact.orders
GROUP BY 1
)
SELECT d,
SUM(revenue) OVER (
ORDER BY d
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rev_7d
FROM daily
ORDER BY d;Window frames compute rolling metrics without self-joins.
Drills and exercises
- Create a staging query that converts string timestamps to TIMESTAMP, trims text columns, and drops rows with null IDs.
- Write a ROW_NUMBER dedup on a table with duplicates; verify that total distinct keys before equals rows after.
- Build a fact table at order line grain and confirm its row count equals the number of lines, not orders.
- Write an incremental WHERE clause using updated_at and a placeholder watermark, then test it with two different watermark values.
- Implement a left anti-join to find source rows not present in the target.
- Optimize a slow query by removing SELECT *, adding predicate pushdown, and pre-aggregating before joins.
Common mistakes and debugging tips
- Unclear grain: define the grain in a comment at the top of your query; verify with COUNT(DISTINCT key).
- Accidental many-to-many joins: inspect duplicates by grouping both keys and searching for counts > 1.
- Window misuse: apply window functions in a subquery before aggregation or ensure correct partitioning.
- Forgetting late-arriving updates: use MERGE with timestamp comparisons, not INSERT OVERWRITE without conditions.
- SELECT * in production models: select only needed columns to reduce IO and costs.
- Time zones: align timestamps before grouping; annotate assumptions in the model.
- COUNT(DISTINCT ...) everywhere: prefer approximate distinct or pre-aggregations when exactness isn’t needed.
Quick debugging checklist
- Validate row counts across stages (raw → staging → marts).
- Compare sums by key between source and target after each transformation.
- Check filters: status, date range, test data flags, time zones.
- Confirm join keys and null-handling; test inner vs left joins explicitly.
- Look for skewed partitions or un-pruned dates in execution plans.
Mini project: Orders mart with incremental loads
- Staging: Clean raw orders and customers (types, names, flags).
- Dedup: Keep the latest order per order_id using ROW_NUMBER.
- Dimensions: Build dim_customer (SCD2 on segment) with valid_from/valid_to.
- Fact: Build fact_order_line with line-level grain and a surrogate key.
- Incremental load: MERGE into mart tables using updated_at watermark.
- Metrics: Daily revenue and 7d rolling revenue; reconcile against source with a validation query.
Acceptance criteria
- No duplicate order_id in the latest orders table.
- Fact row count equals number of order lines in the source for the loaded period.
- dim_customer has exactly one current record per customer.
- Incremental run loads only rows with updated_at after the watermark.
- Revenue difference between source and mart is within 0.1% for the same filter set.
Practical projects (expand as you learn)
- Sessionize web events using gaps-and-islands with window functions, then build a fact_session table.
- Create a product dimension with Type 2 history (price tier changes) and join to a sales fact.
- Design a metric reconciliation dashboard model comparing source vs mart by day, country, and channel.
Subskills
Master these areas to become production-ready:
- Building Transformations From Raw To Mart — create clean staging and conformed marts that are easy to join and trust.
- Writing Modular Reusable SQL Models — layer CTEs, document grain, and structure queries for reuse.
- Working With Incremental Loads — implement watermarks and MERGE patterns to process only new or changed data.
- Handling Late Arriving Data — design safe upserts and small reprocessing windows for delayed events.
- Window Functions For Dedup And SCD — use row_number, lag/lead, and validity ranges for history.
- Building Facts And Dimensions In SQL — define grains, surrogate keys, and star schemas.
- Debugging And Reconciling Metrics — find mismatches via filters, joins, and time coverage checks.
- Performance Aware Querying For Warehouses — prune data early, select only needed columns, and leverage partitioning.
Next steps
- Add basic tests: not null, unique, accepted values in key dimensions.
- Practice building incremental and SCD models on a small public-like dataset.
- Learn orchestration and CI concepts to run and validate models on schedule.
- Pair with BI to confirm definitions and document metrics in your marts.