How to learn Unit Tests For Transformations for Testing Data Pipelines in Analytics Engineer for free

Why this matters

Analytics Engineers ship business logic as code. A small bug in a SQL transformation can silently skew KPIs, marketing spend, and decision-making. Unit tests for transformations let you validate logic on small, controlled inputs before data reaches production models.

Real tasks: deduplicate events correctly, compute metrics consistently, map categories with edge cases, and safely change logic without breaking downstream dashboards.
Outcomes: faster refactors, fewer regressions, and confidence to ship changes.

Concept explained simply

A unit test checks one transformation in isolation. You feed it a tiny, known input and assert the output matches expectations. Think of it as checking a recipe with a small sample before cooking for a crowd.

Mental model

Treat the transformation as a pure function: Output = Transform(Input).
Arrange → Act → Assert (AAA): Arrange fixtures (tiny tables), Act by running your SQL/logic, Assert results equal expected.
Deterministic and fast: fixed timestamps, no randomness, small data.

Typical unit test skeleton (SQL idea)

-- Arrange: create tiny input and expected tables (CTEs or temp tables)
with input as (
  select 1 as id, 'A' as status, '2024-01-01'::date as dt union all
  select 1, 'B', '2024-01-02'::date
),
expected as (
  select 1 as id, 'B' as status, '2024-01-02'::date as dt
),
-- Act: apply the transformation
actual as (
  select * from (
    select *, row_number() over(partition by id order by dt desc) as rn
    from input
  ) t where rn = 1
)
-- Assert: no differences between actual and expected
select count(*) as diff_count
from (
  select * from actual
  except
  select * from expected
  union all
  select * from expected
  except
  select * from actual
) d;
-- Test passes if diff_count = 0

Core patterns and templates

Equality of result sets: use EXCEPT (or MINUS) in both directions; assert zero rows.
Row-level assertions: sum/count comparisons for metrics; assert exact values.
Window logic: verify partitioning and ordering produce the intended row.
Type and rounding: cast to precise numeric and round before compare.
Time: freeze "today" via parameters or constants; avoid non-deterministic functions.

Template: symmetric difference assertion

with actual as (...), expected as (...)
select case when exists (
  (select * from actual except select * from expected)
  union all
  (select * from expected except select * from actual)
) then 'FAIL' else 'PASS' end as result;

Template: numeric tolerance check

with actual as (...), expected as (...)
select case when max(abs(a.val - e.val)) <= 0.00001 then 'PASS' else 'FAIL' end as result
from actual a
join expected e using(key);

Worked examples

Example 1 — Deduplicate to latest record

Goal: keep the most recent row per id.

with input as (
  select 10 id, 'pending' status, timestamp '2024-03-01 09:00' ts union all
  select 10, 'paid',    timestamp '2024-03-01 09:05' union all
  select 11, 'paid',    timestamp '2024-03-01 10:00'
),
expected as (
  select 10 id, 'paid' status, timestamp '2024-03-01 09:05' ts union all
  select 11, 'paid',    timestamp '2024-03-01 10:00'
),
actual as (
  select id, status, ts from (
    select *, row_number() over(partition by id order by ts desc) rn from input
  ) t where rn = 1
)
select count(*) as diffs from (
  select * from actual except select * from expected
  union all
  select * from expected except select * from actual
) d; -- expect 0

Example 2 — Bucket amounts with edge cases

Goal: map amounts to size buckets with rules: amount < 0 → 'refund', 0–99.99 → 'small', 100–499.99 → 'medium', ≥ 500 → 'large'.

with input as (
  select 1 id, -5.00 amount union all
  select 2, 0.00 union all
  select 3, 100.00 union all
  select 4, 500.00
),
expected as (
  select 1 id, 'refund' bucket union all
  select 2, 'small' union all
  select 3, 'medium' union all
  select 4, 'large'
),
actual as (
  select id,
         case
           when amount < 0 then 'refund'
           when amount < 100 then 'small'
           when amount < 500 then 'medium'
           else 'large'
         end as bucket
  from input
)
select count(*) as diffs from (
  select * from actual except select * from expected
  union all
  select * from expected except select * from actual
) d; -- expect 0

Example 3 — Compute daily revenue with rounding

Goal: convert cents to dollars and round to 2 decimals, grouped by date.

with input as (
  select date '2024-06-01' as dt, 12345 as cents union all
  select date '2024-06-01', 55 union all
  select date '2024-06-02', 100
),
expected as (
  select date '2024-06-01' dt, cast(123.45 + 0.55 as numeric(12,2)) revenue_usd union all
  select date '2024-06-02', cast(1.00 as numeric(12,2))
),
actual as (
  select dt, cast(round(sum(cents)/100.0, 2) as numeric(12,2)) as revenue_usd
  from input group by dt
)
select count(*) as diffs from (
  select * from actual except select * from expected
  union all
  select * from expected except select * from actual
) d; -- expect 0

Who this is for

Analytics Engineers building SQL transformations and models.
BI Developers who maintain business logic in views or marts.
Data Analysts who want safer refactors to shared logic.

Prerequisites

Comfortable with SQL (SELECT, JOIN, GROUP BY, window functions).
Basic understanding of your warehouse data types and casting.
Familiarity with how your transformations are executed in your stack.

Learning path

Start with simple equality checks between actual and expected result sets.
Add edge cases: nulls, empty sets, duplicates, boundary dates and amounts.
Introduce deterministic parameters for time and randomness.
Refactor transformations to be more testable (smaller, composable queries/CTEs).
Automate tests in your CI process.

Exercises

These mirror the exercises below. Build tiny inputs, run the transform, assert equality with expected. Keep each result set under ~10 rows.

Exercise 1 — Keep the latest status per order

Create a unit test that ensures your deduplication keeps the most recent status per order_id by updated_at timestamp.

See input and expected

-- Input rows
order_id | status   | updated_at
---------+----------+---------------------
101      | created  | 2024-04-01 08:00:00
101      | paid     | 2024-04-01 08:05:00
102      | shipped  | 2024-04-02 09:00:00

-- Expected rows
order_id | status   | updated_at
---------+----------+---------------------
101      | paid     | 2024-04-01 08:05:00
102      | shipped  | 2024-04-02 09:00:00

Exercise 2 — Category mapping with refunds

Write a unit test for a transformation that assigns categories: amount < 0 → 'refund', 0–49.99 → 'low', 50–199.99 → 'mid', ≥ 200 → 'high'.

See input and expected

-- Input rows
id | amount
---+--------
1  | -10.00
2  | 0.00
3  | 49.99
4  | 50.00
5  | 199.99
6  | 200.00

-- Expected rows
id | category
---+----------
1  | refund
2  | low
3  | low
4  | mid
5  | mid
6  | high

Exercise checklist

Arrange: Defined tiny input and expected result CTEs.
Act: Applied the transformation exactly once.
Assert: Used two-way EXCEPT to check equality.
Deterministic: No NOW(), RAND(), or non-fixed parameters.
Edge cases included: nulls, zero/negative, ties for ordering if relevant.

Common mistakes and self-check

Testing too much at once: If the test is long and fragile, split into smaller units.
Non-deterministic inputs: Replace NOW() with a fixed date; seed randomness.
Ignoring type/rounding: Cast decimals and round before compare.
Missing bidirectional check: One-way EXCEPT misses extra rows; always use both directions.
Skipping edge cases: Include nulls, duplicates, and boundary values.

Self-check prompt

If the input changed slightly (extra duplicate), would your test catch a regression?
Can you explain the rule your test enforces in one sentence?
Does the test fail if you intentionally break the logic?

Practical projects

Project A: Build a unit-tested customer lifecycle transformation (lead → signup → paid) with tie-breaking and date windows.
Project B: Create a unit-tested daily financials model: currency conversion, rounding, tax, and refunds.
Project C: Unit-test event sessionization using window functions, handling out-of-order events.

Mini challenge

Your metric model sums revenue for the last full calendar month. Write a unit test that fixes the reference date to 2024-08-15 and asserts the model aggregates 2024-07-01 to 2024-07-31 only, with the expected total of 12,345.67.

Quick Test note: The quick test is available to everyone. Only logged-in users will see saved progress.

Next steps

Turn these patterns into reusable test templates in your project.
Add unit tests for the riskiest transformations first (metrics and dedupes).
Run the Quick Test to check your understanding and lock in the concepts.

Menu

Unit Tests For Transformations

Table of Contents