luvv to helpDiscover the Best Free Online Tools

Data Transformations

Learn Data Transformations for ETL Developer for free: roadmap, examples, subskills, and a skill exam.

Published: January 11, 2026 | Updated: January 11, 2026

What this skill covers

Data Transformations turn raw inputs into clean, modeled, and analytics-ready datasets. As an ETL Developer, you’ll design and implement robust SQL- and code-based logic that standardizes types, handles nulls, applies business rules, joins and aggregates data, computes derived metrics, and enriches datasets across sources. Done well, transformations are repeatable, testable, and idempotent.

Why it matters for ETL Developers

  • Trustworthy analytics: Consistent types and rules prevent silent errors.
  • Performance: Efficient joins and aggregations save time and cost.
  • Maintainability: Reusable components reduce duplication and speed up delivery.
  • Scalability: Clear, incremental, idempotent logic supports growing data volumes.

Who this is for

  • ETL/ELT developers building data pipelines.
  • Data engineers responsible for modeling and serving curated datasets.
  • Analytics engineers shaping data into marts and reports.

Prerequisites

  • SQL basics: SELECT, WHERE, GROUP BY, JOIN.
  • Comfort with data types: numeric, text, date/time, boolean.
  • Familiarity with files/tables and batch processing concepts.
  • Optional: Python or PySpark basics for code-driven transforms.

Learning path (practical roadmap)

  1. Standardize data types
    Goal: Enforce canonical types and formats.
    Do: CAST/CONVERT values; trim/normalize text; standardize dates and time zones.
  2. Handle nulls and defaults
    Goal: Make missing-data behavior explicit.
    Do: COALESCE/CASE for defaults; separate unknown vs not applicable.
  3. Joins and aggregations
    Goal: Combine and summarize without row explosions.
    Do: Choose the right join; deduplicate; group by; use window functions.
  4. Business rules
    Goal: Encode policy in data, not code paths.
  5. Derivations and calculations
    Goal: Compute metrics and flags that power analytics.
  6. Data enrichment
    Goal: Integrate reference/master data safely and predictably.
  7. Reusable components
    Goal: Parameterize logic into views, CTE templates, or macros; add tests and documentation.
  8. Performance and correctness
    Goal: Ensure idempotency, incremental patterns, and reliable reruns.

Worked examples

1) Data type standardization

-- Standardize types and formats for a customers staging table
WITH src AS (
  SELECT customer_id,
         TRIM(LOWER(email))         AS email_norm,
         REGEXP_REPLACE(phone, '[^0-9]', '') AS phone_digits, -- if regex available; else nested REPLACE
         created_at
  FROM raw_customers
)
SELECT CAST(customer_id AS BIGINT)            AS customer_id,
       email_norm                             AS email,
       phone_digits                           AS phone_e164_guess,
       CAST(created_at AS TIMESTAMP)          AS created_at_ts
FROM src;
Why this works
We normalize text and constrain types at the edge. If REGEXP_REPLACE is not available, replace characters with nested REPLACE calls. Keep transformations idempotent so reruns yield the same results.

2) Handling nulls and defaults

-- Replace null loyalty_tier with 'basic'; preserve other values
SELECT customer_id,
       COALESCE(loyalty_tier, 'basic') AS loyalty_tier,
       CASE WHEN country IS NULL THEN 'UNKNOWN' ELSE country END AS country
FROM dim_customer;
Tip
Use different sentinel values for unknown vs not_applicable when they mean different things.

3) Text and date transformations

-- Normalize casing and extract first day of month
SELECT order_id,
       UPPER(TRIM(state_code))                          AS state_code,
       LOWER(TRIM(email))                               AS email,
       DATE_TRUNC('month', order_date)                  AS month_start -- or build from YEAR/MONTH + day 1 per your engine
FROM fct_orders;
Try this
If DATE_TRUNC is unavailable, construct month_start from YEAR(order_date), MONTH(order_date), and day=1 using your engine's date constructor.

4) Joins without row explosions

-- Deduplicate dimension to one row per key before joining
WITH dim_customer_one AS (
  SELECT customer_id,
         ANY_VALUE(country) AS country -- use MIN/MAX or window + ROW_NUMBER = 1 per engine
  FROM dim_customer
  GROUP BY customer_id
)
SELECT o.order_id,
       o.amount,
       d.country
FROM fct_orders o
LEFT JOIN dim_customer_one d
  ON o.customer_id = d.customer_id;
Debugging
If row counts inflate after a join, check for duplicate keys on the dimension side and deduplicate or choose the intended record with ROW_NUMBER() = 1.

5) Business rules via mapping table

-- Replace free-text status with governed categories
WITH map AS (
  SELECT 'paid' AS raw_status, 'closed' AS std_status UNION ALL
  SELECT 'complete','closed' UNION ALL
  SELECT 'pending','open' UNION ALL
  SELECT 'awaiting','open'
)
SELECT o.order_id,
       COALESCE(m.std_status, 'other') AS status_category
FROM fct_orders o
LEFT JOIN map m
  ON LOWER(o.status) = m.raw_status;
Why mapping tables?
They keep rules data-driven and easy to extend without editing code. Store and version them like normal reference data.

6) Derived metrics with window functions

-- 7-day rolling revenue per customer
SELECT customer_id,
       order_date,
       amount,
       SUM(amount) OVER (
         PARTITION BY customer_id
         ORDER BY order_date
         ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS revenue_7d
FROM fct_orders;
Notes
Use ROWS for equally spaced daily facts; use RANGE with date intervals where supported. Validate windows for gaps in dates.

Drills and exercises

  • Cast mixed numeric strings to DECIMAL; count failures safely (invalid strings become NULL).
  • Normalize names: trim spaces, collapse multiple spaces, title-case or upper-case by your standard.
  • Build a 1-row-per-customer dimension and prove it with a uniqueness check.
  • Create a mapping table that consolidates 5+ messy status values into 3 canonical categories.
  • Compute monthly active customers using window functions; validate against a control query.
  • Implement defaults for missing country and justify the sentinel values you choose.

Common mistakes and debugging tips

1) Turning a LEFT JOIN into an INNER JOIN by filtering in WHERE
Move filters on the right table into the ON clause when you need to keep unmatched left rows. Example: LEFT JOIN dim d ON f.k = d.k AND d.active = 1 (not WHERE d.active = 1).
2) Row explosions from non-unique dimensions
Deduplicate dimension keys with GROUP BY or ROW_NUMBER() before joining. Add a test that asserts one row per key.
3) Silent type coercions
Explicitly CAST and validate ranges. Count invalid casts by comparing source string vs casted value or using try-cast variants where available.
4) Mixed time zones and date math
Normalize timestamps to a canonical zone (e.g., UTC) on ingestion; only convert for presentation. Document the standard.
5) Non-idempotent transformations
Avoid using now() or non-deterministic functions without controls. Parameterize run dates and use them consistently.

Mini project: Sales Orders Curated View

  1. Stage raw orders and customers; standardize types and timestamp zones.
  2. Normalize text fields (emails, states); enforce canonical casing.
  3. Implement a status mapping table and apply it.
  4. Enrich orders with customer country; deduplicate the dimension first.
  5. Compute derived metrics: revenue_7d, first_purchase_date, order_rank per customer.
  6. Package logic into a reusable view or parameterized SQL template.
  7. Add data quality checks: uniqueness of order_id, non-null critical fields, join coverage rate.
Deliverables checklist
  • DDL/DML script (or notebook) with clear steps and rerunnable logic.
  • A view/table fct_orders_curated with documented columns.
  • Quality report: row counts before/after, null rates, join coverage.

Practical project ideas

  • Marketing touchpoints model: Standardize campaign event data, unify UTM parameters, compute assisted conversions by channel.
  • Subscription metrics mart: Normalize plan tiers, compute MRR/ARR, churn/retention flags with window functions.
  • Product catalog harmonization: Merge SKUs across vendors, map categories with a governance table, detect duplicates.

Subskills

  • Data Type Standardization — Enforce consistent numeric, text, and datetime types; normalize formats.
  • Business Rules Implementation — Encode policies via CASE/mapping tables; make decisions data-driven.
  • Joins And Aggregations — Combine data without duplicates; summarize correctly; apply window functions.
  • Derivations And Calculations — Create metrics, ratios, flags, and rolling aggregates.
  • Handling Nulls And Defaults — Make missingness explicit with COALESCE and clear sentinels.
  • Text And Date Transformations — Trim, normalize, parse, and standardize dates/time zones.
  • Data Enrichment From Multiple Sources — Join reference/master data safely; validate coverage.
  • Building Reusable Transformation Components — Views, templates, macros, and tests for maintainable pipelines.

Next steps

  • Refactor one of your existing pipelines to use a mapping table and a reusable view.
  • Add three data quality tests (uniqueness, null checks, join coverage) to your curated dataset.
  • Practice an incremental pattern: process only new/changed rows and prove idempotency.

Skill exam

The exam is available to everyone. Log in to save progress and resume later.

Data Transformations — Skill Exam

Test your understanding of Data Transformations. You can take this exam for free as many times as you like. Only logged-in users have their progress and best score saved.Recommendations are shown at the end based on your score.

12 questions70% to pass

Have questions about Data Transformations?

AI Assistant

Ask questions about this tool