How to learn Data Transformations for ETL Developer for free

What this skill covers

Data Transformations turn raw inputs into clean, modeled, and analytics-ready datasets. As an ETL Developer, you’ll design and implement robust SQL- and code-based logic that standardizes types, handles nulls, applies business rules, joins and aggregates data, computes derived metrics, and enriches datasets across sources. Done well, transformations are repeatable, testable, and idempotent.

Why it matters for ETL Developers

Trustworthy analytics: Consistent types and rules prevent silent errors.
Performance: Efficient joins and aggregations save time and cost.
Maintainability: Reusable components reduce duplication and speed up delivery.
Scalability: Clear, incremental, idempotent logic supports growing data volumes.

Who this is for

ETL/ELT developers building data pipelines.
Data engineers responsible for modeling and serving curated datasets.
Analytics engineers shaping data into marts and reports.

Prerequisites

SQL basics: SELECT, WHERE, GROUP BY, JOIN.
Comfort with data types: numeric, text, date/time, boolean.
Familiarity with files/tables and batch processing concepts.
Optional: Python or PySpark basics for code-driven transforms.

Learning path (practical roadmap)

Standardize data types
Goal: Enforce canonical types and formats.
Do: CAST/CONVERT values; trim/normalize text; standardize dates and time zones.
Handle nulls and defaults
Goal: Make missing-data behavior explicit.
Do: COALESCE/CASE for defaults; separate unknown vs not applicable.
Joins and aggregations
Goal: Combine and summarize without row explosions.
Do: Choose the right join; deduplicate; group by; use window functions.
Business rules
Goal: Encode policy in data, not code paths.
Derivations and calculations
Goal: Compute metrics and flags that power analytics.
Data enrichment
Goal: Integrate reference/master data safely and predictably.
Reusable components
Goal: Parameterize logic into views, CTE templates, or macros; add tests and documentation.
Performance and correctness
Goal: Ensure idempotency, incremental patterns, and reliable reruns.

Worked examples

1) Data type standardization

-- Standardize types and formats for a customers staging table
WITH src AS (
  SELECT customer_id,
         TRIM(LOWER(email))         AS email_norm,
         REGEXP_REPLACE(phone, '[^0-9]', '') AS phone_digits, -- if regex available; else nested REPLACE
         created_at
  FROM raw_customers
)
SELECT CAST(customer_id AS BIGINT)            AS customer_id,
       email_norm                             AS email,
       phone_digits                           AS phone_e164_guess,
       CAST(created_at AS TIMESTAMP)          AS created_at_ts
FROM src;

Why this works

We normalize text and constrain types at the edge. If REGEXP_REPLACE is not available, replace characters with nested REPLACE calls. Keep transformations idempotent so reruns yield the same results.

2) Handling nulls and defaults

-- Replace null loyalty_tier with 'basic'; preserve other values
SELECT customer_id,
       COALESCE(loyalty_tier, 'basic') AS loyalty_tier,
       CASE WHEN country IS NULL THEN 'UNKNOWN' ELSE country END AS country
FROM dim_customer;

Tip

Use different sentinel values for unknown vs not_applicable when they mean different things.

3) Text and date transformations

-- Normalize casing and extract first day of month
SELECT order_id,
       UPPER(TRIM(state_code))                          AS state_code,
       LOWER(TRIM(email))                               AS email,
       DATE_TRUNC('month', order_date)                  AS month_start -- or build from YEAR/MONTH + day 1 per your engine
FROM fct_orders;

Try this

If DATE_TRUNC is unavailable, construct month_start from YEAR(order_date), MONTH(order_date), and day=1 using your engine's date constructor.

4) Joins without row explosions

-- Deduplicate dimension to one row per key before joining
WITH dim_customer_one AS (
  SELECT customer_id,
         ANY_VALUE(country) AS country -- use MIN/MAX or window + ROW_NUMBER = 1 per engine
  FROM dim_customer
  GROUP BY customer_id
)
SELECT o.order_id,
       o.amount,
       d.country
FROM fct_orders o
LEFT JOIN dim_customer_one d
  ON o.customer_id = d.customer_id;

Debugging

If row counts inflate after a join, check for duplicate keys on the dimension side and deduplicate or choose the intended record with ROW_NUMBER() = 1.

5) Business rules via mapping table

-- Replace free-text status with governed categories
WITH map AS (
  SELECT 'paid' AS raw_status, 'closed' AS std_status UNION ALL
  SELECT 'complete','closed' UNION ALL
  SELECT 'pending','open' UNION ALL
  SELECT 'awaiting','open'
)
SELECT o.order_id,
       COALESCE(m.std_status, 'other') AS status_category
FROM fct_orders o
LEFT JOIN map m
  ON LOWER(o.status) = m.raw_status;

Why mapping tables?

They keep rules data-driven and easy to extend without editing code. Store and version them like normal reference data.

6) Derived metrics with window functions

-- 7-day rolling revenue per customer
SELECT customer_id,
       order_date,
       amount,
       SUM(amount) OVER (
         PARTITION BY customer_id
         ORDER BY order_date
         ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
       ) AS revenue_7d
FROM fct_orders;

Notes

Use ROWS for equally spaced daily facts; use RANGE with date intervals where supported. Validate windows for gaps in dates.

Drills and exercises

Cast mixed numeric strings to DECIMAL; count failures safely (invalid strings become NULL).
Normalize names: trim spaces, collapse multiple spaces, title-case or upper-case by your standard.
Build a 1-row-per-customer dimension and prove it with a uniqueness check.
Create a mapping table that consolidates 5+ messy status values into 3 canonical categories.
Compute monthly active customers using window functions; validate against a control query.
Implement defaults for missing country and justify the sentinel values you choose.

Common mistakes and debugging tips

1) Turning a LEFT JOIN into an INNER JOIN by filtering in WHERE

Move filters on the right table into the ON clause when you need to keep unmatched left rows. Example: LEFT JOIN dim d ON f.k = d.k AND d.active = 1 (not WHERE d.active = 1).

2) Row explosions from non-unique dimensions

Deduplicate dimension keys with GROUP BY or ROW_NUMBER() before joining. Add a test that asserts one row per key.

3) Silent type coercions

Explicitly CAST and validate ranges. Count invalid casts by comparing source string vs casted value or using try-cast variants where available.

4) Mixed time zones and date math

Normalize timestamps to a canonical zone (e.g., UTC) on ingestion; only convert for presentation. Document the standard.

5) Non-idempotent transformations

Avoid using now() or non-deterministic functions without controls. Parameterize run dates and use them consistently.

Mini project: Sales Orders Curated View

Stage raw orders and customers; standardize types and timestamp zones.
Normalize text fields (emails, states); enforce canonical casing.
Implement a status mapping table and apply it.
Enrich orders with customer country; deduplicate the dimension first.
Compute derived metrics: revenue_7d, first_purchase_date, order_rank per customer.
Package logic into a reusable view or parameterized SQL template.
Add data quality checks: uniqueness of order_id, non-null critical fields, join coverage rate.

Deliverables checklist

DDL/DML script (or notebook) with clear steps and rerunnable logic.
A view/table fct_orders_curated with documented columns.
Quality report: row counts before/after, null rates, join coverage.

Practical project ideas

Marketing touchpoints model: Standardize campaign event data, unify UTM parameters, compute assisted conversions by channel.
Subscription metrics mart: Normalize plan tiers, compute MRR/ARR, churn/retention flags with window functions.
Product catalog harmonization: Merge SKUs across vendors, map categories with a governance table, detect duplicates.

Subskills

Data Type Standardization — Enforce consistent numeric, text, and datetime types; normalize formats.
Business Rules Implementation — Encode policies via CASE/mapping tables; make decisions data-driven.
Joins And Aggregations — Combine data without duplicates; summarize correctly; apply window functions.
Derivations And Calculations — Create metrics, ratios, flags, and rolling aggregates.
Handling Nulls And Defaults — Make missingness explicit with COALESCE and clear sentinels.
Text And Date Transformations — Trim, normalize, parse, and standardize dates/time zones.
Data Enrichment From Multiple Sources — Join reference/master data safely; validate coverage.
Building Reusable Transformation Components — Views, templates, macros, and tests for maintainable pipelines.

Next steps

Refactor one of your existing pipelines to use a mapping table and a reusable view.
Add three data quality tests (uniqueness, null checks, join coverage) to your curated dataset.
Practice an incremental pattern: process only new/changed rows and prove idempotency.

Skill exam

The exam is available to everyone. Log in to save progress and resume later.

Menu

Data Transformations

Table of Contents

What this skill covers

Why it matters for ETL Developers

Who this is for

Prerequisites

Learning path (practical roadmap)

Worked examples

1) Data type standardization

2) Handling nulls and defaults

3) Text and date transformations

4) Joins without row explosions

5) Business rules via mapping table

6) Derived metrics with window functions

Drills and exercises

Common mistakes and debugging tips

Mini project: Sales Orders Curated View

Practical project ideas

Subskills

Next steps

Skill exam

Data Transformations — Skill Exam

Topics

Data Type Standardization

Business Rules Implementation

Joins And Aggregations

Derivations And Calculations

Handling Nulls And Defaults

Text And Date Transformations

Data Enrichment From Multiple Sources

Building Reusable Transformation Components

Have questions about Data Transformations?

AI Assistant