What this skill covers
Data Transformations turn raw inputs into clean, modeled, and analytics-ready datasets. As an ETL Developer, you’ll design and implement robust SQL- and code-based logic that standardizes types, handles nulls, applies business rules, joins and aggregates data, computes derived metrics, and enriches datasets across sources. Done well, transformations are repeatable, testable, and idempotent.
Why it matters for ETL Developers
- Trustworthy analytics: Consistent types and rules prevent silent errors.
- Performance: Efficient joins and aggregations save time and cost.
- Maintainability: Reusable components reduce duplication and speed up delivery.
- Scalability: Clear, incremental, idempotent logic supports growing data volumes.
Who this is for
- ETL/ELT developers building data pipelines.
- Data engineers responsible for modeling and serving curated datasets.
- Analytics engineers shaping data into marts and reports.
Prerequisites
- SQL basics: SELECT, WHERE, GROUP BY, JOIN.
- Comfort with data types: numeric, text, date/time, boolean.
- Familiarity with files/tables and batch processing concepts.
- Optional: Python or PySpark basics for code-driven transforms.
Learning path (practical roadmap)
- Standardize data types
Goal: Enforce canonical types and formats.
Do: CAST/CONVERT values; trim/normalize text; standardize dates and time zones. - Handle nulls and defaults
Goal: Make missing-data behavior explicit.
Do: COALESCE/CASE for defaults; separate unknown vs not applicable. - Joins and aggregations
Goal: Combine and summarize without row explosions.
Do: Choose the right join; deduplicate; group by; use window functions. - Business rules
Goal: Encode policy in data, not code paths. - Derivations and calculations
Goal: Compute metrics and flags that power analytics. - Data enrichment
Goal: Integrate reference/master data safely and predictably. - Reusable components
Goal: Parameterize logic into views, CTE templates, or macros; add tests and documentation. - Performance and correctness
Goal: Ensure idempotency, incremental patterns, and reliable reruns.
Worked examples
1) Data type standardization
-- Standardize types and formats for a customers staging table
WITH src AS (
SELECT customer_id,
TRIM(LOWER(email)) AS email_norm,
REGEXP_REPLACE(phone, '[^0-9]', '') AS phone_digits, -- if regex available; else nested REPLACE
created_at
FROM raw_customers
)
SELECT CAST(customer_id AS BIGINT) AS customer_id,
email_norm AS email,
phone_digits AS phone_e164_guess,
CAST(created_at AS TIMESTAMP) AS created_at_ts
FROM src;Why this works
2) Handling nulls and defaults
-- Replace null loyalty_tier with 'basic'; preserve other values
SELECT customer_id,
COALESCE(loyalty_tier, 'basic') AS loyalty_tier,
CASE WHEN country IS NULL THEN 'UNKNOWN' ELSE country END AS country
FROM dim_customer;Tip
3) Text and date transformations
-- Normalize casing and extract first day of month
SELECT order_id,
UPPER(TRIM(state_code)) AS state_code,
LOWER(TRIM(email)) AS email,
DATE_TRUNC('month', order_date) AS month_start -- or build from YEAR/MONTH + day 1 per your engine
FROM fct_orders;Try this
4) Joins without row explosions
-- Deduplicate dimension to one row per key before joining
WITH dim_customer_one AS (
SELECT customer_id,
ANY_VALUE(country) AS country -- use MIN/MAX or window + ROW_NUMBER = 1 per engine
FROM dim_customer
GROUP BY customer_id
)
SELECT o.order_id,
o.amount,
d.country
FROM fct_orders o
LEFT JOIN dim_customer_one d
ON o.customer_id = d.customer_id;Debugging
5) Business rules via mapping table
-- Replace free-text status with governed categories
WITH map AS (
SELECT 'paid' AS raw_status, 'closed' AS std_status UNION ALL
SELECT 'complete','closed' UNION ALL
SELECT 'pending','open' UNION ALL
SELECT 'awaiting','open'
)
SELECT o.order_id,
COALESCE(m.std_status, 'other') AS status_category
FROM fct_orders o
LEFT JOIN map m
ON LOWER(o.status) = m.raw_status;Why mapping tables?
6) Derived metrics with window functions
-- 7-day rolling revenue per customer
SELECT customer_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS revenue_7d
FROM fct_orders;Notes
Drills and exercises
- Cast mixed numeric strings to DECIMAL; count failures safely (invalid strings become NULL).
- Normalize names: trim spaces, collapse multiple spaces, title-case or upper-case by your standard.
- Build a 1-row-per-customer dimension and prove it with a uniqueness check.
- Create a mapping table that consolidates 5+ messy status values into 3 canonical categories.
- Compute monthly active customers using window functions; validate against a control query.
- Implement defaults for missing country and justify the sentinel values you choose.
Common mistakes and debugging tips
1) Turning a LEFT JOIN into an INNER JOIN by filtering in WHERE
2) Row explosions from non-unique dimensions
3) Silent type coercions
4) Mixed time zones and date math
5) Non-idempotent transformations
Mini project: Sales Orders Curated View
- Stage raw orders and customers; standardize types and timestamp zones.
- Normalize text fields (emails, states); enforce canonical casing.
- Implement a status mapping table and apply it.
- Enrich orders with customer country; deduplicate the dimension first.
- Compute derived metrics: revenue_7d, first_purchase_date, order_rank per customer.
- Package logic into a reusable view or parameterized SQL template.
- Add data quality checks: uniqueness of order_id, non-null critical fields, join coverage rate.
Deliverables checklist
- DDL/DML script (or notebook) with clear steps and rerunnable logic.
- A view/table fct_orders_curated with documented columns.
- Quality report: row counts before/after, null rates, join coverage.
Practical project ideas
- Marketing touchpoints model: Standardize campaign event data, unify UTM parameters, compute assisted conversions by channel.
- Subscription metrics mart: Normalize plan tiers, compute MRR/ARR, churn/retention flags with window functions.
- Product catalog harmonization: Merge SKUs across vendors, map categories with a governance table, detect duplicates.
Subskills
- Data Type Standardization — Enforce consistent numeric, text, and datetime types; normalize formats.
- Business Rules Implementation — Encode policies via CASE/mapping tables; make decisions data-driven.
- Joins And Aggregations — Combine data without duplicates; summarize correctly; apply window functions.
- Derivations And Calculations — Create metrics, ratios, flags, and rolling aggregates.
- Handling Nulls And Defaults — Make missingness explicit with COALESCE and clear sentinels.
- Text And Date Transformations — Trim, normalize, parse, and standardize dates/time zones.
- Data Enrichment From Multiple Sources — Join reference/master data safely; validate coverage.
- Building Reusable Transformation Components — Views, templates, macros, and tests for maintainable pipelines.
Next steps
- Refactor one of your existing pipelines to use a mapping table and a reusable view.
- Add three data quality tests (uniqueness, null checks, join coverage) to your curated dataset.
- Practice an incremental pattern: process only new/changed rows and prove idempotency.
Skill exam
The exam is available to everyone. Log in to save progress and resume later.