Why this matters
Derivations and calculations turn raw fields into business-ready metrics. As an ETL Developer, you will:
- Compute revenue, margins, discounts, and KPIs reliably.
- Standardize names, emails, and codes to enable joins and analytics.
- Create buckets (age groups, spend tiers) and flags (churn risk, VIP).
- Build rollups and window-calculated features (rolling averages, ranks).
- Handle nulls, rounding, and data types so numbers stay trustworthy.
Concept explained simply
Derivations are new columns created from existing data (e.g., net_revenue = qty * price * (1 - discount)). Calculations are the operations and rules you apply to produce those columns: arithmetic, conditionals, string/date functions, aggregates, and window functions.
Mental model
Think of your dataset as ingredients and your transformation as a recipe. Each derived column is a step that must be:
- Deterministic: same inputs lead to same outputs.
- Traceable: easy to explain how it was computed.
- Testable: you can check each step with small samples.
Core techniques you will use
- Arithmetic: addition, subtraction, multiplication, division; order of operations matters.
- Conditionals: CASE/IF for flags and buckets.
- String ops: trim, lower/upper, split, substring, regex where available.
- Date/time ops: date differences, truncation to day/week/month, time zone awareness.
- Aggregations: sums, counts, averages; grouping keys.
- Window functions: rolling averages, cumulative sums, ranks.
- Type handling: casting, decimal precision, safe integer division.
- Null handling: COALESCE/defaults, avoid dividing by zero.
- Rounding: financial values usually round half up to 2 decimals; be consistent.
Worked examples
Example 1 — Order economics
Inputs per order: qty, unit_price, discount_pct, cogs.
- revenue_before_discount = qty * unit_price
- discount_amount = revenue_before_discount * discount_pct
- net_revenue = revenue_before_discount - discount_amount
- gross_margin = net_revenue - cogs
- margin_pct = CASE WHEN net_revenue > 0 THEN gross_margin / net_revenue ELSE NULL END
Tip: Store monetary fields as decimals with sufficient precision to avoid floating-point surprises.
Example 2 — Churn risk flag
Inputs: last_purchase_date, today_date.
- days_since_purchase = today_date - last_purchase_date
- is_churn_risk = CASE WHEN days_since_purchase >= 90 THEN 1 ELSE 0 END
Make the cutoff explicit (e.g., 90) and keep it configurable in pipelines.
Example 3 — Standardize names and emails
- full_name_clean = TRIM(REGEXP_REPLACE(full_name, '\\s+', ' '))
- email_normalized = LOWER(TRIM(email))
- email_domain = SPLIT_PART(email_normalized, '@', 2)
Small standardizations greatly improve join rates and deduplication.
Example 4 — Rolling 7-day average
Inputs: daily_sales(date, sales).
- rolling_avg_7d = AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
Window frames must be ordered and bounded. Early rows have smaller windows; document this behavior.
Data types, precision, and safety
- Decimals: use DECIMAL/NUMERIC for money; set scale (e.g., 2) consistently across stages.
- Integer division: cast to decimal before division to avoid truncation.
- Rounding: choose a policy (e.g., round half up) and apply once, at the presentation or agreed layer.
- Null safety: COALESCE defaults thoughtfully; do not mask real data gaps.
- Time zones: convert timestamps to a canonical zone (e.g., UTC) before deriving date parts.
Self-checks you can automate
- Invariants: net_revenue ≥ 0; margin_pct between -1 and 1 unless negative margins allowed.
- Sum consistency: detail sums match aggregate tables within rounding tolerance.
- Monotonicity: cumulative totals never decrease.
- Coverage: flags/buckets sum to known totals; no unexpected NULLs after derivation.
Common mistakes (and how to avoid)
- Integer division truncation: cast operands to decimal before dividing.
- Unbounded windows: missing frame leads to full-history averages; always specify frame.
- Silent nulls: operations with NULL return NULL; use COALESCE or CASE to control behavior.
- Double-discounting: apply discounts once and in the right order.
- Rounding at every step: round only at final output unless required, to avoid drift.
- Timezone drift: derive dates after converting to a consistent time zone.
Exercises
Practice with small, deterministic datasets. The same tasks appear below in the interactive section with solutions.
- Exercise ex1: Compute net_revenue, gross_margin, and margin_pct for 3 orders. Then round money to 2 decimals and margin_pct to 2 decimal percent.
- Exercise ex2: Build a 3-day rolling average of daily sales and label each day with its week number (ISO week).
- Exercise ex3: Derive churn_risk, ltv_bucket, and email_domain from a customers sample. Assume current_date = 2024-02-01.
Checklist before you submit
- Handled NULLs and divide-by-zero safely.
- Used consistent rounding for money and percentages.
- Documented any constants (e.g., 90 days churn cutoff).
- Window frames are explicit and ordered.
Mini challenge
Design derived fields for a subscription billing dataset: monthly_recurring_revenue (MRR), churn_flag, expansion_revenue, contraction_revenue, and net_revenue_retention by month. State your definitions, required inputs, and how you would validate them.
Who this is for
- ETL Developers and Data Engineers building reliable transformations.
- Analysts moving pipelines from notebooks to production.
- Anyone preparing for metric definitions and data contracts.
Prerequisites
- Comfort with SQL basics (SELECT, WHERE, GROUP BY) or a transformation tool.
- Understanding of data types (int, decimal, string, date/timestamp).
Learning path
- Start with simple arithmetic and conditional flags.
- Add string/date standardization and safe casting.
- Introduce aggregates and then window functions.
- Bake in validation checks and rounding policies.
- Optimize and document derivations as reusable macros/UDFs if your platform supports them.
Practical projects
- Order metrics mart: net_revenue, discounts, margins by day/product.
- Customer health table: churn risk, activity streaks, VIP tiering.
- Marketing attribution prep: cleaned channels, campaign parsing, week buckets.
Next steps
- Complete the exercises below and compare with the solutions.
- Take the Quick Test to confirm understanding.
- Apply these patterns to your current pipeline and add automated checks.
Ready to test yourself?
Take the Quick Test below. Note: The quick test is available to everyone; only logged-in users get saved progress.