Why this matters
Business logic is where raw data becomes decisions. As a Data Engineer, you encode rules that finance, product, and operations rely on: which transactions count as revenue, how to treat refunds, what defines an active user, how to keep historical truth. Getting this right makes dashboards trustworthy and machine learning features stable. Getting it wrong causes rework, confusion, and bad decisions.
Concept explained simply
Business logic is the set of clear, testable rules that transform raw data into meaningful outputs. In ETL/ELT, it lives in your transformations: SQL models, Spark jobs, or stored procedures that standardize definitions (e.g., “MRR”, “active user”, “churned customer”).
Mental model
Think in three layers:
- Inputs: clean, modeled sources (staging tables with standardized types and names).
- Rules: deterministic transformations that encode definitions. These should be idempotent and version-controlled.
- Checks: assertions that confirm the rules worked (row counts, referential integrity, boundary checks).
If you can re-run the pipeline and get the same result, your logic is likely correct and reproducible.
Core patterns and rules
1) Deduplication and latest-record selection
Often multiple rows represent the same business entity. Choose a business key and pick the latest by a trustable timestamp or version.
-- Latest record per business key using a window
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC, _ingested_at DESC) AS rn
FROM stg_customers
)
SELECT * FROM ranked WHERE rn = 1;
Notes: break ties deterministically; prefer server-side timestamps; add a tiebreaker like ingestion time.
2) Slowly Changing Dimension (Type 2)
Keep full history of attribute changes with effective date ranges and a current flag.
-- Conceptual merge for SCD2
-- Keys: product_id; Attributes: price
-- Columns: effective_start, effective_end, is_current
-- New record starts when price changes; close previous record the day before.
3) Idempotency and late-arriving data
Idempotency: running the job twice yields the same outcome.
- Use deterministic upserts keyed by business keys + effective timestamps.
- Handle late-arriving facts by re-computing affected windows or using as-of joins.
4) Aggregations and calendars
Aggregations must align with a calendar: day, week, month, fiscal periods.
-- Robust monthly grouping via a date dimension
SELECT d.fiscal_month, SUM(f.amount) AS revenue
FROM fact_payments f
JOIN dim_date d ON f.event_date = d.date
GROUP BY d.fiscal_month;
5) Time zones and currencies
- Store timestamps in UTC; convert for reporting at the edges.
- Normalize currency with a reliable rate table and document the conversion date (transaction date vs posting date).
Worked examples
Example 1: Idempotent deduplication and daily active users
Input: stg_events(user_id, event_name, event_time_utc, _ingested_at)
Goal: Daily Active Users (DAU) by UTC day. A user counts once per day if any qualifying event occurs.
- Deduplicate by (user_id, event_time_utc, event_name) keeping the latest _ingested_at.
- Filter to qualifying events (e.g., login, page_view).
- Group by CAST(event_time_utc AS date) and count distinct user_id.
WITH base AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY user_id, event_name, event_time_utc
ORDER BY _ingested_at DESC
) rn
FROM stg_events
), dedup AS (
SELECT * FROM base WHERE rn = 1
), eligible AS (
SELECT user_id, CAST(event_time_utc AS DATE) AS event_date
FROM dedup
WHERE event_name IN ('login','page_view')
)
SELECT event_date, COUNT(DISTINCT user_id) AS dau
FROM eligible
GROUP BY event_date;
Why it works: Window dedup is deterministic. Counting distinct after dedup ensures idempotency.
Example 2: MRR with proration and refunds
Input: subscriptions(user_id, plan_price_usd, period_start, period_end, status), refunds(user_id, amount_usd, refund_date)
Goal: Monthly Recurring Revenue (MRR) per month with proration for partial months and refunds subtracted.
- Compute active days in each month for each subscription period.
- Allocate plan_price_usd proportionally to active days in the month.
- Subtract refunds that relate to the same month.
-- Sketch; relies on a date dimension for month boundaries
WITH spans AS (
SELECT s.user_id, s.plan_price_usd, d.calendar_month,
GREATEST(d.month_start, s.period_start) AS start_in_m,
LEAST(d.month_end, s.period_end) AS end_in_m
FROM subscriptions s
JOIN dim_date_month d
ON s.period_start <= d.month_end AND s.period_end >= d.month_start
WHERE s.status = 'active'
), apportioned AS (
SELECT user_id, calendar_month,
plan_price_usd * (DATE_DIFF('day', start_in_m, end_in_m) + 1)
/ (DATE_DIFF('day', calendar_month, DATE_TRUNC('month', calendar_month) + INTERVAL '1' MONTH - INTERVAL '1' DAY) + 1)
AS mrr_component
FROM spans
), refunds_m AS (
SELECT DATE_TRUNC('month', refund_date) AS calendar_month,
SUM(amount_usd) AS refunds
FROM refunds
GROUP BY 1
)
SELECT a.calendar_month,
SUM(a.mrr_component) - COALESCE(r.refunds,0) AS mrr
FROM apportioned a
LEFT JOIN refunds_m r USING (calendar_month)
GROUP BY a.calendar_month, r.refunds;
Why it works: Aligning to a date dimension avoids off-by-one. Explicit proration and refunds make the logic auditable.
Example 3: SCD Type 2 for product price
Input: stg_product_prices(product_id, price, valid_from_utc, _ingested_at)
Goal: Maintain dim_product_price_scd2 with effective ranges and current flag.
-- Normalize input and order changes
WITH changes AS (
SELECT product_id, price, valid_from_utc,
ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY valid_from_utc) AS rn
FROM (
SELECT product_id, price, valid_from_utc,
ROW_NUMBER() OVER (PARTITION BY product_id, valid_from_utc ORDER BY _ingested_at DESC) r
FROM stg_product_prices
) x WHERE r = 1
), ranges AS (
SELECT c1.product_id, c1.price,
c1.valid_from_utc AS effective_start,
LEAD(c1.valid_from_utc) OVER (PARTITION BY c1.product_id ORDER BY c1.valid_from_utc) - INTERVAL '1' SECOND AS effective_end
FROM changes c1
)
SELECT product_id, price, effective_start,
COALESCE(effective_end, TIMESTAMP '9999-12-31 23:59:59') AS effective_end,
CASE WHEN effective_end IS NULL THEN TRUE ELSE FALSE END AS is_current
FROM ranges;
Why it works: One record per continuous price period with a sentinel end date for current rows enables as-of joins.
Exercises
Do these to lock in the concepts. The Quick Test is available to everyone; only logged-in users get saved progress.
- Exercise 1: Deduplicate events and compute DAU with a checklist below. See the exercise card for details.
- Exercise 2: Build an SCD2 dimension for customer status. See the exercise card for details.
Self-check checklist
- Business keys are defined and documented.
- Window functions or MERGE statements are deterministic (explicit tie-breakers).
- Aggregations align to a date/calendar dimension.
- All timestamps handled in UTC; conversions only at the edges.
- Each rule has at least one assertion (e.g., non-negative amounts).
Common mistakes and how to self-check
- Mixing business logic into orchestration. Fix: keep logic in transformations; orchestration only schedules/runs.
- Relying on dashboard formulas for core metrics. Fix: centralize definitions in data models.
- Non-idempotent merges (e.g., update without stable keys). Fix: use business keys + timestamps; test reruns.
- Ignoring late-arriving records. Fix: reprocess time windows or use as-of joins.
- Time zone drift. Fix: store UTC, convert for presentation only, document the convention.
- Ambiguous tie-breaking. Fix: add deterministic ORDER BY in window functions.
Practical projects
- Metrics mart: Build DAU/WAU/MAU from raw event logs with dedup, filters, and a date dimension.
- Revenue core: Create MRR, churn, and expansion metrics with proration, refunds, and currency normalization.
- History-aware dimensions: Implement SCD2 for customer status and product pricing; add as-of joins to facts.
Learning path
- Model inputs: clean staging with consistent names/types.
- Implement core rules: dedup, keys, time handling, idempotent upserts.
- Add data quality: row count deltas, null checks, domain constraints.
- Harden for change: parameterize calendars, currency rates, and time zones.
- Document: define each metric and rule in plain language next to code.
Who this is for
- Data Engineers implementing reliable transformations.
- Analytics Engineers formalizing metric definitions.
- Developers moving business rules out of apps and into data models.
Prerequisites
- Comfortable with SQL (CTEs, window functions, joins).
- Basic data modeling (staging, dimensions, facts).
- Familiarity with version control and code reviews.
Next steps
- Finish the exercises, then take the Quick Test.
- Pick one Practical Project and ship a first version.
- Add assertions to every critical model before you call it “done.”
Mini challenge
Given orders(order_id, customer_id, status, status_updated_at) where status can move backward due to system bugs, design a rule to compute the final daily status snapshot that is monotonic (never moves backward) per customer. Outline keys, tie-breakers, and how you will handle late updates. Write 3–5 bullets and one SQL or pseudo-SQL snippet to illustrate your approach.