Why this matters
As an ETL Developer, you transform business needs into reliable data loads. Clear requirements prevent costly rework, missed deadlines, and data trust issues. Typical tasks include creating load specs, defining data quality rules, choosing load strategies (full vs incremental), and aligning SLAs with the business.
- You are asked: "We need yesterday\'s sales by 7am." You must turn that into grain, schedule, incremental logic, and validation rules.
- Stakeholders change a KPI definition. You need to adjust transformations, mapping, and acceptance tests safely.
- A source system sends late records. You decide how to handle late-arriving data and reprocessing.
Concept explained simply
Business requirements describe outcomes: what decisions people need to make and when. ETL load requirements translate that into technical behavior: what data to load, how often, in which structure, with what quality and guarantees.
Mental model
Think of a funnel:
- Business Question → Metrics and Entities → Data Inputs → Transformations → Load Targets → Validation & SLAs.
At each step, write explicit, testable statements. If it can’t be tested, it isn’t a good requirement.
Business-to-Load translation checklist
- Purpose: What decision or report depends on this load?
- Scope: Which entities and attributes are in or out?
- Grain: One row equals what? (e.g., one order line per day)
- Latency: When must it be ready? (e.g., 7am daily)
- Freshness window: How far back do we reprocess or accept late data?
- Load pattern: Full, incremental, CDC, or streaming?
- Merge rules: Upsert keys, dedup rules, and SCD behavior (Type 1/2/3)
- Data quality: Required fields, validation thresholds, and tolerances
- Error handling: What to do on failures, retries, and partial loads
- Security & privacy: Masking, PII handling, and access
- Acceptance criteria: Exact tests to declare success
- Lineage & audit: What to log for traceability
Key terms you\'ll use
- Grain: The exact meaning of one row in the target.
- SLA: Service Level Agreement (e.g., load complete by 6am UTC).
- Latency: Time between source event and availability in target.
- Freshness window: The time period we accept late-arriving data.
- Incremental load: Only new/changed records are processed.
- CDC: Change Data Capture to detect inserts/updates/deletes.
- SCD: Slowly Changing Dimension strategy (Type 1 overwrite, Type 2 history).
- Idempotency: Re-running a job does not duplicate or corrupt data.
- DQ rules: Data quality validations with thresholds and actions.
From requirement to load spec — step-by-step
- Clarify the business outcome
- Ask: What decisions depend on this? Which KPI definitions are authoritative?
- Define scope and grain
- List entities, attributes, and the exact grain (e.g., one record per order_line_id per day).
- Select the load pattern
- Full vs incremental vs CDC vs streaming, based on volume, latency, and source capability.
- Write merge and dedup rules
- Business keys, natural vs surrogate keys, SCD type, tie-breakers (latest timestamp, highest version).
- Set quality rules and thresholds
- Mandatory fields, referential integrity, allowed nulls, tolerance for late data, and what happens on thresholds.
- Define SLAs and operations
- Schedule, time windows, retries, alerts, idempotency, reprocessing approach.
- Specify acceptance criteria and tests
- Exact row counts, sample reconciliations, KPI parity checks, and audit expectations.
See a compact example of a load spec
- Target: dw.fact_sales_daily (grain: order_line_id x sale_date)
- Pattern: Incremental from POS CDC
- Merge: Upsert by (order_line_id, sale_date). Cancel reversals create negative amounts.
- DQ: null rate for product_id <= 0%; rows with null product_id are rejected to quarantine.
- Freshness: accept late data up to 3 days; reprocess prior 3 partitions daily.
- SLA: completed by 06:30 UTC, 2 retries, alert on failure.
- Acceptance: within ±0.5% of POS totals; 100% of CDC changes applied; audit log written.
Worked examples
Example 1: Daily Sales Fact
- Business: Retail wants yesterday\'s net sales by 7am to staff stores.
- Grain: order_line_id x sale_date.
- Pattern: Incremental + CDC; reversals as negative amounts.
- DQ: product_id not null, amount ≥ 0; quarantine bad rows.
- Freshness: late records up to 3 days; re-run last 3 partitions daily.
- SLA: 07:00 local time, 3 retries with backoff.
- Acceptance: Net sales within ±0.5% of POS; row count change aligns with CDC change count.
Example 2: Customer Dimension (SCD2)
- Business: Marketing needs historical address changes for cohort analysis.
- Grain: Customer surrogate key, SCD2 on address fields.
- Pattern: Incremental using source updated_at and natural key email+source_id.
- SCD: Type 2 on address, Type 1 on name corrections.
- DQ: email format valid; dedup by latest updated_at when ties.
- Privacy: Mask PII columns in non-prod; restrict column-level access.
- Acceptance: 100% of changed addresses produce new SCD2 rows; active record count = distinct customers.
Example 3: IoT Telemetry Stream
- Business: Ops needs device anomaly counts within 5 minutes.
- Grain: device_id x 1-minute window.
- Pattern: Streaming with micro-batch aggregation.
- Late data: accept lateness up to 10 minutes; watermark 10 minutes.
- Idempotency: windowed upsert by (device_id, minute_bucket).
- Acceptance: anomaly counts match sampled raw events within ±1% for 10-minute windows.
Exercises
Try these. Then compare with the solutions at the end of each exercise.
- Exercise 1: Translate a marketing request into a load spec. See details in the Exercises panel below.
- Exercise 2: Define acceptance criteria and reconciliation for a finance load. See details below.
Self-check checklist before viewing solutions
- Did you state the grain explicitly?
- Did you choose a load pattern and justify it?
- Are dedup and merge keys unambiguous?
- Do DQ rules have thresholds and actions?
- Are SLAs and retry/reprocess rules clear?
- Can each acceptance criterion be tested automatically?
Common mistakes and self-checks
- Vague grain: Fix by writing "One row equals ..." for every target table.
- Missing late-data policy: Always specify freshness window and reprocessing scope.
- Unclear merge keys: Define natural keys and how to resolve ties.
- Soft DQ wording: Replace "good quality" with precise thresholds and actions.
- No acceptance tests: Add count reconciliations, KPI parity checks, and audit checks.
- Ignoring idempotency: Define how reruns avoid duplicates or double counting.
Quick self-audit mini task
Pick one of your specs and find these 4 items. If missing, add them now:
- Grain statement
- Late-arriving window
- Exact merge key and tie-breaker
- At least 3 acceptance tests
Practical projects
- Build a Sales Daily Load: Create a spec and a stub pipeline that loads two days of sample sales with late-arriving adjustments. Include a reprocessing script.
- Design a Customer SCD2: Write the load spec and implement a test dataset showing two address changes and one name correction.
- Finance Revenue Reconciliation: Specify and simulate a monthly load with rounding rules and a variance threshold. Implement automated checks that fail the job over 0.3% variance.
Who this is for
- ETL Developers and Data Engineers aligning pipelines with business outcomes.
- Analysts moving into data platform work.
- Product/data PMs writing data requirements and acceptance criteria.
Prerequisites
- Basic SQL: joins, aggregates, window functions.
- Familiarity with data modeling concepts (dimensions, facts, SCD).
- Understanding of batch vs streaming processing.
Learning path
- This lesson: Turn business needs into load specs.
- Next: Source profiling and data contracts (define field-level expectations).
- Then: Mapping documents and transformation logic.
- Finally: Validation automation and observability (DQ checks, lineage, alerts).
Mini challenge
A stakeholder asks: "Can we get a weekly active users table by every Monday 8am?" Draft a 6-line mini spec stating grain, source, incremental logic, late-arriving policy, DQ rules (2 items), and acceptance criteria (2 items). Keep it testable.
Quick Test
Everyone can take this quick test; only logged-in users get saved progress.