luvv to helpDiscover the Best Free Online Tools

ETL Requirements And Source Analysis

Learn ETL Requirements And Source Analysis for ETL Developer for free: roadmap, examples, subskills, and a skill exam.

Published: January 11, 2026 | Updated: January 11, 2026

What you'll learn and why it matters for ETL Developers

ETL Requirements and Source Analysis is about turning business goals into precise, testable data movement rules. You'll learn to clarify what to load, from where, how often, in what shape, with what keys, and what to do when data is late or broken. Solid analysis prevents rework, failed nightly jobs, and unreliable dashboards. This skill unlocks faster delivery, fewer defects, and predictable operations.

Checklist: Before extracting any source
  • Confirm business questions and the grain of required outputs.
  • List all source systems, tables/files, and owners.
  • Profile data: volumes, nulls, distinct counts, value ranges.
  • Decide primary/foreign keys and surrogate keys if needed.
  • Define incremental logic and watermarks/CDC markers.
  • Set latency windows, cutoffs, and retry policy.
  • Document error rules, quarantine flows, and alerts.
  • Estimate load times vs. SLAs and backfill strategy.

Who this is for

  • Aspiring and junior ETL Developers who need a reliable method to move from vague requirements to runnable pipelines.
  • Analysts or DBAs transitioning into data engineering roles.
  • Engineers who want to reduce failed loads and unplanned maintenance.

Prerequisites

  • Comfort with SQL (SELECT, JOIN, GROUP BY, window functions).
  • Basic understanding of data modeling (dimensions, facts, grain).
  • Familiarity with batch scheduling concepts and data file formats (CSV/JSON/Parquet).

Learning path

  1. Clarify business requirements: define the questions, outputs, and success criteria.
  2. Profile sources: discover data shape, quality, and volumes.
  3. Map source-to-target: fields, transformations, keys, and constraints.
  4. Identify grain and keys: ensure unique rows and stable joins.
  5. Define incremental logic: watermarks, CDC, and reprocessing rules.
  6. Plan for latency and cutoffs: handle late-arriving and partial-day data safely.
  7. Set error handling: validation, quarantine, logging, alerting.
  8. Estimate loads and SLAs: volumes, throughput, windows, and backfills.

Worked examples

Example 1 — Profiling a source table

Goal: Understand data shape, quality, and growth for a table sales.order_items.

-- Row count and recency
SELECT COUNT(*) AS row_count,
       MIN(created_at) AS oldest,
       MAX(created_at) AS newest
FROM sales.order_items;

-- Null/blank checks for critical fields
SELECT SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS null_order_id,
       SUM(CASE WHEN product_id IS NULL THEN 1 ELSE 0 END) AS null_product_id,
       SUM(CASE WHEN quantity <= 0 THEN 1 ELSE 0 END) AS bad_quantity
FROM sales.order_items;

-- Cardinality and outliers
SELECT COUNT(DISTINCT order_id) AS distinct_orders,
       COUNT(DISTINCT product_id) AS distinct_products,
       PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY quantity) AS p99_qty
FROM sales.order_items;

Record key observations: null rates, extreme quantities, and latest timestamp. Use these to set validation rules and watermarks.

Example 2 — Keys, grain, and source-to-target mapping

Scenario: Build a fact table at order item grain with one row per order_id + line_number.

# Target: fact_order_item (grain: order_id + line_number)
# Keys: natural composite key order_id + line_number; surrogate key: fact_item_sk
# Mapping (YAML-style doc)
columns:
  - name: fact_item_sk
    type: bigint
    derivation: surrogate_key(order_id, line_number)
  - name: order_id
    from: sales.order_items.order_id
  - name: line_number
    from: sales.order_items.line_number
  - name: product_sk
    from: dim_product.product_sk
    join: product_id = dim_product.product_id
  - name: quantity
    from: sales.order_items.quantity
    validation: quantity > 0
  - name: unit_price
    from: sales.order_items.unit_price
  - name: extended_price
    derivation: quantity * unit_price
  - name: load_ts
    derivation: current_timestamp()
constraints:
  - unique: [order_id, line_number]
  - not_null: [order_id, line_number, quantity, unit_price]

This mapping clarifies the grain, joins, transformations, and constraints—all testable.

Example 3 — Incremental loads using watermark

Use updated_at as the driving column. Persist last_success_ts from previous runs.

-- Assume we store last_success_ts in control.last_run
WITH params AS (
  SELECT last_success_ts FROM control.last_run WHERE job_name = 'fact_order_item'
),
changed AS (
  SELECT *
  FROM sales.order_items s, params p
  WHERE s.updated_at > p.last_success_ts
)
MERGE INTO dw.fact_order_item t
USING (
  SELECT order_id, line_number, product_id, quantity, unit_price,
         quantity * unit_price AS extended_price, updated_at
  FROM changed
) src
ON (t.order_id = src.order_id AND t.line_number = src.line_number)
WHEN MATCHED AND src.updated_at > t.updated_at THEN
  UPDATE SET product_id = src.product_id,
             quantity = src.quantity,
             unit_price = src.unit_price,
             extended_price = src.extended_price,
             updated_at = src.updated_at
WHEN NOT MATCHED THEN
  INSERT (order_id, line_number, product_id, quantity, unit_price, extended_price, updated_at)
  VALUES (src.order_id, src.line_number, src.product_id, src.quantity, src.unit_price, src.extended_price, src.updated_at);

-- After success, update the watermark
UPDATE control.last_run SET last_success_ts = NOW() WHERE job_name = 'fact_order_item';

Key decisions: strict > vs. >=, and which timestamp to use for the watermark update.

Example 4 — Handling data latency and cutoffs

Daily batch runs at 01:00 with a T-1 cutoff and a late data window of 2 days.

-- Window captures yesterday plus late arrivals from the day before
WITH bounds AS (
  SELECT date_trunc('day', NOW() - interval '1 day') AS start_window,
         date_trunc('day', NOW()) AS end_window
)
SELECT *
FROM sales.order_items s, bounds b
WHERE s.created_at >= b.start_window - interval '1 day'  -- allow 1 extra day for late data
  AND s.created_at < b.end_window;                        -- up to midnight today

When you add late windows, ensure idempotence by MERGE/UPSERT instead of blind inserts.

Example 5 — Error handling and quarantine

Define explicit error codes and log invalid records with context.

-- Error log table
CREATE TABLE IF NOT EXISTS ops.etl_error_log (
  job_name      text,
  batch_id      text,
  source_system text,
  record_key    text,
  error_code    text,
  error_msg     text,
  event_ts      timestamp default current_timestamp
);

-- Quarantine invalid records (missing order_id or non-positive quantity)
CREATE TABLE IF NOT EXISTS ops.quarantine_order_items AS
SELECT * FROM sales.order_items WHERE 1=0;

INSERT INTO ops.quarantine_order_items
SELECT *
FROM sales.order_items
WHERE order_id IS NULL OR quantity <= 0;

INSERT INTO ops.etl_error_log (job_name, batch_id, source_system, record_key, error_code, error_msg)
SELECT 'load_fact_order_item', '2026-01-11', 'sales', COALESCE(CAST(order_id AS text),'null') || '-' || COALESCE(CAST(line_number AS text),'null'),
       CASE WHEN order_id IS NULL THEN 'ORD_NULL' ELSE 'QTY_NONPOS' END,
       CASE WHEN order_id IS NULL THEN 'order_id missing' ELSE 'quantity <= 0' END;

Keep error categories stable so alerts and dashboards remain meaningful over time.

Example 6 — Estimating volume and SLA

Assume 50M rows/day, 500 rows/sec throughput per single-threaded load, and 4-way parallelism.

throughput_total = 500 rows/sec * 4 = 2000 rows/sec
load_time = 50,000,000 / 2000 = 25,000 sec ≈ 6.94 hours

SLA check: If your daily window is 6 hours, you need either higher parallelism,
faster I/O, filtering to incremental-only, or a different schedule.

Drills and quick exercises

  • List five business questions your target table must answer. Define the table grain that supports them.
  • Profile a real table: row count, null rates for 3 columns, and top 5 values for a dimension.
  • Create a source-to-target mapping for 8–12 columns, including at least 2 derived fields.
  • Write incremental SQL using a watermark and justify your choice of > vs. >=.
  • Propose cutoff and late-arrival rules for a daily job with 2-hour data delays.
  • Draft three error codes and show where they would be logged.
  • Estimate load duration for a dataset of your choice and identify at least two throughput levers.

Common mistakes and debugging tips

  • Unclear grain: Leads to duplicates or broken joins. Tip: Write the grain sentence: “One row per …, uniquely by …”.
  • Missing key strategy: Natural keys change; plan surrogate keys for stable references.
  • Loose incremental filters: Off-by-one on timestamps causes missed or duplicated rows. Tip: Choose strict > and store precise last_success_ts; or use >= with a small overlap + dedup.
  • No late-arrival plan: Late data silently never lands. Tip: Add a configurable late window and an upsert-based target.
  • All-or-nothing loads: A single bad record blocks the batch. Tip: Validate, quarantine, continue, and alert.
  • Underestimated volumes: Jobs exceed windows. Tip: Measure daily deltas, test throughput, and parallelize by partition keys (date, id ranges).
  • Hidden business rules: Transformations live only in code. Tip: Maintain a mapping spec that reviewers can read before you build.
Troubleshooting playbook
  • Duplicates in target: Check grain, unique constraints, and MERGE ON clause.
  • Missing rows: Validate watermark and source time zone shifts; diff counts by day.
  • Slow job: Look for filters on non-indexed columns; push down projections; increase parallelism carefully.
  • Unexpected nulls: Re-check join keys and default values; validate referential completeness of dimensions.

Mini project: Order Analytics Foundation

Build a small pipeline that loads a fact_order_item table and two dimensions from mixed sources (CSV for product catalog, relational for orders).

  • Write requirements: business questions, outputs, grain, and acceptance criteria.
  • Profile sources (orders, order_items, products): counts, nulls, cardinality.
  • Create a mapping spec with keys, derived columns, and constraints.
  • Define incremental logic (updated_at watermark) and a 2-day late window.
  • Implement error handling with a quarantine table and error log.
  • Estimate load time for a mock daily volume; propose parallelization.
Acceptance criteria
  • Exactly one row per order_id + line_number in fact table.
  • All quantities positive; invalid rows quarantined and logged with error codes.
  • Incremental runs are idempotent and capture late arrivals within 2 days.
  • Total runtime estimate aligns with SLA window.

Practical projects

  • Customer 360 incremental load: Merge CRM, web events, and support tickets with clear grain per interaction.
  • Financial daily snapshots: Build a snapshot fact with strict cutoffs and late correction mechanisms.
  • Inventory SCD Type 2: Define keys, detect changes, and manage validity ranges.

Subskills

  • Understanding Business Requirements For Loads — Turn stakeholder goals into measurable outputs and acceptance criteria.
  • Source System Profiling — Discover data quality, volumes, distributions, and recency.
  • Source To Target Mapping — Document field-by-field transformations and constraints.
  • Identifying Keys And Grain — Ensure uniqueness and stable joins.
  • Defining Incremental Logic — Watermarks, CDC, and idempotent merges.
  • Handling Data Latency And Cutoffs — Safe windows for late-arriving data.
  • Defining Error Handling Rules — Validation, quarantine, logging, and alerting.
  • Estimating Load Volumes And SLAs — Throughput planning and parallelization.

Next steps

  • Work through the subskills below in order; each has focused tasks.
  • Complete the mini project and have a peer review your mapping specs.
  • Take the skill exam at the bottom of this page. Anyone can take it; only logged-in users have progress saved.

ETL Requirements And Source Analysis — Skill Exam

This exam checks your grasp of requirements gathering, source profiling, mapping, keys/grain, incremental logic, latency/cutoffs, error handling, and volume/SLA planning. You can take it for free. Progress and results are saved only if you are logged in.Choose the best answer for each question. Passing score is 70%.

12 questions70% to pass

Have questions about ETL Requirements And Source Analysis?

AI Assistant

Ask questions about this tool