How to learn Source System Profiling for ETL Requirements And Source Analysis in ETL Developer for free

Why this matters

Source system profiling is how ETL Developers uncover the real shape, quality, and behavior of data before building pipelines. It prevents broken jobs, bad dashboards, and last-minute fire drills.

You estimate nulls, duplicates, and outliers to design the right transformations.
You discover keys, watermarks, and change patterns for incremental loads.
You validate assumptions from stakeholders against the actual data.
You reduce rework by documenting quirks early (encodings, time zones, late-arriving data).

Real tasks you might face

Confirm that customer_id is stable and unique across tables.
Find reliable fields for CDC (e.g., updated_at, hash, version).
Measure null and distinct rates to plan lookups and dimensions.
Detect schema drift in JSON payloads and plan for optional fields.

Concept explained simply

Profiling means measuring and describing data as it truly is: structure, values, relationships, and change dynamics. It’s like running a health check before you start training. You’re not cleaning or transforming yet—you’re learning how the data behaves so you can design a safe, efficient ETL.

Mental model

Think of the source as a city. Profiling is drawing the city map: streets (schemas), traffic patterns (update frequency), bottlenecks (nulls, duplicates), and rules (constraints). With the map, you can plan the fastest route (ETL design) without getting stuck.

What to profile (the essentials)

Schema and types: column names, data types, nested structures, enums.
Completeness: null/blank rates per column.
Uniqueness: distinct counts; candidate keys and composite keys.
Validity: ranges, patterns (regex), domain checks, date/time sanity.
Integrity: referential integrity between tables (FK-like relationships).
Duplicates: exact or near-duplicates; business key vs surrogate key.
Change signals: created_at/updated_at, version fields, CDC flags, soft deletes.
Volumes and velocity: row counts, daily deltas, seasonality, burstiness.
Text quirks: encodings, trimming issues, case sensitivity, special characters.
Time: time zones, daylight saving changes, lag between systems.
Access constraints: pagination limits, API rate limits, extraction windows.

Helpful metrics to compute

count, min, max, avg, stddev (for numerics)
min/max date, % future dates, % invalid dates
% nulls, % empty strings, % whitespace-only
distinct count, top N frequent values
% outliers (e.g., outside 1.5 IQR or z-score > 3)

Step-by-step profiling workflow

Inventory sources: list tables/files/endpoints and their schemas.
Sample safely: full scan if possible; otherwise stratified or time-based samples.
Compute core metrics: nulls, distincts, ranges, duplicates.
Test relationships: join keys, parent/child counts, orphans.
Probe change patterns: detect reliable watermarks and daily deltas.
Document decisions: constraints, gotchas, and ETL implications.
Validate with stakeholders: confirm business meaning of fields.

Documentation template (copy/paste)

Source object: name, owner, refresh cadence
Keys: candidate keys, uniqueness test results
Completeness: null/blank rates per column
Validity: rules and % violations
Relationships: joins tested + orphan counts
Change strategy: full vs incremental watermark, late-arrival handling
Operational: extraction window, rate limits, file size patterns

Worked examples

Example 1: CSV Orders file

orders_YYYYMMDD.csv with columns: order_id, customer_id, order_ts, total_amount, email.

Profiling approach

Check uniqueness of order_id and duplicates across days.
Null rates for customer_id and total_amount.
Date sanity: no future timestamps; timezone awareness.
Email pattern validity and common domains.

-- Example SQL (after loading CSV to a staging table stg_orders)
SELECT COUNT(*) AS rows,
       COUNT(DISTINCT order_id) AS distinct_orders,
       SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)*1.0/COUNT(*) AS pct_null_customer,
       SUM(CASE WHEN total_amount < 0 THEN 1 ELSE 0 END) AS negative_amounts,
       SUM(CASE WHEN order_ts > NOW() THEN 1 ELSE 0 END) AS future_ts
FROM stg_orders;

SELECT email ~ '^[^@\s]+@[^@\s]+\.[^@\s]+$' AS email_valid, COUNT(*)
FROM stg_orders GROUP BY 1 ORDER BY 2 DESC;

Example 2: OLTP tables (customers, orders, order_items)

Profiling approach

Check referential integrity: every order has a valid customer and items.
Find incremental watermark (updated_at vs created_at).
Detect soft deletes (deleted_flag or deleted_at).

-- Orphans
SELECT COUNT(*) AS orphans
FROM orders o LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

-- Candidate watermark reliability
SELECT SUM(CASE WHEN updated_at < created_at THEN 1 ELSE 0 END) AS anomalies,
       SUM(CASE WHEN updated_at IS NULL THEN 1 ELSE 0 END) AS missing_updates
FROM orders;

-- Late-arriving orders by comparing ingestion date with order date
SELECT DATE(order_ts) AS d, COUNT(*) AS cnt
FROM orders
GROUP BY 1 ORDER BY 1;

Example 3: JSON web events API

Endpoint returns events with fields: event_id, user_id, ts, attrs (JSON), session_id (optional), geo.country (nested).

Profiling approach

Schema drift: list optional/missing fields and types over 7 days.
Duplicates: detect repeated event_id across pages.
Cardinality: top event types; high-cardinality attributes.

-- After landing to stg_events with raw JSON column j
SELECT COUNT(*) rows,
       COUNT(DISTINCT (j->>'event_id')) distinct_events,
       SUM(CASE WHEN j ? 'session_id' THEN 1 ELSE 0 END)*1.0/COUNT(*) pct_has_session
FROM stg_events;

-- Top event types
SELECT j->>'type' AS event_type, COUNT(*)
FROM stg_events GROUP BY 1 ORDER BY 2 DESC LIMIT 20;

-- Duplicate event_id check
SELECT (j->>'event_id') AS event_id, COUNT(*)
FROM stg_events GROUP BY 1 HAVING COUNT(*) > 1 ORDER BY 2 DESC;

Turning findings into ETL design

Nulls: add defaulting, rejection rules, or enrichment steps.
Keys: choose composite/business keys if single-column keys aren’t stable.
Watermark: pick updated_at if reliable; otherwise use hash-diff or full reload by partition.
Late data: design windowed upserts (e.g., last 7 days reprocess).
Drift: use schema-on-read or versioned contracts; make optional fields nullable.

Example decisions

Deduplicate by (order_id, order_ts) keeping latest updated_at.
Incremental load on updated_at with 2-hour overlap window.
Reject rows with negative total_amount; log to a quarantine table.

Exercises you can do now

These match the exercises below. Do them in your SQL engine or a notebook. Use small sample tables if needed.

Exercise 1: Find a trustworthy watermark

Goal: Given a table sales with columns: id, created_at, updated_at, status, amount, detect if updated_at is reliable for incremental loads.
Deliverable: A short note and a query set that proves your choice.

Exercise 2: Validate keys and relationships

Goal: For tables users(user_id, email), orders(order_id, user_id, created_at), find candidate keys and orphan orders.
Deliverable: Queries plus counts for uniqueness and orphans.

Self-check checklist

I measured null, distinct, and range stats for every critical column.
I tested referential integrity and reported orphan counts.
I verified the watermark and proposed an overlap window.
I documented at least 3 ETL decisions based on the data behavior.

Common mistakes and how to self-check

Profiling only a tiny sample: can hide rare but critical issues. Self-check: compare metrics on full data vs sample.
Trusting column names blindly: updated_at can be stale. Self-check: find rows where updated_at < created_at or never changes.
Ignoring time zones: future timestamps or day shifts. Self-check: normalize to UTC in profiling queries.
Not measuring duplicates: assume id is unique. Self-check: count duplicates explicitly and list top offenders.
Skipping documentation: knowledge gets lost. Self-check: fill the profiling template and get sign-off.

Mini challenge

You receive daily parquet files for payments with fields: payment_id, user_id, amount, currency, paid_at, updated_at, status. You must design an incremental load that tolerates late-arriving corrections. In one paragraph, propose: the key(s), watermark, overlap window, and rules for negative or zero amounts.

Hint

Consider (payment_id) as key, updated_at as watermark with 1–2 day overlap, and quarantine invalid amounts with a reason code.

Who this is for

Aspiring and working ETL Developers
Data Engineers responsible for ingestion and modeling
Analytics Engineers validating upstream data

Prerequisites

Basic SQL (SELECT, JOIN, GROUP BY, window functions helpful)
Familiarity with files (CSV/JSON/Parquet) and APIs
Understanding of ETL/ELT concepts and incremental loads

Learning path

Start: Source system profiling (this lesson)
Next: Data quality rules and validations
Then: Incremental loading and CDC patterns
Finally: Dimensional modeling and serving layers

Practical projects

Profile a public CSV and write a 1-page profiling report with ETL decisions.
Load an events JSON file, detect schema drift, and plan a resilient schema-on-read.
Create a small star schema from profiled sources and document the keys and SCD strategy.

Next steps

Do the exercises below and compare with solutions.
Take the Quick Test to check your understanding. The test is available to everyone; only logged-in users get saved progress.
Apply the profiling template on a real dataset at work or in a sandbox.

Quick Test

Answer the questions below to validate your understanding.

Menu

Source System Profiling

Table of Contents

Why this matters

Concept explained simply

Mental model

What to profile (the essentials)

Step-by-step profiling workflow

Worked examples

Example 1: CSV Orders file

Example 2: OLTP tables (customers, orders, order_items)

Example 3: JSON web events API

Turning findings into ETL design

Exercises you can do now

Exercise 1: Find a trustworthy watermark

Exercise 2: Validate keys and relationships

Common mistakes and how to self-check

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Quick Test

Practice Exercises

Find a trustworthy watermark in sales

Instructions

Expected Output

Validate keys and relationships

Source System Profiling — Quick Test

Have questions about Source System Profiling?

AI Assistant