Why this matters
Source system profiling is how ETL Developers uncover the real shape, quality, and behavior of data before building pipelines. It prevents broken jobs, bad dashboards, and last-minute fire drills.
- You estimate nulls, duplicates, and outliers to design the right transformations.
- You discover keys, watermarks, and change patterns for incremental loads.
- You validate assumptions from stakeholders against the actual data.
- You reduce rework by documenting quirks early (encodings, time zones, late-arriving data).
Real tasks you might face
- Confirm that customer_id is stable and unique across tables.
- Find reliable fields for CDC (e.g., updated_at, hash, version).
- Measure null and distinct rates to plan lookups and dimensions.
- Detect schema drift in JSON payloads and plan for optional fields.
Concept explained simply
Profiling means measuring and describing data as it truly is: structure, values, relationships, and change dynamics. It’s like running a health check before you start training. You’re not cleaning or transforming yet—you’re learning how the data behaves so you can design a safe, efficient ETL.
Mental model
Think of the source as a city. Profiling is drawing the city map: streets (schemas), traffic patterns (update frequency), bottlenecks (nulls, duplicates), and rules (constraints). With the map, you can plan the fastest route (ETL design) without getting stuck.
What to profile (the essentials)
- Schema and types: column names, data types, nested structures, enums.
- Completeness: null/blank rates per column.
- Uniqueness: distinct counts; candidate keys and composite keys.
- Validity: ranges, patterns (regex), domain checks, date/time sanity.
- Integrity: referential integrity between tables (FK-like relationships).
- Duplicates: exact or near-duplicates; business key vs surrogate key.
- Change signals: created_at/updated_at, version fields, CDC flags, soft deletes.
- Volumes and velocity: row counts, daily deltas, seasonality, burstiness.
- Text quirks: encodings, trimming issues, case sensitivity, special characters.
- Time: time zones, daylight saving changes, lag between systems.
- Access constraints: pagination limits, API rate limits, extraction windows.
Helpful metrics to compute
- count, min, max, avg, stddev (for numerics)
- min/max date, % future dates, % invalid dates
- % nulls, % empty strings, % whitespace-only
- distinct count, top N frequent values
- % outliers (e.g., outside 1.5 IQR or z-score > 3)
Step-by-step profiling workflow
- Inventory sources: list tables/files/endpoints and their schemas.
- Sample safely: full scan if possible; otherwise stratified or time-based samples.
- Compute core metrics: nulls, distincts, ranges, duplicates.
- Test relationships: join keys, parent/child counts, orphans.
- Probe change patterns: detect reliable watermarks and daily deltas.
- Document decisions: constraints, gotchas, and ETL implications.
- Validate with stakeholders: confirm business meaning of fields.
Documentation template (copy/paste)
- Source object: name, owner, refresh cadence
- Keys: candidate keys, uniqueness test results
- Completeness: null/blank rates per column
- Validity: rules and % violations
- Relationships: joins tested + orphan counts
- Change strategy: full vs incremental watermark, late-arrival handling
- Operational: extraction window, rate limits, file size patterns
Worked examples
Example 1: CSV Orders file
orders_YYYYMMDD.csv with columns: order_id, customer_id, order_ts, total_amount, email.
Profiling approach
- Check uniqueness of order_id and duplicates across days.
- Null rates for customer_id and total_amount.
- Date sanity: no future timestamps; timezone awareness.
- Email pattern validity and common domains.
-- Example SQL (after loading CSV to a staging table stg_orders)
SELECT COUNT(*) AS rows,
COUNT(DISTINCT order_id) AS distinct_orders,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)*1.0/COUNT(*) AS pct_null_customer,
SUM(CASE WHEN total_amount < 0 THEN 1 ELSE 0 END) AS negative_amounts,
SUM(CASE WHEN order_ts > NOW() THEN 1 ELSE 0 END) AS future_ts
FROM stg_orders;
SELECT email ~ '^[^@\s]+@[^@\s]+\.[^@\s]+$' AS email_valid, COUNT(*)
FROM stg_orders GROUP BY 1 ORDER BY 2 DESC;
Example 2: OLTP tables (customers, orders, order_items)
Profiling approach
- Check referential integrity: every order has a valid customer and items.
- Find incremental watermark (updated_at vs created_at).
- Detect soft deletes (deleted_flag or deleted_at).
-- Orphans
SELECT COUNT(*) AS orphans
FROM orders o LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Candidate watermark reliability
SELECT SUM(CASE WHEN updated_at < created_at THEN 1 ELSE 0 END) AS anomalies,
SUM(CASE WHEN updated_at IS NULL THEN 1 ELSE 0 END) AS missing_updates
FROM orders;
-- Late-arriving orders by comparing ingestion date with order date
SELECT DATE(order_ts) AS d, COUNT(*) AS cnt
FROM orders
GROUP BY 1 ORDER BY 1;
Example 3: JSON web events API
Endpoint returns events with fields: event_id, user_id, ts, attrs (JSON), session_id (optional), geo.country (nested).
Profiling approach
- Schema drift: list optional/missing fields and types over 7 days.
- Duplicates: detect repeated event_id across pages.
- Cardinality: top event types; high-cardinality attributes.
-- After landing to stg_events with raw JSON column j
SELECT COUNT(*) rows,
COUNT(DISTINCT (j->>'event_id')) distinct_events,
SUM(CASE WHEN j ? 'session_id' THEN 1 ELSE 0 END)*1.0/COUNT(*) pct_has_session
FROM stg_events;
-- Top event types
SELECT j->>'type' AS event_type, COUNT(*)
FROM stg_events GROUP BY 1 ORDER BY 2 DESC LIMIT 20;
-- Duplicate event_id check
SELECT (j->>'event_id') AS event_id, COUNT(*)
FROM stg_events GROUP BY 1 HAVING COUNT(*) > 1 ORDER BY 2 DESC;
Turning findings into ETL design
- Nulls: add defaulting, rejection rules, or enrichment steps.
- Keys: choose composite/business keys if single-column keys aren’t stable.
- Watermark: pick updated_at if reliable; otherwise use hash-diff or full reload by partition.
- Late data: design windowed upserts (e.g., last 7 days reprocess).
- Drift: use schema-on-read or versioned contracts; make optional fields nullable.
Example decisions
- Deduplicate by (order_id, order_ts) keeping latest updated_at.
- Incremental load on updated_at with 2-hour overlap window.
- Reject rows with negative total_amount; log to a quarantine table.
Exercises you can do now
These match the exercises below. Do them in your SQL engine or a notebook. Use small sample tables if needed.
Exercise 1: Find a trustworthy watermark
- Goal: Given a table sales with columns: id, created_at, updated_at, status, amount, detect if updated_at is reliable for incremental loads.
- Deliverable: A short note and a query set that proves your choice.
Exercise 2: Validate keys and relationships
- Goal: For tables users(user_id, email), orders(order_id, user_id, created_at), find candidate keys and orphan orders.
- Deliverable: Queries plus counts for uniqueness and orphans.
Self-check checklist
- I measured null, distinct, and range stats for every critical column.
- I tested referential integrity and reported orphan counts.
- I verified the watermark and proposed an overlap window.
- I documented at least 3 ETL decisions based on the data behavior.
Common mistakes and how to self-check
- Profiling only a tiny sample: can hide rare but critical issues. Self-check: compare metrics on full data vs sample.
- Trusting column names blindly: updated_at can be stale. Self-check: find rows where updated_at < created_at or never changes.
- Ignoring time zones: future timestamps or day shifts. Self-check: normalize to UTC in profiling queries.
- Not measuring duplicates: assume id is unique. Self-check: count duplicates explicitly and list top offenders.
- Skipping documentation: knowledge gets lost. Self-check: fill the profiling template and get sign-off.
Mini challenge
You receive daily parquet files for payments with fields: payment_id, user_id, amount, currency, paid_at, updated_at, status. You must design an incremental load that tolerates late-arriving corrections. In one paragraph, propose: the key(s), watermark, overlap window, and rules for negative or zero amounts.
Hint
Consider (payment_id) as key, updated_at as watermark with 1–2 day overlap, and quarantine invalid amounts with a reason code.
Who this is for
- Aspiring and working ETL Developers
- Data Engineers responsible for ingestion and modeling
- Analytics Engineers validating upstream data
Prerequisites
- Basic SQL (SELECT, JOIN, GROUP BY, window functions helpful)
- Familiarity with files (CSV/JSON/Parquet) and APIs
- Understanding of ETL/ELT concepts and incremental loads
Learning path
- Start: Source system profiling (this lesson)
- Next: Data quality rules and validations
- Then: Incremental loading and CDC patterns
- Finally: Dimensional modeling and serving layers
Practical projects
- Profile a public CSV and write a 1-page profiling report with ETL decisions.
- Load an events JSON file, detect schema drift, and plan a resilient schema-on-read.
- Create a small star schema from profiled sources and document the keys and SCD strategy.
Next steps
- Do the exercises below and compare with solutions.
- Take the Quick Test to check your understanding. The test is available to everyone; only logged-in users get saved progress.
- Apply the profiling template on a real dataset at work or in a sandbox.
Quick Test
Answer the questions below to validate your understanding.