Why this matters
In real ETL jobs, you must explain what ran, how many rows moved, how long it took, and why it failed. Logging and audit tables give you answers fast, support compliance, enable reliable re-runs, and help teams meet SLAs.
- Troubleshoot failures without guesswork.
- Prove data movement with row counts and checksums.
- Detect duplicates or missing rows and reconcile quickly.
- Enable safe reprocess and idempotent loads.
Concept explained simply
Think of two layers:
- Run logging: records each job and step execution (start/end, status, timings, errors).
- Audit tables: record what changed in the data (row counts, checksums, inserts/updates/deletes).
Mental model
Imagine a flight recorder. Your pipeline writes a small, consistent set of events that reconstructs what happened. Each run has a single correlation key (run_id) that ties every step and audit fact together.
Core entities to implement
- job_runs(job_name, run_id, started_at, ended_at, status, env, triggered_by, error_message)
- step_runs(run_id, step_name, started_at, ended_at, status, rows_read, rows_written, error_message)
- audit_facts(run_id, step_name, source, target, metric_type, metric_value, checksum, snapshot_at)
Minimum required columns: run_id, job_name/step_name, status, started_at, ended_at, duration, row counts or metric_value, and an error_message field for failures.
Design patterns that work
- Assign a run_id at job start. Insert a job_runs row with status="running" immediately.
- Each step inserts a step_runs row at start, updates at end.
- Commit target data before marking success. Only write success when the data change is durable.
- Use UTC timestamps and a consistent timezone everywhere.
- Use insert-only for audit_facts and upsert-or-update for job_runs/step_runs keyed by (run_id, step_name).
- Keep PII out of logs; store only IDs, counts, and hashed checksums when needed.
What is a checksum?
A lightweight hash (e.g., MD5/SHA of concatenated key columns or a rolling sum) that helps detect mismatches across systems. It is not a security feature here—just a quick integrity signal.
Worked examples
Example 1 — Minimal logging schema (SQL)
-- Job and step run logging (example in SQL)
CREATE TABLE job_runs (
run_id VARCHAR(64) PRIMARY KEY,
job_name VARCHAR(200) NOT NULL,
env VARCHAR(50) NOT NULL,
triggered_by VARCHAR(100),
status VARCHAR(20) NOT NULL, -- running|success|failed|skipped
started_at TIMESTAMP NOT NULL,
ended_at TIMESTAMP,
error_message TEXT
);
CREATE TABLE step_runs (
run_id VARCHAR(64) NOT NULL,
step_name VARCHAR(200) NOT NULL,
status VARCHAR(20) NOT NULL,
started_at TIMESTAMP NOT NULL,
ended_at TIMESTAMP,
rows_read BIGINT,
rows_written BIGINT,
error_message TEXT,
PRIMARY KEY (run_id, step_name)
);
-- Start of job
INSERT INTO job_runs(run_id, job_name, env, triggered_by, status, started_at)
VALUES ('2024-11-01T00:00Z_acme_sales_daily', 'acme_sales_daily', 'prod', 'scheduler', 'running', NOW());
-- Start a step
INSERT INTO step_runs(run_id, step_name, status, started_at)
VALUES ('2024-11-01T00:00Z_acme_sales_daily', 'extract_sales', 'running', NOW());
-- End a step
UPDATE step_runs SET status='success', ended_at=NOW(), rows_read=125432, rows_written=125432
WHERE run_id='2024-11-01T00:00Z_acme_sales_daily' AND step_name='extract_sales';
-- End of job
UPDATE job_runs SET status='success', ended_at=NOW()
WHERE run_id='2024-11-01T00:00Z_acme_sales_daily';Example 2 — Audit facts with reconciliation
CREATE TABLE audit_facts (
run_id VARCHAR(64) NOT NULL,
step_name VARCHAR(200) NOT NULL,
source VARCHAR(200) NOT NULL,
target VARCHAR(200) NOT NULL,
metric_type VARCHAR(50) NOT NULL, -- source_count|target_count|checksum
metric_value BIGINT,
checksum VARCHAR(64),
snapshot_at TIMESTAMP NOT NULL,
PRIMARY KEY (run_id, step_name, metric_type)
);
-- Record counts from source and target
INSERT INTO audit_facts VALUES
('2024-11-01T00:00Z_acme_sales_daily','load_sales','dw.stg_sales','dw.fct_sales','source_count',125432,NULL,NOW());
INSERT INTO audit_facts VALUES
('2024-11-01T00:00Z_acme_sales_daily','load_sales','dw.stg_sales','dw.fct_sales','target_count',125432,NULL,NOW());
-- Quick reconciliation
SELECT s.run_id,
s.metric_value AS source_cnt,
t.metric_value AS target_cnt,
(t.metric_value - s.metric_value) AS diff
FROM audit_facts s
JOIN audit_facts t
ON s.run_id=t.run_id AND s.step_name=t.step_name
WHERE s.metric_type='source_count' AND t.metric_type='target_count';If diff != 0, mark the step as failed and block downstream steps.
Example 3 — Hash-based change audit
-- Create a rolling checksum per batch to detect subtle mismatches
INSERT INTO audit_facts(run_id, step_name, source, target, metric_type, checksum, snapshot_at)
SELECT '2024-11-01T00:00Z_acme_sales_daily', 'merge_sales', 'dw.stg_sales', 'dw.fct_sales', 'checksum',
md5(string_agg(sale_id || '|' || amount::text || '|' || coalesce(updated_at::text,''), '|' ORDER BY sale_id)),
NOW()
FROM dw.stg_sales;
-- Later compare checksum to the target after merge
INSERT INTO audit_facts(...)
SELECT ... FROM dw.fct_sales;
-- Reconcile
SELECT a1.run_id, a1.checksum AS stg_hash, a2.checksum AS tgt_hash,
CASE WHEN a1.checksum=a2.checksum THEN 'ok' ELSE 'mismatch' END AS result
FROM audit_facts a1
JOIN audit_facts a2 ON a1.run_id=a2.run_id AND a1.step_name=a2.step_name
WHERE a1.metric_type='checksum' AND a1.source='dw.stg_sales' AND a2.target='dw.fct_sales';Use hashes for detection only; investigate mismatches with targeted queries.
Who this is for
- ETL Developers who need reliable, debuggable pipelines.
- Data Engineers adding observability to batch or streaming jobs.
- Analytics Engineers who must reconcile ELT transformations.
Prerequisites
- Comfortable with SQL DDL/DML.
- Basic understanding of ETL/ELT steps.
- Access to a database or data warehouse to create tables.
Learning path
- Model the logging schema: job_runs, step_runs, audit_facts.
- Instrument one pipeline step-by-step (start, metrics, end).
- Add reconciliation checks and failure handling.
Implementation notes
- Reruns: keep the same run_id for retried attempts of the same scheduled run; add attempt_no if needed. Or generate a new run_id and link via parent_run_id.
- Indexes: (job_name, started_at), (status), and (run_id, step_name) for fast lookup.
- Retention: keep detailed step logs for 30–90 days; aggregate older data to daily summaries.
- Security: do not log PII or secrets. Prefer IDs and masked values.
- Performance: batch inserts and minimal columns; avoid logging per-row events unless truly necessary.
Exercises
These exercises are available to everyone. Only logged-in users will see saved progress and answers when they return.
- Exercise ex1: Design and create minimal job_runs and step_runs tables, then simulate a successful 2-step run using inserts and updates. See details in the Exercises panel below.
- Exercise ex2: Build an audit_facts table and write a reconciliation query that fails the run when counts mismatch.
- Checklist: Did you capture run_id, start/end times, status, row counts, and error messages?
- Checklist: Are all timestamps in UTC?
- Checklist: Does success only write after target commits?
Common mistakes
- No correlation key: forgetting run_id makes logs impossible to join. Always generate and pass run_id to every step.
- Logging too late: if you only log at the end, you lose visibility during long steps. Log at start and end.
- Storing PII: logs should not contain raw customer data. Use IDs and checksums.
- Inconsistent timezones: mix of local times leads to confusing timelines. Use UTC everywhere.
- No indexes or retention: logs become slow and expensive. Add indexes and purge/archive on schedule.
- Success before commit: marking success prior to final commit can mislead. Commit first, then log success.
How to self-check
- Can you find all failed runs for a job in the last 7 days under 1 second?
- Can you show row-count parity between source and target for any run?
- Can you explain the last failure in under 3 minutes using logs alone?
Practical projects
- Project A: Instrument one existing pipeline with job_runs and step_runs. Add indexes and retention policy.
- Project B: Add audit_facts with source_count/target_count and a checksum metric for a critical table.
- Project C: Create a simple dashboard (any BI tool) showing success rate, average duration by step, and failure reasons.
Mini challenge
You see: source_count=250,000; target_count=249,997; checksum mismatch. Step_logs show rows_read=250,000 and rows_written=249,997. What’s the fastest way to diagnose?
One possible approach
- Query reject/dead-letter table or step error messages.
- Check constraints on target (PK/unique). Look for 3 duplicates via GROUP BY HAVING COUNT(*)>1 on the natural key in source.
- Verify merge conditions; ensure updates vs inserts are correct.
Next steps
- Add alerts on failed runs or count mismatches.
- Automate reruns with attempt_no and idempotent load logic.
- Extend audit to include DQ checks (null-rate, referential integrity).
Quick Test
Everyone can take the test. Only logged-in users will have their results saved.