How to learn Logging And Audit Tables for ETL Tooling And Implementation in ETL Developer for free

Why this matters

In real ETL jobs, you must explain what ran, how many rows moved, how long it took, and why it failed. Logging and audit tables give you answers fast, support compliance, enable reliable re-runs, and help teams meet SLAs.

Troubleshoot failures without guesswork.
Prove data movement with row counts and checksums.
Detect duplicates or missing rows and reconcile quickly.
Enable safe reprocess and idempotent loads.

Concept explained simply

Think of two layers:

Run logging: records each job and step execution (start/end, status, timings, errors).
Audit tables: record what changed in the data (row counts, checksums, inserts/updates/deletes).

Mental model

Imagine a flight recorder. Your pipeline writes a small, consistent set of events that reconstructs what happened. Each run has a single correlation key (run_id) that ties every step and audit fact together.

Core entities to implement

job_runs(job_name, run_id, started_at, ended_at, status, env, triggered_by, error_message)
step_runs(run_id, step_name, started_at, ended_at, status, rows_read, rows_written, error_message)
audit_facts(run_id, step_name, source, target, metric_type, metric_value, checksum, snapshot_at)

Minimum required columns: run_id, job_name/step_name, status, started_at, ended_at, duration, row counts or metric_value, and an error_message field for failures.

Design patterns that work

Assign a run_id at job start. Insert a job_runs row with status="running" immediately.
Each step inserts a step_runs row at start, updates at end.
Commit target data before marking success. Only write success when the data change is durable.
Use UTC timestamps and a consistent timezone everywhere.
Use insert-only for audit_facts and upsert-or-update for job_runs/step_runs keyed by (run_id, step_name).
Keep PII out of logs; store only IDs, counts, and hashed checksums when needed.

What is a checksum?

A lightweight hash (e.g., MD5/SHA of concatenated key columns or a rolling sum) that helps detect mismatches across systems. It is not a security feature here—just a quick integrity signal.

Worked examples

Example 1 — Minimal logging schema (SQL)

-- Job and step run logging (example in SQL)
CREATE TABLE job_runs (
  run_id        VARCHAR(64) PRIMARY KEY,
  job_name      VARCHAR(200) NOT NULL,
  env           VARCHAR(50)  NOT NULL,
  triggered_by  VARCHAR(100),
  status        VARCHAR(20)  NOT NULL,  -- running|success|failed|skipped
  started_at    TIMESTAMP NOT NULL,
  ended_at      TIMESTAMP,
  error_message TEXT
);

CREATE TABLE step_runs (
  run_id        VARCHAR(64) NOT NULL,
  step_name     VARCHAR(200) NOT NULL,
  status        VARCHAR(20) NOT NULL,
  started_at    TIMESTAMP NOT NULL,
  ended_at      TIMESTAMP,
  rows_read     BIGINT,
  rows_written  BIGINT,
  error_message TEXT,
  PRIMARY KEY (run_id, step_name)
);

-- Start of job
INSERT INTO job_runs(run_id, job_name, env, triggered_by, status, started_at)
VALUES ('2024-11-01T00:00Z_acme_sales_daily', 'acme_sales_daily', 'prod', 'scheduler', 'running', NOW());

-- Start a step
INSERT INTO step_runs(run_id, step_name, status, started_at)
VALUES ('2024-11-01T00:00Z_acme_sales_daily', 'extract_sales', 'running', NOW());

-- End a step
UPDATE step_runs SET status='success', ended_at=NOW(), rows_read=125432, rows_written=125432
WHERE run_id='2024-11-01T00:00Z_acme_sales_daily' AND step_name='extract_sales';

-- End of job
UPDATE job_runs SET status='success', ended_at=NOW()
WHERE run_id='2024-11-01T00:00Z_acme_sales_daily';

Example 2 — Audit facts with reconciliation

CREATE TABLE audit_facts (
  run_id       VARCHAR(64) NOT NULL,
  step_name    VARCHAR(200) NOT NULL,
  source       VARCHAR(200) NOT NULL,
  target       VARCHAR(200) NOT NULL,
  metric_type  VARCHAR(50)  NOT NULL,  -- source_count|target_count|checksum
  metric_value BIGINT,
  checksum     VARCHAR(64),
  snapshot_at  TIMESTAMP NOT NULL,
  PRIMARY KEY (run_id, step_name, metric_type)
);

-- Record counts from source and target
INSERT INTO audit_facts VALUES
('2024-11-01T00:00Z_acme_sales_daily','load_sales','dw.stg_sales','dw.fct_sales','source_count',125432,NULL,NOW());
INSERT INTO audit_facts VALUES
('2024-11-01T00:00Z_acme_sales_daily','load_sales','dw.stg_sales','dw.fct_sales','target_count',125432,NULL,NOW());

-- Quick reconciliation
SELECT s.run_id,
       s.metric_value AS source_cnt,
       t.metric_value AS target_cnt,
       (t.metric_value - s.metric_value) AS diff
FROM audit_facts s
JOIN audit_facts t
  ON s.run_id=t.run_id AND s.step_name=t.step_name
WHERE s.metric_type='source_count' AND t.metric_type='target_count';

If diff != 0, mark the step as failed and block downstream steps.

Example 3 — Hash-based change audit

-- Create a rolling checksum per batch to detect subtle mismatches
INSERT INTO audit_facts(run_id, step_name, source, target, metric_type, checksum, snapshot_at)
SELECT '2024-11-01T00:00Z_acme_sales_daily', 'merge_sales', 'dw.stg_sales', 'dw.fct_sales', 'checksum',
       md5(string_agg(sale_id || '|' || amount::text || '|' || coalesce(updated_at::text,''), '|' ORDER BY sale_id)),
       NOW()
FROM dw.stg_sales;

-- Later compare checksum to the target after merge
INSERT INTO audit_facts(...)
SELECT ... FROM dw.fct_sales;

-- Reconcile
SELECT a1.run_id, a1.checksum AS stg_hash, a2.checksum AS tgt_hash,
       CASE WHEN a1.checksum=a2.checksum THEN 'ok' ELSE 'mismatch' END AS result
FROM audit_facts a1
JOIN audit_facts a2 ON a1.run_id=a2.run_id AND a1.step_name=a2.step_name
WHERE a1.metric_type='checksum' AND a1.source='dw.stg_sales' AND a2.target='dw.fct_sales';

Use hashes for detection only; investigate mismatches with targeted queries.

Who this is for

ETL Developers who need reliable, debuggable pipelines.
Data Engineers adding observability to batch or streaming jobs.
Analytics Engineers who must reconcile ELT transformations.

Prerequisites

Comfortable with SQL DDL/DML.
Basic understanding of ETL/ELT steps.
Access to a database or data warehouse to create tables.

Learning path

Model the logging schema: job_runs, step_runs, audit_facts.
Instrument one pipeline step-by-step (start, metrics, end).
Add reconciliation checks and failure handling.

Implementation notes

Reruns: keep the same run_id for retried attempts of the same scheduled run; add attempt_no if needed. Or generate a new run_id and link via parent_run_id.
Indexes: (job_name, started_at), (status), and (run_id, step_name) for fast lookup.
Retention: keep detailed step logs for 30–90 days; aggregate older data to daily summaries.
Security: do not log PII or secrets. Prefer IDs and masked values.
Performance: batch inserts and minimal columns; avoid logging per-row events unless truly necessary.

Exercises

These exercises are available to everyone. Only logged-in users will see saved progress and answers when they return.

Exercise ex1: Design and create minimal job_runs and step_runs tables, then simulate a successful 2-step run using inserts and updates. See details in the Exercises panel below.
Exercise ex2: Build an audit_facts table and write a reconciliation query that fails the run when counts mismatch.

Checklist: Did you capture run_id, start/end times, status, row counts, and error messages?
Checklist: Are all timestamps in UTC?
Checklist: Does success only write after target commits?

Common mistakes

No correlation key: forgetting run_id makes logs impossible to join. Always generate and pass run_id to every step.
Logging too late: if you only log at the end, you lose visibility during long steps. Log at start and end.
Storing PII: logs should not contain raw customer data. Use IDs and checksums.
Inconsistent timezones: mix of local times leads to confusing timelines. Use UTC everywhere.
No indexes or retention: logs become slow and expensive. Add indexes and purge/archive on schedule.
Success before commit: marking success prior to final commit can mislead. Commit first, then log success.

How to self-check

Can you find all failed runs for a job in the last 7 days under 1 second?
Can you show row-count parity between source and target for any run?
Can you explain the last failure in under 3 minutes using logs alone?

Practical projects

Project A: Instrument one existing pipeline with job_runs and step_runs. Add indexes and retention policy.
Project B: Add audit_facts with source_count/target_count and a checksum metric for a critical table.
Project C: Create a simple dashboard (any BI tool) showing success rate, average duration by step, and failure reasons.

Mini challenge

You see: source_count=250,000; target_count=249,997; checksum mismatch. Step_logs show rows_read=250,000 and rows_written=249,997. What’s the fastest way to diagnose?

One possible approach

Query reject/dead-letter table or step error messages.
Check constraints on target (PK/unique). Look for 3 duplicates via GROUP BY HAVING COUNT(*)>1 on the natural key in source.
Verify merge conditions; ensure updates vs inserts are correct.

Next steps

Add alerts on failed runs or count mismatches.
Automate reruns with attempt_no and idempotent load logic.
Extend audit to include DQ checks (null-rate, referential integrity).

Quick Test

Everyone can take the test. Only logged-in users will have their results saved.

Menu

Logging And Audit Tables

Table of Contents

Why this matters

Concept explained simply

Mental model

Core entities to implement

Design patterns that work

Worked examples

Who this is for

Prerequisites

Learning path

Implementation notes

Exercises

Common mistakes

Practical projects

Mini challenge

Next steps

Quick Test

Practice Exercises

Create minimal run logging and simulate a successful run

Instructions

Expected Output

Add audit facts and fail on mismatch

Logging And Audit Tables — Quick Test

Have questions about Logging And Audit Tables?

AI Assistant