What you’ll learn
Build data cleaning pipelines that are reproducible, traceable, and easy to debug. You will design simple run logs, structure your steps for determinism, and ship changes confidently.
Who this is for
- Aspiring and junior Data Analysts who want reliable, auditable cleaning processes
- Analysts who ship recurring reports and need consistent outputs
- Anyone moving from ad‑hoc notebooks to maintainable pipelines
Prerequisites
- Basic data cleaning skills (e.g., removing duplicates, fixing types)
- Familiarity with either spreadsheets, SQL, or Python/pandas
- Comfort with files and folders; basic version control is a plus
Why this matters
- Trust: Stakeholders ask “Where did this number come from?” Logging provides answers.
- Speed: Reproducibility turns a week of detective work into minutes of re-runs.
- Compliance: Many industries require data lineage and audit trails.
- Teamwork: Others can run and understand your pipeline without you.
Real tasks you’ll face:
- Explaining why last month’s report changed after a bug fix
- Re-running a backfill for the last 90 days consistently
- Handing over a project so another analyst can run it identically
Concept explained simply
Logging is writing down what happened: when a run started/ended, what data it used, counts, warnings, and outcomes. Reproducibility means running the same steps with the same inputs yields the same outputs. Together, they give you evidence and consistency.
Mental model
Imagine a bakery assembly line. Reproducibility is following the same recipe and oven settings each time. Logging is the clipboard that records batch time, temperature, and how many loaves you produced. If something goes wrong, you can trace it.
Core components of a reproducible cleaning pipeline
- Deterministic transforms: Same inputs produce the same outputs; avoid hidden randomness.
- Configurations: Externalize parameters (dates, paths, thresholds) in a config file or table.
- Version control: Track your pipeline code and config changes.
- Environment lock: Pin package versions or document tool versions.
- Logging: Record run metadata, row counts, warnings/errors, and output locations.
- Idempotency: Safe to re-run without duplicating results.
Worked examples
Example 1 — Python/pandas with a simple run log
Config (YAML):
run_date: "2025-01-15"
input_path: "data/raw/customers.csv"
output_path: "data/clean/customers_clean.csv"
drop_columns: ["misc_notes"]
valid_countries: ["US", "CA", "GB"]
Pipeline (Python):
import pandas as pd
import logging, time, json, hashlib
from pathlib import Path
# Basic file logger
logging.basicConfig(filename='logs/run.log', level=logging.INFO,
format='%(asctime)s | %(levelname)s | %(message)s')
cfg = {
"run_date": "2025-01-15",
"input_path": "data/raw/customers.csv",
"output_path": "data/clean/customers_clean.csv",
"drop_columns": ["misc_notes"],
"valid_countries": ["US", "CA", "GB"]
}
start = time.time()
run_id = hashlib.md5(json.dumps(cfg, sort_keys=True).encode()).hexdigest()[:8]
logging.info(f"run_id={run_id} | start | cfg={json.dumps(cfg)}")
raw = pd.read_csv(cfg["input_path"]) # deterministic source
n_raw = len(raw)
# Deterministic transforms
if set(cfg["drop_columns"]).issubset(raw.columns):
raw = raw.drop(columns=cfg["drop_columns"])
raw['email'] = raw['email'].str.strip().str.lower()
raw = raw[raw['country'].isin(cfg['valid_countries'])]
# Output
Path(cfg["output_path"]).parent.mkdir(parents=True, exist_ok=True)
raw.to_csv(cfg["output_path"], index=False)
# Log metrics
n_out = len(raw)
elapsed = round(time.time() - start, 2)
logging.info(f"run_id={run_id} | end | rows_in={n_raw} | rows_out={n_out} | seconds={elapsed}")
Sample log lines:
2025-01-15T10:00:00 | INFO | run_id=4fa2b9c1 | start | cfg={...}
2025-01-15T10:00:01 | INFO | run_id=4fa2b9c1 | end | rows_in=12034 | rows_out=11876 | seconds=0.89
Example 2 — SQL cleaning with an audit table
Create a cleaned table and append to a simple run log.
-- Clean table (idempotent: recreate or write to a dated partition)
DROP TABLE IF EXISTS clean_sales;
CREATE TABLE clean_sales AS
SELECT
TRIM(LOWER(email)) AS email,
CAST(order_value AS DECIMAL(12,2)) AS order_value,
order_date
FROM raw_sales
WHERE email IS NOT NULL
AND order_value >= 0;
-- Minimal run audit
CREATE TABLE IF NOT EXISTS run_log (
run_timestamp TIMESTAMP,
pipeline VARCHAR(100),
row_count BIGINT,
notes VARCHAR(255)
);
INSERT INTO run_log (run_timestamp, pipeline, row_count, notes)
SELECT CURRENT_TIMESTAMP, 'clean_sales', COUNT(*), 'daily refresh'
FROM clean_sales;
Tip: If you can’t overwrite, write to a dated table like clean_sales_2025_01_15 and use a view clean_sales pointing to the latest.
Example 3 — Spreadsheet/No‑code approach
- Create a Raw sheet (locked or read-only) and a Clean sheet referencing Raw with formulas.
- Record a Changelog sheet with columns: Date, Who, Change, Reason.
- Use formula-based cleaning to stay reproducible (e.g., TRIM, LOWER, IFERROR).
Example formulas:
Clean!A2 = LOWER(TRIM(Raw!A2)) // standardized email
Clean!B2 = IF(Raw!B2<0, NA(), Raw!B2) // guard invalid values
Manual logging: Add a new row to Changelog whenever you adjust formulas or validation rules. Keep a copy of the sheet version (File > Version history) and note it in the log.
How to name, version, and structure pipelines
- Name consistently: use a prefix and step order, e.g., 01_extract.py, 02_clean.py, 03_export.py.
- Separate config: config.yml or a parameters table controls dates, paths, thresholds.
- Folders: data/raw, data/intermediate, data/clean, logs/, configs/.
- Version: commit code and config together with clear messages. Tag releases for major changes.
- Environment: pin package versions (e.g., requirements.txt) or record tool versions in a README.
Minimal logging you should implement
- Timestamp of run start and end
- Run identifier (hash of config or a date+time stamp)
- Input source and output destination paths/tables
- Row counts in/out; number of dropped/filtered records
- Warnings/errors (e.g., unexpected nulls, schema drift)
- Version info (code/config version or commit hash)
Exercises
Complete these tasks to solidify the concepts. The same exercises are listed below in an interactive format, and your progress is saved if you are logged in. The quick test is available to everyone; only logged‑in users will see saved progress.
Exercise 1 — Design a simple run log schema
Imagine a daily pipeline that cleans customers.csv and outputs customers_clean.csv. Define the columns your run log should have, and draft 2 example rows (for two days of runs) capturing useful metrics.
Exercise 2 — Convert an ad‑hoc notebook into a reproducible pipeline
Take a cleaning process you’ve done before. Break it into deterministic steps, externalize parameters to a config, add logging, and write a clear folder structure. Produce a sample log output.
- I wrote a run log schema with at least: timestamp, run_id, input, output, rows_in, rows_out, warnings
- I created or described a config file with parameters
- My steps are numbered and deterministic
- I can re-run and get identical results on the same input
Common mistakes and self‑check
- Hidden randomness: Using sampling without a fixed seed. Self-check: set a seed or remove randomness; verify outputs match on re-run.
- Hard‑coded dates/paths: Burying parameters in code. Self-check: can you change a date in one place?
- No idempotency: Appending duplicates on re-runs. Self-check: re-run twice; counts shouldn’t double.
- Logging only errors: Missing row counts and config snapshots. Self-check: can you answer “how many rows in/out?” for the last run?
- Overwriting raw data: Losing traceability. Self-check: raw files/tables are read-only and preserved.
- Unpinned environments: Different package versions cause drift. Self-check: record tool versions in a README or lock file.
Mini challenge
You receive a CSV of 25,000 transactions with occasional negative values and mixed‑case emails. Build a minimal pipeline that:
- Standardizes emails (lowercase, trimmed)
- Removes negative values
- Writes a log with run_id, rows_in, rows_out, and a count of removed rows
Hint
Hash the config for run_id or use a timestamp. Compute removed_rows as rows_in - rows_out and log it.
Practical projects
- Daily customer hygiene job: Clean and deduplicate customers, log counts and unique emails, export clean file.
- Marketing list validator: Validate domains and opt-in flags, log invalid rows and reasons, produce a validation report.
- Finance transaction standardizer: Normalize amounts and dates across sources, ensure idempotent merges, log source file hashes.
Learning path and next steps
- Now: Implement minimal logging in your current cleaning task.
- Next: Add config files and idempotent outputs. Practice with the exercises here.
- Then: Introduce version control and environment pinning for full reproducibility.
When ready, take the quick test below to check your understanding. The test is available to everyone; only logged‑in users get saved progress.