Why this matters
As a Data Scientist, you will clean data, explore patterns, and ship models. Stakeholders need to trust that your results are repeatable. A solid notebook workflow reduces bugs, helps teammates reproduce your work, and makes it easy to move from exploration to production.
- Real tasks: rerunning an analysis after data refresh, handing your notebook to a teammate, or comparing model runs fairly.
- Risks without reproducibility: misleading metrics due to random splits, mysterious errors on another machine, or lost time reconciling versions.
Concept explained simply
Reproducibility means: if you rerun the notebook tomorrow or on another machine, you get the same results and can explain how they were produced.
Mental model
Think of your notebook as a recipe card. Good recipes list ingredients (versions, data paths), set the oven temperature (random seeds, configs), and produce the same dish each time (deterministic steps and saved outputs).
Checklist: A reproducible notebook does...
- Start with a single setup cell: imports, seed, config, and environment info.
- Use relative project paths (no local absolute paths like C:\User\... or /Users/... ).
- Make randomness explicit with a fixed seed.
- Run top-to-bottom without manual tweaking.
- Save key artifacts (figures, CSVs, metrics) with clear names and timestamps.
Minimal reproducible notebook template
Use this as the first cell of every notebook.
import os, sys, json, platform, random
import numpy as np
import pandas as pd
# 1) Configuration (override via environment variables if needed)
SEED = int(os.getenv("SEED", "42"))
DATA_DIR = os.getenv("DATA_DIR", "data")
ARTIFACTS_DIR = os.getenv("ARTIFACTS_DIR", "artifacts")
CONFIG = {
"sample_frac": 1.0, # keep 1.0 for full data
"target": None,
}
# 2) Reproducibility
random.seed(SEED)
np.random.seed(SEED)
# 3) Directories
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(ARTIFACTS_DIR, exist_ok=True)
# 4) Environment snapshot (print once for logging)
print(json.dumps({
"python": sys.version.split()[0],
"platform": platform.platform(),
"pandas": pd.__version__,
"numpy": np.__version__,
"seed": SEED,
"config": CONFIG,
"data_dir": DATA_DIR,
"artifacts_dir": ARTIFACTS_DIR
}, indent=2))
# 5) Helper utilities (optional)
def describe_df(df, name="df"):
print(f"{name}: shape={df.shape}; columns={list(df.columns)[:6]}")
Why this template works
- All assumptions live in one place (CONFIG, SEED, paths).
- Seeds force deterministic results for random operations.
- Environment snapshot helps others match your setup.
- Relative directories keep paths portable across machines.
Worked examples
Example 1: Clean synthetic data deterministically
# Synthetic dataset with missing values (deterministic)
rng = np.random.RandomState(SEED)
N = 100
raw = pd.DataFrame({
"feature_a": rng.normal(0, 1, N),
"feature_b": rng.uniform(0, 10, N),
"label": rng.binomial(1, 0.3, N)
})
# Introduce some missingness deterministically
raw.loc[rng.choice(N, 10, replace=False), "feature_b"] = np.nan
# Clean: impute with median
clean = raw.copy()
median_b = clean["feature_b"].median()
clean["feature_b"].fillna(median_b, inplace=True)
describe_df(clean, "clean")
clean.to_csv(os.path.join(ARTIFACTS_DIR, "clean.csv"), index=False)
print("Saved:", os.path.join(ARTIFACTS_DIR, "clean.csv"))
Re-running yields the same median and the same saved file content.
Example 2: Parameterized EDA summary
# Use CONFIG to control EDA behavior
EDA_COLS = ["feature_a", "feature_b"]
QUANT = float(os.getenv("EDA_QUANT", "0.95"))
summary = {
col: {
"mean": float(clean[col].mean()),
"std": float(clean[col].std()),
"q95": float(clean[col].quantile(QUANT))
} for col in EDA_COLS
}
with open(os.path.join(ARTIFACTS_DIR, "eda_summary.json"), "w") as f:
json.dump(summary, f, indent=2)
print("EDA summary saved.")
Change EDA_QUANT to 0.9 and you can reproduce a different, intentional run with provenance.
Example 3: Reproducible train/test split without extra libraries
# Deterministic split using NumPy only
idx = np.arange(len(clean))
rng = np.random.RandomState(SEED)
rng.shuffle(idx)
split = int(0.8 * len(idx))
train_idx, test_idx = idx[:split], idx[split:]
train, test = clean.iloc[train_idx], clean.iloc[test_idx]
print(train.shape, test.shape)
train.to_csv(os.path.join(ARTIFACTS_DIR, "train.csv"), index=False)
test.to_csv(os.path.join(ARTIFACTS_DIR, "test.csv"), index=False)
Optional: export notebook as a script or HTML
From a notebook cell, you can create a shareable artifact:
# Convert the current notebook (replace name.ipynb) to a script or HTML
# !jupyter nbconvert --to script name.ipynb
# !jupyter nbconvert --to html name.ipynb
These exports help reviewers read diffs and reproduce results without running every cell.
Practical workflow (step-by-step)
- Create a project skeleton:
project/ ├── data/ # raw or external data (read-only when possible) ├── artifacts/ # tables, plots, metrics you produce ├── notebooks/ # your .ipynb files └── README.md # what to run, in what order - Start each notebook with the setup cell shown above.
- Use deterministic operations (e.g., random_state in .sample, fixed seeds).
- Save outputs with informative names, e.g., artifacts/train_2026-01-01.csv.
- Before sharing, run Kernel → Restart & Run All to ensure top-to-bottom execution works.
Mini tasks to practice
- Create a new notebook in notebooks/ and paste the setup cell.
- Generate a tiny synthetic dataset, clean it, and save artifacts/clean.csv.
- Restart & Run All. Confirm the same outputs are produced.
Exercises
Complete these in a fresh notebook. Use the setup cell template.
Exercise 1 — Create a reproducible notebook header (mirrors ex1)
- Make a setup cell with imports, SEED, directories, and environment snapshot.
- Generate 5 random numbers with NumPy and print them. Re-run the whole notebook and confirm the numbers do not change.
- Change SEED via environment variable (e.g., SEED=7) and confirm the numbers change accordingly.
Exercise 2 — Deterministic split and summary (mirrors ex2)
- Create a 100-row synthetic DataFrame with two features and a binary label.
- Perform an 80/20 split using a fixed seed and NumPy shuffle.
- Save train.csv and test.csv to artifacts/. Compute and save mean of each feature on the train set to artifacts/train_summary.json.
Checklist: did you complete the exercises?
- Setup cell prints versions, seed, and config.
- Random sequences repeat with same seed.
- artifacts/train.csv and artifacts/test.csv exist and shapes are stable across runs.
- artifacts/train_summary.json contains numeric means.
Common mistakes and self-check
- Mistake: Using absolute file paths. Fix: Use relative paths rooted at project directory.
- Mistake: Forgetting seeds for random ops. Fix: Set both random.seed and np.random.seed.
- Mistake: Running cells out of order. Fix: Always Restart & Run All before sharing.
- Mistake: Hidden state (variables from previous runs). Fix: Keep setup cell idempotent and rerun from top.
- Mistake: Not recording environment versions. Fix: Print pandas/NumPy versions in the setup cell.
Self-check mini task
Delete all cell outputs, Restart & Run All. If anything changes unexpectedly or fails, locate the first failing cell, add required seeds/paths, and retry.
Mini challenge
Extend the Example 3 split by adding a baseline predictor: predict the majority class from the training set, then compute accuracy on the test set. Save metrics to artifacts/metrics.json. Ensure results are identical across reruns.
Who this is for
- Data Scientists and Analysts working with pandas/NumPy in notebooks.
- Anyone preparing notebooks for review, handoff, or productionization.
Prerequisites
- Comfort with Python basics and running Jupyter notebooks.
- Basic pandas and NumPy operations (loading data, indexing, simple transforms).
Learning path
- Adopt the setup cell template in all new notebooks.
- Practice deterministic data generation, cleaning, and splitting.
- Save and name artifacts consistently.
- Automate: Restart & Run All before commit; clear unnecessary heavy outputs.
Practical projects
- Reproducible EDA pack: a notebook that loads a CSV, prints an environment snapshot, computes summaries/plots, and saves them to artifacts/.
- Reproducible feature pipeline: create synthetic data, apply a deterministic transform chain, and export train/test datasets.
- Metrics dropbox: run the same notebook with two different seeds (via env var), compare metrics.json files, and document differences.
Next steps
- Apply the template to your current notebooks.
- Share one notebook with a teammate; ask them to Restart & Run All and confirm identical results.
- Take the Quick Test below to check your understanding. Everyone can take it; only logged-in users will have their progress saved.