Why this matters
As an Applied Scientist, your work must be trusted, reviewed, and reused. Reproducible notebooks and reports let teammates re-run your analysis, compare experiments, and audit decisions. Real tasks you will do:
- Share a model experiment so a teammate can re-run it and get the same metrics.
- Deliver a clean report (HTML/PDF) with methods, parameters, and results for stakeholders.
- Archive an analysis so future you can reproduce the exact figures after a dependency update.
- Hand off an experiment to engineering with clear inputs/outputs and a documented environment.
Concept explained simply
Reproducibility means another person (or future you) can run your work and get the same results using the same inputs, code, and environment.
Mental model
Think like a cookbook recipe:
- Ingredients: data, code, libraries, parameters, random seed.
- Instructions: ordered steps that are deterministic.
- Output: metrics, figures, tables that can be regenerated.
If any ingredient or step is missing or ambiguous, results drift.
Core building blocks
- Project structure: keep notebooks, data, configs, and outputs organized.
- Environment: pin dependencies (requirements.txt or environment.yml) and record versions.
- Data: use stable paths and note data versions/checksums; avoid manual edits.
- Determinism: set random seeds for Python, NumPy, and torch/sklearn when used.
- Parameters: keep all tunables in one place (config cell or YAML/JSON).
- Modularity: put reusable logic into functions/modules; keep notebooks thin.
- Narrative: explain what, why, and how; label figures and tables.
- Export: generate a clean HTML/PDF report with hidden code if desired.
Step-by-step recipe to make any notebook reproducible
- Create a clear folder layout:
project/ data/ notebooks/ configs/ reports/ src/ requirements.txt - Add an environment file and freeze versions:
# requirements.txt python==3.10.* numpy==1.26.* pandas==2.1.* matplotlib==3.8.* scikit-learn==1.3.* notebook==7.* - Start the notebook with a Parameters cell and Seeds cell:
# Parameters DATA_PATH = "../data/train.csv" RANDOM_SEED = 42 N_SPLITS = 5 # Seeds import os, random, numpy as np random.seed(RANDOM_SEED) os.environ["PYTHONHASHSEED"] = str(RANDOM_SEED) np.random.seed(RANDOM_SEED) - Have an Imports + Versions cell:
import sys, platform import numpy as np, pandas as pd import sklearn print({ "python": sys.version, "platform": platform.platform(), "numpy": np.__version__, "pandas": pd.__version__, "sklearn": sklearn.__version__ }) - Keep reusable code in src/ and import it. If not possible, define helper functions in one cell.
- Load data in one place and validate shapes, dtypes, and basic stats.
- Produce outputs to reports/ with deterministic names (e.g., include a config hash or timestamp).
- Export a clean report:
jupyter nbconvert --to html --no-input notebooks/experiment.ipynb --output reports/experiment.html
Worked examples
Example 1: EDA template notebook
Show template
# 0. Title: EDA for churn v1
# 1. Parameters
DATA_PATH = "../data/churn.csv"
RANDOM_SEED = 7
# 2. Seeds
import os, random, numpy as np
random.seed(RANDOM_SEED); os.environ["PYTHONHASHSEED"] = str(RANDOM_SEED); np.random.seed(RANDOM_SEED)
# 3. Imports + versions
import sys, platform
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
print({
"python": sys.version.split(" ")[0],
"numpy": np.__version__,
"pandas": pd.__version__
})
# 4. Load data
df = pd.read_csv(DATA_PATH)
print(df.shape, df.dtypes.value_counts())
# 5. Helper functions
def summarize(df):
return {
"rows": len(df),
"cols": df.shape[1],
"missing": df.isna().sum().sum()
}
print(summarize(df))
# 6. Plots (deterministic order)
cat_cols = [c for c in df.columns if df[c].dtype == "object"]
for c in sorted(cat_cols):
df[c].value_counts().head(10).plot.bar()
plt.title(f"Top values: {c}")
plt.tight_layout()
plt.show()
# 7. Save artifacts
out = "../reports/eda_churn_summary.txt"
with open(out, "w") as f:
f.write(str(summarize(df)))
print("Saved:", out)Example 2: Parameterized training with a config file
Show example
# configs/churn_v1.yaml
seed: 123
features: ["age", "tenure", "monthly_charges"]
model: logistic_regression
C: 1.0
cv_splits: 5# notebooks/train_churn.ipynb (key cells)
import os, random, numpy as np, yaml, hashlib
with open("../configs/churn_v1.yaml") as f:
cfg = yaml.safe_load(f)
seed = cfg["seed"]
random.seed(seed); os.environ["PYTHONHASHSEED"] = str(seed); np.random.seed(seed)
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
X = pd.read_csv("../data/churn_features.csv")[cfg["features"]]
y = pd.read_csv("../data/churn_labels.csv")["churn"].astype(int)
model = LogisticRegression(C=cfg["C"], max_iter=200, n_jobs=1, random_state=seed)
auc = cross_val_score(model, X, y, cv=cfg["cv_splits"], scoring="roc_auc", n_jobs=1)
print("CV ROC AUC:", auc.mean())
# Save metrics and config hash
h = hashlib.sha256(repr(sorted(cfg.items())).encode()).hexdigest()[:8]
metrics_path = f"../reports/metrics_{h}.txt"
with open(metrics_path, "w") as f:
f.write(f"roc_auc={auc.mean():.4f}\nconfig_hash={h}\n")
print("Saved:", metrics_path)Example 3: Export a clean stakeholder report
Show steps
- In Jupyter, tag non-essential code cells with "remove-input" (Cell Toolbar → Tags).
- Add a top Markdown cell describing goal, dataset, methods, and key results.
- Run the entire notebook from top to bottom.
- Export to HTML with hidden code:
jupyter nbconvert --to html --TagRemovePreprocessor.remove_input_tags=remove-input \ notebooks/churn_findings.ipynb --output ../reports/churn_findings.html
Common mistakes and self-check
- Missing seeds → Self-check: rerun twice; metrics should match within floating tolerance.
- Hidden global state (e.g., cached variables) → Self-check: Restart kernel and run All Cells; results should be identical.
- Unpinned dependencies → Self-check: print versions; commit requirements.txt and verify teammates match.
- Hardcoded absolute paths → Self-check: try running from a new machine or different folder; use relative paths or a config.
- Manual data edits → Self-check: document every transformation in code; keep raw data read-only.
- Mixed outputs in many places → Self-check: write all artifacts to a single reports/ directory with clear names.
Exercises
Do these in order. The Quick Test is available to everyone; only logged‑in users will see saved progress on the page.
Exercise ex1: Refactor a messy notebook into a reproducible template
Mirror of Exercise 1 below. Goal: add parameters, seeds, versions printout, and deterministic outputs.
Checklist
- Add Parameters and Seeds cells
- Group imports and print library versions
- Move helper logic to one cell or src/
- Use relative paths under data/ and reports/
- Restart-and-run-all without errors
Exercise ex2: Freeze environment and export a report
Mirror of Exercise 2 below. Goal: create requirements.txt, rerun, and export HTML with hidden code.
Checklist
- Create requirements.txt with pinned versions
- Record Python and OS info in the notebook
- Tag non-essential cells to hide
- Export HTML to reports/
Who this is for
- Applied Scientists running experiments and sharing findings
- Data Scientists moving from ad-hoc notebooks to robust workflows
- Researchers who need auditable, repeatable analyses
Prerequisites
- Basics of Python and Jupyter Notebooks
- Familiarity with pandas, NumPy, and at least one ML library
- Comfort with the command line
Practical projects
- Reproducible A/B uplift analysis: parameterized notebook reading two cohorts, outputs metrics and figures to reports/.
- Model selection report: compare 3 algorithms with the same seed and cross-validation setup; export a clean HTML summary.
- Data audit: notebook that validates schema, missingness, and basic metrics; saves a text summary and plots reproducibly.
Learning path
- Start with a tiny EDA notebook; add Parameters, Seeds, and Versions cells.
- Add a config file (YAML/JSON) and refactor magic numbers into it.
- Split reusable code into src/ and write minimal tests (sanity checks).
- Automate exporting to reports/ via nbconvert.
- Optional: orchestrate runs via Makefile or a simple shell script.
Next steps
- Adopt a simple run script (make or bash) to execute notebooks end-to-end.
- Add lightweight data version notes (date snapshots or hashes).
- Standardize a team template for experiments and reports.
Mini challenge
Turn any existing notebook into a one-click report
- Add a top Parameters and Seeds cell.
- Create requirements.txt with pinned versions.
- Restart-and-run-all successfully.
- Hide non-essential code and export HTML to reports/.
Timebox: 45–60 minutes. Share the HTML and requirements.txt with a teammate to verify they can reproduce.