luvv to helpDiscover the Best Free Online Tools
Topic 2 of 10

Notebook Workflow And Reproducibility

Learn Notebook Workflow And Reproducibility for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Data Scientist, you will clean data, explore patterns, and ship models. Stakeholders need to trust that your results are repeatable. A solid notebook workflow reduces bugs, helps teammates reproduce your work, and makes it easy to move from exploration to production.

  • Real tasks: rerunning an analysis after data refresh, handing your notebook to a teammate, or comparing model runs fairly.
  • Risks without reproducibility: misleading metrics due to random splits, mysterious errors on another machine, or lost time reconciling versions.

Concept explained simply

Reproducibility means: if you rerun the notebook tomorrow or on another machine, you get the same results and can explain how they were produced.

Mental model

Think of your notebook as a recipe card. Good recipes list ingredients (versions, data paths), set the oven temperature (random seeds, configs), and produce the same dish each time (deterministic steps and saved outputs).

Checklist: A reproducible notebook does...
  • Start with a single setup cell: imports, seed, config, and environment info.
  • Use relative project paths (no local absolute paths like C:\User\... or /Users/... ).
  • Make randomness explicit with a fixed seed.
  • Run top-to-bottom without manual tweaking.
  • Save key artifacts (figures, CSVs, metrics) with clear names and timestamps.

Minimal reproducible notebook template

Use this as the first cell of every notebook.

import os, sys, json, platform, random
import numpy as np
import pandas as pd

# 1) Configuration (override via environment variables if needed)
SEED = int(os.getenv("SEED", "42"))
DATA_DIR = os.getenv("DATA_DIR", "data")
ARTIFACTS_DIR = os.getenv("ARTIFACTS_DIR", "artifacts")
CONFIG = {
    "sample_frac": 1.0,  # keep 1.0 for full data
    "target": None,
}

# 2) Reproducibility
random.seed(SEED)
np.random.seed(SEED)

# 3) Directories
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(ARTIFACTS_DIR, exist_ok=True)

# 4) Environment snapshot (print once for logging)
print(json.dumps({
    "python": sys.version.split()[0],
    "platform": platform.platform(),
    "pandas": pd.__version__,
    "numpy": np.__version__,
    "seed": SEED,
    "config": CONFIG,
    "data_dir": DATA_DIR,
    "artifacts_dir": ARTIFACTS_DIR
}, indent=2))

# 5) Helper utilities (optional)
def describe_df(df, name="df"):
    print(f"{name}: shape={df.shape}; columns={list(df.columns)[:6]}")
Why this template works
  • All assumptions live in one place (CONFIG, SEED, paths).
  • Seeds force deterministic results for random operations.
  • Environment snapshot helps others match your setup.
  • Relative directories keep paths portable across machines.

Worked examples

Example 1: Clean synthetic data deterministically

# Synthetic dataset with missing values (deterministic)
rng = np.random.RandomState(SEED)
N = 100
raw = pd.DataFrame({
    "feature_a": rng.normal(0, 1, N),
    "feature_b": rng.uniform(0, 10, N),
    "label": rng.binomial(1, 0.3, N)
})
# Introduce some missingness deterministically
raw.loc[rng.choice(N, 10, replace=False), "feature_b"] = np.nan

# Clean: impute with median
clean = raw.copy()
median_b = clean["feature_b"].median()
clean["feature_b"].fillna(median_b, inplace=True)

describe_df(clean, "clean")
clean.to_csv(os.path.join(ARTIFACTS_DIR, "clean.csv"), index=False)
print("Saved:", os.path.join(ARTIFACTS_DIR, "clean.csv"))

Re-running yields the same median and the same saved file content.

Example 2: Parameterized EDA summary

# Use CONFIG to control EDA behavior
EDA_COLS = ["feature_a", "feature_b"]
QUANT = float(os.getenv("EDA_QUANT", "0.95"))

summary = {
    col: {
        "mean": float(clean[col].mean()),
        "std": float(clean[col].std()),
        "q95": float(clean[col].quantile(QUANT))
    } for col in EDA_COLS
}

with open(os.path.join(ARTIFACTS_DIR, "eda_summary.json"), "w") as f:
    json.dump(summary, f, indent=2)
print("EDA summary saved.")

Change EDA_QUANT to 0.9 and you can reproduce a different, intentional run with provenance.

Example 3: Reproducible train/test split without extra libraries

# Deterministic split using NumPy only
idx = np.arange(len(clean))
rng = np.random.RandomState(SEED)
rng.shuffle(idx)
split = int(0.8 * len(idx))
train_idx, test_idx = idx[:split], idx[split:]
train, test = clean.iloc[train_idx], clean.iloc[test_idx]

print(train.shape, test.shape)
train.to_csv(os.path.join(ARTIFACTS_DIR, "train.csv"), index=False)
test.to_csv(os.path.join(ARTIFACTS_DIR, "test.csv"), index=False)
Optional: export notebook as a script or HTML

From a notebook cell, you can create a shareable artifact:

# Convert the current notebook (replace name.ipynb) to a script or HTML
# !jupyter nbconvert --to script name.ipynb
# !jupyter nbconvert --to html name.ipynb

These exports help reviewers read diffs and reproduce results without running every cell.

Practical workflow (step-by-step)

  1. Create a project skeleton:
    project/
      ├── data/            # raw or external data (read-only when possible)
      ├── artifacts/       # tables, plots, metrics you produce
      ├── notebooks/       # your .ipynb files
      └── README.md        # what to run, in what order
  2. Start each notebook with the setup cell shown above.
  3. Use deterministic operations (e.g., random_state in .sample, fixed seeds).
  4. Save outputs with informative names, e.g., artifacts/train_2026-01-01.csv.
  5. Before sharing, run Kernel → Restart & Run All to ensure top-to-bottom execution works.
Mini tasks to practice
  • Create a new notebook in notebooks/ and paste the setup cell.
  • Generate a tiny synthetic dataset, clean it, and save artifacts/clean.csv.
  • Restart & Run All. Confirm the same outputs are produced.

Exercises

Complete these in a fresh notebook. Use the setup cell template.

Exercise 1 — Create a reproducible notebook header (mirrors ex1)

  • Make a setup cell with imports, SEED, directories, and environment snapshot.
  • Generate 5 random numbers with NumPy and print them. Re-run the whole notebook and confirm the numbers do not change.
  • Change SEED via environment variable (e.g., SEED=7) and confirm the numbers change accordingly.

Exercise 2 — Deterministic split and summary (mirrors ex2)

  • Create a 100-row synthetic DataFrame with two features and a binary label.
  • Perform an 80/20 split using a fixed seed and NumPy shuffle.
  • Save train.csv and test.csv to artifacts/. Compute and save mean of each feature on the train set to artifacts/train_summary.json.
Checklist: did you complete the exercises?
  • Setup cell prints versions, seed, and config.
  • Random sequences repeat with same seed.
  • artifacts/train.csv and artifacts/test.csv exist and shapes are stable across runs.
  • artifacts/train_summary.json contains numeric means.

Common mistakes and self-check

  • Mistake: Using absolute file paths. Fix: Use relative paths rooted at project directory.
  • Mistake: Forgetting seeds for random ops. Fix: Set both random.seed and np.random.seed.
  • Mistake: Running cells out of order. Fix: Always Restart & Run All before sharing.
  • Mistake: Hidden state (variables from previous runs). Fix: Keep setup cell idempotent and rerun from top.
  • Mistake: Not recording environment versions. Fix: Print pandas/NumPy versions in the setup cell.
Self-check mini task

Delete all cell outputs, Restart & Run All. If anything changes unexpectedly or fails, locate the first failing cell, add required seeds/paths, and retry.

Mini challenge

Extend the Example 3 split by adding a baseline predictor: predict the majority class from the training set, then compute accuracy on the test set. Save metrics to artifacts/metrics.json. Ensure results are identical across reruns.

Who this is for

  • Data Scientists and Analysts working with pandas/NumPy in notebooks.
  • Anyone preparing notebooks for review, handoff, or productionization.

Prerequisites

  • Comfort with Python basics and running Jupyter notebooks.
  • Basic pandas and NumPy operations (loading data, indexing, simple transforms).

Learning path

  1. Adopt the setup cell template in all new notebooks.
  2. Practice deterministic data generation, cleaning, and splitting.
  3. Save and name artifacts consistently.
  4. Automate: Restart & Run All before commit; clear unnecessary heavy outputs.

Practical projects

  • Reproducible EDA pack: a notebook that loads a CSV, prints an environment snapshot, computes summaries/plots, and saves them to artifacts/.
  • Reproducible feature pipeline: create synthetic data, apply a deterministic transform chain, and export train/test datasets.
  • Metrics dropbox: run the same notebook with two different seeds (via env var), compare metrics.json files, and document differences.

Next steps

  • Apply the template to your current notebooks.
  • Share one notebook with a teammate; ask them to Restart & Run All and confirm identical results.
  • Take the Quick Test below to check your understanding. Everyone can take it; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

  1. Create a setup cell that defines SEED, DATA_DIR, ARTIFACTS_DIR, a CONFIG dict, sets both random and NumPy seeds, ensures directories exist, and prints an environment snapshot (Python, platform, pandas, NumPy, seed, config).
  2. Generate and print a NumPy array of 5 random numbers. Restart the kernel and Run All to confirm the same numbers appear.
  3. Override the seed by setting an environment variable (e.g., SEED=7) and confirm the numbers change accordingly.
Expected Output
1) A JSON-like environment printout with versions and seed; 2) The same 5 numbers on repeated full runs with the same seed; 3) Different numbers when SEED changes.

Notebook Workflow And Reproducibility — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Notebook Workflow And Reproducibility?

AI Assistant

Ask questions about this tool