Why this matters
As an NLP Engineer, you will run experiments that others must be able to repeat. Recruiters and teammates care that your results are trustworthy, that a model can be rebuilt on a clean machine, and that a bugfix does not silently change metrics. Reproducible workflows save time, reduce risk, and make collaboration smoother.
- Team task: Re-run a sentiment model from 3 months ago to compare with a new dataset.
- Production task: Patch a tokenizer bug without changing baseline metrics.
- Research task: Share a training recipe that yields the same numbers on a colleague’s laptop.
Concept explained simply
Reproducibility means another person (or future you) can re-run your steps and get the same results.
Mental model
Treat your project like a recipe: ingredients (data + exact package versions), a fixed oven setting (seeds and deterministic flags), and a written method (configs + commands). If any of those change, the cake tastes different.
Core building blocks
- Version control: Track code and configuration changes.
- Environment pinning: Freeze Python version and package versions.
- Data versioning: Store where data came from and its content hash.
- Randomness control: Set seeds and deterministic options in libraries.
- Config files: Keep hyperparameters and paths in one versioned YAML/JSON file.
- Pipelines: Use a repeatable run command and clear folder layout.
- Metadata logging: Save run id, commit hash, data hash, seed, metrics, and model path.
Recommended project layout
nlp-project/
README.md
env/
requirements.txt # or pyproject + lock
data/
raw/
processed/
configs/
default.yaml
src/
train.py
predict.py
utils.py
runs/
2026-01-05_120000/
metrics.json
config.used.yaml
model.pkl
Worked examples
Example 1: Pin environment + set seeds
- Pin versions: create requirements.txt with exact versions.
- Set seeds in code (random, numpy, and any ML library you use).
- Record them in a run folder with metrics.json.
Seed snippet
import os, random, numpy as np
SEED = 13
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
# For torch users (optional):
# import torch
# torch.manual_seed(SEED)
# torch.cuda.manual_seed_all(SEED)
# torch.backends.cudnn.deterministic = True
# torch.backends.cudnn.benchmark = False
Example 2: Data integrity with hashing
Hash your dataset file and store the hash in metrics.json. If the file changes, the hash changes.
Hash snippet
import hashlib
def file_sha256(path):
h = hashlib.sha256()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
h.update(chunk)
return h.hexdigest()
Example 3: Config-driven runs
Put hyperparameters and paths into configs/default.yaml and have train.py read it. Store a copy of the used config alongside outputs so future runs know exactly what was used.
Minimal YAML example
seed: 13
data:
path: data/raw/toy.csv
model:
type: logistic_regression
C: 1.0
max_iter: 200
vectorizer:
type: tfidf
max_features: 5000
split:
test_size: 0.2
Step-by-step: Make any NLP project reproducible in 60 minutes
- Create a clean repo and commit code and configs.
- Freeze environment (exact versions) and record Python version.
- Store raw data in data/raw and never mutate it. Derive processed data in data/processed.
- Add a dataset hash function; save hashes on each run.
- Add seeds and deterministic settings in one function.
- Move hyperparameters and paths into a YAML config.
- Create a single entry command (e.g., python src/train.py --config configs/default.yaml) that writes runs/ artifacts.
- Save metrics.json with: commit hash, config copy, data hash, seed, metrics, and model path.
Exercises
These exercises mirror the items in the Exercises section below.
Exercise 1 — Reproducible NLP skeleton
Create a tiny text classification project that pins environment, uses a YAML config, sets seeds, hashes the dataset, trains a simple model, and writes metrics.json and model.pkl to runs/.
Exercise 2 — Determinism check
Run the same config twice and verify identical metrics and model checksum. Change the seed and observe different results.
Checklist before you start
- requirements.txt has exact versions
- configs/default.yaml exists and is used
- data/raw/toy.csv present
- train.py writes runs/<timestamp>
- metrics.json contains seed, data_hash, versions, and accuracy
Common mistakes and self-check
- Forgetting to pin versions. Self-check: pip freeze shows exact versions; commit the file.
- Changing raw data in place. Self-check: raw folder is read-only; processed data has its own folder.
- Relying on notebook state. Self-check: restart kernel and run all; or export to a script.
- Not saving configs. Self-check: runs folder contains config.used.yaml.
- Ignoring nondeterminism on GPU. Self-check: set deterministic flags and document hardware; expect small differences on some ops.
- No data hash. Self-check: metrics.json has data_hash; if the file changes, your script detects mismatch.
Practical projects
- Baseline Sentiment Classifier: Tfidf + Logistic Regression with fully reproducible runs and ablation configs.
- News Topic Classifier: Add preprocessing steps (lowercase, stopwords) and prove reproducibility across OS.
- Text Similarity Pipeline: Evaluate TF-IDF cosine vs. simple embedding; log metrics and artifacts for each variant.
Who this is for
- Junior to mid-level NLP Engineers who need reliable experiments.
- Data Scientists moving from notebooks to production-ready workflows.
- Students building shareable, auditable projects.
Prerequisites
- Basic Python and command line.
- Familiarity with Git fundamentals (init, commit, branch).
- Intro ML knowledge (train/test split, metrics).
Learning path
- Start with environment pinning and seeds.
- Introduce config files and a single entry command.
- Add data hashing and run metadata.
- Refactor notebooks into scripts.
- Automate the pipeline with simple make-like commands.
Mini challenge
Take any old NLP notebook you have. In under 60 minutes, turn it into a reproducible script that produces a runs/ folder with config.used.yaml, metrics.json, and a model file. Aim to re-run it twice with identical metrics.
Next steps
- Generalize your scripts to accept multiple configs and run batches.
- Add pre-commit hooks to auto-format and catch common errors.
- Adopt a lightweight experiment tracker to compare runs locally.
Quick test
Everyone can take the test. If you log in, your progress will be saved automatically.