Why this matters
In a real Data Scientist role, clear documentation is the backbone of trustworthy ML. It lets teammates reproduce your results, auditors verify decisions, and future-you understand what you did and why.
- Reproducibility: Run the same code, data, and settings to get the same results.
- Decision traceability: Explain why you chose Model A over Model B.
- Handover: Enable teammates to continue work without guesswork.
- Compliance and risk: Show data sources, consent, and limitations.
- Monitoring: Know what “good” looks like to spot drift and incidents.
Real tasks you will face
- Write an experiment log for each training run.
- Create a model card before deployment.
- Maintain a dataset factsheet with source and preprocessing.
- Record decisions after offline and A/B tests.
- Update monitoring docs when thresholds change.
Concept explained simply
Documentation of experiments and models is a structured way to capture what you tried, what happened, and what you shipped.
- Experiment log: A concise record per run (data snapshot, parameters, metrics, and the decision).
- Dataset factsheet (data card): What the data is, where it came from, how it was cleaned, and known issues.
- Model card: What the model does, how it was trained, how to use it responsibly, and where it breaks.
- Decision record: Why you picked a model or threshold (trade-offs, risks, stakeholders).
- Versioning and lineage: IDs for data, code, model, and environment so others can reconstruct the exact pipeline.
Mental model
Think of documentation as three tools together:
- Flight recorder: Captures every run’s settings and outcomes.
- Recipe: The exact steps to recreate results.
- User manual: How to safely use the model in the real world.
What to document (checklist)
- Dataset
- Name, version or date range
- Source and license/consent notes
- Splitting strategy and leakage protections
- Preprocessing/feature engineering steps
- Known biases, missingness, caveats
- Experiment run
- Objective and hypothesis
- Random seed(s), cross-validation strategy
- Parameters and environment (packages, hardware)
- Primary and secondary metrics (with confidence intervals if applicable)
- Results table and decision
- Model artifact
- Model version/ID and checksum
- Training data version/ID
- Model card: intended use, limitations, ethics
- Deployment and monitoring
- Thresholds and calibration method
- Guardrail metrics and alert thresholds
- Rollback plan and who to contact
- Governance and risk
- Approval dates and reviewers
- Change log
- Retirement or retraining criteria
Worked examples
Example 1: Experiment log entry (binary churn prediction)
Experiment ID: churn_lr_2025-07-14_01
Objective: Improve recall at 90% precision.
Data: customer_events_v4 (2025-06), train/val/test by time split (70/15/15)
Leakage controls: no post-churn events in features; only t-30d window used
Seed: 42 CV: 5-fold stratified
Model: Logistic Regression
Params: C=[0.1, 0.5, 1.0], class_weight=balanced, solver=liblinear
Preprocessing: StandardScaler on numeric; one-hot on top 50 categories
Metrics (val):
C=0.1: Prec@th=0.62: 0.90, Recall: 0.52, AUROC: 0.86
C=0.5: Prec@th=0.60: 0.90, Recall: 0.56, AUROC: 0.87
C=1.0: Prec@th=0.58: 0.90, Recall: 0.58, AUROC: 0.87
Decision: Choose C=1.0; best recall at required precision.
Next: Calibrate threshold on test via precision-recall sweep; draft model card.
Example 2: Dataset factsheet (transactions_v2)
Name: transactions_v2 (2024-01 to 2024-12)
Source: Internal payments DB; aggregated daily
License/consent: Internal use; customer ToS section 3.2
Population: Active users in regions A, B, C
Preprocessing: Drop rows with >50% NaN; impute median for amount; log-transform amount
Known issues: Region C has seasonal spikes; missing merchant_category for ~8%
Intended use: Fraud detection training
Limitations: Not representative of region D; suspect merchant_category drift after Nov
Example 3: Model card (fraud_xgb_v3)
Model: fraud_xgb_v3 (XGBoost)
Intended use: Flag high-risk transactions for review
Out-of-scope: Blocking payments automatically without human review
Training data: transactions_v2 (2024-01..12); stratified by user_id
Metrics (test): AUROC=0.941; PR-AUC=0.312
Calibration: Platt scaling on validation set
Fairness: Evaluated FPR parity across regions; region C FPR +1.8pp vs A (monitor)
Limitations: Underperforms on new merchants; sensitive to amount outliers
Ethical considerations: False positives increase manual workload; human-in-the-loop required
Owner: Risk ML Team; Contact: risk-ml@company
Versioning: Code v1.8.2; Data v2; Model checksum a1b2c3
Minimal templates you can copy
Template: Experiment log (Markdown)
# Experiment: <id> (date, time, author)
Objective:
Data: <name/version/date range> | Split: <method>
Leakage controls:
Seed(s): <...> CV: <...>
Model:
Params:
Preprocessing:
Metrics (val/test):
Results table:
Decision:
Next action:
Template: Dataset factsheet
Name:
Source:
License/consent:
Population:
Collection period:
Preprocessing/feature engineering:
Quality notes (missingness, bias, drift risks):
Intended use and out-of-scope:
Limitations:
Template: Model card
Model name/version:
Intended use / Out-of-scope:
Training data summary:
Training procedure and environment:
Metrics (with definitions):
Calibration/threshold(s):
Fairness/robustness checks:
Known limitations and failure modes:
Ethical considerations and mitigations:
Owner/contact and review approvals:
Change log:
Workflow: from idea to shipped model
- Define objective and hypothesis (which metric will move and why).
- Prepare dataset factsheet and snapshot the data version.
- Plan experiments (seeds, CV, parameter ranges).
- Run experiments and fill an experiment log per run.
- Compare runs and create a decision record (trade-offs, risks).
- Write the model card draft and get review.
- Deploy with thresholds, guardrails, and monitoring plan documented.
- Update documents with post-deployment findings and changes.
Common mistakes and how to self-check
- Only logging the best run. Fix: Log every run and the selection criteria.
- Forgetting seeds or package versions. Fix: Capture seed(s), Python/OS, and key library versions.
- Unclear data splits. Fix: Describe split method and leakage protections.
- Ambiguous metrics. Fix: Name metrics precisely (e.g., PR-AUC), include thresholds and intervals if used.
- No limitations section. Fix: Always state where the model may fail.
Self-check prompts
- Can a teammate recreate my best result in under a day?
- Could an auditor see how data was collected and consented?
- Would a new PM understand why we shipped this model?
- Can on-call engineers find thresholds and rollback steps?
Exercises
Note: The Quick Test is available to everyone; only logged-in users get saved progress. Look for the section titled "Documentation Of Experiments And Models — Quick Test" below.
Exercise 1: Log an experiment run clearly
Create a concise experiment log for a binary classifier. Include: objective, dataset and split, seeds/CV, parameters, preprocessing, metrics table, decision, and next action.
- Deliverable: a short Markdown experiment log.
Exercise 2: Draft a one-page model card
Write a model card for a logistic regression spam detector. Include: intended use/out-of-scope, data, training setup, metrics, calibration/threshold, fairness notes, limitations, ethics, owner/versioning.
- Deliverable: a one-page model card with clear sections.
- [ ] I included seeds, versions, and data snapshot IDs.
- [ ] Metrics and thresholds are defined and comparable.
- [ ] Decision criteria and trade-offs are explicit.
- [ ] Limitations, risks, and monitoring plan are present.
Mini challenge
Your best offline model has slightly worse recall but better precision than the baseline. Stakeholders care most about reducing false positives in a human-review workflow. Write 3 sentences for a decision record justifying which model to ship and what to monitor post-deployment. Keep it crisp and measurable.
Who this is for
- Data Scientists and ML Engineers who run experiments and ship models.
- Analysts transitioning to ML who need reproducible workflows.
- Students building portfolios that look professional.
Prerequisites
- Basic ML training workflow (train/validate/test, metrics).
- Familiarity with versioning (git) and environments (conda/venv or containers).
- Comfort with Markdown or simple text docs.
Learning path
- Start using the experiment log template for every run this week.
- Create a dataset factsheet for your main dataset.
- Draft a model card for your current best model and request peer review.
- Add a decision record after each comparison or A/B test.
- Integrate documentation checkpoints into your PR process.
Practical projects
- Reproduce a public notebook’s results and write your own experiment log narrating differences.
- Turn a classroom model into a production-ready artifact with a full model card and monitoring plan.
- Build a small internal "model registry" folder structure with logs, cards, and factsheets for two different models.
Next steps
- Adopt a consistent naming and versioning scheme across data, code, and models.
- Automate parts of logging (auto-capture env, params, and metrics) while keeping human-written decisions.
- Run the Quick Test below to cement your understanding and find gaps.