Why this matters
As a Machine Learning Engineer, you need your models to be trustworthy and fast to deploy. Training and evaluation in Continuous Integration (CI) helps you catch broken code, data issues, and performance regressions before they reach production. Real tasks you will handle:
- Run a small, fast training job on each pull request to check that code and data interfaces still work.
- Evaluate metrics on a fixed validation set and block merges if they drop below thresholds.
- Ensure runs are reproducible with seeded randomness and pinned dependencies.
- Publish artifacts (metrics, plots, small models) for teammates to review.
Who this is for
- Machine Learning Engineers wiring model checks into CI.
- Data Scientists preparing models for production.
- MLOps engineers building reliable, fast pipelines.
Prerequisites
- Basic Python and command-line skills.
- Comfort with git and pull requests.
- Understanding of train/validation/test splits and common metrics (accuracy, F1, AUC, RMSE).
Concept explained simply
CI runs quick, automated checks on every code change. For ML, that means a miniature training run plus evaluation that is deterministic and bounded by time. If metrics are good and tests pass, the change can be merged.
Mental model
Think of CI as a gate with three locks:
- Correctness lock: unit tests and data schema checks.
- Reproducibility lock: fixed seeds, fixed data slice, pinned dependencies, consistent hardware assumptions.
- Quality lock: performance metrics must meet or exceed thresholds.
Only when all three locks click, the gate opens.
A minimal CI workflow (training + evaluation)
View an example CI outline
# Pseudo-CI (tool-agnostic) steps
# 1) Setup
- checkout repository
- setup python 3.10
- cache deps by lockfile hash
- install: pip install -r requirements.txt
# 2) Quick data stage
- fetch small, versioned sample data (e.g., 5k rows or 10% subset)
- validate schema (columns, dtypes)
# 3) Training (fast)
- run: python train.py --data sample.csv --epochs 1 --seed 42 --limit 5000 --out model.bin
- timebox to < 5-10 minutes on CI
# 4) Evaluation
- run: python eval.py --model model.bin --data val.csv --metrics-json metrics.json
- parse metrics.json; fail if metrics below thresholds
# 5) Artifacts
- upload metrics.json and small artifacts (e.g., confusion_matrix.png)
Worked examples
Example 1: Classification gate with F1 threshold
Goal: Block merges if F1 < 0.78 on a fixed validation split.
# metrics.json (produced by eval.py)
{
"precision": 0.80,
"recall": 0.78,
"f1": 0.79
}
# gate.py (pseudo)
import json, sys
m = json.load(open("metrics.json"))
if m["f1"] < 0.78:
print("Fail: F1 below threshold 0.78")
sys.exit(1)
print("Pass: F1 OK")
Result: CI fails if F1 regresses under 0.78.
Example 2: Regression gate with RMSE non-increase
Goal: Ensure RMSE not worse than baseline by more than 2%.
# baseline_metrics.json
{ "rmse": 13.2 }
# current metrics
{ "rmse": 13.3 }
# gate logic
allowed = 13.2 * 1.02 # 2% worse allowed
if current_rmse > allowed:
fail
else:
pass
Result: Small noise allowed; clear regressions blocked.
Example 3: Fixing flaky CI runs
Symptoms: Sometimes accuracy is 0.85, sometimes 0.78 for same code.
- Root causes: Non-deterministic data shuffling; unseeded model init; random train/valid split.
- Fixes: Set all seeds (NumPy, PyTorch/TF, Python); use a fixed validation index file; pin package versions.
# train.py flags
--seed 42 --fixed-val-index val_idx.txt --deterministic true
Picking metrics and thresholds
- Classification: accuracy/F1/AUC. Favor F1 for class-imbalance.
- Regression: RMSE/MAE; pick one primary metric.
- Ranking: NDCG/MAP on a stable query set.
Threshold strategies:
- Absolute: F1 ≥ 0.78.
- Non-regression: F1 not lower than baseline by more than 1%.
- Composite: F1 ≥ 0.78 and latency ≤ 50 ms/batch (on CI hardware).
Tip: Start conservative
Begin with non-regression gates to avoid blocking early improvements, then move to absolute thresholds once the baseline stabilizes.
Data and determinism
- Use a small, representative, versioned sample for CI (e.g., 1–10% of data).
- Freeze validation indices; store in repo (if small) or in a versioned artifact store.
- Set seeds everywhere and prefer deterministic operations. Document any non-deterministic ops.
- Pin dependencies (requirements.txt with locked versions) to eliminate metric drift.
Speeding up CI training
- Train on a subset (rows or steps/epochs).
- Reduce model size or disable heavy augmentations in CI mode.
- Cache datasets and dependencies across CI runs.
- Skip heavy hyperparameter search; keep it for nightly jobs.
Pattern: CI-fast vs. Nightly-full
- CI-fast: minutes; subset data; 1 epoch; deterministic; gates only.
- Nightly-full: hours; full data; full training; richer reports and drift checks.
Exercises
Complete the exercise below, then use the checklist to self-verify. The Quick Test at the end is available to everyone; only logged-in users get saved progress.
Exercise 1: Design a metrics gate
Create a CI step that parses metrics.json and fails if F1 < 0.80 or if accuracy drops by more than 1% from a stored baseline. See the Exercises section below for full instructions and solution.
Exercise checklist
- Deterministic training flags are present (seed, fixed validation set).
- metrics.json is produced and archived on CI.
- Gate reads both current and baseline metrics.
- Absolute threshold and non-regression rules implemented.
- CI fails with clear, actionable error messages.
Common mistakes and how to self-check
- Unseeded randomness: If back-to-back CI runs vary by more than 1–2 percentage points, you likely missed a seed or a fixed validation split.
- Moving validation target: Ensure the validation set is stable across runs; changing it invalidates comparisons.
- Over-strict thresholds: If the gate blocks useful improvements, switch to non-regression or widen tolerance temporarily.
- Heavy CI jobs: If runs exceed 10–15 minutes, reduce data size, epochs, or artifacts produced.
- Opaque failures: Error messages should state the metric and threshold that failed.
Practical projects
- Project A: Build a CI-fast pipeline for a binary classifier with F1 gate and artifacted confusion matrix.
- Project B: Add a non-regression gate comparing RMSE to a checked-in baseline file, with a 2% tolerance.
- Project C: Introduce a fairness sanity check (e.g., F1 parity within 5% across two groups on the validation slice) and fail CI on large gaps.
Learning path
- Before this: Unit testing for ML code; data validation basics.
- You are here: Training and Evaluation in CI (fast, deterministic gates).
- Next: Model packaging and artifact versioning; deployment gating in CD; monitoring and drift alerts.
Next steps
- Implement a minimal gate on your current project with one primary metric.
- Stabilize seeds and validation indices.
- Add a baseline file and switch to non-regression rules.
- Run the Quick Test below to check understanding.
Mini challenge
In under 15 minutes of CI time, wire a gate that checks: (1) data schema match, (2) training completes with deterministic seed, (3) F1 ≥ 0.78 and not 1% worse than baseline, (4) metrics.json and a small plot are saved as artifacts. What is the single most time-consuming step, and how can you shave 30% off it without hurting reliability? Write down your answer and try it in your pipeline.