How to learn Training And Evaluation In CI for CI CD for ML in Machine Learning Engineer for free

Why this matters

As a Machine Learning Engineer, you need your models to be trustworthy and fast to deploy. Training and evaluation in Continuous Integration (CI) helps you catch broken code, data issues, and performance regressions before they reach production. Real tasks you will handle:

Run a small, fast training job on each pull request to check that code and data interfaces still work.
Evaluate metrics on a fixed validation set and block merges if they drop below thresholds.
Ensure runs are reproducible with seeded randomness and pinned dependencies.
Publish artifacts (metrics, plots, small models) for teammates to review.

Who this is for

Machine Learning Engineers wiring model checks into CI.
Data Scientists preparing models for production.
MLOps engineers building reliable, fast pipelines.

Prerequisites

Basic Python and command-line skills.
Comfort with git and pull requests.
Understanding of train/validation/test splits and common metrics (accuracy, F1, AUC, RMSE).

Concept explained simply

CI runs quick, automated checks on every code change. For ML, that means a miniature training run plus evaluation that is deterministic and bounded by time. If metrics are good and tests pass, the change can be merged.

Mental model

Think of CI as a gate with three locks:

Correctness lock: unit tests and data schema checks.
Reproducibility lock: fixed seeds, fixed data slice, pinned dependencies, consistent hardware assumptions.
Quality lock: performance metrics must meet or exceed thresholds.

Only when all three locks click, the gate opens.

A minimal CI workflow (training + evaluation)

View an example CI outline

# Pseudo-CI (tool-agnostic) steps
# 1) Setup
- checkout repository
- setup python 3.10
- cache deps by lockfile hash
- install: pip install -r requirements.txt

# 2) Quick data stage
- fetch small, versioned sample data (e.g., 5k rows or 10% subset)
- validate schema (columns, dtypes)

# 3) Training (fast)
- run: python train.py --data sample.csv --epochs 1 --seed 42 --limit 5000 --out model.bin
- timebox to < 5-10 minutes on CI

# 4) Evaluation
- run: python eval.py --model model.bin --data val.csv --metrics-json metrics.json
- parse metrics.json; fail if metrics below thresholds

# 5) Artifacts
- upload metrics.json and small artifacts (e.g., confusion_matrix.png)

Worked examples

Example 1: Classification gate with F1 threshold

Goal: Block merges if F1 < 0.78 on a fixed validation split.

# metrics.json (produced by eval.py)
{
  "precision": 0.80,
  "recall": 0.78,
  "f1": 0.79
}

# gate.py (pseudo)
import json, sys
m = json.load(open("metrics.json"))
if m["f1"] < 0.78:
    print("Fail: F1 below threshold 0.78")
    sys.exit(1)
print("Pass: F1 OK")

Result: CI fails if F1 regresses under 0.78.

Example 2: Regression gate with RMSE non-increase

Goal: Ensure RMSE not worse than baseline by more than 2%.

# baseline_metrics.json
{ "rmse": 13.2 }

# current metrics
{ "rmse": 13.3 }

# gate logic
allowed = 13.2 * 1.02  # 2% worse allowed
if current_rmse > allowed:
    fail
else:
    pass

Result: Small noise allowed; clear regressions blocked.

Example 3: Fixing flaky CI runs

Symptoms: Sometimes accuracy is 0.85, sometimes 0.78 for same code.

Root causes: Non-deterministic data shuffling; unseeded model init; random train/valid split.
Fixes: Set all seeds (NumPy, PyTorch/TF, Python); use a fixed validation index file; pin package versions.

# train.py flags
--seed 42 --fixed-val-index val_idx.txt --deterministic true

Picking metrics and thresholds

Classification: accuracy/F1/AUC. Favor F1 for class-imbalance.
Regression: RMSE/MAE; pick one primary metric.
Ranking: NDCG/MAP on a stable query set.

Threshold strategies:

Absolute: F1 ≥ 0.78.
Non-regression: F1 not lower than baseline by more than 1%.
Composite: F1 ≥ 0.78 and latency ≤ 50 ms/batch (on CI hardware).

Tip: Start conservative

Begin with non-regression gates to avoid blocking early improvements, then move to absolute thresholds once the baseline stabilizes.

Data and determinism

Use a small, representative, versioned sample for CI (e.g., 1–10% of data).
Freeze validation indices; store in repo (if small) or in a versioned artifact store.
Set seeds everywhere and prefer deterministic operations. Document any non-deterministic ops.
Pin dependencies (requirements.txt with locked versions) to eliminate metric drift.

Speeding up CI training

Train on a subset (rows or steps/epochs).
Reduce model size or disable heavy augmentations in CI mode.
Cache datasets and dependencies across CI runs.
Skip heavy hyperparameter search; keep it for nightly jobs.

Pattern: CI-fast vs. Nightly-full

CI-fast: minutes; subset data; 1 epoch; deterministic; gates only.
Nightly-full: hours; full data; full training; richer reports and drift checks.

Exercises

Complete the exercise below, then use the checklist to self-verify. The Quick Test at the end is available to everyone; only logged-in users get saved progress.

Exercise 1: Design a metrics gate

Create a CI step that parses metrics.json and fails if F1 < 0.80 or if accuracy drops by more than 1% from a stored baseline. See the Exercises section below for full instructions and solution.

Exercise checklist

Deterministic training flags are present (seed, fixed validation set).
metrics.json is produced and archived on CI.
Gate reads both current and baseline metrics.
Absolute threshold and non-regression rules implemented.
CI fails with clear, actionable error messages.

Common mistakes and how to self-check

Unseeded randomness: If back-to-back CI runs vary by more than 1–2 percentage points, you likely missed a seed or a fixed validation split.
Moving validation target: Ensure the validation set is stable across runs; changing it invalidates comparisons.
Over-strict thresholds: If the gate blocks useful improvements, switch to non-regression or widen tolerance temporarily.
Heavy CI jobs: If runs exceed 10–15 minutes, reduce data size, epochs, or artifacts produced.
Opaque failures: Error messages should state the metric and threshold that failed.

Practical projects

Project A: Build a CI-fast pipeline for a binary classifier with F1 gate and artifacted confusion matrix.
Project B: Add a non-regression gate comparing RMSE to a checked-in baseline file, with a 2% tolerance.
Project C: Introduce a fairness sanity check (e.g., F1 parity within 5% across two groups on the validation slice) and fail CI on large gaps.

Learning path

Before this: Unit testing for ML code; data validation basics.
You are here: Training and Evaluation in CI (fast, deterministic gates).
Next: Model packaging and artifact versioning; deployment gating in CD; monitoring and drift alerts.

Next steps

Implement a minimal gate on your current project with one primary metric.
Stabilize seeds and validation indices.
Add a baseline file and switch to non-regression rules.
Run the Quick Test below to check understanding.

Mini challenge

In under 15 minutes of CI time, wire a gate that checks: (1) data schema match, (2) training completes with deterministic seed, (3) F1 ≥ 0.78 and not 1% worse than baseline, (4) metrics.json and a small plot are saved as artifacts. What is the single most time-consuming step, and how can you shave 30% off it without hurting reliability? Write down your answer and try it in your pipeline.

Instructions

You have two scripts:

train.py: produces model.bin using sample_train.csv with flags --seed, --limit, --epochs.
eval.py: produces metrics.json with keys {"accuracy", "precision", "recall", "f1"} on sample_val.csv.

Your task:

Add CI steps to run training deterministically: set --seed 42, --limit 5000, --epochs 1.
Archive metrics.json as an artifact.
Create a gate step that fails if f1 < 0.80 OR if accuracy is more than 1% lower than a baseline stored at baseline_metrics.json.
Print a clear message on failure indicating current values, thresholds, and suggested fixes (e.g., re-check data schema, seeds, or recent code changes).

Constraints:

Total CI time under 10 minutes.
No external services. Keep everything in-repo artifacts.

Menu

Training And Evaluation In CI

Table of Contents