Why this matters
Prediction quality feedback loops turn model outputs into learning. In a Machine Learning Engineer role, you will:
- Log every prediction so you can measure real-world precision/recall, calibration, and business impact.
- Collect delayed ground-truth labels (days or weeks later) and correctly join them to predictions.
- Detect performance drift, recalibrate thresholds, and decide when to retrain.
- Reduce bias from actioned predictions (selection bias) using exploration and propensity-aware evaluation.
- Close the loop end-to-end: prediction → action → outcome → learning → improved model.
Who this is for
- Machine Learning Engineers building or maintaining production models.
- Data Scientists responsible for post-deployment evaluation and monitoring.
- Product engineers integrating model decisions into user experiences.
Prerequisites
- Comfort with Python or SQL for data joins and metrics computation.
- Basic understanding of classification/regression metrics (AUC, precision/recall, MAE, calibration).
- Awareness of data drift and model versioning concepts.
Concept explained simply
A prediction quality feedback loop is a system that captures what the model predicted, what actually happened, and uses that information to measure quality and improve the model.
Mental model
Think of a circular flow: Log → Label → Join → Measure → Act. If you make this circle fast, reliable, and unbiased, your model steadily improves. If any segment breaks (e.g., missing IDs, delayed labels, biased labels), you fly blind.
Architecture of a production feedback loop
- Log predictions: Store prediction_id, entity_id (e.g., user_id, order_id), timestamp, model_version, input schema hash, prediction score/class.
- Log actions (optional): Record whether and how the prediction was used (e.g., flagged, discounted, ranked higher).
- Collect outcomes: Store eventual ground truth or proxy signals plus timestamps (e.g., chargeback, return, click, conversion).
- Define label windows: Specify how long you wait for the outcome (e.g., 45-day return window).
- Join logic: Deterministic join by prediction_id if available; otherwise by entity_id and time window with clear rules.
- Compute metrics: Slice-aware metrics (by segment, version, time), calibration, threshold curves, business KPIs.
- Alerts: Thresholds for degradation (e.g., recall down 20% week-over-week) and label availability SLOs.
- Learning: Trigger recalibration or retraining; schedule backfills as late labels arrive.
What to log (copy-paste checklist)
- prediction_id (UUID)
- entity_id(s) (user_id, item_id, order_id, etc.)
- ts_pred (UTC)
- model_version and feature_view/version
- raw prediction (score/probability or regression value)
- decision/threshold used (if any)
- action taken (if any) and policy/propensity (if bandit)
- request_id/session_id for traceability
- environment (prod/stage), region
Worked examples
Example 1: Fraud detection with delayed chargebacks
Task: Predict fraud on transactions. True fraud labels appear 60–90 days later via chargebacks.
- Label window: Use 90 days; compute provisional metrics at 30/60 days and finalize at 90 days.
- Join: By prediction_id; if missing, join on transaction_id with ts_event between ts_pred and ts_pred + 90d.
- Metrics: Precision@threshold, recall, PR AUC, calibration; alert if finalized recall drops >10% month-over-month.
- Bias watch: If high-risk scores trigger auto-blocks, you may never observe true labels for blocked cases; incorporate exploration or post-authorization audits.
Example 2: Recommendations using click as proxy
Task: Rank items on a homepage. You get immediate clicks but delayed purchases.
- Primary proxy: Click within session; secondary: purchase within 7 days.
- Join: Use request_id + position + item_id to attribute clicks to a specific prediction.
- Metrics: CTR calibration, expected revenue uplift; run exploration (e.g., epsilon-greedy 2–5%) and store propensities for unbiased offline evaluation.
Example 3: Credit risk with action-induced bias
Task: Approve/deny loans. If you deny, you never see repayment outcome.
- Problem: Missing-not-at-random labels create bias if you train only on approved applications.
- Mitigations: Randomized exploration band within safe ranges; use inverse propensity weighting or reject inference techniques; track policy propensities.
- Metrics: PSI drift on features, KS statistic stability, default rate calibration by score deciles.
Designing metrics and windows
- Label latency: Define clear waiting periods (t_wait). Maintain rolling metrics at t_wait and finalize later.
- Attribution: Use unique prediction_id; if not possible, define unambiguous time and entity-based joins.
- Calibration: Reliability curves and Brier score; recalibrate (e.g., Platt/Isotonic) when curves drift.
- Slices: Always compute by cohort (region, device, new/returning, version) to localize regressions.
- Business alignment: Tie to downstream KPIs (e.g., refund rate, revenue per session) while still tracking pure model metrics.
Implementation checklist
- Unique prediction_id generated and stored.
- Prediction/event schemas versioned and documented.
- Time is UTC, ISO-8601, with clear timezones.
- Label window(s) defined and encoded in code/config.
- Join logic deterministic and test-covered.
- Late-arriving labels backfilled and re-scored.
- Metrics computed per version, per slice, over time.
- Alert thresholds and runbooks documented.
- Exploration policy and propensities logged when actions affect outcomes.
- Data retention policy covers the maximum label delay.
Exercises
Do these before the quick test. The quick test is available to everyone; log in to save your progress.
- Exercise 1 (Design): Create a feedback loop plan for returns prediction with a 45-day label window. See detailed instructions in the Exercises section below.
- Exercise 2 (SQL join): Join predictions to outcomes and compute Brier score and calibration bins.
- Exercise 3 (Bias): Propose a selection-bias-aware evaluation using exploration and inverse propensity weighting.
Pre-flight checklist before you implement
- Can you reconstruct an evaluation dataset using only logs?
- Do you know exactly when a label becomes final?
- Do you have a unique ID to match prediction to outcome?
- Do actions change the outcome probability, and is that logged?
- Is there an alert if label availability drops below target?
Common mistakes and how to self-check
- Missing IDs: Not logging prediction_id leads to ambiguous joins. Self-check: Pick 10 random predictions and trace to outcomes unambiguously.
- Time leakage: Joining outcomes that occurred before the prediction. Self-check: Validate ts_event ≥ ts_pred.
- Ignoring label delay: Treating unlabeled as negative. Self-check: Track label availability rate by cohort over time.
- Selection bias: Training/evaluating only on actioned cases. Self-check: Compare metrics on exploration vs policy traffic.
- Unstable definitions: Changing label definition without versioning. Self-check: Version label logic and keep a changelog.
- No slices: Looking at overall metrics only. Self-check: Add top-5 critical slices and alert on each.
Practical projects
- Build a mini feedback loop: Simulate predictions for 10k users, generate delayed labels, store logs, and compute weekly metrics and calibration curves.
- Propensity-aware evaluation: Implement epsilon-greedy ranking with logged propensities; estimate CTR with inverse propensity weighting.
- Recalibration pipeline: Detect calibration drift and apply Platt or isotonic calibration; compare Brier score before/after.
Learning path
- Solid prediction logging and schema versioning.
- Reliable label collection and windowing rules.
- Deterministic joins and slice-aware metrics dashboards.
- Alerts and runbooks for degradation and missing labels.
- Bias-aware evaluation (exploration, propensities) and retraining triggers.
Next steps
- Integrate label-latency-aware dashboards in your monitoring system.
- Add exploration to counter selection bias where actions alter outcomes.
- Automate recalibration and define retraining SLOs tied to metric thresholds.
Mini challenge
Your model’s weekly recall dropped 15% while label availability simultaneously dropped from 92% to 70%. Decide: Is this a true quality drop, a labeling issue, or both? Outline the top three checks you would run within 30 minutes to isolate the cause.