Topic Not Found

Who this is for

Applied Scientists who need to debug, improve, and ship reliable ML systems.
Data Scientists/ML Engineers responsible for A/B tests, fairness monitoring, or model iteration.
Analysts supporting product owners with targeted fixes instead of global tweaks.

Prerequisites

Comfort with basic ML metrics (precision/recall/F1, ROC/PR curves, MAE/MAPE, RMSE).
Ability to compute confusion matrices and residuals.
Basic data wrangling (group-bys, filters).

Why this matters

Aggregate metrics often hide problems. Error slicing reveals which users, contexts, or inputs your model fails on so you can ship precise fixes. Real tasks you will face:

Diagnose why an A/B uplift vanishes in certain countries or devices.
Find fairness gaps across demographic groups and decide thresholds or reweighting.
Prioritize labeling for the slices that drive the most business cost.
Create guardrails by monitoring key slices (e.g., high-risk cohorts) after launch.

Concept explained simply

Error analysis is the process of breaking model performance into meaningful slices to understand where and why it fails. Slicing means grouping examples by attributes (like country, device, text length, probability bucket, time window) and computing metrics per group.

Mental model: a microscope for your model

Think of the overall metric as a blurry photo. Slicing is zooming in with a microscope. Each zoom level (slice) reveals specific failure modes you can address with targeted changes: data collection, labeling policy, feature fixes, thresholds, or model architecture.

Practical workflow (step-by-step)

Define the goal and costs: What metric matters, and what is expensive (FP vs FN, large residuals, latency)?
Pick slicing dimensions: Start with user/context (geo, device, time), input features (length, language), and model outputs (score buckets, uncertainty).
Compute per-slice metrics: For classification, use confusion counts, precision/recall/F1, PR-AUC at key thresholds; for regression, MAE/RMSE and residual distributions.
Rank slices by impact: Combine error severity, slice size, and business cost.
Hypothesize causes: Data coverage, label noise, drift, feature bugs, domain shift.
Test interventions on the slice: Threshold tuning, data augmentation, labeling fixes, specialized model, or features.
Verify and monitor: Re-evaluate per-slice on holdout; add slice alerts in production.

[ ] Before deciding a fix, did you check multiple metrics (not just accuracy/AUROC)?
[ ] Did you ensure sample sizes are adequate per slice (confidence intervals)?
[ ] Did you consider Simpson's paradox (aggregate vs per-slice disagreement)?

Worked examples

Example 1 — Binary fraud classifier by country and score bucket

Setup: Global F1 looks fine (0.86), but refunds surge in Country A.

Slice by country and model score bucket (0–0.3, 0.3–0.7, 0.7–1.0).
Findings: In Country A, the 0.7–1.0 bucket shows precision 0.62 vs 0.88 global — many false positives.
Hypothesis: Merchant features differ; calibration off for Country A.
Action: Country-specific threshold or isotonic recalibration with local data. Validate with per-country PR at fixed recall.

Example 2 — Demand forecasting (regression) by season and store type

Setup: RMSE steady, yet stockouts occur in Urban stores during Peak season.

Slice by store_type × season.
Findings: Urban-Peak: MAE = 42 vs overall MAE = 18; residuals skew positive (underprediction).
Hypothesis: Event features missing (concerts, tourism).
Action: Add event/calendar features; consider specialized Urban-Peak model or feature interactions. Confirm with residual plots per slice.

Example 3 — NER by entity type and text length

Setup: Overall F1 = 0.91, but customers report missed product names in long descriptions.

Slice by entity_type and text_length buckets.
Findings: PRODUCT in length > 512 tokens: recall drops to 0.72 (others > 0.9).
Hypothesis: Truncation or context window limits.
Action: Use sliding windows or long-context model. Re-check recall specifically for long texts.

Useful slicing dimensions

User/context: geography, device, app version, time-of-day/day-of-week, channel.
Input characteristics: language, text length, image brightness, missing feature count.
Label/target strata: rare classes, long-tail categories, outlier ranges.
Model behavior: predicted probability buckets, uncertainty, calibration bins.
Business risk: high-value customers, safety-critical flows.

Pro tip: choose stable, explainable slices

Prefer slices that are interpretable, stable across time, and actionable (you can collect data or tune thresholds for them). Avoid extremely small slices without uncertainty checks.

Common mistakes and self-checks

Mistake: Relying on AUROC alone for imbalanced data. Fix: Add PR-AUC, precision/recall at business thresholds, per-slice confusion counts.
Mistake: Acting on tiny slices. Fix: Add confidence intervals or minimum slice size.
Mistake: One global threshold. Fix: Consider per-slice or per-segment thresholds when justified.
Mistake: Ignoring drift. Fix: Include time-based slices and rolling windows.
Mistake: Overfitting to a slice. Fix: Validate on a held-out time or geography not used for tuning.

Self-check checklist

[ ] Did I verify the slice failures reproduce on holdout data?
[ ] Are my slices orthogonal enough to isolate causes?
[ ] Did I evaluate business impact, not just metric deltas?
[ ] Did I consider fairness slices and report gaps with uncertainty?

Practical projects

Slice dashboard: Build a small notebook that groups metrics by 5–8 key slices (time, device, score bins, top classes) and highlights top-3 risky slices by cost × size.
Calibration clinic: For one problematic slice, plot reliability diagrams and recalibrate. Compare per-slice precision at fixed recall before/after.
Fairness pass: Define protected-attribute slices (when available), compute TPR/FPR parity deltas, and propose mitigations with trade-offs.

Exercises you can do now

Exercise 1 — Compute per-slice metrics and propose actions

You have a binary classifier evaluated on two slices (same positive rate overall). Compute precision, recall, F1 for each slice and propose one targeted action.

Slice A ("Young"), N=500: TP=80, FP=40, FN=60, TN=320
Slice B ("Older"), N=500: TP=120, FP=30, FN=20, TN=330

Report: precision, recall, F1 for A and B; identify dominant error type in each (FP or FN); propose one action per slice.

Hints

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * P * R / (P + R)

Show solution

Slice A: P = 80/(80+40)=0.667; R = 80/(80+60)=0.571; F1 ≈ 0.615. Dominant error: FN (60) vs FP (40). Action: lower threshold for A or add recall-oriented features/labels.

Slice B: P = 120/(120+30)=0.800; R = 120/(120+20)=0.857; F1 ≈ 0.828. Dominant error: FP (30) vs FN (20) — both are low. Action: maintain or consider slightly higher threshold to trim FPs if cost is high.

Exercise 2 — Slice a regression error log

Compute MAE per slice for store_type × season and find the worst slice. Then suggest one fix.

Columns: store_type, season, y_true, y_pred
U, Peak, 210, 160
U, Peak, 190, 150
U, OffPeak, 120, 125
R, Peak, 140, 150
R, Peak, 160, 155
R, OffPeak, 90, 95
U, OffPeak, 110, 130
R, OffPeak, 100, 120

Hints

MAE is the mean of |y_true - y_pred| within each slice.
Group by store_type × season.

Show solution

Absolute errors:

U, Peak: |210-160|=50; |190-150|=40 → MAE = (50+40)/2 = 45
U, OffPeak: |120-125|=5; |110-130|=20 → MAE = (5+20)/2 = 12.5
R, Peak: |140-150|=10; |160-155|=5 → MAE = 7.5
R, OffPeak: |90-95|=5; |100-120|=20 → MAE = 12.5

Worst slice: Urban-Peak (MAE = 45). Likely underprediction. Fix: add event or capacity features; consider a specialized model/interaction for Urban-Peak.

[ ] I computed metrics correctly per slice.
[ ] I identified the largest contributors to business cost (not just error rate).
[ ] I proposed slice-specific actions, not only global changes.

Mini challenge

Pick one of your models. Define 5 slices that matter for users or risk. Compute one precision/recall pair (or MAE) per slice and write one sentence for the riskiest slice: cause hypothesis + proposed fix. Keep it to 5 minutes.

Learning path

Before: Baseline evaluation and metric selection.
This subskill: Systematic slicing and error analysis.
Next: Experiment design and A/B readouts with slice-aware guardrails.
Later: Calibration, thresholding strategies, and fairness-aware evaluation.

Next steps

Create a reusable notebook/template for per-slice metrics with confidence intervals.
Add time-based slices to catch drift early.
Define 2–3 high-risk slices to monitor post-launch.

Quick Test

Take the short test to check understanding. Available to everyone; only logged-in users get saved progress.

Menu

Error Analysis And Slicing

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Practical workflow (step-by-step)

Worked examples

Useful slicing dimensions

Common mistakes and self-checks

Practical projects

Exercises you can do now

Exercise 1 — Compute per-slice metrics and propose actions

Exercise 2 — Slice a regression error log

Mini challenge

Learning path

Next steps

Quick Test

Practice Exercises

Compute per-slice metrics and propose actions

Instructions

Expected Output

Slice a regression error log

Error Analysis And Slicing — Quick Test

Have questions about Error Analysis And Slicing?

AI Assistant