Topic Not Found

Why this matters

Error analysis and slicing help you move beyond one big number (like accuracy) to find exactly where your model fails and how to fix it. In real Data Scientist work, you will:

Diagnose underperforming customer segments, cities, devices, or time windows.
Check fairness by comparing error rates across sensitive groups (where allowed) and reduce harmful gaps.
Detect model drift by monitoring metrics across weeks or campaigns.
Prioritize fixes that reduce business cost (e.g., lower false negatives for fraud).

Concept explained simply

Error analysis is a structured way to ask: where are mistakes concentrated, how big are they, and why? Slicing means computing metrics on subgroups (slices) of data, like:

Categorical: product_category = "A"
Numeric buckets: age in [18–25], [26–40], [41–60], 60+
Text/vision proxies: text_length buckets, image brightness buckets
Time: by week/month, pre/post release

Mental model

Think of your dataset as a city at night. Overall metrics are a skyline view; slicing is a flashlight you shine down each street. You look for dark spots (bad performance), decide if they're large and important, then choose the fastest path to add light (data, features, thresholds, or process changes).

What counts as a good slice?

Actionable: you can collect data or change features for it.
Meaningful size: enough examples to trust the metric.
Stable definition: slice definition that persists over time.

How to run error analysis and slicing

Pick primary metrics aligned to the problem (e.g., recall at fixed precision for fraud, MAE for regression). For imbalanced classification, prefer PR AUC or recall/precision over accuracy.
List candidate slices: domain-driven (e.g., plan_type, geography), data-driven (bins/quantiles, time windows), and quality proxies (text length, image brightness).
Set guardrails: minimum slice size (e.g., n ≥ 200 for stable classification metrics or ≥ 30 for a quick scan), and hold-out or cross-validation for estimates.
Compute per-slice metrics and uncertainty: metric, count, and a simple bootstrap 95% CI if possible.
Flag performance gaps: big difference vs. baseline and CI that does not overlap (or a practical threshold, e.g., recall gap ≥ 0.10 on ≥ 5% of data).
Drill down: inspect confusion matrix/residuals for that slice, sample mispredictions, and feature attributions.
Plan fixes: data (more examples, augmentation), features (new signals), training (rebalancing, loss weights), and policy (thresholds per context if allowed and transparent).
Validate on fresh time or holdout slices. Re-check the rest of the system for regressions.

Thresholds and fairness cautions

Only apply per-slice thresholds if compliant with policy, regulation, and ethics review. Document decisions and monitor for unintended consequences.

Worked examples

Example 1 — Credit default classifier

Overall ROC AUC = 0.92. At threshold 0.5, overall recall = 0.68. Slice by employment_type:

Self-employed (n=800): TP=120, FP=60, FN=180, TN=440 → recall = 120/(120+180) = 0.40
Others (n=7,200): TP=1,360, FP=320, FN=520, TN=5,000 → recall = 1,360/(1,360+520) ≈ 0.72

Gap = 0.32. Likely causes: different income patterns, missing features for variability. Fixes: add features (income volatility), collect more labeled examples, try class-weighting and calibration checks.

Example 2 — Image classifier (low-light)

Task: dog vs. not-dog. Brightness slices:

Bright (≥ 40%, n=1,200): TP=540, FP=60, FN=60, TN=540 → precision ≈ 0.90, recall ≈ 0.90, F1 ≈ 0.90
Dark (< 20%, n=500): TP=156, FP=44, FN=84, TN=216 → precision ≈ 0.78, recall ≈ 0.65, F1 ≈ 0.71

Fixes: low-light augmentation, exposure normalization, train-time photometric transforms. Validate on dark slice.

Example 3 — House price regression

Overall MAE = 28k. Price deciles (by true price):

Decile 1 (lowest): MAE ≈ 12k
...
Decile 10 (highest): MAE ≈ 85k

Model underestimates luxury homes. Fixes: non-linear features (log price), location-quality features, quantile regression or heteroscedastic loss. Re-check MAE per decile after changes.

Example 4 — Time-based drift

Monthly ROC AUC: Jan 0.89, Feb 0.88, Mar 0.77. March had a new marketing campaign with new traffic sources. Fixes: retrain including new distribution, add source feature, monitor weekly after deployment.

Safety, fairness, and reliability

Use minimum slice size and confidence intervals to avoid overreacting to noise.
Consider sensitive attributes carefully; follow your org's policy and local regulations.
Document gaps, decisions, and post-fix monitoring plans.

Exercises you can do now

These mirror the tasks in the Exercises section below. Do them here, then check the official solutions.

EX1: Classification slices — Compute recall per slice and identify where to focus.
EX2: Regression slices — Compare MAE across bins and propose a fix.

[ ] Wrote down primary metric(s)
[ ] Defined at least 5 candidate slices
[ ] Set a minimum slice size
[ ] Computed per-slice metrics and flagged gaps
[ ] Proposed at least one data fix and one modeling fix

Common mistakes and self-check

Relying on accuracy for imbalanced data — Self-check: compare PR AUC or recall@precision vs accuracy; if conclusions differ, switch primary metric.
Overfitting to tiny slices — Self-check: enforce n ≥ threshold; add CIs; re-validate on fresh data.
Too many slices (p-hacking) — Self-check: pre-register top slices; use practical thresholds; beware cherry-picking.
Fixing metrics but hurting users — Self-check: simulate business costs; run A/B or offline cost evaluation.
Unclear ownership — Self-check: for each gap, note an owner, a fix, and a target date.

Practical projects

Slice dashboard MVP: Build a small notebook that computes per-slice metrics, ranks worst gaps, and renders a simple HTML table. Include min-size filtering and CI via bootstrap.
Drift watch: For any model output you have, compute primary metric per week and alert if drop ≥ 0.05 for two consecutive weeks.
Fairness scan: Where policy allows, compare false negative rates across groups. Write a one-page memo with findings, limits, and next steps.

Who this is for, prerequisites, and learning path

Who this is for: Aspiring and practicing Data Scientists who train, evaluate, or maintain ML models.

Prerequisites: Basic Python/pandas or similar, understanding of classification/regression metrics, and familiarity with train/validation/test splits.

Learning path:

Before: Metrics fundamentals, Confusion matrix, ROC/PR curves.
Now: Error analysis and slicing.
Next: Calibration, threshold tuning, cost-sensitive evaluation, and monitoring.

Next steps

Automate per-slice metrics in your evaluation pipeline.
Track 3–5 key slices as part of every model report.
Schedule a re-check after fixes (fresh time window or holdout).

Mini challenge

Find two slices with opposite errors (one high false negatives, one high false positives). Propose one change that helps both, or explain why two separate strategies are needed. Keep it to 5–7 sentences.

Hint for the mini challenge

Look for shared root causes like distribution shift, missing features, or calibration drift. Consider loss weighting or new features first; use per-slice thresholds only with care.

Quick Test

The quick test below is available to everyone; only logged-in users get saved progress.

Menu

Error Analysis And Slicing

Table of Contents

Why this matters

Concept explained simply

Mental model

How to run error analysis and slicing

Worked examples

Example 1 — Credit default classifier

Example 2 — Image classifier (low-light)

Example 3 — House price regression

Example 4 — Time-based drift

Safety, fairness, and reliability

Exercises you can do now

Common mistakes and self-check

Practical projects

Who this is for, prerequisites, and learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Compute recall gaps across slices (classification)

Instructions

Expected Output

Spot regression slices with high error

Error Analysis And Slicing — Quick Test

Have questions about Error Analysis And Slicing?

AI Assistant