Why this matters
Error analysis and slicing help you move beyond one big number (like accuracy) to find exactly where your model fails and how to fix it. In real Data Scientist work, you will:
- Diagnose underperforming customer segments, cities, devices, or time windows.
- Check fairness by comparing error rates across sensitive groups (where allowed) and reduce harmful gaps.
- Detect model drift by monitoring metrics across weeks or campaigns.
- Prioritize fixes that reduce business cost (e.g., lower false negatives for fraud).
Concept explained simply
Error analysis is a structured way to ask: where are mistakes concentrated, how big are they, and why? Slicing means computing metrics on subgroups (slices) of data, like:
- Categorical: product_category = "A"
- Numeric buckets: age in [18β25], [26β40], [41β60], 60+
- Text/vision proxies: text_length buckets, image brightness buckets
- Time: by week/month, pre/post release
Mental model
Think of your dataset as a city at night. Overall metrics are a skyline view; slicing is a flashlight you shine down each street. You look for dark spots (bad performance), decide if they're large and important, then choose the fastest path to add light (data, features, thresholds, or process changes).
What counts as a good slice?
- Actionable: you can collect data or change features for it.
- Meaningful size: enough examples to trust the metric.
- Stable definition: slice definition that persists over time.
How to run error analysis and slicing
- Pick primary metrics aligned to the problem (e.g., recall at fixed precision for fraud, MAE for regression). For imbalanced classification, prefer PR AUC or recall/precision over accuracy.
- List candidate slices: domain-driven (e.g., plan_type, geography), data-driven (bins/quantiles, time windows), and quality proxies (text length, image brightness).
- Set guardrails: minimum slice size (e.g., n β₯ 200 for stable classification metrics or β₯ 30 for a quick scan), and hold-out or cross-validation for estimates.
- Compute per-slice metrics and uncertainty: metric, count, and a simple bootstrap 95% CI if possible.
- Flag performance gaps: big difference vs. baseline and CI that does not overlap (or a practical threshold, e.g., recall gap β₯ 0.10 on β₯ 5% of data).
- Drill down: inspect confusion matrix/residuals for that slice, sample mispredictions, and feature attributions.
- Plan fixes: data (more examples, augmentation), features (new signals), training (rebalancing, loss weights), and policy (thresholds per context if allowed and transparent).
- Validate on fresh time or holdout slices. Re-check the rest of the system for regressions.
Thresholds and fairness cautions
Only apply per-slice thresholds if compliant with policy, regulation, and ethics review. Document decisions and monitor for unintended consequences.
Worked examples
Example 1 β Credit default classifier
Overall ROC AUC = 0.92. At threshold 0.5, overall recall = 0.68. Slice by employment_type:
- Self-employed (n=800): TP=120, FP=60, FN=180, TN=440 β recall = 120/(120+180) = 0.40
- Others (n=7,200): TP=1,360, FP=320, FN=520, TN=5,000 β recall = 1,360/(1,360+520) β 0.72
Gap = 0.32. Likely causes: different income patterns, missing features for variability. Fixes: add features (income volatility), collect more labeled examples, try class-weighting and calibration checks.
Example 2 β Image classifier (low-light)
Task: dog vs. not-dog. Brightness slices:
- Bright (β₯ 40%, n=1,200): TP=540, FP=60, FN=60, TN=540 β precision β 0.90, recall β 0.90, F1 β 0.90
- Dark (< 20%, n=500): TP=156, FP=44, FN=84, TN=216 β precision β 0.78, recall β 0.65, F1 β 0.71
Fixes: low-light augmentation, exposure normalization, train-time photometric transforms. Validate on dark slice.
Example 3 β House price regression
Overall MAE = 28k. Price deciles (by true price):
- Decile 1 (lowest): MAE β 12k
- ...
- Decile 10 (highest): MAE β 85k
Model underestimates luxury homes. Fixes: non-linear features (log price), location-quality features, quantile regression or heteroscedastic loss. Re-check MAE per decile after changes.
Example 4 β Time-based drift
Monthly ROC AUC: Jan 0.89, Feb 0.88, Mar 0.77. March had a new marketing campaign with new traffic sources. Fixes: retrain including new distribution, add source feature, monitor weekly after deployment.
Safety, fairness, and reliability
- Use minimum slice size and confidence intervals to avoid overreacting to noise.
- Consider sensitive attributes carefully; follow your org's policy and local regulations.
- Document gaps, decisions, and post-fix monitoring plans.
Exercises you can do now
These mirror the tasks in the Exercises section below. Do them here, then check the official solutions.
- EX1: Classification slices β Compute recall per slice and identify where to focus.
- EX2: Regression slices β Compare MAE across bins and propose a fix.
- [ ] Wrote down primary metric(s)
- [ ] Defined at least 5 candidate slices
- [ ] Set a minimum slice size
- [ ] Computed per-slice metrics and flagged gaps
- [ ] Proposed at least one data fix and one modeling fix
Common mistakes and self-check
- Relying on accuracy for imbalanced data β Self-check: compare PR AUC or recall@precision vs accuracy; if conclusions differ, switch primary metric.
- Overfitting to tiny slices β Self-check: enforce n β₯ threshold; add CIs; re-validate on fresh data.
- Too many slices (p-hacking) β Self-check: pre-register top slices; use practical thresholds; beware cherry-picking.
- Fixing metrics but hurting users β Self-check: simulate business costs; run A/B or offline cost evaluation.
- Unclear ownership β Self-check: for each gap, note an owner, a fix, and a target date.
Practical projects
- Slice dashboard MVP: Build a small notebook that computes per-slice metrics, ranks worst gaps, and renders a simple HTML table. Include min-size filtering and CI via bootstrap.
- Drift watch: For any model output you have, compute primary metric per week and alert if drop β₯ 0.05 for two consecutive weeks.
- Fairness scan: Where policy allows, compare false negative rates across groups. Write a one-page memo with findings, limits, and next steps.
Who this is for, prerequisites, and learning path
Who this is for: Aspiring and practicing Data Scientists who train, evaluate, or maintain ML models.
Prerequisites: Basic Python/pandas or similar, understanding of classification/regression metrics, and familiarity with train/validation/test splits.
Learning path:
- Before: Metrics fundamentals, Confusion matrix, ROC/PR curves.
- Now: Error analysis and slicing.
- Next: Calibration, threshold tuning, cost-sensitive evaluation, and monitoring.
Next steps
- Automate per-slice metrics in your evaluation pipeline.
- Track 3β5 key slices as part of every model report.
- Schedule a re-check after fixes (fresh time window or holdout).
Mini challenge
Find two slices with opposite errors (one high false negatives, one high false positives). Propose one change that helps both, or explain why two separate strategies are needed. Keep it to 5β7 sentences.
Hint for the mini challenge
Look for shared root causes like distribution shift, missing features, or calibration drift. Consider loss weighting or new features first; use per-slice thresholds only with care.
Quick Test
The quick test below is available to everyone; only logged-in users get saved progress.