Who this is for
Applied Scientists and ML practitioners who build and evaluate models that affect people (recommendations, ranking, classification, content moderation, risk scoring, NLP, vision, and decision support).
Prerequisites
- Comfort with basic statistics (rates, proportions, confidence intervals).
- ML evaluation basics (confusion matrix, precision/recall, ROC/PR).
- Ability to slice data by subgroups (e.g., with Python/Pandas or spreadsheets).
Why this matters
In real products, a model that looks good overall can underperform badly for specific groups. As an Applied Scientist, you will be asked to:
- Audit models for group disparities before launch.
- Pick fairness metrics that match the harm you want to prevent.
- Tune thresholds or post-process predictions to reduce gaps.
- Write clear, reproducible fairness reports for stakeholders.
- Monitor fairness drift after deployment.
Concept explained simply
Bias and fairness evaluation checks if a model’s errors or decisions are inconsistent across groups (e.g., age, gender, region, device type, language). Think of it as reliability across slices.
Mental model
Picture a pipeline: Data → Labels → Model → Thresholds → Decisions → Impact. Gaps can appear at any step. Fairness is about defining the harm to avoid, choosing a compatible metric, and showing disparities with uncertainty.
Common fairness goal types
- Demographic Parity: Similar positive decision rates across groups (ignores labels). Use when access should be independent of outcome prevalence (e.g., exposure in recommendations).
- Equal Opportunity: Similar true positive rates (TPR) across groups. Use when missing true positives hurts (e.g., qualified candidates being overlooked).
- Equalized Odds: Similar TPR and FPR across groups. Stronger but harder to satisfy.
- Calibration/Predictive Parity: Same predicted score reliability across groups (e.g., among 0.7-score items, ~70% should be positive in each group).
- Error Parity (regression): Similar mean absolute error or error variance across groups.
How to pick a metric
- If harm = unequal access/exposure → Demographic Parity.
- If harm = missing qualified positives → Equal Opportunity.
- If harm = unequal false alarms and misses → Equalized Odds.
- If harm = scores not meaning the same across groups → Calibration.
How to evaluate (step-by-step)
- Define groups: Choose sensitive or relevant slices (and key intersections, e.g., gender × age).
- Pick the harm and the matching metric (see list above).
- Compute metrics per group: e.g., selection rate, TPR, FPR, precision, calibration error, MAE.
- Compare disparities: gaps (difference) and ratios (e.g., four-fifths rule: minority_rate / majority_rate ≥ 0.8).
- Add uncertainty: bootstrap 95% CIs; flag gaps that are large and/or consistent.
- Mitigate: options include data balancing/relabeling, per-group thresholds, post-processing (e.g., equalized odds), or model retraining.
- Report and monitor: summarize metrics, disparities, CIs, mitigation steps, and a plan to track drift.
Notes on small groups
If a group has very few samples, metrics can be unstable. Aggregate over time, widen intervals, or set a minimum sample size before enforcement.
Worked examples
Example 1 — Equal Opportunity
We evaluate a binary classifier across two groups.
- Group A: TP=45, FN=15, FP=20, TN=120
- Group B: TP=18, FN=12, FP=10, TN=60
Walkthrough
- TPR_A = 45 / (45+15) = 0.75
- TPR_B = 18 / (18+12) = 0.60
- Equal Opportunity gap = |0.75 − 0.60| = 0.15 → likely unacceptable if tolerance is 0.05.
Mitigation options: lower B’s threshold to increase TPR_B, or retrain with better coverage for B.
Example 2 — Four-fifths rule (Demographic Parity)
- Group A: 200 candidates, 120 predicted positive → 0.60
- Group B: 80 candidates, 36 predicted positive → 0.45
Walkthrough
- Disparate Impact ratio = 0.45 / 0.60 = 0.75
- Since 0.75 < 0.80, this fails the four-fifths rule → flag.
- Mitigation: post-process to raise B’s selection rate (or reduce A’s), investigate data imbalance.
Example 3 — Regression error parity
- Group A: MAE = 4.2
- Group B: MAE = 6.0
Walkthrough
- Error gap = 1.8. If product tolerance is 1.0, this is too high.
- Check if higher error for B comes from fewer training samples or distribution shift; consider group-aware loss or more B data.
Exercises (do these now)
These mirror the exercises below. Try without peeking at solutions.
- Compute group TPRs and Equal Opportunity gap from confusion counts.
- Apply the four-fifths rule to selection rates and decide pass/fail.
- Checklist before you look at solutions:
- Did you compute per-group denominators correctly?
- Did you compare the right rates (difference vs ratio)?
- Did you state a clear pass/fail rule?
Common mistakes and self-check
- Using overall accuracy only. Self-check: always include per-group metrics and disparities.
- Ignoring thresholds. Self-check: report fairness at the actual operating threshold(s).
- No uncertainty. Self-check: add bootstrap CIs and minimum sample checks.
- One-size-fits-all metric. Self-check: tie metric to the harm definition.
- Ignoring intersections. Self-check: audit key intersections where risk is highest.
- Over-correction. Self-check: check business constraints and measure utility impact.
Quick self-audit before shipping
- Defined target groups and intersections.
- Chosen fairness metric tied to a specific harm.
- Computed per-group metrics, disparities, and CIs.
- Mitigation tested and documented trade-offs.
- Monitoring plan with alert thresholds.
Practical projects
- Fairness slicing notebook: load predictions, compute per-group metrics, gaps, ratios, and bootstrap CIs.
- Threshold tuning sandbox: per-group thresholds to minimize Equalized Odds gap with minimal utility loss.
- Calibration by group: reliability plots and Brier score by group; fix with group-wise calibration.
- Data audit: check label imbalance, sampling bias, and missingness by group; propose fixes.
Learning path
- Identify harms → pick fairness metric.
- Compute per-group metrics and disparities with CIs.
- Tune thresholds/post-process; document trade-offs.
- Improve data/labels; retrain.
- Calibrate and re-check fairness.
- Set up monitoring and governance (regular audits).
Mini challenge
Your classifier flags content for review. Group X has a much higher false positive rate than Group Y. Users report unfair takedowns.
What would you do?
- Measure Equalized Odds (FPR and TPR) and calibration by group.
- If FPR_X is higher, test per-group thresholds or post-processing to align FPR and TPR.
- Investigate training data and label quality for Group X; retrain if needed.
- Communicate changes and monitor post-launch.
Quick test
The test is available to everyone. Only logged-in users will have their progress saved.