Menu

Topic 1 of 7

Bias And Fairness Evaluation

Learn Bias And Fairness Evaluation for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Who this is for

Applied Scientists and ML practitioners who build and evaluate models that affect people (recommendations, ranking, classification, content moderation, risk scoring, NLP, vision, and decision support).

Prerequisites

  • Comfort with basic statistics (rates, proportions, confidence intervals).
  • ML evaluation basics (confusion matrix, precision/recall, ROC/PR).
  • Ability to slice data by subgroups (e.g., with Python/Pandas or spreadsheets).

Why this matters

In real products, a model that looks good overall can underperform badly for specific groups. As an Applied Scientist, you will be asked to:

  • Audit models for group disparities before launch.
  • Pick fairness metrics that match the harm you want to prevent.
  • Tune thresholds or post-process predictions to reduce gaps.
  • Write clear, reproducible fairness reports for stakeholders.
  • Monitor fairness drift after deployment.

Concept explained simply

Bias and fairness evaluation checks if a model’s errors or decisions are inconsistent across groups (e.g., age, gender, region, device type, language). Think of it as reliability across slices.

Mental model

Picture a pipeline: Data → Labels → Model → Thresholds → Decisions → Impact. Gaps can appear at any step. Fairness is about defining the harm to avoid, choosing a compatible metric, and showing disparities with uncertainty.

Common fairness goal types

  • Demographic Parity: Similar positive decision rates across groups (ignores labels). Use when access should be independent of outcome prevalence (e.g., exposure in recommendations).
  • Equal Opportunity: Similar true positive rates (TPR) across groups. Use when missing true positives hurts (e.g., qualified candidates being overlooked).
  • Equalized Odds: Similar TPR and FPR across groups. Stronger but harder to satisfy.
  • Calibration/Predictive Parity: Same predicted score reliability across groups (e.g., among 0.7-score items, ~70% should be positive in each group).
  • Error Parity (regression): Similar mean absolute error or error variance across groups.
How to pick a metric
  • If harm = unequal access/exposure → Demographic Parity.
  • If harm = missing qualified positives → Equal Opportunity.
  • If harm = unequal false alarms and misses → Equalized Odds.
  • If harm = scores not meaning the same across groups → Calibration.

How to evaluate (step-by-step)

  1. Define groups: Choose sensitive or relevant slices (and key intersections, e.g., gender × age).
  2. Pick the harm and the matching metric (see list above).
  3. Compute metrics per group: e.g., selection rate, TPR, FPR, precision, calibration error, MAE.
  4. Compare disparities: gaps (difference) and ratios (e.g., four-fifths rule: minority_rate / majority_rate ≥ 0.8).
  5. Add uncertainty: bootstrap 95% CIs; flag gaps that are large and/or consistent.
  6. Mitigate: options include data balancing/relabeling, per-group thresholds, post-processing (e.g., equalized odds), or model retraining.
  7. Report and monitor: summarize metrics, disparities, CIs, mitigation steps, and a plan to track drift.
Notes on small groups

If a group has very few samples, metrics can be unstable. Aggregate over time, widen intervals, or set a minimum sample size before enforcement.

Worked examples

Example 1 — Equal Opportunity

We evaluate a binary classifier across two groups.

  • Group A: TP=45, FN=15, FP=20, TN=120
  • Group B: TP=18, FN=12, FP=10, TN=60
Walkthrough
  • TPR_A = 45 / (45+15) = 0.75
  • TPR_B = 18 / (18+12) = 0.60
  • Equal Opportunity gap = |0.75 − 0.60| = 0.15 → likely unacceptable if tolerance is 0.05.

Mitigation options: lower B’s threshold to increase TPR_B, or retrain with better coverage for B.

Example 2 — Four-fifths rule (Demographic Parity)

  • Group A: 200 candidates, 120 predicted positive → 0.60
  • Group B: 80 candidates, 36 predicted positive → 0.45
Walkthrough
  • Disparate Impact ratio = 0.45 / 0.60 = 0.75
  • Since 0.75 < 0.80, this fails the four-fifths rule → flag.
  • Mitigation: post-process to raise B’s selection rate (or reduce A’s), investigate data imbalance.

Example 3 — Regression error parity

  • Group A: MAE = 4.2
  • Group B: MAE = 6.0
Walkthrough
  • Error gap = 1.8. If product tolerance is 1.0, this is too high.
  • Check if higher error for B comes from fewer training samples or distribution shift; consider group-aware loss or more B data.

Exercises (do these now)

These mirror the exercises below. Try without peeking at solutions.

  1. Compute group TPRs and Equal Opportunity gap from confusion counts.
  2. Apply the four-fifths rule to selection rates and decide pass/fail.
  • Checklist before you look at solutions:
    • Did you compute per-group denominators correctly?
    • Did you compare the right rates (difference vs ratio)?
    • Did you state a clear pass/fail rule?

Common mistakes and self-check

  • Using overall accuracy only. Self-check: always include per-group metrics and disparities.
  • Ignoring thresholds. Self-check: report fairness at the actual operating threshold(s).
  • No uncertainty. Self-check: add bootstrap CIs and minimum sample checks.
  • One-size-fits-all metric. Self-check: tie metric to the harm definition.
  • Ignoring intersections. Self-check: audit key intersections where risk is highest.
  • Over-correction. Self-check: check business constraints and measure utility impact.
Quick self-audit before shipping
  • Defined target groups and intersections.
  • Chosen fairness metric tied to a specific harm.
  • Computed per-group metrics, disparities, and CIs.
  • Mitigation tested and documented trade-offs.
  • Monitoring plan with alert thresholds.

Practical projects

  • Fairness slicing notebook: load predictions, compute per-group metrics, gaps, ratios, and bootstrap CIs.
  • Threshold tuning sandbox: per-group thresholds to minimize Equalized Odds gap with minimal utility loss.
  • Calibration by group: reliability plots and Brier score by group; fix with group-wise calibration.
  • Data audit: check label imbalance, sampling bias, and missingness by group; propose fixes.

Learning path

  1. Identify harms → pick fairness metric.
  2. Compute per-group metrics and disparities with CIs.
  3. Tune thresholds/post-process; document trade-offs.
  4. Improve data/labels; retrain.
  5. Calibrate and re-check fairness.
  6. Set up monitoring and governance (regular audits).

Mini challenge

Your classifier flags content for review. Group X has a much higher false positive rate than Group Y. Users report unfair takedowns.

What would you do?
  • Measure Equalized Odds (FPR and TPR) and calibration by group.
  • If FPR_X is higher, test per-group thresholds or post-processing to align FPR and TPR.
  • Investigate training data and label quality for Group X; retrain if needed.
  • Communicate changes and monitor post-launch.

Quick test

The test is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

You have a binary classifier evaluated on two groups:

  • Group A: TP=45, FN=15, FP=20, TN=120
  • Group B: TP=18, FN=12, FP=10, TN=60

Tasks:

  • Compute TPR for each group.
  • Compute the Equal Opportunity gap (absolute TPR difference).
  • State whether it meets a 0.05 tolerance.
Expected Output
TPR_A=0.75, TPR_B=0.60, EO_gap=0.15, Does not meet 0.05 tolerance.

Bias And Fairness Evaluation — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Bias And Fairness Evaluation?

AI Assistant

Ask questions about this tool