Why this matters
Point estimates like accuracy, precision, recall, F1, MAE, or AUC can look impressive, but they are still estimates. Confidence intervals (CIs) tell you how much your metric could vary if you repeated the evaluation on similar data. As a Data Scientist, you will:
- Decide if Model A truly beats Model B or if the difference is noise.
- Report performance to stakeholders with uncertainty, not just a single number.
- Set realistic expectations for production performance drift monitoring.
- Choose sample sizes for tests to reach a desired margin of error.
Note about progress: The quick test on this page is available to everyone. Only logged-in users will see saved progress.
Concept explained simply
A confidence interval is a range around your metric estimate that likely contains the true value. For a 95% CI, if you repeated data collection many times, about 95% of such intervals would contain the true metric.
Mental model
Imagine shining a flashlight on your metric estimate. The center of the beam is your estimate; the beam’s width is the uncertainty. More data narrows the beam; noisier data widens it.
Key pieces
- Estimate: the metric you computed on your test data (e.g., accuracy = 0.82).
- Sampling variability: results change with different samples.
- CI construction: method to turn variability into a range (e.g., normal approximation, Wilson, bootstrap).
- Confidence level: usually 90%, 95%, or 99%.
Choosing a CI method
- Simple proportions (accuracy, precision, recall): if n is large and p not near 0 or 1, use normal approximation p ± z·sqrt(p(1−p)/n). If n is small or p near 0/1, prefer Wilson interval.
- F1, AUC, regression metrics (MAE, RMSE): use bootstrap CIs (resample the dataset and recompute the metric many times). For AUC, DeLong or bootstrap is common.
- Comparing two models on the same test set: use paired methods (paired bootstrap, McNemar for accuracy) because predictions are correlated.
- Clustered data (e.g., multiple events per user): use cluster/bootstrap by group to avoid underestimating uncertainty.
Step-by-step: compute a confidence interval
- Define the metric and unit of resampling. Example: accuracy per sample; cluster by user if multiple events per user.
- Pick a CI method. Proportion metrics: normal/Wilson. Complex metrics: bootstrap. AUC: DeLong or bootstrap.
- Pick a confidence level. Commonly 95% (z ≈ 1.96), 90% (z ≈ 1.64), 99% (z ≈ 2.576).
- Compute the interval.
- Normal approx for proportion p: p ± z·sqrt(p(1−p)/n)
- Wilson interval for proportion p: use adjusted center and width (more stable at small n).
- Bootstrap: resample with replacement B times, recompute metric, take percentiles (e.g., 2.5% and 97.5%).
- State assumptions and caveats. Independence? Class balance? Clustering? Report method and level.
Bootstrap tips (no code needed)
- Use at least 1000 resamples for a stable 95% CI; 2000–5000 is common.
- Always resample at the correct level (e.g., by user for user-level metrics).
- Report the bootstrap type (percentile is simplest).
Worked examples (with numbers)
Example 1: 95% CI for accuracy (normal approximation)
Suppose your classifier got 820 correct out of 1000 test samples.
- p̂ = 820/1000 = 0.82
- SE = sqrt(p̂(1−p̂)/n) = sqrt(0.82·0.18/1000) ≈ 0.01215
- 95% CI = 0.82 ± 1.96·0.01215 ≈ 0.82 ± 0.0238 → [0.796, 0.844]
Example 2: 95% CI for precision with small-ish n (normal approximation shown; Wilson preferred)
TP = 90, predicted positives = 120 ⇒ p̂ = 0.75
- SE = sqrt(0.75·0.25/120) ≈ 0.0395
- 95% CI ≈ 0.75 ± 1.96·0.0395 ≈ 0.75 ± 0.077 → [0.673, 0.827]
- Note: Wilson would be more reliable here due to n=120. See Exercise 2 to compute it.
Example 3: 95% CI for F1 (bootstrap)
Given predictions on a test set, suppose F1 = 0.65. You resample the test set with replacement 2000 times and recompute F1 each time.
- Bootstrap distribution of F1 has 2.5th percentile = 0.61 and 97.5th percentile = 0.69.
- 95% CI (percentile) = [0.61, 0.69]
Example 4: 95% CI for MAE (bootstrap)
On 500 samples, MAE = 12.3. Bootstrap (2000 resamples) yields percentiles 11.5 and 13.2.
- 95% CI = [11.5, 13.2]
Example 5: Is Model A better than Model B? (paired bootstrap, accuracy)
Both models evaluated on the same 2000 samples: Acc(A)=0.84, Acc(B)=0.81. Paired bootstrap the samples 5000 times and compute diff = Acc(A)−Acc(B) per resample.
- Median diff = 0.030
- 95% CI for diff = [0.015, 0.045]
- Since 0 is not inside the CI, Model A is likely better.
Exercises (practice what you learned)
These mirror the tasks below in the Exercises panel. Try them here first, then open the solution.
Exercise 1: 99% CI for accuracy (normal approximation)
Your model got 560 correct out of 700. Compute the 99% CI for accuracy using the normal approximation.
Hint
- p̂ = 560/700. Use z = 2.576 for 99%.
- SE = sqrt(p̂(1−p̂)/n)
Exercise 2: 95% CI for precision using Wilson interval
TP = 48 out of 60 predicted positives ⇒ p̂ = 0.8. Compute the 95% Wilson interval.
Hint
- Use z = 1.96; apply Wilson center and half-width formulas.
Self-check checklist
- I selected a CI method that matches my metric and data size.
- I computed or approximated the correct z-score for the chosen confidence level.
- I used the right denominator for the metric (e.g., predicted positives for precision).
- I stated any assumptions (independence, clustering) and method used.
Common mistakes and how to self-check
- Using normal CI for tiny n or extreme p: Switch to Wilson or exact methods.
- Ignoring correlation (same users, multiple events): Use cluster bootstrap to avoid too-narrow CIs.
- Wrong denominator (e.g., using total n for precision): Precision uses predicted positives; recall uses actual positives.
- Over-interpreting CIs: A 95% CI does not mean a 95% probability the true value lies inside this particular interval.
- Comparing models with independent methods: When both models are tested on the same data, use paired methods.
Fast self-audit
- Is the method named in your report (normal/Wilson/bootstrap)?
- Is the confidence level clearly stated (e.g., 95%)?
- Did you check class or group structure that may break independence?
- For comparisons, did you use a paired approach?
Practical projects
- Build a CI dashboard: given predictions and labels, compute metrics and 95% CIs (normal/Wilson/bootstrap). Allow switching confidence levels.
- Paired comparison tool: input two models’ predictions, output CI for the difference in accuracy and F1 via paired bootstrap.
- Cluster-aware evaluation: for user-event data, implement cluster bootstrap for recall@k and report CIs.
Who this is for
- Data Scientists and ML Engineers who report model performance.
- Analysts validating experiments or A/B tests for ML models.
- Students learning evaluation beyond point estimates.
Prerequisites
- Basic probability and proportions (p, n, z-scores).
- Understanding of classification/regression metrics.
- Comfort with spreadsheets or scripting to run bootstrap.
Learning path
- Refresh metric definitions (accuracy, precision, recall, F1, AUC, MAE).
- Learn normal and Wilson CIs for proportions.
- Practice bootstrap CIs for F1 and MAE.
- Do paired comparisons (bootstrap or McNemar for accuracy).
- Handle clustered data with cluster bootstrap.
Next steps
- Apply CIs to your current project metrics and add them to your reports.
- Automate CI computation in your evaluation pipeline.
- Take the quick test below to check understanding.
Mini challenge
You have a fraud model used per customer session (multiple sessions per customer). You need a 95% CI for recall. Outline steps for a cluster bootstrap CI by customer and list what you will report (assumptions, confidence level, number of resamples).
About the quick test
Answer short questions to check your understanding. Everyone can take it; only logged-in users will have results saved.