Why this matters
As a Data Scientist, your models inform product decisions, experiments, and forecasts. Statistical assumptions are the guardrails that keep your inferences valid. Diagnostics are how you verify those guardrails are holding up. Skipping them can lead to wrong conclusions, wasted budget, and faulty product changes.
- Real task: Validate an A/B test where group variances differ and traffic is time-dependent.
- Real task: Ship a regression model with multicollinearity and outliers without overstating feature importance.
- Real task: Check calibration and discrimination of a churn classifier before rollout.
Who this is for and prerequisites
Who this is for
- Early-career Data Scientists and Analysts who build and evaluate models.
- Engineers and Researchers who run experiments or predictive models.
Prerequisites
- Comfort with basic probability and distributions.
- Know linear and logistic regression at a basic level.
- Know hypothesis testing (t-test/ANOVA) basics.
Concept explained simply
Assumptions are the conditions under which a methodâs math holds. Diagnostics are tests, plots, and checks that tell you if those conditions are approximately true for your data.
Mental model: Treat your analysis like a vehicle. Assumptions are the safety rules (seatbelt, speed limit). Diagnostics are the dashboard sensors (fuel, engine light). You donât need perfection, but you must be within safe ranges and know when to slow down or switch the route.
Assumption checklist by common methods
Linear regression (OLS)
- Linearity: Relationship between predictors and outcome is approximately linear.
- Independence: Errors are independent (no autocorrelation).
- Homoscedasticity: Constant error variance across predictions.
- Normality of errors: For valid t-tests/intervals with small samples.
- No high multicollinearity: Predictors not nearly linear combinations of each other.
- No high-influence anomalies: Outliers/leverage points not dominating fit.
Logistic regression
- Correct link and specification (logit for binary outcome).
- Independent observations (unless modeled otherwise).
- No extreme separation (or use remedies like regularization/Firth).
- Reasonable multicollinearity levels.
- Adequate calibration and discrimination.
t-tests and ANOVA
- Independence between observations.
- Normality of group residuals (especially at small n).
- Equal variances across groups (for classic tests; Welch handles inequality).
Time series models
- Stationarity (or modeled trends/seasonality).
- Residuals uncorrelated and roughly homoscedastic.
Diagnostics toolbox
- Plots: Residuals vs fitted, ScaleâLocation, QQ plot, leverage/Cook's distance, calibration curve, ROC, ACF/PACF.
- Tests: BreuschâPagan/White (heteroscedasticity), DurbinâWatson (autocorrelation), ShapiroâWilk (normality), Levene/BrownâForsythe (variance), HosmerâLemeshow (calibration).
- Stats: VIF for multicollinearity; Brier score for calibration; AUC/PR for discrimination.
How to run diagnostics (practical steps)
- Fit baseline: Start with a simple, interpretable model. Save residuals/predicted values.
- Visual triage: Residuals vs fitted, QQ plot. Look for patterns/funnels/heavy tails.
- Targeted tests: Based on visuals, run heteroscedasticity tests, DurbinâWatson, ShapiroâWilk, Levene, etc.
- Influence checks: Leverage, Cookâs distance. Investigate data quality on flagged points.
- Collinearity: Compute VIF. Address with feature engineering or regularization.
- Model suitability: For classification, examine calibration and AUC/PR; for time series, check ACF/PACF of residuals.
- Remedies: Transform variables, add interactions, use robust/clustered SEs, regularize, or switch models. Re-run diagnostics.
Worked examples
Example 1: Linear regression with issues
Scenario: Predicting revenue from ad spend and season. Diagnostics show: funnel-shaped residuals, DurbinâWatson = 1.1, VIF for two spend channels = 9.5, two points with Cookâs D > 0.5.
- Interpretation: Heteroscedasticity, positive autocorrelation, multicollinearity, influential points.
- Remedies: log-transform revenue or use robust SEs; model autocorrelation (e.g., include lagged residuals or move to time-series regression); combine correlated channels or regularize; investigate and possibly winsorize or correct data issues for influential points.
- Re-check: After fixes, residuals random around zero, DW â 2, VIF < 5.
Example 2: Two-sample test under unequal variances
Scenario: Compare conversion rates (as continuous proxy) between variants with differing variance. Leveneâs test p = 0.01.
- Interpretation: Variances unequal; classic pooled t-test invalid.
- Remedies: Use Welchâs t-test. If heavy non-normality and small n, use MannâWhitney as a robustness check.
- Decision: Report Welchâs estimate and CI; confirm with bootstrap CI.
Example 3: Logistic regression diagnostics
Scenario: Churn model. AUC = 0.83, Brier score = 0.17, calibration curve underpredicts high-risk customers. Some complete separation on a rare feature.
- Interpretation: Good discrimination, calibration drift at high risk, potential separation.
- Remedies: Apply calibration (Platt scaling or isotonic), consider Firth or L2 regularization for separation, review rare feature encoding.
- Re-check: Improved Brier, calibration curve close to diagonal; coefficients stable.
Hands-on exercises
Try the exercise below. Then compare with the provided solution.
- Checklist before you answer:
- State which assumptions are violated.
- List at least three concrete remedies.
- Mention how you would re-check after fixes.
Common mistakes and self-check
- Mistake: Treating normality of residuals as required for unbiased coefficients in OLS. Self-check: Itâs needed mainly for small-sample inference; exogeneity is key for unbiasedness.
- Mistake: Ignoring autocorrelation in time-ordered data. Self-check: Always examine residual ACF/DurbinâWatson when data are sequential.
- Mistake: Dropping variables due to high p-values without checking multicollinearity. Self-check: Inspect VIF first; consider regularization.
- Mistake: Optimizing AUC only, ignoring calibration. Self-check: Inspect calibration curves/Brier score.
- Mistake: Deleting outliers blindly. Self-check: Investigate data quality; prefer robust methods or justified winsorization.
Practical projects
- Retail demand regression: Diagnose and fix heteroscedasticity and multicollinearity; compare OLS vs. OLS with robust SE vs. Ridge.
- Churn classifier: Evaluate discrimination and calibration; apply calibration method and measure improvement.
- A/B analysis: Simulate non-constant variance and autocorrelation; compare classic t-test vs. Welch vs. block/cluster-robust SEs.
Learning path
- Review regression assumptions and residual plots.
- Learn heteroscedasticity and autocorrelation tests.
- Practice VIF and influence diagnostics; try regularization.
- Expand to classification calibration and time-series residual checks.
- Consolidate with a mini project and quick test.
Next steps
- Run diagnostics on one of your past analyses; document issues and fixes.
- Adopt a standard diagnostic checklist for every model.
- Compare conclusions with and without appropriate fixes.
Quick test
Take the quick test to check understanding. Available to everyone; only logged-in users get saved progress.
Mini challenge
You inherit a model predicting weekly sales. Residual vs fitted shows a clear wave pattern; ACF has significant spikes at lags 1 and 52; VIFs are all below 3. In one paragraph, propose your next three actions and how you will verify improvements.