Why this matters
Linear models are the workhorse of applied machine learning. As a Data Scientist, you will use them to quickly establish baselines, interpret feature effects, and ship reliable predictions when data is limited or stakeholders need transparency.
- Forecasting: sales, demand, conversion rate (linear regression).
- Binary outcomes: churn, default, click/no-click (logistic regression).
- Operational analytics: fast models with clear coefficient-based explanations.
Who this is for
- Aspiring and junior Data Scientists who need solid, explainable baselines.
- Analysts moving into predictive modeling.
- Engineers who want practical ML foundations.
Prerequisites
- Comfort with basic algebra and averages/variances.
- Python or R familiarity helps, but examples are language-agnostic.
- Know train/validation/test splits and why they matter.
Concept explained simply
Linear regression predicts a continuous value by adding up weighted features plus an intercept. Example: price = 60 + 20×size.
Logistic regression predicts the probability of a class via a sigmoid of a linear score: probability = 1 / (1 + exp(−(w·x + b))). If probability ≥ threshold (often 0.5), predict class 1.
Regularization helps generalization by shrinking coefficients:
- L2 (Ridge): pushes coefficients toward zero smoothly; great when many correlated features.
- L1 (Lasso): can drive some coefficients exactly to zero; useful for feature selection.
Mental model
- Regression: fit a straight line (or plane) that best passes through the cloud of points.
- Classification: fit a straight boundary that separates classes; the sigmoid maps distance from the boundary to probability.
- Regularization: a gentle elastic band (L2) or a sharp tug (L1) that prevents wild coefficients.
Worked examples
Example 1 — Univariate linear regression by hand
Dataset (x in hundreds of sqft, y in $k):
- (6, 180), (8, 220), (10, 260), (14, 340)
Means: x̄ = 9.5, ȳ = 250. Slope w1 = cov(x,y)/var(x) = 700/35 = 20. Intercept b = ȳ − w1×x̄ = 250 − 20×9.5 = 60.
Model: y = 60 + 20×x. Prediction for x = 12 → ŷ = 60 + 240 = 300.
Example 2 — Logistic regression probability
Model: logit(p) = −4 + 0.1×usage_hours + 0.6×tickets.
For usage=20, tickets=3: logit = −4 + 2 + 1.8 = −0.2 → p = 1/(1+exp(0.2)) ≈ 0.45.
At threshold 0.5, predict class 0; with threshold 0.4, predict class 1.
Example 3 — Ridge vs Lasso with correlated features
Two correlated features (x1, x2). Unregularized model shows unstable coefficients: w = [50, −48].
- Ridge (L2) with a moderate penalty yields smaller, more stable coefficients, e.g., [10, 8].
- Lasso (L1) may set one to zero, e.g., [0, 17], effectively selecting features.
Both reduce variance, but Lasso also performs feature selection.
How to fit and evaluate (step-by-step)
- Define target and features. Remove obvious leakage (future or post-outcome features).
- Split data: train/validation/test (e.g., 60/20/20). For time series, split chronologically.
- Scale numeric features (standardize) especially if using regularization.
- Fit baseline model without regularization. Record metrics.
- Tune regularization (alpha/λ) via cross-validation. Compare validation metrics.
- Check residual plots (regression) or calibration/PR curves (classification).
- Refit on train+validation with best hyperparameters, then evaluate on test once.
- Package model with the same preprocessing steps for deployment.
Common mistakes and self-check
- Not scaling features before L1/L2 → unfair penalties by scale.
- Data leakage (using target or future info) → overly optimistic metrics.
- Using accuracy on imbalanced data → prefer Precision/Recall, PR AUC, or F1.
- Interpreting raw coefficients without noting feature scale or encoding.
- Ignoring multicollinearity → unstable, high-variance coefficients.
- Forgetting the intercept or mishandling dummy variables (dummy trap).
Self-check
- Can you explain what one unit increase in a standardized feature means for the outcome?
- Have you checked residual heteroscedasticity (variance vs. fitted values)?
- Are your positive/negative classes reasonably calibrated?
- Do coefficients change drastically across folds? Consider more regularization or feature pruning.
Practical projects
- Regression: Predict weekly sales using price, promotions, and seasonality features. Evaluate with MAE and MAPE.
- Classification: Predict churn using activity metrics. Track PR AUC and calibration; choose an operating threshold for business cost.
- Regularization: High-dimensional (e.g., text-encoded) features → compare OLS, Ridge, and Lasso; plot coefficient paths as regularization increases.
Exercises
Do these before the quick test. They mirror the graded exercises below.
-
ex1 — Compute predictions and MSE (linear regression)
Model: y = 5 + 2×x1 − 3×x2
Rows:
- r1: x1=4, x2=1, y=12
- r2: x1=0, x2=2, y=−1
- r3: x1=1.5, x2=−1, y=12
Task: Compute predictions for each row and the MSE across all three rows.
-
ex2 — Probability and threshold (logistic regression)
Model: logit(p) = −2 + 0.8×score − 1.2×is_premium + 0.02×age_scaled
Sample: score=3, is_premium=0, age_scaled=20
Task: Compute p and the prediction at thresholds 0.7 and 0.6.
- Checklist: Did you show intermediate steps?
- Did you round probabilities sensibly (2–3 decimals)?
- Did you verify classification changes when threshold changes?
Learning path
- Before: Data cleaning, feature encoding, train/validation/test splitting.
- Now: Linear and logistic regression with L1/L2, evaluation, and interpretation.
- Next: Nonlinear models (trees, ensembles), calibration, and model monitoring.
Mini challenge
You have 2,000 features with many near-duplicates and a small dataset. Build a robust baseline that generalizes well and surfaces key drivers.
- Which regularization will you start with and why?
- How will you tune the penalty and evaluate stability across folds?
- What metric(s) will you report if the positive class is rare?
Suggested approach
- Start with standardized features and Lasso to reduce dimensionality; compare to Ridge.
- Use cross-validation to choose λ; track coefficient stability and PR AUC for rare positives.
- Report calibration and a decision threshold aligned with business cost.
Next steps
- Try polynomial features and interaction terms; compare performance and interpretability.
- Plot learning curves to diagnose bias vs. variance and adjust regularization accordingly.
- Document model assumptions, metrics, and chosen threshold for stakeholders.
Ready to test yourself?
Take the quick test below. It is available to everyone; only logged-in users get saved progress.