Why this matters
As an Applied Scientist, you must decide when a classical model (like logistic regression or gradient-boosted trees) beats a deep learning approach, and when the opposite is true. Making the right choice saves compute, time, and risk while maximizing business impact.
- Ship a baseline fast for A/B tests and iterate safely.
- Hit latency and memory budgets in production.
- Choose models that match data type: tabular, text, images, time series, or multimodal.
- Balance accuracy with interpretability and maintenance costs.
Concept explained simply
Simple view: pick the smallest model family that can capture the needed patterns in your data under your constraints.
Rules of thumb (open me)
- Tabular data: start with linear models and tree ensembles. Deep learning is rare unless you have very large data or multimodal inputs.
- Images and audio: start with transfer learning from pretrained deep models.
- Text: small data and tight latency → linear on TF-IDF or frozen embeddings. Moderate to large data → fine-tune a compact transformer.
- Time series: start with naïve/seasonal baselines and classical forecasting; add tree/boosting with engineered lags; deep models when many related series and long horizons.
Quick selection framework
- Latency/throughput budgets (e.g., P99 < 50ms CPU)
- Model size/memory (e.g., < 50MB)
- Training time/budget
- Interpretability/regulatory needs
- Type: tabular, text, image, audio, time series
- Label volume/quality
- Imbalance/severity of errors
- Shift risk: domain drift, seasonality
- Uniform baselines: majority class, seasonal naïve, random
- Classical strong baselines: logistic/linear, gradient boosting
- Transfer-learned baseline for unstructured data
- Clear metric aligned to cost (e.g., PR AUC for rare positives)
- Stable splits or CV; 2–3 random seeds
- Budget-equalized tuning
- Show measured gap and error analysis that suggests more capacity
- Check cost/latency impact and maintainability
When deep learning is worth it
- Unstructured inputs (images/audio/text) with pretrained models available
- Large labeled datasets or strong self-supervised representations
- Multimodal signals and complex interactions unexplained by trees
- Clear accuracy gap that matters to the business
When classical wins
- Structured/tabular data with limited samples
- Tight latency/memory budgets
- Need for simple explanations and calibration
- Rapid iteration and low maintenance risk
Worked examples
1) Customer churn (tabular, imbalanced)
- Setup: 120 features, 200k rows, 1:20 positive rate, weekly batch scoring.
- Constraints: interpretability helpful; latency not strict.
- Plan: start with logistic regression + class weights; then gradient-boosted trees (GBT). Metric: PR AUC, recall at fixed precision. Use stratified 5-fold CV; calibrate with isotonic if needed.
- Decision: choose GBT if it beats logistic by a meaningful margin and survives calibration/shift checks. Deep tabular only if millions of rows or strong nonlinear interactions not captured by GBT.
2) Defect detection in images
- Setup: 2k labeled images, 224x224, minor class imbalance.
- Constraints: edge device, < 30ms per image, model size < 25MB.
- Plan: transfer learning with a compact pretrained CNN (e.g., a small residual network). Freeze most layers, train head; then selectively unfreeze top blocks. Aggressive augmentation. Metric: F1 at threshold, PR AUC.
- Decision: deep transfer beats classical HOG/SVM in accuracy. Choose the smallest architecture that meets latency and accuracy. Consider quantization after calibration.
3) Text intent classification (small labeled set)
- Setup: 5k short texts, 30 intents, class imbalance.
- Constraints: API latency P99 < 10ms CPU.
- Plan: baseline TF-IDF + linear classifier. Next: sentence embeddings (frozen) + linear. Fine-tune a compact transformer only if latency allows or if accuracy gap is large.
- Decision: if linear on embeddings meets accuracy and latency, prefer it. Otherwise, try a distilled transformer with ONNX export and CPU optimization.
4) Time series forecasting (many series)
- Setup: 3k related demand series, hourly, strong weekly seasonality.
- Constraints: batch forecasting nightly; interpretability desired.
- Plan: seasonal naïve and classical models (ETS/ARIMA) per series; then global GBT with lags/holidays; finally deep sequence model if global model underfits long dependencies.
- Decision: choose the simplest model that achieves required MAPE; often global tree with features wins before deep models.
How to compare models fairly
- Fix the split: use the same train/val/test partitions (time-aware if needed).
- Use consistent preprocessing and leakage-safe pipelines.
- Tune with equal budgets (trials/time) across families.
- Report uncertainty: average over 2–3 seeds or CV; include confidence intervals (bootstrap).
- Calibrate probabilities (Platt or isotonic) when decisions are thresholded.
- Evaluate cost-aware metrics (e.g., recall at precision target; weighted loss).
- Check production fit: latency, memory, cold-start behavior, drift sensitivity.
Quick fairness checklist
- Same data splits and feature sets
- No leakage from future or target encodings without CV
- Comparable hyperparameter search budgets
- Metrics aligned with business costs
- Report both performance and resource usage
Common mistakes and how to self-check
- Jumping to deep learning before a strong classical baseline. Self-check: do you have a calibrated tree/linear model with tuned regularization?
- Using ROC AUC for rare events. Self-check: for prevalence < 5%, use PR AUC or cost-aware metrics.
- Ignoring latency/memory. Self-check: have you measured P95/P99 on target hardware?
- Data leakage via target encoding or time splits. Self-check: are encodings computed within CV folds or time-safe windows?
- Overfitting hyperparameters. Self-check: is test untouched until final selection? Consider nested CV.
- Skipping calibration for decision thresholds. Self-check: reliability diagrams look flat?
Exercises
Complete the tasks below. Compare your answers with the provided solutions.
- ex1: Design a model selection plan for an imbalanced tabular churn dataset.
- ex2: Choose an approach for an image defect detection system under tight latency.
Exercise checklist
- Constraints identified (latency, memory, interpretability)
- Data characteristics summarized (type, size, imbalance)
- Baseline(s) proposed and justified
- Metrics aligned to the problem
- Fair comparison plan (splits, tuning budget)
- Escalation criteria defined
Practical projects
- Fraud detection on tabular data: compare logistic regression, GBT, and a simple deep model; report PR AUC, latency, and calibration.
- News topic classifier: TF-IDF + linear vs. fine-tuned compact transformer; measure accuracy/latency trade-offs.
- Retail demand forecasting: seasonal naïve, ARIMA, global GBT with features; evaluate MAPE and weekly bias.
Mini challenge
You have 15k labeled customer support emails (short texts), 10 classes, need P99 < 15ms on CPU. Propose two candidate solutions with quick experiments you would run first, the metrics you would compare, and how you would choose a winner. Keep it to 6 bullet points.
Who this is for
- Applied Scientists making end-to-end decisions from data to deployment.
- ML Engineers needing fast, pragmatic model choices under constraints.
- Data Scientists moving from analysis to production modeling.
Prerequisites
- Comfort with Python or similar for ML workflows
- Understanding of supervised learning, CV, regularization
- Basic knowledge of tree ensembles and neural networks
Learning path
- Define constraints and success metrics for your problem.
- Build and evaluate classical baselines (linear, GBT).
- If unstructured data: add a transfer-learned deep baseline.
- Run equalized hyperparameter searches; calibrate and measure latency.
- Escalate capacity only with evidence; finalize and document trade-offs.
Quick test
Take the quick test to check your understanding. The test is available to everyone; only logged-in users get saved progress.
Next steps
- Apply the selection framework to one of your current projects.
- Create a model card capturing constraints, baselines, and selection rationale.
- Set up monitoring for calibration, drift, and latency in production.