Why this matters
As an Applied Scientist, you are paid to solve a product problem, not to use a particular model. Choosing the right model family early saves weeks of iteration, keeps latency and costs in check, and increases your chance of shipping measurable impact.
- Real tasks: churn prediction, ticket routing, fraud scoring, demand forecasting, ranking search results, content moderation, anomaly detection.
- Real constraints: limited labels, skewed classes, small data vs. big data, strict latency, memory budgets, interpretability requirements.
Concept explained simply
Model choice is matching: the structure of your data and objective to a family of models that naturally fits it under real-world constraints.
- Data form: tabular, text, images/video, time series, graphs, user–item events.
- Objective: classification, regression, ranking, forecasting, anomaly detection, generation.
- Constraints: data size, latency, memory, training time, interpretability, update frequency.
Cheat-sheet (use as a starting point, not a rule)
- Tabular (small–medium): Gradient-Boosted Trees (e.g., GBDT) or regularized linear/GLM. Consider GAMs for interpretability.
- Text classification: Linear (TF–IDF + Logistic/Linear SVM) as baseline; fine-tuned Transformer when you need higher ceiling and have enough data/budget.
- Images: Transfer learning with pretrained CNN/ViT; train from scratch only with lots of labeled images.
- Time series: Seasonal-naive baseline → ARIMA/Prophet (few series) → global models (GBDT or deep seq models) when many related series and covariates.
- Ranking/Recsys: Learning-to-Rank GBDT (LambdaMART) or two-tower retrieval + re-ranker; matrix factorization for explicit feedback.
- Anomaly detection: Isolation Forest, One-Class SVM, robust autoencoders; pick by data dimension and latency.
- Graphs: GNNs for relational patterns; start with engineered graph features into GBDT if labels are scarce.
Mental model: the 5-factor lens
- Fit-to-data: Does the model family naturally represent the data (e.g., sequences, images, sparse text)?
- Metric alignment: Can it optimize a proxy close to your business metric (e.g., AUC/logloss for conversion)?
- Bias–variance: Start simple; increase capacity when you have evidence (learning curves, error analysis) that you need it.
- Latency–cost: Favor models that meet your p95 latency and memory on target hardware.
- Interpretability–risk: Prefer simpler, explainable models in high-risk domains or when stakeholder trust is critical.
Quick rule-of-thumb: start with the simplest family likely to work, ship a strong baseline, then scale complexity only when it clearly improves your chosen metric under constraints.
Quick chooser by task
Tabular supervised (classification/regression)
- Baseline: Gradient-Boosted Trees. Handle non-linearities, missing values, mixed types well.
- Small n, wide p, sparse: Regularized linear or linear + target/impact encoding.
- Interpretability needed: GAMs or shallow trees with monotonic constraints.
- When to go deep: very large data with complex interactions or when you have learned embeddings (e.g., from recsys), but expect longer iteration cycles.
Text / NLP
- Baseline: TF–IDF + Logistic Regression/Linear SVM; very fast, strong for many tasks.
- Upgrade: Fine-tune a lightweight transformer when more accuracy is needed and latency budget allows (e.g., Distilled models on CPU/GPU).
- Generative tasks: Sequence-to-sequence or instruction-tuned models; cache and retrieval-augment to control latency/cost.
Vision
- Baseline: Transfer learning with pretrained CNN/ViT; freeze backbone then unfreeze top layers.
- Edge latency: Quantize and prune; choose smaller backbones (MobileNet family).
- Data-scarce: Strong augmentations and mixup/cutmix before adding complexity.
Time series forecasting
- Baselines: Seasonal naive and last-value; they are hard to beat in short horizons.
- Few series: ARIMA/Prophet per series for interpretability.
- Many related series + covariates: Global models (GBDT on features or deep seq models like LSTM/Transformer).
- Operational notes: Rolling-origin validation; respect time leakage.
Recommendation / Ranking
- Retrieval: Two-tower / matrix factorization for candidate generation.
- Re-ranking: Learning-to-Rank GBDT (LambdaMART) with pairwise/listwise losses.
- Cold-start: Content features into GBDT; later add embeddings.
Anomaly / Outlier detection
- Unsupervised: Isolation Forest for tabular; One-Class SVM for low–mid dimensional data.
- Semi-supervised: Autoencoder reconstruction error; adjust thresholds with a small labeled set.
Graphs / Structured prediction
- Start: Handcrafted graph features (degree, PageRank) into GBDT.
- Upgrade: Graph Neural Networks when edges carry strong signal and you have labels.
Worked examples
1) Churn prediction on tabular customer data
Context: 120k customers, 200 features (mix numeric/categorical), monthly churn label. Need AUC > 0.80, CPU-only inference under 10 ms.
- Baseline: GBDT with monotonic constraints on price-related features where sensible.
- Why: Mixed types, moderate size, strict latency; trees excel.
- Stretch: Calibrate probabilities (Platt/Isotonic), try GAM for interpretability if legal requires explanations.
2) Ticket topic classification
Context: 30k labeled tickets, 20 classes, must run under 100 ms on CPU.
- Baseline: TF–IDF + Logistic Regression with class weights; add bigrams.
- Why: Strong baseline, low latency, fast iteration.
- Stretch: Distilled transformer fine-tuning if accuracy stalls; enforce max sequence length and quantization to meet SLA.
3) Demand forecasting for 100 stores
Context: 3 years of daily sales per store; want 6-week forecast; promotions and holidays known.
- Baselines: Seasonal naive; Prophet per store for interpretability.
- Global model: Engineer seasonal/covariate features and train GBDT; evaluate via rolling-origin mAPE.
- Stretch: Sequence model if many cross-series effects; keep simple if gains are marginal.
4) Search ranking for an e-commerce site
Context: Click/cvn logs, need to optimize NDCG@10.
- Baseline: LambdaMART (GBDT) with query–item features; pairwise/listwise loss aligned with NDCG.
- Why: Direct metric alignment, strong tabular performance.
- Stretch: Two-stage system: ANN retrieval → LTR re-rank; add query/item embeddings over time.
Playbook: from baseline to better
- Define task, metric, constraints (p95 latency, memory, training budget, update cadence).
- Baseline with the simplest likely-to-work family (see quick chooser). Log a decision record.
- Validate with correct CV (stratified for classification; time-aware for forecasting). Add sanity baselines.
- Error analysis: inspect failure patterns; decide if capacity or data/feature issues block progress.
- Upgrade only if evidence shows a gap: more capacity, pretrained models, better loss/metric alignment.
- Stress-test latency/cost at target hardware; consider quantization, distillation, or smaller backbones.
- Ship the simplest model that meets goals; monitor drift and recalibration needs.
Exercises
Note: The quick test is available to everyone. Only logged-in users get saved progress.
Exercise 1: Choose a model for real-time fraud scoring
You have streaming card transactions (millions/day), extreme class imbalance, and a 50 ms p95 latency budget on CPU. Pick a baseline family, how you will handle imbalance, and how you will update the model.
Write your plan, then compare with the solution in the exercise block below.
Exercise 2: Pick a text classifier under CPU latency
30k labeled support tickets, 20 classes, target accuracy 85%+, CPU 100 ms SLA. Propose baseline and stretch plan with latency controls.
Exercise 3: Multi-store forecasting strategy
100 stores, daily data for 3 years; predict 6 weeks ahead; promotions/holidays available. Choose baseline and global model approach; define validation.
Checklist: before you lock a model family
- My CV scheme matches deployment reality (e.g., no time leakage).
- My chosen loss approximates the business metric.
- I profiled p95 latency and memory on target hardware.
- I compared against a naive and a simple baseline.
- I have a rollback and monitoring plan.
Common mistakes and self-check
- Jumping to deep models without a baseline. Self-check: Do you have a strong linear/GBDT result to beat?
- Ignoring metric alignment. Self-check: Is your training loss correlated with your business metric in validation?
- Offline-only success. Self-check: Did you measure latency and memory on target hardware?
- Time leakage in forecasting. Self-check: Did you use rolling-origin splits and future-only features?
- Overfitting rare positives. Self-check: Plot precision–recall and calibrate probabilities.
Practical projects
- Tabular churn: Build GBDT baseline and a GAM variant. Acceptance: AUC improvement over logistic baseline and calibrated probabilities.
- Text routing: TF–IDF + Logistic baseline vs. distilled transformer. Acceptance: Meet 100 ms CPU p95 and +2–3% accuracy gain.
- Global forecasting: Rolling-origin evaluation with seasonal naive vs. GBDT global model. Acceptance: mAPE reduction and leakage-free pipeline.
Who this is for and prerequisites
- For: Applied Scientists, ML Engineers, Data Scientists moving models to production.
- Prerequisites: Comfortable with supervised learning, validation schemes, basic optimization, and at least one ML toolkit.
Learning path
- Start: Build baselines for your primary data type (tabular/text/time-series).
- Then: Learn metric-specific losses (ranking, imbalanced classification, calibration).
- Next: Manage constraints (latency, memory) via distillation/quantization and efficient architectures.
- Finally: Experiment with advanced families only when justified by error analysis.
Mini challenge
You must classify harmful content in user comments. Constraint: CPU-only, 80 ms p95, 200k labeled examples, strong class imbalance. Draft a 2-stage approach (baseline and upgrade) that meets latency, and list 3 monitoring checks you will run after launch.
Next steps
- Write a one-page decision record for a current problem using the 5-factor lens.
- Take the Quick Test below to check your understanding.
- Pick one Practical Project and implement the baseline this week.