Why this skill matters for Data Scientists
Machine Learning (ML) algorithms are the tools you use to turn data into predictive insights. As a Data Scientist, mastering core algorithms lets you build baselines quickly, choose the right model family for a problem, reason about trade-offs (bias vs. variance, accuracy vs. interpretability), and iterate confidently under real-world constraints such as small datasets, noisy labels, or imbalanced classes.
What this unlocks in your daily work
- Rapid baselines that guide feature engineering and data collection.
- Strong model selection skills for classification, regression, and clustering tasks.
- Practical tuning to reach reliable performance with limited time and compute.
- Clear communication of model behavior, limitations, and risks to stakeholders.
Who this is for
- Aspiring and junior Data Scientists building practical modeling skills.
- Analysts/Engineers expanding into predictive modeling.
- Practitioners who need a structured refresher across major algorithm families.
Prerequisites
- Python basics and comfort with NumPy/Pandas.
- Basic probability and linear algebra (vectors, matrices, dot product).
- Familiarity with scikit-learn pipelines and train/test split.
If you’re missing some prerequisites
Do quick refreshers on: arrays and broadcasting (NumPy), data wrangling (Pandas), and the scikit-learn estimator API (fit/predict/transform, pipelines).
Learning path (practical roadmap)
- Know the families: Linear models, trees/ensembles, SVM, kNN, Naive Bayes, clustering, dimensionality reduction, and a light touch of neural nets.
- Data-first preprocessing: Scaling, encoding, handling imbalance, feature leakage prevention.
- Baselines and metrics: Start simple, choose the right metric (ROC AUC, F1, MAE/RMSE), and do cross-validation.
- Controlled tuning: Grid/random search with sensible ranges; interpret hyperparameters.
- Reliability checks: Learning curves, calibration, stability across folds.
- Explainability: Coefficients, feature importance, partial dependence at a minimum.
Milestone checklist
- Can train and evaluate at least one model from each major family.
- Uses proper CV and reports uncertainty (mean ± std across folds).
- Understands when to prefer linear vs. tree-based vs. margin-based models.
- Can reduce dimensionality safely and explain the trade-offs.
Worked examples (copy, run, adapt)
1) Regularized Logistic Regression (classification baseline)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
X, y = make_classification(n_samples=3000, n_features=20, n_informative=5,
weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(C=1.0, penalty="l2", solver="liblinear", max_iter=200))
])
pipe.fit(X_train, y_train)
proba = pipe.predict_proba(X_test)[:, 1]
print("ROC AUC:", roc_auc_score(y_test, proba))
# Tip: Lower C -> stronger regularization (simpler model, less variance)
2) Random Forest vs. Gradient Boosting (quick compare)
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import numpy as np
X, y = make_regression(n_samples=3000, n_features=30, noise=10, random_state=0)
rf = RandomForestRegressor(n_estimators=300, random_state=0)
gb = GradientBoostingRegressor(random_state=0)
rf_scores = cross_val_score(rf, X, y, cv=5, scoring="neg_mean_absolute_error")
gb_scores = cross_val_score(gb, X, y, cv=5, scoring="neg_mean_absolute_error")
print("RF MAE:", -np.mean(rf_scores))
print("GB MAE:", -np.mean(gb_scores))
# Heuristic: RF handles many noisy features robustly; GB often wins with careful tuning.
3) SVM with RBF kernel (margin-based tuning)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
X, y = load_breast_cancer(return_X_y=True)
pipe = Pipeline([
("scaler", StandardScaler()),
("svm", SVC(kernel="rbf", probability=True))
])
param_grid = {
"svm__C": [0.1, 1, 10],
"svm__gamma": ["scale", 0.1, 0.01]
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = GridSearchCV(pipe, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
search.fit(X, y)
print("Best params:", search.best_params_)
print("Best AUC:", search.best_score_)
# Larger C -> narrower margin (can overfit); larger gamma -> more local decision boundary.
4) k-Nearest Neighbors with scaling (distance matters)
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
X, y = load_iris(return_X_y=True)
pipe = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5, weights="distance"))
])
scores = cross_val_score(pipe, X, y, cv=5)
print("CV accuracy:", np.mean(scores))
# Always scale for distance-based models; try odd k in small datasets to reduce ties.
5) Multinomial Naive Bayes (simple text baseline)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
texts = [
"great product loved it", "bad quality not recommended",
"excellent value would buy again", "terrible broken on arrival",
"works perfectly as expected", "waste of money"
]
labels = [1, 0, 1, 0, 1, 0]
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.33, random_state=42, stratify=labels)
model = make_pipeline(CountVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
preds = model.predict(X_test)
print("F1:", f1_score(y_test, preds))
# NB is fast and strong with bag-of-words; assumes feature independence.
6) KMeans, DBSCAN, and PCA (unsupervised toolkit)
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import numpy as np
X, y_true = make_blobs(n_samples=1000, centers=4, cluster_std=1.0, random_state=0)
# PCA (for visualization or speed)
pca = PCA(n_components=2, random_state=0)
X2 = pca.fit_transform(X)
# KMeans with k=4
kmeans = KMeans(n_clusters=4, n_init=10, random_state=0)
labels_k = kmeans.fit_predict(X)
print("KMeans silhouette:", silhouette_score(X, labels_k))
# DBSCAN (density-based; discovers non-spherical clusters, can mark noise)
db = DBSCAN(eps=0.8, min_samples=10)
labels_d = db.fit_predict(X)
mask = labels_d != -1
if np.any(mask):
print("DBSCAN silhouette:", silhouette_score(X[mask], labels_d[mask]))
else:
print("DBSCAN found only noise with these params.")
# Heuristic: scale features before clustering; tune eps/min_samples for DBSCAN.
Drills and quick exercises
- [ ] Re-run Example 1 with C in [0.01, 0.1, 1, 10]. Note AUC vs. calibration.
- [ ] For Example 2, try 100 vs. 500 trees and learning_rate for Gradient Boosting (e.g., 0.05 vs. 0.1).
- [ ] In Example 3, add class_weight="balanced". Does it help minority recall?
- [ ] Swap StandardScaler with no scaling in Example 4. Observe accuracy drop.
- [ ] In Example 6, standardize X before clustering; compare silhouettes.
- [ ] Add train/test leakage guard: confirm all scalers/encoders are inside a Pipeline.
Common mistakes and debugging tips
Data leakage
- Symptom: unrealistically high CV, low production performance.
- Fix: wrap preprocessing in Pipelines; split before scaling/encoding/selection.
Wrong metric for the task
- Symptom: accuracy looks great but minority class suffers.
- Fix: for imbalance, prefer ROC AUC, PR AUC, or F1; monitor class-wise recall.
Overfitting during tuning
- Symptom: best CV score is from extreme hyperparameters.
- Fix: use nested CV or a validation split; constrain search ranges; inspect learning curves.
Poor scaling with distance-based models
- Symptom: kNN/SVM unstable across runs.
- Fix: standardize/normalize features; check outliers; use robust scalers if needed.
Ignoring uncertainty
- Symptom: single number reported as performance.
- Fix: report mean ± std across folds; add confidence intervals for key metrics.
Mini project: Churn prediction baseline to shortlist model families
Goal: Build a reliable baseline and compare three families (linear, tree-based, and margin-based) for a binary churn problem. Keep it fast, reproducible, and explainable.
- Data split: train/validation/test with stratification.
- Pipeline: numerical scaling + categorical encoding inside a single Pipeline.
- Baselines: LogisticRegression (with class_weight), RandomForest, and SVC (RBF) with light tuning.
- Metrics: ROC AUC (primary), F1 and calibration (secondary). Use cross-validation on train.
- Model cards: for each candidate, log best params, mean ± std metrics, top features/importance.
- Decision: pick the simplest model within 1–2% of the best score.
Acceptance criteria
- Reproducible code with fixed random_state where applicable.
- No leakage (all transforms inside Pipelines).
- Report mean ± std across folds and final test metrics.
- Brief note on risks and failure modes (e.g., drift, imbalance).
Stretch goals
- Add calibration (Platt scaling or isotonic) and compare Brier score.
- Use SHAP or permutation importance to validate feature influence (if available).
- Try Gradient Boosting or XGBoost-style alternatives and compare compute vs. gains.
Next steps
- Practice targeted: pick one dataset and cycle through 3 model families with a fixed evaluation harness.
- Start building a personal template: data split, pipeline, CV, logging, plots.
- Move to model evaluation depth: threshold selection, calibration, confidence, and monitoring.
Subskills
- Linear Models: Regression & Classification
- Regularization: Ridge, Lasso, Elastic Net
- Tree-Based Models: Random Forest, Gradient Boosting
- Support Vector Machines: Basics
- Nearest Neighbors: Basics
- Naive Bayes: Basics
- Clustering: K-Means, Hierarchical, DBSCAN (Basics)
- Dimensionality Reduction: PCA, UMAP (Basics)
- Neural Network Basics