Who this is for
Data Scientist learners who want reliable, high-performing models for classification or regression without excessive feature engineering. Useful for tabular data problems like churn, fraud, credit risk, pricing, and demand forecasting.
Prerequisites
- Basic Python and scikit-learn usage (fit, predict, train/test split)
- Understanding of decision trees (splits, depth, overfitting)
- Familiarity with metrics (Accuracy, ROC AUC, Precision/Recall)
Why this matters
In real data science work, you often face messy, mixed-type tabular data. Tree ensembles like Random Forest (RF) and Gradient Boosting (GBM) deliver strong performance with minimal preprocessing, handle non-linearities and interactions automatically, and provide feature importance for interpretability. They are go-to baselines and production workhorses for:
- Customer churn prediction and retention targeting
- Fraud detection with imbalanced data
- Credit scoring and risk ranking
- Pricing and demand forecasting with complex effects
Concept explained simply
Decision trees split data into regions to make predictions. They are easy to interpret but can overfit. Ensembles combine many trees to reduce error:
- Random Forest: Builds many deep trees on bootstrapped samples and randomly selects features at each split. Votes/averages predictions. It reduces variance (stabilizes predictions) and is robust with minimal tuning.
- Gradient Boosting: Builds trees sequentially. Each new tree focuses on the current errors of the model. With small learning_rate and many shallow trees, it achieves high accuracy. It reduces bias while controlling variance.
Mental model
Think of Random Forest as a panel of independent judges (trees) voting. Each judge looks at slightly different evidence (sampled rows and features), so group decisions are stable. Gradient Boosting is like iterative coaching: after each practice, the coach adds targeted advice (a new tree) to fix specific mistakes. Many small improvements compound into a strong model.
Core knobs you will tune
- For Random Forest (sklearn):
n_estimators: number of trees (e.g., 200–1000). More trees → better stability, slower.max_depth: tree depth. None or large depth often works; usemin_samples_leafto regularize.max_features: features to consider per split (e.g.,sqrtfor classification). Lower → more decorrelation.min_samples_leaf: minimum samples per leaf. Larger → smoother, less overfit (e.g., 5–50).class_weight(classification): handle imbalance (e.g.,balanced).
- For Gradient Boosting (sklearn HistGradientBoosting or GradientBoosting):
learning_rate: size of each step (e.g., 0.02–0.1). Lower → needs more trees, often better generalization.max_depthormax_leaf_nodes: controls tree complexity. Shallow trees (depth 3–6) are common.n_estimators: number of boosting rounds (e.g., 200–1000).subsample: row subsampling per tree (e.g., 0.7–0.9) to reduce overfitting.early_stopping(HistGBM): stop when validation score stops improving.
Worked examples
Example 1 — Strong baseline with Random Forest (classification)
- Create synthetic data: 20 features, 2 informative, 10k rows. Split train/valid/test.
- Train RandomForestClassifier with
n_estimators=500,max_features='sqrt',min_samples_leaf=5. - Evaluate ROC AUC on validation; inspect feature importances.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
X, y = make_classification(n_samples=10000, n_features=20, n_informative=2,
n_redundant=5, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier(n_estimators=500, max_features='sqrt',
min_samples_leaf=5, random_state=42, n_jobs=-1)
rf.fit(X_tr, y_tr)
proba = rf.predict_proba(X_te)[:,1]
print('ROC AUC:', roc_auc_score(y_te, proba))
print('Top features:', rf.feature_importances_.argsort()[::-1][:5])
What to look for: Stable AUC, low variance across runs, and a few features dominating importance.
Example 2 — Handling class imbalance with RF
- Simulate 5% positive class via
weights=[0.95, 0.05]inmake_classification. - Use
class_weight='balanced'; compare PR AUC with/without balancing.
from sklearn.metrics import average_precision_score
X, y = make_classification(n_samples=20000, n_features=30, n_informative=5,
weights=[0.95, 0.05], random_state=0)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)
rf_bal = RandomForestClassifier(n_estimators=600, min_samples_leaf=10,
class_weight='balanced', random_state=0, n_jobs=-1)
rf_bal.fit(X_tr, y_tr)
proba_bal = rf_bal.predict_proba(X_te)[:,1]
rf_unbal = RandomForestClassifier(n_estimators=600, min_samples_leaf=10,
random_state=0, n_jobs=-1)
rf_unbal.fit(X_tr, y_tr)
proba_unbal = rf_unbal.predict_proba(X_te)[:,1]
print('PR AUC balanced:', average_precision_score(y_te, proba_bal))
print('PR AUC unbalanced:', average_precision_score(y_te, proba_unbal))
Expected: Balanced model lifts recall/PR AUC on rare positives.
Example 3 — Gradient Boosting with early stopping (HistGradientBoosting)
- Use
HistGradientBoostingClassifierwithlearning_rate=0.05,max_depth=4,early_stopping=True. - Compare AUC vs Random Forest from Example 1.
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
X, y = make_classification(n_samples=10000, n_features=20, n_informative=4,
random_state=7)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=7)
gbm = HistGradientBoostingClassifier(learning_rate=0.05, max_depth=4,
early_stopping=True, validation_fraction=0.2,
random_state=7)
gbm.fit(X_tr, y_tr)
proba = gbm.predict_proba(X_te)[:,1]
print('GBM ROC AUC:', roc_auc_score(y_te, proba))
Expected: GBM often edges out RF on clean, informative features when tuned moderately.
Hands-on exercises
Mirror of the graded exercises below. Aim to complete them and check with the solutions.
- Exercise 1 — Baseline Random Forest
- Build a RF baseline on a synthetic binary dataset. Target: validation ROC AUC ≥ 0.85.
- Exercise 2 — RF vs GBM tuning
- Run basic hyperparameter search for RF and GBM. Report best model and ROC AUC.
- Exercise 3 — Interpretability
- Compute permutation importances and partial dependence (or 1D ICE) for top features. Write 2 insights.
Checklist before you move on
- You can explain the difference between variance reduction (RF) and bias reduction (GBM).
- You know which parameters to tune first for each method.
- You can handle class imbalance with class_weight or appropriate metrics.
- You can extract and sanity-check feature importance.
- You know when to use early stopping and small learning_rate in GBM.
Common mistakes and self-check
- Too deep GBM trees with high learning rate: Leads to overfitting. Self-check: validation AUC improves early then degrades. Fix: smaller
max_depth, lowerlearning_rate, enable early stopping. - Ignoring feature scaling issues for GBM? Not required. Trees don’t need scaling, but ensure consistent encodings and missing value handling.
- Using accuracy on imbalanced data: Misleading. Self-check: compare with PR AUC or recall at K. Fix: pick metrics matching business goal.
- Unstable importance interpretation: RF split-based importance can be biased toward high-cardinality features. Self-check: compare with permutation importance.
- Too few trees: High variance. Self-check: results fluctuate across random seeds. Fix: increase
n_estimators.
Practical projects
- Customer churn: RF baseline → GBM tuned model. Deliver a calibration curve and top-5 feature insights.
- Fraud detection: Build PR AUC–focused GBM, threshold for top 1% alerts. Compare cost savings vs RF.
- Credit risk: Train RF then GBM, produce risk scores, monotonic partial dependence for key financial features.
Learning path
- 1) Review decision trees and impurity metrics
- 2) Fit Random Forest baseline and learn importance types
- 3) Fit Gradient Boosting with early stopping
- 4) Tune core hyperparameters; select metrics aligned to the business
- 5) Interpret model: permutation importances, partial dependence
- 6) Package model: inference pipeline, input validation, monitoring plan
Next steps
- Try monotonic constraints or categorical handling with advanced gradient boosters (conceptually). Keep focus on core principles here.
- Practice with real tabular datasets and compare RF vs GBM under time limits.
- Add calibration (Platt or isotonic) if probabilities will drive decisions.
Mini challenge
You have 2 hours to ship a model predicting a binary outcome on a noisy tabular dataset. What do you choose and why?
Suggested approach
- Start with RF (fast, robust). Quick CV to set
min_samples_leaf,n_estimators. - If time remains, train GBM with small
learning_rate,max_depth=3–5, and early stopping. - Pick the better validation AUC; sanity-check with permutation importance.
Quick Test
The quick test is available to everyone. Only logged-in users get saved progress.