Why this matters
Feature selection reduces noise, speeds up training, improves generalization, and clarifies what drives predictions. As a Data Scientist, you will:
- Ship models faster by trimming dozens/hundreds of weak features.
- Fight overfitting and leakage using only informative, safe predictors.
- Explain models to stakeholders with a concise, credible feature set.
Real tasks you will face
- Cut a 10,000-word text vocabulary down to the top 2,000 terms.
- Decide which of 80 tabular fields to keep in a credit risk model.
- Choose between L1-regularized selection vs. tree-based importances.
Concept explained simply
Feature selection is choosing a subset of existing features that best predict the target. It differs from feature extraction, which creates new features (e.g., PCA).
- Filter methods: score each feature individually (e.g., correlation, chi-square, mutual information) and keep the best.
- Wrapper methods: test subsets with a model (e.g., RFE/RFECV).
- Embedded methods: selection happens inside training (e.g., L1/Lasso, tree-based importance).
Mental model: signal vs. budget
Imagine each feature requests a slice of your model's attention budget. You keep features that bring signal and drop those that bring mostly noise or duplicate information.
Quick rules you can trust
- Start simple: baseline model with all safe features. Measure before you prune.
- Remove data leaks first. Never select features using the full dataset target before splitting.
- Check redundancy: drop one of any highly correlated twins.
- Validate with cross-validation and a pipeline so selection happens inside CV folds.
Practical workflow
- Define objective and metric (e.g., AUC for churn, RMSE for price).
- Split data (train/validation/test) or set up cross-validation. For time series, use time-aware splits.
- Baseline model with minimal preprocessing. Record metric.
- Basic pruning:
- Remove leakage and post-outcome features (e.g., refund_issued_after_label).
- Drop near-constant features (very low variance).
- Handle duplicates and highly correlated pairs (keep the most interpretable/complete).
- Filter selection:
- Regression: Pearson/Spearman correlation, mutual information.
- Classification: ANOVA F-test (for scaled numeric), chi-square (for counts), mutual information.
- Embedded selection:
- L1 models (Lasso/L1-logistic) to zero out weak coefficients.
- Tree-based models to get robust importance ranks.
- Wrapper refinement:
- Use RFECV (recursive feature elimination with CV) to refine K.
- Evaluate and lock: choose the smallest feature set that matches or improves performance; check stability across folds and seeds.
Worked examples
Example 1 — Churn classification (tabular, mixed types)
- Data: 30 features (age, tenure, avg_session_time, plan_type, num_support_tickets, etc.).
- Steps:
- Remove leakage (e.g., "cancellation_request_date").
- Encode categoricals (target inside CV only). Scale numeric when needed for tests.
- Filter: ANOVA F-test for numeric vs. churn; mutual information for mixed types.
- Redundancy: if tenure and months_active correlate at 0.95, keep one.
- Embedded: L1-logistic to zero weak features; compare with tree-based importance.
- Result: kept 14 features; AUC improved from 0.812 to 0.826. Training time dropped ~30%.
Example 2 — House price regression
- Data: 55 numeric features (size, rooms, age, distances, engineered ratios).
- Steps:
- Filter: Pearson correlation with price; examine top 20.
- Check multicollinearity: VIF; drop one of area, rooms, area_per_room as needed.
- Embedded: LassoCV selects ~12 non-zero features.
- Wrapper: RFECV with gradient boosting confirms 10–14 features is optimal.
- Result: RMSE fell from 41.2k to 38.9k while keeping 13 features. Interpretability improved.
Example 3 — Text classification (bag-of-words)
- Data: 50k vocabulary. Binary labels.
- Steps:
- Variance threshold: remove terms appearing in fewer than 5 docs.
- Filter: chi-square to keep top 2,500 tokens.
- Model: linear SVM or logistic on selected features.
- Result: Accuracy stable, training 5x faster, memory usage down sharply.
Evaluating your selection
- Always evaluate inside cross-validation with a pipeline so selection uses only training folds.
- Never touch the test set until the very end.
- For time series, use expanding or rolling splits; do not shuffle.
- Check stability: run with different seeds; measure overlap of selected features.
- Prefer the smallest set that achieves within ~1% of the best score.
Exercises
Complete these tasks, then compare with the provided solutions.
Exercise 1 — Design a robust selection plan (classification)
Scenario: You have a churn dataset with 12,000 rows, 30 features (mixed numeric/categorical). Goal: maximize AUC while keeping the feature set lean and leak-free.
- Write a step-by-step selection plan you will execute (from baseline to final lock-in).
- Specify which tests/methods you will use (e.g., ANOVA, MI, L1, tree importances, RFECV).
- Propose how you will detect redundancy and leakage.
- Define your stopping rule (how you decide K).
Submission checklist
- Clear numbered plan with 6–10 steps.
- Leakage checks included.
- Evaluation protocol describes CV and metric.
- Decision rule for K stated.
Exercise 2 — Compare Lasso vs. tree-based selection (regression)
Scenario: You model energy consumption with 5,000 rows and 40 numeric features. You will compare two paths:
- LassoCV to select features via non-zero coefficients.
- Gradient boosting importances followed by selecting the top K and validating.
Design the comparison: how you choose K for the tree path, how you ensure fairness, and how you decide the winner if RMSE differences are small.
Submission checklist
- Both paths evaluated under the same CV and preprocessing.
- Method for choosing K described (e.g., RFECV or CV curve).
- Decision rule when RMSE is within 1% stated (prefer simpler).
Common mistakes and self-check
- Leakage: selecting features using the full dataset before splitting. Self-check: Is selection in a pipeline within CV?
- Over-reliance on a single method: only correlation or only tree importance. Self-check: Did you triangulate with at least two methods?
- Ignoring multicollinearity. Self-check: Did you review high correlations/VIF and prune duplicates?
- Using test set during selection. Self-check: Is test untouched until final?
- Dropping rare but crucial features. Self-check: Did performance worsen after removal? Reconsider.
Practical projects
- Credit risk mini-model: from 60 tabular features, ship a lean model with 15–20 features, documented selection rationale.
- Topic classifier: reduce a 30k-term vocabulary to the best 3k features and benchmark linear vs. tree models.
- Demand forecasting: time-series-aware selection with expanding CV; compare lag features sets of sizes 10, 20, 40.
Quick test
Take the quick test to check your understanding. Everyone can take it; only logged-in users get saved progress.
Mini challenge
Take any recent model you built and cut its features by 30–50% without losing more than 1% of the primary metric. Document the final set and how you chose it. Tip: combine one filter + one embedded method, then validate with RFECV.
Who this is for
- Data Scientists and ML Engineers building models on tabular or text data.
- Analysts transitioning to predictive modeling and seeking reliable selection patterns.
Prerequisites
- Basic supervised learning (classification, regression).
- Train/validation/test split and cross-validation.
- Familiarity with scaling/encoding and model evaluation metrics.
Learning path
- Review selection families: filter, wrapper, embedded.
- Practice a baseline-first workflow with proper CV.
- Apply filter methods and prune redundancy.
- Cross-check with embedded (L1 or trees).
- Refine K with RFECV; check stability; lock the set.
Next steps
- Automate your selection inside a pipeline for reproducibility.
- Track stability across seeds and data slices.
- Move to advanced selection for high-dimensional data (e.g., text, genomics).