Why this matters
Text datasets often have tens or hundreds of thousands of n-gram features. Good feature selection makes classical NLP models faster, more accurate, and easier to interpret.
- Production speed: Smaller vocabularies reduce memory and latency for spam filters, intent classifiers, or routing systems.
- Generalization: Removing noisy tokens reduces overfitting, especially with small labeled datasets.
- Interpretability: You can show stakeholders which words drive decisions, useful for compliance and debugging.
Concept explained simply
Feature selection for text means choosing which words, n-grams, or hand-crafted features to keep in your model. The goal is to keep signals that help predict labels and drop tokens that add noise.
Filter vs Wrapper vs Embedded (quick reference)
- Filter methods: Score features without fitting a final model. Examples: document-frequency filters (min_df, max_df), chi-square, mutual information, information gain.
- Wrapper methods: Use a predictive model to search for a good subset. Example: Recursive Feature Elimination (RFE). Costly for large vocabularies.
- Embedded methods: The model intrinsically selects features. Examples: L1-regularized logistic regression (lasso), linear SVMs with feature weights, tree-based models with feature importance.
Mental model
Imagine every token as a dial. Helpful tokens “move with” the label across documents. Filters measure how tightly a dial moves with the label. Embedded methods let the model twist the dials and set many to zero. The best subset is the fewest dials that still let you predict well.
Quick recipe (practical steps)
- Vectorize: Tokenize, pick n-grams, and start with TF-IDF or count vectors.
- Frequency filters: Remove ultra-rare tokens with min_df and remove ultra-common tokens with max_df to drop boilerplate.
- Filter selection: Use chi-square or mutual information to keep top-k features. For multiclass, score per class and keep the union.
- Embedded pruning: Fit an L1-penalized linear model to further zero-out weak features.
- Evaluate properly: Select features inside each cross-validation fold to avoid leakage. Track F1/ROC-AUC and runtime.
- Stability check: Compare selected sets across folds (e.g., Jaccard overlap). Prefer stable subsets with similar performance.
Minimal scikit-learn style pipeline (conceptual)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
pipe = Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=3, max_df=0.9)),
("chi", SelectKBest(chi2, k=20000)),
("clf", LogisticRegression(penalty="l2", max_iter=1000))
])
# IMPORTANT: selection happens inside CV folds
scores = cross_val_score(pipe, X_text, y, cv=5, scoring="f1_macro")
print(scores.mean())Swap L2 with L1 to embed selection in the classifier, or reduce k to speed up training.
Worked examples
1) Spam detection (binary classification)
Start with unigram+bigram TF-IDF. Apply min_df=3 to remove typos and rare IDs. Use chi-square to select top 20k features. Evaluate with 5-fold CV. Expect top tokens like “free”, “win”, “offer”. Add L1 logistic regression for embedded pruning. Compare speed and F1 versus no selection.
What’s happening under the hood?
Chi-square checks whether the presence of a token is independent of the spam label. Tokens that co-occur unusually often with spam get higher scores.
2) Sentiment analysis (multiclass)
Try unigram and bigram features. Use max_df=0.95 to drop boilerplate like “the”, “and”. Use mutual information or chi-square per class, keep the union of top 5k per class. Validate that top negatives include “awful”, “refund”, and top positives include “amazing”, “delighted”.
Tip: balance and per-class selection
If classes are imbalanced, per-class selection ensures minority-class indicators are not drowned out by majority-class features.
3) Topic classification (news)
Use min_df=5, bigrams enabled. Start with no selection to get a baseline. Then apply SelectKBest(chi2, k=30k). Finally, try L1 linear SVM. Observe accuracy changes and time-to-fit. Expect interpretable top features per topic like “stock market”, “climate change”.
Exercises (hands-on)
These mirror the exercises below. Do them here, then submit your answers in the exercise blocks.
Exercise 1: Rank tokens by chi-square
Documents (lowercased, tokenized by whitespace):
- spam: "win a free prize now"
- ham: "see you at lunch"
- spam: "limited time offer win cash"
- ham: "project meeting schedule"
- spam: "claim your free offer now"
- ham: "let's meet for lunch"
Use intuition to pick the top 3 spam-associated tokens by chi-square-style reasoning. Output the top 3 tokens.
Hints
- Focus on tokens that appear mostly in spam and rarely in ham.
- Tokens appearing in both classes get lower scores.
Expected solution (peek only after trying)
Likely top 3: ["win", "free", "offer"]. Runners-up: "now", "limited", "claim".
Exercise 2: Design a selection plan
You have 10k documents, 8 classes, and latency constraints. Propose min_df/max_df, a filter method (chi-square or MI), a k to start with, and whether you will add an embedded L1 step. Justify your picks in 5 sentences maximum.
Hints
- Use per-class selection to protect minority classes.
- Start with a generous k and trim with L1 if latency is tight.
Expected solution (example)
min_df=3, max_df=0.95; per-class chi-square keeping 5k per class and taking the union (capped at 40k), then L1 logistic regression to prune further. Re-tune k with CV and measure latency.
Build-and-check checklist
- Baseline recorded with no selection.
- min_df and max_df tuned to remove rare and boilerplate tokens.
- Feature selection performed inside each CV fold (no leakage).
- k tuned via validation; report performance vs k.
- Stability of selected features checked across folds.
- Interpretability: Reviewed top features per class for face validity.
- Latency/memory measured before and after selection.
- All steps captured in a single reproducible pipeline.
Common mistakes and self-check
- Leakage: Selecting features on the full dataset before CV. Fix: Wrap selection in the pipeline.
- Over-pruning: Choosing k so small that recall collapses. Fix: Compare metrics vs baseline; inspect class-wise errors.
- Ignoring minority classes: Global top-k misses minority cues. Fix: Per-class selection and union.
- Using TF-IDF with chi-square incorrectly: chi-square expects non-negative counts; use count or ensure non-negative features.
- Assuming stopword removal is enough: It is not feature selection. Fix: Still apply statistical selection.
Practical projects
- News topic classifier: Compare no selection vs chi-square top-30k vs L1 logistic regression. Report accuracy, F1, latency, and top features per topic.
- Support ticket routing: Build a triage model. Use min_df/max_df, chi-square per class, then optional L1. Evaluate macro-F1 and feature stability.
- Toxic comment detection: Tune selection to achieve a 2x speedup with minimal F1 loss. Document the trade-off curve.
Who this is for
- Junior to mid-level NLP engineers moving from notebooks to production pipelines.
- Data scientists working with sparse text models who need speed and interpretability.
Prerequisites
- Comfort with tokenization and n-grams.
- Understanding of bag-of-words/TF-IDF.
- Basic ML metrics (accuracy, precision/recall, F1, ROC-AUC).
Learning path
- Tokenization and normalization basics.
- Vectorization (count, TF-IDF, n-grams).
- Feature selection (this page): frequency filters, chi-square/MI, L1 models.
- Modeling with linear classifiers or SVMs.
- Evaluation with leakage-safe CV and error analysis.
- Scaling and deployment with compact vocabularies.
Next steps
- Explore dimensionality reduction for text (e.g., SVD) and compare to feature selection.
- Try different regularizers (L1 vs L2) and observe sparsity/performance trade-offs.
- Automate selection tuning with simple sweeps over k and min_df/max_df.
Mini challenge
Constraint: Reduce vocabulary by 60% while keeping macro-F1 within 1 point of baseline on a multiclass dataset. Deliverables: chosen min_df/max_df, selection method, k, final metrics, top 10 features per class, latency before/after.
Quick Test
Take the quick test below to check your understanding. Everyone can take the test; only logged-in users will have progress saved.