Why this skill matters for an NLP Engineer
Feature engineering is how classical NLP models understand text. While deep learning can learn features automatically, classical pipelines are still faster to build, easier to interpret, cheaper to run, and great baselines. As an NLP Engineer, mastering feature engineering lets you launch reliable models for classification, retrieval, deduplication, and rule-augmented systems—often in hours, not weeks.
What this unlocks in your day-to-day
- Ship strong baselines using TF–IDF and linear models.
- Build fast semantic search with cosine similarity.
- Create robust spam/toxicity filters with char n-grams.
- Augment models with rule-based features for interpretability.
- Control memory and speed using sparse vectors and hashing.
What you will be able to do
- Turn raw text into effective sparse features (BoW, TF–IDF, n-grams).
- Combine rule-based and statistical features in a single pipeline.
- Use topic models (LDA/NMF) as compact semantic features.
- Compute and use similarity scores (cosine, Jaccard) safely on sparse data.
- Reduce dimensionality and overfitting with feature selection.
- Handle memory at scale with CSR matrices and the hashing trick.
Who this is for
- Aspiring NLP Engineers building text classifiers, rankers, and deduplication tools.
- Data Scientists who need fast, interpretable baselines or low-latency production models.
- Backend/ML Engineers integrating lightweight NLP into existing products.
Prerequisites
- Python basics and ability to run notebooks or scripts.
- Intro ML concepts (train/test split, cross-validation, overfitting).
- Familiarity with scikit-learn pipelines is helpful but not required.
Learning path
-
Start with Bag-of-Words and TF–IDF
Produce sparse matrices; understand tokenization, min_df/max_df, stopword handling.
-
Add n-grams and character features
Capture phrases, misspellings, and subword patterns; control feature growth.
-
Introduce topic features
Use LDA or NMF to create compact semantic vectors; decide topic count by validation.
-
Compute similarity for retrieval and deduplication
Normalize vectors and use cosine similarity safely on sparse matrices.
-
Feature selection and sparse handling
Apply chi-square or mutual information; use CSR, float32, and hashing to manage memory.
-
Baseline modeling discipline
Build, log, and compare baselines with clear metrics and no leakage.
Worked examples
1) BoW baseline with Logistic Regression
Goal: Build a quick text classifier using word counts.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
texts = [
"great product works well", "terrible quality broke", "love it amazing",
"worst purchase ever", "not bad quite useful", "awful and disappointing"
]
y = [1, 0, 1, 0, 1, 0] # 1=positive, 0=negative
pipe = make_pipeline(
CountVectorizer(ngram_range=(1,1), min_df=1, stop_words='english'),
LogisticRegression(max_iter=1000)
)
scores = cross_val_score(pipe, texts, y, cv=3, scoring='f1')
print(round(scores.mean(), 3))
Tip: Keep this as your sanity-check baseline before adding complexity.
2) TF–IDF with word and char n-grams
Goal: Improve robustness to typos and short texts (e.g., spam detection).
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from scipy import sparse
import numpy as np
texts = ["Win $$$ now!!!", "Limited offfer", "Hello friend", "Meeting schedule"]
y = [1, 1, 0, 0]
word_tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=1)
char_tf = TfidfVectorizer(analyzer='char', ngram_range=(3,5), min_df=1)
Xw = word_tf.fit_transform(texts)
Xc = char_tf.fit_transform(texts)
X = sparse.hstack([Xw, Xc]).tocsr().astype(np.float32)
clf = SGDClassifier(loss='log_loss', max_iter=1000, random_state=0)
clf.fit(X, y)
print(clf.predict(char_tf.transform(["W1n $$$ now"]))[0])
Tip: Character n-grams often boost recall on noisy user text.
3) Topic features (LDA) + classifier
Goal: Add low-dimensional semantic features.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
corpus = [
"football match team score", "economy stocks market rise", "goal scored player",
"interest rates inflation", "coach tactics game", "investors portfolio risk"
]
y = [1, 0, 1, 0, 1, 0] # 1=sports, 0=finance
cv = CountVectorizer(min_df=1, stop_words='english')
X_counts = cv.fit_transform(corpus)
lda = LatentDirichletAllocation(n_components=5, random_state=0)
X_topics = lda.fit_transform(X_counts).astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X_topics, y, test_size=0.33, random_state=0)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
Tip: Use validation to choose n_components; set random_state for reproducibility.
4) Cosine similarity for duplicate detection
Goal: Find near-duplicate questions or titles.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
questions = [
"How to reset my password?",
"What is the way to change password?",
"How to download invoice?"
]
vec = TfidfVectorizer(min_df=1, stop_words='english', ngram_range=(1,2))
X = vec.fit_transform(questions)
sim = cosine_similarity(X)
print(sim.round(2)) # high score between similar questions
Tip: Normalize with TF–IDF; cosine works naturally on these vectors.
5) Feature selection with chi-square
Goal: Reduce overfitting and speed up training.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
texts = ["red apple", "green apple", "blue sky", "red sky", "green field", "blue ocean"]
y = [1, 1, 0, 0, 1, 0] # 1=fruit-related
pipe = Pipeline([
("tfidf", TfidfVectorizer(ngram_range=(1,2))),
("sel", SelectKBest(chi2, k=5)), # chi2 requires non-negative features
("clf", LogisticRegression(max_iter=1000))
])
pipe.fit(texts, y)
print("Selected features:", pipe.named_steps["sel"].get_support().sum())
Tip: Do not standardize or mean-center before chi-square (keeps features non-negative).
6) Memory-safe sparse vectors and hashing
Goal: Keep RAM in check at scale.
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
import numpy as np
vec = HashingVectorizer(n_features=2**18, alternate_sign=False, norm='l2')
clf = SGDClassifier(loss='hinge', random_state=0)
# streaming style
texts = ["sample text one", "another sample", "more text"]
X = vec.transform(texts).astype(np.float32)
clf.fit(X, [0, 1, 0])
print("nnz per row:", [row.getnnz() for row in X])
Tip: Use alternate_sign=False to keep features non-negative when using chi-square later.
Drills and quick exercises
- Create a CountVectorizer baseline with and without stopwords; compare F1 via cross-validation.
- Add word bigrams; record changes in validation score and training time.
- Switch to character 3–5 grams; test on noisy, misspelled samples you craft.
- Try max_df=0.9 and min_df=2; note effects on feature count and accuracy.
- Apply SelectKBest(chi2) with k=500, 2000; plot score vs k.
- Replace TF–IDF with HashingVectorizer; keep the same classifier; compare speed and metrics.
- Compute cosine similarity between 5 support queries; manually inspect top-1 neighbor quality.
Common mistakes and debugging tips
Exploding feature space with wide n-grams
Symptom: Slow training, memory errors. Fix: Limit ngram_range, raise min_df, cap vocabulary, or use hashing.
Data leakage during feature selection
Symptom: Great validation, poor test. Fix: Fit vectorizer and selector only on training folds; use Pipeline to enforce.
Using dense arrays accidentally
Symptom: Memory spike. Fix: Keep data in CSR; avoid toarray(); prefer float32; check X.format and X.dtype.
Ignoring text normalization
Symptom: Duplicate tokens ("Apple" vs "apple"). Fix: Lowercase, strip accents; use analyzer defaults or custom preprocessor.
Misusing chi-square on negative features
Symptom: Errors or nonsense rankings. Fix: Use TF–IDF/Count (non-negative) before chi2; avoid mean-centering.
Unstable topic models
Symptom: Fluctuating topics. Fix: Set random_state, increase data, tune n_components, and use validation to judge utility.
Mini project: FAQ matcher with hybrid features
Build a small system that matches user questions to a curated FAQ list.
- Prepare 50–200 FAQ entries with short answers; craft 100+ user queries (some noisy).
- Represent FAQs and queries using TF–IDF with word bigrams and char 3–5 grams.
- Compute cosine similarity; return top-3 matches. Add rule-based boosts (e.g., regex for order IDs, emails).
- Evaluate with MRR@3 or Recall@3 on a held-out set.
- Reduce feature size via min_df and chi2; maintain or improve metrics.
Acceptance criteria
- Top-1 correct match in at least 60% of test queries; Top-3 in 85%.
- Latency under 50 ms per query on a laptop (single thread).
- Clear README describing features and evaluation.
Practical projects
- News topic classifier: TF–IDF + Logistic Regression; ablate n-grams and chi2.
- Near-duplicate tweet detector: char n-grams + cosine similarity; evaluate with precision@k.
- Support ticket router: hybrid features (rules + TF–IDF + topics) to assign team labels.
Next steps
- Wrap your vectorization and models into reproducible scikit-learn Pipelines.
- Practice ablations: change one feature at a time; log metrics, size, and latency.
- Explore weak supervision: combine rule-based labels with classical features to bootstrap datasets.
Skill exam
Test what you learned. Everyone can take the exam. If you log in, your progress and scores will be saved automatically.