Why this skill matters for NLP Engineers
NLP Foundations give you the mental models and practical habits to turn raw text into signals that models can learn from. As an NLP Engineer, you will clean and normalize text, choose tokenization strategies, build vocabularies or subword units, represent text with embeddings, reason about sequences, and evaluate models without leaking information. These fundamentals determine model quality, latency, and maintainability in production.
What you will be able to do
- Design a robust text preprocessing pipeline for varied languages and domains.
- Choose and justify tokenization (whitespace, wordpiece/BPE, sentencepiece) for a task.
- Build and manage vocabularies, including subwords and OOV handling.
- Explain and apply embedding concepts (static vs contextual, pooling).
- Reason about sequences (n-grams, RNN/Transformer intuition) and set up data for them.
- Evaluate fairly, avoid data leakage, and run reproducible experiments.
Who this is for
- Aspiring or junior NLP/ML engineers who need a practical baseline.
- Data scientists transitioning to text problems.
- Software engineers integrating NLP features into products.
Prerequisites
- Comfort with Python basics (functions, lists, dicts, file I/O).
- High-level understanding of ML (train/validation/test split, overfitting).
- Basic linear algebra intuition (vectors, dot product) helpful but not required.
Learning path (roadmap)
Text normalization and cleaning
Unify casing, whitespace, Unicode, punctuation. Decide what to keep (e.g., emojis for sentiment) vs remove.
Mini task: Write a function that lowercases, strips accents, collapses whitespace, and keeps hashtags/@usernames.Tokenization choices
Compare whitespace/split tokenizers, rule-based, and subword approaches. Align choice with task and deployment constraints.
Mini task: Tokenize the same sentence with three strategies and count OOVs vs vocabulary.Vocabulary and subwords
Build a vocabulary with frequency thresholds. Explore BPE/WordPiece intuition to reduce OOVs.
Mini task: Simulate a few BPE merges on a tiny corpus and inspect merges.Embedding concepts
Static (Word2Vec/GloVe) vs contextual (Transformers). Pooling: mean, max, CLS. Understand trade-offs and dimensionality.
Mini task: Implement simple average word vectors with random init to prototype a baseline classifier.Sequence modeling intuition
From n-grams to RNNs and Transformers. Padding, masking, and sequence length choices.
Mini task: Build a character bigram model and compute sentence likelihoods.Evaluation without traps
Create stratified splits, avoid leakage, and select metrics (accuracy, F1, ROUGE, BLEU) fit to task.
Mini task: Show why random split can leak when duplicates exist; deduplicate first, then split.Reproducible workflows
Set seeds, log configs and hashes, record dataset versions, and structure folders for stable runs.
Mini task: Save preprocessing params (regexes, vocab cutoffs) to a JSON alongside a run ID.
Worked examples
1) Robust text cleaning
Keep informative symbols (like emojis) while normalizing consistently.
import re, unicodedata
def clean_text(s: str) -> str:
# NFKC normalizes width/compatibility chars
s = unicodedata.normalize('NFKC', s)
# Lowercase but keep emojis, hashtags and mentions
s = s.lower()
# Replace URLs and emails with placeholders
s = re.sub(r'https?://\S+|www\.\S+', ' <url> ', s)
s = re.sub(r'[\w\.-]+@[\w\.-]+', ' <email> ', s)
# Collapse whitespace
s = re.sub(r'\s+', ' ', s).strip()
return s
print(clean_text('Email me at John.Doe@example.com! Visit https://ex.am/ple 🚀'))
Why this works
Normalization unifies variants. Placeholders preserve signal without leaking raw PII or noise. Lowercasing reduces sparsity for many languages.
2) Tokenization trade-offs
text = "Unbelievably strong performances! Re-engineered, reimagined."
# Whitespace
ws_tokens = text.split()
# Simple rule-based split on words and hyphenated parts
import re
rule_tokens = re.findall(r"[A-Za-z]+(?:-[A-Za-z]+)?|\d+|[^\w\s]", text)
print('Whitespace:', ws_tokens)
print('Rule-based:', rule_tokens)
Takeaway
Whitespace is fast but naive. Rule-based keeps hyphenated forms as single tokens. Subwords go further by splitting rare words into frequent pieces.
3) Tiny BPE intuition (manual)
from collections import Counter
# Toy corpus split into characters with end-of-word marker
corpus = [list("low er new low est".replace(" ", ""))]
# We'll count pairs (simplified for illustration)
words = [list("low"), list("er"), list("new"), list("low"), list("est")]
def get_pairs(words):
pairs = Counter()
for w in words:
for a,b in zip(w, w[1:]):
pairs[(a,b)] += 1
return pairs
pairs = get_pairs(words)
print('Top pairs:', pairs.most_common(3))
# Suppose we merge ('l','o') -> 'lo'
merged = []
for w in words:
i = 0; tmp = []
while i < len(w):
if i+1 < len(w) and (w[i], w[i+1]) == ('l','o'):
tmp.append('lo'); i += 2
else:
tmp.append(w[i]); i += 1
merged.append(tmp)
print('After merge:', merged)
Takeaway
BPE iteratively merges frequent pairs, reducing OOVs while keeping a compact vocabulary. Real implementations learn many merges on large corpora.
4) Simple bag-of-words embedding baseline
import numpy as np
from collections import defaultdict
corpus = [
("great product works flawlessly", 1),
("terrible quality broke immediately", 0),
("works great and great support", 1)
]
# Build vocab
vocab = {}
for text, _ in corpus:
for tok in text.split():
if tok not in vocab: vocab[tok] = len(vocab)
# TF vectors
X = []
y = []
for text, label in corpus:
vec = np.zeros(len(vocab))
for tok in text.split():
vec[vocab[tok]] += 1
X.append(vec); y.append(label)
X = np.stack(X); y = np.array(y)
# Simple linear classifier via closed-form ridge (demo-only)
lam = 1e-1
w = np.linalg.pinv(X.T @ X + lam*np.eye(X.shape[1])) @ X.T @ y
# Predict
pred = (X @ w >= 0.5).astype(int)
print('Predictions:', pred.tolist())
Takeaway
Even simple bag-of-words embeddings can be strong baselines. Start simple; add complexity only if baselines underperform.
5) Character bigram sequence model
from collections import defaultdict
text = "hello there"
text = text.replace(' ', '_') # underscore as space token
counts = defaultdict(int)
context = defaultdict(int)
for a,b in zip(text, text[1:]):
counts[(a,b)] += 1
context[a] += 1
def prob(a,b,alpha=1.0): # Laplace smoothing
return (counts[(a,b)] + alpha) / (context[a] + alpha*len(set(text)))
import math
sentence = "he_lo"
logp = 0.0
for a,b in zip(sentence, sentence[1:]):
logp += math.log(prob(a,b))
print('Log-prob:', round(logp, 3))
Takeaway
N-gram models expose sequence dependence and smoothing. This intuition transfers to understanding attention masks and context windows in Transformers.
Drills and exercises
- Normalize a multilingual sample using NFKC and verify no accidental information removal (e.g., currency symbols).
- Tokenize 10 tricky sentences (hyphens, emojis, hashtags) with two methods and compare token counts and OOV rate.
- Build a vocabulary with a min-frequency cutoff; measure coverage on a validation file.
- Implement mean-pooled embeddings from randomly initialized word vectors; classify 20 sentences with logistic regression.
- Create a stratified train/val/test split that avoids near-duplicates by hashing normalized text.
- Run the same preprocessing twice with seeds set; verify identical outputs and saved config JSON.
Common mistakes and debugging tips
- Over-cleaning text: Removing punctuation/emojis that carry sentiment. Tip: Review feature importance or attention patterns; retain informative symbols.
- Inconsistent tokenization between train and inference: Different libraries or versions. Tip: Serialize tokenizer artifacts and test round-trip on a small held-out set.
- Leaky splits: Deduping after splitting. Tip: Deduplicate first; split by group IDs (e.g., user, thread) when correlation exists.
- Wrong metric for task: Optimizing accuracy on imbalanced classes. Tip: Use F1 or AUROC; inspect per-class metrics.
- Ignoring sequence length effects: Truncating key tokens. Tip: Plot token positions of labels; adjust max length or use sliding windows.
- Non-reproducible runs: Seeds set only for one library. Tip: Set seeds across Python, NumPy, and your ML framework; log versions and hashes.
Mini project: Sentiment-on-support-tickets
Goal: Build a baseline sentiment classifier for short support tickets.
- Collect or simulate 500 short tickets with positive/negative labels.
- Preprocess: normalize (NFKC), lowercase, keep emojis/hashtags and replace URLs/emails with placeholders.
- Tokenize: compare rule-based vs subword tokenizer; pick one based on OOV and speed.
- Vocabulary: choose min-frequency threshold; document OOV rate.
- Embedding: start with bag-of-words; add mean-pooled random embeddings as a second baseline.
- Evaluation: stratified split, no duplicates. Report accuracy, precision, recall, F1.
- Reproducibility: save config JSON (cleaning rules, tokenizer params, seed) and a run ID.
Acceptance criteria
- Clear README describing decisions and trade-offs.
- Re-runnable pipeline producing identical splits and metrics (± floating noise).
- Better than majority baseline on F1.
Practical projects (beyond the mini)
- Keyword-highlighting for customer chats: rule-based tokens + TF-IDF weighting, evaluated by precision/recall against human tags.
- Duplicate question detector: normalize, character n-grams, cosine similarity with threshold tuning.
- Emoji-aware sentiment baseline: emoji lexicon + token model; ablation to show emoji impact.
Subskills
- Text Processing Basics: Build predictable normalization routines that keep task-relevant symbols while reducing noise.
- Tokenization Concepts: Compare and select tokenization strategies aligned with task and deployment constraints.
- Vocabulary And Subwords: Construct vocabularies, apply frequency cutoffs, and reduce OOVs with subword units.
- Embeddings Concepts: Understand static vs contextual embeddings and pooling strategies.
- Sequence Modeling Intuition: Grasp n-grams to Transformers, masking, and length trade-offs.
- Evaluation Pitfalls For NLP: Choose proper metrics, stratify splits, avoid duplicates and leakage.
- Data Leakage Awareness: Identify and prevent content, temporal, and cross-split leakage.
- Reproducible NLP Workflows: Seed control, config logging, dataset versioning, and file structure.
Next steps
- Pick one practical project and complete it end-to-end with saved configs.
- Attempt the Skill Exam below. Anyone can take it for free; sign in to save your progress.
- Move on to the next skill focusing on applied language modeling and fine-tuning.