What you will learn
- Create a clean, reproducible text preprocessing pipeline
- Tokenize text, handle casing, punctuation, and Unicode safely
- Apply stopword removal, stemming vs. lemmatization intentionally
- Segment sentences and build n-grams
- Build a small vocabulary, handle OOV, and apply padding/truncation
- Design rules for URLs, emails, mentions, emojis, and numbers
Why this matters
As an NLP Engineer, you will routinely: prepare raw text for training, deploy consistent preprocessing in production, and debug model failures caused by broken tokenization or mismatched cleaning rules. Good text processing improves signal-to-noise, makes models more robust, and reduces leakage between train and test data.
- Real task: Clean and tokenize product reviews for a sentiment classifier
- Real task: Normalize user messages before entity extraction
- Real task: Build a consistent pipeline shared by training and inference
Concept explained simply
Text processing is a pipeline that makes messy text model-ready. You decide what to keep, what to transform, and what to discard, based on the task.
Mental model
Imagine a car wash with stations. Each station is a transformation: normalize Unicode, casefold, replace URLs with placeholders, tokenize, remove or keep certain tokens, then package the result as fixed-length sequences.
Core concepts
- Unicode normalization: Convert equivalent characters to a standard form (e.g., NFKC) so comparisons are reliable.
- Case handling: lower() or casefold() to reduce variation; keep case only if it carries meaning for your task.
- Tokenization: Split text into tokens (words, subwords, or symbols). Consistency is crucial.
- Stopwords: Common words; remove only if they hurt the task. Keep negations like "not" and "no" for sentiment.
- Stemming vs. Lemmatization: Stemming chops endings; lemmatization uses vocabulary/grammar to return base forms.
- Sentence segmentation: Split text into sentences for tasks that are sentence-aware.
- N-grams: Combine consecutive tokens to capture short phrases.
- Special tokens: Use placeholders like
, , , and structural tokens like , , , . - Padding/Truncation: Format sequences to fixed length for batching.
Typical safe pipeline order
- Unicode normalize (e.g., NFKC)
- Casefold/lowercase
- Replace patterns (URLs, emails, handles, numbers) with placeholders if needed
- Tokenize
- Filter tokens (punctuation, optionally stopwords; keep negations)
- Optionally lemmatize or stem
- Build n-grams (if using classic features)
- Map to ids, add
/ , handle OOV, pad/truncate
Worked examples
Example 1: Normalization + tokenization
Input: Café Müller’s email: Alice@example.com — wow!!! 😊
- Unicode normalize (NFKC)
- Casefold: cafe muller’s email: alice@example.com — wow!!! 😊
- Replace email: cafe muller’s email:
— wow!!! 😊 - Tokenize; remove standalone punctuation like !!!; keep emoji
Result tokens: ["cafe", "muller’s", "email", ":", "
Example 2: Stopwords + lemmatization vs. stemming
Input: The runner was running faster than the runners
- No stopword removal, lemmatization: ["the", "runner", "be", "run", "fast", "than", "the", "runner"]
- With stopword removal (keep negations), lemmatization: ["runner", "run", "fast", "runner"]
- With stemming (Porter-style, rough): ["the", "runner", "wa", "run", "faster", "than", "the", "runner"] (Note: stemming can produce non-words like "wa")
Pick based on needs: lemmatization preserves meaning better; stemming is faster.
Example 3: Sentence segmentation, n-grams, padding
Input: "i love clean data. models love clean text."
- Sentences: ["i love clean data.", "models love clean text."]
- Tokens (lowercased): ["i","love","clean","data","."] and ["models","love","clean","text","."]
- Remove punctuation-only tokens; bigrams from first sentence: ["i love","love clean","clean data"]
- Map unigrams to ids and add
/ . Suppose ids: =0, =1, =2, =3, i=10, love=11, clean=12, data=13 - Sequence (max len 8): [2, 10, 11, 12, 13, 3, 0, 0]
Hands-on exercises
Do these right here, then check solutions. They mirror the graded exercises below.
Exercise 1: Build a tiny cleaner for social text
Goal: Normalize and tokenize a short message. Keep emojis; replace URLs with
Input: OMG!!! Check https://t.co/AbC123 and email me at USER+test@Example.COM :) #AI @john_doe
Expected tokens: ["omg", "check", "
Hint
- Casefold first, then regex replacements
- URL regex can be simplified to something starting with http(s) and non-space chars
- Keep emoticons like ":)" as tokens
Show solution
- NFKC normalize
- Casefold: "omg!!! check https://t.co/abc123 and email me at user+test@example.com :) #ai @john_doe"
- Replace patterns: URL →
; email → ; @user → ; #ai → " ai" - Tokenize on whitespace and split heavy punctuation; keep ":)"
- Drop punctuation-only tokens like "!!!"
- Final: ["omg", "check", "
", "and", "email", "me", "at", " ", ":)", " ", "ai", " "]
Exercise 2: Build a toy vocabulary and encode
Corpus (after lowercase and simple tokenization):
- "i love nlp and i love clean data"
- "clean data leads to better models"
- "nlp models need clean text"
Make a top-5 vocabulary by frequency, breaking ties alphabetically. Special ids:
Expected ids: [2, 6, 7, 1, 8, 3, 0, 0]
Hint
- Compute counts; tie-break alphabetically
- Unknown tokens map to
=1 - Don’t forget to pad
Show solution
- Counts: clean:3; i:2; love:2; data:2; nlp:2; models:2; others:1
- Top-5 with tie-break: [clean, data, i, love, models]
- Ids:
=0, =1, =2, =3, clean=4, data=5, i=6, love=7, models=8 - Sentence tokens: [i, love, nlp, models] → [6, 7, 1, 8]
- Add
/ and pad to len 8: [2, 6, 7, 1, 8, 3, 0, 0]
Completion checklist
- You applied Unicode normalization and casefolding
- You used placeholders for URLs/emails/handles
- You chose a consistent tokenization and punctuation policy
- You built a frequency-based vocabulary with clear tie-breaks
- You encoded sequences with special tokens, OOV, and padding
Common mistakes and self-checks
- Removing all stopwords, including negations. Self-check: Ensure "not", "no", "never" remain for sentiment tasks.
- Skipping Unicode normalization. Self-check: Compare lengths before/after; test with "Café" vs "Cafe01".
- Different pipelines for train and inference. Self-check: Centralize the exact same functions; run a round-trip test on a sample string.
- Greedy regex that deletes useful text (e.g., removing all punctuation kills emoticons). Self-check: Unit-test on edge cases like ":)", "C++", "e-mail".
- Dropping numbers blindly. Self-check: For finance or logs, keep numbers or map to
. - Over-stemming causing meaning loss. Self-check: Compare stemming vs lemmatization on a small dev set.
- Not padding/truncating consistently. Self-check: Verify all batches have identical lengths.
Practical projects
- Build a tweet-cleaning pipeline with placeholders and emoji preservation; evaluate on a tiny sentiment dataset
- Preprocess product reviews and train a bag-of-words logistic regression baseline; compare with/without bigrams
- Prepare a small NER dataset: sentence segmentation, token alignment, and consistent casing policy
Mini challenge
Design a preprocessing policy for: "RT @Brand: New Café deals at https://shop.example.com! 50% OFF!!! #CoffeeLovers ☕️"
- Decide what to keep, replace, and remove
- State the order of steps and justify each
- Produce final tokens and, if applicable, ids with special tokens
One possible answer
- Keep emoji, replace URL with
, replace @Brand with , convert hashtag to [" ", "coffeelovers"], drop "RT" if not informative - Order: NFKC → casefold → replacements → tokenize → remove punctuation-only tokens → map to ids
- Tokens: ["
", "new", "cafe", "deals", "at", " ", "50%", "off", " ", "coffeelovers", "☕️"] (decide policy for "50%": keep or map to )
Who this is for
- Beginners in NLP building their first preprocessing pipeline
- Data scientists moving from general ML to text tasks
- Engineers deploying NLP models who need reproducible text handling
Prerequisites
- Basic Python familiarity (lists, strings)
- Comfort with simple regex patterns
- Awareness of Unicode and character encodings
Learning path
- This subskill: Text Processing Basics
- Next: Feature extraction (bag-of-words, TF-IDF) and vectorization
- Then: Subword tokenizers (WordPiece/BPE) and embeddings
- Finally: Packaging preprocessing for training and inference with tests
Next steps
- Write unit tests for your pipeline with tricky edge cases
- Benchmark the impact of each step (e.g., stopwords on/off, stem vs lemma)
- Prepare a small vocabulary and export mappings for reuse
Note on progress: The quick test is available to everyone. Only logged-in users will have their test progress saved.