luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Text Processing Basics

Learn Text Processing Basics for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

What you will learn

  • Create a clean, reproducible text preprocessing pipeline
  • Tokenize text, handle casing, punctuation, and Unicode safely
  • Apply stopword removal, stemming vs. lemmatization intentionally
  • Segment sentences and build n-grams
  • Build a small vocabulary, handle OOV, and apply padding/truncation
  • Design rules for URLs, emails, mentions, emojis, and numbers

Why this matters

As an NLP Engineer, you will routinely: prepare raw text for training, deploy consistent preprocessing in production, and debug model failures caused by broken tokenization or mismatched cleaning rules. Good text processing improves signal-to-noise, makes models more robust, and reduces leakage between train and test data.

  • Real task: Clean and tokenize product reviews for a sentiment classifier
  • Real task: Normalize user messages before entity extraction
  • Real task: Build a consistent pipeline shared by training and inference

Concept explained simply

Text processing is a pipeline that makes messy text model-ready. You decide what to keep, what to transform, and what to discard, based on the task.

Mental model

Imagine a car wash with stations. Each station is a transformation: normalize Unicode, casefold, replace URLs with placeholders, tokenize, remove or keep certain tokens, then package the result as fixed-length sequences.

Core concepts

  • Unicode normalization: Convert equivalent characters to a standard form (e.g., NFKC) so comparisons are reliable.
  • Case handling: lower() or casefold() to reduce variation; keep case only if it carries meaning for your task.
  • Tokenization: Split text into tokens (words, subwords, or symbols). Consistency is crucial.
  • Stopwords: Common words; remove only if they hurt the task. Keep negations like "not" and "no" for sentiment.
  • Stemming vs. Lemmatization: Stemming chops endings; lemmatization uses vocabulary/grammar to return base forms.
  • Sentence segmentation: Split text into sentences for tasks that are sentence-aware.
  • N-grams: Combine consecutive tokens to capture short phrases.
  • Special tokens: Use placeholders like , , , and structural tokens like , , , .
  • Padding/Truncation: Format sequences to fixed length for batching.
Typical safe pipeline order
  1. Unicode normalize (e.g., NFKC)
  2. Casefold/lowercase
  3. Replace patterns (URLs, emails, handles, numbers) with placeholders if needed
  4. Tokenize
  5. Filter tokens (punctuation, optionally stopwords; keep negations)
  6. Optionally lemmatize or stem
  7. Build n-grams (if using classic features)
  8. Map to ids, add /, handle OOV, pad/truncate

Worked examples

Example 1: Normalization + tokenization

Input: Café Müller’s email: Alice@example.com — wow!!! 😊

  1. Unicode normalize (NFKC)
  2. Casefold: cafe muller’s email: alice@example.com — wow!!! 😊
  3. Replace email: cafe muller’s email: — wow!!! 😊
  4. Tokenize; remove standalone punctuation like !!!; keep emoji

Result tokens: ["cafe", "muller’s", "email", ":", "", "—", "wow", "😊"] Then remove punctuation-only tokens like ":" and "—" if your task allows. Final: ["cafe", "muller’s", "email", "", "wow", "😊"]

Example 2: Stopwords + lemmatization vs. stemming

Input: The runner was running faster than the runners

  • No stopword removal, lemmatization: ["the", "runner", "be", "run", "fast", "than", "the", "runner"]
  • With stopword removal (keep negations), lemmatization: ["runner", "run", "fast", "runner"]
  • With stemming (Porter-style, rough): ["the", "runner", "wa", "run", "faster", "than", "the", "runner"] (Note: stemming can produce non-words like "wa")

Pick based on needs: lemmatization preserves meaning better; stemming is faster.

Example 3: Sentence segmentation, n-grams, padding

Input: "i love clean data. models love clean text."

  1. Sentences: ["i love clean data.", "models love clean text."]
  2. Tokens (lowercased): ["i","love","clean","data","."] and ["models","love","clean","text","."]
  3. Remove punctuation-only tokens; bigrams from first sentence: ["i love","love clean","clean data"]
  4. Map unigrams to ids and add /. Suppose ids: =0, =1, =2, =3, i=10, love=11, clean=12, data=13
  5. Sequence (max len 8): [2, 10, 11, 12, 13, 3, 0, 0]

Hands-on exercises

Do these right here, then check solutions. They mirror the graded exercises below.

Exercise 1: Build a tiny cleaner for social text

Goal: Normalize and tokenize a short message. Keep emojis; replace URLs with , emails with , @handles with , and turn hashtags like #ai into two tokens ["", "ai"]. Remove standalone punctuation (e.g., "!!!"). Steps: Unicode NFKC, casefold, replace patterns, tokenize, filter punctuation.

Input: OMG!!! Check https://t.co/AbC123 and email me at USER+test@Example.COM :) #AI @john_doe

Expected tokens: ["omg", "check", "", "and", "email", "me", "at", "", ":)", "", "ai", ""]

Hint
  • Casefold first, then regex replacements
  • URL regex can be simplified to something starting with http(s) and non-space chars
  • Keep emoticons like ":)" as tokens
Show solution
  1. NFKC normalize
  2. Casefold: "omg!!! check https://t.co/abc123 and email me at user+test@example.com :) #ai @john_doe"
  3. Replace patterns: URL → ; email → ; @user → ; #ai → " ai"
  4. Tokenize on whitespace and split heavy punctuation; keep ":)"
  5. Drop punctuation-only tokens like "!!!"
  6. Final: ["omg", "check", "", "and", "email", "me", "at", "", ":)", "", "ai", ""]

Exercise 2: Build a toy vocabulary and encode

Corpus (after lowercase and simple tokenization):

  • "i love nlp and i love clean data"
  • "clean data leads to better models"
  • "nlp models need clean text"

Make a top-5 vocabulary by frequency, breaking ties alphabetically. Special ids: =0, =1, =2, =3. Assign ids to top-5 starting from 4. Then encode sentence "I love NLP models" with /, length 8 with padding.

Expected ids: [2, 6, 7, 1, 8, 3, 0, 0]

Hint
  • Compute counts; tie-break alphabetically
  • Unknown tokens map to =1
  • Don’t forget to pad
Show solution
  1. Counts: clean:3; i:2; love:2; data:2; nlp:2; models:2; others:1
  2. Top-5 with tie-break: [clean, data, i, love, models]
  3. Ids: =0, =1, =2, =3, clean=4, data=5, i=6, love=7, models=8
  4. Sentence tokens: [i, love, nlp, models] → [6, 7, 1, 8]
  5. Add / and pad to len 8: [2, 6, 7, 1, 8, 3, 0, 0]

Completion checklist

  • You applied Unicode normalization and casefolding
  • You used placeholders for URLs/emails/handles
  • You chose a consistent tokenization and punctuation policy
  • You built a frequency-based vocabulary with clear tie-breaks
  • You encoded sequences with special tokens, OOV, and padding

Common mistakes and self-checks

  • Removing all stopwords, including negations. Self-check: Ensure "not", "no", "never" remain for sentiment tasks.
  • Skipping Unicode normalization. Self-check: Compare lengths before/after; test with "Café" vs "Cafe01".
  • Different pipelines for train and inference. Self-check: Centralize the exact same functions; run a round-trip test on a sample string.
  • Greedy regex that deletes useful text (e.g., removing all punctuation kills emoticons). Self-check: Unit-test on edge cases like ":)", "C++", "e-mail".
  • Dropping numbers blindly. Self-check: For finance or logs, keep numbers or map to .
  • Over-stemming causing meaning loss. Self-check: Compare stemming vs lemmatization on a small dev set.
  • Not padding/truncating consistently. Self-check: Verify all batches have identical lengths.

Practical projects

  • Build a tweet-cleaning pipeline with placeholders and emoji preservation; evaluate on a tiny sentiment dataset
  • Preprocess product reviews and train a bag-of-words logistic regression baseline; compare with/without bigrams
  • Prepare a small NER dataset: sentence segmentation, token alignment, and consistent casing policy

Mini challenge

Design a preprocessing policy for: "RT @Brand: New Café deals at https://shop.example.com! 50% OFF!!! #CoffeeLovers ☕️"

  • Decide what to keep, replace, and remove
  • State the order of steps and justify each
  • Produce final tokens and, if applicable, ids with special tokens
One possible answer
  • Keep emoji, replace URL with , replace @Brand with , convert hashtag to ["", "coffeelovers"], drop "RT" if not informative
  • Order: NFKC → casefold → replacements → tokenize → remove punctuation-only tokens → map to ids
  • Tokens: ["", "new", "cafe", "deals", "at", "", "50%", "off", "", "coffeelovers", "☕️"] (decide policy for "50%": keep or map to )

Who this is for

  • Beginners in NLP building their first preprocessing pipeline
  • Data scientists moving from general ML to text tasks
  • Engineers deploying NLP models who need reproducible text handling

Prerequisites

  • Basic Python familiarity (lists, strings)
  • Comfort with simple regex patterns
  • Awareness of Unicode and character encodings

Learning path

  1. This subskill: Text Processing Basics
  2. Next: Feature extraction (bag-of-words, TF-IDF) and vectorization
  3. Then: Subword tokenizers (WordPiece/BPE) and embeddings
  4. Finally: Packaging preprocessing for training and inference with tests

Next steps

  • Write unit tests for your pipeline with tricky edge cases
  • Benchmark the impact of each step (e.g., stopwords on/off, stem vs lemma)
  • Prepare a small vocabulary and export mappings for reuse

Note on progress: The quick test is available to everyone. Only logged-in users will have their test progress saved.

Practice Exercises

2 exercises to complete

Instructions

Normalize and tokenize the message. Keep emojis; replace URLs with , emails with , @handles with , and convert hashtags like #ai into two tokens ["", "ai"]. Remove standalone punctuation (e.g., "!!!"). Order: Unicode NFKC → casefold → replacements → tokenize → remove punctuation-only tokens.

Input: OMG!!! Check https://t.co/AbC123 and email me at USER+test@Example.COM :) #AI @john_doe

Expected Output
["omg", "check", "<URL>", "and", "email", "me", "at", "<EMAIL>", ":)", "<HASHTAG>", "ai", "<USER>"]

Text Processing Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Text Processing Basics?

AI Assistant

Ask questions about this tool