luvv to helpDiscover the Best Free Online Tools
Topic 6 of 8

Stopwords And Lemmatization Basics

Learn Stopwords And Lemmatization Basics for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

As an NLP Engineer, your models depend on clean, meaningful text. Stopwords removal reduces noise and model size; lemmatization reduces word variants to a common base (lemma) so models learn patterns more efficiently. You will use these steps in tasks like sentiment analysis, topic modeling, search relevance, question answering, and intent classification.

  • Production search: remove filler words (the, and) to improve recall/precision.
  • Classification: lemmatize (running, ran) to run for better generalization.
  • Topic modeling: reduce sparsity by normalizing tokens.

Who this is for

  • Beginners learning NLP pipelines.
  • Engineers moving from basic tokenization to effective normalization.
  • Data scientists wanting cleaner features for classical ML or embeddings.

Prerequisites

  • Basic Python familiarity (lists, strings) or readiness to read pseudocode.
  • Know what tokens are and how tokenization works.
  • High-level idea of POS tags (noun, verb, adjective).

Concept explained simply

Stopwords are common words that often add little meaning in aggregate analyses (the, a, is). Removing them can reduce noise. Lemmatization maps inflected forms to a dictionary base form (better than naive stemming). Examples: studies → study, was → be, mice → mouse.

When to remove stopwords
  • Good for: topic modeling, TF-IDF features, small datasets with many fillers.
  • Be careful for: sentiment (not, never), named entities (to in To be or not to be quotes), sequence tasks where syntax matters.
Lemmatization vs stemming
  • Stemming: fast, rule-based chops (studies → studi). Can harm meaning.
  • Lemmatization: uses vocabulary and POS (studies → study). Slower, more accurate.

Mental model

Imagine your text as a signal with background hum. Stopwords are the hum; lemmatization merges similar notes. Together, they make the melody (meaningful tokens) clearer for your model. Your goal: maximize meaning per token while minimizing noise.

Key choices you will make

  • Custom vs default stopword list: tailor to domain (keep not for sentiment; maybe remove chapter in ebooks).
  • Where to lemmatize: before or after stopwords removal (usually after POS tagging so context is preserved).
  • Handle negations: often keep not, no, never; consider combining not + word into a single token (not_good).

Worked examples

Example 1: Generic news sentence

Input: "The economists were analyzing the reports and were not surprised."

  • Stopwords removed (keep not): [economists, analyzing, reports, not, surprised]
  • Lemmatized with POS: [economist, analyze, report, not, surprise]
Example 2: Sentiment sentence with negation

Input: "I am not happy with the slow service."

  • Stopwords to remove: I, am, the
  • Keep: not
  • After removal: [not, happy, slow, service]
  • Lemmas: [not, happy, slow, service]
  • Note: Keeping not preserves polarity.
Example 3: Ecommerce search query

Input: "Best running shoes for men"

  • Stopwords removed (domain-aware): maybe remove for; keep best?
  • Tokens: [best, running, shoes, men]
  • Lemmas: [good, run, shoe, man] or keep best as best depending on lemmatizer.
  • Decision: Sometimes keep best for ranking signals, even if frequent.

Steps to apply in a project

  1. Tokenize: split text into words while handling punctuation.
  2. Lowercase: standardize casing unless casing carries meaning (e.g., Named Entities).
  3. Decide stopword policy: start with a default list, then add/remove based on validation.
  4. POS tagging: tag each token as noun/verb/etc.
  5. Lemmatize with POS: use the tag to select the correct lemma.
  6. Negation handling: ensure not/no are retained; optionally join with next adjective/verb.
  7. Evaluate: run an A/B on downstream task metrics (accuracy, F1, NDCG) to validate choices.

Exercises you can do now

These mirror the graded exercises below.

  1. Exercise 1: Remove stopwords (but keep not) from the sentence: "The movie was not as exciting as the trailer, but it was well-acted." Then list the resulting tokens.
  2. Exercise 2: Lemmatize the sentence: "The leaves were lying on the ground and the children were better at finding them." Use POS to guide lemmas.
  • Checklist: Did you keep negations? Did you use POS for verbs vs nouns? Did you adjust the stopword list to the task?

Common mistakes and self-check

  • Removing all stopwords blindly: loses meaning (e.g., not). Self-check: scan your list; ensure negations remain.
  • Lemmatizing without POS: wrong lemmas (lying → lie as verb vs lying as adjective). Self-check: confirm part-of-speech for ambiguous words.
  • Double-normalizing: lemmatize after stemming, causing odd tokens. Self-check: choose either stemming or lemmatization, not both (usually lemmatization).
  • Over-aggressive domain stopwords: removing words like best or free that matter in search. Self-check: compare retrieval metrics before/after.
Quick self-audit mini-check
  • Sample 10 outputs and verify not, never, no are preserved.
  • Spot-check 5 verbs in past tense; confirm they lemmatize to base verb.
  • Ensure numbers, emojis, or entities follow your task policy.

Practical projects

  • Build a small sentiment classifier on 1,000 reviews. Compare three pipelines: baseline, baseline + stopwords, baseline + stopwords + lemmatization. Report F1 and show which tokens were most informative.
  • Create a mini search index of 500 product titles. Try different stopword lists and see which set yields the best top-5 precision on 20 test queries.
  • Topic modeling (LDA) on news headlines. Run with and without lemmatization; inspect topic coherence and example top words.

Learning path

  • Before this: tokenization, basic text cleaning.
  • Now: stopwords and lemmatization (this page).
  • Next: handling punctuation and numbers, subword tokenization, phrase detection (bigrams), and domain-specific normalization rules.

Next steps

  • Take the quick test below to confirm your understanding.
  • Optional: re-run an old model with new preprocessing and compare metrics.

Note: The quick test is available to everyone. Only logged-in users will see saved progress in their account.

Mini challenge

Design a stopword policy for sarcasm detection on tweets. Which words would you keep that are often removed? List five examples and explain why each one should stay (e.g., not, totally, literally, just, actually).

Quick Test

When you are ready, answer the questions to check your understanding.

Practice Exercises

2 exercises to complete

Instructions

Remove stopwords from the sentence below, but keep negations: not, no, never. Lowercase the text and keep punctuation-only tokens out.

Sentence: "The movie was not as exciting as the trailer, but it was well-acted."

  • Step 1: Tokenize and lowercase.
  • Step 2: Remove standard stopwords except not.
  • Step 3: Output the final token list.
Expected Output
[movie, not, exciting, trailer, well-acted]

Stopwords And Lemmatization Basics — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Stopwords And Lemmatization Basics?

AI Assistant

Ask questions about this tool