luvv to helpDiscover the Best Free Online Tools
Topic 7 of 12

Text Cleaning

Learn Text Cleaning for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 19, 2025 | Updated: December 19, 2025

Who this is for

Data Analysts who work with survey responses, support tickets, product names, and any free-text fields that need to be standardized before analysis or modeling.

Prerequisites

  • Basic familiarity with spreadsheets or a scripting language (e.g., SQL or Python).
  • Comfort with simple regular expressions (find/replace patterns).
  • Understanding of your downstream task (dashboards, segmentation, NLP features, matching).

Why this matters

Real analyst tasks depend on clean text:

  • Consolidating messy product names for accurate sales aggregation.
  • Cleaning survey comments to analyze themes and sentiment.
  • Removing emails, URLs, and IDs to protect privacy while keeping meaning.
  • Normalizing casing, punctuation, and Unicode so joins, deduping, and keyword filters actually work.
Example work outcomes
  • Reduce the count of "unique" product names from 1,200 to under 200 by canonicalizing text.
  • Improve keyword recall in ticket routing by normalizing dashes, quotes, and accents.
  • Prevent broken joins caused by non‑breaking spaces and zero‑width characters.

Concept explained simply

Text cleaning is turning many messy ways of writing the same thing into a consistent, comparable form. You make small, repeatable adjustments (case, spaces, punctuation, accents, hidden characters) so downstream steps are reliable.

Mental model: a pipeline of safe filters

Imagine your text flows through labeled pipes. Each pipe does one simple, reversible or clearly documented change. You choose pipes based on your task:

  • Always safe: trim whitespace, collapse multiple spaces, normalize Unicode, standardize hyphens/quotes, remove zero‑width chars, fix encoding glitches.
  • Task‑dependent: lowercasing, removing diacritics, removing stopwords, replacing numbers, stripping punctuation.
  • Privacy: replace emails/URLs/IDs with placeholders like <EMAIL>, <URL>, <ID>.

Keep the pipeline deterministic and documented so results are reproducible.

Worked examples

Example A: Clean survey comments for topic analysis

Input: " Loved the product!!! Visit https://shop.example.com NOW "

  1. Trim and collapse whitespace.
  2. Normalize Unicode (quotes, dashes) and punctuation runs ("!!!" → "!").
  3. Lowercase (task‑dependent; safe for topic analysis).
  4. Replace URLs with <URL>.
Result
loved the product! visit <url> now

Example B: Standardize product names

Input: "Café—Mocha® "

  1. Trim and collapse spaces.
  2. Normalize Unicode and strip diacritics (Café → Cafe) when your matching is accent‑insensitive.
  3. Standardize dashes to hyphen; remove trademark symbols.
  4. Lowercase for canonical comparison (keep a display version if needed).
Result
cafe-mocha

Example C: Remove sensitive bits but keep meaning

Input: "Email me at John.Doe+promo@example.com — thanks"

  1. Normalize Unicode dash to hyphen.
  2. Replace email with <EMAIL> to preserve structure without exposing data.
  3. Lowercase and trim.
Result
email me at <email> - thanks

How to do it step by step

  1. Unicode normalize: Use NFC/NFKC, remove zero‑width and non‑breaking spaces.
  2. Whitespace: Trim; replace any run of whitespace with a single space.
  3. Quotes/dashes: Convert smart quotes/dashes to plain " and -.
  4. PII placeholders: Replace emails, URLs, and obvious IDs with <EMAIL>, <URL>, <ID>.
  5. Case/accents (if appropriate): lowercase; strip diacritics for matching.
  6. Punctuation (task‑dependent): normalize or keep; avoid removing punctuation that encodes meaning (e.g., "C++").
  7. Numbers (task‑dependent): keep, standardize formats, or replace with <NUM> for modeling.
  8. Document: Save your pipeline steps and examples.
Copy-ready patterns
- Collapse whitespace:  regex \s+  → " "
- Normalize punctuation runs:  ([!?\.])\1+  → \1
- Emails:  [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}  → <EMAIL>
- URLs:  https?://\S+|www\.\S+  → <URL>
- Zero-width:  [\u200B-\u200D\uFEFF]  → ""
- Non-breaking space:  \u00A0  → " "
- Smart quotes:  [“”]→" , [‘’]→'
- Dashes:  [–—]→-

Exercises

These mirror the exercises below. Complete them in your preferred tool (spreadsheet, SQL, Python). Use the checklist to self‑review.

Exercise 1: Clean short comments

Data lines:

1) "  Loved the product!!!  Visit https://shop.example.com NOW  "
2) "Email me at John.Doe+promo@example.com — thanks"
3) "Price was 1,299.00 USD... great value."
4) "Café chairs\tare nice — but out of stock :("

Goal: Apply a pipeline to produce the expected outputs (see exercise section below). Steps:

  • [ ] Normalize Unicode and punctuation (quotes/dashes).
  • [ ] Trim and collapse whitespace (tabs → space).
  • [ ] Replace URLs with <URL> and emails with <EMAIL>.
  • [ ] Lowercase.
  • [ ] Reduce punctuation runs ("!!!" → "!", "..." → ".").
  • [ ] Decide whether to strip diacritics (do it here).
Tips
  • Order matters: whitespace collapse after replacements prevents leftover gaps.
  • Keep placeholders lowercase for consistency: <url>, <email> or use uppercase consistently.

Exercise 2: Canonicalize product names for dedup

Inputs:

"Café Mocha"
"Cafe  Mocha"
"CAFE MOCHA®"
"cafe-mocha"
"Cafe—Mocha"
"cafe mocha (Large)"
"Mocha, Cafe"

Goal: Create a canonical form that groups all variants together for counting while keeping an original display string.

  • [ ] Lowercase for canonical keys.
  • [ ] Strip diacritics (Café → Cafe). Remove ®, ™.
  • [ ] Replace dashes/commas with spaces; collapse whitespace.
  • [ ] Remove parenthetical size notes.
  • [ ] Token-sort for grouping (alphabetically sort tokens).
Hint

Use a "display_clean" (no token sort) and a "group_key" (token-sorted). Aggregate by group_key.

Common mistakes and self-check

  • Over-cleaning: Removing punctuation or case when product codes rely on them. Self-check: sample 50 records with codes and compare matches before/after.
  • Not handling Unicode: Hidden characters break joins. Self-check: length before/after zero-width removal.
  • Destroying numbers: Replacing all digits hides quantities. Self-check: confirm metrics derived from numbers remain accurate.
  • Unlogged changes: Inability to reproduce. Self-check: store pipeline version + example IO pairs.
  • Blind stopword removal: Kills meaning in short texts. Self-check: manual read of random 20 comments after stopword removal; if comprehension drops, revert.

Practical projects

  • Build a text cleaning pipeline that takes CSV comments and outputs cleaned text with placeholders and flags for changes applied.
  • Create a dedup report for product names showing original → display_clean → group_key, with group sizes.
  • Make a regex cookbook page with examples and test cases for your team.

Learning path

  1. Master whitespace and Unicode normalization.
  2. Standardize punctuation, quotes, and dashes.
  3. Introduce placeholders for PII and patterns (emails, URLs, IDs).
  4. Task‑dependent choices: case, diacritics, numbers, stopwords.
  5. Build a reusable, documented pipeline with tests and examples.

Next steps

  • Integrate your pipeline into ETL or notebooks so it runs automatically.
  • Add unit tests on tricky strings (accents, zero‑width, long punctuation).
  • Measure impact: fewer unique keys, better match rates, clearer topics.

Mini challenge

Clean these ticket subjects for routing. Produce a canonical form and a display version:

"[URGENT]  Login—failure on iOS 17…"
"Login failure (iOS-17)"
"login failure - IOS17"
One possible answer
display_clean:
"[urgent] login-failure on ios 17."
"login failure ios-17"
"login failure - ios17"

group_key (token-sorted after punctuation→space, digits kept):
"failure ios 17 login"

Quick Test

Take the quiz to check understanding. Available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Clean the following 4 lines using a deterministic pipeline. Use any tool (spreadsheet, SQL, Python). Apply these steps in order:

  1. Normalize Unicode (quotes/dashes) and remove zero‑width/non‑breaking spaces.
  2. Trim and collapse whitespace (tabs/newlines to single spaces).
  3. Replace URLs with <URL> and emails with <EMAIL>.
  4. Lowercase the text.
  5. Reduce punctuation runs (e.g., "!!!"→"!", "..."→".").
  6. Strip diacritics.
1) "  Loved the product!!!  Visit https://shop.example.com NOW  "
2) "Email me at John.Doe+promo@example.com — thanks"
3) "Price was 1,299.00 USD... great value."
4) "Café chairs\tare nice — but out of stock :("
Expected Output
1) loved the product! visit <url> now 2) email me at <email> - thanks 3) price was 1,299.00 usd. great value. 4) cafe chairs are nice - but out of stock :(

Text Cleaning — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Text Cleaning?

AI Assistant

Ask questions about this tool