Why this matters
Before tokenization, stopword removal, or choosing a model, you must know the text’s language. As an NLP Engineer, you will:
- Route messages to the right tokenizer, stemmer, or lemmatizer per language.
- Pick the right embeddings or model head for multilingual systems.
- Filter unsupported languages early to avoid errors downstream.
- Measure multilingual coverage and detect code-switching in chats and social posts.
Concept explained simply
Language detection (a.k.a. language identification) decides which language a given text is written in. It often outputs a label (like "en", "es", "fr") and a confidence score.
Mental model: Every language leaves a fingerprint in text: character patterns (like "th" in English), common function words (like "de" in French/Spanish), and script clues (Cyrillic vs Latin). Your detector compares the text’s fingerprint to known fingerprints and picks the closest one.
What counts as a strong fingerprint?
- Character n-grams (e.g., trigrams) across the whole text.
- Stopword hits and their relative frequencies.
- Unicode script coverage (e.g., Hangul vs Latin vs Cyrillic).
Where it fits in your pipeline
- Run detection early, before language-specific normalization (e.g., stemming/lemmatizing).
- Do not strip accents/diacritics before detection—they carry signal (e.g., "é" vs "e").
- Keep spaces and most punctuation for n-grams; they help mark word boundaries.
Approaches you can start with
- Unicode/script clues: Quick filter. If text is mostly Cyrillic, many languages are eliminated. But script ≠ language; use it as a hint, not a final answer.
- Stopword voting: Count how many common function words of each language appear. Works well on medium texts; can be noisy for very short inputs.
- Character n-grams: Build a small profile of frequent trigrams per language and score a text by overlap. Robust across domains and misspellings.
- Confidence and thresholds: Normalize scores and set a minimum threshold (e.g., 0.7). If no language passes, return unknown.
- Short texts: Use cautious thresholds; it’s okay to return unknown for 1–2 word queries.
- Mixed-language (code-switching): Start simple: report the dominant language plus a low-confidence flag (or proportions) to indicate mixing.
Production tips you can adopt later
- Back-off strategy: script → n-grams → stopwords; combine with weighted voting.
- Language set restriction: if you only support a few languages, restrict scoring to those.
- Cache frequent domains/users to speed up repeated detections.
Worked examples
Example 1 — "This dataset is small but very clean."
Stopwords like "is", "but", "very" and trigrams like " th", "he ", "the" point strongly to English. Predicted: en, high confidence.
Example 2 — "El aprendizaje automático es útil en análisis de texto."
Stopwords "el", "en", "de" and trigrams " de", "la ", " que" (if present) favor Spanish. Predicted: es, high confidence.
Example 3 — "Le traitement du langage naturel est fascinant."
Stopwords "le", "du", "est" and trigrams " le", "de ", "ent" favor French. Predicted: fr, high confidence.
Build a simple detector (step-by-step)
- Define languages and resources: small stopword lists and 10–20 common trigrams per language.
- Preprocess: lowercase; keep spaces and diacritics; remove only extra punctuation.
- Score:
- Stopword score = count of hits / tokens.
- N-gram score = sum of matched trigrams (optionally weighted).
- Combine: final = 0.4 * stopword + 0.6 * n-gram (tweakable).
- Normalize: Softmax or divide by sum to get per-language confidences.
- Decide: If max confidence ≥ threshold (e.g., 0.7), return that language; else return unknown.
- Edge cases: If text length < 5 chars, skip n-grams, lower thresholds, or return unknown.
Exercises (practice)
Work through the tasks below. A full solution is available inside each exercise card. After finishing, use the checklist to self-review.
Exercise 1 — Stopword voting for en/es/fr
See the Exercises section below for the exact instructions and solution (ID: ex1).
Exercise 2 — Character trigram scoring
See the Exercises section below for the exact instructions and solution (ID: ex2).
Self-check checklist:
- You return unknown when confidence is below threshold.
- You keep diacritics and spaces during scoring.
- You can explain why a prediction was made (top features).
- You handle very short inputs conservatively.
- You can detect likely code-switching (e.g., second-highest score close to the top).
Common mistakes and how to self-check
- Stripping accents before detection: Lose discriminative signal. Self-check: Does "cafe" vs "café" behave identically? If yes, you may be over-normalizing.
- Always forcing a label: Causes noisy routing. Self-check: Inspect score distributions; adopt a threshold and allow unknown.
- Ignoring short-text behavior: 1–2 words are unreliable. Self-check: Measure accuracy by length buckets and adjust thresholds.
- Assuming script = language: Cyrillic covers multiple languages. Self-check: Add a second-stage scorer after script detection.
- No evaluation on imbalanced data: Overall accuracy can mislead. Self-check: Use macro-averaged F1 across languages.
Mini challenge
Label each input with a language and a confidence in [0,1], or return unknown if not sure. Explain your top 2 signals.
- "Vamos al cine esta noche?"
- "Data wrangling & modeling"
- "Analyse du discours politique"
- "ok"
Who this is for
- New NLP Engineers setting up multilingual preprocessing.
- Data Scientists integrating language-aware analytics.
- ML Engineers adding routing to production pipelines.
Prerequisites
- Basic Python or pseudocode comfort.
- Understanding of tokenization and stopwords.
- Familiarity with precision/recall and confidence thresholds.
Learning path
- Before: Text cleaning (whitespace, punctuation), tokenization basics.
- Now: Language detection basics (this lesson).
- Next: Language-specific normalization (stemming/lemmatization), handling code-switching, and multilingual embeddings.
Practical projects
- Build a lightweight detector for English/Spanish/French with stopwords + trigrams; log confidence histograms.
- Route user tickets by language to separate preprocessing pipelines; measure downstream accuracy gains.
- Implement a code-switch flag: if the second-best score is within 0.15 of the top, mark as mixed.
Next steps
- Expand language coverage by adding profiles gradually and re-evaluating macro-F1.
- Tune thresholds per text length bucket.
- Add explainability: show top n-grams or stopwords that influenced a decision.
Quick test
Take the quick test below. Everyone can take it; only logged-in users get saved progress.