luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Sentence Segmentation

Learn Sentence Segmentation for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Sentence segmentation splits raw text into sentences. Downstream NLP models assume clean sentence boundaries for tasks like machine translation, summarization, question answering, sentiment analysis, and information extraction. Poor segmentation causes incorrect context windows, broken entities, and degraded model performance.

  • Customer support: identify each complaint sentence to route to the right team.
  • Summarization: feed sentence-level units to rankers.
  • QA: retrieve the exact sentence with the answer span.
  • ASR transcripts: restore sentences from punctuation-less text.

Concept explained simply

Sentence segmentation decides where one idea ends and the next begins. It usually happens at punctuation like '.', '!', '?', '।' but must avoid false splits like 'Dr.' or 'U.S.'.

Mental model

Imagine walking through text with a highlighter. You highlight right after a true end-of-sentence marker. Your toolkit: punctuation patterns, abbreviation lists, capitalization cues, quotation handling, and sometimes a learned model that predicts boundaries from context.

Key methods you can use

1) Rule-based heuristics (fast, transparent)
  • Split on '.', '!', '?', '…', '।', '。', '!', '?' unless followed by an abbreviation or decimal/ordinal.
  • Keep the delimiter with the sentence for readability and alignment.
  • Use negative lists: titles (Dr., Mr., Ms.), initials (A.B.), acronyms (U.S., A.I.), decimals (3.14), ordinal (No. 3).
  • Post-checks: if next token does not start with uppercase (for languages that use case) and previous token likely an abbreviation, do not split.
2) Punkt-style unsupervised

Trains from raw text to learn which tokens with periods are likely abbreviations. Uses orthographic cues (uppercase after period), token-internal periods, and frequency of period-final tokens.

3) Supervised sequence labeling

Predict boundary vs non-boundary for each character or token using features (chars, POS, surrounding tokens). Models: CRF, BiLSTM. Needs labeled data.

4) Neural punctuation restoration (for ASR/IM/chat)

When text lacks punctuation, train a model to insert '.', '?', '!' given token sequences and prosody (if available). Then apply simple rule-based split on the restored punctuation.

Language and domain specifics

  • English: watch abbreviations (e.g., 'etc.', 'vs.', 'Fig.'), decimals (1.2), ellipses '...'.
  • Indic scripts: danda '।' marks sentence end in Hindi and related languages.
  • Chinese/Japanese: '。', '!', '?' without spaces; quotes may be '「」' or '『』'.
  • Social/chat: emojis, short lines, multiple '!' or '???'; newlines may imply boundaries.
  • Legal/medical: numbered clauses, section headers; maintain headings as separate sentences if needed.

Implementation tips

  • Preserve character offsets for each sentence: start_index, end_index inclusive-exclusive. This keeps annotations aligned.
  • Order of operations: normalize Unicode and whitespace first; segment; then tokenize within each sentence.
  • Keep delimiters with the sentence by default. Example: 'Hello world.' not 'Hello world' + '.'.
  • Handle quotes and brackets: if sentence ends inside quotes, include the closing quote with the sentence.
  • Newline heuristics: in emails or bullet lists, treat blank lines as potential boundaries.
  • Performance: streaming splitters scan once; avoid catastrophic regex. Test with long strings (e.g., base64 blobs) to ensure stability.
  • Evaluation: Precision/Recall/F1 on boundaries; for long texts, also Pk or WindowDiff to measure segmentation quality.

Worked examples

Example 1 — Abbreviations and decimals

Input:

Dr. Kim met U.S. officials at 3 p.m. in Washington. They discussed budgets totaling $1.2B. 'It is urgent,' she said.

Output sentences:

  1. Dr. Kim met U.S. officials at 3 p.m. in Washington.
  2. They discussed budgets totaling $1.2B.
  3. 'It is urgent,' she said.

Notes: Do not split after 'Dr.' or 'U.S.' or 'p.m.'; do split at the period after 'Washington.' and after '$1.2B.' despite the number containing a period.

Example 2 — Quotes, dashes, ellipses

Input:

He whispered, 'Wait... do not go.' Then—silence. Finally: Go!

Output sentences:

  1. He whispered, 'Wait... do not go.'
  2. Then—silence.
  3. Finally: Go!

Notes: Treat ellipses '...' as inside-sentence punctuation; the true boundary is the closing period inside the quote.

Example 3 — Multilingual end markers

Input:

他来了。真的吗?太好了! Hindi uses the danda '।' like this: यह सही है।

Output sentences:

  1. 他来了。
  2. 真的吗?
  3. 太好了!
  4. Hindi uses the danda '।' like this: यह सही है。

Notes: Recognize '。', '?', '!', and '।' as sentence boundaries. If mixing scripts, ensure your splitter supports Unicode punctuation.

Practice: mini tasks

  1. Mark boundaries with '|' in this text:
    I met Sam Jr. today... he was late? Yes! But only by 5 min.
  2. Decide: split or not?
    The meeting is at 9 a.m. sharp. Bring docs.
    Do not split after 'a.m.'; split after 'sharp.'.
  3. Mark boundaries for this newline-heavy note:
    Action items
    - Fix ETL.
    - Email Dr. Rao.
    Thanks.
    Treat each bullet as a sentence; keep 'Thanks.' as its own sentence.
  • [ ] I can keep delimiters with sentences
  • [ ] I can avoid splitting at common abbreviations
  • [ ] I can handle quotes and brackets correctly
  • [ ] I can preserve character offsets

Exercises

Exercise 1 — Robust rule-based splitter

Goal: Write or describe rules to segment the text below. Include start/end character offsets for each sentence and keep punctuation with the sentence.

Dr. Rao arrived at 5 p.m. on Jan. 3, 2025. He met with the A.I. team in Bldg. 2. Results were 'ok... not great.' Next steps: fix data issues.

Deliverable: list of sentences with offsets and a short note explaining which rules prevent false splits (e.g., 'p.m.', 'A.I.', 'Bldg.').

Exercise 2 — Learn abbreviations like Punkt

Goal: From this tiny 'training' text, infer which period-final tokens are abbreviations. Then segment the test paragraph accordingly.

Training text:

Meet Dr. Li and Ms. Ana at 10 a.m. Bring docs to Bldg. 4. She likes A.I. demos.

Test text to segment:

Ms. Ana spoke with Dr. Li today. She visited Bldg. 4 to see the A.I. team. They left at 6 p.m. Happy?

Deliverable: list of sentences you produce and the abbreviation set you inferred.

Common mistakes and self-check

  • Mistake: Splitting after every period. Fix: maintain an abbreviation list and use case/next-token checks.
  • Mistake: Dropping delimiters or quotes. Fix: include final punctuation and closing quotes/brackets with the sentence.
  • Mistake: Losing offsets after normalization. Fix: normalize first; compute offsets on the normalized text; keep mapping if you must revert.
  • Mistake: Ignoring non-Latin punctuation. Fix: include '।', '。', '!', '?' in your patterns.
  • Mistake: Regex catastrophic backtracking. Fix: prefer linear-time scans and anchored patterns.
Self-check routine
  • Verify sentence counts on a gold sample; compute Precision/Recall/F1.
  • Spot-check hard cases: titles, decimals, URLs, emails, emojis, ellipses.
  • Randomly sample 20 sentences; ensure delimiters and quotes are attached correctly.

Practical projects

  1. News splitter: Implement rule-based segmentation with abbreviation learning on a small news set. Report F1 on a held-out batch.
  2. ASR punctuation restoration: Train a small model to insert '.', '?', '!' on transcripts; then split sentences and evaluate.
  3. Multilingual support: Extend your splitter to handle '।' and '。', with unit tests for Hindi and Chinese examples.

Who this is for

  • NLP engineers building preprocessing pipelines.
  • Data scientists preparing text for modeling.
  • Annotators/labelers who need stable sentence spans.

Prerequisites

  • Basic string handling and regular expressions.
  • Familiarity with Unicode and tokenization concepts.
  • Optional: knowledge of sequence labeling if using ML.

Learning path

  1. Start: Implement a minimal rule-based splitter for '.', '!', '?'.
  2. Harden: Add abbreviation handling, quotes/brackets, decimals.
  3. Evaluate: Create a small gold set; compute boundary F1.
  4. Extend: Add multilingual punctuation and newline heuristics.
  5. Advance: Try Punkt-style learning or a small punctuation-restoration model.

Next steps

  • Integrate your splitter into the preprocessing pipeline before tokenization.
  • Add logging for boundary decisions to simplify debugging.
  • Prepare unit tests that cover edge cases you encountered.

Mini challenge

Segment this tricky paragraph. List your rules used:

We met Prof. Green, Jr. at 4 p.m. on Tue. The 'alpha' build (v1.2.0) crashed... twice? Yes! Fix by EOD.
Hint
  • Titles and suffixes may end with periods but are not boundaries.
  • Parentheses and version numbers contain periods.
  • Question marks and exclamation points are strong boundaries.

Quick Test

There is a short test for this subskill. Everyone can take it; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Write or describe rules to segment the text. Include start/end character offsets per sentence and keep punctuation with the sentence.

Dr. Rao arrived at 5 p.m. on Jan. 3, 2025. He met with the A.I. team in Bldg. 2. Results were 'ok... not great.' Next steps: fix data issues.
  • Avoid splitting at 'p.m.', 'A.I.', 'Bldg.'
  • Include closing quotes with the sentence.
  • Provide a brief explanation of your negative lookbehinds or abbreviation checks.
Expected Output
Example format: [(0, 53, 'Dr. Rao arrived at 5 p.m. on Jan. 3, 2025.'), (54, 98, 'He met with the A.I. team in Bldg. 2.'), (99, 132, 'Results were 'ok... not great.'), (133, 163, 'Next steps: fix data issues.')] Offsets are inclusive-exclusive on the original string.

Sentence Segmentation — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Sentence Segmentation?

AI Assistant

Ask questions about this tool