luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Chunking Strategies

Learn Chunking Strategies for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Chunking is how you split long text into pieces before creating embeddings and retrieving context for RAG. Good chunks keep meaning intact and boost recall; bad chunks break context and cause hallucinations. As an NLP Engineer, you will:

  • Index large docs (policies, manuals, code) for semantic search.
  • Build RAG pipelines that must fit model context windows.
  • Improve retrieval metrics (recall@k, MRR, answer accuracy) via better chunking.
  • Reduce hallucinations by preserving local context and structure.

Concept explained simply

Chunking = cutting a document into small, meaningful pieces that are big enough to carry context but small enough to be retrieved precisely.

Mental model

Imagine a book as a jigsaw puzzle. Each piece (chunk) should contain a recognisable part of the picture (section idea). Pieces too tiny lose meaning; pieces too large don’t fit where you need them (context window) and are hard to match in search.

Key terms
  • Chunk size: length per chunk (in tokens or characters).
  • Overlap: how many tokens are repeated between adjacent chunks to keep context continuity.
  • Splitter: the rule to split (by headers, paragraphs, sentences, code blocks, semantic similarity).
  • Stride: effective step size = chunk_size βˆ’ overlap.

Practical heuristics that work

  • Start with chunk_size 200–400 tokens for dense retrieval; 400–800 for generative answers that need richer context.
  • Use overlap 10–20% of chunk_size. More only if sentences are long or references cross boundaries.
  • Prefer structure-aware splitters first: headers, lists, code blocks, tables. Then refine with sentence-level or token windows.
  • Keep metadata: source doc ID, section title, headings path, page number, code filename/function. Retrieval benefits from both content and metadata filtering.
  • Different content, different splitters:
    • Markdown/HTML/docs: header β†’ paragraph β†’ sentence β†’ token window if needed.
    • Legal/long narrative: paragraph β†’ sentence; larger overlap (15–25%).
    • Code: file β†’ class/function β†’ sliding token window around definitions.
    • Tables/bullets: keep rows/lists intact when possible.
  • Measure and iterate: evaluate retrieval (recall@k, answer accuracy). Adjust chunk_size/overlap only after testing.
Rule-of-thumb cheatsheet
  • Short FAQs: 120–200 tokens, 10% overlap.
  • Knowledge bases: 256–384 tokens, 15% overlap.
  • Policy/legal: 300–500 tokens, 20% overlap.
  • Code: by function + 150–300 token windows, 10–20% overlap.

Worked examples

Example 1: Customer support article (Markdown)

Content: H1 Product Setup β†’ H2 Install β†’ steps; H2 Troubleshooting β†’ bullet lists.

  • Splitter: header-aware β†’ paragraph β†’ sentence fallback.
  • Chunk size: 300 tokens; overlap: 45 tokens (15%).
  • Reasoning: tasks are step-based; moderate overlap preserves cross-step context.
  • Metadata: {doc_id, h1, h2, anchors_present:boolean}
Result shape
  • Chunk 1: Setup/Install intro + steps 1–3
  • Chunk 2: steps 3–6 (overlap keeps step 3)
  • Chunk 3: Troubleshooting overview + first bullet group

Example 2: Legal policy (long narrative)

  • Splitter: paragraph β†’ sentence. Avoid breaking citations mid-sentence.
  • Chunk size: 400 tokens; overlap: 80 tokens (20%).
  • Reasoning: long sentences and references; higher overlap reduces context loss.
  • Metadata: {doc_id, section_number, clause_ids}
Result shape
  • Chunk 1: Section 1 intro + clauses 1.1–1.2
  • Chunk 2: clauses 1.2–1.4 (overlap with 1.2)
  • Chunk 3: Section 2 intro, etc.

Example 3: Codebase docs + source files

  • Splitter: file β†’ class/function β†’ sliding token window.
  • Chunk size: 220 tokens; overlap: 30 tokens (β‰ˆ14%).
  • Reasoning: function-level granularity improves pinpointing where logic resides.
  • Metadata: {repo, path, language, symbol_name}
Result shape
  • Chunk 1: README overview section
  • Chunk 2: function process_order() signature + body
  • Chunk 3: process_order() body continued (overlap around boundary)

Step-by-step: design a chunking plan

  1. Identify structure: Does the doc have headings, lists, code blocks, tables?
  2. Choose a primary splitter: header/paragraph/sentence/function.
  3. Set initial chunk_size and overlap based on content type and model context budget.
  4. Attach metadata: IDs, titles, paths, section markers.
  5. Create a tiny index; run test queries; measure recall@k and answer accuracy.
  6. Adjust: If recall is low, shrink chunks; if answers lack context, enlarge chunks or increase overlap.
  • [ ] I picked a structure-aware splitter.
  • [ ] chunk_size fits my model’s context window plan.
  • [ ] Overlap 10–20% unless justified.
  • [ ] Metadata includes section hierarchy or symbol names.
  • [ ] I evaluated with representative queries.

Common mistakes and self-check

  • Chunks too tiny: embeddings miss semantics. Self-check: are many chunks under 2 sentences? Increase size.
  • No overlap across long sentences: context breaks. Self-check: scan boundaries; do cross-references get cut?
  • Ignoring structure: breaking tables/code mid-row/function. Self-check: ensure atomic units stay intact.
  • Overly large chunks: reduced precision and wasted tokens. Self-check: top-1 chunk is often irrelevant? Reduce size.
  • No metadata: hard to filter or rank. Self-check: can you filter by section or file quickly?
Quick boundary test

Pick 3 random boundaries and read 2–3 sentences around them. If a reference is incomplete, increase overlap or adjust split rules.

Exercises

These mirror the tasks below. Do them here, then check solutions in the exercise blocks. Tip: Estimating tokens from words ~ words Γ— 1.3 is fine for practice.

  1. Design a chunking plan for a short knowledge base article with headers and bullet lists. Provide chunk_size, overlap, split rules, and example chunks with metadata.
  2. Given document length and target overlap percentage, compute stride and number of chunks. Verify top-k token budget for RAG.
  • [ ] I wrote explicit split rules in order of application.
  • [ ] I justified chunk_size and overlap.
  • [ ] I showed example chunks and metadata fields.
  • [ ] I computed stride and number of chunks.

Practical projects

  • Index a policy manual (β‰₯20 pages). Compare three setups: 256/15%, 384/15%, 384/25%. Report recall@5 and answer accuracy over 15 queries.
  • Build a code search demo: function-level chunks vs sliding windows only. Measure MRR@10.
  • Create a notebook to visualize chunk boundaries and overlaps on a sample doc. Highlight broken sentences or code blocks.

Mini challenge

You have an 8k-token model for generation and 4k-token embedding model. Your docs are mostly long paragraphs with occasional tables. Propose chunking parameters and split rules that keep tables intact and ensure top-4 retrieved chunks fit within 1.6k generation tokens. State your chunk_size, overlap, and why.

Who this is for

  • NLP engineers building RAG/search over long documents.
  • Data scientists improving retrieval quality.
  • ML engineers integrating embeddings into apps.

Prerequisites

  • Basic understanding of embeddings and vector search.
  • Familiarity with tokens vs words and model context limits.

Learning path

  1. Embeddings basics
  2. Retrieval and ranking
  3. Chunking strategies (this lesson)
  4. Evaluation and diagnostics
  5. RAG orchestration

Next steps

  • Run the exercises, then take the quick test below.
  • Iterate on your own document set and record metrics.
Progress note

The quick test is available to everyone. Only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Consider this simplified article outline:

  • H1: Device Setup
  • H2: Install App
    • Steps 1–6
  • H2: Connect Device
    • Bluetooth pairing tips (bullets)
  • H2: Troubleshooting
    • Common errors E01–E05 with short explanations

Task:

  • Propose split rules in order (e.g., header-aware β†’ paragraph β†’ sentence).
  • Choose chunk_size and overlap with reasoning.
  • Show 3 example chunks with brief content description.
  • List metadata fields to attach.
Expected Output
A clear plan including ordered split rules, chosen chunk_size and overlap with justification, three example chunk descriptions, and a metadata schema.

Chunking Strategies β€” Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Chunking Strategies?

AI Assistant

Ask questions about this tool