Why this matters
Chunking is how you split long text into pieces before creating embeddings and retrieving context for RAG. Good chunks keep meaning intact and boost recall; bad chunks break context and cause hallucinations. As an NLP Engineer, you will:
- Index large docs (policies, manuals, code) for semantic search.
- Build RAG pipelines that must fit model context windows.
- Improve retrieval metrics (recall@k, MRR, answer accuracy) via better chunking.
- Reduce hallucinations by preserving local context and structure.
Concept explained simply
Chunking = cutting a document into small, meaningful pieces that are big enough to carry context but small enough to be retrieved precisely.
Mental model
Imagine a book as a jigsaw puzzle. Each piece (chunk) should contain a recognisable part of the picture (section idea). Pieces too tiny lose meaning; pieces too large donβt fit where you need them (context window) and are hard to match in search.
Key terms
- Chunk size: length per chunk (in tokens or characters).
- Overlap: how many tokens are repeated between adjacent chunks to keep context continuity.
- Splitter: the rule to split (by headers, paragraphs, sentences, code blocks, semantic similarity).
- Stride: effective step size = chunk_size β overlap.
Practical heuristics that work
- Start with chunk_size 200β400 tokens for dense retrieval; 400β800 for generative answers that need richer context.
- Use overlap 10β20% of chunk_size. More only if sentences are long or references cross boundaries.
- Prefer structure-aware splitters first: headers, lists, code blocks, tables. Then refine with sentence-level or token windows.
- Keep metadata: source doc ID, section title, headings path, page number, code filename/function. Retrieval benefits from both content and metadata filtering.
- Different content, different splitters:
- Markdown/HTML/docs: header β paragraph β sentence β token window if needed.
- Legal/long narrative: paragraph β sentence; larger overlap (15β25%).
- Code: file β class/function β sliding token window around definitions.
- Tables/bullets: keep rows/lists intact when possible.
- Measure and iterate: evaluate retrieval (recall@k, answer accuracy). Adjust chunk_size/overlap only after testing.
Rule-of-thumb cheatsheet
- Short FAQs: 120β200 tokens, 10% overlap.
- Knowledge bases: 256β384 tokens, 15% overlap.
- Policy/legal: 300β500 tokens, 20% overlap.
- Code: by function + 150β300 token windows, 10β20% overlap.
Worked examples
Example 1: Customer support article (Markdown)
Content: H1 Product Setup β H2 Install β steps; H2 Troubleshooting β bullet lists.
- Splitter: header-aware β paragraph β sentence fallback.
- Chunk size: 300 tokens; overlap: 45 tokens (15%).
- Reasoning: tasks are step-based; moderate overlap preserves cross-step context.
- Metadata: {doc_id, h1, h2, anchors_present:boolean}
Result shape
- Chunk 1: Setup/Install intro + steps 1β3
- Chunk 2: steps 3β6 (overlap keeps step 3)
- Chunk 3: Troubleshooting overview + first bullet group
Example 2: Legal policy (long narrative)
- Splitter: paragraph β sentence. Avoid breaking citations mid-sentence.
- Chunk size: 400 tokens; overlap: 80 tokens (20%).
- Reasoning: long sentences and references; higher overlap reduces context loss.
- Metadata: {doc_id, section_number, clause_ids}
Result shape
- Chunk 1: Section 1 intro + clauses 1.1β1.2
- Chunk 2: clauses 1.2β1.4 (overlap with 1.2)
- Chunk 3: Section 2 intro, etc.
Example 3: Codebase docs + source files
- Splitter: file β class/function β sliding token window.
- Chunk size: 220 tokens; overlap: 30 tokens (β14%).
- Reasoning: function-level granularity improves pinpointing where logic resides.
- Metadata: {repo, path, language, symbol_name}
Result shape
- Chunk 1: README overview section
- Chunk 2: function process_order() signature + body
- Chunk 3: process_order() body continued (overlap around boundary)
Step-by-step: design a chunking plan
- Identify structure: Does the doc have headings, lists, code blocks, tables?
- Choose a primary splitter: header/paragraph/sentence/function.
- Set initial chunk_size and overlap based on content type and model context budget.
- Attach metadata: IDs, titles, paths, section markers.
- Create a tiny index; run test queries; measure recall@k and answer accuracy.
- Adjust: If recall is low, shrink chunks; if answers lack context, enlarge chunks or increase overlap.
- [ ] I picked a structure-aware splitter.
- [ ] chunk_size fits my modelβs context window plan.
- [ ] Overlap 10β20% unless justified.
- [ ] Metadata includes section hierarchy or symbol names.
- [ ] I evaluated with representative queries.
Common mistakes and self-check
- Chunks too tiny: embeddings miss semantics. Self-check: are many chunks under 2 sentences? Increase size.
- No overlap across long sentences: context breaks. Self-check: scan boundaries; do cross-references get cut?
- Ignoring structure: breaking tables/code mid-row/function. Self-check: ensure atomic units stay intact.
- Overly large chunks: reduced precision and wasted tokens. Self-check: top-1 chunk is often irrelevant? Reduce size.
- No metadata: hard to filter or rank. Self-check: can you filter by section or file quickly?
Quick boundary test
Pick 3 random boundaries and read 2β3 sentences around them. If a reference is incomplete, increase overlap or adjust split rules.
Exercises
These mirror the tasks below. Do them here, then check solutions in the exercise blocks. Tip: Estimating tokens from words ~ words Γ 1.3 is fine for practice.
- Design a chunking plan for a short knowledge base article with headers and bullet lists. Provide chunk_size, overlap, split rules, and example chunks with metadata.
- Given document length and target overlap percentage, compute stride and number of chunks. Verify top-k token budget for RAG.
- [ ] I wrote explicit split rules in order of application.
- [ ] I justified chunk_size and overlap.
- [ ] I showed example chunks and metadata fields.
- [ ] I computed stride and number of chunks.
Practical projects
- Index a policy manual (β₯20 pages). Compare three setups: 256/15%, 384/15%, 384/25%. Report recall@5 and answer accuracy over 15 queries.
- Build a code search demo: function-level chunks vs sliding windows only. Measure MRR@10.
- Create a notebook to visualize chunk boundaries and overlaps on a sample doc. Highlight broken sentences or code blocks.
Mini challenge
You have an 8k-token model for generation and 4k-token embedding model. Your docs are mostly long paragraphs with occasional tables. Propose chunking parameters and split rules that keep tables intact and ensure top-4 retrieved chunks fit within 1.6k generation tokens. State your chunk_size, overlap, and why.
Who this is for
- NLP engineers building RAG/search over long documents.
- Data scientists improving retrieval quality.
- ML engineers integrating embeddings into apps.
Prerequisites
- Basic understanding of embeddings and vector search.
- Familiarity with tokens vs words and model context limits.
Learning path
- Embeddings basics
- Retrieval and ranking
- Chunking strategies (this lesson)
- Evaluation and diagnostics
- RAG orchestration
Next steps
- Run the exercises, then take the quick test below.
- Iterate on your own document set and record metrics.
Progress note
The quick test is available to everyone. Only logged-in users get saved progress.