Why this matters
Large Language Models are limited by a context window (the maximum tokens they can read at once). As an NLP Engineer, you must fit instructions, tools, chat history, and retrieved knowledge into that window without losing crucial information. Good context management boosts accuracy, cuts costs, and prevents truncation errors.
- Real task: Build a RAG chatbot that answers from a 500-page manual without exceeding a 8kβ32k token limit.
- Real task: Summarize a 200-page policy while preserving definitions and exceptions.
- Real task: Code assistant: pull the right files, compress them, and cite exact lines.
Concept explained simply
The context window is the modelβs working memory. You must choose what to include now and what to leave out. Think like a luggage packer with weight limits: prioritize essentials, compress clothes, and label pockets.
- Token: A chunk of text (word pieces, punctuation). Limits are per-token.
- Context window: Maximum tokens the model sees at once (prompt + response).
- Budgeting: Reserving tokens for system/instructions/tools/history/docs/answer.
- Chunking: Cutting documents into retrieval-friendly pieces with overlap to keep coherence.
- Compression: Summarizing, selecting, or rewriting content to fit.
Mental model in one sentence
Context = Instruction clarity + Minimal history + Only the most relevant knowledge, compressed to fit.
A practical 5-step context budget
- Window size (e.g., 8k, 32k, 128k tokens).
- What must be preserved: constraints, definitions, citations, code lines.
- Latency and cost targets.
- System + tools: 10β25%
- Task instructions: 5β15%
- Chat history: 10β20% (summarize aggressively)
- Retrieved docs: 30β60%
- Answer allowance: 10β20%
- Chunk prose at ~300β800 tokens; overlap 10β20%.
- For code: smaller chunks (80β200 tokens) around functions/classes; link to file summaries.
- Build hierarchical views: global summary β section summary β snippet.
- Put instructions before knowledge. Ask for citations and uncertainty handling.
- Include only top-k most relevant chunks; rerank by similarity + recency + metadata.
- Compress docs to bullet points that directly answer the query.
- Track token usage per section and truncation events.
- If near limit: reduce k β compress β summarize history β shorten instructions.
Worked examples
Example 1 β Customer support RAG (8k window)
- Budget: system/tools 1.2k, instructions 0.6k, history 0.6k, docs 4k, answer 1.6k.
- Chunking: 500-token chunks, 15% overlap. Top-k = 3β5 after rerank.
- Compression: Turn chunks into bullet evidence first; keep error codes and parameter values verbatim.
- Fallback: If tokens > 4k for docs, compress to 2β3 bullets per chunk.
Why these numbers?
Support answers need space for citations and steps. 500-token chunks balance coherence and retrieval precision. Overlap preserves cross-sentence context.
Example 2 β Code assistant (32k window)
- Budget: system/tools 3k, instructions 1k, history 2k, docs 18k, answer 8k.
- Strategy: For a large repo, take a file-level summary (200β400 tokens/file) + exact function snippets (80β150 tokens each).
- Retrieval: Embed functions and docstrings; rerank by filename match + symbol reference.
- Compression: Preserve signatures and error messages exactly; summarize comments.
Edge case: huge files
Use hierarchical retrieval: project summary β file summary β function snippet. Donβt include entire files; target the symbols mentioned in the user query.
Example 3 β Policy Q&A (128k window)
- Even with a large window, aim to keep docs β€ 60% of budget for robustness.
- Map-reduce: Summarize sections locally, then merge into targeted notes for the query.
- Keep definitions and exceptions verbatim; compress examples.
- Ask the model to list unresolved ambiguities explicitly.
How to choose chunking and overlap
- Prose: 300β800 tokens, 10β20% overlap.
- Tables/FAQs: keep row/QA pair intact; chunk by logical units.
- Code: 80β200 tokens around functions/classes; zero or minimal overlap.
- Papers/policies: 400β600 tokens aligned to headings.
Quick heuristics
- If retrieval feels too broad β smaller chunks.
- If answers lack context β slightly larger chunks or add overlap.
- Always align chunks to semantic boundaries (headings, functions).
Compression strategies that work
- Answer-targeted summaries: keep only lines that answer the current question.
- Bulletized evidence: short bullets with citations (doc, section, line).
- Selective quoting: exact quotes for definitions, error codes, API signatures.
- Multi-query retrieval: generate variants of the user query to catch synonyms; merge and dedupe.
- Reranking: combine semantic score with metadata (recency, section importance).
When you hit the limit
- Reduce top-k by 1β2.
- Compress chunks to bullets.
- Summarize older chat history.
- Trim verbose instructions while keeping constraints.
Quality and cost trade-offs
- Large windows reduce orchestration complexity but can be slower.
- Small windows force tighter retrieval and better compression, often improving precision.
- Balance: reserve minimum tokens for a clear instruction and citations; optimize the rest dynamically.
Exercises
Complete the exercises below. A short checklist follows to sanity-check your work.
Exercise 1 β Design a token budget for a support chatbot (ID: ex1)
You have an 8k-token model. You must answer error troubleshooting questions using product docs. Include: system/tools, instructions, minimal history, retrieved docs, and an answer.
- Propose numeric budgets for each section.
- Choose chunk size, overlap, and top-k.
- Describe what to compress if you exceed the limit.
Expected output
A concise plan with numbers for each section, chunking, k, and a fallback compression sequence.
Exercise 2 β Chunking strategy across content types (ID: ex2)
Design chunking for three sources: (A) product manual with headings and bullet lists, (B) codebase with long files and functions, (C) policy PDF with long sentences.
- Propose chunk sizes and overlap per source.
- Define retrieval top-k and reranking signals.
- State what must be quoted verbatim vs summarized.
Expected output
Three mini-specs: sizes, overlap, k, rerank features, and compression rules for each source.
Checklist β Pre-flight before deployment
- Instruction fits in < 15% and is specific.
- Chat history summarized to last 1β3 turns unless critical context is needed.
- Top-k chosen via validation, not guesswork.
- Chunks aligned to headings/functions with appropriate overlap.
- Compression keeps definitions, parameters, and citations verbatim.
- No truncation when measuring with max-length queries.
Common mistakes and self-check
- Mistake: Over-stuffing many chunks. Fix: Reduce k and compress to bullets.
- Mistake: Long chat history. Fix: Summarize older turns; keep only active constraints.
- Mistake: Chunking mid-sentence or mid-function. Fix: Align to boundaries.
- Mistake: Missing citations. Fix: Reserve tokens for source markers; quote critical lines.
- Mistake: Vague instructions. Fix: State task, constraints, format, and when to abstain.
Self-check prompts
- Does any section regularly cause truncation?
- Do answers remain accurate when I add 20% extra chat history?
- Do citations point to the exact source lines?
Practical projects
- Build a RAG FAQ bot: validate different chunk sizes (300, 500, 800) and overlaps (0%, 10%, 20%). Report accuracy and token usage.
- Code helper: implement hierarchical retrieval (repo summary β file summary β function snippet) and measure latency vs accuracy.
- Policy Q&A: implement query compression into answer-targeted bullets and compare with raw chunk stuffing.
Mini challenge
Your 8k model starts truncating citations. You cannot change the model. Propose a 3-step change list that fixes truncation while keeping answer quality. Keep it under 5 sentences.
Who this is for
- NLP engineers building RAG apps, chatbots, and summarizers.
- Data scientists deploying LLMs with retrieval or tool-calling.
Prerequisites
- Basic prompt engineering (system/instruction roles).
- Understanding of embeddings and vector search.
- Familiarity with tokenization concepts.
Learning path
- Prompt structure and tokenization basics.
- Chunking and embedding strategies.
- RAG retrieval, reranking, and compression.
- Context budgeting and monitoring.
- Evaluation: accuracy, grounding, and latency.
Next steps
- Instrument token usage and truncation rates.
- A/B test chunk sizes and k on real queries.
- Introduce answer-targeted compression and measure grounding.
Quick Test
Take the quick test below to check your understanding. Everyone can take it for free; only logged-in users get saved progress.