Who this is for
Prompt engineers, ML/AI engineers, product managers, and QA/ops who need to ship and maintain LLM features across teams without bottlenecks.
Prerequisites
- Basic prompt design and evaluation familiarity
- Version control basics (semantic versioning helps)
- Comfort with model parameters (temperature, top_p) and simple test sets
Why this matters
LLM features touch many teams: product, design, engineering, QA, support, and risk/compliance. Clear documentation and clean handoff let others run, debug, and evolve your work. Real tasks you will face:
- Ship a prompt update and explain what changed, why, and how to roll back
- Provide an evaluation summary so PMs can decide to launch
- Enable on-call engineers to respond when output quality drops
- Share inputs/outputs and edge cases so QA can test reliably
- Show compliance how data is handled and what risks were mitigated
Concept explained simply
Documentation and handoff mean: make future-you and teammates successful without you in the room. Capture the what, why, how, and how-to-operate of your LLM solution in small, durable docs.
Mental model
Think in a 4-pack:
- Prompt Card (what/how)
- Evaluation Report (quality)
- Decision Record (why)
- Runbook (operate/restore)
Together, they enable safe changes, quick onboarding, and fast incident response.
Core deliverables and templates
Prompt Card — 1-page spec
Title: [Feature/Capability Name] Owner: [Team/Person] Version: [prompt-name@vX.Y.Z] Model & Params: [model_name, temperature, top_p, max_tokens] Purpose: [User problem and intended outcome] Input Contract: [Fields, types, example] Output Contract: [Schema or format, example] System/Instructions: [System prompt or directives] Few-shot Examples: [2–5 canonical examples] Constraints: [PII rules, safety filters, tokens budget] Pre/Post-processing: [Normalization, redaction, formatting] Dependencies: [APIs, feature flags] Known Limits: [Edge cases, failure modes]
Evaluation Report — summary
Scope: [what was evaluated] Dataset: [size, composition, source] Metrics: [exact match, F1, BLEU, human ratings, cost, latency] Acceptance Criteria: [go/no-go thresholds] Results: [table or bullets: baseline vs candidate] Risk Notes: [bias, safety, PII] Sign-off: [roles who approved]
Prompt Decision Record (PDR)
Title: [Decision: e.g., Switch model to X] Date: [YYYY-MM-DD] Context: [problem, constraints] Options: [A/B/C briefly] Decision: [chosen option] Rationale: [why this, trade-offs] Impact: [quality, cost, latency] Version Change: [from vA to vB] Rollback Plan: [how to revert] Status: [accepted/superseded]
Runbook — operate & recover
Service: [feature/prompt] Owners & On-call: [roles] Dependencies: [APIs, flags, datasets] Monitors: [what alarms mean] P0 Incident Steps: 1) Disable feature flag or route to fallback 2) Roll back to last good version 3) Communicate status to channel Diagnosis: - Check recent deploys, prompts diff, dataset changes - Re-run eval subset Recovery: - Validate outputs vs acceptance criteria Safeguards: [rate limits, redaction]
Handoff Pack Checklist
- Prompt Card complete and current version tagged
- PDR for last major change
- Evaluation Report with acceptance criteria
- Runbook with P0 steps and rollback
- Sample inputs/outputs (sanitized)
- Changelog and owners
Versioning guidance
- Use semantic-like versions: MAJOR.MINOR.PATCH
- MAJOR: breaking changes to I/O or behavior
- MINOR: improvements that preserve contract
- PATCH: parameter tweaks with same intent
Worked examples
Example 1 — Prompt Card
Title: Support Email Categorization
Owner: AI Platform
Version: categorize-email@v1.3.0
Model & Params: gpt-4o-mini, temperature=0.2, top_p=1, max_tokens=200
Purpose: Classify incoming support emails into {billing, tech_issue, account, other}
Input Contract: {subject: string, body: string}
Output Contract: {category: string in set, confidence: float 0-1}
System/Instructions:
You are a careful classifier. Choose one category.
Few-shot Examples:
Input: "Card charged twice..." -> {category: billing, confidence: 0.92}
Input: "App crashes on launch" -> {category: tech_issue, confidence: 0.95}
Constraints: No PII in logs; redact emails and card numbers.
Pre/Post: Pre: trim, lowercase. Post: enforce allowed set, clamp confidence.
Dependencies: Feature flag: support_cat_v1
Known Limits: Mixed issues may confuse; low confidence if <0.6
Example 2 — Prompt Decision Record
Title: Reduce categorization cost via model switch Date: 2026-01-08 Context: Growing volume increased monthly spend; latency target P95 < 1s Options: A) Keep model; B) Switch to gpt-4o-mini; C) Heavier caching Decision: B + light cache Rationale: 35% cost reduction, P95 latency 850ms, quality -0.3pp vs baseline Impact: Cost -35%, Quality F1 0.912 -> 0.909, Latency P95 1.2s -> 0.85s Version Change: categorize-email v1.2.0 -> v1.3.0 Rollback Plan: Revert config to v1.2.0 in flag service Status: accepted
Example 3 — Evaluation + Runbook snippet
Scope: v1.2.0 (baseline) vs v1.3.0 (candidate) Dataset: 1,000 labeled emails, stratified by category Metrics: Macro F1, accuracy, cost/request, P95 latency Acceptance: F1 drop <= 0.5pp, P95 <= 1s, cost -20%+ Results: F1 0.912 -> 0.909; Acc 0.936 -> 0.934; Cost -35%; P95 0.85s Sign-off: PM, QA, AI Lead Runbook P0 Steps: - If F1 proxy alert triggers: toggle fallback to v1.2.0, notify #ai-oncall - Re-run 100-sample smoke set, attach diff in incident doc
How to hand off effectively
- Audit: Ensure the 4-pack is complete and consistent.
- Package: Put Prompt Card, PDR, Evaluation summary, Runbook, and sample I/O in one place (same version label).
- Walkthrough: 30–45 minutes with receiving team. Demo a few real inputs.
- Sandbox: Provide a minimal script or notebook to reproduce results.
- Sign-off: Confirm acceptance criteria and owners for incidents.
- Aftercare: Be on-call for the first week after launch window.
Documentation quality checklist
- Clear purpose and acceptance criteria
- Explicit input/output contracts with examples
- Model and parameter settings recorded
- Known limits and risks stated
- Repeatable evaluation method and dataset description
- Rollback plan and P0 steps present
- Version tagged consistently across all docs
- PII and safety guidance included where relevant
Exercises
Exercise 1 — Write a 1-page Prompt Card
Create a Prompt Card for an “Intent Classifier for Chatbot Queries.” Keep it to one page using the template above.
- Define purpose and acceptance criteria
- Specify input/output contracts
- Write system instructions and 3 examples
- Note parameters and constraints (privacy/safety)
- Add pre/post-processing and known limits
Need a hint?
- Use a low temperature (0–0.3) for classification
- Constrain outputs to a fixed label set
- Provide at least one tricky edge-case example
Exercise 2 — From messy notes to a Handoff Pack
Turn the messy notes below into a concise Handoff Pack (PDR + Evaluation summary + Runbook P0 steps).
Messy notes
PM: we need faster responses; current P95 ~1.6s. Eng: tried smaller model, accuracy down ~0.5% on sample. Ops: please document rollback, last time unclear. Data: remove emails in logs; GDPR risk. AI: cache top intents likely helps; goal P95 <= 1s; F1 drop <= 0.5pp max.
- Write a 6–10 line PDR
- Write a 5–7 line Evaluation summary
- Write 3–5 P0 steps in the Runbook
Need a hint?
- Include acceptance criteria in the Evaluation summary
- In the PDR, list options and trade-offs
- P0 steps must include disable/rollback and comms
Common mistakes and how to self-check
- Missing I/O contract: Can another engineer build a stub client from your doc alone?
- No acceptance criteria: Can PM say yes/no based on your Evaluation summary?
- Version drift: Does every doc show the same version label?
- Hidden pre/post steps: Are normalization/redaction rules explicit?
- Vague rollback: Is there a one-click or single-command revert path?
- No risk notes: Did you address PII, bias, and failure modes?
- Exploration dump: Final docs should be concise; move experiments to an appendix if needed.
Self-check: If you went on vacation today, could QA test it, ops run it, and PM justify launch without pinging you?
Practical projects
- Document an existing small LLM feature at work or in a portfolio project using the 4-pack
- Create a smoke test set (50–100 items) and include it in your Evaluation summary
- Run a tabletop incident drill using your Runbook and refine it
Learning path
- Start with Prompt Card for clarity
- Add Evaluation summary with acceptance criteria
- Record a Prompt Decision when you make a non-trivial change
- Finish with a Runbook and handoff walkthrough
- Iterate based on feedback from QA and on-call
Next steps
- Complete the two exercises above
- Ask a peer to review your docs against the quality checklist
- When ready, take the Quick Test
Mini challenge
In 8 lines or fewer, write a Runbook P0 section for a summarization feature that starts producing empty summaries. Include disable/rollback and one diagnostic step.
Test yourself
Take the Quick Test below to check your understanding. The test is available to everyone; only logged-in users get saved progress.