Why this matters
Examples teach the model what "good" looks like in your domain. Counterexamples teach the model where the boundaries are. Together they reduce ambiguity, cut hallucinations, and make outputs consistent. Prompt engineers use them to:
- Adapt general models to niche vocabularies (e.g., internal product names, support categories).
- Enforce formats (e.g., valid JSON with only allowed fields).
- Disambiguate similar labels and avoid false positives.
- Convey edge cases and safety constraints without code changes.
Concept explained simply
An example shows a correct input → output pair for a given task. A counterexample shows an input that might tempt a wrong answer and demonstrates the correct boundary (what to do instead).
Mental model
Think of a boundary line drawn with pegs:
- Examples are pegs inside the region of "do this".
- Counterexamples are pegs outside, labeled "do not cross" with the correct alternative.
Place pegs near confusing borders (similar labels, tricky formats, or safety constraints). Few, high-quality pegs beat many vague ones.
How to craft examples and counterexamples
- State the objective and constraints (labels allowed, format required, tone, length).
- Collect borderline cases from real data (ambiguous, noisy, or commonly misclassified).
- Write 2–5 strong examples that cover typical and slightly tricky cases.
- Add 1–3 counterexamples that highlight what not to do and show the correct outcome.
- Keep structure consistent: same input fields, same output schema, short rationales.
- Iterate: when the model errs in production, turn that case into a new example or counterexample.
Checklist: a good example/counterexample set
- Labels or schema are explicitly listed.
- At least one borderline case is covered.
- Outputs match the exact required format.
- Rationales are brief and task-focused when needed.
- Language matches domain terms and style guides.
Worked examples
Example 1 — Intent classification with borderline counterexample
Task: Route emails to one of {"HR", "Sales", "Support"}. Common confusion: salary questions (HR) vs discounts (Sales).
Few-shot prompt
Task: Classify the email intent into exactly one of: HR, Sales, Support. Respond with JSON: {"label": "...", "rationale": "..."}. Use domain terms as written.
Allowed labels: ["HR", "Sales", "Support"]
Example A (positive):
Input: "I need to update my health insurance details."
Output: {"label": "HR", "rationale": "Benefits and employee records belong to HR."}
Example B (positive):
Input: "Our team wants a demo and pricing for 30 users."
Output: {"label": "Sales", "rationale": "Demo and pricing are pre-sale activities."}
Example C (positive):
Input: "The app keeps logging me out after two minutes."
Output: {"label": "Support", "rationale": "Technical issue with the product."}
Counterexample 1 (borderline):
Input: "Can my salary be paid earlier this month?"
Wrong temptation: Sales
Correct Output: {"label": "HR", "rationale": "Salary/payment policy is an HR matter, not a sales discount."}
Counterexample 2 (borderline):
Input: "Is there a student discount for the Pro plan?"
Wrong temptation: Support
Correct Output: {"label": "Sales", "rationale": "Discounts and plans relate to pricing and sales."}
Why the counterexamples matter: they sit near the decision boundary where generic models often flip labels.
Example 2 — JSON extraction with missing-field counterexamples
Task: Extract invoice data in strict JSON. Pitfall: inventing missing fields or wrong currency.
Few-shot prompt
Task: Extract {"invoice_number", "date_iso", "seller", "total_amount", "currency"} from the text. If a field is missing, use null. Return ONLY one JSON object. Do not invent values.
Example (positive):
Input: "Invoice No. 8842, Dated 2023-11-02, Seller: Acme Tools, Total: EUR 1,250"
Output: {"invoice_number": "8842", "date_iso": "2023-11-02", "seller": "Acme Tools", "total_amount": 1250, "currency": "EUR"}
Counterexample (missing field):
Input: "Billed by Northline. Total due: 500. No invoice number listed."
Correct Output: {"invoice_number": null, "date_iso": null, "seller": "Northline", "total_amount": 500, "currency": null}
Counterexample (avoid invention):
Input: "Invoice #A-19. Seller: H.M. Retail. Amount due but currency not shown."
Correct Output: {"invoice_number": "A-19", "date_iso": null, "seller": "H.M. Retail", "total_amount": null, "currency": null}
Key signal: show that null is preferred over guessing.
Example 3 — Style and constraints with do/do-not pairs
Task: Generate product titles under 50 characters, no emojis, capitalize only first word and proper nouns.
Few-shot prompt
Task: Write a concise product title under 50 characters. No emojis. Capitalize first word and proper nouns only.
Example (positive):
Input: "Wireless noise-cancelling headphones, Model X3"
Output: "Wireless noise-cancelling headphones X3"
Counterexample (length + emoji):
Input: "Ultra soft cotton T-shirt for men and women in many colors 😀"
Wrong temptation: Add emoji and long text
Correct Output: "Ultra soft cotton T-shirt"
Counterexample (capitalization):
Input: "Smart Home Hub by NovaTech"
Wrong temptation: "Smart Home Hub By Novatech"
Correct Output: "Smart home hub by NovaTech"
These make constraints concrete and reduce retries.
Patterns and templates
Template — Classification with boundaries
Objective: Classify into one of: {ALLOWED_LABELS}. Respond as JSON: {"label": "...", "rationale": "..."}. Only one label.
List labels and short definitions.
Examples (2–4): Show typical and slightly tricky cases.
Counterexamples (1–3): For each, show the tempting wrong label and the correct output with rationale.
Template — Structured extraction with nulls
Objective: Extract fields: {FIELD_LIST}. If a field is missing or uncertain, use null. Return exactly one JSON object. Do not invent values.
Examples: clean and messy inputs.
Counterexamples: missing fields, conflicting values, units/currency absent.
Template — Style/format enforcement
Rules: {LENGTH_LIMIT}, {PROHIBITED_ELEMENTS}, {STYLE_GUIDE}
Examples: good outputs that meet rules.
Counterexamples: show violations and corrected outputs.
Exercises
Do these hands-on tasks. Then take the quick test at the end. Your progress in the test is saved only if you are logged in; everyone can practice the questions.
Exercise 1 — E-commerce category boundaries
Create a prompt that classifies product descriptions into exactly one of: {"Electronics", "Clothing", "Home"}. Include at least 3 positive examples and 2 counterexamples that address borderline cases (e.g., a "smart jacket" with built-in heating; a "desk lamp" sold as decor). Require JSON output with label and brief rationale, and prohibit multi-label answers.
What good looks like
- Allowed labels are listed with short definitions.
- Outputs are strict JSON with a single "label".
- Counterexamples explicitly show the tempting wrong label and the correct one.
Exercise 2 — Bug report extraction with missing fields
Design a prompt that extracts {"title", "severity", "component", "steps_to_reproduce"} from a bug report text. If any field is not present, set it to null. Include 2 positive examples and 2 counterexamples: one with no steps provided, and one where severity words appear in a quote but do not represent an assigned severity.
Quality bar
- Single JSON object only.
- Nulls instead of invented values.
- Counterexamples demonstrate non-inference from quoted or unrelated text.
Common mistakes and self-check
- Too many examples: overwhelms the model and dilutes signal. Start with 3–6 total.
- Vague outputs: not enforcing a format. Always specify exact schema.
- No boundaries: only positives lead to overgeneralization. Add counterexamples near confusions.
- Inconsistent structure: switching formats across examples. Keep inputs/outputs aligned.
- Unrealistic data: examples unlike real inputs. Use real phrasing and noise.
Self-check
- Can a colleague predict your model's output from examples alone?
- Do counterexamples directly address the most common errors?
- Does every output strictly follow the schema without extra text?
Practical projects
- Support router: Build a few-shot classifier for internal support categories with 5 examples and 3 counterexamples from past tickets.
- Receipt parser: Extract totals and dates from receipts; include counterexamples for multi-currency and missing totals.
- Style enforcer: Rewrite release notes into a house style; add counterexamples for banned phrases and tone violations.
Who this is for
- Prompt engineers building reliable domain prompts.
- Data scientists prototyping LLM evaluators.
- Analysts and QA engineers who need consistent outputs from models.
Prerequisites
- Basic prompt design (task, constraints, format).
- Familiarity with your domain labels or schema.
- Ability to read and write simple JSON.
Learning path
- Define task and schema.
- Draft 2–3 positive examples.
- Add 1–2 counterexamples on real edge cases.
- Test on 20–50 samples; log failures.
- Convert failures into improved counterexamples; keep total concise.
Next steps
- Automate evaluation with a small test set and pass/fail checks.
- Rotate examples periodically to avoid overfitting to narrow phrasing.
- Document decisions: why each counterexample exists and what error it prevents.
Mini challenge
Pick a task you own. Add exactly one counterexample that targets the most frequent model mistake. Measure accuracy on 30 samples before and after. Did it reduce that mistake?
Progress & test
Anyone can take the quick test below. If you are logged in, your progress will be saved automatically.