Why this matters
As a Prompt Engineer, your prompts and context often touch real user data, evaluations, or internal instructions. A single careless example, log, or retrieval filter can leak private information, answers, or system prompts. Preventing leakage protects users, your company, and the validity of your model evaluations.
- Real tasks: write few-shot prompts without giving away answers; design redaction for PII; set safe retrieval filters for RAG; configure logging that keeps signals but not secrets.
- Impact: avoids privacy breaches, preserves fair evaluations, and increases trust in your AI product.
Concept explained simply
Think of your AI system as a conversation room with windows. Leakage happens when something that must stay inside (private data, answers, keys, system rules) is visible through a window. Prevention means covering windows you don’t need, blurring sensitive parts, and controlling who can look in.
Mental model
- Minimize: only bring the minimum necessary info into the room.
- Mask: if you must bring sensitive info, cover it (redact/mask/abstract).
- Fence: separate rooms (per user, per tenant, per session) so content never crosses by accident.
- Test: regularly walk outside and see what can be seen (red-team prompts and canary checks).
What counts as leakage?
- Train/eval contamination: evaluation items or answers appear in training data, examples, or hints.
- PII/secrets exposure: emails, phone numbers, tokens, API keys, or internal docs show up in outputs, logs, or prompts.
- Cross-tenant/context bleed: content from user A appears in responses for user B due to shared memory/history.
- System prompt reveal: hidden instructions exposed via prompt injection or direct queries.
Prevention toolkit (step-by-step)
- Classify data
- Never-share: secrets, tokens, private keys, raw credentials, highly sensitive PII.
- Restricted: customer identifiers, emails, addresses, phone numbers.
- Public/safe: already public info or abstracted stats.
- Minimize inputs
- Only include fields strictly needed for the task.
- Replace long histories with short summaries.
- Redact or mask
- Pattern-based redaction: emails, phone numbers, card numbers, SSNs, access tokens.
- Use placeholders like [EMAIL], [PHONE], [ORDER_ID], and keep a reversible map outside model context if needed.
Redaction checklist
- Emails → [EMAIL]
- Phone numbers → [PHONE]
- Payment data → [CARD]
- Access tokens/keys → [SECRET]
- Exact names/addresses → [NAME]/[ADDRESS]
- Fence memory and sessions
- Scope memory by user and tenant.
- Use short, ephemeral sessions; trim histories.
- Do not reuse conversation history across users.
- Safe retrieval (RAG)
- Filter by user/tenant metadata before similarity search results are shown to the model.
- Avoid global indexes without access filters.
- Hide sensitive reasoning
- Avoid exposing chain-of-thought for sensitive tasks. Prefer brief, non-sensitive rationales or no rationale.
- Canary and adversarial tests
- Seed harmless canary tokens (e.g., CANARY-ALPHA-93) in restricted docs to detect leaks if they appear in outputs.
- Probe with injection attempts to reveal system prompts—ensure the model refuses.
- Logging with care
- Log events and categories, not raw secrets or full PII.
- Apply the same redaction to logs as to prompts.
- Evaluate regularly
- Run PI/secret detectors on outputs.
- Check evaluation sets for overlap with examples/training.
Worked examples
Example 1 — Few-shot label leakage
Bad prompt (leaks exact answers from evaluation set):
System: You grade product reviews as Positive or Negative. User: Classify: Q: "This phone is terrible, battery dies" → A: Negative Q: "I love the camera and speed" → A: Positive Q: "This phone is terrible, battery dies" →
Fix:
System: You grade product reviews as Positive or Negative. User: Learn from these patterns (no overlap with evaluation items): Example: "Horrible battery life and slow" → Negative (focus on faults) Example: "Fast, great camera, very happy" → Positive (focus on praise) Now classify without repeating examples: Q: "This phone is terrible, battery dies" →
We removed identical evaluation text and used abstracted patterns.
Example 2 — RAG with PII
Original context (unsafe):
Ticket #4217 by Alice Brown (alice.brown@example.com, +1 202-555-0135) Issue: Card charged twice on 2024-05-03. Last 4 digits: 8421
Redacted context (safe):
Ticket [TICKET_ID] Customer: [NAME], [EMAIL], [PHONE] Issue: Card charged twice on [DATE]. Card: [CARD_LAST4]
Prompt template (minimized):
System: You are a support assistant. User: Summarize the issue and propose next steps without revealing personal details. Use placeholders as is.
Example 3 — Cross-tenant memory
Problem: A shared memory store caches helpful tips from all users and sometimes surfaces other customers’ details.
Fix: Partition memory by tenant and user, add a short TTL, and summarize before storing. Do not store raw PII.
Memory checklist
- Key by tenant_id + user_id
- Short TTL or rolling window
- Summarize; avoid raw PII
- No cross-tenant retrieval
Example 4 — Canary detection
Embed a canary like CANARY-ALPHA-93 in a restricted doc. If it appears in outputs, investigate filters and prompts immediately.
Exercises
Progress and saving: The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Redact sensitive data
Rewrite the following prompt so it contains no direct PII while keeping meaning:
"Contact Jane Miller at jane.miller@craftco.com, phone +44 20 7946 0958, about order 1189. Her card ending 9923 failed."
- Use placeholders like [NAME], [EMAIL], [PHONE], [ORDER_ID], [CARD_LAST4].
- Keep all necessary task info.
Tip
- Identify PII items first.
- Replace with placeholders consistently.
- Do not add new info.
Exercise 2 — Remove label leakage in few-shot
Design 2 few-shot examples for sentiment analysis that do not include or closely paraphrase the evaluation item:
Evaluation item: "Delivery took ages and the box was damaged"
- Your examples should show the pattern but not overlap semantically too closely.
- Include a brief parenthetical rationale, not chain-of-thought.
Tip
- Vary vocabulary substantially.
- Teach the decision rule instead of the answer.
Common mistakes and self-check
- Including evaluation answers in examples. Self-check: Do my examples match any evaluation text?
- Over-sharing context to be “safe.” Self-check: Can the model solve the task if I remove each field? If yes, remove it.
- Redacting in prompts but not in logs. Self-check: Are logs subjected to the same redaction rules?
- Using global retrieval without filters. Self-check: Are queries filtered by user/tenant before ranking?
- Exposing hidden instructions. Self-check: Does the model refuse when asked to reveal system prompts?
Practical projects
- Build a redaction preprocessor: given text with PII, output masked text plus a reversible mapping stored outside the model context. Acceptance: zero raw PII tokens in model inputs.
- Safe RAG demo: index two tenants’ docs with tenant filters; show that each tenant can only retrieve its own docs. Acceptance: canary terms from tenant A never appear for tenant B.
- Leakage test suite: create prompts that attempt to extract system instructions or PII. Acceptance: 0 critical leaks across 50 adversarial prompts.
Mini challenge
Write a one-paragraph guidance block for your team describing how to choose placeholders for PII and where to store the mapping. Keep it actionable and tool-agnostic.
Learning path
- Start: Redaction and masking patterns.
- Next: Safe RAG filtering and metadata design.
- Then: Evaluation for leakage and canary monitoring.
- Finally: Production logging policies and privacy reviews.
Who this is for
- Prompt engineers designing system prompts, few-shot examples, and RAG prompts.
- Data/ML practitioners integrating models with user data.
Prerequisites
- Basic prompt engineering skills (system/user roles, few-shot patterns).
- Familiarity with PII types and privacy basics.
Next steps
- Implement a redaction pass in your prompt pipeline.
- Add tenant/user filters to any retrieval step.
- Create a small leak test set and run it before each release.