Who this is for
This subskill is for Applied Scientists who build, evaluate, or deploy ML/AI features and need to proactively reduce harm, prevent misuse, and respond to safety incidents.
Prerequisites
- Basic understanding of ML model behavior (classification/generation, evaluation metrics)
- Familiarity with product requirements and experiment design (A/B tests, offline evals)
- Comfort with writing clear policies and decision logs
Progress and test
You can take the Quick Test at any time. Everyone can access it for free. Only logged-in users have their progress saved.
Why this matters
Real tasks you will face:
- Define unacceptable use cases (e.g., harassment, self-harm facilitation, malware creation) and enforce them in prompts and filters.
- Design red-teaming plans and measure jailbreak success rate, toxicity rate, and safe-but-helpful response rates.
- Decide when to refuse, when to warn, and when to provide safe alternatives.
- Implement incident response: detect, triage, contain, and patch harmful behaviors quickly.
- Balance user value with safety: reduce harm without making the system unusably cautious.
Concept explained simply
Safety and abuse considerations ensure your AI avoids causing harm and is hard to misuse. Think of it as designing guardrails: define what "bad" looks like, test for it before launch, monitor after launch, and fix gaps fast.
Mental model: The Harm Funnel
- Inputs: risky prompts + adversarial attempts
- Model: may comply, refuse, or be tricked (jailbroken)
- Guardrails: input filters, prompt policies, output filters, tool constraints
- Oversight: red teaming, human review, monitoring, user reporting
- Outcomes: prevented harm, mitigated harm (warnings/alternatives), or incident (requires response)
A simple workflow for safety and abuse considerations
1) Define risk surface
- List sensitive domains: physical danger, self-harm, hate/harassment, sexual content, illegal activity, privacy/PII, security (prompt injection, data exfiltration), medical/financial/legal advice.
- Map who could be harmed: end users, bystanders, specific groups.
- Rate likelihood and severity (Low/Med/High) to prioritize.
2) Write safety policy
- Allowed, allowed-with-safeguards (warn + safe alternative), and disallowed examples.
- Refusal language and format: concise, neutral, empathetic, offer safe options.
- Escalation logic (e.g., self-harm crisis response).
3) Implement guardrails
- Input filtering: classify before model call; block or route to safer flows.
- Prompt engineering: system instructions with policy constraints, safety style, and grounding.
- Tool constraints: whitelist tools and restrict dangerous actions.
- Output filtering: classify responses; redact PII; add safety warnings where needed.
- Rate limits, age gating, and user controls (report/feedback).
4) Evaluate pre-launch
- Safety test set with diverse languages, cultures, and edge cases.
- Metrics: toxicity rate, jailbreak success rate, unsafe advice rate, refusal on safe prompts (over-refusal), helpfulness on benign prompts.
- Manual red teaming + automated adversarial tests; document gaps and mitigations.
5) Monitor and respond
- Sampling and targeted canary prompts; anomaly alerts on spikes in unsafe outputs.
- User report intake, triage (severity/impact), containment, patch, and postmortem with learnings.
Worked examples
Example 1: Social chatbot for teens
- Risks: harassment, bullying, sexual content, self-harm ideation, privacy leaks.
- Policy: disallow sexual content; if self-harm detected, respond with empathetic language and provide crisis resources; minimize personal data collection.
- Guardrails: input classifier for self-harm and harassment; system prompt with empathetic style and strict refusal patterns; output filter to prevent sharing PII.
- Eval: measure correct crisis-response rate, false positives on benign mental-health chats, and jailbreak rate with adversarial slang.
- Monitoring: age gating, report button, priority review for self-harm signals.
Example 2: Code assistant with security risks
- Risks: generating malware, exploit code, or instructions to bypass security.
- Policy: disallow weaponized malware; allow secure-coding help; require safety framing for vulnerability education.
- Guardrails: input filter for exploit requests; tool whitelist (read-only docs, no shell execution); output filter to detect dangerous payloads.
- Eval: red-team prompts (e.g., obfuscated requests), measure refusal precision/recall and helpfulness on legitimate secure-coding tasks.
- Monitoring: auto-block repeated exploit attempts and throttle accounts showing abuse patterns.
Example 3: Image captioning accessibility app
- Risks: inferring sensitive attributes (religion, health), stereotyping, privacy from faces/locations.
- Policy: avoid guessing sensitive traits; prioritize neutral factual descriptions; blur or avoid PII if uncertain.
- Guardrails: output post-processor to remove sensitive inferences; confidence thresholds; clear uncertainty language.
- Eval: annotate a test set for sensitive-attribute leakage; measure leakage rate and harmlessness of captions.
- Monitoring: user feedback to correct biased captions; periodic bias audits.
Example 4: Medical-like Q&A (not a medical device)
- Risks: harmful medical advice, overconfidence, privacy issues.
- Policy: provide general information with disclaimers; encourage professional consultation; refuse diagnosis or treatment instructions.
- Guardrails: grounding to reputable general references; output templates with disclaimers; strict refusal for high-risk requests.
- Eval: measure unsafe advice rate and overconfidence markers; ensure helpfulness on general wellness topics.
- Monitoring: flag spikes in medical queries; review high-risk sessions.
Practical projects
- Build a mini safety classifier pipeline: label a small set of prompts, train or configure a toxicity/self-harm/PII detector, and connect it as input and output filters.
- Create a red-teaming prompt pack covering 8 abuse categories and 3 languages; report jailbreak rate and top failure patterns.
- Design an incident response playbook: severity matrix, on-call rotation, containment steps, and patch timelines.
Exercises
Complete these. They mirror the tasks you’ll do on the job.
Exercise 1: Risk mapping and guardrails
You're shipping a text-generation feature that drafts email replies. Identify abuse risks, propose mitigations, and define evaluation metrics.
- Deliverables checklist:
- • Risk taxonomy with likelihood/severity
- • Policy highlights (allowed / allowed-with-safeguards / disallowed)
- • Guardrails for input, prompt, tools, and output
- • Pre-launch metrics and target thresholds
Hints
- Consider harassment, fraud, and PII leakage in email contexts.
- Balance refusal with helpfulness on benign emails.
Exercise 2: Red-teaming plan
Design a 1-week red-teaming plan for a multilingual chatbot.
- Deliverables checklist:
- • Abuse categories and example prompts (3 languages)
- • Success criteria and metrics (e.g., jailbreak rate < X%)
- • Logging and sampling strategy
- • Patch plan for top 3 failure modes
Hints
- Include obfuscation and indirection techniques (role-play, code words).
- Cover both toxicity and data exfiltration attacks.
Common mistakes and self-check
- Mistake: Over-refusal that blocks benign tasks. Self-check: track helpfulness on a benign benchmark and reduce false positives.
- Mistake: Narrow red teaming. Self-check: include multiple languages, dialects, and cultural contexts.
- Mistake: One-time safety review. Self-check: set up monitoring, canary prompts, and periodic audits.
- Mistake: Policies without examples. Self-check: add clear, concrete examples for each rule.
- Mistake: No incident playbook. Self-check: define severity, timelines, responsible roles, and communication templates.
Mini challenge
Pick one worked example and write a 5-sentence refusal template that is empathetic, policy-aligned, and offers a safe alternative. Test it on three adversarial prompts and refine it.
Learning path
- Start: Learn safety taxonomies and write a draft policy with examples.
- Next: Implement input/output filters and test on a small safety set.
- Then: Run a structured red-teaming sprint; quantify risks and patch.
- Finally: Set up monitoring, reporting channels, and an incident response loop.
Next steps
- Turn your exercises into reusable templates for your team.
- Create a lightweight safety dashboard for your key metrics.
- Schedule a quarterly safety review with cross-functional partners.