Why this matters
AI products behave probabilistically, improve with data, and can fail in surprising ways. Good user research uncovers real jobs-to-be-done, validates where AI helps (and where it harms), and shapes requirements like data needs, quality bars, guardrails, and human-in-the-loop steps.
- Identify high-value, repetitive tasks that AI can reliably assist with.
- Map user workflows to find the right AI insertion point and human review moments.
- Define quality thresholds (precision/recall), acceptable errors, and recovery paths.
- Surface risks: bias, privacy, safety, explainability, and compliance.
Real tasks in the AI Product Manager role
- Interview target users to uncover decision points and data sources they trust.
- Run Wizard-of-Oz tests to simulate AI behavior before building the model.
- Design evaluation rubrics and acceptance criteria (what is "good enough").
- Co-create prompts and guardrails with users for safer, more useful outputs.
- Define success metrics tied to workflow outcomes, not just model benchmarks.
Concept explained simply
User research for AI products is learning how people make decisions, what data they rely on, and where imperfect AI assistance adds value with minimal risk. You test usefulness and trust—before you invest in heavy engineering.
Mental model
Think in three layers:
- Task layer: What job needs doing? What does "good" look like?
- Evidence layer: What inputs, context, and constraints shape the decision?
- AI layer: Where can AI assist, what errors are tolerable, and how do we recover?
Quick litmus tests
- Is the task frequent, time-consuming, and consistent enough to learn from?
- Do users have examples of good outputs we can learn from?
- What’s the worst plausible failure? Is there a safe fallback?
What makes AI user research different
- Probabilistic outputs: You plan for variability and edge cases, not single outcomes.
- Data dependency: Use research to find available data, labeling feasibility, and privacy constraints.
- Human-in-the-loop: Define when humans review, edit, or override outputs.
- Risk & ethics: Explore harms, bias, fairness, and consent early.
- Evaluation: Create rubrics and sample sets reflecting real user definitions of quality.
Methods you can use
- Contextual inquiry: Observe real tasks, inputs, and decision checkpoints.
- JTBD interviews: Uncover desired outcomes and success criteria.
- Wizard-of-Oz: Simulate AI behind the scenes to validate usefulness and UI.
- Prompt co-design sessions: Pair with users to craft prompts and guardrails.
- Red-teaming sessions: Ask users to intentionally break or stress the system.
- Diary studies: Track repeated tasks and variance across days or cases.
- Prototype A/B: Compare baseline workflow vs. AI-assisted workflow.
Worked examples
1) Support email triage assistant
Goal: Route and summarize incoming emails for faster resolution.
- Research questions: What categories matter? What info must be extracted? What are unacceptable routing errors?
- Method: Contextual inquiry + Wizard-of-Oz triage with a researcher behind the scenes.
- Data/ethics: Personal data in emails—define redactions and retention. Get consent for using historical tickets.
- Quality bar: 95% correct routing on top 5 categories; clear rationale in summary; human can correct in 1 click.
- Outcome: Requirements include must-have fields (account ID, urgency, product area), confidence display, and edit controls.
2) Sales call summarizer
Goal: Turn call transcripts into CRM-ready notes with action items.
- Research questions: Which summary sections are valuable? What mistakes break trust (e.g., hallucinated pricing)?
- Method: Prompt co-design with reps, then red-team with tricky jargon and accents.
- Data/ethics: Consent for recording; filter PII; regional compliance considerations.
- Quality bar: 90% accurate action items; zero invented discounts; clear timestamp citations for key claims.
- Outcome: UX includes citation links to transcript segments and a mandatory confirmation step before saving.
3) Coding assistant for internal tools
Goal: Help engineers write boilerplate and fix common errors.
- Research questions: Where does it help most—generating tests, refactoring, or explaining errors?
- Method: Diary study + pair sessions to observe failures and recovery.
- Data/ethics: No sharing of proprietary code externally; strong on-device or private model constraints.
- Quality bar: 80% tasks completed faster by 20%+; zero code suggestions that violate security lint rules.
- Outcome: Guardrails to block insecure patterns; inline explanations with references to docs.
Step-by-step plan
- Define the job-to-be-done: outcome, user, and constraints.
- Map the workflow: inputs, decision points, and failure impact.
- Select methods: interviews, observations, and a low-fidelity AI simulation.
- Create an evaluation rubric: what does acceptable quality look like and how to measure it.
- Run sessions, iterate prompts/UX, and log failure modes and mitigations.
Templates to copy
Rubric fields: scenario, desired outcome, unacceptable errors, acceptable errors, evidence/citations, confidence display, time-to-correct, final rating (1-5).
Session log: user role, task, prompt/inputs, output, user edits, time saved, issues spotted, suggestions.
Interview and test scripts
- Warm-up: Walk me through the last time you did [task]. What made it hard?
- Inputs: What information do you look at? Which sources do you trust?
- Quality: Show me an example of a "good" result. Why is it good?
- Risk: What mistake would be annoying? Which mistake would be unacceptable?
- Controls: How would you want to review or edit an AI suggestion?
- Signals: What would help you trust or distrust an output?
Consent and safe prompting notes
- Obtain permission to view or use any real data; redact sensitive info.
- Tell participants the system may be imperfect; ask them to speak aloud about doubts.
- Stop if sensitive data appears; replace with realistic placeholders.
Data, safety, and evaluation checklist
- Data sources identified and documented (availability, freshness, sensitivity)
- Consent and redaction process defined
- Evaluation rubric agreed with users and stakeholders
- High-impact failure modes listed with mitigations
- Human-in-the-loop steps defined (when, who, how)
- Logging plan for errors, user edits, and feedback
- Success metrics tied to workflow outcomes (time saved, quality, adoption)
Exercises
Exercise 1: Draft a lean AI research plan
Pick a task your users do weekly. Write a one-page plan covering the job-to-be-done, risks, methods, participants, and evaluation rubric.
Hints
- Choose a task with clear inputs and visible outcomes.
- Define one unacceptable error and how you'd prevent or catch it.
Show solution
Example structure: Objective, Target users (5 support agents), Methods (2 interviews, 3 WoZ sessions), Data (redacted tickets), Risks (PII exposure; mitigation: redaction), Rubric (5-point scale on correctness, clarity, edit time), Success (95% correct routing on top categories).
Expected output: One-page plan with rubric fields and risk mitigations.
Exercise 2: Create an evaluation rubric and sample set
Assemble 12 real or realistic task examples. Define rating criteria and unacceptable errors. Include at least 3 edge cases.
Hints
- Balance common cases with rare but high-impact ones.
- Track time-to-correct and edits required as part of your rubric.
Show solution
Rubric example: Correctness (0-2), Completeness (0-2), Harmful error (Yes/No, auto-fail), Citations present (Yes/No), Edit time < 30s (Yes/No). Sample set: 6 common, 3 tricky language/jargon, 3 edge cases with ambiguous inputs.
Expected output: A rubric table and a labeled dataset of 12 cases with ground truth or desired outcomes.
Self-check
- Do your methods match your questions (discovery vs. validation)?
- Would your rubric catch the failures users fear most?
- Is there a clear plan to recover from errors?
Common mistakes and how to self-check
- Testing on happy paths only → Add edge cases and red-teaming prompts.
- Measuring model accuracy but not workflow impact → Track time saved and edit counts.
- Ignoring consent and privacy → Define redaction rules and retention upfront.
- Hiding uncertainty → Expose confidence and show evidence/citations.
- Over-automation → Keep an easy way to override or revert to manual.
Quick audit
- What’s the worst plausible failure and its mitigation?
- Where does a human review? What UI affordance supports that?
- How will you learn from user edits post-launch?
Practical projects
- Run a 3-session Wizard-of-Oz test for an AI assistant in your domain; report impact, failures, and updated requirements.
- Build a 15-case evaluation set and share results of two prompt variants with your rubric scores.
- Design a trust UI: confidence indicator + citation pattern, and test it with 3 users.
Mini challenge
In 20 minutes, sketch a single screen showing: user input, AI output with citations, a confidence badge, and a one-click correction flow. List the top two unacceptable errors your design guards against.
Learning path
- Foundations: JTBD interviews and workflow mapping.
- AI-specific: Wizard-of-Oz and prompt co-design.
- Evaluation: Rubric design, edge cases, and human-in-the-loop criteria.
- Safety: Bias, privacy, and failure mitigation.
- Delivery: Translate findings into clear product requirements and metrics.
Next steps
- Pick one workflow and run a lean WoZ test this week.
- Create your 12-case evaluation set and share with your team.
- Document unacceptable errors and add them to your PRD as guardrail requirements.
Who this is for
AI Product Managers, UX Researchers, and Designers shaping AI-assisted workflows and defining acceptance criteria, risks, and evaluation methods.
Prerequisites
- Basic user interview skills and note-taking.
- Understanding of your product’s domain and key workflows.
- High-level knowledge of AI limitations (probabilistic outputs, potential biases).
Quick Test
Take the quick test below to check your understanding. Available to everyone. Sign in to save your progress.