How to learn User Research For AI Products for Problem Discovery And Requirements in AI Product Manager for free

Why this matters

AI products behave probabilistically, improve with data, and can fail in surprising ways. Good user research uncovers real jobs-to-be-done, validates where AI helps (and where it harms), and shapes requirements like data needs, quality bars, guardrails, and human-in-the-loop steps.

Identify high-value, repetitive tasks that AI can reliably assist with.
Map user workflows to find the right AI insertion point and human review moments.
Define quality thresholds (precision/recall), acceptable errors, and recovery paths.
Surface risks: bias, privacy, safety, explainability, and compliance.

Real tasks in the AI Product Manager role

Interview target users to uncover decision points and data sources they trust.
Run Wizard-of-Oz tests to simulate AI behavior before building the model.
Design evaluation rubrics and acceptance criteria (what is "good enough").
Co-create prompts and guardrails with users for safer, more useful outputs.
Define success metrics tied to workflow outcomes, not just model benchmarks.

Concept explained simply

User research for AI products is learning how people make decisions, what data they rely on, and where imperfect AI assistance adds value with minimal risk. You test usefulness and trust—before you invest in heavy engineering.

Mental model

Think in three layers:

Task layer: What job needs doing? What does "good" look like?
Evidence layer: What inputs, context, and constraints shape the decision?
AI layer: Where can AI assist, what errors are tolerable, and how do we recover?

Quick litmus tests

Is the task frequent, time-consuming, and consistent enough to learn from?
Do users have examples of good outputs we can learn from?
What’s the worst plausible failure? Is there a safe fallback?

What makes AI user research different

Probabilistic outputs: You plan for variability and edge cases, not single outcomes.
Data dependency: Use research to find available data, labeling feasibility, and privacy constraints.
Human-in-the-loop: Define when humans review, edit, or override outputs.
Risk & ethics: Explore harms, bias, fairness, and consent early.
Evaluation: Create rubrics and sample sets reflecting real user definitions of quality.

Methods you can use

Contextual inquiry: Observe real tasks, inputs, and decision checkpoints.
JTBD interviews: Uncover desired outcomes and success criteria.
Wizard-of-Oz: Simulate AI behind the scenes to validate usefulness and UI.
Prompt co-design sessions: Pair with users to craft prompts and guardrails.
Red-teaming sessions: Ask users to intentionally break or stress the system.
Diary studies: Track repeated tasks and variance across days or cases.
Prototype A/B: Compare baseline workflow vs. AI-assisted workflow.

Worked examples

1) Support email triage assistant

Goal: Route and summarize incoming emails for faster resolution.

Research questions: What categories matter? What info must be extracted? What are unacceptable routing errors?
Method: Contextual inquiry + Wizard-of-Oz triage with a researcher behind the scenes.
Data/ethics: Personal data in emails—define redactions and retention. Get consent for using historical tickets.
Quality bar: 95% correct routing on top 5 categories; clear rationale in summary; human can correct in 1 click.
Outcome: Requirements include must-have fields (account ID, urgency, product area), confidence display, and edit controls.

2) Sales call summarizer

Goal: Turn call transcripts into CRM-ready notes with action items.

Research questions: Which summary sections are valuable? What mistakes break trust (e.g., hallucinated pricing)?
Method: Prompt co-design with reps, then red-team with tricky jargon and accents.
Data/ethics: Consent for recording; filter PII; regional compliance considerations.
Quality bar: 90% accurate action items; zero invented discounts; clear timestamp citations for key claims.
Outcome: UX includes citation links to transcript segments and a mandatory confirmation step before saving.

3) Coding assistant for internal tools

Goal: Help engineers write boilerplate and fix common errors.

Research questions: Where does it help most—generating tests, refactoring, or explaining errors?
Method: Diary study + pair sessions to observe failures and recovery.
Data/ethics: No sharing of proprietary code externally; strong on-device or private model constraints.
Quality bar: 80% tasks completed faster by 20%+; zero code suggestions that violate security lint rules.
Outcome: Guardrails to block insecure patterns; inline explanations with references to docs.

Step-by-step plan

Define the job-to-be-done: outcome, user, and constraints.
Map the workflow: inputs, decision points, and failure impact.
Select methods: interviews, observations, and a low-fidelity AI simulation.
Create an evaluation rubric: what does acceptable quality look like and how to measure it.
Run sessions, iterate prompts/UX, and log failure modes and mitigations.

Templates to copy

Rubric fields: scenario, desired outcome, unacceptable errors, acceptable errors, evidence/citations, confidence display, time-to-correct, final rating (1-5).

Session log: user role, task, prompt/inputs, output, user edits, time saved, issues spotted, suggestions.

Interview and test scripts

Warm-up: Walk me through the last time you did [task]. What made it hard?
Inputs: What information do you look at? Which sources do you trust?
Quality: Show me an example of a "good" result. Why is it good?
Risk: What mistake would be annoying? Which mistake would be unacceptable?
Controls: How would you want to review or edit an AI suggestion?
Signals: What would help you trust or distrust an output?

Consent and safe prompting notes

Obtain permission to view or use any real data; redact sensitive info.
Tell participants the system may be imperfect; ask them to speak aloud about doubts.
Stop if sensitive data appears; replace with realistic placeholders.

Data, safety, and evaluation checklist

Data sources identified and documented (availability, freshness, sensitivity)
Consent and redaction process defined
Evaluation rubric agreed with users and stakeholders
High-impact failure modes listed with mitigations
Human-in-the-loop steps defined (when, who, how)
Logging plan for errors, user edits, and feedback
Success metrics tied to workflow outcomes (time saved, quality, adoption)

Exercises

Exercise 1: Draft a lean AI research plan

Pick a task your users do weekly. Write a one-page plan covering the job-to-be-done, risks, methods, participants, and evaluation rubric.

Hints

Choose a task with clear inputs and visible outcomes.
Define one unacceptable error and how you'd prevent or catch it.

Show solution

Example structure: Objective, Target users (5 support agents), Methods (2 interviews, 3 WoZ sessions), Data (redacted tickets), Risks (PII exposure; mitigation: redaction), Rubric (5-point scale on correctness, clarity, edit time), Success (95% correct routing on top categories).

Expected output: One-page plan with rubric fields and risk mitigations.

Exercise 2: Create an evaluation rubric and sample set

Assemble 12 real or realistic task examples. Define rating criteria and unacceptable errors. Include at least 3 edge cases.

Hints

Balance common cases with rare but high-impact ones.
Track time-to-correct and edits required as part of your rubric.

Show solution

Rubric example: Correctness (0-2), Completeness (0-2), Harmful error (Yes/No, auto-fail), Citations present (Yes/No), Edit time < 30s (Yes/No). Sample set: 6 common, 3 tricky language/jargon, 3 edge cases with ambiguous inputs.

Expected output: A rubric table and a labeled dataset of 12 cases with ground truth or desired outcomes.

Self-check

Do your methods match your questions (discovery vs. validation)?
Would your rubric catch the failures users fear most?
Is there a clear plan to recover from errors?

Common mistakes and how to self-check

Testing on happy paths only → Add edge cases and red-teaming prompts.
Measuring model accuracy but not workflow impact → Track time saved and edit counts.
Ignoring consent and privacy → Define redaction rules and retention upfront.
Hiding uncertainty → Expose confidence and show evidence/citations.
Over-automation → Keep an easy way to override or revert to manual.

Quick audit

What’s the worst plausible failure and its mitigation?
Where does a human review? What UI affordance supports that?
How will you learn from user edits post-launch?

Practical projects

Run a 3-session Wizard-of-Oz test for an AI assistant in your domain; report impact, failures, and updated requirements.
Build a 15-case evaluation set and share results of two prompt variants with your rubric scores.
Design a trust UI: confidence indicator + citation pattern, and test it with 3 users.

Mini challenge

In 20 minutes, sketch a single screen showing: user input, AI output with citations, a confidence badge, and a one-click correction flow. List the top two unacceptable errors your design guards against.

Learning path

Foundations: JTBD interviews and workflow mapping.
AI-specific: Wizard-of-Oz and prompt co-design.
Evaluation: Rubric design, edge cases, and human-in-the-loop criteria.
Safety: Bias, privacy, and failure mitigation.
Delivery: Translate findings into clear product requirements and metrics.

Next steps

Pick one workflow and run a lean WoZ test this week.
Create your 12-case evaluation set and share with your team.
Document unacceptable errors and add them to your PRD as guardrail requirements.

Who this is for

AI Product Managers, UX Researchers, and Designers shaping AI-assisted workflows and defining acceptance criteria, risks, and evaluation methods.

Prerequisites

Basic user interview skills and note-taking.
Understanding of your product’s domain and key workflows.
High-level knowledge of AI limitations (probabilistic outputs, potential biases).

Quick Test

Take the quick test below to check your understanding. Available to everyone. Sign in to save your progress.

Menu

User Research For AI Products

Table of Contents