Why this matters
As an AI Product Manager, you do not need to code models, but you must make confident decisions about use-cases, data, metrics, cost, quality, and risks. These basics help you:
- Translate business goals into ML/LLM metrics and acceptance criteria.
- Choose between approaches (heuristics, classic ML, RAG, fine-tuning) with clear trade-offs.
- Design evaluation plans, test sets, and go/no-go launch thresholds.
- Estimate latency and cost, plan capacity, and monitor quality drift.
- Spot safety issues (prompt injection, bias) and set guardrails.
Who this is for
- Product Managers and aspiring PMs working on AI-powered features.
- Designers, data analysts, and founders coordinating with ML teams.
- Engineers moving into product roles who need a PM-focused view.
Prerequisites
- Comfort discussing metrics and basic statistics (averages, percentages).
- Basic understanding of APIs and product development life cycle.
- No coding required for this lesson.
Concept explained simply
Machine Learning (ML) learns patterns from examples to make predictions or decisions. Large Language Models (LLMs) generate or understand text by predicting the next token based on context.
- Supervised ML: learn from labeled examples (e.g., churn yes/no).
- Unsupervised ML: find structure without labels (e.g., clustering users).
- Reinforcement learning: learn via trial-and-error with rewards (e.g., recommendations tuning).
- LLMs: text-in, text-out; can be guided with prompts, augmented with retrieval (RAG), or updated via fine-tuning.
Key terms:
- Training/Validation/Test: learn, tune, then honestly measure generalization.
- Overfitting: model memorizes training, fails on new data. Fix with simpler models, more data, regularization, or better validation.
- Metrics: choose based on business risk. For classification, understand precision, recall, F1, and thresholds. For regression, MAE/RMSE.
- LLM controls: temperature (creativity), max tokens, system vs user instructions, few-shot examples.
Mental model
Think of ML/LLM systems as factories:
- Inputs (data, prompt) go in; outputs (predictions, text) come out.
- Quality depends more on inputs, process, and checks than on model brand.
- As PM, you define the spec: what quality means, how to measure it, how fast and how much it can cost.
Quick cheat-sheet (open)
- Define the decision: what action will the model enable or automate?
- Map business risk to metric: costly misses -> recall; costly false alarms -> precision.
- Always get a baseline: majority class or simple rule.
- Offline first, then online: pass offline acceptance bar before A/B testing.
- LLM choice: RAG for up-to-date/factual; fine-tune for style/format/behavior on stable data.
- Monitor slices: user segments, languages, lengths, domains.
Worked examples
Example 1: Fraud alerts (classification threshold)
Business: Minimize missed fraud (false negatives) while keeping review workload manageable.
- Metric choice: prioritize recall; set a minimum precision so analysts are not overwhelmed.
- Spec: "Recall ≥ 95% on last 3 months data, Precision ≥ 20%, latency ≤ 300 ms."
- Thresholding: sweep thresholds, pick the point that meets both constraints.
What success looks like
Confusion matrix at chosen threshold shows most fraud caught (high recall). Review queue size fits team capacity. Post-launch monitoring shows stable rates across card types.
Example 2: Delivery time prediction (regression)
Business: Estimate ETA to set customer expectations.
- Baseline: median of historical ETAs by city and hour.
- Metric: MAE (minutes). Spec: "MAE ≤ 6 minutes on holdout, p95 latency ≤ 200 ms."
- Decision: ship if MAE improves baseline by ≥ 20% on top 5 cities and does not regress on long-tail routes by more than 5%.
Why MAE over RMSE?
MAE is easier to communicate (average absolute error in minutes) and less dominated by rare outliers.
Example 3: Knowledge assistant (LLM: RAG vs fine-tune)
Business: Employees ask questions about internal policies that change monthly.
- RAG: fetch relevant documents, ground the answer with citations; no retraining when docs update.
- Fine-tune: better for tone/format consistency; weaker for freshness.
- Spec: "Answer grounded with at least one citation 95% of the time, factuality score ≥ 4/5 in human eval, cost ≤ $0.02 per answer, median latency ≤ 2 s."
Decision
Choose RAG + prompt template with citation requirement; revisit fine-tune later for stylistic consistency once content stabilizes.
Hands-on exercises
Do these now. They mirror the graded exercises below.
Exercise 1 (ex1): Map business risk to ML metric
- Scenario: Churn prediction to trigger a retention offer. Missing a churning user costs $100; offering to a non-churning user costs $5.
- Choose the primary metric and thresholding strategy.
- Given this confusion matrix at a threshold: TP=300, FP=150, FN=100, TN=450. Compute precision and recall. Would you raise or lower the threshold?
Hint
Compare costs of FN vs FP. Precision = TP/(TP+FP). Recall = TP/(TP+FN).
- Checklist: chose metric aligned to cost, computed both precision and recall, justified threshold direction.
Exercise 2 (ex2): RAG or fine-tune?
- Scenario: Customer support bot must answer questions about a product catalog that changes daily. Tone should be on-brand and concise.
- Pick an approach (RAG, fine-tune, or both) and justify.
- Define 3 evaluation slices and 3 acceptance criteria (quality, latency, cost).
Hint
If facts change often, retrieval helps. Style can be controlled via prompt or light fine-tune.
- Checklist: chosen approach, slices include freshness and query difficulty, acceptance criteria are measurable.
Common mistakes and how to self-check
- Optimizing the wrong metric: Always connect metric to business cost; document false positive/negative costs.
- No baseline: Start with a simple rule or majority class; require clear improvement to ship.
- Data leakage: Make sure features aren’t using future information; keep a true holdout set.
- Ignoring slices: Evaluate segments (e.g., new users, long texts, non-English).
- LLM prompts without grounding: For factual tasks, use retrieval and require citations.
- Unbounded costs: Track tokens, average and p95 latency, and implement caching.
Self-check audit
- Do we have acceptance criteria with numbers for quality, latency, and cost?
- Is the test set truly unseen and recent?
- Are there clear go/no-go thresholds that reflect business risk?
- Do we have monitoring for drift and safety incidents post-launch?
Practical projects
- Project A: Classification spec doc. Deliver a 1–2 page spec for a lead-scoring model with baseline, metric targets, threshold policy, and monitoring plan.
- Project B: LLM evaluation plan. Create a rubric and golden set (20–50 prompts) for a policy Q&A assistant, including grounding and harmful-content checks.
- Project C: Cost-latency model. Build a simple spreadsheet estimating token usage, per-request cost, and p95 latency with and without caching/batching.
Learning path
Step 1: Frame the problem
Define decision, user, success metric, and constraints.
- Output: one-paragraph problem statement.
Step 2: Baseline and data
Draft a rule-based baseline and identify required data and labels.
- Output: baseline description + data sources + label quality risks.
Step 3: Offline evaluation
Set acceptance criteria and test set slices; choose metrics.
- Output: metric table with thresholds and slice coverage.
Step 4: Online guardrails
Plan rate limits, abuse filters, and safety checks.
- Output: guardrail checklist and rollback plan.
Step 5: Launch and monitor
Define KPIs, alerting, and retraining cadence.
- Output: dashboard mock and on-call playbook.
Quick test
Available to everyone; only logged-in users get saved progress.
Mini challenge
Pick a feature in your product that could use ML/LLM. In 5 sentences, write:
- The user decision or action aided by the model.
- Primary offline metric and threshold with justification.
- Online success metric (KPI) and ramp plan.
- Cost and latency budget.
- Two biggest risks and how you will monitor them.
Next steps
- Turn one practical project into a shareable one-pager for stakeholder feedback.
- Run a small pilot with a clear offline acceptance bar and a limited rollout.
- Set up monitoring early: quality slices, latency, cost, and safety incidents.