Why this skill matters for Applied Scientists
Research Problem Framing is how you turn ambiguous goals into rigorous, testable work. As an Applied Scientist, you will be asked to improve a product metric, reduce risk, or invent a new capability. Clear framing ensures you select the right objective, design feasible experiments, manage risk, and deliver measurable value—not just interesting models.
- Translate business goals into research questions and hypotheses.
- Choose success criteria, baselines, and stopping rules before running experiments.
- Decide scope and feasibility: data, compute, time, and stakeholder constraints.
- Plan experiments, anticipate risks, and communicate tradeoffs.
What you’ll be able to do
- Write a crisp research problem statement with decision boundaries.
- Define metrics, baselines, and sample sizes that align with business goals.
- Plan offline and online experiments with realistic timelines.
- Document risks, tradeoffs, and mitigation strategies.
Who this is for
- Applied Scientists and ML Engineers who need to turn ideas into shippable experiments.
- Data Scientists transitioning from analytics to product-facing research.
- Researchers who want stronger product impact and stakeholder alignment.
Prerequisites
- Comfort with Python or R for data analysis.
- Basic statistics: hypothesis testing, confidence intervals, power.
- Familiarity with common ML tasks (classification, ranking, forecasting).
Learning path: practical roadmap
- Clarify the business goal
Use the PRFAQ style
- Problem: What user or business pain are we solving?
- Result: What decision will be made when results arrive?
- FAQ: What is out of scope? What does success unlock?
- Formulate research questions and hypotheses
From goal to testable statements
- Primary metric and direction of improvement.
- Minimum detectable effect (MDE) that matters.
- Potential harms to monitor.
- Survey prior art
Literature and production history
- Identify methods with proven lift and failure modes.
- Summarize 3–5 candidate approaches and required data.
- Define success criteria and baselines
Guard against p-hacking
- Write metrics, baselines, and decision rules before you test.
- Include a simple baseline and a production-as-is baseline.
- Scope and feasibility
Time, data, compute, and dependencies
- What can be built in 2–6 weeks?
- What is the smallest slice that proves or de-risks the idea?
- Experiment plan
Offline → limited online
- Offline validation and backtesting plan.
- Pilot A/B or interleaving; guardrail metrics and stopping rules.
- Risks and tradeoffs
Pre-mortem
- What could fail? How would we know quickly?
- What do we sacrifice (latency, cost, fairness)?
Worked examples (with reasoning and code)
Example 1 — Search ranking relevance
Goal: Improve perceived relevance of top results.
- Research question: Does a cross-encoder reranker increase NDCG@10 versus current BM25+LTR?
- Hypothesis (H1): Reranker improves NDCG@10 by ≥ 1.5 points.
- Success criteria: Offline NDCG@10 +1.5 ±0.7 (95% CI) and online CTR +1.0% without worse latency > p95 50ms.
- Baselines: Production (BM25+LTR), Simple: bi-encoder only.
# Pseudo-Python: compute NDCG@10 lift vs production
import numpy as np
def ndcg_at_k(rels, k=10):
rels = np.array(rels)[:k]
dcg = np.sum((2**rels - 1) / np.log2(np.arange(2, len(rels)+2)))
ideal = np.sort(rels)[::-1]
idcg = np.sum((2**ideal - 1) / np.log2(np.arange(2, len(ideal)+2)))
return dcg / idcg if idcg > 0 else 0.0
# bootstrapped CI of lift
Decision rule: If offline lift meets criteria and p95 latency delta ≤ 20ms on canary, proceed to 10% online A/B.
Example 2 — Churn prediction for support reduction
Goal: Reduce monthly churn-related tickets by 10%.
- Research question: Can a calibrated classifier enable proactive outreach that lowers churn tickets?
- Primary metric: Tickets per 1k users. Guardrails: Outreach opt-out rate, CSAT.
- Baseline: Heuristic: last-activity > 14 days.
# Threshold tuning under cost asymmetry
# cost(false negative) = 5x cost(false positive)
# Choose threshold that minimizes expected cost on validation
Scope: 3-week pilot to 5% users; success if tickets/1k drop ≥ 8% with no CSAT drop.
Example 3 — Cold-start demand forecasting
Goal: Forecast weekly demand for a new store format.
- Research question: Does hierarchical forecasting (pooled with similar stores) beat naive and seasonal ARIMA?
- Metrics: WMAPE and P50 absolute error. Baseline: Seasonal naive (last year).
# Backtest: rolling-origin evaluation
# Train on weeks 1..t, predict t+1; slide window; compute WMAPE
Risk: Data drift as promotions change; mitigation: include promo features and scenario stress tests.
Example 4 — Safety/fairness check in a classifier
Goal: Improve precision without harming group fairness.
- Guardrail metric: Demographic parity difference ≤ 0.05.
- Decision rule: Ship only if precision +2% and parity within bound on holdout and shadow deployment.
# Compute parity difference
p_hat_group = positives_group / total_group
parity_diff = abs(p_hat_groupA - p_hat_groupB)
Example 5 — Sample size for A/B (conversion)
Goal: Detect lift from 6.0% to 6.6% conversion at 95% confidence, 80% power.
# Approximate two-proportion sample size per variant
from math import sqrt
p1, p2 = 0.060, 0.066
alpha, power = 0.05, 0.80
z_alpha = 1.96 # 95% two-sided
z_beta = 0.84 # 80% power
p_bar = (p1 + p2) / 2
num = (z_alpha*sqrt(2*p_bar*(1-p_bar)) + z_beta*sqrt(p1*(1-p1)+p2*(1-p2)))**2
den = (p2 - p1)**2
n_per_arm = int(num/den) # rough estimate
Decision: If traffic cannot support this n in 2 weeks, reduce MDE or run longer; document tradeoff.
Drills and quick exercises
- Write a one-sentence problem statement that includes user, metric, and MDE.
- List two baselines: a production-as-is baseline and a simple heuristic/statistical baseline.
- State a primary hypothesis and at least one falsifiable null.
- Pick one guardrail metric and a threshold that would halt a rollout.
- Estimate back-of-the-envelope sample size for a 1% absolute lift in your metric.
- Identify the smallest feasible experiment slice deliverable in 2 weeks.
Mini tasks
- Turn a vague goal (“make recommendations better”) into a metric-aligned question and decision rule.
- Draft a pre-mortem: list top 3 risks and mitigations.
- Sketch an offline evaluation plan and a limited online rollout.
Common mistakes and debugging tips
- Optimizing the wrong metric: Tie metrics to the actual decision; add guardrails to protect user experience.
- No pre-defined baseline: Always include a trivial baseline and a production baseline; prevents overfitting stories.
- Scope creep: Freeze scope and MDE for the first iteration; log deferrals to a v2 list.
- Ignoring feasibility: Validate data availability, feature freshness, and compute early.
- Unclear decision rule: Write the ship/stop criteria as if/then statements before running anything.
- Underpowered tests: If you cannot reach required sample size, increase MDE, extend time, or switch to richer metrics.
Debugging a stalled project
- Re-check problem statement: is the decision actionable?
- Reduce scope to a smaller cohort or geography.
- Swap the primary metric to an earlier proxy if it is strongly correlated and quicker to measure.
Mini project: Frame and plan a real experiment
Scenario: You’re adding a “smart notifications” feature to re-engage inactive users.
- Deliverables:
- Problem statement: user, outcome metric (weekly active users), MDE, and time window.
- Hypotheses: H0/H1 with MDE and guardrails (opt-outs, complaint rate).
- Baselines: no notifications, simple time-since-last-visit rule.
- Feasibility: data required, feature freshness, latency, and privacy checks.
- Experiment plan: offline backtest on historical data and 10% online pilot.
- Risk/tradeoffs: user fatigue, send cost, fairness across time zones; mitigation plan.
- What to submit: a 1–2 page brief and a spreadsheet/notebook with sample size and baseline metrics.
Practical project ideas
- Ranking uplift: Reframe a search or recommendation improvement with NDCG/CTR metrics and latency guardrails.
- Fraud spike response: Create a 2-week feasibility plan for a high-recall rule+model hybrid with precision guardrails.
- Forecasting v1: Ship a WMAPE-targeted baseline using seasonal naive + feature calendar; document v2 paths.
Subskills
- Translating Business Needs Into Research Questions: Turn objectives into testable questions with measurable outcomes.
- Literature Review And Prior Art Search: Identify proven methods, known pitfalls, and realistic baselines.
- Defining Success Criteria And Baselines: Pre-register metrics, MDE, and baseline comparisons to avoid hindsight bias.
- Scope And Feasibility Assessment: Stress-test time, data, compute, and compliance constraints.
- Hypothesis Formulation: Write falsifiable statements tied to metrics and decision rules.
- Experiment Planning: Design offline/online evaluations, sample sizes, and guardrails.
- Risk And Tradeoff Analysis: Anticipate costs (latency, fairness, maintenance) and define mitigation.
Next steps
- Pick one of the practical project ideas and complete the mini project brief.
- Discuss your framing with a peer or mentor; refine metrics and decision rules.
- Move to implementation with a strict v1 scope and a scheduled decision checkpoint.
Copy-paste templates
# Problem statement
For [user/group], we aim to improve [metric] by [MDE] over [time window].
Success means: if [metric + MDE + CI] and guardrails [bounds], then [ship/iterate/stop].
# Baselines
- Production-as-is: [description]
- Simple baseline: [heuristic or linear model]
# Risks & tradeoffs
Top risks: [R1, R2, R3]. Mitigations: [M1, M2, M3].