How to learn Documenting Assumptions And Limitations for Scientific Communication in Applied Scientist for free

Why this matters

As an Applied Scientist, your models drive product decisions, experiments, and risk assessments. Clear documentation of assumptions and limitations prevents misuse, supports reproducibility, and helps stakeholders choose the right solution.

Product decisions: Explain what conditions must hold for the model to work (e.g., traffic volume, user behavior).
Experimentation: State design assumptions (randomization quality, stationarity) and known threats to validity.
Risk and compliance: Surface fairness, privacy, and safety constraints, along with mitigations and monitoring.

Concept explained simply

Assumptions are things you expect to be true but have not fully proven in your current work (e.g., data distribution is stable next quarter). Limitations are known boundaries where your approach may underperform or fail (e.g., low recall for rare classes).

Mental model

Think of your model like a tool with a label:

Use for: The problem slice where it shines.
Works because: The assumptions that make it valid.
Do not use for: Conditions where performance degrades.
Handle with care: Risks and mitigations.

A practical template

Use this template to write one page that anyone can understand:

Context: One sentence on the goal and where the model is used.

Assumptions (A):
- A1: [Claim], evidence: [metric/test], scope: [data/time/domain].
- A2: ...

Limitations (L):
- L1: [Boundary/weakness], impact: [metric/user segment], severity: [low/med/high].
- L2: ...

Uncertainties (U):
- U1: [What we don’t yet know], plan to learn: [experiment/monitor], timeline: [date].

Mitigations (M):
- M1: [Control/guardrail], owner: [name/team], trigger: [threshold].

Monitoring & Triggers:
- Metric: [e.g., recall minority segment], threshold: [value], action: [rollback/fallback/alert].

Version & Date:
- Model vX.Y, documented on [YYYY-MM-DD].

Example phrases you can reuse

Assumption: "We assume input text is predominantly English (>95%), based on language-ID on a 100k sample (last 30 days)."
Limitation: "Performance drops for cold-start users (AUC -0.07); avoid using scores for first 24 hours of activity."
Uncertainty: "Long-term decay of calibration beyond 90 days is unknown; schedule recalibration check quarterly."
Mitigation: "If drift (PSI > 0.25) is detected, switch to baseline heuristic and alert the on-call."

Worked examples

Example 1: Fraud detection model (tabular, imbalanced)

Context: Real-time fraud scoring for card transactions.

Assumptions:
- A1: Merchant category distributions remain within ±10% of last quarter; monitored via PSI.
- A2: Label delay ≤ 14 days; training uses labels up to 21 days old to reduce leakage.
Limitations:
- L1: Sparse history users (≤ 2 prior transactions) have 12% lower recall; scores should be combined with rules for these users.
- L2: Model not calibrated for cross-border transactions; expect higher false positives for foreign MCCs.
Uncertainties:
- U1: Impact of new 3DS policy changes; A/B ramp planned next month.
Mitigations:
- M1: If recall_weekly < 0.70 on RU/CIS region, route to manual review queue.

Example 2: Forecasting daily demand (time series)

Context: 30-day horizon demand forecast for inventory planning.

Assumptions:
- A1: Seasonality patterns are stable year-over-year after holiday normalization.
- A2: Promotions calendar is accurate and finalized 2 weeks before launch.
Limitations:
- L1: Cannot capture sudden black-swan events; uncertainty intervals widen during anomalies.
- L2: Low-volume SKUs (weekly sales < 20) have MAPE > 35%; use safety stock buffers.
Uncertainties:
- U1: Supplier lead-time variance increasing; assess impact via scenario stress tests.
Mitigations:
- M1: If WAPE > 25% for 2 consecutive weeks, escalate to manual override for top 50 SKUs.

Example 3: Toxicity classifier (NLP)

Context: Flag harmful comments for moderator review.

Assumptions:
- A1: Training distribution covers major dialects currently seen (>90% coverage by language ID).
- A2: Human labels are majority-vote from trained raters; inter-rater agreement κ ≥ 0.62.
Limitations:
- L1: Higher false positives for reclaimed slurs in community contexts; moderators should rely on context view.
- L2: Sarcasm detection is weak; expect under-flagging in sarcastic content.
Uncertainties:
- U1: Shift due to new platform slang; monthly lexicon updates planned.
Mitigations:
- M1: Fairness checks per demographic proxy; if FNR disparity > 10pp, trigger retraining with reweighting.

How to write clear assumptions and limitations

Make them testable: include a metric or observable condition.
Bound them: specify time window, data slice, or scenario.
Trace them: cite the source (sample size, test name, date).
Quantify impact: describe effect on users or metrics.
Pair with a mitigation: what you will do if it breaks.

Good vs. vague statements

Vague: "Data probably stable."
Good: "Feature drift PSI < 0.1 for top 20 features (last 30 days vs. training)."
Vague: "Works on most users."
Good: "AUC ≥ 0.86 for users with ≥ 30 days history; drops to 0.79 for newer users."

Exercises

Complete the tasks below, then compare with the solutions. Tip: Write assumptions as "Condition + Evidence + Scope" and limitations as "Boundary + Impact + Severity".

Exercise 1 — Rewrite vague statements

Scenario: A recommendation model for news articles.

Vague assumption A: "Users like fresh content."
Vague limitation L: "Cold-start is an issue."

Task: Rewrite A and L into clear, testable statements including evidence and scope.

A includes measurable condition
A cites evidence or test
L states specific impact/segment
L includes severity or threshold

Show sample answer

Assumption: "Clicks skew to articles < 24h old (CTR +18% vs. older), confirmed on a 500k-session sample (last 14 days)."

Limitation: "For new users (account age < 48h), recall@10 is 0.06 lower; severity: medium—add popularity fallback until 5 sessions observed."

Exercise 2 — Surface hidden assumptions

Scenario: You plan an uplift model to target a discount. The business wants to launch next week.

Task: List at least 4 hidden assumptions that must hold for results to be trustworthy, and propose one mitigation each.

Randomization or proper counterfactual design
Stable conversion tracking and label definitions
Sufficient sample size and variance assumptions
No interference or spillover across groups

Show sample answer

A1: Randomization is unbiased; check via covariate balance before launch. Mitigation: re-randomize if SMD > 0.1.
A2: Tracking latency ≤ 1h; verify end-to-end event delivery. Mitigation: add backup batch ingestion.
A3: Power ≥ 80% for expected uplift 1.5pp; compute sample size. Mitigation: extend duration by 2 weeks if needed.
A4: No cross-group coupon sharing; Mitigation: per-user code restriction, monitor leakage rate.

Note: The quick test is available to everyone; only logged-in users will have their progress saved.

Common mistakes and self-check

Listing generic risks without thresholds. Self-check: Do you have numbers, dates, or segments?
Confusing uncertainties with limitations. Self-check: Can this be tested soon? If yes, it is an uncertainty.
Not scoping to specific user segments. Self-check: Name the segment and metric direction.
Omitting monitoring hooks. Self-check: What alert triggers action?
Over-promising. Self-check: Can you defend each claim with current evidence?

Practical projects

Write a one-page ALUM doc for a model you built (or a public dataset project). Include at least 3 assumptions, 3 limitations, and 2 mitigations.
Run a simple drift analysis on a dataset snapshot (e.g., compare two weeks). Document findings as assumptions/limitations with thresholds.
Peer review: Swap ALUM docs with a colleague; each adds one missing assumption and one clearer limitation.

Mini challenge

In 6 sentences or fewer, write assumptions and limitations for a sentiment model used to prioritize customer support tickets. Include one monitoring trigger.

Learning path

Foundation: Review your model’s objective, key metrics, and data generation process.
Evidence gathering: Run sanity checks (drift, calibration, segment performance).
Draft ALUM: Fill the template; keep it to one page.
Peer review: Ask a PM/engineer to challenge assumptions; refine.
Operationalize: Add monitoring thresholds and owners; schedule review cadence.

Who this is for

Applied Scientists and ML Engineers shipping models to production.
Data Scientists preparing experiment or modeling reports.
PMs and Analysts who need clear, decision-ready documentation.

Prerequisites

Basic understanding of model evaluation (precision/recall/AUC/calibration).
Familiarity with dataset splits and drift concepts.
Comfort summarizing evidence from experiments or analyses.

Next steps

Convert one of your recent project notes into the ALUM template.
Set monitoring thresholds for two critical metrics and define an action per threshold.
Take the quick test to check your understanding.

Menu

Documenting Assumptions And Limitations

Table of Contents

Why this matters

Concept explained simply

Mental model

A practical template

Worked examples

How to write clear assumptions and limitations

Exercises

Exercise 1 — Rewrite vague statements

Exercise 2 — Surface hidden assumptions

Common mistakes and self-check

Practical projects

Mini challenge

Learning path

Who this is for

Prerequisites

Next steps

Practice Exercises

Rewrite vague statements

Instructions

Expected Output

Surface hidden assumptions in an uplift model plan

Documenting Assumptions And Limitations — Quick Test

Have questions about Documenting Assumptions And Limitations?

AI Assistant