Why this matters
Applied Scientists are expected to design solutions that are novel, feasible, and evidence-based. Strong literature review and prior art search helps you:
- Avoid reinventing the wheel and pick proven baselines.
- Identify state-of-the-art methods, datasets, and metrics before proposing a solution.
- Validate novelty for publications, patents, and internal approvals.
- Spot risks early (bias, data leakage, IP conflicts, deployment pitfalls).
Real tasks you will do
- Draft a background section for a project proposal with 8–12 key references.
- Map prior art to confirm a feature idea isn’t already patented.
- Summarize pros/cons of top-3 approaches and recommend one for a pilot.
- Create an evidence matrix of methods, datasets, and reproducibility signals.
Concept explained simply
Literature review: systematically finding and understanding research about your problem. Prior art search: checking if ideas/implementations have already been disclosed (papers, patents, tech reports, standards, blogs).
Mental model: the research funnel (4R)
- Retrieve: cast a wide net with smart queries.
- Rapidly screen: skim titles/abstracts; discard off-topic items quickly.
- Read deeply: evaluate a shortlist for methods, baselines, data, and limitations.
- Record: capture notes, citations, and decisions in a structured matrix.
Cheat-sheet: Query building patterns
- Combine core concepts with synonyms using AND; list synonyms with OR.
- Use phrase quotes for multi-word terms: "contrastive learning".
- Include common abbreviations: (LLM OR "large language model").
- Add constraints: (benchmark OR dataset) AND (AUC OR F1) AND (2021..2024).
- For patents: include functional verbs (detect, classify, segment) and domain nouns (sensor, camera, EHR) plus classification terms (CPC/IPC codes when known).
Step-by-step workflow
Problem, context, constraints, success metrics. Example: "Online recommendations; must handle cold-start; target metric: CTR uplift."
Split into concepts and synonyms. Example: ("recommendation" OR "ranking") AND ("cold start" OR "new item").
Search academic portals, preprint servers, and patent databases. Save top 50–100 hits for screening.
Title/abstract triage with inclusion/exclusion rules (domain, method relevance, recency). Discard duplicates.
Extract: objective, method, data, baselines, metrics, results, compute, limitations. Note reproducibility signals (code, seeds, data access).
Backward (references) and forward (who cited it) to find influential works you missed.
Maintain an evidence matrix and write a short narrative: What works, when, and why. Decide on baselines and novelty angle.
Evidence matrix (copy/paste template)
- Paper/Patent:
- Year:
- Task/Domain:
- Method summary:
- Data/Scale:
- Metrics/Results:
- Baselines compared:
- Compute/Cost:
- Limitations/Risks:
- Reproducibility (code/data?):
- Relevance to our constraints:
Worked examples
Example 1: Fair ranking for recommendations
Goal: Reduce popularity bias while maintaining CTR.
How to search
- Concepts: fairness, ranking, recommendation, exposure
- Query: (fairness OR "fair exposure" OR debias*) AND (ranking OR recommender) AND (exposure OR popularity) AND (metric OR evaluation)
- Screen out: unrelated fairness (e.g., only classification), outdated (pre-2015) unless seminal.
What to extract
- Metrics: exposure disparity, NDCG, CTR proxy
- Methods: re-ranking, regularization, counterfactual estimators
- Risks: business trade-offs, cold-start creators
Example 2: Missing data in healthcare time series
Goal: Robust imputation for ICU vitals streams.
How to search
- Concepts: time series, healthcare, imputation, irregular sampling
- Query: ("time series" AND (imput* OR interpolation) AND (healthcare OR ICU OR EHR) AND (irregular OR sparse))
- Add abbreviations: (RNN OR GRU OR TCN OR diffusion) for method breadth.
What to extract
- Datasets: MIMIC-III/IV
- Metrics: MAE, downstream AUROC on mortality task
- Compute: training time and hardware
- Limitations: failing patterns (e.g., long gaps)
Example 3: Prior art for visual defect detection on assembly line
Goal: Check novelty of an idea: contrastive pretraining + few-shot segmentation for surface defects.
How to search
- Keywords: (defect OR anomaly) AND (industrial OR manufacturing) AND (vision OR camera) AND (contrastive OR self-supervised) AND (few-shot OR low-shot)
- Patents: include verbs and components: (detect OR segment) AND (surface OR weld OR scratch) AND (camera OR sensor) AND (contrastive)
- Refine with classification terms if found relevant (e.g., CPC codes under computer vision inspection).
What to extract
- Claim scope and embodiments in patents
- Implementation specifics: augmentations, thresholding, post-processing
- Datasets: DAGM, MVTec AD; reported metrics
Practical projects
- Project 1: Baseline map for your team’s active problem
- Deliver: 1-page narrative + evidence matrix (10–15 entries) + recommended baselines.
- Project 2: Mini systematic review (lightweight)
- Define inclusion/exclusion, run citation chaining, and produce a PRISMA-style count summary (numbers only).
- Project 3: Prior art risk scan
- Draft a 2-page brief comparing your proposed idea to 3–5 closely related patents/papers, highlighting differences.
Exercises
Anyone can take the exercises and test. Only logged-in users will see saved progress.
- Exercise 1: Build a search strategy
Problem: "We need a robust method for detecting data drift in streaming tabular data with concept drift and limited labels."
Task: Write 2 boolean queries: one for academic literature, one for patents. Include at least 3 synonym groups and 1 constraint (e.g., timeframe or evaluation).
Submit: Your two queries and a one-sentence rationale each. - Exercise 2: Rapid screening triage
Given titles/abstract snippets (below), mark Include/Exclude and justify briefly:- A: "Unsupervised drift detection via adaptive windows in data streams"
- B: "Image style transfer with transformers"
- C: "Monitoring ML systems in production: a survey of drift and skew"
- D: "Concept drift in non-stationary environments using KL divergence with labels"
- E: "Real-time anomaly detection in network traffic using PCA"
Completion checklist
- Queries include core concept, synonyms, and constraints.
- Screening decisions align with the defined problem (streaming tabular, drift, limited labels).
- Reasons mention method-task fit and data/label constraints.
Common mistakes and self-check
- Too narrow queries: You miss synonyms and adjacent fields. Self-check: Did you include 2–3 synonyms per core concept?
- No stopping rule: Endless searching. Self-check: Stop when the last 10 quality sources add no new methods or datasets.
- Ignoring patents/industry reports: Novelty risk. Self-check: Have you checked at least one patent database and one industry venue?
- Weak screening: Keeping everything. Self-check: Apply clear inclusion/exclusion criteria and cap deep reads to a shortlist.
- Poor notes: Can’t reproduce decisions. Self-check: Maintain an evidence matrix with decisions and rationale.
Quality bar for a "good enough" review
- 8–12 high-quality, recent sources + 2–3 seminal works
- At least 2 alternative methods compared head-to-head
- Clear baseline and recommended path forward
Mini challenge
Pick any ML task you care about. In 45 minutes: draft one query, collect 15 hits, triage to 5, extract key points into the evidence matrix, and write a 3-sentence recommendation.
Who this is for
- Applied Scientists and ML Engineers proposing solutions or writing internal/external research docs.
- Data Scientists validating ideas before building prototypes.
- Students preparing capstones or research statements.
Prerequisites
- Basic understanding of ML tasks and metrics.
- Comfort reading abstracts and method sections.
- Ability to write simple boolean queries with AND/OR/quotes.
Learning path
- Start: Learn problem scoping and success metrics.
- This subskill: Literature review and prior art search.
- Next: Experimental design and baseline selection.
- Then: Risk, bias, and deployment considerations.
Next steps
- Turn your evidence matrix into a short internal memo with recommended baselines.
- Schedule a review with a teammate to sanity-check coverage and novelty.
- Translate insights into an experiment plan: datasets, metrics, and compute budget.