Why this matters
As an Applied Scientist, you often need one model to do several things at once (multi-task), or to optimize a single task under several goals like accuracy, latency, and fairness (multi-objective). Doing this well reduces infrastructure cost, improves product metrics, and helps you ship faster.
- Real tasks: predict click-through rate and conversion rate together; detect multiple topics from text; forecast multiple demand signals per item; tune a model to be accurate but also fast and fair.
- Impact: fewer models to maintain, better generalization from shared learning, and clear trade-off controls between competing goals.
Concept explained simply
Multi-task learning (MTL)
Train a single model with a shared body ("trunk") and multiple heads, each head solving a different but related task. Example: one trunk for user-item features, one head for click, another for conversion.
Multi-objective optimization (MOO)
Optimize one model against several objectives at the same time. Example: minimize classification loss while keeping latency and unfairness small.
Mental model
- Multi-task: a Swiss Army knife — one handle (shared features) with multiple tools (task heads).
- Multi-objective: a balance scale — you adjust weights to balance competing goals.
Key differences at a glance
- Multi-task: multiple targets/labels; outputs are different tasks.
- Multi-objective: one model/task, multiple goals/constraints on its behavior.
- Both use weighted sums or constraints, but multi-task also structures outputs and losses per task.
Core building blocks
Model architecture
- Hard parameter sharing: one shared trunk, multiple heads. Simple and efficient.
- Soft parameter sharing: separate task-specific trunks with regularization to keep them similar. More flexible, heavier.
Loss design (the heart of it)
- Weighted sum: L = Σ w_i L_i (tasks or objectives). Choose weights by scale, importance, or with automatic methods.
- Dynamic weighting ideas: uncertainty weighting (higher noise → lower weight), GradNorm (balances gradient magnitudes), or simple performance-based reweighting.
- Constraints: convert constraints to penalties (e.g., latency over budget) or use Lagrange multipliers.
Data and sampling
- Missing labels per task: use masks so each task loss only uses examples with labels.
- Imbalance: per-task sampling or class weighting.
- Task sampling: alternate batches per task to avoid dominance by the largest task.
Evaluation
- Per-task metrics: track each task separately plus an aggregate.
- For multi-objective: plot trade-off curves (e.g., accuracy vs latency), pick Pareto-efficient models.
- Ablate weights to see sensitivity.
Worked examples
Example 1: Multi-output regression
Goal: predict house price (regression) and time-on-market (regression) from the same features.
- Loss: L = w_price * MSE(price) + w_time * MSE(time).
- Scaling tip: set w_i inversely proportional to target variance to balance gradients.
Small numeric example
Suppose on a batch: MSE_price = 0.25, MSE_time = 4.0. If w_price = 2.0 and w_time = 0.5, then L = 2.0*0.25 + 0.5*4.0 = 0.5 + 2.0 = 2.5.
Example 2: Multi-label classification
Goal: detect multiple topics per document. Use a single head with sigmoid outputs for each label.
- Loss per label: binary cross-entropy. Sum or average across labels.
- Threshold tuning: choose per-label thresholds based on precision/recall needs.
Practical tip
If rare labels underperform, increase their weight or oversample documents containing them.
Example 3: CTR + CVR with selection bias
Goal: predict click (CTR) and purchase (CVR). CVR labels are mostly observed after a click (selection bias).
- Architecture: shared trunk → head_click (sigmoid), head_conv (sigmoid).
- Loss: L = w_click * BCE(click) + w_conv * BCE(conv_with_mask). Use a mask so CVR loss applies only when label exists.
- Business objective: revenue ∝ CTR * CVR * price. Adjust w_conv upward if conversion quality is more important.
Handling missing CVR labels
Use a mask for unavailable CVR labels. Consider techniques to reduce bias, such as inverse propensity weighting or modeling the cascade explicitly.
Practical steps to build one
- Define outputs and objectives: list tasks or goals. Label any constraints (e.g., latency ≤ 50 ms).
- Choose architecture: start with a shared trunk and simple per-task heads.
- Pick initial weights: begin with equal weights after normalizing losses; adjust based on metric priorities.
- Implement masking: ensure each task’s loss uses only valid labels.
- Train with balanced sampling: rotate minibatches across tasks if label volume differs.
- Evaluate: report per-task metrics and an aggregate; for multi-objective, produce trade-off points.
- Iterate on weights: small, deliberate changes; document effects.
Simple aggregate metric idea
Compute a weighted score of normalized task metrics (e.g., average of per-task z-scores) to compare checkpoints without hiding weak tasks.
Who this is for
- Applied Scientists and ML Engineers deploying models that must serve multiple predictions or balance accuracy with runtime/business constraints.
- Data Scientists upgrading single-task models into unified, production-ready systems.
Prerequisites
- Comfort with supervised learning (classification/regression), basic neural networks.
- Understanding of common losses (MSE, cross-entropy) and evaluation metrics.
- Familiarity with minibatch training and validation splits.
Learning path
- Before: single-task modeling, metrics and calibration, regularization.
- Now: multi-task vs multi-objective basics (this lesson).
- Next: advanced weighting (uncertainty, GradNorm), gradient conflict methods, constraint handling and Pareto optimization.
Common mistakes and self-check
- One task dominates training: check gradient magnitudes; try task reweighting or balanced sampling.
- Ignoring missing labels: ensure masked losses; verify zero loss contribution when label is absent.
- Mismatched scales: normalize losses or use variance-based weights so tasks train at similar rates.
- Evaluating only aggregate: always inspect per-task metrics to avoid hidden regressions.
- Static weights forever: revisit weights after observing trade-offs and business impact.
Self-check mini-audit
- ☐ Do I log per-task losses and metrics each epoch?
- ☐ Are loss weights justified and documented?
- ☐ Is masking correctly implemented and unit-tested?
- ☐ Do I have at least one trade-off curve or ablation showing weight sensitivity?
Practical projects
- Build a multi-task house model predicting price and time-on-market; experiment with equal vs variance-based weights.
- Create a multi-label news classifier; tune per-label thresholds for a chosen precision target.
- CTR+CVR prototype with a shared trunk and two heads; compare business metrics as you sweep loss weights.
Stretch goals
- Add a latency penalty to your training (proxy via model size or FLOPs) and find a Pareto-efficient checkpoint.
- Implement simple uncertainty-based weighting by learning per-task log-variance parameters.
Exercises
Note: Everyone can do exercises and the quick test for free. Only logged-in users get saved progress.
Exercise 1 (ex1): CTR + CVR weighted loss with masking
You have BCE_click = 0.35 and BCE_cvr = 0.85 on a batch. CVR labels exist only for 40% of examples. Use w_click = 1.0 and w_cvr = 2.0 with proper masking. Compute the total loss given the observed averages, and explain why masking matters.
- Expected: numeric total loss and a short rationale.
Exercise 2 (ex2): Multi-objective with latency penalty
You optimize classification loss with a latency constraint of 50 ms using penalty J = L_ce + λ * max(0, latency − 50). For a batch, L_ce = 0.42 and latency = 62 ms. Compute J for λ = 0.02 and λ = 0.10. Which setting enforces latency more strictly, and why?
- Expected: two J values and a brief explanation.
Checklist before submitting
- ☐ Loss formulas are written clearly.
- ☐ Masking explained for missing labels.
- ☐ Trade-off reasoning is concise.
Mini challenge
You have three tasks: signup prediction (AUC target), retention prediction (AUC target), and churn reasons (multi-label F1). Propose initial loss weights and one rule for when to increase or decrease each weight based on validation trends. Keep it to 3–5 lines.
Next steps
- Try dynamic task weighting (uncertainty or performance-based) and compare stability.
- Plot at least five weight configurations and pick a Pareto-efficient model for your use case.
- Document your chosen weights and the business rationale so teammates can iterate confidently.