luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Multi Task And Multi Objective Modeling Basics

Learn Multi Task And Multi Objective Modeling Basics for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Why this matters

As an Applied Scientist, you often need one model to do several things at once (multi-task), or to optimize a single task under several goals like accuracy, latency, and fairness (multi-objective). Doing this well reduces infrastructure cost, improves product metrics, and helps you ship faster.

  • Real tasks: predict click-through rate and conversion rate together; detect multiple topics from text; forecast multiple demand signals per item; tune a model to be accurate but also fast and fair.
  • Impact: fewer models to maintain, better generalization from shared learning, and clear trade-off controls between competing goals.

Concept explained simply

Multi-task learning (MTL)

Train a single model with a shared body ("trunk") and multiple heads, each head solving a different but related task. Example: one trunk for user-item features, one head for click, another for conversion.

Multi-objective optimization (MOO)

Optimize one model against several objectives at the same time. Example: minimize classification loss while keeping latency and unfairness small.

Mental model

  • Multi-task: a Swiss Army knife — one handle (shared features) with multiple tools (task heads).
  • Multi-objective: a balance scale — you adjust weights to balance competing goals.
Key differences at a glance
  • Multi-task: multiple targets/labels; outputs are different tasks.
  • Multi-objective: one model/task, multiple goals/constraints on its behavior.
  • Both use weighted sums or constraints, but multi-task also structures outputs and losses per task.

Core building blocks

Model architecture

  • Hard parameter sharing: one shared trunk, multiple heads. Simple and efficient.
  • Soft parameter sharing: separate task-specific trunks with regularization to keep them similar. More flexible, heavier.

Loss design (the heart of it)

  • Weighted sum: L = Σ w_i L_i (tasks or objectives). Choose weights by scale, importance, or with automatic methods.
  • Dynamic weighting ideas: uncertainty weighting (higher noise → lower weight), GradNorm (balances gradient magnitudes), or simple performance-based reweighting.
  • Constraints: convert constraints to penalties (e.g., latency over budget) or use Lagrange multipliers.

Data and sampling

  • Missing labels per task: use masks so each task loss only uses examples with labels.
  • Imbalance: per-task sampling or class weighting.
  • Task sampling: alternate batches per task to avoid dominance by the largest task.

Evaluation

  • Per-task metrics: track each task separately plus an aggregate.
  • For multi-objective: plot trade-off curves (e.g., accuracy vs latency), pick Pareto-efficient models.
  • Ablate weights to see sensitivity.

Worked examples

Example 1: Multi-output regression

Goal: predict house price (regression) and time-on-market (regression) from the same features.

  • Loss: L = w_price * MSE(price) + w_time * MSE(time).
  • Scaling tip: set w_i inversely proportional to target variance to balance gradients.
Small numeric example

Suppose on a batch: MSE_price = 0.25, MSE_time = 4.0. If w_price = 2.0 and w_time = 0.5, then L = 2.0*0.25 + 0.5*4.0 = 0.5 + 2.0 = 2.5.

Example 2: Multi-label classification

Goal: detect multiple topics per document. Use a single head with sigmoid outputs for each label.

  • Loss per label: binary cross-entropy. Sum or average across labels.
  • Threshold tuning: choose per-label thresholds based on precision/recall needs.
Practical tip

If rare labels underperform, increase their weight or oversample documents containing them.

Example 3: CTR + CVR with selection bias

Goal: predict click (CTR) and purchase (CVR). CVR labels are mostly observed after a click (selection bias).

  • Architecture: shared trunk → head_click (sigmoid), head_conv (sigmoid).
  • Loss: L = w_click * BCE(click) + w_conv * BCE(conv_with_mask). Use a mask so CVR loss applies only when label exists.
  • Business objective: revenue ∝ CTR * CVR * price. Adjust w_conv upward if conversion quality is more important.
Handling missing CVR labels

Use a mask for unavailable CVR labels. Consider techniques to reduce bias, such as inverse propensity weighting or modeling the cascade explicitly.

Practical steps to build one

  1. Define outputs and objectives: list tasks or goals. Label any constraints (e.g., latency ≤ 50 ms).
  2. Choose architecture: start with a shared trunk and simple per-task heads.
  3. Pick initial weights: begin with equal weights after normalizing losses; adjust based on metric priorities.
  4. Implement masking: ensure each task’s loss uses only valid labels.
  5. Train with balanced sampling: rotate minibatches across tasks if label volume differs.
  6. Evaluate: report per-task metrics and an aggregate; for multi-objective, produce trade-off points.
  7. Iterate on weights: small, deliberate changes; document effects.
Simple aggregate metric idea

Compute a weighted score of normalized task metrics (e.g., average of per-task z-scores) to compare checkpoints without hiding weak tasks.

Who this is for

  • Applied Scientists and ML Engineers deploying models that must serve multiple predictions or balance accuracy with runtime/business constraints.
  • Data Scientists upgrading single-task models into unified, production-ready systems.

Prerequisites

  • Comfort with supervised learning (classification/regression), basic neural networks.
  • Understanding of common losses (MSE, cross-entropy) and evaluation metrics.
  • Familiarity with minibatch training and validation splits.

Learning path

  • Before: single-task modeling, metrics and calibration, regularization.
  • Now: multi-task vs multi-objective basics (this lesson).
  • Next: advanced weighting (uncertainty, GradNorm), gradient conflict methods, constraint handling and Pareto optimization.

Common mistakes and self-check

  • One task dominates training: check gradient magnitudes; try task reweighting or balanced sampling.
  • Ignoring missing labels: ensure masked losses; verify zero loss contribution when label is absent.
  • Mismatched scales: normalize losses or use variance-based weights so tasks train at similar rates.
  • Evaluating only aggregate: always inspect per-task metrics to avoid hidden regressions.
  • Static weights forever: revisit weights after observing trade-offs and business impact.
Self-check mini-audit
  • ☐ Do I log per-task losses and metrics each epoch?
  • ☐ Are loss weights justified and documented?
  • ☐ Is masking correctly implemented and unit-tested?
  • ☐ Do I have at least one trade-off curve or ablation showing weight sensitivity?

Practical projects

  • Build a multi-task house model predicting price and time-on-market; experiment with equal vs variance-based weights.
  • Create a multi-label news classifier; tune per-label thresholds for a chosen precision target.
  • CTR+CVR prototype with a shared trunk and two heads; compare business metrics as you sweep loss weights.
Stretch goals
  • Add a latency penalty to your training (proxy via model size or FLOPs) and find a Pareto-efficient checkpoint.
  • Implement simple uncertainty-based weighting by learning per-task log-variance parameters.

Exercises

Note: Everyone can do exercises and the quick test for free. Only logged-in users get saved progress.

Exercise 1 (ex1): CTR + CVR weighted loss with masking

You have BCE_click = 0.35 and BCE_cvr = 0.85 on a batch. CVR labels exist only for 40% of examples. Use w_click = 1.0 and w_cvr = 2.0 with proper masking. Compute the total loss given the observed averages, and explain why masking matters.

  • Expected: numeric total loss and a short rationale.

Exercise 2 (ex2): Multi-objective with latency penalty

You optimize classification loss with a latency constraint of 50 ms using penalty J = L_ce + λ * max(0, latency − 50). For a batch, L_ce = 0.42 and latency = 62 ms. Compute J for λ = 0.02 and λ = 0.10. Which setting enforces latency more strictly, and why?

  • Expected: two J values and a brief explanation.
Checklist before submitting
  • ☐ Loss formulas are written clearly.
  • ☐ Masking explained for missing labels.
  • ☐ Trade-off reasoning is concise.

Mini challenge

You have three tasks: signup prediction (AUC target), retention prediction (AUC target), and churn reasons (multi-label F1). Propose initial loss weights and one rule for when to increase or decrease each weight based on validation trends. Keep it to 3–5 lines.

Next steps

  • Try dynamic task weighting (uncertainty or performance-based) and compare stability.
  • Plot at least five weight configurations and pick a Pareto-efficient model for your use case.
  • Document your chosen weights and the business rationale so teammates can iterate confidently.

Practice Exercises

2 exercises to complete

Instructions

You have two heads: click and conversion. For a minibatch, average BCE_click = 0.35 over all examples. Average BCE_cvr = 0.85 computed only on examples with observed CVR labels (40% of the batch). Use w_click = 1.0 and w_cvr = 2.0. Compute total loss L = w_click*BCE_click + w_cvr*BCE_cvr and explain the role of masking in this setting.

Expected Output
Total loss = 1.0*0.35 + 2.0*0.85 = 2.05. Masking ensures CVR loss is computed only where labels exist, avoiding bias and spurious gradients.

Multi Task And Multi Objective Modeling Basics — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Multi Task And Multi Objective Modeling Basics?

AI Assistant

Ask questions about this tool