How to learn Multi Task And Multi Objective Modeling Basics for Applied ML Modeling in Applied Scientist for free

Why this matters

As an Applied Scientist, you often need one model to do several things at once (multi-task), or to optimize a single task under several goals like accuracy, latency, and fairness (multi-objective). Doing this well reduces infrastructure cost, improves product metrics, and helps you ship faster.

Real tasks: predict click-through rate and conversion rate together; detect multiple topics from text; forecast multiple demand signals per item; tune a model to be accurate but also fast and fair.
Impact: fewer models to maintain, better generalization from shared learning, and clear trade-off controls between competing goals.

Concept explained simply

Multi-task learning (MTL)

Train a single model with a shared body ("trunk") and multiple heads, each head solving a different but related task. Example: one trunk for user-item features, one head for click, another for conversion.

Multi-objective optimization (MOO)

Optimize one model against several objectives at the same time. Example: minimize classification loss while keeping latency and unfairness small.

Mental model

Multi-task: a Swiss Army knife — one handle (shared features) with multiple tools (task heads).
Multi-objective: a balance scale — you adjust weights to balance competing goals.

Key differences at a glance

Multi-task: multiple targets/labels; outputs are different tasks.
Multi-objective: one model/task, multiple goals/constraints on its behavior.
Both use weighted sums or constraints, but multi-task also structures outputs and losses per task.

Core building blocks

Model architecture

Hard parameter sharing: one shared trunk, multiple heads. Simple and efficient.
Soft parameter sharing: separate task-specific trunks with regularization to keep them similar. More flexible, heavier.

Loss design (the heart of it)

Weighted sum: L = Σ w_i L_i (tasks or objectives). Choose weights by scale, importance, or with automatic methods.
Dynamic weighting ideas: uncertainty weighting (higher noise → lower weight), GradNorm (balances gradient magnitudes), or simple performance-based reweighting.
Constraints: convert constraints to penalties (e.g., latency over budget) or use Lagrange multipliers.

Data and sampling

Missing labels per task: use masks so each task loss only uses examples with labels.
Imbalance: per-task sampling or class weighting.
Task sampling: alternate batches per task to avoid dominance by the largest task.

Evaluation

Per-task metrics: track each task separately plus an aggregate.
For multi-objective: plot trade-off curves (e.g., accuracy vs latency), pick Pareto-efficient models.
Ablate weights to see sensitivity.

Worked examples

Example 1: Multi-output regression

Goal: predict house price (regression) and time-on-market (regression) from the same features.

Loss: L = w_price * MSE(price) + w_time * MSE(time).
Scaling tip: set w_i inversely proportional to target variance to balance gradients.

Small numeric example

Suppose on a batch: MSE_price = 0.25, MSE_time = 4.0. If w_price = 2.0 and w_time = 0.5, then L = 2.0*0.25 + 0.5*4.0 = 0.5 + 2.0 = 2.5.

Example 2: Multi-label classification

Goal: detect multiple topics per document. Use a single head with sigmoid outputs for each label.

Loss per label: binary cross-entropy. Sum or average across labels.
Threshold tuning: choose per-label thresholds based on precision/recall needs.

Practical tip

If rare labels underperform, increase their weight or oversample documents containing them.

Example 3: CTR + CVR with selection bias

Goal: predict click (CTR) and purchase (CVR). CVR labels are mostly observed after a click (selection bias).

Architecture: shared trunk → head_click (sigmoid), head_conv (sigmoid).
Loss: L = w_click * BCE(click) + w_conv * BCE(conv_with_mask). Use a mask so CVR loss applies only when label exists.
Business objective: revenue ∝ CTR * CVR * price. Adjust w_conv upward if conversion quality is more important.

Handling missing CVR labels

Use a mask for unavailable CVR labels. Consider techniques to reduce bias, such as inverse propensity weighting or modeling the cascade explicitly.

Practical steps to build one

Define outputs and objectives: list tasks or goals. Label any constraints (e.g., latency ≤ 50 ms).
Choose architecture: start with a shared trunk and simple per-task heads.
Pick initial weights: begin with equal weights after normalizing losses; adjust based on metric priorities.
Implement masking: ensure each task’s loss uses only valid labels.
Train with balanced sampling: rotate minibatches across tasks if label volume differs.
Evaluate: report per-task metrics and an aggregate; for multi-objective, produce trade-off points.
Iterate on weights: small, deliberate changes; document effects.

Simple aggregate metric idea

Compute a weighted score of normalized task metrics (e.g., average of per-task z-scores) to compare checkpoints without hiding weak tasks.

Who this is for

Applied Scientists and ML Engineers deploying models that must serve multiple predictions or balance accuracy with runtime/business constraints.
Data Scientists upgrading single-task models into unified, production-ready systems.

Prerequisites

Comfort with supervised learning (classification/regression), basic neural networks.
Understanding of common losses (MSE, cross-entropy) and evaluation metrics.
Familiarity with minibatch training and validation splits.

Learning path

Before: single-task modeling, metrics and calibration, regularization.
Now: multi-task vs multi-objective basics (this lesson).
Next: advanced weighting (uncertainty, GradNorm), gradient conflict methods, constraint handling and Pareto optimization.

Common mistakes and self-check

One task dominates training: check gradient magnitudes; try task reweighting or balanced sampling.
Ignoring missing labels: ensure masked losses; verify zero loss contribution when label is absent.
Mismatched scales: normalize losses or use variance-based weights so tasks train at similar rates.
Evaluating only aggregate: always inspect per-task metrics to avoid hidden regressions.
Static weights forever: revisit weights after observing trade-offs and business impact.

Self-check mini-audit

☐ Do I log per-task losses and metrics each epoch?
☐ Are loss weights justified and documented?
☐ Is masking correctly implemented and unit-tested?
☐ Do I have at least one trade-off curve or ablation showing weight sensitivity?

Practical projects

Build a multi-task house model predicting price and time-on-market; experiment with equal vs variance-based weights.
Create a multi-label news classifier; tune per-label thresholds for a chosen precision target.
CTR+CVR prototype with a shared trunk and two heads; compare business metrics as you sweep loss weights.

Stretch goals

Add a latency penalty to your training (proxy via model size or FLOPs) and find a Pareto-efficient checkpoint.
Implement simple uncertainty-based weighting by learning per-task log-variance parameters.

Exercises

Note: Everyone can do exercises and the quick test for free. Only logged-in users get saved progress.

Exercise 1 (ex1): CTR + CVR weighted loss with masking

You have BCE_click = 0.35 and BCE_cvr = 0.85 on a batch. CVR labels exist only for 40% of examples. Use w_click = 1.0 and w_cvr = 2.0 with proper masking. Compute the total loss given the observed averages, and explain why masking matters.

Expected: numeric total loss and a short rationale.

Exercise 2 (ex2): Multi-objective with latency penalty

You optimize classification loss with a latency constraint of 50 ms using penalty J = L_ce + λ * max(0, latency − 50). For a batch, L_ce = 0.42 and latency = 62 ms. Compute J for λ = 0.02 and λ = 0.10. Which setting enforces latency more strictly, and why?

Expected: two J values and a brief explanation.

Checklist before submitting

☐ Loss formulas are written clearly.
☐ Masking explained for missing labels.
☐ Trade-off reasoning is concise.

Mini challenge

You have three tasks: signup prediction (AUC target), retention prediction (AUC target), and churn reasons (multi-label F1). Propose initial loss weights and one rule for when to increase or decrease each weight based on validation trends. Keep it to 3–5 lines.

Next steps

Try dynamic task weighting (uncertainty or performance-based) and compare stability.
Plot at least five weight configurations and pick a Pareto-efficient model for your use case.
Document your chosen weights and the business rationale so teammates can iterate confidently.

Menu

Multi Task And Multi Objective Modeling Basics

Table of Contents