Why this matters
In production, models don’t live in notebooks—they power APIs with strict latency and reliability targets. Poor loading and missing warmup cause cold starts, p95 spikes, failed readiness checks, and wasted GPU/CPU memory.
- Reduce cold-start latency after scale-out or deploys.
- Stabilize p95/p99 latency for SLAs and user experience.
- Avoid out-of-memory (OOM) during initialization.
- Ensure thread/process-safe model instances under concurrency.
Real tasks you will do on the job
- Design model initialization sequences for CPU/GPU services.
- Warm tokenizers, feature pipelines, and compiled graphs before receiving traffic.
- Tune batch shapes and cache sizes to match real traffic.
- Implement readiness probes that only pass after warmup.
- Plan blue/green deployments with warm pools to avoid cold starts.
Concept explained simply
Model loading is getting your model and its dependencies into memory and ready for inference: weights, tokenizer/featurizer, device placement, and any compiled/optimized representation.
Warmup is running representative requests (or dummy inputs) right after loading to trigger one-time costs: JIT/graph compile, memory allocation, kernel caching, page faults, lazy CUDA initialization, and CPU caches.
Mental model
Think of a restaurant kitchen. Loading is stocking the fridge and turning on appliances. Warmup is preheating ovens and doing a test dish so the first paying customer doesn’t wait extra.
- Eager load: prepare everything up front. Faster first request, longer startup.
- Lazy load: prepare on first use. Faster startup, slower first request.
- Balanced: eager-load critical parts; lazy-load rare paths.
Readiness checklist
- Weights loaded and moved to the right device (CPU/GPU).
- Tokenizer/featurizer initialized with vocab, rules, and thread-safe settings.
- One representative forward pass run for each expected batch shape.
- Caches primed (e.g., convolution algorithms, attention masks, ONNX/TensorRT kernels).
- Concurrency validated (thread/process safety; one model instance per worker if needed).
- Memory footprint measured vs limits, with headroom.
- Readiness/health checks pass only after warmup completes.
Worked examples
Example 1 — PyTorch model warmup (CPU/GPU)
# Pseudo-code
import torch
from my_models import ResNet50
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = ResNet50()
state = torch.load("resnet50.pt", map_location="cpu")
model.load_state_dict(state)
model.eval()
model.to(DEVICE)
# Optional: compile or torchscript (use what's available in your environment)
# model = torch.jit.trace(model, example_inputs)
# Warmup with representative shapes
with torch.inference_mode():
for batch in [1, 4]: # reflect real traffic mix
dummy = torch.randn(batch, 3, 224, 224, device=DEVICE)
_ = model(dummy)
# Now attach to the request handler
- Moves model to target device before warmup.
- Runs several batch sizes to populate caches.
- Uses inference_mode to avoid autograd overhead.
Example 2 — TensorFlow model warmup
import tensorflow as tf
model = tf.keras.models.load_model("/models/bert_classifier")
# Trace graph with tf.function
@tf.function(jit_compile=False)
def infer(x):
return model(x, training=False)
# Warmup with representative shapes
dummy = tf.random.uniform((2, 128), maxval=30522, dtype=tf.int32)
_ = infer(dummy)
# For TF Serving, you can also include warmup requests in assets
# so it runs at server start (if supported in your setup).
- Graph is traced to avoid first-call overhead.
- Warmup uses realistic sequence length.
Example 3 — ONNX Runtime InferenceSession warmup
import onnxruntime as ort
import numpy as np
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
sess = ort.InferenceSession("model.onnx", providers=providers)
# Inspect input name and shape
iname = sess.get_inputs()[0].name
# Warmup with expected shapes
for batch in (1, 8):
dummy = np.random.randn(batch, 3, 224, 224).astype("float32")
_ = sess.run(None, {iname: dummy})
- Initializes CUDA provider and algorithm caches.
- Verifies both small and medium batch sizes.
Example 4 — NLP pipeline warmup
# Pseudo-code: tokenizer + model
from my_nlp import load_tokenizer, load_model
tok = load_tokenizer("my-bert-tokenizer")
model = load_model("my-bert")
model.eval()
sample_texts = [
"Short sample.",
"This is a typical user sentence likely to reach the service.",
]
encoded = tok(sample_texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
_ = model(**encoded)
- Tokenizer initialized and caches warmed.
- Typical length and padding/truncation behavior exercised.
Implementation patterns
Process model: pre-fork vs post-fork
- CPU model + pre-fork server (e.g., gunicorn): load in master before forking to benefit from copy-on-write memory sharing.
- GPU model: initialize CUDA per worker after fork to avoid forking GPU contexts.
Memory safety
- Measure peak RSS during warmup; keep headroom (e.g., 20–30%).
- Use half precision or quantization if appropriate to fit memory.
- Avoid warming with oversized batches that won’t occur in production.
Choosing what to warm
- Critical path: tokenizer/featurizer + forward pass.
- Likely shapes: common batch sizes and sequence/image sizes.
- Rare branches: lazy-load to save time and memory.
Exercises
Do these two hands-on tasks. They mirror the exercises below the lesson and are auto-gradable by a reviewer, but you can self-check with the provided solutions.
Exercise 1 — Warmup plan and pseudo-code (Image classification)
Design a warmup for a ResNet-like model that will run on GPU in production but must also work on CPU in dev.
- Pick representative batch sizes (e.g., 1 and 8).
- Include device placement and inference-only mode.
- Ensure the warmup won’t cause OOM.
- Provide a short pseudo-code snippet for the warmup function.
Exercise 2 — Diagnose cold starts from logs
Given startup logs that show long first-request latency, identify the causes and propose mitigations.
Logs
[Init] Loading weights: 1.8s
[Init] Tokenizer vocab load: 700ms
[FirstRequest] CUDA context creation: 2.3s
[FirstRequest] JIT compile kernels: 1.2s
[FirstRequest] Sequence length 512 vs typical 128
[Metrics] p95 = 1450ms (first 5 mins), steady = 120ms
- List root causes behind the first request overhead.
- Propose a warmup that prevents these costs.
Self-check checklist
- You covered device placement and representative shapes.
- Your warmup uses inference-only/no-grad modes.
- Your plan prevents known first-call costs (CUDA init, JIT/graph trace, tokenizer caches).
- Memory headroom considered; no unrealistic batch sizes.
Common mistakes and how to self-check
- Only warming with batch=1: caches for larger batches remain cold. Warm multiple realistic shapes.
- Forking initialized CUDA: leads to instability. Initialize CUDA after fork, per worker.
- Tokenizer loaded lazily: first request spikes. Initialize and run a sample encode.
- Passing readiness too early: traffic hits cold model. Gate readiness until warmup completes.
- OOM during warmup: oversized dummy batches. Match production limits and add headroom.
- Ignoring concurrency: one global instance not thread-safe. Validate per-thread/worker access.
Quick self-audit
- Compare first vs second request latency; is the gap small after warmup?
- Check memory during warmup; peak below 80% of limit?
- Scale replicas from 0 to N; p95 remains stable within 10–20%?
Mini challenge (15 minutes)
Your API scales from 1 to 6 replicas on traffic spikes. After scale-out, p95 shoots from 140ms to 900ms for 2 minutes. Propose a concrete warmup strategy to keep p95 under 200ms, including batch shapes, components to warm, and when readiness should flip to true.
Hint
- Warm tokenizer + model with typical sequence/image sizes and 2–3 batch sizes.
- Flip readiness only after successful warmup passes and a small synthetic load test.
- Consider a warm pool of pre-initialized instances during busy hours.
Practical projects
- Build a warmup module that supports CPU, CUDA, and ONNX Runtime providers with a single interface.
- Create a startup probe that runs a scripted warmup and exports a metric diff between first and steady-state latencies.
- Implement a blue/green deploy where green replicas warm fully and pass synthetic load before receiving real traffic.
Who this is for
- Machine Learning Engineers deploying models behind HTTP/gRPC APIs.
- Data/Platform Engineers responsible for runtime performance and reliability.
Prerequisites
- Basic Python and one DL framework (PyTorch or TensorFlow).
- Familiarity with containers and a process model (threads vs workers).
- Understanding of CPU/GPU memory limits and batching.
Learning path
- Start: Model packaging and artifact formats (SavedModel, TorchScript, ONNX).
- Then: Model loading and warmup (this lesson).
- Next: Concurrency, batching, and request routing.
- Finally: Observability (metrics/tracing) and autoscaling strategies.
Next steps
- Implement warmup in your current service and measure first vs steady-state latency.
- Automate warmup in container entrypoint; gate readiness on completion.
- Design a warm pool for peak hours if cold starts remain an issue.
Quick Test is available to everyone. If you are logged in, your progress will be saved automatically.