Why this matters
NLP Engineers often fine-tune and serve models that can be expensive to run. Managing GPUs and cost means you can ship models within budget, meet SLAs, and iterate faster. Real tasks you will face:
- Choosing a GPU type that fits model memory without overpaying.
- Keeping GPU utilization high (e.g., 70β95%) so you pay for speed, not idle time.
- Using mixed precision, gradient accumulation, and checkpointing to fit large models on smaller, cheaper GPUs.
- Deciding on spot/preemptible vs on-demand capacity and creating safe auto-resume plans.
- Estimating total run cost before you press βstart.β
Concept explained simply
Cost = Price per GPU-hour Γ (Number of GPUs) Γ (Hours). You control all three by:
- Picking the right GPU size (memory and speed).
- Reducing time-to-train (throughput up, stalls down).
- Cutting waste (idle time, overprovisioning, unnecessary precision).
Mental model
Think of training like moving water through a pipe. The GPU is the pipe diameter (throughput). Your data loader and preprocessing are the faucet. If the faucet is slow, the pipe is underused. If the pipe is too large, you pay for unused capacity. Your goal: match faucet to pipe, and pick the narrowest pipe that still meets deadlines.
Jargon decoder (open)
- Mixed precision (FP16/BF16): Lower precision math that reduces memory and boosts throughput with minimal accuracy loss.
- Gradient checkpointing: Save memory by recomputing activations during backward; slower, but fits larger batches/models.
- Gradient accumulation: Use multiple micro-batches to simulate a larger batch without increasing peak memory.
- Spot/preemptible: Cheaper GPUs that might be interrupted; requires auto-resume.
- Utilization: Percentage of time GPU is actively computing; higher is better.
Quick recipes you can use today
Recipe 1: Fit a model on a smaller GPU
- Enable mixed precision (FP16/BF16).
- Turn on gradient checkpointing.
- Reduce micro-batch size and use gradient accumulation to keep effective batch size.
- Use packed sequences and static padding where possible.
Recipe 2: Raise utilization to cut hours
- Prefetch and increase data loader workers until CPU or disk becomes a bottleneck.
- Cache tokenized data to avoid re-tokenizing each epoch.
- Pin memory for faster host-to-device transfers; overlap transfer with compute.
- Profile and remove long Python-side transforms from the hot path.
Recipe 3: Safe spot strategy
- Checkpoint often (by steps or minutes), keep last N checkpoints.
- Log step/epoch and RNG state; on restart, resume from last checkpoint cleanly.
- Use stateless data iterators (deterministic sharding) so samples arenβt skipped/duplicated.
- Accept 5β20% time overhead in exchange for lower hourly price.
GPU selection mini-guide
- Small model (β€7B parameters) + LoRA: 16β24 GB GPU often works with mixed precision.
- Medium model (13β34B): 24β80 GB per GPU; consider tensor/FSDP sharding if needed.
- Large batch or long context: memory driven; think accumulation + checkpointing.
- Inference with low traffic: prefer autoscaling and right-size to avoid idle GPUs.
Worked examples
Example 1 β Plan a fine-tune and estimate cost
Scenario: You will LoRA fine-tune a 7B model for 3 epochs on 100k samples, seq length 1,024. Throughput target: 1,100 samples/min on a 24 GB GPU with FP16 and gradient accumulation of 4 (micro-batch 2). Estimated time: roughly 100k / 1,100 β 91 min per epoch β 273 min total β 4.6 h. With on-demand price P dollars/GPU-hour:
- Cost β 1 GPU Γ 4.6 h Γ P = 4.6P.
- If spot is 50% cheaper and you add 10% time overhead due to one interruption: Cost β 1 Γ (4.6 Γ 1.1) Γ (0.5P) β 2.53P.
Decision: If you can tolerate restart overhead, spot saves ~45% in this scenario.
Example 2 β Increase utilization to reduce wall time
Baseline: Utilization at 55%, step time 200 ms. After caching tokenized data, increasing data loader workers, and enabling pinned memory, utilization rises to 88%, step time drops to 130 ms.
- Speedup β 200/130 β 1.54Γ β a 10-hour job now ~6.5 hours.
- Cost reduction β 35% for the same result (same GPUs, fewer hours).
Example 3 β Expected interruptions with spot
Assume average interruption once every 8 hours. Your job is 12 hours.
- Expected interruptions β 12/8 = 1.5 β plan for 1β2 restarts.
- If each restart loses 5 minutes (checkpoint + warmup), overhead β 10 minutes.
- If spot is 60% cheaper, savings usually dominate the small overhead.
Who this is for
- NLP Engineers and ML practitioners training or serving transformer models.
- Data scientists moving from experimentation to production with budget constraints.
Prerequisites
- Comfort with Python and a deep learning framework (e.g., PyTorch or similar).
- Basic understanding of training loops, batches, and evaluation.
- Familiarity with checkpointing and experiment tracking.
Learning path
- Refresh: Batching, precision, and memory basics.
- Learn: Mixed precision, gradient accumulation, checkpointing.
- Practice: Build a budget estimator and a restart-safe training script.
- Optimize: Raise utilization via data pipeline tuning.
- Scale: Consider multi-GPU sharding only when needed and measured.
Common mistakes and self-check
- Buying the biggest GPU by default.
- Self-check: Does your peak memory fit a cheaper GPU with FP16 and checkpointing?
- Low utilization due to slow data loading.
- Self-check: Is GPU busy β₯70%? If not, profile CPU, disk, and augmentation.
- No auto-resume for spot.
- Self-check: Can you kill the job and resume within minutes without losing progress?
- Ignoring effective batch size.
- Self-check: effective_batch = micro_batch Γ grad_accum Γ number_of_gpus.
- Tokenizing on the fly during training.
- Self-check: Pre-tokenize once and cache; measure the step time change.
Practical projects
- Budget Estimator: A small script that predicts GPU-hours and cost from dataset size, epochs, precision, and expected throughput.
- Restart-Safe Trainer: Train with frequent checkpoints; verify you can resume after a forced interruption.
- Utilization Booster: Tune data loading and measure tokens/sec and utilization before vs after changes.
Exercises
Do these now. Then check your answers in the collapsible solutions.
-
Exercise 1 β Plan a cost-efficient run
Scenario: Fine-tune a 7B model with LoRA for 2 epochs on 200k sequences (length 1,024). You can choose 1Γ24 GB on-demand or 1Γ24 GB spot. Assume with FP16 + gradient checkpointing + micro-batch 2 + grad accum 4 you achieve 1,200 samples/min. Checkpoint every 5 minutes; restart cost 3 minutes. The on-demand price is D per hour; spot is 50% of D. Estimate runtime and cost for both options and recommend one.
-
Exercise 2 β Hit a utilization target
Baseline: 45% GPU utilization, step time 220 ms, data loader workers=2, tokenization in loop. Target: β₯80% utilization. Propose concrete changes and explain expected effects.
- [ ] I estimated runtime using samples/min Γ total samples.
- [ ] I computed total GPU-hours Γ price for on-demand and spot.
- [ ] I included interruption overhead in spot time.
- [ ] I proposed at least three utilization improvements.
Mini challenge
You have a weekend training budget cap of 12 GPU-hours total. Propose a training plan (precision, micro-batch, grad accumulation, checkpointing frequency, and whether to use spot) that stays under budget while finishing at least 1 epoch over 300k samples. Justify your choices briefly.
Next steps
- Automate cost estimation in your experiment launcher.
- Add periodic checkpoints and auto-resume to every training job by default.
- Track tokens/sec and utilization alongside loss metrics; optimize the bottleneck you can prove.
- For inference, experiment with quantization and autoscaling to minimize idle time.
Quick Test
Anyone can take the test. Only logged-in users will have their progress saved.