Why this matters
Choosing between CPU and GPU (and sizing them correctly) directly affects training time, inference latency, and your cloud bill. As a Machine Learning Engineer you will:
- Ship fast inference APIs without overpaying for idle GPUs.
- Run training jobs that finish overnight instead of over the weekend.
- Plan capacity for batch pipelines and A/B experiments.
- Diagnose bottlenecks (compute, memory, I/O) and right-size instances.
Concept explained simply
CPUs are great at many different tasks and branching logic. GPUs excel at doing the same math on many data points in parallel, which is perfect for deep learning (matrix multiplications and convolutions).
Mental model: highways vs. intersections
Imagine a city. A CPU is a smart intersection that can route complex traffic one car at a time very efficiently. A GPU is a multi-lane highway built for huge flows of similar cars going in the same direction. Deep learning sends thousands of identical "cars" (tensor ops) down the highway, so GPUs dominate. Traditional ML or control-heavy code often prefers the CPU intersection.
Decision checklist (quick rules)
Worked examples
Example 1 β Tabular model training
Gradient boosting or logistic regression on 10M rows. Use a compute-optimized CPU instance. Benefit: more RAM for dataset and strong per-core performance. GPU unlikely to help.
Example 2 β Fine-tuning a Transformer
Fine-tuning a BERT-base model on text classification. Use a single modern GPU with 16β24 GB memory. Enable mixed precision to fit larger batches and speed up training.
Example 3 β Real-time inference API
Image classification at 30β60 requests/sec with p95 < 150 ms. Options: (a) a small GPU with micro-batching, or (b) autoscaled CPU replicas with optimized inference engine and smaller batch size. Choose the cheaper that meets latency SLO.
Cost and sizing basics
Key drivers:
- GPU memory capacity (fits model weights, activations, optimizer states during training).
- GPU generation (newer = faster, more memory bandwidth).
- CPU: number of vCPUs and RAM; use compute-optimized for numeric workloads.
Very rough memory rules of thumb:
- Inference memory β model weights + runtime buffers + batch-dependent activations.
- Training memory β weights + gradients + optimizer states + activations (often 2β6Γ weights). Mixed precision can nearly halve many parts.
Approximate cloud prices (rough ranges β varies by country/company; treat as rough ranges.)
- 1 vCPU: about $0.03β$0.10/hour.
- Entry GPU (e.g., T4/A10 class): about $0.30β$0.80/hour.
- Mid/high GPU (e.g., V100/A100 class): about $1.50β$4.00/hour.
- Top-tier GPU (e.g., H100 class): about $4.00β$10.00+/hour.
Compare total job cost: hourly price Γ hours. A faster GPU that halves training time may be cheaper overall.
How to choose, step-by-step
- Identify workload: training vs. inference; batch vs. online; peak QPS and latency SLO.
- Estimate memory: model size, precision, batch size. If it doesnβt fit in GPU memory, reduce batch size or precision, or pick a larger GPU.
- Pick compute: deep nets β GPU; classic ML or control-heavy β CPU; hybrid workloads might mix (CPU data prep + GPU model).
- Right-size: start with the smallest instance that fits; profile utilization; scale up/down based on headroom and SLOs.
- Optimize: mixed precision, micro-batching, data loader workers, quantization for inference.
Exercises
Do these to practice. Thereβs a quick test at the end; everyone can take it for free. Only logged-in users have their progress saved.
Exercise 1 β CPU or GPU?
Decide CPU or GPU for each scenario and state one reason.
- Train XGBoost on 20M-row tabular dataset.
- Fine-tune a 110M-parameter Transformer for 3 epochs.
- Serve 10 req/sec sentiment model with p95 < 80 ms.
- Batch embed 5M sentences overnight.
- Classify 512Γ512 images offline, 200k images/day.
- Run feature engineering with heavy joins and UDFs.
- Write your mapping and reason for each.
Exercise 2 β Memory tier estimate
Pick a GPU memory tier that likely fits each case. You can use rough logic: inference memory β weights + overhead; training β multiple of weights due to activations/optimizer.
- 50M-parameter CNN, fp32 training, batch 64.
- 1.3B-parameter language model, fp16 inference, batch 1.
- 7B-parameter language model, fp16 inference, batch 1β2.
Choose from: 8 GB, 16 GB, 24 GB, 40 GB+.
Checklist: did you think it through?
- Did you consider precision (fp32 vs. fp16)?
- Did you account for activations (training) and overhead?
- Did you balance cost vs. latency/throughput?
Common mistakes and self-check
- Overbuying GPUs: Paying for a powerful GPU to serve tiny models at low QPS. Self-check: Is GPU utilization consistently < 20%? Try CPU autoscaling.
- Ignoring GPU memory: Model crashes due to OOM. Self-check: Log max allocated GPU memory; reduce batch size or use mixed precision.
- CPU-bound pipelines: Data loading or preprocessing bottlenecks starve the GPU. Self-check: GPU utilization low while CPU 100%? Increase loader workers, prefetch, and use faster codecs.
- No batching strategy: Real-time services without micro-batching waste GPU throughput. Self-check: Add small batch window (e.g., 5β10 ms) and measure p95 latency.
- Wrong CPU family: Memory-bound ETL on compute-optimized instances. Self-check: If RAM is the limiter, use memory-optimized CPUs.
Practical projects
- Latency vs. cost dashboard: Implement an inference endpoint for a small image model on CPU and on a small GPU. Measure p50/p95 latency and cost/hour. Write a one-page recommendation.
- Throughput tuner: Train a Transformer for 1 epoch with different batch sizes and precisions. Record time/epoch, memory usage, and final loss. Summarize the best cost-performance setting.
- Batch pipeline: Process 1M texts to embeddings using CPU-only multi-processing vs. a single GPU with micro-batching. Compare total runtime and cloud cost.
Learning path
- Foundations: CPU vs. GPU basics; precision (fp32/fp16/int8); batching.
- Sizing: Estimating memory and compute; reading instance specs.
- Optimization: Mixed precision, micro-batching, data pipeline performance.
- Cost-aware deployment: Autoscaling, right-sizing, spot/preemptible strategies.
- Validation: Load testing, utilization tracking, and SLO checks.
Who this is for
Machine Learning Engineers, Data Scientists moving to production, and MLOps practitioners who need to make cost-effective compute choices for training and inference.
Prerequisites
- Comfort with Python and ML workflows.
- Basic understanding of neural networks and common classic ML algorithms.
- Ability to read simple hardware specs (vCPU, RAM, GPU VRAM).
Next steps
- Complete the exercises above and verify with the solutions.
- Take the Quick Test below to confirm you can choose between CPU and GPU under constraints.
- Apply the decision checklist to one of your current or past projects.
Mini challenge
Design an inference plan for a text classification API that must serve 40 req/sec with p95 < 120 ms and a budget of $2/hour. Propose CPU or GPU, batch size/micro-batch window, and any optimizations. Justify your choice briefly.
Hint
Consider a small GPU with micro-batching vs. several CPU replicas with autoscaling. Compare utilization and total hourly cost for both while meeting p95.
Quick Test
Everyone can take the Quick Test for free. Only logged-in users get saved progress and personalized next steps.