Why this matters
As a Machine Learning Engineer, cloud bills can grow quickly during training, data processing, and serving. Quotas can block work if you cannot get enough CPUs/GPUs, IPs, or API calls. Cost awareness keeps your experiments sustainable; quota planning keeps your services reliable.
- Training: Estimate cost, choose instance types, set checkpointing for cheaper preemptible/spot compute.
- Serving: Bound autoscaling, control egress, and plan capacity within quotas.
- Data pipelines: Schedule jobs during cheaper hours, compress artifacts, and tag resources for chargeback.
Concept explained simply
Cloud pricing is a meter: you pay for what you use. Quotas are guardrails: they cap how much you can use at once.
Mental model
- Meter: cost ≈ rate × usage. Rate depends on resource (CPU/GPU/storage/network). Usage is hours, GB, or requests.
- Guardrails: quotas limit peak capacity. You can request increases, but it takes time. Always keep a fallback plan.
What affects the meter?
- Compute: on-demand vs spot/preemptible; GPU type; vCPU/RAM size.
- Storage: hot vs cold tiers; number of objects; request rates.
- Network: inter-zone/region transfer and internet egress.
- Managed services: databases, queues, feature stores, endpoints.
Prices vary by region and provider; treat any examples here as rough ranges only.
Core components of cloud cost for ML
- Compute: CPUs/GPUs per hour. Savings options: spot/preemptible, reservations/commitments, right-sizing.
- Storage: per-GB-month plus request and retrieval costs. Compress and lifecycle old artifacts.
- Network egress: moving data out of a region or to the public internet. Co-locate services to minimize.
- Managed endpoints: per-hour baseline, per-request, and sometimes per-concurrency charges.
- Observability: logs and metrics can add up if very verbose.
Quotas you’ll meet
- Regional vCPU and RAM limits
- GPU count per project/region/zone
- Public IPs, load balancers, instance groups
- API rate limits and request quotas
- Service-specific limits (endpoints, clusters, jobs)
How to work with quotas
- Check current quotas for your target region.
- Estimate peak demand from your design (e.g., GPUs for training; replicas for serving).
- Request increases with justification and lead time.
- Prepare fallbacks: alternate regions, smaller instance types, spot pools, or throttled rollout.
Worked examples
Example 1: Training job cost with checkpointing
- Assume 2 GPUs at $2.50/hour each for 12 hours: compute ≈ 2 × 2.50 × 12 = $60.
- Storage for dataset/checkpoints: 200 GB on standard tier at $0.02/GB-month. For one week: ≈ 200 × 0.02 × (7/30) ≈ $0.93.
- Network egress negligible if training and data are in the same region.
- With spot/preemptible at ~60% discount: compute ≈ $24; add checkpointing every 15–30 min to tolerate preemption.
Decision: If you can resume safely, choose spot to save ~60%.
Example 2: Real-time serving endpoint
- Baseline: 1 small GPU or 4 vCPU instance always-on at $0.20/hour ≈ $144/month.
- Autoscaling up to 10 replicas at peak (assume 2 hours/day at peak for 30 days): extra ≈ 9 × 0.20 × 60 ≈ $108.
- Egress: 0.5 GB/day to internet at $0.08/GB ≈ $1.20/month.
- Bound cost by setting max replicas to 10, enable request/response compression, and cache frequent results.
Decision: Keep min replicas low, set a clear max, and monitor p95 latency vs cost.
Example 3: Weekly batch feature pipeline
- Compute: 16 vCPU for 3 hours at $0.40/hour/vCPU ≈ 16 × 0.40 × 3 = $19.20.
- Storage read/write: a few cents (often negligible compared to compute).
- Optimize: run on spot with retry; compress intermediate parquet; downscale memory if not utilized.
Decision: Schedule during low-usage windows, use spot with retries, and prune intermediates.
Decision cheatsheet
- Use spot/preemptible: training with checkpoints, batch ETL, backfills.
- Use on-demand/reserved: strict SLAs, low tolerance for interruption, critical real-time serving.
- Reduce egress: collocate compute and data; avoid cross-region calls; compress responses.
- Right-size: pick the smallest instance that meets throughput/latency; measure utilization.
- Tag everything: project, owner, environment, experiment_id for cost allocation.
- Set budget alerts: e.g., 50%, 80%, and 100% thresholds with notifications.
Hands-on exercises
Complete these before the quick test.
- Exercise ex1: Estimate a training job cost and set budget thresholds. See details below.
- Exercise ex2: Plan quotas and fallbacks for a serving endpoint. See details below.
Exercise ex1 — instructions
You plan to fine-tune a model with:
- 2 GPUs at $2.50/hour each
- Training duration 10 hours
- Checkpoints: 3 files × 5 GB each kept for 2 weeks on $0.02/GB-month storage
- Spot option gives 60% discount on compute
Tasks:
- Calculate on-demand total cost and spot total cost (assume no extra retries).
- Propose budget alert thresholds at 50%, 80%, and 100% of the higher (on-demand) estimate.
- List two risks and mitigations if using spot.
Exercise ex2 — instructions
You will deploy a real-time endpoint with autoscaling:
- Instance cost: $0.15/hour per replica
- Min replicas: 1, Max replicas: 8
- Expected peak: 2 hours/day
- Response payload: ~200 KB, 30k responses/day, most clients in same region
Tasks:
- Estimate monthly baseline cost and peak cost.
- Identify two quotas that could block scaling and how to mitigate.
- Recommend a hard cap to bound cost without violating SLO.
Practice checklist
- Compute, storage, and egress were each estimated in dollars.
- Budget alerts at 50/80/100% are defined with recipients.
- Quotas were checked in the target region and increase requests drafted.
- Fallback strategies (spot, alternate region, smaller instance) are listed.
- Resource tags: project, owner, environment, experiment_id.
Common mistakes and how to self-check
- Mistake: Ignoring egress. Self-check: Are clients in other regions or on the public internet? Is payload compressed?
- Mistake: No checkpointing on spot. Self-check: Can the job resume within 10–15 minutes after preemption?
- Mistake: Unlimited autoscaling. Self-check: Did you set max replicas and rate limits?
- Mistake: No tags. Self-check: Can you attribute costs to a project/owner in your reports?
- Mistake: Requesting quota increases too late. Self-check: Did you submit requests at least several days before launch?
Practical projects
- Costed ML Experiment Tracker: script that logs per-run compute hours, storage growth, and estimated dollars.
- Cost-aware Batch Inference: pipeline that runs on spot with retries, checkpoints outputs, and emails budget status.
- GPU Quota Playbook: document and template request justifications for standard, high-memory, and A* GPU families.
Who this is for
- ML Engineers deploying training and inference on cloud.
- Data Scientists running frequent experiments who want predictable costs.
- Tech leads setting budgets and SLOs for ML systems.
Prerequisites
- Basic comfort with cloud compute and storage concepts.
- Familiarity with training loops and checkpointing.
- Understanding of latency/throughput targets for APIs.
Learning path
- Identify your workload pattern: training, batch, or real-time.
- List resources: compute type, storage class, data paths, expected traffic.
- Estimate cost using rate × usage; include egress and observability.
- Check quotas; submit increase requests; note fallback options.
- Set tags and budget alerts; test a small run; review actual vs estimate.
- Iterate: right-size instances, compress data, adjust autoscaling.
Next steps
- Complete the exercises below and compare with the provided solutions.
- Take the quick test to validate your understanding.
- Apply these steps to your current ML project and monitor actual spend for one week.
Mini challenge
You must deploy a demo endpoint for a week for a stakeholder. Design a plan that:
- Caps cost under $50 for the week.
- Handles up to 5 QPS with p95 latency under 300 ms.
- Includes at least two quota checks and a fallback.
Hint
- Keep min replicas to 0–1; set a conservative max; cache heavy responses.
- Use same-region data to avoid egress; compress responses.
- Prepare a smaller model as fallback if quotas block GPU.
Exercises — reference
These map to the exercises section below with full solutions.
- ex1: Training cost estimate + budget thresholds
- ex2: Serving quotas + cost bounding
Quick Test
Note: The quick test is available to everyone. Only logged-in users get saved progress.