Why this matters
As a Data Architect, you shape systems that must scale without burning budget. Cost optimization controls ensure platforms deliver the required performance while guarding against waste from idle resources, runaway queries, over-retention, and misconfigured scaling.
- Real tasks: define warehouse autosuspend rules, set workload isolation tiers, enforce storage lifecycle policies, choose reserved vs on-demand capacity, and implement budgets/alerts.
- Impact: predictable spend, resilient performance, better ROI, and happier stakeholders.
Who this is for
- Data Architects and Platform Engineers designing cloud data stacks.
- Analytics Engineers and Data Engineers tuning warehouses, lakes, and pipelines.
- Team leads who approve capacity and budgets.
Prerequisites
- Basic understanding of cloud compute, storage, and networking.
- Familiarity with data warehouses, data lakes, and batch/stream processing.
- Comfort reading simple cost estimates (per GB-month, per vCPU-hour, etc.).
Concept explained simply
Cost optimization controls are guardrails that keep performance high without overspending. Think of them as thermostats, timers, and safety valves for your data platform.
Mental model
- Preventive controls: stop waste before it happens (e.g., autosuspend, quotas, reserved capacity for steady loads, query timeouts).
- Detective controls: reveal waste quickly (e.g., budgets, alerts, cost dashboards, anomaly detection).
- Corrective controls: automatically trim waste (e.g., lifecycle policies, auto-scale in, job retries with backoff, preemption-aware workloads).
Quick glossary
- Autosuspend/Auto-resume: pause compute when idle; wake on demand.
- Reserved/committed capacity: discount for steady usage.
- Spot/preemptible: cheap but can be interrupted; best for fault-tolerant batch.
- Lifecycle policy: move data from hot to warm to archive; delete when expired.
- Query guardrails: limits on runtime, scanned bytes, or cost per query.
- Tagging/chargeback: map spend to teams and products for accountability.
Worked examples
Example 1: Warehouse idle cost control
Scenario: BI team queries a warehouse 9 a.m.–6 p.m. Mon–Fri; spikes at 10 a.m. and 3 p.m.; little night/weekend use.
- Controls: autosuspend at 5 minutes idle; auto-resume on demand; scale 1–3 clusters with queueing; schedule off on weekends; cache enabled; query timeout 5 minutes for ad-hoc role.
- Impact: If the warehouse idled 12 hours/day, 7 days/week, autosuspend could cut ~84 hours/week of compute. At an example rate of $2/hour, that’s ~$672/month saved. Numbers are illustrative.
Example 2: Storage lifecycle tiers
Scenario: 10 TB/day logs; typical read pattern: 7-day hot, 90-day warm, 7-year archive.
- Controls: partition by date; hot tier 7 days; warm tier up to 90 days; archive beyond 90 days; TTL delete after 7 years.
- Illustrative pricing: hot $0.023/GB-month, warm $0.0125, archive $0.004. Moving 9 TB/day to warm/archive quickly reduces run-rate. Retrieval fees may apply; keep recent indexes hot to minimize restores.
Example 3: Batch job with spot capacity and pruning
Scenario: nightly Spark job scans 20 TB, 2 hours on on-demand compute.
- Controls: use partition pruning to scan 4 TB; caching small dimension tables; switch 70% workers to spot/preemptible with retries; start job in off-peak window.
- Impact: Data scanned cut 80%; compute blended rate drops. If on-demand $1/hour/node and spot $0.3, with 10 nodes for 2 hours: before ~$20; after 6 spot + 4 on-demand for 1.2 hours ≈ (6×0.3 + 4×1)×1.2 ≈ ($1.8 + $4)×1.2 ≈ $6.96. Numbers are illustrative.
Controls you can combine
- Compute: autosuspend, rightsize instances/warehouses, scale-to-zero dev, spot/preemptible for batch, reserved capacity for steady loads, concurrency quotas, workload isolation.
- Storage: lifecycle to warm/archive, compression, partitioning/clustering, small-file compaction, TTL retention, selective materialized views.
- Query/pipeline: timeouts, max scanned bytes, result caching, sampling/approximate queries where acceptable, incremental processing.
- Governance: budgets, anomaly alerts, tags/labels, chargeback/showback, scheduled audits, cost KPIs (e.g., $/query, $/TB processed).
Exercises
Do these, then compare with the provided sample solutions.
Exercise 1 — Warehouse guardrails and scaling
Design cost controls for an analytics warehouse with these traits: 30 analysts; business hours are 8 a.m.–7 p.m. local; ad-hoc spikes around 9 a.m.; nightly ELT 1 a.m.–3 a.m.; dev and prod separated.
- Deliverables: autosuspend/auto-resume settings; min/max clusters or size; queueing vs burst; query timeout/scan limits by role; weekday/weekend schedule; budgets and alerts; tagging plan for teams.
Exercise 2 — Storage lifecycle and retention plan
You ingest 5 TB/day of event logs. Requirements: 90 days fast access, 12 months cheaper access, 7 years compliance archive. Provide a lifecycle policy, partitioning plan, and a rough monthly cost estimate using these illustrative rates: hot $0.023/GB-mo, warm $0.0125/GB-mo, archive $0.004/GB-mo. Note: numbers are example only.
Checklist before you compare
- [ ] Preventive controls defined (autosuspend, timeouts, quotas).
- [ ] Detective controls defined (budgets, alerts, reports).
- [ ] Corrective controls defined (lifecycle, auto-scale-in, TTL).
- [ ] Roles and environments isolated (dev/test/prod).
- [ ] Estimates include simple math with clear assumptions.
Common mistakes and self-check
- Mistake: Turning on autoscaling without limits. Self-check: Did you set a max cluster/size and queue policy?
- Mistake: Over-retaining raw data. Self-check: Do you have TTL and archive tiers with restore procedures?
- Mistake: Using spot for critical low-latency workloads. Self-check: Are only fault-tolerant batch jobs on spot/preemptible?
- Mistake: Materialized views everywhere. Self-check: Does each view’s refresh cost save more than it spends?
- Mistake: No tagging. Self-check: Can you attribute at least 90% of spend to owners?
- Mistake: Budgets without alerts. Self-check: Do you have thresholds at 50/75/90/100% with action plans?
Learning path
- Map workloads — list SLAs, concurrency, and usage windows.
Mini task: Write down top 3 workloads and their peak hours. - Pick preventive controls — autosuspend, size limits, query guardrails.
Mini task: Set a default query timeout and a max scan per role. - Design storage lifecycle — hot/warm/archive + TTL.
Mini task: Draft 3 rules that move and delete data automatically. - Add detective controls — budgets, anomaly alerts, tagging.
Mini task: Define tags: environment, owner, cost-center, workload. - Pilot and tune — apply to one workload, measure $/query and latency.
Mini task: Compare before/after for one week and adjust limits.
Practical projects
- Cost-aware warehouse blueprint: Build IaC templates that create a warehouse with autosuspend, scaling limits, role-based timeouts, and tags. Include a default budget and alert policy.
- Data lake lifecycle: Implement hot-to-warm-to-archive moves, partition by date, compact small files daily, and auto-delete beyond retention. Create a monthly cost report per dataset.
- Batch optimization: Convert a nightly job to use partition pruning and spot/preemptible workers with retries, and schedule during off-peak. Measure cost and runtime changes.
Mini challenge
Your CFO asks for a 25% cost reduction without hurting 95th percentile query latency for dashboards. List five changes you would try first and the metrics you’d track to prove success.
Possible angles to consider
- Autosuspend thresholds; warehouse right-sizing; caching effectiveness.
- Dashboards rewritten to limit scanned bytes; MV where it truly pays.
- Warm/archive tiers for older partitions; TTL for raw staging.
- Budgets + anomaly alerts; tag coverage to find owners of spikes.
- Spot/preemptible for batch; reserved capacity for steady, predictable loads.
Next steps
- Apply one preventive, one detective, and one corrective control in your environment this week.
- Track two KPIs: cost per query (or per TB processed) and 95th percentile latency.
- Expand controls to the next highest-cost workload.
Quick test and progress
Take the quick test to check your understanding. The test is available to everyone. If you log in, your progress and results are saved.