Menu

Topic 6 of 8

Cost And Scalability Planning

Learn Cost And Scalability Planning for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you are accountable for designing platforms that scale without surprise bills. Real tasks include: forecasting cloud spend, choosing storage/compute tiers, setting autoscaling limits, planning data lifecycle policies, and proving that your design can handle 10x growth while staying within budget.

Concept explained simply

Cost and scalability planning means estimating how much your data platform will cost as usage grows, then building guardrails so it stays fast and affordable. You do this by knowing what drives cost (compute, storage, data transfer, requests), choosing the right service tiers, and adding controls like autoscaling and budgets.

Mental model: The 3 dials

Imagine three dials you can turn: Volume (data size), Speed (throughput/concurrency), and Price (unit cost). You can turn two dials freely; the third reacts. If you need more speed at the same price, you must reduce the data volume scanned or switch to cheaper units (e.g., serverless, spot, columnar formats). If data volume grows, you either pay more or optimize to keep price flat.

Core principles

  • Total cost of ownership (TCO): Compute + Storage + Data transfer + Requests + Ops overhead.
  • Right-tiering: Hot (frequent access), Warm/Infrequent access (cheaper, retrieval fees), Cold/Archive (cheapest, slow retrieval).
  • Elasticity: Autoscaling and serverless reduce idle cost and absorb spikes; set caps to avoid runaway bills.
  • Unit economics: Cost per query, per GB scanned, per event, per pipeline run. Track and improve these.
  • Data lifecycle: Retain, down-tier, archive, or delete based on access patterns and compliance.
  • Workload shaping: Partitioning, pruning, compression, columnar formats, caching to reduce scanned bytes.
  • Guardrails: Budgets, alerts, quotas, per-query limits, and tag-based cost allocation.
  • Capacity planning: Design for P95/P99 load, not just average. Test with realistic bursts.

Worked examples

Example 1: BI warehouse for 200 analysts

Assumptions (example numbers):

  • Average 15 concurrent queries, peak 35.
  • Columnar store with 2 TB hot data, 8 TB warm.
  • Serverless warehouse billed per query scan; aim to scan <2 GB per query via partitioning.

Plan:

  • Use partitioning on date and customer_id to prune scans.
  • Add result cache for top dashboards (morning spike).
  • Set per-query scan cap (e.g., 5 GB) and alert.
  • Down-tier fact tables older than 90 days to warm storage; keep dimensions hot.

Outcome: Peak handled by elasticity; cost controlled by pruning and query caps.

Example 2: Streaming pipeline at 50k events/sec

Assumptions:

  • Autoscaling consumers process events; average 20k/s, bursts to 50k/s.
  • Exactly-once sink to data lake; compaction every hour.

Plan:

  • Scale by lag: increase consumers when lag > 30s, reduce when < 5s.
  • Use spot/preemptible workers for non-stateful transformers with checkpointing.
  • Cap max workers to budget; alert if lag > 2 min for 10 min (signals quota hit).
  • Compact small files hourly to reduce request costs.

Outcome: Smooth bursts with bounded spend; small-file cost reduced by compaction.

Example 3: Data lake lifecycle

Assumptions:

  • Raw logs: 30 TB/month. First 30 days hot, then infrequent access, archive after 180 days.
  • Access: 5% of logs queried after 30 days; rare archive restores for audits.

Plan:

  • Apply lifecycle rules at ingestion time.
  • Store in columnar compressed format (e.g., Parquet) to cut scans.
  • Tag buckets for cost allocation by domain.
  • Document restore SLAs and retrieval fees to avoid surprises.

Outcome: Majority of cost moves to cheaper tiers; predictable restore trade-offs.

Step-by-step: Plan cost and scale for a new workload

  1. Define success
    • What latency and concurrency must we meet at P95?
    • Monthly budget and cost per unit (per query/GB/event)?
  2. Quantify volumes
    • Data in (GB/day), data out, requests, retention.
    • Traffic patterns: steady vs spiky, seasonality.
  3. Choose tiers
    • Hot vs warm vs archive storage.
    • On-demand/serverless vs fixed clusters; spot where safe.
  4. Optimize unit cost
    • Partition, prune, compress, cache.
    • Batch small files; avoid unnecessary shuffles/scans.
  5. Set guardrails
    • Autoscaling limits, per-query scan caps, concurrency caps.
    • Budgets and alerts at 50%, 80%, 100% of spend.
  6. Test and iterate
    • Load test to P95/P99.
    • Compare forecast vs observed cost; adjust tiers.

Checklists

Design review checklist:

  • Have we estimated TCO across compute, storage, transfer, and requests?
  • Do we have lifecycle policies for each dataset?
  • Are autoscaling caps and per-query limits defined?
  • Is data partitioned and compressed to minimize scans?
  • Are cost tags applied for allocation by team/domain?
  • Do we have SLOs for performance and restore?

Pre-go-live cost readiness:

  • Budgets and alerts configured (50/80/100%).
  • Load test completed to peak with acceptable cost/unit.
  • Runbooks for scale events and budget breaches.
  • Backup/restore tested with known retrieval cost.

Exercises

Do the two practical exercises below. Solutions are hidden in their toggles inside the Exercises section of this page.

  • Exercise 1: Estimate and cap costs for a weekly batch job.
  • Exercise 2: Right-size a data warehouse for spiky BI usage.

Common mistakes and how to self-check

  • Overprovisioning always-on clusters for rare peaks. Self-check: What is average utilization at P95? Can autoscaling handle peaks instead?
  • Ignoring data transfer and request costs. Self-check: Do we know GB moved and request counts per job?
  • No lifecycle policies. Self-check: For each dataset, what happens at 30/90/180 days?
  • Scanning entire tables. Self-check: Are queries pruned by partitions and selective predicates?
  • Unbounded autoscaling. Self-check: Are there max worker limits and budget alerts?

Practical projects

  • Build a cost forecast workbook: Inputs (data volume, QPS, retention); Outputs (monthly compute, storage, transfer, total). Include sliders for 2x/5x growth.
  • Implement lifecycle rules on a sample data lake with hot/warm/archive and measure 30-day savings.
  • Create a guardrails pack: autoscaling caps, per-query scan limit, and budget alerts; validate with a load test.

Who this is for

  • Data Architects and Senior Data Engineers designing or optimizing platforms.
  • Team leads who own budget and SLAs for data systems.

Prerequisites

  • Basic cloud concepts (compute, storage, networking).
  • Familiarity with SQL engines and data formats (e.g., Parquet, ORC).
  • Understanding of autoscaling and partitioning.

Learning path

  1. Map workloads and unit economics (per query/GB/event).
  2. Right-tier storage and set lifecycle rules.
  3. Design elastic compute with caps and budgets.
  4. Optimize scans (partitioning, compression, caching).
  5. Load test, compare forecast vs actual, iterate.

Mini challenge

You expect a Monday 9–11 AM dashboard spike to 30 concurrent users, typical is 8. Propose an elastic setup (tiers, caps, and limits) to keep latency under 3 seconds without overspending. Include one guardrail that prevents runaway scans.

See a sample approach
  • One medium warehouse baseline with auto-suspend 5 min; concurrency scaling adds a second during 9–11 AM.
  • Per-query scan cap 3 GB; queries above that fail fast with guidance.
  • Materialize top 5 dashboards at 8:45 AM to warm caches.
  • Budget alerts at 50/80/100%; max two concurrent clusters.

Next steps

  • Complete the Quick Test below to check understanding. The test is available to everyone; only logged-in users have saved progress.
  • Apply one lifecycle policy and one guardrail to a real dataset this week.

Practice Exercises

2 exercises to complete

Instructions

Scenario:

  • A batch job runs 4 hours every Sunday (4 runs/month), processing 5 TB of input and producing 0.5 TB of output.
  • Assume 1 TB = 1000 GB for calculations.
  • Compute: 64 vCPUs at $2.40 per vCPU-hour.
  • Storage hot tier: $0.023 per GB-month.
  • Egress to internet: 100 GB exported after each run; egress price $0.09 per GB.
  • Lifecycle: Keep raw and output in hot for the first month. Ignore retrieval fees for this exercise.

Tasks:

  • Compute month-1 cost for compute, storage (hot), and egress, and the total.
  • Propose two guardrails to cap spend and one optimization to reduce cost next month.
Expected Output
A short breakdown with dollar amounts for compute, storage, and egress; total monthly estimate; two guardrails; one optimization for month 2.

Cost And Scalability Planning — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Cost And Scalability Planning?

AI Assistant

Ask questions about this tool