Skill Not Found

Why this matters for Data Architects

Performance and scalability determine whether your data platform can handle real-world growth: more users, more data, and more concurrent workloads without surprises. As a Data Architect, you define patterns that keep costs under control while meeting SLAs for ingestion, transformation, and analytics. This skill helps you align data models, compute strategies, and governance with predictable performance.

What this unlocks in your role

Designs that meet query SLAs as data volume and concurrency grow
Clear workload isolation and cost controls for teams
Repeatable benchmarks and capacity plans stakeholders trust
Architecture choices that scale without rewrites

Who this is for

Data Architects defining warehouse/lakehouse topologies
Senior Data/Analytics Engineers shaping query and storage patterns
Platform Engineers owning clusters and cost controls
Team leads responsible for BI and ML workload reliability

Prerequisites

Comfort with SQL and data modeling (star/snowflake, partitioning)
Basics of cloud storage and compute (object stores, columnar formats)
Familiarity with your platform's resource model (warehouses, pools, clusters)
Understanding of job scheduling and orchestration concepts

Learning path

1) Clarify SLOs and constraints

Write explicit targets for latency, concurrency, freshness, and budget. These guide every design decision.

Query latency targets (e.g., P95 < 5s for BI)
Data freshness (e.g., T+5 min for streaming aggregates)
Concurrency expectations (e.g., 200 BI users, 20 ETL jobs)
Cost ceiling per month or per workload

2) Workload management strategy

Segment compute by workload. Separate ELT/ETL, BI/Ad‑hoc, and ML to avoid noisy neighbors.

Define resource groups/queues/pools per workload
Set quotas, timeouts, and priority rules
Create retry/backoff and fallbacks

3) Query performance principles

Design for pruning and parallelism.

Choose partition/cluster keys that match common filters
Reduce I/O: select necessary columns, push filters early
Use statistics, indexes (if supported), and materialized views

4) Aggregation & materialization strategy

Precompute where it matters. Balance freshness, cost, and complexity.

Daily/hourly rollups for heavy joins
Materialized views for predictable queries
Backfill and refresh policies

5) Caching concepts

Cash in on repetition without violating data freshness agreements.

Result cache for repeat queries
Application cache for high-read metrics
Define TTL by data volatility

6) Concurrency & capacity planning

Estimate peak load and plan headroom.

Concurrency ≈ arrival rate × avg duration
Scale out (more clusters) vs scale up (larger nodes)
Adopt admission control and queuing

7) Cost optimization controls

Make performance affordable.

Right-size compute; auto-suspend/auto-resume
Storage layout: columnar, partitioned, compressed
Guardrails: quotas, budgets, tagging

8) Benchmarking & load testing

Measure before you decide.

Design repeatable test suites and datasets
Warm-up runs, then capture P50/P95, throughput, and cost
Document environment and test seeds

9) Scalability reviews

Institutionalize what works.

Regular reviews after major growth or feature changes
Track regressions and remediation playbooks
Share guidance as internal standards

Worked examples

Example 1 — Partitioning and clustering for query pruning

Scenario: Analysts filter dashboards by event_date and country; tables grow by 50M rows/day.

-- Choose event_date as partition key; cluster by country for common secondary filter
-- Example query pattern
SELECT country, COUNT(*) AS sessions
FROM fact_events
WHERE event_date BETWEEN '2026-01-01' AND '2026-01-07'
  AND country IN ('US','DE')
GROUP BY country;

Why it works: Partitioning by date prunes most files; clustering by country reduces scanned blocks, boosting performance and lowering cost.

When not to cluster on country

If country distribution is extremely skewed (e.g., 70% in one country), clustering may not help; consider additional clustering (e.g., country, device_type) or bucketing.

Example 2 — Materialized view for heavy joins

Scenario: A daily KPI query joins 5 large tables and runs every 10 minutes.

-- Define a materialized view for a common, expensive join
CREATE MATERIALIZED VIEW mv_sales AS
SELECT f.order_id, f.event_date, d.product_id, d.category_id, c.country,
       f.net_revenue, f.qty
FROM fact_sales f
JOIN dim_product d ON f.product_id = d.product_id
JOIN dim_customer c ON f.customer_id = c.customer_id;

Then the KPI query aggregates on mv_sales:

SELECT event_date, category_id, SUM(net_revenue) AS rev
FROM mv_sales
WHERE event_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY event_date, category_id;

Trade-offs: Faster queries and lower compute at the cost of storage and refresh overhead. Choose incremental refresh aligned with data arrival.

Example 3 — Application cache with safe TTL

Scenario: The same Top 10 categories chart is requested thousands of times/hour; data updates every 5 minutes.

# Pseudocode for app-layer cache
key = "top10:categories:today"
value = cache.get(key)
if not value:
    value = db.query("SELECT category_id, SUM(net_revenue) rev FROM mv_sales\n                     WHERE event_date = CURRENT_DATE\n                     GROUP BY category_id\n                     ORDER BY rev DESC LIMIT 10")
    cache.set(key, value, ttl=300)  # TTL matches 5-minute freshness
return value

Outcome: Reduces database load and stabilizes latency. TTL is set to the data freshness SLO.

Example 4 — Concurrency planning using Little’s Law

Scenario: Expect 180 dashboard requests per minute; average query duration 3 seconds.

# Concurrency estimate
arrival_rate = 180 / 60  # per second
avg_duration = 3         # seconds
concurrency = arrival_rate * avg_duration  # ~9 concurrent queries
# Plan: allow headroom (x2) => design for ~18 concurrent queries

Apply: Size pools/warehouses to sustain ~18 concurrent slots; add a queue and backoff for spikes.

Example 5 — Cost control via column pruning

Scenario: Analysts SELECT * out of convenience, scanning huge payloads.

-- Replace SELECT * with selected columns only
SELECT order_id, event_date, net_revenue
FROM fact_sales
WHERE event_date BETWEEN '2026-01-01' AND '2026-01-31';

Impact: Columnar stores scan fewer bytes; latency and cost both drop significantly.

Guardrail idea

Use a query policy or linter to block SELECT * in production schemas.

Example 6 — Repeatable benchmark plan

Scenario: Compare two cluster sizes before scaling.

Plan:
1) Fix dataset snapshot and statistics.
2) Warm up each cluster with 3 runs.
3) Run 10 iterations capturing: P50/P95 latency, throughput, bytes scanned, cost.
4) Change one variable at a time (cluster size).
5) Decide using P95 latency and cost per successful query.

Decision rule: Prefer configuration that meets SLO with lowest cost per query and acceptable headroom.

Drills and exercises

Pick a large table and choose a partition + clustering strategy. Write three queries and predict prune ratios.
Convert one heavy recurring query into a materialized view or daily rollup. Document refresh logic.
Define workload pools for ETL, BI, and Ad‑hoc. Set max concurrency and timeouts.
Create a benchmark plan with metrics (P50/P95, throughput, cost). Run on two configurations.
Add an app or result cache. Choose a TTL that matches data freshness.
Implement one cost guardrail (quota, auto-suspend, or linter for SELECT *).

Common mistakes

Choosing partition keys that do not match filter patterns, leading to full scans
Overusing materialized views without monitoring refresh cost and staleness
Mixing ETL and BI on the same compute pool, causing noisy-neighbor effects
Ignoring data skew; a few hot keys dominate resources
Benchmarking on warm caches only, then failing in production after cache misses
Optimizing for average latency instead of P95/P99 tail latency
Relying only on scale-up; underusing scale-out and admission control

Debugging tips

Check query plan and bytes scanned vs returned rows; excessive scan indicates poor pruning
Inspect distribution/partition histograms; high skew calls for salting/bucketing
Compare P50 vs P95; large gaps point to concurrency or I/O contention
Evaluate cache hit ratio; low hits suggest missing TTL alignment or key design
Trace a single request across layers (app, cache, warehouse) to locate bottleneck

Mini project: Scalable KPI pipeline

Goal: Deliver a daily and hourly revenue dashboard that scales to 10× data and 5× users.

Define SLOs: P95 < 5s for dashboard, freshness T+10 min, budget cap per day.
Design storage layout: Partition fact_sales by event_date; cluster by country.
Create rollups: daily_sales and hourly_sales tables; schedule incremental refresh.
Add materialized view for expensive product/customer joins.
Configure workloads: Separate pools for ETL and BI; set concurrency and timeouts.
Add caching: app-layer cache for Top-N charts with 10‑minute TTL.
Benchmark: 10 iterations, capture P95 latency and cost; compare two pool sizes.
Write a scalability review: risks, cost projections, and rollback plan.

Deliverables checklist

ERD + partition/cluster rationale
SQL for rollups/materialized views
Workload configuration and guardrails
Benchmark results with decision rationale
One-page scalability review

Subskills

Workload Management Strategy — Isolate ETL, BI, and Ad‑hoc with quotas, priorities, and admission control.
Query Performance Principles — Pruning, parallelism, join strategy, and minimizing scanned bytes.
Data Aggregation And Materialization Strategy — Rollups, materialized views, and refresh policies.
Caching Concepts — Result cache vs application cache; TTL tied to freshness needs.
Concurrency And Capacity Planning — Headroom, queueing, and scale-out vs scale-up decisions.
Cost Optimization Controls — Right-sizing, auto-suspend, quotas, and query hygiene.
Benchmarking And Load Testing — Repeatable tests with P95 latency and cost per query.
Scalability Reviews — Regular audits and standards to prevent regressions.

Next steps

Pick one production workload and apply two quick wins (e.g., partitioning fix and query pruning).
Introduce a workload isolation policy with a small pilot team.
Schedule a monthly scalability review to catch regressions early.

Menu

Performance And Scalability

Table of Contents

Why this matters for Data Architects

Who this is for

Prerequisites

Learning path

1) Clarify SLOs and constraints

2) Workload management strategy

3) Query performance principles

4) Aggregation & materialization strategy

5) Caching concepts

6) Concurrency & capacity planning

7) Cost optimization controls

8) Benchmarking & load testing

9) Scalability reviews

Worked examples

Example 1 — Partitioning and clustering for query pruning

Example 2 — Materialized view for heavy joins

Example 3 — Application cache with safe TTL

Example 4 — Concurrency planning using Little’s Law

Example 5 — Cost control via column pruning

Example 6 — Repeatable benchmark plan

Drills and exercises

Common mistakes

Mini project: Scalable KPI pipeline

Subskills

Next steps

Performance And Scalability — Skill Exam

Topics

Workload Management Strategy

Query Performance Principles

Data Aggregation And Materialization Strategy

Caching Concepts

Cost Optimization Controls

Concurrency And Capacity Planning

Benchmarking And Load Testing

Scalability Reviews

Have questions about Performance And Scalability?

AI Assistant