Who this is for
- Data Analysts who query operational databases or read replicas that are sharded.
- Analysts building reports from multiple identically structured tables (e.g., customers_shard_0..3).
- Anyone collaborating with engineers on data models that span multiple shards.
Prerequisites
- Comfortable with SELECT, WHERE, GROUP BY, JOIN, UNION ALL.
- Basic understanding of indexes and table partitions.
- Know how to read and reason about query execution cost at a high level.
Why this matters
Real systems shard data to scale reads/writes. As an analyst, your queries can become slow or incorrect if they fan out across all shards or double-count results. Knowing sharding basics helps you route queries to the right subset of data, design efficient aggregations, and avoid misleading numbers.
- Weekly KPIs from time-sharded order tables without scanning years of data.
- Customer investigations routed to a single shard using a shard key formula.
- Global rankings built from per-shard aggregates combined safely.
Concept explained simply
Short definition
Sharding is horizontal partitioning of rows across multiple databases or tables (shards). Each shard holds a subset of rows; all shards share the same schema. A shard key decides where each row goes.
Mental model
Imagine mail sorted by ZIP ranges into separate bins. Each bin (shard) has letters (rows) with the same layout. A router uses the ZIP (shard key) to send letters to one bin. Queries that know a ZIP touch one bin. Queries that don’t know ZIP may need to open many bins.
Key terms
- Shard: A database or table holding part of the data.
- Shard key: Column(s) that determine shard placement (e.g., customer_id).
- Routing: Logic to pick the shard(s) a query must hit.
- Fan-out: Query that hits many/all shards.
- Hotspot: One shard receives disproportionate traffic/data.
- Rebalancing: Moving data to keep shards even.
Sharding vs partitioning
- Partitioning: Slices data inside one database instance.
- Sharding: Spreads data across multiple instances or tables. Often used with a routing layer.
You can have both. For example, each shard can also be partitioned by date.
Shard key choices and trade-offs
- Hash on id (e.g., customer_id % N): Good for even distribution; poor for range scans.
- Range/time (e.g., created_at by month/quarter): Great for time-bounded queries; risk of hotspots on recent time periods.
- Compound key (tenant_id + hash(user_id)): Balances cross-tenant queries and evenness.
Pick a shard key that matches your most frequent and latency-sensitive queries. Avoid keys that concentrate new writes on one shard.
Analyst toolkit for sharded data
- Identify the shard key and routing formula (ask engineering or read data docs).
- Determine if your question can target one shard (e.g., by customer_id) or a small subset (e.g., a date range).
- If you must fan out: compute per-shard aggregates first, UNION ALL them, then aggregate again globally.
- For time shards: scan only relevant time tables (e.g., orders_2024_q2 and orders_2024_q3 for May–July 2024).
- Use sampling for quick estimates when exactness is not required; mark estimates clearly.
Self-check before running any cross-shard query
- Do I know the shard key?
- Can I limit to one or a few shards?
- If I UNION ALL, do I aggregate results again to avoid double counts?
- Is my time filter aligned with time shards?
Worked examples
Example 1: Route to a single shard via hash key
Rule: shard_id = customer_id % 4. To get total spend for customer_id = 90210 in the last 365 days, compute 90210 % 4 = 2, so query only orders_shard_2.
SELECT SUM(order_amount) AS total_amount
FROM orders_shard_2
WHERE customer_id = 90210
AND created_at >= CURRENT_DATE - INTERVAL '365 days';Example 2: Time shards across quarters
Tables: orders_2024_q1..q4. To get May–July 2024 revenue, you only need q2 (Apr–Jun) and q3 (Jul–Sep).
WITH union_months AS (
SELECT DATE_TRUNC('month', created_at) AS month, order_amount
FROM orders_2024_q2
WHERE created_at >= '2024-05-01' AND created_at < '2024-08-01'
UNION ALL
SELECT DATE_TRUNC('month', created_at), order_amount
FROM orders_2024_q3
WHERE created_at >= '2024-05-01' AND created_at < '2024-08-01'
)
SELECT month, SUM(order_amount) AS total_revenue
FROM union_months
GROUP BY month
ORDER BY month;Example 3: Global top products via per-shard rollups
Compute per-shard revenue, then combine.
WITH per_shard AS (
SELECT product_id, SUM(amount) AS rev FROM tx_shard_0 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
UNION ALL
SELECT product_id, SUM(amount) FROM tx_shard_1 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
UNION ALL
SELECT product_id, SUM(amount) FROM tx_shard_2 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
UNION ALL
SELECT product_id, SUM(amount) FROM tx_shard_3 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
)
SELECT product_id, SUM(rev) AS total_rev
FROM per_shard
GROUP BY product_id
ORDER BY total_rev DESC
LIMIT 5;Example 4: Fast estimate by tenant sampling
Pick ~10% of tenants by hash and scale the result. Mark as estimate.
WITH sampled AS (
SELECT * FROM tenants WHERE MOD(tenant_id, 10) = 0
), per_shard AS (
SELECT SUM(o.amount) AS rev
FROM orders_shard_0 o JOIN sampled s ON o.tenant_id = s.tenant_id
UNION ALL
SELECT SUM(o.amount) FROM orders_shard_1 o JOIN sampled s ON o.tenant_id = s.tenant_id
)
SELECT SUM(rev) * 10 AS estimated_total_revenue FROM per_shard;Varies by country/company; treat as rough ranges.
Common mistakes and how to self-check
- Scanning all shards by default. Fix: compute routing first; limit shards.
- Double counting after UNION ALL. Fix: aggregate again at the end.
- Joining across shards directly. Fix: aggregate per shard, then join/aggregate on the results.
- Picking a hotspot-prone shard key (e.g., created_at only). Fix: mix hash with time or use a more even key.
- Mismatched time filters vs time shards. Fix: align WHERE to shard boundaries.
Quick self-audit checklist
- I know the shard key and formula.
- I am hitting only necessary shards.
- UNION ALL followed by a final GROUP BY is present if fan-out occurred.
- Date filters match time shards.
- If I estimated, I labeled the result as an estimate.
Exercises
Hands-on practice. After completing, take the Quick Test at the bottom. The test is available to everyone; only logged-in users get saved progress.
- Route a single-customer query to the right shard
Tables: customers_shard_0..3(id, name), orders_shard_0..3(order_id, customer_id, order_amount, created_at). Rule: shard_id = customer_id % 4. Task: total spent by customer_id = 90210 in the last 365 days. - Union only needed time shards
Tables: orders_2024_q1..q4. Task: monthly revenue for May–July 2024 (crosses Q2 and Q3). - Global top categories from per-shard aggregates
Tables: transactions_shard_0..3(product_category, amount, created_at). Task: top 3 categories by revenue in last 30 days.
Exercise checklist
- I computed or identified the correct shard(s).
- I used UNION ALL only for the necessary shards.
- I aggregated after UNION ALL to avoid double counts.
- I aligned dates to shard boundaries.
Practical projects
- Build a routing cheat-sheet: document shard keys, formulas, and table naming patterns for your team.
- Create a global KPI view: per-shard daily aggregates UNION ALLed into a stable reporting view, with a final GROUP BY.
- Benchmark: compare a naive cross-shard scan vs targeted shards using EXPLAIN or timing; document the improvement.
Learning path
- Start: Sharding Basics (this lesson).
- Next: Table Partitioning and pruning.
- Then: Distributed SQL concepts (fan-out, consistency, rebalancing).
- Finally: Data warehouse ingestion from sharded sources and building federated views.
Next steps
- Ask engineering for the shard key and routing rules in your environment.
- Refactor one heavy report to use per-shard aggregates and targeted unions.
- Create a small sampling query for quick estimates and validate it against a full run once.
Mini challenge
You need to design queries for an events table with 1B+ rows. Frequent questions: per-tenant weekly active users, and global trends by day. Propose a shard key and explain how you would query both per-tenant and global views efficiently. Write a sample per-tenant query and a global rollup using per-shard UNION ALL with a final GROUP BY.