How to learn Sharding Basics for SQL in Data Analyst for free

Who this is for

Data Analysts who query operational databases or read replicas that are sharded.
Analysts building reports from multiple identically structured tables (e.g., customers_shard_0..3).
Anyone collaborating with engineers on data models that span multiple shards.

Prerequisites

Comfortable with SELECT, WHERE, GROUP BY, JOIN, UNION ALL.
Basic understanding of indexes and table partitions.
Know how to read and reason about query execution cost at a high level.

Why this matters

Real systems shard data to scale reads/writes. As an analyst, your queries can become slow or incorrect if they fan out across all shards or double-count results. Knowing sharding basics helps you route queries to the right subset of data, design efficient aggregations, and avoid misleading numbers.

Weekly KPIs from time-sharded order tables without scanning years of data.
Customer investigations routed to a single shard using a shard key formula.
Global rankings built from per-shard aggregates combined safely.

Concept explained simply

Short definition

Sharding is horizontal partitioning of rows across multiple databases or tables (shards). Each shard holds a subset of rows; all shards share the same schema. A shard key decides where each row goes.

Mental model

Imagine mail sorted by ZIP ranges into separate bins. Each bin (shard) has letters (rows) with the same layout. A router uses the ZIP (shard key) to send letters to one bin. Queries that know a ZIP touch one bin. Queries that don’t know ZIP may need to open many bins.

Key terms

Shard: A database or table holding part of the data.
Shard key: Column(s) that determine shard placement (e.g., customer_id).
Routing: Logic to pick the shard(s) a query must hit.
Fan-out: Query that hits many/all shards.
Hotspot: One shard receives disproportionate traffic/data.
Rebalancing: Moving data to keep shards even.

Sharding vs partitioning

Partitioning: Slices data inside one database instance.
Sharding: Spreads data across multiple instances or tables. Often used with a routing layer.

You can have both. For example, each shard can also be partitioned by date.

Shard key choices and trade-offs

Hash on id (e.g., customer_id % N): Good for even distribution; poor for range scans.
Range/time (e.g., created_at by month/quarter): Great for time-bounded queries; risk of hotspots on recent time periods.
Compound key (tenant_id + hash(user_id)): Balances cross-tenant queries and evenness.

Pick a shard key that matches your most frequent and latency-sensitive queries. Avoid keys that concentrate new writes on one shard.

Analyst toolkit for sharded data

Identify the shard key and routing formula (ask engineering or read data docs).
Determine if your question can target one shard (e.g., by customer_id) or a small subset (e.g., a date range).
If you must fan out: compute per-shard aggregates first, UNION ALL them, then aggregate again globally.
For time shards: scan only relevant time tables (e.g., orders_2024_q2 and orders_2024_q3 for May–July 2024).
Use sampling for quick estimates when exactness is not required; mark estimates clearly.

Self-check before running any cross-shard query

Do I know the shard key?
Can I limit to one or a few shards?
If I UNION ALL, do I aggregate results again to avoid double counts?
Is my time filter aligned with time shards?

Worked examples

Example 1: Route to a single shard via hash key

Rule: shard_id = customer_id % 4. To get total spend for customer_id = 90210 in the last 365 days, compute 90210 % 4 = 2, so query only orders_shard_2.

SELECT SUM(order_amount) AS total_amount
FROM orders_shard_2
WHERE customer_id = 90210
  AND created_at >= CURRENT_DATE - INTERVAL '365 days';

Example 2: Time shards across quarters

Tables: orders_2024_q1..q4. To get May–July 2024 revenue, you only need q2 (Apr–Jun) and q3 (Jul–Sep).

WITH union_months AS (
  SELECT DATE_TRUNC('month', created_at) AS month, order_amount
  FROM orders_2024_q2
  WHERE created_at >= '2024-05-01' AND created_at < '2024-08-01'
  UNION ALL
  SELECT DATE_TRUNC('month', created_at), order_amount
  FROM orders_2024_q3
  WHERE created_at >= '2024-05-01' AND created_at < '2024-08-01'
)
SELECT month, SUM(order_amount) AS total_revenue
FROM union_months
GROUP BY month
ORDER BY month;

Example 3: Global top products via per-shard rollups

Compute per-shard revenue, then combine.

WITH per_shard AS (
  SELECT product_id, SUM(amount) AS rev FROM tx_shard_0 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
  UNION ALL
  SELECT product_id, SUM(amount) FROM tx_shard_1 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
  UNION ALL
  SELECT product_id, SUM(amount) FROM tx_shard_2 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
  UNION ALL
  SELECT product_id, SUM(amount) FROM tx_shard_3 WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' GROUP BY 1
)
SELECT product_id, SUM(rev) AS total_rev
FROM per_shard
GROUP BY product_id
ORDER BY total_rev DESC
LIMIT 5;

Example 4: Fast estimate by tenant sampling

Pick ~10% of tenants by hash and scale the result. Mark as estimate.

WITH sampled AS (
  SELECT * FROM tenants WHERE MOD(tenant_id, 10) = 0
), per_shard AS (
  SELECT SUM(o.amount) AS rev
  FROM orders_shard_0 o JOIN sampled s ON o.tenant_id = s.tenant_id
  UNION ALL
  SELECT SUM(o.amount) FROM orders_shard_1 o JOIN sampled s ON o.tenant_id = s.tenant_id
)
SELECT SUM(rev) * 10 AS estimated_total_revenue FROM per_shard;

Varies by country/company; treat as rough ranges.

Common mistakes and how to self-check

Scanning all shards by default. Fix: compute routing first; limit shards.
Double counting after UNION ALL. Fix: aggregate again at the end.
Joining across shards directly. Fix: aggregate per shard, then join/aggregate on the results.
Picking a hotspot-prone shard key (e.g., created_at only). Fix: mix hash with time or use a more even key.
Mismatched time filters vs time shards. Fix: align WHERE to shard boundaries.

Quick self-audit checklist

I know the shard key and formula.
I am hitting only necessary shards.
UNION ALL followed by a final GROUP BY is present if fan-out occurred.
Date filters match time shards.
If I estimated, I labeled the result as an estimate.

Exercises

Hands-on practice. After completing, take the Quick Test at the bottom. The test is available to everyone; only logged-in users get saved progress.

Route a single-customer query to the right shard
Tables: customers_shard_0..3(id, name), orders_shard_0..3(order_id, customer_id, order_amount, created_at). Rule: shard_id = customer_id % 4. Task: total spent by customer_id = 90210 in the last 365 days.
Union only needed time shards
Tables: orders_2024_q1..q4. Task: monthly revenue for May–July 2024 (crosses Q2 and Q3).
Global top categories from per-shard aggregates
Tables: transactions_shard_0..3(product_category, amount, created_at). Task: top 3 categories by revenue in last 30 days.

Exercise checklist

I computed or identified the correct shard(s).
I used UNION ALL only for the necessary shards.
I aggregated after UNION ALL to avoid double counts.
I aligned dates to shard boundaries.

Practical projects

Build a routing cheat-sheet: document shard keys, formulas, and table naming patterns for your team.
Create a global KPI view: per-shard daily aggregates UNION ALLed into a stable reporting view, with a final GROUP BY.
Benchmark: compare a naive cross-shard scan vs targeted shards using EXPLAIN or timing; document the improvement.

Learning path

Start: Sharding Basics (this lesson).
Next: Table Partitioning and pruning.
Then: Distributed SQL concepts (fan-out, consistency, rebalancing).
Finally: Data warehouse ingestion from sharded sources and building federated views.

Next steps

Ask engineering for the shard key and routing rules in your environment.
Refactor one heavy report to use per-shard aggregates and targeted unions.
Create a small sampling query for quick estimates and validate it against a full run once.

Mini challenge

You need to design queries for an events table with 1B+ rows. Frequent questions: per-tenant weekly active users, and global trends by day. Propose a shard key and explain how you would query both per-tenant and global views efficiently. Write a sample per-tenant query and a global rollup using per-shard UNION ALL with a final GROUP BY.

Menu

Sharding Basics

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Sharding vs partitioning

Shard key choices and trade-offs

Analyst toolkit for sharded data

Worked examples

Example 1: Route to a single shard via hash key

Example 2: Time shards across quarters

Example 3: Global top products via per-shard rollups

Example 4: Fast estimate by tenant sampling

Common mistakes and how to self-check

Exercises

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Route a single-customer query to the right shard

Instructions

Expected Output

Union only needed time shards

Global top categories from per-shard aggregates

Sharding Basics — Quick Test

Have questions about Sharding Basics?

AI Assistant