Why this matters
As a Data Analyst, you often run queries on large tables: finding recent orders, filtering by customer segments, building dashboards, and preparing A/B reports. The difference between a 50 ms query and a 50 s query is usually the right index. Knowing when to use B-tree, Hash, Composite, and Covering indexes helps you:
- Speed up filters, joins, and ORDER BY/LIMIT queries.
- Cut compute costs by avoiding full table scans.
- Make dashboards responsive and ad-hoc analysis snappy.
Concept explained simply
An index is a compact structure that lets the database quickly find rows without scanning the whole table—like a book index. Different index types excel at different query patterns:
- B-tree index: The default in many systems. Great for equality (=) and range (>, <, BETWEEN), and helps with ORDER BY on the indexed columns.
- Hash index: Optimized for equality lookups only (e.g., WHERE token = ...). Not for range, sorting, or prefix searches. Availability and behavior vary by database engine.
- Composite index: An index on multiple columns (e.g., (customer_id, order_date)). The leftmost (leading) column matters for how the index is used.
- Covering index: An index that contains all columns needed by a query, so the database can answer from the index alone (index-only scan) without touching the table. In some systems you can add non-key columns to cover a query (e.g., INCLUDE).
Mental model
- Seek vs Scan: Without an index, the database scans every row. With the right index, it seeks directly to matching entries, reads fewer pages, and returns results faster.
- Selectivity: Indexes are most useful when a predicate filters to a small subset of rows (high selectivity).
- Leading column rule: For composite indexes, the order of columns matters. The index is most effective when your WHERE and ORDER BY use the leftmost columns.
- Trade-offs: Indexes accelerate reads but slow down writes (INSERT/UPDATE/DELETE) and use storage. Add only those that deliver real benefit.
Worked examples
Example 1 — B-tree index for range + sort
Task: Show the latest 20 orders for a given customer quickly.
-- Query
SELECT order_id, order_date, total_amount
FROM orders
WHERE customer_id = 123
AND order_date >= DATE '2024-01-01'
ORDER BY order_date DESC
LIMIT 20;
-- Index (compliant with many RDBMS; DESC may be optional depending on engine)
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date DESC);
Why this works: The composite B-tree lets the engine find the subset for customer_id, then read the date range in order, making ORDER BY + LIMIT efficient.
Example 2 — Hash index for equality-only lookups
Task: Fetch a user session by exact token.
-- Query
SELECT user_id, last_seen
FROM sessions
WHERE session_token = 'abc123';
-- Option A (engines that support hash indexes, e.g., PostgreSQL):
CREATE INDEX idx_sessions_token_hash ON sessions USING HASH (session_token);
-- Option B (portable fallback):
CREATE INDEX idx_sessions_token ON sessions (session_token);
Use a Hash index only for equality comparisons and if your engine benefits from it. Otherwise, a B-tree on the column is a safe choice.
Example 3 — Covering index for index-only scans
Task: A dashboard lists order_id and total_amount for a customer’s last 50 orders by date.
-- Query
SELECT order_id, total_amount
FROM orders
WHERE customer_id = 123
ORDER BY order_date DESC
LIMIT 50;
-- PostgreSQL (cover with INCLUDE)
CREATE INDEX idx_orders_cover ON orders (customer_id, order_date DESC) INCLUDE (total_amount);
-- MySQL/InnoDB (no INCLUDE keyword)
-- If the index has (customer_id, order_date, total_amount), the query may be covered
CREATE INDEX idx_orders_cover ON orders (customer_id, order_date, total_amount);
-- SQL Server (include non-key)
CREATE INDEX idx_orders_cover ON dbo.orders (customer_id, order_date DESC) INCLUDE (total_amount);
When fully covered, the engine can avoid visiting the base table, reducing I/O and latency.
Example 4 — Composite index order matters
Task: Two queries on events(user_id, occurred_at, event_type):
-- Q1: filter by user, then time range
SELECT * FROM events
WHERE user_id = 7 AND occurred_at BETWEEN NOW() - INTERVAL '7 days' AND NOW();
-- Q2: filter by time range only
SELECT count(*) FROM events
WHERE occurred_at >= NOW() - INTERVAL '1 day';
-- Index choice
CREATE INDEX idx_events_user_time ON events (user_id, occurred_at);
This index is great for Q1 (matches leading user_id then uses occurred_at). For Q2 (time-only filter), you may also want an index on occurred_at alone if it’s a frequent query.
How to choose the right index (quick steps)
Exercises you can do now
These mirror the graded exercises below. Run on a realistic dataset if possible.
- Composite + covering. Speed up: SELECT order_id, total_amount FROM sales WHERE customer_id = ? AND order_date BETWEEN ? AND ? ORDER BY order_date DESC LIMIT 50; Create a composite B-tree and make it covering.
- Equality-heavy lookup. Speed up: SELECT user_id FROM api_keys WHERE api_key = ?; Try a Hash index if supported; otherwise a B-tree. Compare EXPLAIN before/after.
- Checklist: Identify WHERE columns and sort order.
- Checklist: Choose B-tree or Hash thoughtfully.
- Checklist: Order composite columns by how the query filters/sorts.
- Checklist: Cover the query to avoid table lookups when it helps.
- Checklist: Verify with EXPLAIN and row/page reads.
Common mistakes and how to self-check
- Indexing low-selectivity columns alone (e.g., status with 95% of rows the same). Self-check: Does the predicate filter to a small fraction?
- Wrong composite order. Self-check: Do WHERE clauses use the leftmost columns? Does the ORDER BY align with the index order?
- Overlapping/duplicate indexes. Self-check: Can one composite index serve multiple queries? Remove redundant ones.
- Expecting Hash to help ranges or sorting. Self-check: Does the query use only equality? If not, prefer B-tree.
- Ignoring write cost. Self-check: Are inserts/updates slowing down after adding indexes?
- Not validating with EXPLAIN. Self-check: Confirm Index Seek/Index Scan and row estimates match expectations.
Self-check mini task
Pick 2 of your slowest queries. For each, list WHERE columns, joins, ORDER BY. Propose one index. Use EXPLAIN to confirm usage. If not used, ask: wrong order? low selectivity? already covered by another index?
Practical projects
- Dashboard turbocharge: Take a BI dashboard with 3 slow widgets. Add/adjust indexes so all load under 500 ms. Document each index, expected benefit, and EXPLAIN before/after.
- Query coverage audit: For a reporting schema, list top 10 queries and map them to existing indexes. Remove two redundant indexes safely after verifying no regressions.
- A/B analysis accelerator: Create composite indexes that speed up filtering by experiment_id and event timestamps across two large tables; show join improvements.
Mini challenge
You have a query: SELECT order_id, order_date FROM orders WHERE store_id = 77 AND order_date >= CURRENT_DATE - INTERVAL '30 days' ORDER BY order_date DESC LIMIT 100;. Propose one index that helps both filtering and sorting, and say whether it could be covering if the SELECT also needs total_amount.
Suggested answer
CREATE INDEX idx_orders_store_date ON orders (store_id, order_date DESC); To cover total_amount, add INCLUDE (total_amount) where supported, or append total_amount in engines without INCLUDE.
Who this is for
- Data Analysts who run frequent filters/aggregations on large tables.
- Anyone tuning SQL for dashboards, reports, or ad-hoc analysis.
Prerequisites
- Comfort with SELECT, WHERE, JOIN, GROUP BY, ORDER BY.
- Know how to run EXPLAIN/EXPLAIN ANALYZE in your database.
Learning path
- Review slow queries and their predicates.
- Learn B-tree basics and the leading column rule.
- Practice composite indexes with ORDER BY + LIMIT.
- Explore covering indexes and measure index-only scans.
- Compare equality-only patterns with Hash vs B-tree.
- Clean up redundant indexes and measure write overhead.
Next steps
- Do the exercises below and verify with EXPLAIN.
- Take the quick test to check your understanding. Note: The quick test is available to everyone; only logged-in users get saved progress.
- Apply indexes on a real dataset and record latency improvements.