Who this is for
Backend Engineers who write queries, design schemas, and need fast, predictable performance in production APIs, jobs, and analytics read paths.
Prerequisites
- Basic SQL (SELECT, WHERE, ORDER BY, JOIN)
- Understanding of primary keys and foreign keys
- Familiarity with your database’s query plan viewer (e.g., EXPLAIN)
Why this matters
Indexes are the difference between millisecond responses and timeouts. In real work you will:
- Speed up API endpoints that filter by email, status, or date ranges.
- Scale background jobs that scan large tables (orders, events, logs).
- Cut cloud costs by reducing full-table scans and CPU spikes.
- Keep SLAs during peak traffic while supporting new features.
Concept explained simply
An index is a data structure that helps the database find rows faster, like a book’s index. Instead of scanning every row, the DB jumps to the right area and narrows down quickly.
Mental model
- B-tree index: a sorted map of keys to row locations. Good for equality, ranges, and sorting.
- Composite index (a, b, c): effectively sorted by a, then b, then c. The leftmost prefix rule applies.
- Every index speeds up reads but adds write cost (inserts/updates/deletes must maintain it).
Index types at a glance (when to use)
- B-tree: default. Equality, ranges, ordering.
- Hash: equality only (engine-specific usefulness).
- GIN: membership/search for arrays, JSON, full-text.
- GiST: geometric, ranges, nearest-neighbor.
- BRIN: large, append-only tables with natural ordering (time-series).
Core concepts that guide index design
- Selectivity and cardinality: High selectivity (few rows match) makes indexes more effective.
- Composite indexes and leftmost prefix: (a, b, c) can use a; or a and b; it cannot start on b without a.
- Order support: B-tree can satisfy ORDER BY if it matches index order and filters.
- Covering index: If the index holds all needed columns, the DB can avoid visiting the table (index-only scan).
- Write overhead: Each additional index slows writes and increases storage; only add what you need.
- Partial index: Index just the hot subset (e.g., WHERE deleted_at IS NULL).
- Unique index: Enforces uniqueness and speeds up lookups.
- Functional/expressions index: Index on lower(email) or date(order_date) if queries use that expression.
- Nulls behavior: Some engines treat NULLs specially; check how your DB handles them in indexes.
Worked examples
1) User lookup by email (equality)
-- Query
SELECT id FROM users WHERE email = 'a@b.com';
-- Index
CREATE UNIQUE INDEX idx_users_email ON users(email);
Why: High selectivity; unique guarantee; speeds logins and account lookups.
2) Orders by customer with status and recent date (filter + sort)
-- Query
SELECT *
FROM orders
WHERE customer_id = $1 AND status = 'paid' AND order_date >= now() - interval '30 days'
ORDER BY order_date DESC
LIMIT 50;
-- Candidate composite index (PostgreSQL-like)
CREATE INDEX idx_orders_customer_status_date
ON orders(customer_id, status, order_date DESC);
Why: Matches leftmost prefix (customer_id, status) and supports the DESC ordering on order_date to avoid a sort.
3) Search by tags membership (arrays/JSON)
-- Query (PostgreSQL-style arrays)
SELECT id FROM tickets WHERE tags @> ARRAY['urgent'];
-- Index (GIN for membership)
CREATE INDEX idx_tickets_tags_gin ON tickets USING GIN(tags);
Why: GIN accelerates contains/overlap on arrays and JSON paths.
4) Partial index on active rows (soft delete)
-- Query
SELECT id FROM posts WHERE deleted_at IS NULL AND slug = $1;
-- Partial unique index only for active rows
CREATE UNIQUE INDEX idx_posts_active_slug
ON posts(slug)
WHERE deleted_at IS NULL;
Why: Keeps index small and enforces uniqueness only among active posts.
How to choose an index (practical steps)
- Identify top queries: High-traffic endpoints and slow queries.
- Extract patterns: Predicates in WHERE, JOIN keys, ORDER BY, LIMIT.
- Estimate selectivity: Prefer indexing columns that sharply reduce rows.
- Pick index type: B-tree for most; GIN/GiST/BRIN for special cases.
- Choose column order: Most selective and consistently used predicates first; then sort columns.
- Cover if needed: Add included columns for index-only scans when beneficial.
- Validate: Use your DB’s EXPLAIN/plan viewer; confirm index is chosen.
- Monitor: Track latency, CPU, and index bloat; prune unused indexes.
Quick self-estimation trick
Ask: Does this index let the DB jump directly to a tiny fraction of rows and return them already sorted? If yes, it’s likely a win.
Exercises
Try these. Solutions are hidden under each exercise. The quick test is available to everyone; log in to save your progress.
Exercise 1 — Feed page for published posts
Query pattern:
SELECT id, title, published_at
FROM posts
WHERE published = true AND category_id = $1
ORDER BY published_at DESC
LIMIT 20;
- Design an index to make this fast without an extra sort.
- Assume many categories, and only ~10% of posts are published.
Show solution
CREATE INDEX idx_posts_feed
ON posts(category_id, published, published_at DESC);
-- Why
-- 1) Filters on category_id and published are satisfied by the leftmost prefix.
-- 2) DESC order on published_at matches the index to avoid a sort.
-- 3) Consider a partial index if queries always use published = true:
-- CREATE INDEX idx_posts_feed_active
-- ON posts(category_id, published_at DESC)
-- WHERE published = true;
Exercise 2 — Top purchasers last 7 days
Query pattern (PostgreSQL-like):
SELECT user_id, count(*) AS purchases
FROM events
WHERE type = 'purchase' AND created_at >= now() - interval '7 days'
GROUP BY user_id
ORDER BY purchases DESC
LIMIT 50;
- Propose an index that narrows scanning and helps an index-only scan.
- Table is append-only and large.
Show solution
-- Focus on filtering first: type and recent created_at.
-- Cover user_id to avoid table visits when possible.
CREATE INDEX idx_events_purchase_recent
ON events(type, created_at DESC)
INCLUDE (user_id);
-- Optional: make it partial to shrink the index (if type filter is always used):
-- CREATE INDEX idx_events_purchase_recent_partial
-- ON events(created_at DESC)
-- INCLUDE (user_id)
-- WHERE type = 'purchase';
-- Notes
-- - The dynamic 7-day boundary still benefits from the created_at ordering.
-- - INCLUDE (user_id) supports index-only scans for counting per user.
-- - For very large time-series tables, consider BRIN on created_at in addition to the above.
- [Checklist] Does your index match the WHERE predicates’ leftmost prefix?
- [Checklist] If you ORDER BY, does index order align to avoid a sort?
- [Checklist] Could a partial index reduce size without breaking queries?
- [Checklist] Can the query be satisfied by an index-only scan?
- [Checklist] Are you okay with the extra write/storage cost?
Common mistakes and how to self-check
- Mistake: Indexing low-selectivity columns alone (e.g., boolean). Fix: Combine with a selective column or use partial index.
- Mistake: Wrong column order in composite indexes. Fix: Put the most selective and frequently filtered first, then sort keys.
- Mistake: Expecting ORDER BY to use index when WHERE doesn’t match leftmost prefix. Fix: Align predicates with index order.
- Mistake: Too many indexes slowing writes. Fix: Remove unused ones; measure write latency and size.
- Mistake: Function on column breaks index usage (WHERE lower(email)=...). Fix: Create functional index aligned with the expression.
Self-check method
- Run EXPLAIN (or your DB’s plan tool) before/after creating the index.
- Confirm index scan is used and sorting is removed.
- Measure latency over realistic data volumes.
Practical projects
- Audit your top 5 slow queries. For each, propose one index, create it in a staging database, and compare plans and timings.
- Implement a partial index for soft-deleted rows and measure index size difference.
- Add a GIN index for a JSON or array filter and compare performance with/without it.
Learning path
- Before this: Basic SQL and schema design.
- This subskill: Choosing and validating indexes for common workloads.
- Next: Query optimization (joins, window functions), transactions and locking, and database observability.
Next steps
- Run EXPLAIN on two of your queries and verify whether existing indexes are used.
- Pick one partial or covering index to implement in a safe environment.
- Take the quick test below to confirm understanding. Anyone can take it; log in to save progress.
Mini challenge
Your analytics API runs this query:
SELECT order_id
FROM line_items
WHERE product_id = $1 AND created_at BETWEEN $2 AND $3
ORDER BY created_at ASC
LIMIT 200;
- Propose an index that avoids a sort and minimizes scanned rows.
- How would your answer change if most products appear in >50% of rows?
See a possible approach
CREATE INDEX idx_line_items_product_time
ON line_items(product_id, created_at ASC);
-- If product_id is not selective, consider additional filters or BRIN on created_at
-- (for very large, time-ordered data), plus caching at the application layer.