Why SQL matters for Data Analysts
SQL is the primary way data analysts access, clean, and aggregate data stored in relational databases and data warehouses. Mastering SQL unlocks faster insights, reliable dashboards, ad-hoc analyses, and collaboration with engineers. You will query tables, join datasets, build metrics, audit data quality, and deliver trustworthy answers.
Who this is for
- Aspiring data analysts needing a practical, project-ready SQL foundation.
- Analysts who can write simple queries and want to advance to window functions, optimization, and modeling.
- Professionals in adjacent roles (product, marketing, operations) who need reliable data pull and aggregation skills.
Prerequisites
- Comfort with basic math and spreadsheets.
- Familiarity with rows, columns, and CSVs.
- Optional but helpful: a local or cloud SQL environment (any vendor is fine; syntax differences are small for core topics).
Learning path
Quick study plan (2–4 weeks)
- Days 1–3: SELECT, WHERE, ORDER BY, NULLs, strings, dates.
- Days 4–6: CASE, GROUP BY, HAVING, joins.
- Days 7–9: Subqueries, CTEs, set operations; mini drills daily.
- Days 10–12: Window functions; build a summary dashboard query.
- Days 13–14: Indexes, EXPLAIN, import/export; finish a mini-project.
Worked examples
Assume an e-commerce schema:
customers(id, name, country)
orders(id, customer_id, order_date, status, total_amount)
order_items(order_id, product_id, quantity, unit_price)
products(id, name, category, price)1) Basic filtering and sorting: last 7 days completed orders
SELECT id, customer_id, order_date, total_amount
FROM orders
WHERE status = 'completed'
AND order_date >= CURRENT_DATE - INTERVAL '7 days'
ORDER BY order_date DESC;Why it matters: fast ad-hoc pulls for recent performance.
2) Aggregate revenue by country with a threshold
SELECT c.country, SUM(o.total_amount) AS revenue
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.country
HAVING SUM(o.total_amount) >= 50000
ORDER BY revenue DESC;Why it matters: focus on meaningful segments; HAVING filters groups.
3) Monthly cohorts with CASE and date functions
SELECT DATE_TRUNC('month', o.order_date) AS month,
CASE WHEN o.total_amount >= 100 THEN 'AOV 100+' ELSE 'AOV < 100' END AS aov_bucket,
COUNT(*) AS orders,
SUM(o.total_amount) AS revenue
FROM orders o
WHERE o.status = 'completed'
GROUP BY 1, 2
ORDER BY 1, 2;Why it matters: quick cohort and KPI breakdowns.
4) Top 3 products per category using window functions
WITH sales AS (
SELECT oi.product_id,
p.category,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN products p ON p.id = oi.product_id
GROUP BY 1, 2
)
SELECT category, product_id, revenue
FROM (
SELECT category, product_id, revenue,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS rn
FROM sales
) t
WHERE rn <= 3
ORDER BY category, revenue DESC;Why it matters: rank within groups without subqueries for every category.
5) Detect potential duplicates with window functions
SELECT id, name, country,
COUNT(*) OVER (PARTITION BY LOWER(TRIM(name)), country) AS dup_count
FROM customers
ORDER BY dup_count DESC, name;Why it matters: data-quality checks before reporting.
Hands-on drills
- Return the 20 most recent completed orders with customer names.
- Create a CASE expression that buckets order totals into Low (<50), Medium (50–199.99), High (200+).
- Find the top 5 countries by total revenue this quarter.
- List products never ordered (anti-join or LEFT JOIN ... WHERE is NULL).
- Use a CTE to compute daily revenue, then a second query to compute a 7-day moving average with window functions.
- Compare UNION vs UNION ALL on two small derived tables; explain the difference in row counts.
- Run EXPLAIN (or your dialect’s equivalent) on a slow join; add an index and compare plans.
Common mistakes
- Filtering aggregated results with WHERE instead of HAVING.
- Forgetting that COUNT(column) ignores NULLs while COUNT(*) counts rows.
- Using INNER JOIN when you need to keep unmatched rows (use LEFT JOIN).
- Grouping by a derived expression in SELECT without adding it to GROUP BY (or reusing positional references if your dialect supports it).
- Assuming ORDER BY in subqueries is preserved; only the outermost ORDER BY guarantees ordering.
- Confusing UNION (deduplicates) with UNION ALL (keeps duplicates).
- Expecting indexes to help on expressions not covered by the index; consider functional or composite indexes where supported.
Debugging tips
- Build queries incrementally: SELECT few columns → add WHERE → add JOINs → add GROUP BY → add HAVING → add windows.
- Validate join cardinalities by counting rows before and after joins.
- Check NULL behavior explicitly with COALESCE or IS NULL to avoid surprises.
- Use EXPLAIN/EXPLAIN ANALYZE to see scans vs. index usage; watch for huge row estimates.
- Simplify complex logic into CTEs; name each step by its intent.
Mini project: Product performance dashboard query
Goal: Produce a single query that powers a simple dashboard with KPIs by category and month.
Requirements
- Inputs: orders, order_items, products.
- Outputs per category and month: orders, units, revenue, average order value, top product (by revenue), 7-day moving average of revenue (computed at daily granularity then aggregated).
- Use: CTEs, joins, CASE, GROUP BY, window functions.
Suggested approach
- CTE daily_sales: sum revenue per day and category.
- CTE moving_avg: compute 7-day moving average per category.
- Main select: aggregate by month and category; join the top product per category via a window function ranking.
Acceptance checklist
- Uses at least two CTEs with clear names.
- Includes one window function and one CASE expression.
- Reproducible if tables refresh.
Subskills
- Data Types and Casting — Choose correct column types and convert safely to avoid truncation and precision issues.
- Filtering and Sorting (WHERE, ORDER BY) — Retrieve just the rows you need in the sequence you expect.
- Null Handling (IS NULL, COALESCE, NULLIF) — Control missing values to keep metrics correct.
- String Functions and Pattern Matching (LIKE, regex) — Clean, parse, and search text data.
- Date and Time Functions — Work with truncation, extraction, and intervals for time-based analytics.
- Conditional Logic (CASE WHEN THEN ELSE) — Create labeled segments and KPI buckets.
- Aggregations (GROUP BY, HAVING) — Summarize data into meaningful metrics.
- DML (INSERT, UPDATE, DELETE, MERGE) — Modify data carefully when needed.
- Joins (INNER/LEFT/RIGHT/FULL/CROSS) — Combine tables correctly without duplications or losses.
- Constraints and Keys (PK, FK, UNIQUE, CHECK) — Keep datasets consistent and trustworthy.
- DDL (CREATE, ALTER, DROP) — Define and evolve schemas safely.
- Normalization and Relational Modeling — Structure data for clarity and efficiency.
- Set Operations (UNION, INTERSECT, EXCEPT) — Combine result sets precisely.
- Subqueries (correlated/uncorrelated) — Express complex filters and comparisons.
- CTEs (WITH) — Organize complex queries into readable steps.
- Views and Materialized Views — Package logic for reuse and speed up reads.
- Window Functions (OVER, PARTITION BY, ORDER BY) — Compute rankings, moving averages, and shares.
- Indexes (B-tree, hash, composite, covering) — Accelerate lookups and joins.
- Transactions and Isolation Levels (ACID, locks) — Keep writes safe and consistent.
- Security and Permissions (GRANT, REVOKE, roles) — Enforce least-privilege access.
- JSON and Semi-Structured Data — Query nested attributes when schemas vary.
- Full Text Search Basics — Search human language fields effectively.
- Query Optimization (EXPLAIN/ANALYZE) — Diagnose and fix slow queries.
- Stored Procedures and Functions — Encapsulate logic close to the data.
- Triggers — Automate actions on data changes; use sparingly.
- Data Import and Export (COPY/LOAD/UNLOAD) — Move data reliably in and out.
- Backup, Restore, and Migrations — Protect and evolve production data.
- Database Monitoring and Maintenance (VACUUM/ANALYZE, stats) — Keep performance healthy.
- Partitioning Basics — Scale large tables and improve pruning.
- Sharding Basics — Distribute data across nodes; trade-offs apply.
- Vendor Dialects (PostgreSQL, MySQL, SQL Server) — Recognize syntax differences and features.
Next steps
- Rebuild one of your regular reports using CTEs and window functions to simplify logic.
- Create a personal SQL notebook of patterns: date bucketing, top-N per group, dedup strategies.
- Move on to complementary analyst skills: spreadsheets for quick checks, Python for advanced analysis, and visualization tools for storytelling.