Why this matters
As a Data Visualization Engineer, you turn raw tables into visuals people trust. Your SQL must deliver clean, visual-ready datasets that:
- Match the visual grain (e.g., one row per month for a time series)
- Aggregate correctly (SUM/COUNT/AVG with proper GROUP BY)
- Filter and parametrize safely (date range, segments, top-N)
- Perform well (pre-aggregation, minimal columns, appropriate joins)
Real tasks you’ll do
- Build a monthly revenue time series with optional product/region filters
- Create a bar chart of top categories with an “Others” group
- Prepare a funnel dataset (visited → added_to_cart → purchased)
- Produce KPI tiles with current value, target, and delta vs last period
Concept explained simply
Visuals consume tables. Your job is to produce a table where each row equals one visual mark. If the visual is a monthly line chart, each row should represent one month. If it’s a bar chart by category, each row is one category.
SQL building blocks:
- SELECT: choose only the columns the visual needs
- FROM/JOIN: bring together necessary tables
- WHERE: filter to the relevant subset (use parameters for interactivity)
- GROUP BY: set row grain (what each row represents)
- ORDER BY/LIMIT: control visual order and top-N logic
Mental model: “One row per mark”
Pick the mark grain first: time bucket, category, segment, or a single KPI. Then ensure the query emits exactly one row per mark. Everything else (joins, filters, aggregations) must serve that grain.
Query patterns for common visuals
- Time series: DATE_TRUNC on a timestamp, GROUP BY the trunc, aggregate measures
- Bar chart (top-N): aggregate by category, rank with window functions, limit or bucket into Others
- KPIs: aggregate to a single row; optionally union with prior period for delta
- Funnel: compute counts at each step on the same population; left joins to preserve earlier steps
- Tables: minimal transformations, clear column names, stable sorting
Visual-friendly dataset checklist
- Row grain matches the visual
- Column names are readable and final (no cryptic aliases)
- Only needed columns are returned
- Filters are parameterized and safe
- Aggregations are correct and non-duplicative
- Performance: pre-aggregate where possible
Worked examples
Assume a simple commerce schema:
- orders(order_id, order_date, customer_id, status, total_amount)
- order_items(order_id, product_id, quantity, price)
- products(product_id, category)
- refunds(refund_id, order_id, refund_amount, refund_date)
Example 1: Monthly net revenue time series
SELECT
DATE_TRUNC('month', o.order_date) AS month,
SUM(o.total_amount) - COALESCE(SUM(r.refund_amount), 0) AS net_revenue
FROM orders o
LEFT JOIN refunds r
ON r.order_id = o.order_id
WHERE o.status = 'completed'
AND o.order_date >= :start_date
AND o.order_date < :end_date
GROUP BY 1
ORDER BY 1;
Notes: DATE_TRUNC sets the grain; LEFT JOIN keeps months with zero refunds; parameters bound in your BI tool.
Example 2: Top 5 categories with Others
WITH cat_rev AS (
SELECT p.category,
SUM(oi.quantity * oi.price) AS revenue
FROM order_items oi
JOIN orders o ON o.order_id = oi.order_id AND o.status = 'completed'
JOIN products p ON p.product_id = oi.product_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY p.category
), ranked AS (
SELECT category, revenue,
DENSE_RANK() OVER (ORDER BY revenue DESC) AS rnk
FROM cat_rev
), split AS (
SELECT category, revenue FROM ranked WHERE rnk <= 5
UNION ALL
SELECT 'Others' AS category,
(SELECT SUM(revenue) FROM ranked WHERE rnk > 5) AS revenue
)
SELECT category, revenue
FROM split
WHERE revenue IS NOT NULL
ORDER BY revenue DESC;
Notes: Window function ranks; a UNION builds the Others bucket.
Example 3: Funnel counts
Assume events(user_id, event_name, event_time). Steps: visited_site, add_to_cart, purchase.
WITH base AS ( SELECT DISTINCT user_id FROM events WHERE event_time BETWEEN :start_date AND :end_date ), s1 AS ( SELECT DISTINCT user_id FROM events WHERE event_name = 'visited_site' AND event_time BETWEEN :start_date AND :end_date ), s2 AS ( SELECT DISTINCT user_id FROM events WHERE event_name = 'add_to_cart' AND event_time BETWEEN :start_date AND :end_date ), s3 AS ( SELECT DISTINCT user_id FROM events WHERE event_name = 'purchase' AND event_time BETWEEN :start_date AND :end_date ) SELECT 'visited_site' AS step, COUNT(b.user_id) AS users FROM base b LEFT JOIN s1 ON b.user_id = s1.user_id WHERE s1.user_id IS NOT NULL UNION ALL SELECT 'add_to_cart', COUNT(b.user_id) FROM base b LEFT JOIN s2 ON b.user_id = s2.user_id WHERE s2.user_id IS NOT NULL UNION ALL SELECT 'purchase', COUNT(b.user_id) FROM base b LEFT JOIN s3 ON b.user_id = s3.user_id;
Notes: The funnel is on the same population window; steps are counted distinctly.
Design a dataset query in 6 steps
- Define the visual grain (time bucket, category, segment, or single KPI).
- List required fields: dimensions, measures, filters, sort order.
- Sketch joins and filters on paper (avoid accidental fan-out).
- Write SELECT with clear aliases and minimal columns.
- Group and aggregate to the grain; test with small date ranges.
- Optimize: predicates on indexed columns, pre-aggregation, limit rows.
Performance tips that matter
- Pre-aggregate for dashboards; avoid aggregating millions of rows per view
- Filter early by date/status; avoid functions wrapping indexed columns
- Return only needed columns; avoid SELECT *
- Use appropriate time buckets; don’t over-granularize
- Consider materialized views for heavy reusable datasets
Quality, filters, and parameters
- Use explicit WHERE for data window; set defaults for :start_date/:end_date
- Handle NULLs with COALESCE where appropriate
- Name columns user-friendly: month, category, revenue
- Validate against known totals before publishing
Common mistakes and how to self-check
- Fan-out joins causing inflated sums
- Self-check: sum at detail level, then aggregate; compare to expected totals
- Mismatched grain for the visual
- Self-check: one row per mark? If not, adjust GROUP BY
- Forgetting filters (status, canceled orders)
- Self-check: explicit WHERE status filters, reviewed with domain owners
- Top-N without deterministic ordering
- Self-check: add tie-breakers and document them
- Returning too many columns
- Self-check: remove unused columns; keep the dataset lean
Exercises
Do these to practice. The Quick Test at the bottom is available to everyone; logged-in learners will have progress saved automatically.
Exercise 1: Monthly Net Revenue Time Series
Goal: Build a monthly net revenue dataset for a line chart.
- Tables: orders, refunds (as described above)
- Include only completed orders
- Parameters: :start_date, :end_date (end exclusive)
- Columns: month, net_revenue
Expected output (example):
month | net_revenue 2024-01-01 | 120000.00 2024-02-01 | 135500.00 ...
Hints
- Use DATE_TRUNC('month', order_date)
- LEFT JOIN refunds on order_id; aggregate after the join
- Filter orders by date and status before grouping
Show solution
SELECT
DATE_TRUNC('month', o.order_date) AS month,
SUM(o.total_amount) - COALESCE(SUM(r.refund_amount), 0) AS net_revenue
FROM orders o
LEFT JOIN refunds r ON r.order_id = o.order_id
WHERE o.status = 'completed'
AND o.order_date >= :start_date
AND o.order_date < :end_date
GROUP BY 1
ORDER BY 1;
Exercise 2: Top 5 Categories with Others
Goal: Produce a bar chart dataset of revenue by category for the last 90 days.
- Tables: orders, order_items, products
- Include completed orders only
- Columns: category, revenue (with an Others row if needed)
Expected output (example):
category | revenue Electronics | 250000.00 Home | 210000.00 Apparel | 175000.00 Beauty | 120000.00 Toys | 90000.00 Others | 60000.00
Hints
- SUM(quantity * price) for revenue
- Use DENSE_RANK over revenue desc to find top 5
- UNION ALL an Others row for the rest
Show solution
WITH cat_rev AS (
SELECT p.category,
SUM(oi.quantity * oi.price) AS revenue
FROM order_items oi
JOIN orders o ON o.order_id = oi.order_id AND o.status = 'completed'
JOIN products p ON p.product_id = oi.product_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY p.category
), ranked AS (
SELECT category, revenue,
DENSE_RANK() OVER (ORDER BY revenue DESC) AS rnk
FROM cat_rev
), split AS (
SELECT category, revenue FROM ranked WHERE rnk <= 5
UNION ALL
SELECT 'Others' AS category,
(SELECT SUM(revenue) FROM ranked WHERE rnk > 5)
)
SELECT category, revenue
FROM split
WHERE revenue IS NOT NULL
ORDER BY revenue DESC;
- Checklist before running:
- Does the query emit exactly the required columns?
- Is the grain correct for the target visual?
- Are filters parameterized and safe?
- Can you explain each join and why it won’t duplicate rows?
Mini challenge
Create a KPI dataset that returns three rows: metric, current_value, prior_value, delta for total revenue over :start_date to :end_date vs the previous equal-length period. Keep it fast and single-query if possible.
Practical projects
- Build a daily active users (DAU) time series with weekday/weekly rollups
- Design a sales dashboard dataset pack: KPIs, time series, top-N by product and region
- Create a retention cohort dataset (monthly signup cohort vs active in month N)
Who this is for
- Aspiring and junior Data Visualization Engineers
- Data analysts building dashboard-ready datasets
- Engineers connecting SQL outputs to BI tools
Prerequisites
- Basic SQL: SELECT, WHERE, JOIN, GROUP BY, ORDER BY
- Comfort reading simple schemas and understanding visual requirements
Learning path
- Start: aggregate functions and grouping
- Next: window functions and top-N patterns
- Then: parameterization and performance basics
- Finally: robust datasets for multi-visual dashboards
Next steps
- Complete the exercises above
- Take the Quick Test below to check understanding
- Apply the patterns to one real dataset in your environment