Why this matters
BI Analysts constantly clean data, join tables, and produce metrics that drive decisions. Real data is messy: missing values (NULLs) and duplicated rows are common. If you ignore them, you will miscount users, misstate revenue, and break dashboards. This lesson gives you the practical SQL patterns to control NULLs and duplicates reliably.
- Calculate KPIs that ignore or replace missing values safely.
- Join tables without accidentally dropping unmatched rows.
- Deduplicate records with clear, auditable rules.
Concept explained simply
Think of your dataset as a spreadsheet.
- NULL means the cell is empty/unknown. It is not zero and not an empty string.
- Duplicates are multiple rows that represent the same real-world entity (same user, order, SKU) according to a business key.
Mental model
- Before querying: define your unit (row meaning) and your key (how to identify uniqueness).
- When aggregating: decide how to treat unknowns — drop, replace, or highlight.
- When merging: prefer rules that are deterministic (keep latest, highest quality, or trusted source).
NULLs: quick rules
- COUNT(*) counts rows; COUNT(col) ignores NULLs in col.
- WHERE col = NULL never matches. Use IS NULL / IS NOT NULL.
- COALESCE(col, default) replaces NULL with a default value.
- NULLIF(a, b) returns NULL if a = b, otherwise a.
- AVG, SUM, MIN, MAX ignore NULLs in their argument.
Duplicates: quick rules
- Find duplicates: GROUP BY key HAVING COUNT(*) > 1.
- Remove duplicates deterministically: use ROW_NUMBER() OVER (PARTITION BY key ORDER BY preference) and keep rn = 1.
- UNION removes duplicates across sets; UNION ALL keeps all rows.
- DISTINCT removes duplicate rows in the projection but can hide data quality problems if overused.
Practical patterns
Pattern: Safe sums and counts with NULLs
-- Count all rows vs. non-null emails
SELECT COUNT(*) AS total_rows,
COUNT(email) AS non_null_emails
FROM users;
-- Replace NULL amounts with 0 when summing
SELECT COALESCE(SUM(amount), 0) AS total_amount
FROM payments;
Pattern: Keep customers with zero orders (LEFT JOIN)
SELECT c.customer_id,
COALESCE(SUM(o.amount), 0) AS revenue
FROM customers c
LEFT JOIN orders o
ON o.customer_id = c.customer_id
AND o.status <> 'cancelled' -- put filters that should not drop customers here
GROUP BY c.customer_id;
Pattern: Deduplicate by key, keep latest
WITH ranked AS (
SELECT p.*,
ROW_NUMBER() OVER (
PARTITION BY p.sku
ORDER BY p.updated_at DESC
) AS rn
FROM products_raw p
)
SELECT *
FROM ranked
WHERE rn = 1;
Pattern: Merge two sources, then deduplicate
WITH unioned AS (
SELECT user_id, event_time, 'web' AS source FROM web_events
UNION ALL
SELECT user_id, event_time, 'app' AS source FROM app_events
), ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id, CAST(event_time AS DATE)
ORDER BY event_time DESC
) AS rn
FROM unioned
)
SELECT user_id, CAST(event_time AS DATE) AS event_date
FROM ranked
WHERE rn = 1 AND user_id IS NOT NULL;
Worked examples
Example 1: Reliable email coverage rate
Goal: show total users vs. users with email, treating missing emails as unknowns (not empty strings).
SELECT COUNT(*) AS total_users,
COUNT(email) AS users_with_email,
ROUND(100.0 * COUNT(email) / NULLIF(COUNT(*), 0), 2) AS email_coverage_pct
FROM dim_user;
Why this works: COUNT(email) ignores NULLs; NULLIF prevents division by zero.
Example 2: Deduplicate customers by email, keep most recent profile
WITH ranked AS (
SELECT c.*,
ROW_NUMBER() OVER (
PARTITION BY LOWER(TRIM(c.email))
ORDER BY c.updated_at DESC, c.customer_id DESC
) AS rn
FROM staging_customers c
WHERE c.email IS NOT NULL AND TRIM(c.email) <> ''
)
SELECT customer_id, email, updated_at
FROM ranked
WHERE rn = 1;
Why this works: a normalized key (LOWER/TRIM) defines duplicates; ROW_NUMBER picks the latest record deterministically.
Example 3: Orders per customer including zero
Goal: include customers without orders and treat missing amounts as zero revenue.
SELECT cu.customer_id,
COUNT(o.order_id) AS orders_cnt,
COALESCE(SUM(o.amount), 0) AS revenue
FROM dim_customer cu
LEFT JOIN fct_orders o
ON o.customer_id = cu.customer_id
AND o.status = 'paid'
GROUP BY cu.customer_id;
Placing the status filter in the ON clause keeps customers with zero paid orders.
Exercises you can try now
These exercises mirror the interactive tasks below. Do them locally or in a sandbox, then check your answers.
- Exercise 1: Count unique customers while ignoring NULLs.
- Exercise 2: Deduplicate products by SKU, keeping the latest row.
- Exercise 3: Merge leads from two sources, deduplicate by email, and keep the latest.
Checklist before you run your query
- Is your deduplication key clearly defined?
- Do aggregates handle NULLs as intended (COALESCE where needed)?
- Are LEFT JOIN filters placed in the ON clause to preserve unmatched rows?
- Is your rule for keeping one row deterministic (ORDER BY tie-breakers)?
Common mistakes and self-checks
- Mistake: Using = NULL or <> NULL.
Fix: Use IS NULL / IS NOT NULL. - Mistake: COUNT(col) to count all rows.
Fix: COUNT(*) counts rows; COUNT(col) skips NULLs. - Mistake: Turning a LEFT JOIN into INNER by adding WHERE o.col ...
Fix: Put such filters in the ON clause when you must keep unmatched left rows. - Mistake: Slapping DISTINCT at the end.
Fix: Define a proper deduplication rule with ROW_NUMBER and explain it. - Mistake: Using UNION instead of UNION ALL and unexpectedly losing rows.
Fix: Use UNION ALL unless you truly need cross-set deduplication.
Self-check prompts
- Can you explain why each dropped row is dropped?
- Can someone else reproduce your dedup result using your rule?
- If you run the query tomorrow, will you get the same row kept (tie-breakers set)?
Mini challenge
Tables:
customers(customer_id, email, signup_date)
orders(order_id, customer_id, amount, status, created_at)
Task: Build a customer revenue summary with one row per customer:
- Include all customers, even with zero orders.
- Only include orders with status = 'paid'.
- Return: customer_id, email, paid_orders_cnt, paid_revenue, first_order_date (NULL if none).
Hint: LEFT JOIN, COALESCE for revenue, MIN(created_at) for first order.
Who this is for
- BI Analysts and aspiring analysts working with dashboards, KPIs, and reporting.
- Data-savvy PMs or marketers validating metrics from raw tables.
Prerequisites
- Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
- Familiarity with primary keys and business keys.
Learning path
- NULL fundamentals: counts, arithmetic, COALESCE, NULLIF.
- Duplicate detection: GROUP BY ... HAVING COUNT(*) > 1.
- Deterministic dedup: ROW_NUMBER with a clear ORDER BY.
- Safe joining and filtering strategies.
- Set ops: UNION vs UNION ALL.
Practical projects
- Contact list cleanup: deduplicate by normalized email, keep latest profile.
- Order ledger audit: identify duplicate orders by natural key and quantify impact on revenue.
- Daily active users: merge multiple event sources and deduplicate per user-day.
Next steps
- Apply these patterns to your most-used tables and document the dedup rules.
- Add data quality checks (e.g., duplicates count) to your dashboard refresh process.
Ready for the quick test?
Take the quick test below to check your understanding. Everyone can take it for free; only logged-in users get saved progress.