How to learn Handling Nulls And Duplicates for SQL in BI Analyst for free

Why this matters

BI Analysts constantly clean data, join tables, and produce metrics that drive decisions. Real data is messy: missing values (NULLs) and duplicated rows are common. If you ignore them, you will miscount users, misstate revenue, and break dashboards. This lesson gives you the practical SQL patterns to control NULLs and duplicates reliably.

Calculate KPIs that ignore or replace missing values safely.
Join tables without accidentally dropping unmatched rows.
Deduplicate records with clear, auditable rules.

Concept explained simply

Think of your dataset as a spreadsheet.

NULL means the cell is empty/unknown. It is not zero and not an empty string.
Duplicates are multiple rows that represent the same real-world entity (same user, order, SKU) according to a business key.

Mental model

Before querying: define your unit (row meaning) and your key (how to identify uniqueness).
When aggregating: decide how to treat unknowns — drop, replace, or highlight.
When merging: prefer rules that are deterministic (keep latest, highest quality, or trusted source).

NULLs: quick rules

COUNT(*) counts rows; COUNT(col) ignores NULLs in col.
WHERE col = NULL never matches. Use IS NULL / IS NOT NULL.
COALESCE(col, default) replaces NULL with a default value.
NULLIF(a, b) returns NULL if a = b, otherwise a.
AVG, SUM, MIN, MAX ignore NULLs in their argument.

Duplicates: quick rules

Find duplicates: GROUP BY key HAVING COUNT(*) > 1.
Remove duplicates deterministically: use ROW_NUMBER() OVER (PARTITION BY key ORDER BY preference) and keep rn = 1.
UNION removes duplicates across sets; UNION ALL keeps all rows.
DISTINCT removes duplicate rows in the projection but can hide data quality problems if overused.

Practical patterns

Pattern: Safe sums and counts with NULLs

-- Count all rows vs. non-null emails
SELECT COUNT(*) AS total_rows,
       COUNT(email) AS non_null_emails
FROM users;

-- Replace NULL amounts with 0 when summing
SELECT COALESCE(SUM(amount), 0) AS total_amount
FROM payments;

Pattern: Keep customers with zero orders (LEFT JOIN)

SELECT c.customer_id,
       COALESCE(SUM(o.amount), 0) AS revenue
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.customer_id
  AND o.status <> 'cancelled'  -- put filters that should not drop customers here
GROUP BY c.customer_id;

Pattern: Deduplicate by key, keep latest

WITH ranked AS (
  SELECT p.*,
         ROW_NUMBER() OVER (
           PARTITION BY p.sku
           ORDER BY p.updated_at DESC
         ) AS rn
  FROM products_raw p
)
SELECT *
FROM ranked
WHERE rn = 1;

Pattern: Merge two sources, then deduplicate

WITH unioned AS (
  SELECT user_id, event_time, 'web' AS source FROM web_events
  UNION ALL
  SELECT user_id, event_time, 'app' AS source FROM app_events
), ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, CAST(event_time AS DATE)
           ORDER BY event_time DESC
         ) AS rn
  FROM unioned
)
SELECT user_id, CAST(event_time AS DATE) AS event_date
FROM ranked
WHERE rn = 1 AND user_id IS NOT NULL;

Worked examples

Example 1: Reliable email coverage rate

Goal: show total users vs. users with email, treating missing emails as unknowns (not empty strings).

SELECT COUNT(*) AS total_users,
       COUNT(email) AS users_with_email,
       ROUND(100.0 * COUNT(email) / NULLIF(COUNT(*), 0), 2) AS email_coverage_pct
FROM dim_user;

Why this works: COUNT(email) ignores NULLs; NULLIF prevents division by zero.

Example 2: Deduplicate customers by email, keep most recent profile

WITH ranked AS (
  SELECT c.*,
         ROW_NUMBER() OVER (
           PARTITION BY LOWER(TRIM(c.email))
           ORDER BY c.updated_at DESC, c.customer_id DESC
         ) AS rn
  FROM staging_customers c
  WHERE c.email IS NOT NULL AND TRIM(c.email) <> ''
)
SELECT customer_id, email, updated_at
FROM ranked
WHERE rn = 1;

Why this works: a normalized key (LOWER/TRIM) defines duplicates; ROW_NUMBER picks the latest record deterministically.

Example 3: Orders per customer including zero

Goal: include customers without orders and treat missing amounts as zero revenue.

SELECT cu.customer_id,
       COUNT(o.order_id) AS orders_cnt,
       COALESCE(SUM(o.amount), 0) AS revenue
FROM dim_customer cu
LEFT JOIN fct_orders o
  ON o.customer_id = cu.customer_id
  AND o.status = 'paid'
GROUP BY cu.customer_id;

Placing the status filter in the ON clause keeps customers with zero paid orders.

Exercises you can try now

These exercises mirror the interactive tasks below. Do them locally or in a sandbox, then check your answers.

Exercise 1: Count unique customers while ignoring NULLs.
Exercise 2: Deduplicate products by SKU, keeping the latest row.
Exercise 3: Merge leads from two sources, deduplicate by email, and keep the latest.

Checklist before you run your query

Is your deduplication key clearly defined?
Do aggregates handle NULLs as intended (COALESCE where needed)?
Are LEFT JOIN filters placed in the ON clause to preserve unmatched rows?
Is your rule for keeping one row deterministic (ORDER BY tie-breakers)?

Common mistakes and self-checks

Mistake: Using = NULL or <> NULL.
Fix: Use IS NULL / IS NOT NULL.
Mistake: COUNT(col) to count all rows.
Fix: COUNT(*) counts rows; COUNT(col) skips NULLs.
Mistake: Turning a LEFT JOIN into INNER by adding WHERE o.col ...
Fix: Put such filters in the ON clause when you must keep unmatched left rows.
Mistake: Slapping DISTINCT at the end.
Fix: Define a proper deduplication rule with ROW_NUMBER and explain it.
Mistake: Using UNION instead of UNION ALL and unexpectedly losing rows.
Fix: Use UNION ALL unless you truly need cross-set deduplication.

Self-check prompts

Can you explain why each dropped row is dropped?
Can someone else reproduce your dedup result using your rule?
If you run the query tomorrow, will you get the same row kept (tie-breakers set)?

Mini challenge

Tables:

customers(customer_id, email, signup_date)
orders(order_id, customer_id, amount, status, created_at)

Task: Build a customer revenue summary with one row per customer:

Include all customers, even with zero orders.
Only include orders with status = 'paid'.
Return: customer_id, email, paid_orders_cnt, paid_revenue, first_order_date (NULL if none).

Hint: LEFT JOIN, COALESCE for revenue, MIN(created_at) for first order.

Who this is for

BI Analysts and aspiring analysts working with dashboards, KPIs, and reporting.
Data-savvy PMs or marketers validating metrics from raw tables.

Prerequisites

Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
Familiarity with primary keys and business keys.

Learning path

NULL fundamentals: counts, arithmetic, COALESCE, NULLIF.
Duplicate detection: GROUP BY ... HAVING COUNT(*) > 1.
Deterministic dedup: ROW_NUMBER with a clear ORDER BY.
Safe joining and filtering strategies.
Set ops: UNION vs UNION ALL.

Practical projects

Contact list cleanup: deduplicate by normalized email, keep latest profile.
Order ledger audit: identify duplicate orders by natural key and quantify impact on revenue.
Daily active users: merge multiple event sources and deduplicate per user-day.

Next steps

Apply these patterns to your most-used tables and document the dedup rules.
Add data quality checks (e.g., duplicates count) to your dashboard refresh process.

Ready for the quick test?

Take the quick test below to check your understanding. Everyone can take it for free; only logged-in users get saved progress.

Menu

Handling Nulls And Duplicates

Table of Contents

Why this matters

Concept explained simply

Mental model

NULLs: quick rules

Duplicates: quick rules

Practical patterns

Worked examples

Example 1: Reliable email coverage rate

Example 2: Deduplicate customers by email, keep most recent profile

Example 3: Orders per customer including zero

Exercises you can try now

Checklist before you run your query

Common mistakes and self-checks

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Ready for the quick test?

Practice Exercises

Count distinct customers while ignoring NULLs

Instructions

Expected Output

Deduplicate products by SKU, keep latest

Merge and deduplicate leads by email

Handling Nulls And Duplicates — Quick Test

Have questions about Handling Nulls And Duplicates?

AI Assistant