luvv to helpDiscover the Best Free Online Tools
Topic 21 of 30

Duplicates Unique Nunique

Learn Duplicates Unique Nunique for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Who this is for

Data Analysts who clean datasets, merge files, and produce counts like “unique users per day” or “distinct products per region.” If you often need to remove duplicates or compute distinct counts, this subskill is for you.

Prerequisites

  • Basic Python (variables, lists, functions)
  • Intro pandas: creating DataFrames, selecting columns

Why this matters

  • De-duplicate leads before emailing to avoid double-sending
  • Clean transactions to remove repeated rows before revenue totals
  • Count unique users by day, product, or region for KPIs
  • Audit merges to confirm no duplicate join keys slipped in

Concept explained simply

Think of your dataset as a stack of index cards. Some cards are exact repeats, others share the same key (like the same email). You need quick ways to 1) mark or remove repeats, and 2) list what’s distinct.

Mental model

  • duplicated: stamps each card as duplicate or not (True/False).
  • drop_duplicates: throws away duplicates, keeping the one you choose.
  • unique (Series): returns the distinct values from one column.
  • nunique: counts distinct values (per column, per group, etc.).

Core methods at a glance

# Mark duplicate rows
DF.duplicated(subset=None, keep='first')  # subset: columns to compare; keep: 'first'|'last'|False

# Remove duplicate rows
DF.drop_duplicates(subset=None, keep='first', ignore_index=False)

# Unique values from one column (Series only)
SERIES.unique()

# Count distinct values
DF.nunique(dropna=True)            # per column by default
SERIES.nunique(dropna=True)

# Grouped distinct counts (very common)
DF.groupby('col')['target'].nunique()

Worked examples

Example 1 — Remove duplicate customers by email
import pandas as pd

customers = pd.DataFrame({
    'email': ['a@x.com','b@x.com','b@x.com','c@x.com','d@x.com','d@x.com'],
    'name':  ['Ana',     'Ben',     'Ben',     'Cam',     'Dan',     'Dan'],
    'city':  ['NY',      'LA',      'LA',      'NY',      'SF',      'SF']
})

# Mark duplicates by email (keep first occurrence as original)
customers['is_dup'] = customers.duplicated(subset='email', keep='first')

# Remove duplicates by email, keep the last occurrence
dedup = customers.drop_duplicates(subset='email', keep='last', ignore_index=True)
print(customers)
print(dedup)

Tip: choose keep='first' or keep='last' depending on which row you want to retain (e.g., last may have latest info).

Example 2 — Get unique categories and distinct counts
products = pd.DataFrame({
    'product': ['A','B','B','A','C','C','D'],
    'region':  ['NA','EU','EU','NA','NA','NA','EU']
})

# Unique product values (array)
print(products['product'].unique())  # ['A' 'B' 'C' 'D']

# Distinct counts per column
print(products.nunique())             # product: 4, region: 2

# Distinct products per region
print(products.groupby('region')['product'].nunique())
# EU: 2, NA: 2

Use .unique() when you need the values themselves; use .nunique() when you only need the count.

Example 3 — De-duplicate transactions by a key combination
orders = pd.DataFrame({
    'user_id':  [1,1,1,2,2,3],
    'order_id': [10,10,11,20,20,30],
    'amount':   [50,50,40,25,25,60]
})

# Some rows repeat the (user_id, order_id) pair exactly
mask = orders.duplicated(subset=['user_id','order_id'], keep='first')
print(mask.tolist())  # mark later duplicates

clean = orders.drop_duplicates(subset=['user_id','order_id'], keep='first')
print(clean)

De-dup on keys, not on amount. This preserves one row per logical transaction.

Learning path

  • Before this: Selecting/filtering rows, basic DataFrame operations
  • This lesson: duplicated, drop_duplicates, unique, nunique
  • Next: groupby aggregations, value_counts, joins/merge quality checks

Common mistakes and self-check

  • Forgetting subset=... and de-duplicating on all columns by accident
  • Using unique() on a DataFrame (it’s a Series method)
  • Confusing keep: keep=False removes/marks all duplicates, not keeping any copy
  • Ignoring NaN handling: nunique(dropna=True) excludes NaN by default
  • Not checking order impact: dropping duplicates can change row order; use ignore_index when needed
Quick self-check
  • Can you mark duplicates by a specific key or key-pair?
  • Can you keep only the last occurrence when de-duplicating?
  • Can you compute unique values and distinct counts per group?
  • Do you know when to include or exclude NaN in distinct counts?

Practice exercises

Complete Exercise 1 below. After finishing, compare with the solution and tick the checklist.

  • [Exercise 1] Duplicates and distinct counts on a small orders dataset
Exercise 1 instructions (open in your IDE or notebook)
  1. Create the DataFrame shown in the exercise block below.
  2. Mark duplicate rows across all columns and count them.
  3. Drop exact duplicate rows, keeping the first occurrence.
  4. Drop duplicates by ['order_id','user_email'], keeping the last occurrence.
  5. List unique products and count them.
  6. Compute distinct products per region using groupby + nunique.
  • I can mark and drop duplicates by a chosen subset
  • I can get Series unique values and counts
  • I can compute groupwise distinct counts

Practical projects

  • CRM Clean-up: De-duplicate contacts by email; count unique domains; deliver a “clean contacts” CSV
  • Transaction Audit: Remove duplicated order lines by (user_id, order_id); report unique users per day
  • Catalog Check: List unique categories and the number of distinct products per category

Mini challenge

You are handed a DataFrame with columns ['session_id','user_id','page','ts']. Keep only the latest row per session_id, then report the number of distinct users per page. Hint: sort by ts, use drop_duplicates(subset='session_id', keep='last'), then groupby('page')['user_id'].nunique().

Next steps

  • Take the Quick Test to confirm you’ve got it. The test is available to everyone; only logged-in users get saved progress.
  • Apply these tools in your next data-cleaning task.

Quick Test

Available to everyone; only logged-in users get saved progress.

Practice Exercises

1 exercises to complete

Instructions

Create this DataFrame and complete the tasks.

import pandas as pd

df = pd.DataFrame({
    'order_id':   [1,2,2,3,4,4,5],
    'user_email': ['a@x','b@x','b@x','c@x','d@x','d@x','e@x'],
    'product':    ['A','B','B','A','C','C','D'],
    'region':     ['NA','EU','EU','NA','NA','NA','EU']
})

# 1) Mark duplicate rows across ALL columns and count them
# 2) Drop exact duplicate rows (keep first); show the shape
# 3) Drop duplicates by ['order_id','user_email'] keeping LAST; show resulting rows
# 4) List unique products and count them using .unique() and .nunique()
# 5) Compute distinct products per region using groupby + nunique
Expected Output
1) Duplicate flags across all columns: [False, False, True, False, False, True, False]; duplicate count = 2 2) After drop_duplicates(keep='first'): shape is (5, 4) 3) After drop_duplicates(subset=['order_id','user_email'], keep='last'): keeps the later rows for pairs (2,'b@x') and (4,'d@x') 4) Unique products: ['A', 'B', 'C', 'D']; product nunique = 4 5) Distinct products per region: EU = 2, NA = 2

Duplicates Unique Nunique — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Duplicates Unique Nunique?

AI Assistant

Ask questions about this tool