How to learn Duplicates Unique Nunique for Python pandas in Data Analyst for free

Who this is for

Data Analysts who clean datasets, merge files, and produce counts like “unique users per day” or “distinct products per region.” If you often need to remove duplicates or compute distinct counts, this subskill is for you.

Prerequisites

Basic Python (variables, lists, functions)
Intro pandas: creating DataFrames, selecting columns

Why this matters

De-duplicate leads before emailing to avoid double-sending
Clean transactions to remove repeated rows before revenue totals
Count unique users by day, product, or region for KPIs
Audit merges to confirm no duplicate join keys slipped in

Concept explained simply

Think of your dataset as a stack of index cards. Some cards are exact repeats, others share the same key (like the same email). You need quick ways to 1) mark or remove repeats, and 2) list what’s distinct.

Mental model

duplicated: stamps each card as duplicate or not (True/False).
drop_duplicates: throws away duplicates, keeping the one you choose.
unique (Series): returns the distinct values from one column.
nunique: counts distinct values (per column, per group, etc.).

Core methods at a glance

# Mark duplicate rows
DF.duplicated(subset=None, keep='first')  # subset: columns to compare; keep: 'first'|'last'|False

# Remove duplicate rows
DF.drop_duplicates(subset=None, keep='first', ignore_index=False)

# Unique values from one column (Series only)
SERIES.unique()

# Count distinct values
DF.nunique(dropna=True)            # per column by default
SERIES.nunique(dropna=True)

# Grouped distinct counts (very common)
DF.groupby('col')['target'].nunique()

Worked examples

Example 1 — Remove duplicate customers by email

import pandas as pd

customers = pd.DataFrame({
    'email': ['a@x.com','b@x.com','b@x.com','c@x.com','d@x.com','d@x.com'],
    'name':  ['Ana',     'Ben',     'Ben',     'Cam',     'Dan',     'Dan'],
    'city':  ['NY',      'LA',      'LA',      'NY',      'SF',      'SF']
})

# Mark duplicates by email (keep first occurrence as original)
customers['is_dup'] = customers.duplicated(subset='email', keep='first')

# Remove duplicates by email, keep the last occurrence
dedup = customers.drop_duplicates(subset='email', keep='last', ignore_index=True)
print(customers)
print(dedup)

Tip: choose keep='first' or keep='last' depending on which row you want to retain (e.g., last may have latest info).

Example 2 — Get unique categories and distinct counts

products = pd.DataFrame({
    'product': ['A','B','B','A','C','C','D'],
    'region':  ['NA','EU','EU','NA','NA','NA','EU']
})

# Unique product values (array)
print(products['product'].unique())  # ['A' 'B' 'C' 'D']

# Distinct counts per column
print(products.nunique())             # product: 4, region: 2

# Distinct products per region
print(products.groupby('region')['product'].nunique())
# EU: 2, NA: 2

Use .unique() when you need the values themselves; use .nunique() when you only need the count.

Example 3 — De-duplicate transactions by a key combination

orders = pd.DataFrame({
    'user_id':  [1,1,1,2,2,3],
    'order_id': [10,10,11,20,20,30],
    'amount':   [50,50,40,25,25,60]
})

# Some rows repeat the (user_id, order_id) pair exactly
mask = orders.duplicated(subset=['user_id','order_id'], keep='first')
print(mask.tolist())  # mark later duplicates

clean = orders.drop_duplicates(subset=['user_id','order_id'], keep='first')
print(clean)

De-dup on keys, not on amount. This preserves one row per logical transaction.

Learning path

Before this: Selecting/filtering rows, basic DataFrame operations
This lesson: duplicated, drop_duplicates, unique, nunique
Next: groupby aggregations, value_counts, joins/merge quality checks

Common mistakes and self-check

Forgetting subset=... and de-duplicating on all columns by accident
Using unique() on a DataFrame (it’s a Series method)
Confusing keep: keep=False removes/marks all duplicates, not keeping any copy
Ignoring NaN handling: nunique(dropna=True) excludes NaN by default
Not checking order impact: dropping duplicates can change row order; use ignore_index when needed

Quick self-check

Can you mark duplicates by a specific key or key-pair?
Can you keep only the last occurrence when de-duplicating?
Can you compute unique values and distinct counts per group?
Do you know when to include or exclude NaN in distinct counts?

Practice exercises

Complete Exercise 1 below. After finishing, compare with the solution and tick the checklist.

[Exercise 1] Duplicates and distinct counts on a small orders dataset

Exercise 1 instructions (open in your IDE or notebook)

Create the DataFrame shown in the exercise block below.
Mark duplicate rows across all columns and count them.
Drop exact duplicate rows, keeping the first occurrence.
Drop duplicates by ['order_id','user_email'], keeping the last occurrence.
List unique products and count them.
Compute distinct products per region using groupby + nunique.

I can mark and drop duplicates by a chosen subset
I can get Series unique values and counts
I can compute groupwise distinct counts

Practical projects

CRM Clean-up: De-duplicate contacts by email; count unique domains; deliver a “clean contacts” CSV
Transaction Audit: Remove duplicated order lines by (user_id, order_id); report unique users per day
Catalog Check: List unique categories and the number of distinct products per category

Mini challenge

You are handed a DataFrame with columns ['session_id','user_id','page','ts']. Keep only the latest row per session_id, then report the number of distinct users per page. Hint: sort by ts, use drop_duplicates(subset='session_id', keep='last'), then groupby('page')['user_id'].nunique().

Next steps

Take the Quick Test to confirm you’ve got it. The test is available to everyone; only logged-in users get saved progress.
Apply these tools in your next data-cleaning task.

Quick Test

Available to everyone; only logged-in users get saved progress.

import pandas as pd df = pd.DataFrame({ 'order_id': [1,2,2,3,4,4,5], 'user_email': ['a@x','b@x','b@x','c@x','d@x','d@x','e@x'], 'product': ['A','B','B','A','C','C','D'], 'region': ['NA','EU','EU','NA','NA','NA','EU'] }) # 1) Mark duplicate rows across ALL columns and count them # 2) Drop exact duplicate rows (keep first); show the shape # 3) Drop duplicates by ['order_id','user_email'] keeping LAST; show resulting rows # 4) List unique products and count them using .unique() and .nunique() # 5) Compute distinct products per region using groupby + nunique

Menu

Duplicates Unique Nunique

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core methods at a glance

Worked examples

Learning path

Common mistakes and self-check

Practice exercises

Practical projects

Mini challenge

Next steps

Quick Test

Practice Exercises

Duplicates and Distincts on Orders

Instructions

Expected Output

Duplicates Unique Nunique — Quick Test

Have questions about Duplicates Unique Nunique?

AI Assistant