Why this matters
As a Data Analyst, you will constantly answer questions like: What are the top customer segments? How often do users churn by plan? Which regions return products more? Two pandas tools make this fast and reliable: value_counts (frequency of a single column) and crosstab (contingency table across two or more variables). Mastering them means you can create clear summaries, percent distributions, and quick sanity checks in seconds.
- Quickly find the top N categories and their share.
- Compare distributions across segments (e.g., churn by region, returns by channel).
- Produce totals and row/column percentages without custom code.
Concept explained simply
value_counts counts how many times each category appears in a Series. crosstab creates a table (rows vs columns) tallying counts for each combination of categories. Both can show percentages and include totals.
Mental model
- value_counts: Imagine a tally sheet for one question: How many of each answer?
- crosstab: Imagine a 2D grid: rows are one question (e.g., Region), columns another (e.g., Returned), and cells hold counts. Add margins for totals and normalize for percentages.
Quick syntax cheat-sheet
# 1) value_counts basics
s.value_counts() # counts, sorted by count desc
s.value_counts(normalize=True) # share (proportions) sum to 1
s.value_counts(dropna=False) # include NaN as a category
s.value_counts(sort=False) # keep original order (no sorting)
s.value_counts().head(5) # top 5 categories
# 2) crosstab basics
pd.crosstab(df['row_col'], df['col_col'])
# Add totals (margins)
pd.crosstab(df['A'], df['B'], margins=True, margins_name='Total')
# Percentages
pd.crosstab(df['A'], df['B'], normalize='index') # row-wise percent
pd.crosstab(df['A'], df['B'], normalize='columns') # column-wise percent
pd.crosstab(df['A'], df['B'], normalize='all') # table-wise percent
# Keep missing combinations
pd.crosstab(df['A'], df['B'], dropna=False)
# With bins (e.g., age groups)
pd.crosstab(pd.cut(df['age'], bins=[0,25,40,60,100], include_lowest=True), df['B'])
Worked examples
Example 1 — Top products with shares
import pandas as pd
products = pd.Series(['A','B','A','C','B','A','B','B','C','A', None])
counts = products.value_counts(dropna=False)
shares = products.value_counts(normalize=True, dropna=False)
print(counts)
print(shares.round(3))
What to notice
- Including NaN exposes missing data rate.
- normalize=True returns proportions useful for easy charting and comparisons.
Example 2 — Returns by channel with totals and row percentages
df = pd.DataFrame({
'channel': ['Online','Online','Store','Store','Online','Store','Online','Online','Store','Store'],
'returned': ['Yes','No','No','Yes','No','No','No','Yes','No','Yes']
})
ct = pd.crosstab(df['channel'], df['returned'], margins=True, margins_name='Total')
row_pct = pd.crosstab(df['channel'], df['returned'], normalize='index')
print(ct)
print((row_pct*100).round(1))
What to notice
- margins adds a Total row/column for quick sanity checks.
- normalize='index' shows per-channel shares (each row sums to 1).
Example 3 — Age-band vs churn, including missing and friendly percentages
df = pd.DataFrame({
'age': [22, 37, 41, 59, None, 26, 63, 44, 38, 23],
'churned': ['No','No','Yes','No','Yes','No','Yes','Yes','No','No']
})
age_band = pd.cut(df['age'], bins=[0,25,40,60,100], include_lowest=True)
ct = pd.crosstab(age_band, df['churned'], dropna=False)
row_pct = pd.crosstab(age_band, df['churned'], normalize='index', dropna=False)
print(ct)
print((row_pct*100).round(1).astype(str) + '%')
What to notice
- Binning creates interpretable groups.
- dropna=False keeps the (NaN, ...) band if any, helpful to audit missing ages.
- Row percentages simplify comparison across age bands.
Exercises
Use the sample dataset below for both exercises. You can copy-paste it into a Python session.
import pandas as pd
df = pd.DataFrame({
'order_id': [1,2,3,4,5,6,7,8,9,10,11,12],
'region': ['North','South','East','West','North','East','West','South','North','West','East','South'],
'channel': ['Online','Store','Store','Online','Online','Online','Store','Store','Online','Store','Online','Online'],
'returned': ['No','No','Yes','No','Yes','No','No','Yes','No','Yes','No','No'],
'category': ['A','B','A','C','A','B','C','A','A','C','B',None]
})
Exercise 1 — Top categories and their share
Task: Show category counts including missing values, plus percentage share. Display only the top 3 non-missing categories by count. Round shares to 1 decimal place (%).
- Hint: Use value_counts twice (counts and normalize). Consider dropna.
- Goal: A small table or Series with counts and percent for A, B, C.
Show a tiny nudge
Compute counts = df['category'].value_counts(dropna=False) and shares = ... normalize=True.
Exercise 2 — Returns by region (row percentages + totals)
Task: Build a crosstab of region vs returned with totals. Then produce row percentages (each region sums to 100%). Round to 1 decimal place.
- Hint: margins=True for totals, normalize='index' for row percentages.
Show a tiny nudge
ct = pd.crosstab(df['region'], df['returned'], margins=True). Then another crosstab with normalize='index'.
Practice checklist
- Can I include NaN in value_counts when needed?
- Can I switch between counts and percentages quickly?
- Can I add margins and read totals correctly?
- Do my row/column percentages sum to ~1 (or 100%)?
- Did I round for readability without losing meaning?
Common mistakes
- Forgetting normalize='index' vs 'columns': If you want row-wise shares, use 'index'. Columns for column-wise shares.
- Dropping NaN unintentionally: Use dropna=False to keep missing categories visible.
- Comparing counts across groups with different sizes: Use percentages for fair comparison.
- Reading margins incorrectly: The Total row/column summarizes counts; when using normalize, margins summarize proportions accordingly—double-check interpretation.
- Sorting confusion: value_counts sorts by count by default. If you need alphabetical order, use sort=False then sort_index().
Self-check
- Do my totals equal the dataset size?
- Do row percentages sum to 100% (allow tiny rounding errors)?
- Did I explicitly decide whether to include NaN?
- Is the chosen normalization aligned with the question asked?
Practical projects
- E-commerce snapshot: Use value_counts to list top 10 product categories and their shares; use crosstab to compare returns by shipping method with margins and row percentages.
- Customer support: crosstab ticket severity by channel; produce both counts and row percentages; highlight which channel has the highest share of critical tickets.
- Marketing campaign: value_counts on campaign source; crosstab source by conversion flag; export both as CSV after rounding percentages.
Who this is for
- Aspiring and practicing Data Analysts needing fast categorical summaries.
- Anyone using pandas to explore and report distributions.
Prerequisites
- Basic Python (variables, lists, functions).
- pandas fundamentals: Series, DataFrame, basic selection.
Learning path
- Refresh Series and DataFrame basics.
- Practice value_counts on multiple categorical columns; include dropna.
- Build crosstab tables; add margins; try different normalize modes.
- Format outputs (rounding, sorting) for stakeholder-friendly summaries.
- Apply to a real dataset and validate totals.
Mini challenge
Given a dataset of signups with columns plan (Basic/Pro/Team), country, and churn (Yes/No):
- Find plan distribution with value_counts including NaN share.
- Create crosstab of country vs churn with row percentages and totals.
- In one sentence, interpret which country shows the highest churn share.
Note: The quick test is available to everyone. Only logged-in users will have their progress saved.
Next steps
- Re-run the examples on your own data.
- Complete the exercises above, then take the Quick Test below.
- Move on to grouping and aggregation to build deeper analyses.