Who this is for
Data Analysts and learners who want to transform continuous variables (ages, income, scores) into meaningful categories using pandas pd.cut and pd.qcut.
Prerequisites
- Basic Python and pandas (Series/DataFrame basics)
- Comfort with filtering, value_counts, and simple aggregations
- Installed pandas in your environment
Learning path
- Understand when to bin data and why it helps analysis
- Learn pd.cut for equal-width or custom edges
- Learn pd.qcut for equal-frequency (quantile) bins
- Handle labels, boundaries, missing values, and duplicates
- Apply bins to analysis and visualization
Why this matters
- Customer analytics: create age groups, tenure buckets, or spend tiers to compare behavior.
- Reporting: simplify continuous metrics into clear segments (low/medium/high) for dashboards.
- Model features: engineer categorical features from continuous variables.
Real tasks you might do
- Group customers into quartiles by monthly spend to target campaigns.
- Bucket delivery times into on-time, slightly late, very late for SLA monitoring.
- Convert exam scores to grade bands for education reports.
Concept explained simply
Binning turns a continuous number into a category by asking: “Which interval does this number fall into?”
- pd.cut: you choose the cut points (bins). Good for business-defined bands, like [0, 50), [50, 100).
- pd.qcut: pandas chooses cut points so each bin has (about) the same number of rows. Good for quartiles/deciles.
Mental model
Imagine laying a ruler (the number line). With pd.cut you mark your own tick marks. With pd.qcut the data itself decides where the tick marks go so each section holds a similar amount of data.
Key functions and parameters
- pd.cut(x, bins, right=True, include_lowest=False, labels=None, precision=3)
- bins: int (equal-width) or list of edges
- right: whether intervals include the right edge
- include_lowest: include the first interval’s left edge
- labels: list or False (False returns integer bin codes)
- pd.qcut(x, q, labels=None, duplicates='raise')
- q: int (e.g., 4 for quartiles) or list of quantiles (0 to 1)
- duplicates: 'drop' to handle non-unique quantile edges
Worked examples
1) Equal-width bins with pd.cut
import pandas as pd
import numpy as np
ages = pd.Series([5, 17, 18, 29, 35, 49, 50, 72, np.nan])
# Define edges and labels
edges = [0, 18, 35, 50, float('inf')]
labels = ['0–17', '18–34', '35–49', '50+']
age_group = pd.cut(ages, bins=edges, right=False, labels=labels)
print(age_group)
print(age_group.value_counts(dropna=False))We used right=False to include the left edge and exclude the right, so 18 goes to 18–34 and 50 goes to 50+.
2) Automatic equal-width from number of bins
import pandas as pd
import numpy as np
scores = pd.Series([32, 45, 58, 63, 71, 79, 84, 92])
# 4 equal-width bins between min and max
binned = pd.cut(scores, bins=4)
print(binned.cat.categories)
print(binned.value_counts())pd.cut computed 4 intervals spanning the min and max of scores.
3) Equal-frequency bins with pd.qcut (quartiles)
import pandas as pd
sales = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120])
quartile = pd.qcut(sales, q=4, labels=['Q1','Q2','Q3','Q4'])
print(quartile)
print(quartile.value_counts())Each quartile has 3 values, so counts are balanced.
Edge cases and tips
- NaN stays NaN after binning; handle with fillna if needed.
- Labels length must equal number of intervals; if you pass 4 edges, you get 3 intervals.
- pd.qcut may fail when many identical values cause duplicate quantile edges; set duplicates='drop' to proceed with fewer bins.
- Boundary inclusion: right=False includes the left edge; include_lowest=True includes the very first left edge.
How to use bins in analysis
- Group and aggregate: df.groupby('bin')['metric'].mean()
- Distribution checks: bin_col.value_counts(normalize=True)
- Visualization: bar charts of counts per bin, or color-encoding in scatter plots
Practice: follow along
Exercises
Complete the tasks below. Then open the solution blocks to compare.
Exercise 1 — Age groups with pd.cut
Recreate the AgeGroup labels for a small list of ages using left-closed, right-open intervals.
- Data: [5, 17, 18, 29, 35, 49, 50, 72, NaN]
- Edges: [0, 18, 35, 50, inf]
- Labels: ['0–17', '18–34', '35–49', '50+']
- Use right=False
Show a hint
Use pd.cut(..., bins=edges, labels=labels, right=False). Remember that NaN stays NaN.
Exercise 2 — Quartiles with pd.qcut
Bin the following incomes into quartiles labeled Q1–Q4.
- Data: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120]
- Hint: pd.qcut(incomes, 4, labels=[...])
Show a hint
Value counts for each label should be equal when there are no duplicates causing edge collisions.
Self-check checklist
- You verified boundary placement with right=False vs right=True.
- Labels length matched number of intervals.
- value_counts shows balanced counts for qcut on evenly spaced data.
- You noted NaN values remain unbinned unless filled.
Common mistakes and how to spot them
- Labels mismatch: ValueError about labels length. Fix by matching labels to number of intervals.
- Wrong boundary inclusion: Values equal to an edge fall into the unexpected bin. Fix by adjusting right and include_lowest.
- Using qcut on tiny datasets: Bins may collapse. Use duplicates='drop' or fewer bins.
- Forgetting ordered categories: If you plan to sort bins logically, ensure the categorical is ordered (cat.as_ordered()).
Practical projects
- Customer spend tiers: Create deciles with qcut for monthly spend. Compare churn rates by tier.
- Delivery performance: Bin delivery minutes into [0, 10), [10, 30), [30, 60), [60, inf) and chart on-time vs late proportions.
Mini challenge
You have transaction amounts: [2, 5, 9, 15, 20, 26, 33, 47, 51, 68, 72, 90, 105, 130].
- Create 5 equal-frequency bins with qcut (labels Q1–Q5). If duplicates cause issues, allow duplicates='drop' and note the final number of bins.
- Create custom business bins: [0, 20, 50, 100, inf] with labels ['Micro','Small','Medium','Large'] using cut with right=False.
- Report counts per bin for both methods and 1 insight you observe.
Need a nudge?
Start with pd.Series(data). Use value_counts() with sort=False to keep label order.
Next steps
- Combine bins with groupby to compute KPIs per band.
- Try deciles (q=10) and compare uplift in segmentation analyses.
- Use pd.IntervalIndex to introspect interval boundaries when debugging.
Quick test info
Everyone can take the quick test. Only logged-in users have their progress saved automatically.