Who this is for
Data Analysts who handle API responses, product catalogs, surveys, logs, or any dataset with nested lists/dicts that must be turned into clean, tabular rows and columns.
Prerequisites
- Basic Python (lists, dicts, loops)
- Core pandas (DataFrame, Series, indexing, groupby)
- Comfort reading JSON-like structures
Why this matters
- APIs return nested JSON; you need to flatten it for analysis.
- Columns often contain arrays (tags, items, categories) that need row-wise expansion.
- Survey questions can be multi-select (lists) and must be exploded to count responses.
- Event logs store attributes as dicts; normalizing reveals columns you can aggregate.
Real tasks you’ll face
- Explode a list of purchased items per order to compute item-level revenue.
- Normalize a nested customer profile JSON into clean columns for modeling.
- Unpack an array of event objects to analyze funnels and retention.
Concept explained simply
Two superpowers:
- explode: Takes a column with list-like values and makes a new row for each element.
- json_normalize: Takes nested dicts/lists and flattens them into columns (and optionally rows) using paths you specify.
Mental model
Think of explode as a photocopier for rows that have a list: it duplicates the row once per list item. Think of json_normalize as a map that traces paths through nested dicts (like customer.name or order.items[0].price) and pulls them out as tidy columns.
Core pandas tools you’ll use
- Series.explode(ignore_index=True)
- pandas.json_normalize(data, record_path=None, meta=None, meta_prefix=None, errors='raise')
- DataFrame.apply(pd.Series) for quick dict-to-columns (simple cases)
- Series.str methods to clean lists/dicts stored as strings (careful, only when truly string-encoded)
Quick syntax reminders
df = pd.DataFrame({
'order_id': [1, 2],
'items': [['A', 'B'], ['C']]
})
# explode
out = df.explode('items', ignore_index=True)
# json_normalize
nested = [
{
'order_id': 1,
'customer': {'id': 10, 'city': 'Berlin'},
'items': [{'sku': 'A', 'qty': 2}, {'sku': 'B', 'qty': 1}]
}
]
flat = pd.json_normalize(
nested,
record_path='items',
meta=['order_id', ['customer', 'id'], ['customer', 'city']],
meta_prefix='customer.'
)
Worked examples
Example 1 — Explode a list column for item-level analysis
import pandas as pd
df = pd.DataFrame({
'order_id': [101, 102, 103],
'items': [
['pen', 'notebook'],
['pen'],
[]
],
'price': [5.0, 3.0, 0.0]
})
# Expand rows so each item is one row
items_long = df.explode('items', ignore_index=True)
# Count items
item_counts = items_long['items'].value_counts(dropna=True)
print(items_long)
print(item_counts)
What to notice
- Empty lists become NaN rows after explode. Filter them if needed: items_long.dropna(subset=['items']).
- ignore_index=True gives a clean 0..N index.
Example 2 — Flatten nested dicts with json_normalize
data = [
{
'user': {'id': 1, 'name': 'Ada', 'geo': {'city': 'Paris'}},
'stats': {'logins': 5, 'purchases': 2}
},
{
'user': {'id': 2, 'name': 'Lin', 'geo': {'city': 'Tokyo'}},
'stats': {'logins': 8, 'purchases': 1}
}
]
flat = pd.json_normalize(data)
print(flat)
# Columns: ['user.id','user.name','user.geo.city','stats.logins','stats.purchases']
Why this helps
You instantly get usable columns for grouping, joins, and visualization.
Example 3 — Normalize records inside a list (record_path + meta)
orders = [
{
'order_id': 'o1',
'customer': {'id': 7, 'segment': 'B2C'},
'lines': [
{'sku': 'A', 'qty': 2, 'price': 3.5},
{'sku': 'B', 'qty': 1, 'price': 5.0}
]
},
{
'order_id': 'o2',
'customer': {'id': 8, 'segment': 'B2B'},
'lines': [{'sku': 'C', 'qty': 4, 'price': 2.0}]
}
]
lines = pd.json_normalize(
orders,
record_path='lines',
meta=['order_id', ['customer', 'id'], ['customer', 'segment']],
meta_prefix='customer.'
)
print(lines)
# Each line becomes a row, customer/order fields are repeated for easy grouping.
Tip
Use meta_prefix to prevent name clashes and keep clarity.
Example 4 — Combine explode + json_normalize for mixed data
df = pd.DataFrame({
'post_id': [1, 2],
'comments': [
[
{'user': {'id': 11}, 'text': 'Nice', 'likes': 2},
{'user': {'id': 12}, 'text': 'Wow', 'likes': 3}
],
[
{'user': {'id': 13}, 'text': 'Great', 'likes': 1}
]
]
})
# Step 1: explode comments to one comment per row
comments_long = df.explode('comments', ignore_index=True)
# Step 2: normalize each comment dict into columns
comment_cols = pd.json_normalize(comments_long['comments'])
result = pd.concat([comments_long['post_id'], comment_cols], axis=1)
print(result)
# Columns include user.id, text, likes
When to choose which approach
- If your column already contains a list of dicts, explode first, then json_normalize the dicts.
- If your entire dataset is a nested structure (list of dicts), go straight to json_normalize with record_path+meta.
Exercises (practice in your own notebook)
Mirror of the tasks in the Exercises section below. Use the checklist to verify.
Exercise 1 — Explode tags and count
You have a DataFrame of products with a list of tags:
import pandas as pd
df = pd.DataFrame({
'sku': ['p1', 'p2', 'p3'],
'tags': [['blue', 'cotton'], ['cotton'], []]
})
# Task:
# 1) Explode tags into rows.
# 2) Remove empty/NaN tags.
# 3) Count tags frequency, sorted descending.
- [ ] Used explode with ignore_index=True
- [ ] Dropped NaN tags after explode
- [ ] Produced a Series with counts (index=tag, value=count)
Hint
After explode, use dropna on the exploded column and then value_counts().
Exercise 2 — Normalize nested payload and unnest events
Given:
data = pd.DataFrame({
'payload': [
{
'user': {'id': 1, 'city': 'Paris'},
'events': [{'type': 'view', 'value': 3}, {'type': 'buy', 'value': 1}]
},
{
'user': {'id': 2, 'city': 'Rome'},
'events': [{'type': 'view', 'value': 5}]
}
]
})
# Task:
# 1) Extract rows for each event with json_normalize(record_path='events', meta=...).
# 2) Keep user.id and user.city as meta (prefix them with 'user.').
# 3) Result columns: type, value, user.id, user.city
- [ ] Used pd.json_normalize with record_path='events'
- [ ] Passed meta for user fields
- [ ] Got tidy result with one row per event
Hint
First convert the column to a list: payloads = data['payload'].tolist(), then use json_normalize on it.
Common mistakes and self-checks
- Forgetting to drop NaN after explode. Self-check: Do you have rows where the exploded column is NaN? Filter them.
- Exploding a dict directly. Self-check: explode expects list-like; dicts will yield keys, often not what you expect.
- Not preserving context when normalizing. Self-check: Use meta fields (like order_id, customer info) so you can aggregate later.
- Name collisions after normalize. Self-check: Use meta_prefix and review columns for duplicates.
- Assuming numeric types after flattening. Self-check: Inspect dtypes and use nullable types (e.g., Int64) if missing values exist.
Quick dtype fix patterns
# After normalization
cols = ['qty', 'price']
for c in cols:
flat[c] = pd.to_numeric(flat[c], errors='coerce')
flat['qty'] = flat['qty'].astype('Int64')
Practical projects
- E-commerce order lines: explode items from orders, compute item-level revenue, and top SKUs.
- Marketing events: normalize nested event payloads and build a funnel conversion table.
- Survey analysis: explode multi-select answers and chart most common choices by segment.
Learning path
- Before this: Reading files (CSV/JSON), Selecting/Filtering, Basic GroupBy.
- This subskill: explode, json_normalize, tidy nested structures.
- Next: Merging/Joining datasets, Aggregations, and Time-based analytics.
Next steps
- Refactor your data cleaning scripts to handle nested columns up-front.
- Create a utility function that normalizes a payload and documents the meta fields used.
- Practice on one real dataset with both explode and json_normalize.
Mini challenge
You receive a daily JSON dump of sessions where each session has user info and a list of actions with timestamps. Produce a table of one action per row including user_id, session_id, action_type, and minute-of-day. Then find top 3 actions by city.
Plan it
- json_normalize with record_path=actions, meta=[session_id, ['user','id'], ['user','city']].
- Convert timestamp to datetime and compute timestamp.dt.hour*60 + timestamp.dt.minute.
- Group by city and action_type; take top 3.
Quick test
The quick test below is available to everyone; log in to save your progress.