How to learn Working With Lists and Dicts Explode Json Normalize for Python pandas in Data Analyst for free

Who this is for

Data Analysts who handle API responses, product catalogs, surveys, logs, or any dataset with nested lists/dicts that must be turned into clean, tabular rows and columns.

Prerequisites

Basic Python (lists, dicts, loops)
Core pandas (DataFrame, Series, indexing, groupby)
Comfort reading JSON-like structures

Why this matters

APIs return nested JSON; you need to flatten it for analysis.
Columns often contain arrays (tags, items, categories) that need row-wise expansion.
Survey questions can be multi-select (lists) and must be exploded to count responses.
Event logs store attributes as dicts; normalizing reveals columns you can aggregate.

Real tasks you’ll face

Explode a list of purchased items per order to compute item-level revenue.
Normalize a nested customer profile JSON into clean columns for modeling.
Unpack an array of event objects to analyze funnels and retention.

Concept explained simply

Two superpowers:

explode: Takes a column with list-like values and makes a new row for each element.
json_normalize: Takes nested dicts/lists and flattens them into columns (and optionally rows) using paths you specify.

Mental model

Think of explode as a photocopier for rows that have a list: it duplicates the row once per list item. Think of json_normalize as a map that traces paths through nested dicts (like customer.name or order.items[0].price) and pulls them out as tidy columns.

Core pandas tools you’ll use

Series.explode(ignore_index=True)
pandas.json_normalize(data, record_path=None, meta=None, meta_prefix=None, errors='raise')
DataFrame.apply(pd.Series) for quick dict-to-columns (simple cases)
Series.str methods to clean lists/dicts stored as strings (careful, only when truly string-encoded)

Quick syntax reminders

df = pd.DataFrame({
    'order_id': [1, 2],
    'items': [['A', 'B'], ['C']]
})

# explode
out = df.explode('items', ignore_index=True)

# json_normalize
nested = [
    {
        'order_id': 1,
        'customer': {'id': 10, 'city': 'Berlin'},
        'items': [{'sku': 'A', 'qty': 2}, {'sku': 'B', 'qty': 1}]
    }
]
flat = pd.json_normalize(
    nested,
    record_path='items',
    meta=['order_id', ['customer', 'id'], ['customer', 'city']],
    meta_prefix='customer.'
)

Worked examples

Example 1 — Explode a list column for item-level analysis

import pandas as pd

df = pd.DataFrame({
    'order_id': [101, 102, 103],
    'items': [
        ['pen', 'notebook'],
        ['pen'],
        []
    ],
    'price': [5.0, 3.0, 0.0]
})

# Expand rows so each item is one row
items_long = df.explode('items', ignore_index=True)

# Count items
item_counts = items_long['items'].value_counts(dropna=True)
print(items_long)
print(item_counts)

What to notice

Empty lists become NaN rows after explode. Filter them if needed: items_long.dropna(subset=['items']).
ignore_index=True gives a clean 0..N index.

Example 2 — Flatten nested dicts with json_normalize

data = [
    {
        'user': {'id': 1, 'name': 'Ada', 'geo': {'city': 'Paris'}},
        'stats': {'logins': 5, 'purchases': 2}
    },
    {
        'user': {'id': 2, 'name': 'Lin', 'geo': {'city': 'Tokyo'}},
        'stats': {'logins': 8, 'purchases': 1}
    }
]

flat = pd.json_normalize(data)
print(flat)
# Columns: ['user.id','user.name','user.geo.city','stats.logins','stats.purchases']

Why this helps

You instantly get usable columns for grouping, joins, and visualization.

Example 3 — Normalize records inside a list (record_path + meta)

orders = [
  {
    'order_id': 'o1',
    'customer': {'id': 7, 'segment': 'B2C'},
    'lines': [
      {'sku': 'A', 'qty': 2, 'price': 3.5},
      {'sku': 'B', 'qty': 1, 'price': 5.0}
    ]
  },
  {
    'order_id': 'o2',
    'customer': {'id': 8, 'segment': 'B2B'},
    'lines': [{'sku': 'C', 'qty': 4, 'price': 2.0}]
  }
]

lines = pd.json_normalize(
    orders,
    record_path='lines',
    meta=['order_id', ['customer', 'id'], ['customer', 'segment']],
    meta_prefix='customer.'
)
print(lines)
# Each line becomes a row, customer/order fields are repeated for easy grouping.

Tip

Use meta_prefix to prevent name clashes and keep clarity.

Example 4 — Combine explode + json_normalize for mixed data

df = pd.DataFrame({
  'post_id': [1, 2],
  'comments': [
    [
      {'user': {'id': 11}, 'text': 'Nice', 'likes': 2},
      {'user': {'id': 12}, 'text': 'Wow', 'likes': 3}
    ],
    [
      {'user': {'id': 13}, 'text': 'Great', 'likes': 1}
    ]
  ]
})

# Step 1: explode comments to one comment per row
comments_long = df.explode('comments', ignore_index=True)

# Step 2: normalize each comment dict into columns
comment_cols = pd.json_normalize(comments_long['comments'])
result = pd.concat([comments_long['post_id'], comment_cols], axis=1)
print(result)
# Columns include user.id, text, likes

When to choose which approach

If your column already contains a list of dicts, explode first, then json_normalize the dicts.
If your entire dataset is a nested structure (list of dicts), go straight to json_normalize with record_path+meta.

Exercises (practice in your own notebook)

Mirror of the tasks in the Exercises section below. Use the checklist to verify.

Exercise 1 — Explode tags and count

You have a DataFrame of products with a list of tags:

import pandas as pd

df = pd.DataFrame({
  'sku': ['p1', 'p2', 'p3'],
  'tags': [['blue', 'cotton'], ['cotton'], []]
})

# Task:
# 1) Explode tags into rows.
# 2) Remove empty/NaN tags.
# 3) Count tags frequency, sorted descending.

[ ] Used explode with ignore_index=True
[ ] Dropped NaN tags after explode
[ ] Produced a Series with counts (index=tag, value=count)

Hint

After explode, use dropna on the exploded column and then value_counts().

Exercise 2 — Normalize nested payload and unnest events

Given:

data = pd.DataFrame({
  'payload': [
    {
      'user': {'id': 1, 'city': 'Paris'},
      'events': [{'type': 'view', 'value': 3}, {'type': 'buy', 'value': 1}]
    },
    {
      'user': {'id': 2, 'city': 'Rome'},
      'events': [{'type': 'view', 'value': 5}]
    }
  ]
})

# Task:
# 1) Extract rows for each event with json_normalize(record_path='events', meta=...).
# 2) Keep user.id and user.city as meta (prefix them with 'user.').
# 3) Result columns: type, value, user.id, user.city

[ ] Used pd.json_normalize with record_path='events'
[ ] Passed meta for user fields
[ ] Got tidy result with one row per event

Hint

First convert the column to a list: payloads = data['payload'].tolist(), then use json_normalize on it.

Common mistakes and self-checks

Forgetting to drop NaN after explode. Self-check: Do you have rows where the exploded column is NaN? Filter them.
Exploding a dict directly. Self-check: explode expects list-like; dicts will yield keys, often not what you expect.
Not preserving context when normalizing. Self-check: Use meta fields (like order_id, customer info) so you can aggregate later.
Name collisions after normalize. Self-check: Use meta_prefix and review columns for duplicates.
Assuming numeric types after flattening. Self-check: Inspect dtypes and use nullable types (e.g., Int64) if missing values exist.

Quick dtype fix patterns

# After normalization
cols = ['qty', 'price']
for c in cols:
    flat[c] = pd.to_numeric(flat[c], errors='coerce')
flat['qty'] = flat['qty'].astype('Int64')

Practical projects

E-commerce order lines: explode items from orders, compute item-level revenue, and top SKUs.
Marketing events: normalize nested event payloads and build a funnel conversion table.
Survey analysis: explode multi-select answers and chart most common choices by segment.

Learning path

Before this: Reading files (CSV/JSON), Selecting/Filtering, Basic GroupBy.
This subskill: explode, json_normalize, tidy nested structures.
Next: Merging/Joining datasets, Aggregations, and Time-based analytics.

Next steps

Refactor your data cleaning scripts to handle nested columns up-front.
Create a utility function that normalizes a payload and documents the meta fields used.
Practice on one real dataset with both explode and json_normalize.

Mini challenge

You receive a daily JSON dump of sessions where each session has user info and a list of actions with timestamps. Produce a table of one action per row including user_id, session_id, action_type, and minute-of-day. Then find top 3 actions by city.

Plan it

json_normalize with record_path=actions, meta=[session_id, ['user','id'], ['user','city']].
Convert timestamp to datetime and compute timestamp.dt.hour*60 + timestamp.dt.minute.
Group by city and action_type; take top 3.

Quick test

The quick test below is available to everyone; log in to save your progress.

Menu

Working With Lists and Dicts Explode Json Normalize

Table of Contents