Why this matters
As a Data Scientist, you frequently work with millions of rows. Inefficient pandas or NumPy code can turn a 10-second task into a 10-minute blocker, or worse, crash your notebook due to memory. Mastering performance and memory lets you:
- Clean and aggregate large datasets without timeouts.
- Feature engineer quickly for iterative modeling.
- Run experiments on a laptop instead of needing a cluster.
- Ship robust data pipelines that don’t silently duplicate data in memory.
Concept explained simply
Core idea: move work from slow Python loops to fast vectorized operations in pandas/NumPy, and store data in the smallest safe dtypes.
Mental model
Think of Python loops as you moving bricks one by one. Vectorized operations are a forklift: the same job but hundreds of bricks at once. Memory optimization is choosing the right-sized boxes for those bricks. Too big wastes space; too small breaks. Aim for the smallest box that fits (correct dtype), and use the forklift (vectorized ops).
Key techniques you will use
- Vectorize instead of loops: prefer column-wise arithmetic, .str, .dt, and NumPy ufuncs.
- Choose compact dtypes: use int8/16/32, float32 where precision allows; convert low-cardinality object columns to category.
- Avoid unnecessary copies: use .loc for assignment; avoid chained indexing; use .to_numpy(copy=False) when safe.
- Measure memory: df.info(memory_usage='deep'), df.memory_usage(deep=True).sum().
- Lean groupby: prefer built-in aggregations (sum, mean, size, nunique) and .agg with native funcs; avoid row-wise .apply.
- Chunk large files: pd.read_csv(..., chunksize=...), process and reduce before concatenating.
- Efficient joins: ensure join keys are the same dtype; consider categoricals for repeated keys; merge only required columns.
- String ops: use vectorized .str methods; avoid Python loops over rows.
- Sparse/zero-heavy arrays: consider pandas' Sparse dtype or NumPy sparse alternatives when mostly zeros.
- Sorting and indexing: sort_index/sort_values only when needed; set an index if repeatedly selecting by that key.
In-place vs copy in pandas
inplace=True does not guarantee faster or lower memory operations in modern pandas; it often creates a new object under the hood. Prefer assignment to a new variable and measure.
A reliable optimization workflow
- Measure first: time your current code (small dry-run) and check memory with df.info(memory_usage='deep').
- Eliminate Python loops: replace with vectorized NumPy/pandas operations.
- Fix dtypes: downcast numerics; convert object to category where appropriate.
- Optimize I/O: read only needed columns (usecols), parse dates selectively, use chunksize for large files.
- Re-measure and validate correctness on a sample, then on full data.
Worked examples
Example 1 — Replace loops with vectorization
Show example
Goal: standardize a column and compute a ratio without loops.
import pandas as pd
import numpy as np
df = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,40,50]})
# Slow pattern (avoid)
# z = []
# for v in df['y']:
# z.append((v - df['y'].mean()) / df['y'].std(ddof=0))
# df['z'] = z
# df['ratio'] = [a/b for a,b in zip(df['y'], df['x'])]
# Vectorized
y = df['y'].to_numpy()
df['z'] = (y - y.mean()) / y.std(ddof=0)
df['ratio'] = df['y'] / df['x']
print(df)
Example 2 — Reduce memory via dtypes
Show example
df = pd.DataFrame({
'user_id':[101,101,102,103,103,103],
'city':['NY','NY','LA','LA','LA','SF'],
'amount':[12.5, 9.0, 3.0, 1.0, 0.0, 5.5],
'flag':[0,1,0,0,1,0]
})
print('Before bytes:', df.memory_usage(deep=True).sum())
# Optimize
opt = df.copy()
opt['user_id'] = opt['user_id'].astype('int16')
opt['city'] = opt['city'].astype('category')
opt['amount'] = opt['amount'].astype('float32')
opt['flag'] = opt['flag'].astype('int8') # or bool
print('After bytes:', opt.memory_usage(deep=True).sum())
print(opt.dtypes)
Expected: noticeable drop in bytes; dtypes are int16, category, float32, int8/bool.
Example 3 — Faster groupby aggregations
Show example
sales = pd.DataFrame({
'city':['NY','NY','LA','LA','SF','SF','SF'],
'amount':[12.5, 9.0, 3.0, 1.0, 8.0, 2.0, 5.0]
})
# Slow-ish: apply with custom row-wise lambda (avoid)
# res = sales.groupby('city').apply(lambda g: pd.Series({
# 'sum': g['amount'].sum(),
# 'avg': g['amount'].mean(),
# 'n': g['amount'].size
# }))
# Fast: built-ins
res = sales.groupby('city')['amount'].agg(sum='sum', avg='mean', n='size')
print(res)
Hands-on exercises
Try these in your environment. They mirror the exercises below so you can check solutions.
Exercise 1 — Vectorize and standardize
Create a DataFrame df = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,40,50]}). Add:
- z: standardized y using population std (ddof=0)
- ratio: y divided by x
- Use no Python loops or row-wise apply.
Hint
Use to_numpy() or operate directly on the Series. For z: (y - y.mean()) / y.std(ddof=0).
Exercise 2 — Reduce memory safely
With df = pd.DataFrame({'user_id':[101,101,102,103,103,103],'city':['NY','NY','LA','LA','LA','SF'],'amount':[12.5,9.0,3.0,1.0,0.0,5.5],'flag':[0,1,0,0,1,0]}):
- Downcast user_id to smallest safe integer.
- Convert city to category.
- Downcast amount to float32; flag to int8 or bool.
- Verify dtypes and compare memory before/after with memory_usage(deep=True).
Hint
Use astype for explicit casts; print dtypes before and after. Compare byte counts.
Checklist before you move on
- I replaced loops with vectorized operations.
- I measured memory usage with deep=True.
- I used the smallest safe dtypes.
- I avoided chained indexing and used .loc for assignments.
- I used built-in aggregations for groupby.
Common mistakes and self-check
- Using row-wise apply for arithmetic. Self-check: can this be column arithmetic or a NumPy ufunc?
- Forgetting ddof when standardizing. Self-check: choose ddof=0 (population) or ddof=1 (sample) explicitly.
- Blindly using inplace=True. Self-check: measure; prefer clear assignments.
- Leaving object dtype for repeated strings. Self-check: is cardinality low? Convert to category.
- Mismatched join dtypes causing slow merges. Self-check: compare df1[key].dtype vs df2[key].dtype.
- Chained indexing leading to copies and warnings. Self-check: use df.loc[row_selector, col] = value.
- Reading entire files when you need a few columns. Self-check: use usecols and chunksize.
Practical projects
- Retail rollups: ingest a multi-million-row CSV of transactions in chunks, downcast dtypes, compute daily city-level totals, and write a compact parquet file.
- Feature store preview: from a wide user events table, convert suitable strings to category, build 5 vectorized features (rates, rolling means), and validate memory impact.
- Join optimization: join a 5M-row fact table to two small dimensions. Align dtypes, select only needed columns, and benchmark merge times before/after.
Learning path
- Now: Performance Optimization and Memory (this page).
- Next: Efficient joins, indexes, and time-series windows.
- Then: Vectorized feature engineering patterns and scaling to out-of-core workflows.
Who this is for
Data Scientists and Analysts using pandas/NumPy who need faster pipelines and to fit larger datasets into memory without sacrificing correctness.
Prerequisites
- Python basics: functions, loops, and comprehension (you’ll replace loops, but knowing them helps).
- Comfort with pandas DataFrame/Series basics and NumPy arrays.
- Basic statistics (mean, std) and aggregations.
Mini challenge
You receive a 12M-row clickstream CSV with columns: user_id (int), ts (ISO string), url (string), device (5 categories), dwell_sec (float), and referrer (mostly empty). Outline a plan to load, reduce memory, and compute per-user average dwell time by device.
Suggested approach
- Read in chunks with usecols to skip referrer if unneeded; parse dates for ts only if required for grouping.
- Downcast numerics (user_id to int32/int16 if safe; dwell_sec to float32).
- Convert device to category; consider url to category only if very low cardinality (likely not) or avoid loading it entirely for this task.
- Aggregate per chunk (groupby(['user_id','device']).agg({'dwell_sec':'mean','user_id':'size'}) or compute sum and count), then combine chunk results with groupby sum/count and final mean.
- Write results to parquet.
Next steps
Take the quick test to lock in the concepts. Note: the test is available to everyone; only logged-in users will have their progress saved.