luvv to helpDiscover the Best Free Online Tools
Topic 10 of 10

Performance Optimization And Memory

Learn Performance Optimization And Memory for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Data Scientist, you frequently work with millions of rows. Inefficient pandas or NumPy code can turn a 10-second task into a 10-minute blocker, or worse, crash your notebook due to memory. Mastering performance and memory lets you:

  • Clean and aggregate large datasets without timeouts.
  • Feature engineer quickly for iterative modeling.
  • Run experiments on a laptop instead of needing a cluster.
  • Ship robust data pipelines that don’t silently duplicate data in memory.

Concept explained simply

Core idea: move work from slow Python loops to fast vectorized operations in pandas/NumPy, and store data in the smallest safe dtypes.

Mental model

Think of Python loops as you moving bricks one by one. Vectorized operations are a forklift: the same job but hundreds of bricks at once. Memory optimization is choosing the right-sized boxes for those bricks. Too big wastes space; too small breaks. Aim for the smallest box that fits (correct dtype), and use the forklift (vectorized ops).

Key techniques you will use

  • Vectorize instead of loops: prefer column-wise arithmetic, .str, .dt, and NumPy ufuncs.
  • Choose compact dtypes: use int8/16/32, float32 where precision allows; convert low-cardinality object columns to category.
  • Avoid unnecessary copies: use .loc for assignment; avoid chained indexing; use .to_numpy(copy=False) when safe.
  • Measure memory: df.info(memory_usage='deep'), df.memory_usage(deep=True).sum().
  • Lean groupby: prefer built-in aggregations (sum, mean, size, nunique) and .agg with native funcs; avoid row-wise .apply.
  • Chunk large files: pd.read_csv(..., chunksize=...), process and reduce before concatenating.
  • Efficient joins: ensure join keys are the same dtype; consider categoricals for repeated keys; merge only required columns.
  • String ops: use vectorized .str methods; avoid Python loops over rows.
  • Sparse/zero-heavy arrays: consider pandas' Sparse dtype or NumPy sparse alternatives when mostly zeros.
  • Sorting and indexing: sort_index/sort_values only when needed; set an index if repeatedly selecting by that key.
In-place vs copy in pandas

inplace=True does not guarantee faster or lower memory operations in modern pandas; it often creates a new object under the hood. Prefer assignment to a new variable and measure.

A reliable optimization workflow

  1. Measure first: time your current code (small dry-run) and check memory with df.info(memory_usage='deep').
  2. Eliminate Python loops: replace with vectorized NumPy/pandas operations.
  3. Fix dtypes: downcast numerics; convert object to category where appropriate.
  4. Optimize I/O: read only needed columns (usecols), parse dates selectively, use chunksize for large files.
  5. Re-measure and validate correctness on a sample, then on full data.

Worked examples

Example 1 — Replace loops with vectorization

Show example

Goal: standardize a column and compute a ratio without loops.

import pandas as pd
import numpy as np

df = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,40,50]})

# Slow pattern (avoid)
# z = []
# for v in df['y']:
#     z.append((v - df['y'].mean()) / df['y'].std(ddof=0))
# df['z'] = z
# df['ratio'] = [a/b for a,b in zip(df['y'], df['x'])]

# Vectorized
y = df['y'].to_numpy()
df['z'] = (y - y.mean()) / y.std(ddof=0)
df['ratio'] = df['y'] / df['x']
print(df)

Example 2 — Reduce memory via dtypes

Show example
df = pd.DataFrame({
    'user_id':[101,101,102,103,103,103],
    'city':['NY','NY','LA','LA','LA','SF'],
    'amount':[12.5, 9.0, 3.0, 1.0, 0.0, 5.5],
    'flag':[0,1,0,0,1,0]
})
print('Before bytes:', df.memory_usage(deep=True).sum())

# Optimize
opt = df.copy()
opt['user_id'] = opt['user_id'].astype('int16')
opt['city'] = opt['city'].astype('category')
opt['amount'] = opt['amount'].astype('float32')
opt['flag'] = opt['flag'].astype('int8')  # or bool

print('After bytes:', opt.memory_usage(deep=True).sum())
print(opt.dtypes)

Expected: noticeable drop in bytes; dtypes are int16, category, float32, int8/bool.

Example 3 — Faster groupby aggregations

Show example
sales = pd.DataFrame({
    'city':['NY','NY','LA','LA','SF','SF','SF'],
    'amount':[12.5, 9.0, 3.0, 1.0, 8.0, 2.0, 5.0]
})

# Slow-ish: apply with custom row-wise lambda (avoid)
# res = sales.groupby('city').apply(lambda g: pd.Series({
#     'sum': g['amount'].sum(),
#     'avg': g['amount'].mean(),
#     'n': g['amount'].size
# }))

# Fast: built-ins
res = sales.groupby('city')['amount'].agg(sum='sum', avg='mean', n='size')
print(res)

Hands-on exercises

Try these in your environment. They mirror the exercises below so you can check solutions.

Exercise 1 — Vectorize and standardize

Create a DataFrame df = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,40,50]}). Add:

  • z: standardized y using population std (ddof=0)
  • ratio: y divided by x
  • Use no Python loops or row-wise apply.
Hint

Use to_numpy() or operate directly on the Series. For z: (y - y.mean()) / y.std(ddof=0).

Exercise 2 — Reduce memory safely

With df = pd.DataFrame({'user_id':[101,101,102,103,103,103],'city':['NY','NY','LA','LA','LA','SF'],'amount':[12.5,9.0,3.0,1.0,0.0,5.5],'flag':[0,1,0,0,1,0]}):

  • Downcast user_id to smallest safe integer.
  • Convert city to category.
  • Downcast amount to float32; flag to int8 or bool.
  • Verify dtypes and compare memory before/after with memory_usage(deep=True).
Hint

Use astype for explicit casts; print dtypes before and after. Compare byte counts.

Checklist before you move on

  • I replaced loops with vectorized operations.
  • I measured memory usage with deep=True.
  • I used the smallest safe dtypes.
  • I avoided chained indexing and used .loc for assignments.
  • I used built-in aggregations for groupby.

Common mistakes and self-check

  • Using row-wise apply for arithmetic. Self-check: can this be column arithmetic or a NumPy ufunc?
  • Forgetting ddof when standardizing. Self-check: choose ddof=0 (population) or ddof=1 (sample) explicitly.
  • Blindly using inplace=True. Self-check: measure; prefer clear assignments.
  • Leaving object dtype for repeated strings. Self-check: is cardinality low? Convert to category.
  • Mismatched join dtypes causing slow merges. Self-check: compare df1[key].dtype vs df2[key].dtype.
  • Chained indexing leading to copies and warnings. Self-check: use df.loc[row_selector, col] = value.
  • Reading entire files when you need a few columns. Self-check: use usecols and chunksize.

Practical projects

  • Retail rollups: ingest a multi-million-row CSV of transactions in chunks, downcast dtypes, compute daily city-level totals, and write a compact parquet file.
  • Feature store preview: from a wide user events table, convert suitable strings to category, build 5 vectorized features (rates, rolling means), and validate memory impact.
  • Join optimization: join a 5M-row fact table to two small dimensions. Align dtypes, select only needed columns, and benchmark merge times before/after.

Learning path

  • Now: Performance Optimization and Memory (this page).
  • Next: Efficient joins, indexes, and time-series windows.
  • Then: Vectorized feature engineering patterns and scaling to out-of-core workflows.

Who this is for

Data Scientists and Analysts using pandas/NumPy who need faster pipelines and to fit larger datasets into memory without sacrificing correctness.

Prerequisites

  • Python basics: functions, loops, and comprehension (you’ll replace loops, but knowing them helps).
  • Comfort with pandas DataFrame/Series basics and NumPy arrays.
  • Basic statistics (mean, std) and aggregations.

Mini challenge

You receive a 12M-row clickstream CSV with columns: user_id (int), ts (ISO string), url (string), device (5 categories), dwell_sec (float), and referrer (mostly empty). Outline a plan to load, reduce memory, and compute per-user average dwell time by device.

Suggested approach
  • Read in chunks with usecols to skip referrer if unneeded; parse dates for ts only if required for grouping.
  • Downcast numerics (user_id to int32/int16 if safe; dwell_sec to float32).
  • Convert device to category; consider url to category only if very low cardinality (likely not) or avoid loading it entirely for this task.
  • Aggregate per chunk (groupby(['user_id','device']).agg({'dwell_sec':'mean','user_id':'size'}) or compute sum and count), then combine chunk results with groupby sum/count and final mean.
  • Write results to parquet.

Next steps

Take the quick test to lock in the concepts. Note: the test is available to everyone; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Create the DataFrame:

import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,40,50]})

Add two columns using vectorized operations only (no loops, no row-wise apply):

  • z: standardized y using population std (ddof=0)
  • ratio: y divided by x
Expected Output
A DataFrame with columns x,y,z,ratio where z ≈ [-1.414, -0.707, 0.000, 0.707, 1.414] and ratio = [10.0, 10.0, 10.0, 10.0, 10.0].

Performance Optimization And Memory — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Performance Optimization And Memory?

AI Assistant

Ask questions about this tool