luvv to helpDiscover the Best Free Online Tools
Topic 1 of 30

Memory Usage and Optimization

Learn Memory Usage and Optimization for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Data Analyst, you frequently load CSVs, join tables, and compute aggregates. Inefficient memory usage can cause slowdowns, crashes, or timeouts, especially on laptops or shared notebooks. Optimizing memory lets you:

  • Load larger datasets without running out of RAM.
  • Speed up joins, groupby, and sorting.
  • Reduce I/O time and file sizes when saving data.
  • Make dashboards and analyses more responsive.
Real tasks where this helps
  • Importing 5–10M row transaction logs for a cohort analysis.
  • Joining a customer table with events and computing retention.
  • Building daily KPIs from raw logs with limited compute.

Concept explained simply

Pandas stores data in columns. Different dtypes (int8, int16, int32, float32, category, datetime64[ns], bool, etc.) require different memory per value. The same numbers can consume 2x–8x more memory if you use a larger-than-necessary dtype.

Mental model

  • Measure first: like a budget, know where memory goes.
  • Match dtype to reality: pick the smallest type that safely fits your values.
  • Reduce repetition: convert repeated strings to category.
  • Be lazy with loading: read only what you need, in the right type.

Core techniques

1) Measure memory usage

# Deep includes Python object overhead for strings/objects
mem_bytes = df.memory_usage(deep=True).sum()
print(round(mem_bytes / 1024**2, 2), "MB")

# Quick overview
df.info(memory_usage='deep')

2) Downcast numerics safely

import pandas as pd

# For integers
df["quantity"] = pd.to_numeric(df["quantity"], downcast="integer")

# For floats
df["price"] = pd.to_numeric(df["price"], downcast="float")

Ensure the column has no unexpected decimals (for int) and values fit the range (e.g., int32: about −2.1B to 2.1B).

3) Use category for low-cardinality text

# Good for columns like country, status, product_id as string
low_card = df["country"].nunique() / len(df) < 0.5
if low_card:
    df["country"] = df["country"].astype("category")

Category stores a dictionary of unique values plus integer codes, saving memory and speeding up groupby/joins.

4) Booleans and datetimes

# Convert 0/1 or yes/no into bool or category
map_yes_no = {"yes": True, "no": False}
df["is_active"] = df["is_active"].map(map_yes_no).astype("boolean")  # nullable boolean

# Parse dates once, store as datetime64[ns]
df["event_time"] = pd.to_datetime(df["event_time"], errors="coerce", utc=True)

5) Sparse for many zeros

import pandas as pd
from pandas.api.types import is_numeric_dtype

for col in df.columns:
    if is_numeric_dtype(df[col]) and (df[col] == 0).mean() > 0.9:
        df[col] = df[col].astype(pd.SparseDtype("float32", fill_value=0))

Sparse columns store only non-fill values, reducing memory for highly sparse matrices.

6) I/O optimizations when reading data

  • Select columns you actually need: usecols=[...].
  • Set dtypes on read for big wins: dtype={"id": "int32", "country": "category"}.
  • Read in chunks: read_csv(..., chunksize=500_000) to process piece by piece.
  • Parse dates during read: parse_dates=["event_time"].

7) Clean up and reduce copies

  • Drop temporary columns: df.drop(columns=[...], inplace=True).
  • Avoid unnecessary copies: prefer inplace=True or assignment that does not duplicate data.
  • Garbage-collect after big deletions: in long scripts, import gc; del temp; gc.collect().

Worked examples

Example 1 — Measure and reduce memory

import numpy as np
import pandas as pd

n = 1_000_000
rng = np.random.default_rng(0)

df = pd.DataFrame({
    "user_id": rng.integers(1, 10_000, size=n),
    "country": rng.choice(["US","DE","IN","BR","FR"], size=n),
    "amount": rng.normal(50, 10, size=n),
    "is_new": rng.choice(["yes","no"], size=n)
})

before = df.memory_usage(deep=True).sum()

# Optimize
df["user_id"] = pd.to_numeric(df["user_id"], downcast="integer")
df["amount"] = pd.to_numeric(df["amount"], downcast="float")
df["country"] = df["country"].astype("category")
df["is_new"] = df["is_new"].map({"yes": True, "no": False}).astype("boolean")

after = df.memory_usage(deep=True).sum()
print(round(before/1024**2,2), "MB ->", round(after/1024**2,2), "MB")

Result: typically a 40–70% reduction depending on data.

Example 2 — Safer integer downcast

# Detect best integer dtype by ranges
c = df["user_id"]
print(c.min(), c.max())  # ensure it fits into int16 or int32

# If max <= 32767 and min >= -32768:
df["user_id"] = df["user_id"].astype("int16")  # smaller than int64

This is deterministic and avoids overflow.

Example 3 — Read only what you need

import pandas as pd

cols = ["user_id", "country", "amount", "event_time"]
dtypes = {"user_id": "int32", "country": "category", "amount": "float32"}

df = pd.read_csv(
    "transactions.csv",
    usecols=cols,
    dtype=dtypes,
    parse_dates=["event_time"],
)
print(df.info(memory_usage='deep'))

Specifying dtypes at read-time avoids expensive type conversions later.

Exercises

These mirror the exercises below. You can complete them in a notebook or Python script.

  1. Optimize a synthetic DataFrame
    • Create 1M rows with columns: user_id (ints 1–50k), country (5 values), amount (floats), is_new ("yes"/"no").
    • Measure memory (deep) before and after optimizing dtypes (int32/float32/category/boolean).
    • Target: reduce by at least 50%.
  2. Detect and sparsify zeros
    • Create a numeric column with 95% zeros and random small positive numbers otherwise.
    • Convert to SparseDtype('float32', fill_value=0) and compare memory usage.
Self-check checklist
  • You used df.memory_usage(deep=True) and df.info(memory_usage='deep').
  • Numeric columns are downcasted (int32/int16, float32) where safe.
  • Low-cardinality strings are category.
  • Binary flags are boolean or category.
  • Sparse conversion applied only when zeros ≥ 90%.

Note: The quick test is available to everyone; if you are logged in, your progress will be saved.

Common mistakes and how to self-check

  • Downcasting to int when column has NaN: Use nullable integer (Int32) or keep float if NaNs are required.
  • Using category for high-cardinality IDs: Category helps if repeats are common. If most values are unique, it may not save memory.
  • Forgetting deep=True: memory_usage() without deep underestimates object/string memory.
  • Parsing dates late: Converting dates after read can spike memory. Parse during read_csv.
  • Accidental copies: Chaining operations can create copies. Assign once when possible.
Quick self-audit steps
  1. Run df.info(memory_usage='deep') and scan for object dtypes.
  2. Check .nunique() ratios to decide category.
  3. Validate numeric ranges before downcasting.
  4. Compare memory before/after and keep a note of changes.

Practical projects

  • Memory-optimized customer analytics: Load a large synthetic customer-event dataset, optimize types, compute retention by country, and plot memory savings per step.
  • Zero-inflated features report: Generate 20 numeric columns with varying sparsity; automatically convert highly sparse ones and report memory saved.
  • Optimized import pipeline: Build a function wrapping read_csv that takes usecols, dtype, and parse_dates, returns an optimized DataFrame, and logs memory.

Learning path

  • Learn pandas basics: indexing, selecting columns, filtering.
  • Master dtypes: numeric families, category, datetime, boolean, nullable types.
  • File I/O patterns: chunked reads, dtype hints, parse_dates, compression.
  • Performance: vectorization, avoiding loops, efficient groupby and merge.

Next steps

  • Practice with the exercises above.
  • Apply optimizations to one of your real datasets.
  • Take the quick test to confirm you can choose the right dtype and spot heavy columns.

Quick Test

Take the test below. Everyone can try it; if you log in, your result will be saved.

Mini challenge

You receive a 5M-row dataset with columns: order_id (numeric ID), country (text, ~20 values), is_return (0/1), discount (float), event_time (string timestamps). Describe the exact dtype choices you would apply on read, and one extra optimization to reduce memory further.

Practice Exercises

2 exercises to complete

Instructions

  1. Create a 1,000,000-row DataFrame with columns: user_id (1–50,000), country (5 values), amount (normal distribution), is_new ("yes"/"no").
  2. Print memory (MB) using df.memory_usage(deep=True).sum() before and after optimization.
  3. Optimize with: user_idint32, amountfloat32, countrycategory, is_newboolean.
  4. Show final df.info(memory_usage='deep').
Expected Output
Clear memory reduction (target ≥ 50%). Dtypes should include int32/float32/category/boolean.

Memory Usage and Optimization — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Memory Usage and Optimization?

AI Assistant

Ask questions about this tool