Why this matters
As a Data Analyst, speed and memory efficiency directly impact how quickly you can explore data, iterate on insights, and deliver dashboards or reports. Real tasks include profiling slow joins, accelerating groupby aggregations, and reading large CSVs without crashing your notebook.
- Clean and analyze millions of rows without running out of memory.
- Speed up feature calculations and aggregations for ad-hoc analysis.
- Shorten ETL jobs so stakeholders get results faster.
Typical pain points you can solve
- Row-wise apply taking minutes on medium datasets.
- CSV loads consuming several GBs of RAM.
- groupby operations running slowly due to unsuitable dtypes.
- merge operations slowed by mismatched join keys and unsorted data.
Concept explained simply
Profiling is measuring where time and memory go. Speedups are the targeted changes you make once you know the bottleneck.
Mental model
- Measure first: identify the slowest 1–2 steps.
- Fix the biggest wins: vectorize, change dtypes, reduce data early.
- Re-measure to confirm the speedup and correctness.
Key tools (no special setup)
- Timing: use small repeatable tests with time.perf_counter in scripts or %timeit in notebooks.
- Memory: df.info(memory_usage='deep') and df.memory_usage(deep=True).
- Row counts: confirm filters and joins with sanity checks like df.shape and value_counts().
Speedup toolbox
- Vectorize instead of row-wise apply/iterrows for arithmetic and boolean logic.
- Use built-in pandas ops: groupby, merge, join, assign, transform, where, clip.
- Reduce memory: downcast numerics, convert low-cardinality strings to category.
- Read efficiently: specify usecols, dtype, parse_dates; use chunksize for large files.
- Filter early: reduce rows/columns before expensive joins/aggregations.
- Align dtypes for joins: ensure both sides’ keys have the same dtype and, if helpful, sorted or indexed.
- Use .query/.eval for concise, vectorized expressions when applicable.
When .apply is OK vs not
- OK: complex row-wise logic hard to vectorize; small data.
- Not OK: simple arithmetic, comparisons, or mapping you can do with vectorized operations or map/replace.
Worked examples
Example 1 — Replace row-wise apply with vectorization
Goal: Compute net price and a segment label quickly.
import numpy as np
import pandas as pd
sales = pd.DataFrame({
'qty':[2,1,10],
'price':[5.0,100.0,1.5],
'discount_pct':[0.10,0.0,0.20]
})
# Vectorized
sales['net'] = sales['qty'] * sales['price'] * (1 - sales['discount_pct'])
sales['segment'] = np.where(
sales['qty'] >= 5, 'bulk',
np.where(sales['price'] >= 50, 'premium', 'standard')
)
print(sales)
Result is instant, with net values [9.0, 100.0, 12.0] and segments ['standard', 'premium', 'bulk'].
Why this is fast
Vectorization pushes work into optimized C loops under the hood instead of Python-level loops.
Example 2 — Cut memory in half with dtypes
df = pd.DataFrame({
'id':[1,2,3,4],
'amount':[12.5, 0.0, 9999.99, -3.25],
'city':['NY','LA','NY','LA']
})
# Before
print(df.info(memory_usage='deep'))
# Optimize
df['id'] = pd.to_numeric(df['id'], downcast='integer')
df['amount'] = pd.to_numeric(df['amount'], downcast='float')
df['city'] = df['city'].astype('category')
# After
print(df.info(memory_usage='deep'))
On typical datasets, this reduces memory by 50–90%, and speeds groupby and merges on category columns.
Notes
- Downcasting floats may slightly change representation; verify with a tolerance like abs(diff) <= 1e-6.
- city as category stores values once and uses codes under the hood.
Example 3 — Fast CSV aggregation with chunks
import pandas as pd
from io import StringIO
csv_text = """
id,status,revenue
1,paid,10
2,unpaid,5
3,paid,7
4,paid,3
"""
paid_total = 0
for chunk in pd.read_csv(StringIO(csv_text), chunksize=2, usecols=['status','revenue'], dtype={'status':'category','revenue':'int32'}):
paid_total += chunk.loc[chunk['status'] == 'paid', 'revenue'].sum()
print(paid_total) # 20
Chunking limits peak memory and keeps code responsive.
Step-by-step profiling recipe
- Reproduce the slowness on a small sample with the same structure.
- Measure baseline time and memory: time.perf_counter, df.info(memory_usage='deep').
- Check dtypes and cardinality: df.dtypes, df.nunique().
- Apply one change: vectorize, reduce columns (usecols), convert categories, align join dtypes.
- Re-measure and assert correctness with small test cases.
- Scale up gradually and validate results.
Sanity checks to keep accuracy
- Row/column counts after filters and joins match expectations.
- Aggregates (sum, mean) before/after refactor are identical or within acceptable tolerance.
- Key columns have identical value_counts on samples.
Exercises (do these in order)
These mirror the tasks below. Use the checklist to confirm your work.
Exercise 1 — Vectorize price calc and segmentation
Create sales as in Example 1. Compute columns net and segment using only vectorized operations (no apply/iterrows). Validate results.
- Vectorized only (no Python loops).
- net equals [9.0, 100.0, 12.0].
- segment equals ['standard','premium','bulk'].
Exercise 2 — Downcast and categorize
Use the sample df from Example 2. Downcast numerics and convert city to category. Show memory reduction and data integrity (within 1e-6 tolerance for floats).
- id is integer downcast (e.g., int16 or int32).
- amount is float32.
- city is category with 2 categories.
- Memory reduced by at least ~50% on the sample.
Exercise 3 — Chunked paid revenue sum
Use the sample CSV in Example 3. Compute total revenue for status == 'paid' using chunksize, dtype, and usecols. Return 20.
- Use chunksize.
- Use dtype and usecols.
- Final sum is 20.
Common mistakes and self-check
- Using apply/iterrows for simple arithmetic. Self-check: can you express the logic with column operations, np.where, map, or replace?
- Forgetting to align join dtypes. Self-check: compare left.dtypes[key] vs right.dtypes[key]; cast to the same.
- Parsing entire CSVs when only a few columns are needed. Self-check: add usecols and re-measure.
- Leaving object strings when category fits. Self-check: if nunique/len < 0.5 and repeats exist, category likely helps.
- Skipping correctness checks after refactoring. Self-check: assert baselines on small samples before scaling.
Quick fixes you can try first
- Filter early and drop unused columns.
- Downcast and categorize before groupby/merge.
- Use .query and .eval for concise vectorized expressions.
- Prefer map/replace over apply for simple lookups.
Practical projects
- Retail analytics mini-pipeline: read daily CSV sales with chunksize, compute category-level revenue, and write a compact Parquet output.
- Customer segmentation: derive segment labels using vectorized thresholds, compare runtime vs an apply baseline, and log speedup.
- Join optimization: merge transactions with products; profile mismatched dtypes vs aligned dtypes and category keys, and document the improvement.
Who this is for
- Data Analysts who already manipulate DataFrames and want faster pipelines.
- Anyone preparing for analytics interviews with performance-focused tasks.
Prerequisites
- Comfort with pandas basics: filtering, groupby, merge, dtypes.
- Basic Python and NumPy awareness.
Learning path
- Profile time and memory on a slow script.
- Apply vectorization and dtype optimization.
- Improve I/O: usecols, dtypes, parse_dates, chunksize.
- Refactor joins and groupbys for speed.
- Automate sanity checks for correctness.
Next steps
- Practice on your real datasets and log before/after metrics.
- Adopt a small helper module for profiling (timers, assertions, summaries).
- Move to columnar formats (e.g., Parquet) in your workflow to reduce repeated parsing cost.
Mini challenge
You receive two 5M-row CSVs to join by user_id and date. In 3–5 steps, describe how you would profile and speed up the workflow. Include: dtypes, usecols, category vs int for keys, and one correctness check.
Example outline
- Read with usecols and dtypes (user_id as int32, date parsed; few strings as category).
- Filter early to the date range needed.
- Ensure both user_id columns share the same dtype; set_index or sort if beneficial.
- Merge and immediately validate row counts and sample spot-checks.
- Compare timing and memory before/after.
Progress & test
The quick test below is available to everyone; only logged-in users will have progress saved automatically.