Why this matters
In real ML/AI work, speed saves money and unlocks scale. You will:
- Speed up feature engineering and data preprocessing pipelines.
- Reduce model training time (and cost) by removing bottlenecks.
- Batch and stream data efficiently for large datasets.
- Make real-time inference meet latency SLAs.
- Diagnose memory spikes that crash notebooks or jobs.
Real tasks you might face
- Cut a 45-minute ETL step to under 5 minutes to meet a nightly retrain window.
- Bring a REST inference endpoint from 300 ms to under 80 ms P95 latency.
- Eliminate out-of-memory errors during cross-validation by batching and sharing arrays.
- Profile a slow custom loss function to fit into a 2-hour training budget.
Concept explained simply
Profiling is measuring where time and memory are spent. Optimization is changing code or data flow to make the slowest parts faster without breaking correctness.
Mental model
- Measure first: find the hottest 1–2 functions. Optimize only those.
- Pick the right tactic: algorithmic improvements > vectorization > parallelism > micro-tweaks.
- Trade-offs: speed vs memory vs readability. Document decisions.
Quick toolbox (no special setup)
- Timing: time.perf_counter(), timeit
- CPU profiling: cProfile, pstats
- Line timing: line_profiler (if available), or manual timers
- Memory: tracemalloc (built-in), sys.getsizeof (shallow), list(generators) vs generators
- Parallelism: multiprocessing, concurrent.futures, joblib (common in scikit-learn)
- Vectorization: NumPy, pandas
- Caching: functools.lru_cache
GIL and parallelism in one minute
Python's GIL limits CPU-bound threads. Use:
- CPU-bound: multiprocessing or native extensions (NumPy, Numba).
- I/O-bound: multithreading (concurrent.futures.ThreadPoolExecutor).
- Distributed: joblib with backend or a cluster tool (outside this lesson's scope).
Workflow: from slow to fast
- Set a baseline: Wrap the full task with a timer. Record input size, runtime, memory peaks.
- Profile: Use cProfile to find hot functions. If needed, measure within a function using perf_counter.
- Choose a tactic: Try algorithmic changes, vectorization, batching, caching, or parallelism.
- Optimize safely: Keep a correctness test. Optimize one change at a time.
- Re-measure: Compare against baseline. Stop when you meet the target.
Worked examples
1) Vectorize a feature computation
Before: pure Python loops
import math
def feat_loop(xs):
out = []
for x in xs:
y = 3*x*x + 2*x + 1
out.append(math.sqrt(y))
return sum(out)
xs = list(range(100000))
print(feat_loop(xs))After: NumPy vectorization
import numpy as np
def feat_vec(xs):
x = np.asarray(xs)
y = 3*x*x + 2*x + 1
return np.sqrt(y).sum()
xs = np.arange(100000)
print(feat_vec(xs))Typical speedup: 10x–100x, plus lower Python-level overhead.
2) Find a bottleneck with cProfile
Profile and interpret
import cProfile, pstats, io
def slow_fn(n):
s = 0
for i in range(n):
for j in range(100):
s += (i % 7) * (j % 11)
return s
pr = cProfile.Profile()
pr.enable()
slow_fn(20000)
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumtime')
ps.print_stats(10)
print(s.getvalue()) # Look for the function with the highest cumtimeYou'll see most time in slow_fn. Inner arithmetic loop dominates.
Optimize by hoisting invariants and using vector ops
import numpy as np
def faster_fn(n):
j = np.arange(100)
j_term = (j % 11)
j_sum = j_term.sum() # invariant across i
# (i % 7) repeats every 7
i_mod = np.arange(n) % 7
return int((i_mod * j_sum).sum())We removed the inner loop and used precomputed terms. Big win.
3) Reduce memory via generators and batching
Before: list materialization
def sum_squares_list(n):
data = [i*i for i in range(n)] # allocates n items
return sum(data)
print(sum_squares_list(10_000_000)) # may stress memoryAfter: generator and chunked processing
def sum_squares_gen(n):
return sum(i*i for i in range(n)) # no large list in memory
print(sum_squares_gen(10_000_000))Same result, near-constant memory. For I/O or arrays, process in chunks and accumulate.
Common mistakes and self-check
- Optimizing without measuring: Always keep baseline and after numbers.
- Micro-optimizing non-hot code: Focus on top 1–2 hotspots from the profiler.
- Ignoring algorithmic changes: A better algorithm beats low-level tweaks.
- Python loops over arrays: Prefer NumPy/pandas vectorized ops.
- Threads for CPU-bound work: Use multiprocessing or vectorized/native code.
- Memory bloat from temporary lists: Use generators and in-place operations.
- Not testing correctness: Add assertions comparing old vs new outputs on sample data.
Self-check checklist
- I can run cProfile and identify top cumulative time functions.
- I can time a section with time.perf_counter() and report speedup.
- I replaced at least one Python loop with a vectorized operation.
- I reduced memory by avoiding unnecessary intermediate lists or copies.
- I validated optimized results against a known-correct baseline.
Exercises
Do these in order. Measure before and after. Keep outputs identical.
Exercise ex1 — Vectorize a slow scoring loop with NumPy
Mirror of the exercise in the Exercises panel below.
Starter
import time, math
def score_loop(xs):
s = 0.0
for x in xs:
s += math.sqrt(3*x*x + 2*x + 1)
return s
N = 200_000
xs = list(range(N))
# 1) Time the loop version
start = time.perf_counter()
baseline = score_loop(xs)
loop_t = time.perf_counter() - start
print('baseline:', baseline, 'time:', round(loop_t, 4), 's')
# 2) Write a vectorized function score_vec(xs) using NumPy
# 3) Time it and compare result equality and speedup- Expected: same result within 1e-6 tolerance.
- Target: 5x or more speedup on large N (typical).
Exercise ex2 — Remove memory bloat with generators
Mirror of the exercise in the Exercises panel below.
Starter
import time
def sum_cubes_list(n):
data = [i*i*i for i in range(n)]
return sum(data)
n = 8_000_000
start = time.perf_counter()
res_list = sum_cubes_list(n)
list_t = time.perf_counter() - start
print('list version time:', round(list_t, 3), 's')
# Task: Write sum_cubes_gen(n) using a generator expression.
# Compare times and confirm res_gen == res_list.- Expected: identical sums; generator uses far less memory.
- Tip: Large n may take time; start with 2,000,000 then increase.
Mini checklist before you move on
- I measured runtime before and after.
- I verified outputs are equal.
- I can explain why the optimization worked.
Who this is for
- Machine Learning Engineers who need faster data and training pipelines.
- Data Scientists transitioning to production-grade performance.
- Backend engineers adding ML inference to services.
Prerequisites
- Comfortable with Python functions, loops, and basic data structures.
- Basic NumPy or pandas understanding.
- Ability to run scripts and install common packages if needed.
Learning path
- Start: Baseline and profile with cProfile and perf_counter.
- Then: Vectorize and batch with NumPy/pandas.
- Next: Parallelize appropriately (threads for I/O, processes for CPU).
- Finally: Memory discipline (generators, chunking, avoiding copies) and caching.
Practical projects
- Speed up a feature store build job by 5x and document profiling screenshots and timings.
- Reduce a model inference pipeline’s P95 latency under 100 ms using batching and caching.
- Refactor a cross-validation script to parallel folds safely, with constant memory growth.
Next steps
- Take the Quick Test to check your understanding. Anyone can take it; sign in to save progress.
- Apply one optimization to a current project. Measure and write a short before/after note.
- Revisit this lesson in a week and try a different tactic (e.g., caching or batching).
Mini challenge
You have a preprocessing step that repeatedly parses the same 1,000 JSON templates across 5 million rows. Propose two changes to make it faster and lighter, and how you would prove the improvement with numbers.