How to learn Performance Profiling And Optimization for Python in Machine Learning Engineer for free

Why this matters

In real ML/AI work, speed saves money and unlocks scale. You will:

Speed up feature engineering and data preprocessing pipelines.
Reduce model training time (and cost) by removing bottlenecks.
Batch and stream data efficiently for large datasets.
Make real-time inference meet latency SLAs.
Diagnose memory spikes that crash notebooks or jobs.

Real tasks you might face

Cut a 45-minute ETL step to under 5 minutes to meet a nightly retrain window.
Bring a REST inference endpoint from 300 ms to under 80 ms P95 latency.
Eliminate out-of-memory errors during cross-validation by batching and sharing arrays.
Profile a slow custom loss function to fit into a 2-hour training budget.

Concept explained simply

Profiling is measuring where time and memory are spent. Optimization is changing code or data flow to make the slowest parts faster without breaking correctness.

Mental model

Measure first: find the hottest 1–2 functions. Optimize only those.
Pick the right tactic: algorithmic improvements > vectorization > parallelism > micro-tweaks.
Trade-offs: speed vs memory vs readability. Document decisions.

Quick toolbox (no special setup)

Timing: time.perf_counter(), timeit
CPU profiling: cProfile, pstats
Line timing: line_profiler (if available), or manual timers
Memory: tracemalloc (built-in), sys.getsizeof (shallow), list(generators) vs generators
Parallelism: multiprocessing, concurrent.futures, joblib (common in scikit-learn)
Vectorization: NumPy, pandas
Caching: functools.lru_cache

GIL and parallelism in one minute

Python's GIL limits CPU-bound threads. Use:

CPU-bound: multiprocessing or native extensions (NumPy, Numba).
I/O-bound: multithreading (concurrent.futures.ThreadPoolExecutor).
Distributed: joblib with backend or a cluster tool (outside this lesson's scope).

Workflow: from slow to fast

Set a baseline: Wrap the full task with a timer. Record input size, runtime, memory peaks.
Profile: Use cProfile to find hot functions. If needed, measure within a function using perf_counter.
Choose a tactic: Try algorithmic changes, vectorization, batching, caching, or parallelism.
Optimize safely: Keep a correctness test. Optimize one change at a time.
Re-measure: Compare against baseline. Stop when you meet the target.

Worked examples

1) Vectorize a feature computation

Before: pure Python loops

import math

def feat_loop(xs):
    out = []
    for x in xs:
        y = 3*x*x + 2*x + 1
        out.append(math.sqrt(y))
    return sum(out)

xs = list(range(100000))
print(feat_loop(xs))

After: NumPy vectorization

import numpy as np

def feat_vec(xs):
    x = np.asarray(xs)
    y = 3*x*x + 2*x + 1
    return np.sqrt(y).sum()

xs = np.arange(100000)
print(feat_vec(xs))

Typical speedup: 10x–100x, plus lower Python-level overhead.

2) Find a bottleneck with cProfile

Profile and interpret

import cProfile, pstats, io

def slow_fn(n):
    s = 0
    for i in range(n):
        for j in range(100):
            s += (i % 7) * (j % 11)
    return s

pr = cProfile.Profile()
pr.enable()
slow_fn(20000)
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumtime')
ps.print_stats(10)
print(s.getvalue())  # Look for the function with the highest cumtime

You'll see most time in slow_fn. Inner arithmetic loop dominates.

Optimize by hoisting invariants and using vector ops

import numpy as np

def faster_fn(n):
    j = np.arange(100)
    j_term = (j % 11)
    j_sum = j_term.sum()  # invariant across i
    # (i % 7) repeats every 7
    i_mod = np.arange(n) % 7
    return int((i_mod * j_sum).sum())

We removed the inner loop and used precomputed terms. Big win.

3) Reduce memory via generators and batching

Before: list materialization

def sum_squares_list(n):
    data = [i*i for i in range(n)]  # allocates n items
    return sum(data)

print(sum_squares_list(10_000_000))  # may stress memory

After: generator and chunked processing

def sum_squares_gen(n):
    return sum(i*i for i in range(n))  # no large list in memory

print(sum_squares_gen(10_000_000))

Same result, near-constant memory. For I/O or arrays, process in chunks and accumulate.

Common mistakes and self-check

Optimizing without measuring: Always keep baseline and after numbers.
Micro-optimizing non-hot code: Focus on top 1–2 hotspots from the profiler.
Ignoring algorithmic changes: A better algorithm beats low-level tweaks.
Python loops over arrays: Prefer NumPy/pandas vectorized ops.
Threads for CPU-bound work: Use multiprocessing or vectorized/native code.
Memory bloat from temporary lists: Use generators and in-place operations.
Not testing correctness: Add assertions comparing old vs new outputs on sample data.

Self-check checklist

I can run cProfile and identify top cumulative time functions.
I can time a section with time.perf_counter() and report speedup.
I replaced at least one Python loop with a vectorized operation.
I reduced memory by avoiding unnecessary intermediate lists or copies.
I validated optimized results against a known-correct baseline.

Exercises

Do these in order. Measure before and after. Keep outputs identical.

Exercise ex1 — Vectorize a slow scoring loop with NumPy

Mirror of the exercise in the Exercises panel below.

Starter

import time, math

def score_loop(xs):
    s = 0.0
    for x in xs:
        s += math.sqrt(3*x*x + 2*x + 1)
    return s

N = 200_000
xs = list(range(N))

# 1) Time the loop version
start = time.perf_counter()
baseline = score_loop(xs)
loop_t = time.perf_counter() - start
print('baseline:', baseline, 'time:', round(loop_t, 4), 's')

# 2) Write a vectorized function score_vec(xs) using NumPy
# 3) Time it and compare result equality and speedup

Expected: same result within 1e-6 tolerance.
Target: 5x or more speedup on large N (typical).

Exercise ex2 — Remove memory bloat with generators

Mirror of the exercise in the Exercises panel below.

Starter

import time

def sum_cubes_list(n):
    data = [i*i*i for i in range(n)]
    return sum(data)

n = 8_000_000
start = time.perf_counter()
res_list = sum_cubes_list(n)
list_t = time.perf_counter() - start
print('list version time:', round(list_t, 3), 's')

# Task: Write sum_cubes_gen(n) using a generator expression.
# Compare times and confirm res_gen == res_list.

Expected: identical sums; generator uses far less memory.
Tip: Large n may take time; start with 2,000,000 then increase.

Mini checklist before you move on

I measured runtime before and after.
I verified outputs are equal.
I can explain why the optimization worked.

Who this is for

Machine Learning Engineers who need faster data and training pipelines.
Data Scientists transitioning to production-grade performance.
Backend engineers adding ML inference to services.

Prerequisites

Comfortable with Python functions, loops, and basic data structures.
Basic NumPy or pandas understanding.
Ability to run scripts and install common packages if needed.

Learning path

Start: Baseline and profile with cProfile and perf_counter.
Then: Vectorize and batch with NumPy/pandas.
Next: Parallelize appropriately (threads for I/O, processes for CPU).
Finally: Memory discipline (generators, chunking, avoiding copies) and caching.

Practical projects

Speed up a feature store build job by 5x and document profiling screenshots and timings.
Reduce a model inference pipeline’s P95 latency under 100 ms using batching and caching.
Refactor a cross-validation script to parallel folds safely, with constant memory growth.

Next steps

Take the Quick Test to check your understanding. Anyone can take it; sign in to save progress.
Apply one optimization to a current project. Measure and write a short before/after note.
Revisit this lesson in a week and try a different tactic (e.g., caching or batching).

Mini challenge

You have a preprocessing step that repeatedly parses the same 1,000 JSON templates across 5 million rows. Propose two changes to make it faster and lighter, and how you would prove the improvement with numbers.

Instructions

Replace a Python loop with a NumPy vectorized operation and confirm correctness and speedup.

import time, math, numpy as np

def score_loop(xs):
    s = 0.0
    for x in xs:
        s += math.sqrt(3*x*x + 2*x + 1)
    return s

N = 200_000
xs = list(range(N))

# Baseline timing
start = time.perf_counter()
baseline = score_loop(xs)
loop_t = time.perf_counter() - start
print('baseline:', baseline, 'time:', round(loop_t, 4), 's')

# TODO: Implement vectorized version
# def score_vec(xs):
#     ...

# Then time it and compare
# start = time.perf_counter()
# fast = score_vec(xs)
# vec_t = time.perf_counter() - start
# print('equal:', abs(fast - baseline) < 1e-6)
# print('speedup:', round(loop_t/vec_t, 2), 'x')

Goal: equal results within 1e-6.
Target: 5x+ speedup on large N is typical.

Menu

Performance Profiling And Optimization

Table of Contents