Why this matters
Machine Learning Engineers often handle large datasets and performance-critical pipelines. Knowing the right data structures and algorithm patterns helps you:
- Load, clean, and transform data efficiently (lists, dicts, sets, NumPy arrays).
- Implement fast feature engineering (hash maps for lookups, heaps for top-k).
- Optimize inference and serving (queues, caching, batching).
- Design scalable solutions for recommendations, search, and monitoring.
Who this is for
- Beginner to intermediate Python users moving into ML/AI engineering.
- Data scientists who want production-minded performance.
- Software engineers needing a refresher on Pythonic DS&A with ML flavor.
Prerequisites
- Basic Python: variables, functions, loops, list/dict basics.
- Comfort reading small code snippets.
- Optional but helpful: basic NumPy knowledge.
Concept explained simply
Data structures store data with certain trade-offs. Algorithms are step-by-step methods that operate on those structures. The right combination saves time and memory.
- List: ordered, fast append, slow middle insert/delete.
- Tuple: like list but immutable (safe as dictionary keys).
- Dict: key → value mapping with O(1) average lookup.
- Set: unique items, O(1) average membership checks.
- Deque: fast append/pop from both ends.
- Heap (priority queue): fast access to smallest/largest item.
- Graph/Tree: nodes and edges; useful for dependencies, paths.
- NumPy array: dense numeric array with vectorized operations.
Mental model
Imagine each structure as a specialized toolbox drawer:
- Dict/Set = lightning-fast index cards for membership/lookup.
- List = a row of boxes; reaching the middle takes time.
- Heap = a bin that always exposes the lightest ball first.
- Deque = a train you can attach cars to from both ends.
- NumPy = the conveyor belt that processes many items at once.
Big-O basics you will actually use
- Hash map/set lookup: average O(1).
- List append/pop end: O(1). Insert/delete middle: O(n).
- Sorting: O(n log n) using Timsort (stable).
- Heap push/pop: O(log n). Build-heap from list: O(n).
- Binary search on sorted list: O(log n).
- Vectorized NumPy ops often approach O(n) with low constant factors.
Python building blocks for ML tasks
# Hash maps and counters
from collections import Counter, defaultdict
labels = ["cat", "dog", "dog", "cat", "bird"]
count = Counter(labels) # {'cat': 2, 'dog': 2, 'bird': 1}
# Fast membership checks
seen = set(labels)
"cat" in seen # True
# Deques for sliding windows
from collections import deque
window = deque(maxlen=3)
for x in [1, 2, 3, 4]:
window.append(x) # maintains last 3 items
# Heaps for top-k
import heapq
nums = [9, 1, 5, 3, 7]
smallest = heapq.nsmallest(2, nums) # [1, 3]
largest = heapq.nlargest(2, nums) # [9, 7]
# Binary search for thresholds
import bisect
probs = [0.1, 0.2, 0.35, 0.7, 0.9]
idx = bisect.bisect_left(probs, 0.5) # first >= 0.5 => index 3
# NumPy for vector ops
import numpy as np
X = np.array([[1., 2.], [3., 4.], [5., 6.]])
mean = X.mean(axis=0) # column-wise mean
Worked examples
1) Top-K frequent tokens (Counter + heap)
from collections import Counter
import heapq
tokens = ["ml", "ai", "ml", "data", "ml", "ai", "prod"]
K = 2
freq = Counter(tokens) # {'ml': 3, 'ai': 2, 'data': 1, 'prod': 1}
# Turn into list of (count, token) to use a min-heap of size K
heap = []
for token, c in freq.items():
if len(heap) < K:
heapq.heappush(heap, (c, token))
else:
if c > heap[0][0]:
heapq.heapreplace(heap, (c, token))
# heap contains the top K by count
result = [t for c, t in sorted(heap, reverse=True)] # ['ml', 'ai']
print(result)
2) Sliding window maximum (deque)
from collections import deque
def sliding_max(nums, k):
dq = deque() # store indices, values decreasing
out = []
for i, x in enumerate(nums):
# Remove indices out of window
while dq and dq[0] <= i - k:
dq.popleft()
# Maintain decreasing order
while dq and nums[dq[-1]] <= x:
dq.pop()
dq.append(i)
if i + 1 >= k:
out.append(nums[dq[0]])
return out
print(sliding_max([1,3,-1,-3,5,3,6,7], 3)) # [3,3,5,5,6,7]
3) Fast nearest neighbor with NumPy (vectorization)
import numpy as np
# Dataset: n x d, Query: m x d
X = np.array([[0.,0.],[1.,1.],[2.,2.],[3.,3.]])
Q = np.array([[1.2,1.0],[2.5,1.9]])
# L2 distances using (a-b)^2 = a^2 + b^2 - 2ab
X2 = (X**2).sum(axis=1)[:,None] # n x 1
Q2 = (Q**2).sum(axis=1)[None,:] # 1 x m
dists2 = X2 + Q2 - 2*X@Q.T # n x m squared distances
nn_idx = dists2.argmin(axis=0) # index per query
print(nn_idx) # e.g., [1 2]
Why this is better than loops
Vectorization offloads work to optimized C/BLAS under the hood, reducing Python-level overhead and often leveraging CPU vector instructions.
Learning path
- Master Python built-ins: list, dict, set, tuple, slicing.
- Learn algorithmic patterns: hashing, two pointers, sliding window, heap, binary search.
- Add collections tools: deque, Counter, defaultdict.
- Use bisect and heapq for real tasks (thresholding, top-k).
- Adopt NumPy for numeric-heavy operations and broadcasting.
- Practice by re-implementing common ML preprocessing steps efficiently.
Common mistakes and self-check
- Using list for membership checks instead of set. Self-check: If you do
if x in my_liston big data, consider a set. - Sorting for top-k when k is small. Self-check: Replace full sort with
heapq.nlargest(k, ...). - Manual loops for numeric ops. Self-check: Try a vectorized NumPy approach and compare runtime on 1e6 elements.
- Rebuilding dicts in a loop. Self-check: Use
defaultdictorCounterto accumulate. - Forgetting stable sort behavior. Self-check: When tie-breaking, sort by secondary key then primary.
Quick checklist
- I use set/dict for O(1) lookups.
- I reach for deque for queue-like windows.
- I use heapq for streaming top-k.
- I default to vectorized NumPy for heavy math.
- I know when to apply binary search (sorted data, thresholding).
Practical projects
- Streaming anomaly detector: compute rolling mean and max with deque; flag spikes.
- Top-K keywords per category: use Counter + heap to keep K words for each label.
- Fast similarity search (toy): compute cosine similarity matrix with NumPy; return nearest neighbors.
- Threshold tuner: given sorted probabilities and a target precision, use binary search to find the cutoff.
Exercises
Do these now. The same tasks also appear in the Exercises panel with hints and solutions.
- Exercise 1: Top-K frequent labels using Counter + heap. Input: list of labels and K. Output: list of K labels by descending frequency (ties arbitrary but stable if counts equal).
- Exercise 2: Sliding window max with deque. Input: list of ints and window size k. Output: list of window maxima.
- Exercise 3: Threshold index with bisect. Input: sorted probabilities and threshold t. Output: first index where value ≥ t (or len(list) if none).
Self-check checklist
- Time complexity: Exercise 1 O(n log k), Exercise 2 O(n), Exercise 3 O(log n).
- No unnecessary full sorts for Exercise 1.
- Deque stores indices and maintains decreasing order for Exercise 2.
- Bisect function chosen correctly (
bisect_leftfor ≥) for Exercise 3.
Next steps
- Integrate these patterns into your data loaders and feature pipelines.
- Profile your code on realistic input sizes; replace hotspots with better structures.
- Take the quick test below. Progress is available to everyone; saved progress is only for logged-in users.
Mini challenge
Given a stream of numeric scores, maintain the median after each new number in O(log n) per update using two heaps. Print the running median for the first 10 inputs. Hint: one max-heap for the lower half (store negatives), one min-heap for the upper half; rebalance sizes to differ by at most 1.