Topic Not Found

Why this matters

Machine Learning Engineers often handle large datasets and performance-critical pipelines. Knowing the right data structures and algorithm patterns helps you:

Load, clean, and transform data efficiently (lists, dicts, sets, NumPy arrays).
Implement fast feature engineering (hash maps for lookups, heaps for top-k).
Optimize inference and serving (queues, caching, batching).
Design scalable solutions for recommendations, search, and monitoring.

Who this is for

Beginner to intermediate Python users moving into ML/AI engineering.
Data scientists who want production-minded performance.
Software engineers needing a refresher on Pythonic DS&A with ML flavor.

Prerequisites

Basic Python: variables, functions, loops, list/dict basics.
Comfort reading small code snippets.
Optional but helpful: basic NumPy knowledge.

Concept explained simply

Data structures store data with certain trade-offs. Algorithms are step-by-step methods that operate on those structures. The right combination saves time and memory.

List: ordered, fast append, slow middle insert/delete.
Tuple: like list but immutable (safe as dictionary keys).
Dict: key → value mapping with O(1) average lookup.
Set: unique items, O(1) average membership checks.
Deque: fast append/pop from both ends.
Heap (priority queue): fast access to smallest/largest item.
Graph/Tree: nodes and edges; useful for dependencies, paths.
NumPy array: dense numeric array with vectorized operations.

Mental model

Imagine each structure as a specialized toolbox drawer:

Dict/Set = lightning-fast index cards for membership/lookup.
List = a row of boxes; reaching the middle takes time.
Heap = a bin that always exposes the lightest ball first.
Deque = a train you can attach cars to from both ends.
NumPy = the conveyor belt that processes many items at once.

Big-O basics you will actually use

Hash map/set lookup: average O(1).
List append/pop end: O(1). Insert/delete middle: O(n).
Sorting: O(n log n) using Timsort (stable).
Heap push/pop: O(log n). Build-heap from list: O(n).
Binary search on sorted list: O(log n).
Vectorized NumPy ops often approach O(n) with low constant factors.

Python building blocks for ML tasks

# Hash maps and counters
from collections import Counter, defaultdict

labels = ["cat", "dog", "dog", "cat", "bird"]
count = Counter(labels)  # {'cat': 2, 'dog': 2, 'bird': 1}

# Fast membership checks
seen = set(labels)
"cat" in seen  # True

# Deques for sliding windows
from collections import deque
window = deque(maxlen=3)
for x in [1, 2, 3, 4]:
    window.append(x)  # maintains last 3 items

# Heaps for top-k
import heapq
nums = [9, 1, 5, 3, 7]
smallest = heapq.nsmallest(2, nums)  # [1, 3]
largest = heapq.nlargest(2, nums)   # [9, 7]

# Binary search for thresholds
import bisect
probs = [0.1, 0.2, 0.35, 0.7, 0.9]
idx = bisect.bisect_left(probs, 0.5)  # first >= 0.5 => index 3

# NumPy for vector ops
import numpy as np
X = np.array([[1., 2.], [3., 4.], [5., 6.]])
mean = X.mean(axis=0)  # column-wise mean

Worked examples

1) Top-K frequent tokens (Counter + heap)

from collections import Counter
import heapq

tokens = ["ml", "ai", "ml", "data", "ml", "ai", "prod"]
K = 2
freq = Counter(tokens)  # {'ml': 3, 'ai': 2, 'data': 1, 'prod': 1}
# Turn into list of (count, token) to use a min-heap of size K
heap = []
for token, c in freq.items():
    if len(heap) < K:
        heapq.heappush(heap, (c, token))
    else:
        if c > heap[0][0]:
            heapq.heapreplace(heap, (c, token))

# heap contains the top K by count
result = [t for c, t in sorted(heap, reverse=True)]  # ['ml', 'ai']
print(result)

2) Sliding window maximum (deque)

from collections import deque

def sliding_max(nums, k):
    dq = deque()  # store indices, values decreasing
    out = []
    for i, x in enumerate(nums):
        # Remove indices out of window
        while dq and dq[0] <= i - k:
            dq.popleft()
        # Maintain decreasing order
        while dq and nums[dq[-1]] <= x:
            dq.pop()
        dq.append(i)
        if i + 1 >= k:
            out.append(nums[dq[0]])
    return out

print(sliding_max([1,3,-1,-3,5,3,6,7], 3))  # [3,3,5,5,6,7]

3) Fast nearest neighbor with NumPy (vectorization)

import numpy as np

# Dataset: n x d, Query: m x d
X = np.array([[0.,0.],[1.,1.],[2.,2.],[3.,3.]])
Q = np.array([[1.2,1.0],[2.5,1.9]])
# L2 distances using (a-b)^2 = a^2 + b^2 - 2ab
X2 = (X**2).sum(axis=1)[:,None]    # n x 1
Q2 = (Q**2).sum(axis=1)[None,:]    # 1 x m
dists2 = X2 + Q2 - 2*X@Q.T         # n x m squared distances
nn_idx = dists2.argmin(axis=0)     # index per query
print(nn_idx)  # e.g., [1 2]

Why this is better than loops

Vectorization offloads work to optimized C/BLAS under the hood, reducing Python-level overhead and often leveraging CPU vector instructions.

Learning path

Master Python built-ins: list, dict, set, tuple, slicing.
Learn algorithmic patterns: hashing, two pointers, sliding window, heap, binary search.
Add collections tools: deque, Counter, defaultdict.
Use bisect and heapq for real tasks (thresholding, top-k).
Adopt NumPy for numeric-heavy operations and broadcasting.
Practice by re-implementing common ML preprocessing steps efficiently.

Common mistakes and self-check

Using list for membership checks instead of set. Self-check: If you do if x in my_list on big data, consider a set.
Sorting for top-k when k is small. Self-check: Replace full sort with heapq.nlargest(k, ...).
Manual loops for numeric ops. Self-check: Try a vectorized NumPy approach and compare runtime on 1e6 elements.
Rebuilding dicts in a loop. Self-check: Use defaultdict or Counter to accumulate.
Forgetting stable sort behavior. Self-check: When tie-breaking, sort by secondary key then primary.

Quick checklist

I use set/dict for O(1) lookups.
I reach for deque for queue-like windows.
I use heapq for streaming top-k.
I default to vectorized NumPy for heavy math.
I know when to apply binary search (sorted data, thresholding).

Practical projects

Streaming anomaly detector: compute rolling mean and max with deque; flag spikes.
Top-K keywords per category: use Counter + heap to keep K words for each label.
Fast similarity search (toy): compute cosine similarity matrix with NumPy; return nearest neighbors.
Threshold tuner: given sorted probabilities and a target precision, use binary search to find the cutoff.

Exercises

Do these now. The same tasks also appear in the Exercises panel with hints and solutions.

Exercise 1: Top-K frequent labels using Counter + heap. Input: list of labels and K. Output: list of K labels by descending frequency (ties arbitrary but stable if counts equal).
Exercise 2: Sliding window max with deque. Input: list of ints and window size k. Output: list of window maxima.
Exercise 3: Threshold index with bisect. Input: sorted probabilities and threshold t. Output: first index where value ≥ t (or len(list) if none).

Self-check checklist

Time complexity: Exercise 1 O(n log k), Exercise 2 O(n), Exercise 3 O(log n).
No unnecessary full sorts for Exercise 1.
Deque stores indices and maintains decreasing order for Exercise 2.
Bisect function chosen correctly (bisect_left for ≥) for Exercise 3.

Next steps

Integrate these patterns into your data loaders and feature pipelines.
Profile your code on realistic input sizes; replace hotspots with better structures.
Take the quick test below. Progress is available to everyone; saved progress is only for logged-in users.

Mini challenge

Given a stream of numeric scores, maintain the median after each new number in O(log n) per update using two heaps. Print the running median for the first 10 inputs. Hint: one max-heap for the lower half (store negatives), one min-heap for the upper half; rebalance sizes to differ by at most 1.

Menu

Data Structures And Algorithms Basics

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Big-O basics you will actually use

Python building blocks for ML tasks

Worked examples

1) Top-K frequent tokens (Counter + heap)

2) Sliding window maximum (deque)

3) Fast nearest neighbor with NumPy (vectorization)

Learning path

Common mistakes and self-check

Practical projects

Exercises

Next steps

Mini challenge

Practice Exercises

Top-K frequent labels using Counter + heap

Instructions

Expected Output

Sliding window maximum with deque

Find threshold index with bisect

Data Structures And Algorithms Basics — Quick Test

Have questions about Data Structures And Algorithms Basics?

AI Assistant