luvv to helpDiscover the Best Free Online Tools
Topic 5 of 10

Data Structures And Algorithms Basics

Learn Data Structures And Algorithms Basics for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Machine Learning Engineers often handle large datasets and performance-critical pipelines. Knowing the right data structures and algorithm patterns helps you:

  • Load, clean, and transform data efficiently (lists, dicts, sets, NumPy arrays).
  • Implement fast feature engineering (hash maps for lookups, heaps for top-k).
  • Optimize inference and serving (queues, caching, batching).
  • Design scalable solutions for recommendations, search, and monitoring.

Who this is for

  • Beginner to intermediate Python users moving into ML/AI engineering.
  • Data scientists who want production-minded performance.
  • Software engineers needing a refresher on Pythonic DS&A with ML flavor.

Prerequisites

  • Basic Python: variables, functions, loops, list/dict basics.
  • Comfort reading small code snippets.
  • Optional but helpful: basic NumPy knowledge.

Concept explained simply

Data structures store data with certain trade-offs. Algorithms are step-by-step methods that operate on those structures. The right combination saves time and memory.

  • List: ordered, fast append, slow middle insert/delete.
  • Tuple: like list but immutable (safe as dictionary keys).
  • Dict: key → value mapping with O(1) average lookup.
  • Set: unique items, O(1) average membership checks.
  • Deque: fast append/pop from both ends.
  • Heap (priority queue): fast access to smallest/largest item.
  • Graph/Tree: nodes and edges; useful for dependencies, paths.
  • NumPy array: dense numeric array with vectorized operations.
Mental model

Imagine each structure as a specialized toolbox drawer:

  • Dict/Set = lightning-fast index cards for membership/lookup.
  • List = a row of boxes; reaching the middle takes time.
  • Heap = a bin that always exposes the lightest ball first.
  • Deque = a train you can attach cars to from both ends.
  • NumPy = the conveyor belt that processes many items at once.

Big-O basics you will actually use

  • Hash map/set lookup: average O(1).
  • List append/pop end: O(1). Insert/delete middle: O(n).
  • Sorting: O(n log n) using Timsort (stable).
  • Heap push/pop: O(log n). Build-heap from list: O(n).
  • Binary search on sorted list: O(log n).
  • Vectorized NumPy ops often approach O(n) with low constant factors.

Python building blocks for ML tasks

# Hash maps and counters
from collections import Counter, defaultdict

labels = ["cat", "dog", "dog", "cat", "bird"]
count = Counter(labels)  # {'cat': 2, 'dog': 2, 'bird': 1}

# Fast membership checks
seen = set(labels)
"cat" in seen  # True

# Deques for sliding windows
from collections import deque
window = deque(maxlen=3)
for x in [1, 2, 3, 4]:
    window.append(x)  # maintains last 3 items

# Heaps for top-k
import heapq
nums = [9, 1, 5, 3, 7]
smallest = heapq.nsmallest(2, nums)  # [1, 3]
largest = heapq.nlargest(2, nums)   # [9, 7]

# Binary search for thresholds
import bisect
probs = [0.1, 0.2, 0.35, 0.7, 0.9]
idx = bisect.bisect_left(probs, 0.5)  # first >= 0.5 => index 3

# NumPy for vector ops
import numpy as np
X = np.array([[1., 2.], [3., 4.], [5., 6.]])
mean = X.mean(axis=0)  # column-wise mean

Worked examples

1) Top-K frequent tokens (Counter + heap)

from collections import Counter
import heapq

tokens = ["ml", "ai", "ml", "data", "ml", "ai", "prod"]
K = 2
freq = Counter(tokens)  # {'ml': 3, 'ai': 2, 'data': 1, 'prod': 1}
# Turn into list of (count, token) to use a min-heap of size K
heap = []
for token, c in freq.items():
    if len(heap) < K:
        heapq.heappush(heap, (c, token))
    else:
        if c > heap[0][0]:
            heapq.heapreplace(heap, (c, token))

# heap contains the top K by count
result = [t for c, t in sorted(heap, reverse=True)]  # ['ml', 'ai']
print(result)

2) Sliding window maximum (deque)

from collections import deque

def sliding_max(nums, k):
    dq = deque()  # store indices, values decreasing
    out = []
    for i, x in enumerate(nums):
        # Remove indices out of window
        while dq and dq[0] <= i - k:
            dq.popleft()
        # Maintain decreasing order
        while dq and nums[dq[-1]] <= x:
            dq.pop()
        dq.append(i)
        if i + 1 >= k:
            out.append(nums[dq[0]])
    return out

print(sliding_max([1,3,-1,-3,5,3,6,7], 3))  # [3,3,5,5,6,7]

3) Fast nearest neighbor with NumPy (vectorization)

import numpy as np

# Dataset: n x d, Query: m x d
X = np.array([[0.,0.],[1.,1.],[2.,2.],[3.,3.]])
Q = np.array([[1.2,1.0],[2.5,1.9]])
# L2 distances using (a-b)^2 = a^2 + b^2 - 2ab
X2 = (X**2).sum(axis=1)[:,None]    # n x 1
Q2 = (Q**2).sum(axis=1)[None,:]    # 1 x m
dists2 = X2 + Q2 - 2*X@Q.T         # n x m squared distances
nn_idx = dists2.argmin(axis=0)     # index per query
print(nn_idx)  # e.g., [1 2]
Why this is better than loops

Vectorization offloads work to optimized C/BLAS under the hood, reducing Python-level overhead and often leveraging CPU vector instructions.

Learning path

  1. Master Python built-ins: list, dict, set, tuple, slicing.
  2. Learn algorithmic patterns: hashing, two pointers, sliding window, heap, binary search.
  3. Add collections tools: deque, Counter, defaultdict.
  4. Use bisect and heapq for real tasks (thresholding, top-k).
  5. Adopt NumPy for numeric-heavy operations and broadcasting.
  6. Practice by re-implementing common ML preprocessing steps efficiently.

Common mistakes and self-check

  • Using list for membership checks instead of set. Self-check: If you do if x in my_list on big data, consider a set.
  • Sorting for top-k when k is small. Self-check: Replace full sort with heapq.nlargest(k, ...).
  • Manual loops for numeric ops. Self-check: Try a vectorized NumPy approach and compare runtime on 1e6 elements.
  • Rebuilding dicts in a loop. Self-check: Use defaultdict or Counter to accumulate.
  • Forgetting stable sort behavior. Self-check: When tie-breaking, sort by secondary key then primary.
Quick checklist
  • I use set/dict for O(1) lookups.
  • I reach for deque for queue-like windows.
  • I use heapq for streaming top-k.
  • I default to vectorized NumPy for heavy math.
  • I know when to apply binary search (sorted data, thresholding).

Practical projects

  • Streaming anomaly detector: compute rolling mean and max with deque; flag spikes.
  • Top-K keywords per category: use Counter + heap to keep K words for each label.
  • Fast similarity search (toy): compute cosine similarity matrix with NumPy; return nearest neighbors.
  • Threshold tuner: given sorted probabilities and a target precision, use binary search to find the cutoff.

Exercises

Do these now. The same tasks also appear in the Exercises panel with hints and solutions.

  1. Exercise 1: Top-K frequent labels using Counter + heap. Input: list of labels and K. Output: list of K labels by descending frequency (ties arbitrary but stable if counts equal).
  2. Exercise 2: Sliding window max with deque. Input: list of ints and window size k. Output: list of window maxima.
  3. Exercise 3: Threshold index with bisect. Input: sorted probabilities and threshold t. Output: first index where value ≥ t (or len(list) if none).
Self-check checklist
  • Time complexity: Exercise 1 O(n log k), Exercise 2 O(n), Exercise 3 O(log n).
  • No unnecessary full sorts for Exercise 1.
  • Deque stores indices and maintains decreasing order for Exercise 2.
  • Bisect function chosen correctly (bisect_left for ≥) for Exercise 3.

Next steps

  • Integrate these patterns into your data loaders and feature pipelines.
  • Profile your code on realistic input sizes; replace hotspots with better structures.
  • Take the quick test below. Progress is available to everyone; saved progress is only for logged-in users.

Mini challenge

Given a stream of numeric scores, maintain the median after each new number in O(log n) per update using two heaps. Print the running median for the first 10 inputs. Hint: one max-heap for the lower half (store negatives), one min-heap for the upper half; rebalance sizes to differ by at most 1.

Practice Exercises

3 exercises to complete

Instructions

Write a function topk_labels(labels, k) that returns the K most frequent labels. Use collections.Counter to count and a min-heap of size K (via heapq) to track the top K.

from collections import Counter
import heapq

def topk_labels(labels, k):
    # Your code here
    pass

print(topk_labels(["cat","dog","cat","bird","dog","cat"], 2))
Expected Output
['cat', 'dog']

Data Structures And Algorithms Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Data Structures And Algorithms Basics?

AI Assistant

Ask questions about this tool