Why this matters
Nearest Neighbors (k-NN) is a simple, effective baseline for classification, regression, and recommendation. As a Data Scientist, you will:
- Build quick baselines to compare against complex models.
- Prototype recommenders using “people/items like this.”
- Detect anomalies by measuring how isolated a point is.
- Classify images/text using vector embeddings and nearest neighbors.
Real-world tasks you might do with k-NN
- Predict if a transaction is fraudulent by comparing it to recent transactions.
- Recommend products based on similar users’ purchase histories.
- Estimate house prices using nearby comparable homes.
Concept explained simply
k-Nearest Neighbors predicts a value or class by looking at the k most similar data points (neighbors) and combining their targets.
- Classification: neighbors vote for a class; the majority wins.
- Regression: prediction is the average (or distance-weighted average) of neighbor values.
Mental model: “Find your crowd”
Imagine you move to a new city and want a good cafe. You ask the three people who seem most like you (same tastes, budget, location). Their answers guide you. That’s k-NN: find the “closest” points and let them decide.
Core ingredients
- Distance: how to measure similarity. Common: Euclidean, Manhattan; for binary/categorical: Hamming.
- k (number of neighbors): small k can be noisy; large k can over-smooth. Tune via cross-validation.
- Feature scaling: standardize/normalize to prevent large-scale features from dominating.
- Weights: uniform (all neighbors equal) or distance-weighted (closer neighbors matter more).
- Tie-breaking: define a consistent rule (e.g., choose smallest distance, then smallest index).
Tip: choosing k
- Start with odd k to reduce ties in binary classification.
- Use cross-validation to pick k that minimizes validation error.
- As k increases: variance decreases, bias increases; boundaries get smoother.
Worked examples
Example 1: 3-NN classification (Euclidean, uniform weights)
Data (x, y → label):
A: (1.5, 1.5)
A: (1.2, 1.0)
A: (0.0, 2.0)
B: (3.0, 2.0)
B: (3.0, 3.0)
B: (2.5, 3.5)
Query: q = (2.0, 2.0), k = 3- Compute distances to q:
d(A 1.5,1.5) ≈ 0.707
d(A 1.2,1.0) ≈ 1.281
d(A 0.0,2.0) = 2.000
d(B 3.0,2.0) = 1.000
d(B 3.0,3.0) ≈ 1.414
d(B 2.5,3.5) ≈ 1.581 - 3 nearest: A(0.707), B(1.000), A(1.281).
- Votes: A=2, B=1 → predict A.
Example 2: KNN regression (uniform vs distance-weighted)
Neighbors’ target values and distances to q:
y = [100, 130, 110]
d = [0.8, 1.2, 2.0]- Uniform mean: (100 + 130 + 110)/3 = 113.33
- Distance-weighted (weights = 1/d):
w = [1.25, 0.8333, 0.5], sum w ≈ 2.5833
Weighted mean ≈ (100*1.25 + 130*0.8333 + 110*0.5) / 2.5833 ≈ 111.67
Closer neighbors pull the prediction more.
Example 3: Scaling matters
Features: age (years), income (k$). Two points:
P1 (age=30, income=40), label=A
P2 (age=40, income=41), label=B
Query q (age=31, income=60)- Raw Euclidean distances:
d(q,P1)= sqrt((31-30)^2 + (60-40)^2) ≈ sqrt(1 + 400)=20.02
d(q,P2)= sqrt((31-40)^2 + (60-41)^2) ≈ sqrt(81 + 361)=20.62
q → closer to P1 (A). - But if income ranged 0–200 and age 18–80, without scaling income dominates most comparisons. Standardize (z-scores) before distance to balance features.
Mini check
If one feature’s unit is dollars and another is meters, should you scale? Yes—use standardization or min–max scaling.
How to apply k-NN in practice
- Prepare data: handle missing values; encode categoricals (e.g., one-hot or appropriate distance).
- Scale features: standardize numeric features.
- Choose distance: Euclidean for dense numeric; Manhattan for robustness; Hamming for binary indicators.
- Tune hyperparameters: pick k and weight scheme via cross-validation.
- Evaluate: use proper metrics (accuracy/F1 for classification, MAE/RMSE for regression).
- Speed: for large N, consider approximate neighbors or dimensionality reduction; index structures (KD-tree/Ball-tree) help in lower dimensions.
Deployment tip
Keep the training set (or an index of it) accessible at prediction time. k-NN is fast to train but does work at inference.
Checklist: before you fit k-NN
- [ ] Numeric features standardized
- [ ] Categorical features encoded or proper distance defined
- [ ] k chosen with cross-validation
- [ ] Weighting scheme decided (uniform vs distance)
- [ ] Tie-breaking policy defined
- [ ] Evaluation metric matched to business goal
Exercises
Try these without a calculator first; then verify.
Exercise 1: 3-NN classification by hand
Same data as Example 1. Use k = 3, Euclidean, uniform weights. What is the predicted class for q = (2.0, 2.0)?
Hint
Sort all distances ascending; take top 3 neighbors; count votes.
Show solution
Nearest three: A(0.707), B(1.000), A(1.281) → A=2, B=1 → predict A.
Exercise 2: KNN regression weighting
Neighbors have distances d = [0.8, 1.2, 2.0] and targets y = [100, 130, 110]. Compute predictions for:
(a) uniform weights, (b) distance weights w = 1/d. Round to two decimals.
Hint
Uniform: simple mean. Weighted: sum(y_i * w_i) / sum(w_i).
Show solution
(a) (100 + 130 + 110)/3 = 113.33
(b) w = [1.25, 0.8333, 0.5], sum ≈ 2.5833 → (125 + 108.33 + 55)/2.5833 ≈ 111.67
Common mistakes and self-check
- Forgetting to scale features → Self-check: does one feature have much larger units? Standardize and re-evaluate.
- Using too small k → Overfits noise. Self-check: validation error high variance across folds?
- Using accuracy on imbalanced data → Self-check: also view precision/recall/F1.
- Mismatched distance for data type → Self-check: are you using Euclidean on sparse binary indicators? Try Hamming or cosine on normalized vectors.
- Ignoring ties → Self-check: define a consistent tie-break rule.
Quick sanity checks
- Does prediction change drastically with tiny noise? Try larger k.
- Do results flip after scaling? Keep the scaled pipeline in cross-validation.
Mini challenge
You have 2 numeric features (standardized) and need a robust binary classifier baseline. Tune k in {1, 3, 5, 7, 9} and weighting in {uniform, distance} with 5-fold CV. Your dataset has class imbalance 1:4. What do you report?
- Deliver: best (k, weighting) by F1, plus confusion matrix on a held-out test set.
- Stretch: compare Manhattan vs Euclidean distance.
Who this is for
- Data Scientists building baselines and interpretable models.
- Analysts transitioning to ML who want a practical, quick-win method.
- Engineers prototyping recommenders with embeddings.
Prerequisites
- Basic statistics: mean, variance, bias–variance intuition.
- Python or similar for implementation (optional for the theory here).
- Data preprocessing: handling missing values, encoding categoricals.
Learning path
- Before: Feature scaling and normalization; distance metrics basics.
- Now: Nearest Neighbors Basics (this lesson).
- Next: Model selection with cross-validation; handling imbalanced data; approximate nearest neighbors for scale.
Practical projects
- Classification: Predict customer churn using k-NN baseline; compare F1 vs logistic regression.
- Regression: Estimate apartment rent using k-NN; test uniform vs distance weights.
- Embeddings: Use precomputed text or image embeddings and k-NN for retrieval/classification.
Next steps
- Try different k and weighting on your data; log CV scores.
- Experiment with Euclidean vs Manhattan vs cosine (with normalized vectors).
- Investigate KD-tree/Ball-tree indexes for speed in low/moderate dimensions.
Check your understanding
Take the quick test below. Note: The quick test is available to everyone; only logged-in users get saved progress.