luvv to helpDiscover the Best Free Online Tools
Topic 5 of 9

Nearest Neighbors Basics

Learn Nearest Neighbors Basics for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Nearest Neighbors (k-NN) is a simple, effective baseline for classification, regression, and recommendation. As a Data Scientist, you will:

  • Build quick baselines to compare against complex models.
  • Prototype recommenders using “people/items like this.”
  • Detect anomalies by measuring how isolated a point is.
  • Classify images/text using vector embeddings and nearest neighbors.
Real-world tasks you might do with k-NN
  • Predict if a transaction is fraudulent by comparing it to recent transactions.
  • Recommend products based on similar users’ purchase histories.
  • Estimate house prices using nearby comparable homes.

Concept explained simply

k-Nearest Neighbors predicts a value or class by looking at the k most similar data points (neighbors) and combining their targets.

  • Classification: neighbors vote for a class; the majority wins.
  • Regression: prediction is the average (or distance-weighted average) of neighbor values.
Mental model: “Find your crowd”

Imagine you move to a new city and want a good cafe. You ask the three people who seem most like you (same tastes, budget, location). Their answers guide you. That’s k-NN: find the “closest” points and let them decide.

Core ingredients

  • Distance: how to measure similarity. Common: Euclidean, Manhattan; for binary/categorical: Hamming.
  • k (number of neighbors): small k can be noisy; large k can over-smooth. Tune via cross-validation.
  • Feature scaling: standardize/normalize to prevent large-scale features from dominating.
  • Weights: uniform (all neighbors equal) or distance-weighted (closer neighbors matter more).
  • Tie-breaking: define a consistent rule (e.g., choose smallest distance, then smallest index).
Tip: choosing k
  • Start with odd k to reduce ties in binary classification.
  • Use cross-validation to pick k that minimizes validation error.
  • As k increases: variance decreases, bias increases; boundaries get smoother.

Worked examples

Example 1: 3-NN classification (Euclidean, uniform weights)

Data (x, y → label):

A: (1.5, 1.5)
A: (1.2, 1.0)
A: (0.0, 2.0)
B: (3.0, 2.0)
B: (3.0, 3.0)
B: (2.5, 3.5)
Query: q = (2.0, 2.0), k = 3
  1. Compute distances to q:
    d(A 1.5,1.5) ≈ 0.707
    d(A 1.2,1.0) ≈ 1.281
    d(A 0.0,2.0) = 2.000
    d(B 3.0,2.0) = 1.000
    d(B 3.0,3.0) ≈ 1.414
    d(B 2.5,3.5) ≈ 1.581
  2. 3 nearest: A(0.707), B(1.000), A(1.281).
  3. Votes: A=2, B=1 → predict A.

Example 2: KNN regression (uniform vs distance-weighted)

Neighbors’ target values and distances to q:

y = [100, 130, 110]
d = [0.8, 1.2, 2.0]
  • Uniform mean: (100 + 130 + 110)/3 = 113.33
  • Distance-weighted (weights = 1/d):
    w = [1.25, 0.8333, 0.5], sum w ≈ 2.5833
    Weighted mean ≈ (100*1.25 + 130*0.8333 + 110*0.5) / 2.5833 ≈ 111.67

Closer neighbors pull the prediction more.

Example 3: Scaling matters

Features: age (years), income (k$). Two points:

P1 (age=30, income=40), label=A
P2 (age=40, income=41), label=B
Query q (age=31, income=60)
  • Raw Euclidean distances:
    d(q,P1)= sqrt((31-30)^2 + (60-40)^2) ≈ sqrt(1 + 400)=20.02
    d(q,P2)= sqrt((31-40)^2 + (60-41)^2) ≈ sqrt(81 + 361)=20.62
    q → closer to P1 (A).
  • But if income ranged 0–200 and age 18–80, without scaling income dominates most comparisons. Standardize (z-scores) before distance to balance features.
Mini check

If one feature’s unit is dollars and another is meters, should you scale? Yes—use standardization or min–max scaling.

How to apply k-NN in practice

  1. Prepare data: handle missing values; encode categoricals (e.g., one-hot or appropriate distance).
  2. Scale features: standardize numeric features.
  3. Choose distance: Euclidean for dense numeric; Manhattan for robustness; Hamming for binary indicators.
  4. Tune hyperparameters: pick k and weight scheme via cross-validation.
  5. Evaluate: use proper metrics (accuracy/F1 for classification, MAE/RMSE for regression).
  6. Speed: for large N, consider approximate neighbors or dimensionality reduction; index structures (KD-tree/Ball-tree) help in lower dimensions.
Deployment tip

Keep the training set (or an index of it) accessible at prediction time. k-NN is fast to train but does work at inference.

Checklist: before you fit k-NN

  • [ ] Numeric features standardized
  • [ ] Categorical features encoded or proper distance defined
  • [ ] k chosen with cross-validation
  • [ ] Weighting scheme decided (uniform vs distance)
  • [ ] Tie-breaking policy defined
  • [ ] Evaluation metric matched to business goal

Exercises

Try these without a calculator first; then verify.

Exercise 1: 3-NN classification by hand

Same data as Example 1. Use k = 3, Euclidean, uniform weights. What is the predicted class for q = (2.0, 2.0)?

Hint

Sort all distances ascending; take top 3 neighbors; count votes.

Show solution

Nearest three: A(0.707), B(1.000), A(1.281) → A=2, B=1 → predict A.

Exercise 2: KNN regression weighting

Neighbors have distances d = [0.8, 1.2, 2.0] and targets y = [100, 130, 110]. Compute predictions for:
(a) uniform weights, (b) distance weights w = 1/d. Round to two decimals.

Hint

Uniform: simple mean. Weighted: sum(y_i * w_i) / sum(w_i).

Show solution

(a) (100 + 130 + 110)/3 = 113.33
(b) w = [1.25, 0.8333, 0.5], sum ≈ 2.5833 → (125 + 108.33 + 55)/2.5833 ≈ 111.67

Common mistakes and self-check

  • Forgetting to scale features → Self-check: does one feature have much larger units? Standardize and re-evaluate.
  • Using too small k → Overfits noise. Self-check: validation error high variance across folds?
  • Using accuracy on imbalanced data → Self-check: also view precision/recall/F1.
  • Mismatched distance for data type → Self-check: are you using Euclidean on sparse binary indicators? Try Hamming or cosine on normalized vectors.
  • Ignoring ties → Self-check: define a consistent tie-break rule.
Quick sanity checks
  • Does prediction change drastically with tiny noise? Try larger k.
  • Do results flip after scaling? Keep the scaled pipeline in cross-validation.

Mini challenge

You have 2 numeric features (standardized) and need a robust binary classifier baseline. Tune k in {1, 3, 5, 7, 9} and weighting in {uniform, distance} with 5-fold CV. Your dataset has class imbalance 1:4. What do you report?

  • Deliver: best (k, weighting) by F1, plus confusion matrix on a held-out test set.
  • Stretch: compare Manhattan vs Euclidean distance.

Who this is for

  • Data Scientists building baselines and interpretable models.
  • Analysts transitioning to ML who want a practical, quick-win method.
  • Engineers prototyping recommenders with embeddings.

Prerequisites

  • Basic statistics: mean, variance, bias–variance intuition.
  • Python or similar for implementation (optional for the theory here).
  • Data preprocessing: handling missing values, encoding categoricals.

Learning path

  • Before: Feature scaling and normalization; distance metrics basics.
  • Now: Nearest Neighbors Basics (this lesson).
  • Next: Model selection with cross-validation; handling imbalanced data; approximate nearest neighbors for scale.

Practical projects

  • Classification: Predict customer churn using k-NN baseline; compare F1 vs logistic regression.
  • Regression: Estimate apartment rent using k-NN; test uniform vs distance weights.
  • Embeddings: Use precomputed text or image embeddings and k-NN for retrieval/classification.

Next steps

  • Try different k and weighting on your data; log CV scores.
  • Experiment with Euclidean vs Manhattan vs cosine (with normalized vectors).
  • Investigate KD-tree/Ball-tree indexes for speed in low/moderate dimensions.

Check your understanding

Take the quick test below. Note: The quick test is available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Using the dataset below and Euclidean distance with uniform weights, classify q = (2.0, 2.0) with k=3.

A: (1.5, 1.5)
A: (1.2, 1.0)
A: (0.0, 2.0)
B: (3.0, 2.0)
B: (3.0, 3.0)
B: (2.5, 3.5)

Show your distance calculations and final vote.

Expected Output
Predicted class: A

Nearest Neighbors Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Nearest Neighbors Basics?

AI Assistant

Ask questions about this tool