luvv to helpDiscover the Best Free Online Tools
Topic 7 of 9

Clustering K Means Hierarchical DBSCAN Basics

Learn Clustering K Means Hierarchical DBSCAN Basics for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Clustering reveals structure in unlabeled data. As a Data Scientist, you will:

  • Segment customers for targeted marketing and personalization.
  • Detect anomalous behavior or fraud by finding points that do not belong to any cluster.
  • Summarize large datasets (e.g., color quantization, topic grouping) to reduce complexity.
  • Create features (cluster IDs, distances to centroids) that boost downstream models.
Real-world tasks you might do
  • Choose K-means vs. DBSCAN based on whether clusters are compact and whether noise is expected.
  • Tune K, eps, and min_samples using silhouette and domain constraints.
  • Explain a dendrogram cut to a stakeholder to agree on number of groups.

Who this is for and prerequisites

Who this is for:
  • Aspiring and junior Data Scientists needing a solid start in clustering.
  • Analysts and ML engineers who want practical, explainable methods.
Prerequisites:
  • Comfort with basic statistics (mean, variance) and distance (Euclidean).
  • Know how to handle numeric arrays and plots in Python or similar.
  • Ability to scale/standardize features.

Concept explained simply

Clustering groups similar items without labels. You define a notion of similarity (a distance metric) and an algorithm forms groups that make similar points stick together and different points stay apart.

Mental model: Imagine dropping magnets into a field of iron filings (your data).
  • K-means: You place K magnets; filings move to the nearest magnet; you reposition magnets to the average of attached filings until stable.
  • Hierarchical (agglomerative): Start with every filing alone; repeatedly merge the two closest piles; you get a merge tree (dendrogram) and choose a cut level.
  • DBSCAN: Draw a small circle (eps) around each filing. Densely packed filings form clusters; sparse points become noise.

Core algorithms in brief

K-means (centroid-based)

Step 1: Choose K and initialize K centroids (random or k-means++).
Step 2: Assign each point to the nearest centroid.
Step 3: Recompute each centroid as the mean of its assigned points.
Step 4: Repeat Steps 2–3 until assignments stop changing (or max iterations).
When K-means shines and when it fails
  • Shines: roughly spherical, similar-sized clusters; scaled numeric features; large datasets.
  • Weak: non-spherical clusters, varying densities, outliers; sensitive to scaling and initialization.
How to pick K
  • Elbow: plot inertia vs. K; look for diminishing returns.
  • Silhouette: pick K that maximizes average silhouette score.
  • Domain: business limits (e.g., number of segments you can act on).

Hierarchical (agglomerative)

Start with each point as its own cluster and iteratively merge the two closest clusters until one remains. You choose the final number of clusters by cutting the dendrogram at a chosen height.

Linkage choices
  • Single: nearest pair distance (can chain elongated clusters).
  • Complete: farthest pair distance (tends compact clusters).
  • Average: average pairwise distance.
  • Ward: merges that minimize variance increase (similar to K-means behavior).
Pros and cons
  • Pros: No need to predefine K; interpretable tree.
  • Cons: O(n^2) memory/time for many implementations; sensitive to distance/linkage choice.

DBSCAN (density-based)

Two parameters define density: eps (neighborhood radius) and min_samples (minimum points to be dense). Points are:

  • Core: at least min_samples within eps (including itself).
  • Border: within eps of a core but not dense enough to be core.
  • Noise: neither core nor border.
Parameter tips
  • k-distance plot: sort distances to the k-th nearest neighbor (k = min_samples), look for the knee to pick eps.
  • Scale features first; eps is in feature units.
  • Larger eps merges clusters; larger min_samples requires denser regions.
Pros and cons
  • Pros: Finds arbitrarily shaped clusters; identifies noise; no need to predefine K.
  • Cons: Struggles with varying densities; parameter sensitivity; distance computations can be heavy on very large datasets.

Worked examples

Example 1: One iteration of K-means

Points: A(1,1), B(1,2), C(4,4), D(5,5). K=2. Initial centroids m1=(1,1), m2=(5,5).

Walkthrough
  1. Assign to nearest centroid:
    • A→m1, B→m1, C→m2, D→m2.
  2. Recompute centroids:
    • m1 = mean(A,B) = (1, 1.5)
    • m2 = mean(C,D) = (4.5, 4.5)
  3. Next iteration will likely keep same assignments; algorithm will converge quickly.

Example 2: Agglomerative merges

Points: P1(0,0), P2(0,0.2), P3(3,3), P4(3.1,3.2). Distance: Euclidean. Linkage: complete.

Walkthrough
  1. Closest pairs: (P1,P2) and (P3,P4) form two small clusters first.
  2. With complete linkage, the distance between these two clusters is the max pairwise distance across them, which stays large until final merge.
  3. Cut the dendrogram before the final merge to get 2 clusters: {P1,P2}, {P3,P4}.

Example 3: DBSCAN labeling

Points: P1(0,0), P2(0,0.9), P3(0,1.8), P4(5,5), P5(5.2,5), P6(8,8). eps=1.0, min_samples=2.

Walkthrough
  • Neighbors within eps:
    • P1: {P1,P2} → core
    • P2: {P2,P1,P3} → core
    • P3: {P3,P2} → core
    • P4: {P4,P5} → core
    • P5: {P5,P4} → core
    • P6: {P6} → not core; not within eps of any core → noise
  • Clusters: C1={P1,P2,P3}, C2={P4,P5}; Noise={P6}.

How to choose the algorithm

  • If clusters are compact, similar-sized, numeric, and you need speed → K-means.
  • If you want a dendrogram and do not know K → Hierarchical (choose linkage thoughtfully).
  • If shapes are irregular and noise is likely → DBSCAN (tune eps, min_samples).
Distance and scaling checklist
  • Always scale features before distance-based clustering.
  • Use Euclidean for K-means; other metrics for hierarchical/DBSCAN are possible.

Practice: your turn

Complete these exercises. Then compare with the solutions.

  • Exercise 1: One K-means iteration with given points and centroids.
  • Exercise 2: DBSCAN core/border/noise and clusters with given eps and min_samples.
  • Exercise 3: Pick the best algorithm for each scenario and justify.
Preparation checklist
  • Scaled features? If not applicable, note why.
  • Can you compute Euclidean distance quickly by hand?
  • Do you understand core/border/noise definitions?

Common mistakes and self-check

  • Not scaling features before distance-based methods.
  • Assuming K-means can find non-spherical clusters.
  • Choosing K only by elbow without checking silhouette or business constraints.
  • Using DBSCAN with eps set in unscaled units.
  • Ignoring outliers; K-means is sensitive to them.
  • Misreading dendrograms and cutting at arbitrary heights.
How to self-check
  • Plot clusters and boundaries; do shapes match your expectations?
  • Compute silhouette score; values above ~0.5 are often well-separated (context-dependent).
  • Stability test: rerun K-means with different seeds; results should be similar.
  • For DBSCAN, try slightly different eps; clusters should not swing wildly.

Practical projects

  • Customer segmentation: cluster customers on scaled RFM features; profile each cluster.
  • Sensor anomaly detection: use DBSCAN to flag noise points; validate against known events.
  • Image color quantization: K-means on pixel colors to reduce to K palette colors.

Learning path

  1. Review scaling and distance metrics (especially Euclidean, cosine).
  2. Implement K-means from scratch once; then use a library.
  3. Learn agglomerative linkage effects by comparing single/complete/average/Ward.
  4. Tune DBSCAN using k-distance plots and domain ranges.
  5. Evaluate with silhouette and visual diagnostics; document decisions.

Mini challenge

You have 2D data with two crescent-shaped clusters and scattered noise. Choose one of the three algorithms and outline exact steps (including parameter guesses) to separate crescents and mark noise. Justify why your choice works.

Hint

Crescent shapes are non-spherical; density-based approaches usually outperform centroid-based methods here.

Next steps

  • Try PCA before clustering to denoise and visualize.
  • Experiment with different distance metrics for hierarchical and DBSCAN.
  • Add cluster features to a supervised model and measure uplift.

Ready? Take the quick test

The quick test is available to everyone. Log in to save your progress.

Practice Exercises

3 exercises to complete

Instructions

Points: A(1,1), B(1,2), C(4,4), D(5,5). K=2. Initial centroids: m1=(1,1), m2=(5,5).

  1. Assign each point to the nearest centroid (Euclidean).
  2. Compute the new centroids after this assignment.
Expected Output
Assignments: {A,B} -> m1, {C,D} -> m2. New centroids: m1=(1,1.5), m2=(4.5,4.5).

Clustering K Means Hierarchical DBSCAN Basics — Quick Test

Test your knowledge with 9 questions. Pass with 70% or higher.

9 questions70% to pass

Have questions about Clustering K Means Hierarchical DBSCAN Basics?

AI Assistant

Ask questions about this tool