How to learn Dimensionality Reduction PCA UMAP Basics for Machine Learning Algorithms in Data Scientist for free

Why this matters

As a Data Scientist, you often face hundreds of features. Dimensionality reduction helps you:

Speed up modeling by reducing feature count while keeping signal.
Visualize complex data (embeddings, images, customer behavior) in 2D/3D to explain insights.
Reduce multicollinearity and noise before clustering or regression.
Ship lighter models with fewer inputs and faster inference.

Typical tasks where this shows up: preparing features for k-means, making dashboards to show segments, compressing embeddings, and stabilizing linear models.

Concept explained simply

Dimensionality reduction maps high-dimensional data into a smaller number of dimensions while preserving structure.

PCA (Principal Component Analysis): finds new orthogonal axes (directions) that capture maximum variance in a linear way.
UMAP (Uniform Manifold Approximation and Projection): builds a neighbor graph to preserve the local/global structure and projects to 2D/3D non-linearly.

Mental model

PCA: Imagine a cloud of points. Rotate your coordinate system to the direction where the cloud is the longest (PC1), then the next longest perpendicular direction (PC2), and so on. Keep only the first few directions that explain most of the spread.
UMAP: Imagine connecting each point to its nearest neighbors to form a graph. Then lay this graph flat in 2D so that connected points stay close and unrelated points spread apart.

Core techniques: PCA and UMAP

PCA in 30 seconds

Preprocess: numeric only; scale features (standardize) to avoid dominance by large-scale features.
Fit PCA on training data only; choose number of components by explained variance (e.g., 90–99%).
Use transformed components for modeling or clustering. Components are linear combinations of the original features.

UMAP in 30 seconds

Good for visualization and uncovering non-linear structure.
Key hyperparameters: n_neighbors (local vs global), min_dist (cluster compactness), metric (euclidean, cosine for embeddings).
Stochastic: set random_state for reproducibility.

PCA or UMAP? quick guide

Speed up linear/clustering models: PCA first.
2D plots to show clusters/manifolds: UMAP.
Downstream supervised task: PCA is more stable; UMAP is mainly for visualization, though it can help non-linear tasks with kNN.

Worked examples

Example 1: Stabilize a logistic regression

Task: 120 correlated numeric features for credit risk.

Scale features (StandardScaler).
Fit PCA on train; choose components to reach 95% variance (say 30 components).
Train logistic regression on PCA features.
Result: faster training, reduced multicollinearity, similar or better AUC.

Why it works

Highly correlated features inflate variance of coefficients. PCA compresses them into orthogonal components.

Example 2: Visualize customer segments with UMAP

Task: 256-d product embedding vectors. Need a 2D map for stakeholders.

Normalize vectors; set metric to cosine.
UMAP(n_neighbors=30, min_dist=0.1, metric='cosine', random_state=42).
Plot 2D embedding with points colored by known segment.
Result: clusters become visible; mixed points flag potential mislabels.

Tip

Smaller min_dist pulls points tighter, emphasizing clusters; larger values show smoother structure.

Example 3: PCA before k-means

Task: 500 features, k-means is slow and noisy.

Scale; PCA to 50 components (95% variance).
Run k-means on PCA features.
Result: faster convergence and higher silhouette scores.

Why it helps

PCA denoises and aligns variance, making spherical clusters more plausible for k-means.

How to do it (step-by-step)

PCA workflow

Prepare data
- Keep numeric features (or encode first).
- Impute missing values.
- Standardize features.
Fit and select components
- Fit PCA on training set only.
- Inspect cumulative explained variance; pick a threshold (e.g., 95%).
Transform and model
- Transform train/validation/test using the fitted PCA.
- Train downstream model; monitor performance and latency.

UMAP workflow

Prepare data
- Normalize/scale as appropriate for chosen metric.
- Choose metric: cosine for embeddings; euclidean for dense numeric features.
Configure UMAP
- n_neighbors: 15–50 typical. Larger shows global structure; smaller emphasizes local clusters.
- min_dist: 0.0–0.5. Smaller makes clusters compact.
- Set random_state for reproducible plots.
Fit and evaluate
- Fit on train (or full data if purely for visualization).
- Inspect cluster separation visually; for modeling, validate with metrics (e.g., kNN accuracy or clustering quality).

Pre-deploy checklist

Did you fit scaler and reducer only on training data?
Is the chosen number of PCA components justified by cumulative variance?
Is randomness controlled (UMAP random_state) for reproducible dashboards?
Are transformations applied consistently to all splits and in the same order?

Exercises

Do these before the quick test.

Exercise 1 (ex1): Estimate PCA direction

Data points (2D): (1,1), (2,2), (3,2.9), (4,4). Without heavy math, center the data and reason about the first principal component direction and variance share.

Mini task: Which direction is PC1 closest to: [1,0], [0,1], or [1,1]?
Mini task: Is PC1 likely to explain above 90% of the variance?

Exercise 2 (ex2): Pick the right method

You must present a 2D plot that reveals non-linear clusters in 300-d text embeddings. Stakeholders care about visual clarity more than exact coordinates.

Mini task: Choose PCA or UMAP.
Mini task: Pick a sensible metric and n_neighbors.

Exercise 3 (ex3): Choose components and UMAP settings

Your PCA explained_variance_ratio_ is: [0.40, 0.25, 0.12, 0.08, 0.05, 0.04, 0.03, 0.02, 0.01].

Mini task: How many components for at least 90% cumulative variance?
Mini task: For tight local clusters in 2D UMAP, suggest n_neighbors and min_dist ranges.

Self-check: After solving, compare with the solutions at the bottom of this page section.
Tip: Reason first, then verify.

Exercise solutions

Solutions are also listed in the Exercises panel below.

Common mistakes and self-check

Leakage: Fitting scaler/PCA/UMAP on full data. Self-check: Confirm fit on train only, then transform val/test.
No scaling before PCA. Self-check: Inspect feature scales; standardize first.
Too many/few components. Self-check: Plot cumulative variance; choose a justified threshold.
Assuming UMAP coordinates are stable and metric. Self-check: Fix random_state; try multiple runs and confirm patterns persist.
Using euclidean for angular embeddings. Self-check: Try cosine; compare separation.

Practical projects

Build a PCA-preprocessed pipeline for k-means customer segmentation; compare silhouette before/after PCA.
Create a UMAP dashboard of product embeddings with hover labels; tune n_neighbors/min_dist for clarity.
Compress tabular features with PCA to meet a latency budget; report accuracy vs components curve.

Learning path

Master scaling and encoding.
Apply PCA for variance reduction and modeling.
Use UMAP for visualization; learn parameter effects.
Combine with clustering and classification; validate performance.
Package transformations in reproducible pipelines.

Who this is for

Data Scientist learners who need to reduce features, visualize embeddings, and speed up models.

Prerequisites

Basic linear algebra intuition (vectors, variance).
Understanding of scaling, train/validation/test splits.
Familiarity with clustering or classification metrics.

Next steps

Run PCA on a recent dataset you used; record the variance curve.
Produce a UMAP plot of any embedding you have; try two parameter settings and compare.
Take the quick test to confirm understanding.

Mini challenge

You have 1,000 image embeddings (512-d) and need a slide with a clear 2D cluster map and a quick baseline classifier.

Decide: UMAP for the slide, PCA for the classifier.
Suggest settings: UMAP(cosine, n_neighbors≈30, min_dist≈0.1, random_state fixed). PCA to 95% variance, then logistic regression or kNN.
Deliverables: a 2D plot and a short table of accuracy vs components.

Ready to test yourself?

Take the quick test below. The test is available to everyone; only logged-in users get saved progress.

Menu

Dimensionality Reduction PCA UMAP Basics

Table of Contents