luvv to helpDiscover the Best Free Online Tools
Topic 1 of 12

Normalizing and Scaling

Learn Normalizing and Scaling for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 19, 2025 | Updated: December 19, 2025

Who this is for

Analysts who prepare datasets for modeling or reporting and need consistent feature scales for algorithms (k-NN, k-means, PCA, regressions) and clear dashboards.

Prerequisites

  • Comfort with basic stats: mean, median, standard deviation, percentiles.
  • Ability to load data in spreadsheets, SQL, or Python/R.
  • Know the difference between numeric and categorical features.

Why this matters

Many models and distance-based methods are scale-sensitive. A feature measured in thousands can dominate another in decimals. Proper scaling avoids misleading results and improves training stability. In real tasks, you will:

  • Prepare customer metrics for clustering (k-means needs scaled inputs).
  • Standardize inputs for logistic/linear regression to stabilize coefficients.
  • Normalize ranges for dashboards so indicators plot comparably.

Concept explained simply

Scaling changes the units of a feature without changing its information. Think "resizing" values so features are comparable. Two common ways:

  • Normalization (Min–Max scaling): squashes values to a fixed range (usually 0 to 1).
  • Standardization (Z-score): centers values around 0 and scales by spread so typical values are around −1 to 1.

Mental model

Imagine different thermometers for the same room: one in °C, another in °F. They show different numbers for the same heat. Scaling is converting them to the same ruler so distances and comparisons make sense.

Key methods

Min–Max scaling (Normalization)

Formula: (x − min) / (max − min) → result in [0,1] if no outliers.

  • Great for bounded inputs and algorithms using distances (k-NN, k-means).
  • Sensitive to outliers because min/max can be extreme.

Z-score standardization (Standardization)

Formula: (x − mean) / std → mean ≈ 0, std ≈ 1.

  • Good default for many linear models and PCA.
  • Sensitive to outliers (mean and std move).

Robust scaling

Formula: (x − median) / IQR, where IQR = Q3 − Q1.

  • Resistant to outliers; ideal for skewed/heavy-tailed data.
  • Values are not bounded; range depends on distribution.
When to use which (quick guide)
  • Clustering or k-NN: Min–Max if few outliers; Robust if many outliers.
  • Linear/Logistic Regression, PCA: Z-score; Robust if heavy outliers.
  • Tree-based models (Random Forest, XGBoost): Scaling usually not needed.

Reliable workflow

  1. Profile data: check min, max, mean, std, median, IQR; inspect outliers.
  2. Choose scaler per feature group (e.g., robust for skewed revenue, z-score for balanced rates).
  3. Fit scaler on training data only; apply same transform to validation/test/production.
  4. Save scaler parameters (min/max or mean/std or median/IQR) for reproducibility.
  5. Document transformed features and reasons.
Edge cases and tips
  • Constant features (std=0 or max=min): drop or leave unchanged; scaling will divide by zero.
  • Binary indicators: usually leave as 0/1. Scaling is rarely needed.
  • Dates: engineer meaningful numeric features first (e.g., days_since_signup), then scale.
  • Missing values: impute before scaling or use scalers that ignore missing values appropriately.

Worked examples

Example 1: Min–Max for two features

Data: order_value = [20, 50, 80], items_count = [1, 3, 5].

  • order_value: min=20, max=80 → [ (20−20)/60=0, (50−20)/60=0.5, (80−20)/60=1 ] → [0, 0.5, 1]
  • items_count: min=1, max=5 → [ (1−1)/4=0, (3−1)/4=0.5, (5−1)/4=1 ] → [0, 0.5, 1]

Example 2: Z-score for a rate

Data: conversion_rate = [0.08, 0.10, 0.12]. mean=0.10, std≈0.01633.

  • (0.08−0.10)/0.01633≈−1.225, (0.10−0.10)/0.01633=0, (0.12−0.10)/0.01633≈+1.225

Example 3: Robust scaling for skewed revenue

Data: revenue = [30, 32, 35, 40, 120]. Q1=32, median=35, Q3=40 → IQR=8.

  • (30−35)/8=−0.625, (32−35)/8=−0.375, (35−35)/8=0, (40−35)/8=0.625, (120−35)/8=10.625
Spreadsheet, SQL, Python snippets

Spreadsheet:

MinMax: =(A2 - MIN($A$2:$A$100)) / (MAX($A$2:$A$100)-MIN($A$2:$A$100))
Z-score: =(A2 - AVERAGE($A$2:$A$100)) / STDEV.S($A$2:$A$100)
Robust: =(A2 - MEDIAN($A$2:$A$100)) / (QUARTILE.INC($A$2:$A$100,3)-QUARTILE.INC($A$2:$A$100,1))

SQL (window example):

SELECT x,
       (x - MIN(x) OVER()) / NULLIF(MAX(x) OVER() - MIN(x) OVER(),0) AS x_minmax
FROM t;

Python (scikit-learn):

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Practical projects

  • Customer clustering: scale monetary, frequency, and recency differently (e.g., robust for spend, z-score for frequency).
  • Churn model prep: z-score continuous features, keep binaries as-is; compare model stability with/without scaling.
  • PCA for dimensionality reduction: standardize features and visualize variance explained.

Exercises

Complete these two exercises. A quick checklist before you start:

  • Identify outliers and skewness.
  • Pick the right scaler per feature.
  • Fit on training data parameters only.
  • Verify transformed ranges/means.

Exercise 1: Mixed scaling

Dataset (rows):

order_value: [40, 70, 100, 160]
items_count: [1, 2, 4, 7]
discount_rate: [0.00, 0.05, 0.10, 0.20]

Tasks:

  • Min–Max scale order_value and items_count to [0,1].
  • Z-score standardize discount_rate.
Hints
  • Use min and max by column for Min–Max.
  • For z-score, compute mean and std across discount_rate.
Show solution

Min–Max:

  • order_value min=40, max=160 → [0, (70−40)/120=0.25, (100−40)/120=0.5, (160−40)/120=1] → [0, 0.25, 0.5, 1]
  • items_count min=1, max=7 → [0, (2−1)/6≈0.1667, (4−1)/6=0.5, (7−1)/6=1] → [0, 0.1667, 0.5, 1]

Z-score for discount_rate:

  • mean = (0 + 0.05 + 0.10 + 0.20)/4 = 0.0875
  • std (sample) ≈ 0.08539 (using n−1)
  • z ≈ [(0−0.0875)/0.08539=−1.021, (0.05−0.0875)/0.08539=−0.440, (0.10−0.0875)/0.08539=0.146, (0.20−0.0875)/0.08539=1.315]

Exercise 2: Robust scaling on skewed incomes

Data: annual_income_k = [30, 32, 35, 36, 38, 40, 120]

Tasks:

  • Compute median, Q1, Q3, IQR.
  • Robust scale each income: (x − median) / IQR.
Hints
  • Sorted values help find quartiles: [30, 32, 35, 36, 38, 40, 120].
  • Median is the middle value; Q1 is median of lower half, Q3 of upper half.
Show solution
  • Median=36; Q1=32; Q3=40 → IQR=8.
  • Scaled: (30−36)/8=−0.75; (32−36)/8=−0.5; (35−36)/8=−0.125; (36−36)/8=0; (38−36)/8=0.25; (40−36)/8=0.5; (120−36)/8=10.5.

Common mistakes and self-check

  • Fitting on full data (leakage): Always fit scaler on training only, then transform validation/test. Self-check: ensure you reuse the same fitted parameters.
  • Scaling categorical or binary flags unnecessarily: Usually keep as 0/1. Self-check: list all scaled columns; confirm only continuous features included.
  • Ignoring outliers with Min–Max: Leads to compressed ranges. Self-check: compare min/max before and after; inspect 1st/99th percentiles.
  • Forgetting to save parameters: Causes inconsistent production transforms. Self-check: export min/max or mean/std or median/IQR alongside model.
  • Dividing by zero: Constant columns break scaling. Self-check: drop zero-variance features first.

Learning path

  1. Data profiling: distributions, missing values, outliers.
  2. Choose scaler per feature type and distribution.
  3. Implement transforms in your tool (Excel/SQL/Python/R).
  4. Validate with simple models (k-NN or regression) to see impact.
  5. Package: persist parameters and document workflow.

Next steps

  • Add scaling steps into your standard data-cleaning pipeline template.
  • Test different scalers and compare model performance.
  • Create a short doc that maps features to chosen scalers and why.

Mini challenge

You have two features: time_on_site_seconds (heavily skewed with few very long sessions) and pages_viewed (0–20, roughly symmetric). Choose a scaler for each and justify in one sentence. Then implement quickly on a 10-row sample.

Quick Test

Take the quick test below to check your understanding. Available to everyone. If you log in, your progress will be saved automatically.

Practice Exercises

2 exercises to complete

Instructions

Dataset (rows):

order_value: [40, 70, 100, 160]
items_count: [1, 2, 4, 7]
discount_rate: [0.00, 0.05, 0.10, 0.20]

Tasks:

  • Min–Max scale order_value and items_count to [0,1].
  • Z-score standardize discount_rate (use sample std, n−1).
Expected Output
order_value_minmax: [0, 0.25, 0.5, 1]; items_count_minmax: [0, 0.1667, 0.5, 1]; discount_rate_z: [-1.021, -0.440, 0.146, 1.315] (approx).

Normalizing and Scaling — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Normalizing and Scaling?

AI Assistant

Ask questions about this tool