Who this is for
Analysts who prepare datasets for modeling or reporting and need consistent feature scales for algorithms (k-NN, k-means, PCA, regressions) and clear dashboards.
Prerequisites
- Comfort with basic stats: mean, median, standard deviation, percentiles.
- Ability to load data in spreadsheets, SQL, or Python/R.
- Know the difference between numeric and categorical features.
Why this matters
Many models and distance-based methods are scale-sensitive. A feature measured in thousands can dominate another in decimals. Proper scaling avoids misleading results and improves training stability. In real tasks, you will:
- Prepare customer metrics for clustering (k-means needs scaled inputs).
- Standardize inputs for logistic/linear regression to stabilize coefficients.
- Normalize ranges for dashboards so indicators plot comparably.
Concept explained simply
Scaling changes the units of a feature without changing its information. Think "resizing" values so features are comparable. Two common ways:
- Normalization (Min–Max scaling): squashes values to a fixed range (usually 0 to 1).
- Standardization (Z-score): centers values around 0 and scales by spread so typical values are around −1 to 1.
Mental model
Imagine different thermometers for the same room: one in °C, another in °F. They show different numbers for the same heat. Scaling is converting them to the same ruler so distances and comparisons make sense.
Key methods
Min–Max scaling (Normalization)
Formula: (x − min) / (max − min) → result in [0,1] if no outliers.
- Great for bounded inputs and algorithms using distances (k-NN, k-means).
- Sensitive to outliers because min/max can be extreme.
Z-score standardization (Standardization)
Formula: (x − mean) / std → mean ≈ 0, std ≈ 1.
- Good default for many linear models and PCA.
- Sensitive to outliers (mean and std move).
Robust scaling
Formula: (x − median) / IQR, where IQR = Q3 − Q1.
- Resistant to outliers; ideal for skewed/heavy-tailed data.
- Values are not bounded; range depends on distribution.
When to use which (quick guide)
- Clustering or k-NN: Min–Max if few outliers; Robust if many outliers.
- Linear/Logistic Regression, PCA: Z-score; Robust if heavy outliers.
- Tree-based models (Random Forest, XGBoost): Scaling usually not needed.
Reliable workflow
- Profile data: check min, max, mean, std, median, IQR; inspect outliers.
- Choose scaler per feature group (e.g., robust for skewed revenue, z-score for balanced rates).
- Fit scaler on training data only; apply same transform to validation/test/production.
- Save scaler parameters (min/max or mean/std or median/IQR) for reproducibility.
- Document transformed features and reasons.
Edge cases and tips
- Constant features (std=0 or max=min): drop or leave unchanged; scaling will divide by zero.
- Binary indicators: usually leave as 0/1. Scaling is rarely needed.
- Dates: engineer meaningful numeric features first (e.g., days_since_signup), then scale.
- Missing values: impute before scaling or use scalers that ignore missing values appropriately.
Worked examples
Example 1: Min–Max for two features
Data: order_value = [20, 50, 80], items_count = [1, 3, 5].
- order_value: min=20, max=80 → [ (20−20)/60=0, (50−20)/60=0.5, (80−20)/60=1 ] → [0, 0.5, 1]
- items_count: min=1, max=5 → [ (1−1)/4=0, (3−1)/4=0.5, (5−1)/4=1 ] → [0, 0.5, 1]
Example 2: Z-score for a rate
Data: conversion_rate = [0.08, 0.10, 0.12]. mean=0.10, std≈0.01633.
- (0.08−0.10)/0.01633≈−1.225, (0.10−0.10)/0.01633=0, (0.12−0.10)/0.01633≈+1.225
Example 3: Robust scaling for skewed revenue
Data: revenue = [30, 32, 35, 40, 120]. Q1=32, median=35, Q3=40 → IQR=8.
- (30−35)/8=−0.625, (32−35)/8=−0.375, (35−35)/8=0, (40−35)/8=0.625, (120−35)/8=10.625
Spreadsheet, SQL, Python snippets
Spreadsheet:
MinMax: =(A2 - MIN($A$2:$A$100)) / (MAX($A$2:$A$100)-MIN($A$2:$A$100))
Z-score: =(A2 - AVERAGE($A$2:$A$100)) / STDEV.S($A$2:$A$100)
Robust: =(A2 - MEDIAN($A$2:$A$100)) / (QUARTILE.INC($A$2:$A$100,3)-QUARTILE.INC($A$2:$A$100,1))SQL (window example):
SELECT x,
(x - MIN(x) OVER()) / NULLIF(MAX(x) OVER() - MIN(x) OVER(),0) AS x_minmax
FROM t;Python (scikit-learn):
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)Practical projects
- Customer clustering: scale monetary, frequency, and recency differently (e.g., robust for spend, z-score for frequency).
- Churn model prep: z-score continuous features, keep binaries as-is; compare model stability with/without scaling.
- PCA for dimensionality reduction: standardize features and visualize variance explained.
Exercises
Complete these two exercises. A quick checklist before you start:
- Identify outliers and skewness.
- Pick the right scaler per feature.
- Fit on training data parameters only.
- Verify transformed ranges/means.
Exercise 1: Mixed scaling
Dataset (rows):
order_value: [40, 70, 100, 160]
items_count: [1, 2, 4, 7]
discount_rate: [0.00, 0.05, 0.10, 0.20]Tasks:
- Min–Max scale order_value and items_count to [0,1].
- Z-score standardize discount_rate.
Hints
- Use min and max by column for Min–Max.
- For z-score, compute mean and std across discount_rate.
Show solution
Min–Max:
- order_value min=40, max=160 → [0, (70−40)/120=0.25, (100−40)/120=0.5, (160−40)/120=1] → [0, 0.25, 0.5, 1]
- items_count min=1, max=7 → [0, (2−1)/6≈0.1667, (4−1)/6=0.5, (7−1)/6=1] → [0, 0.1667, 0.5, 1]
Z-score for discount_rate:
- mean = (0 + 0.05 + 0.10 + 0.20)/4 = 0.0875
- std (sample) ≈ 0.08539 (using n−1)
- z ≈ [(0−0.0875)/0.08539=−1.021, (0.05−0.0875)/0.08539=−0.440, (0.10−0.0875)/0.08539=0.146, (0.20−0.0875)/0.08539=1.315]
Exercise 2: Robust scaling on skewed incomes
Data: annual_income_k = [30, 32, 35, 36, 38, 40, 120]
Tasks:
- Compute median, Q1, Q3, IQR.
- Robust scale each income: (x − median) / IQR.
Hints
- Sorted values help find quartiles: [30, 32, 35, 36, 38, 40, 120].
- Median is the middle value; Q1 is median of lower half, Q3 of upper half.
Show solution
- Median=36; Q1=32; Q3=40 → IQR=8.
- Scaled: (30−36)/8=−0.75; (32−36)/8=−0.5; (35−36)/8=−0.125; (36−36)/8=0; (38−36)/8=0.25; (40−36)/8=0.5; (120−36)/8=10.5.
Common mistakes and self-check
- Fitting on full data (leakage): Always fit scaler on training only, then transform validation/test. Self-check: ensure you reuse the same fitted parameters.
- Scaling categorical or binary flags unnecessarily: Usually keep as 0/1. Self-check: list all scaled columns; confirm only continuous features included.
- Ignoring outliers with Min–Max: Leads to compressed ranges. Self-check: compare min/max before and after; inspect 1st/99th percentiles.
- Forgetting to save parameters: Causes inconsistent production transforms. Self-check: export min/max or mean/std or median/IQR alongside model.
- Dividing by zero: Constant columns break scaling. Self-check: drop zero-variance features first.
Learning path
- Data profiling: distributions, missing values, outliers.
- Choose scaler per feature type and distribution.
- Implement transforms in your tool (Excel/SQL/Python/R).
- Validate with simple models (k-NN or regression) to see impact.
- Package: persist parameters and document workflow.
Next steps
- Add scaling steps into your standard data-cleaning pipeline template.
- Test different scalers and compare model performance.
- Create a short doc that maps features to chosen scalers and why.
Mini challenge
You have two features: time_on_site_seconds (heavily skewed with few very long sessions) and pages_viewed (0–20, roughly symmetric). Choose a scaler for each and justify in one sentence. Then implement quickly on a 10-row sample.
Quick Test
Take the quick test below to check your understanding. Available to everyone. If you log in, your progress will be saved automatically.