How to learn Normalizing and Scaling for Data Cleaning in Data Analyst for free

Who this is for

Analysts who prepare datasets for modeling or reporting and need consistent feature scales for algorithms (k-NN, k-means, PCA, regressions) and clear dashboards.

Prerequisites

Comfort with basic stats: mean, median, standard deviation, percentiles.
Ability to load data in spreadsheets, SQL, or Python/R.
Know the difference between numeric and categorical features.

Why this matters

Many models and distance-based methods are scale-sensitive. A feature measured in thousands can dominate another in decimals. Proper scaling avoids misleading results and improves training stability. In real tasks, you will:

Prepare customer metrics for clustering (k-means needs scaled inputs).
Standardize inputs for logistic/linear regression to stabilize coefficients.
Normalize ranges for dashboards so indicators plot comparably.

Concept explained simply

Scaling changes the units of a feature without changing its information. Think "resizing" values so features are comparable. Two common ways:

Normalization (Min–Max scaling): squashes values to a fixed range (usually 0 to 1).
Standardization (Z-score): centers values around 0 and scales by spread so typical values are around −1 to 1.

Mental model

Imagine different thermometers for the same room: one in °C, another in °F. They show different numbers for the same heat. Scaling is converting them to the same ruler so distances and comparisons make sense.

Key methods

Min–Max scaling (Normalization)

Formula: (x − min) / (max − min) → result in [0,1] if no outliers.

Great for bounded inputs and algorithms using distances (k-NN, k-means).
Sensitive to outliers because min/max can be extreme.

Z-score standardization (Standardization)

Formula: (x − mean) / std → mean ≈ 0, std ≈ 1.

Good default for many linear models and PCA.
Sensitive to outliers (mean and std move).

Robust scaling

Formula: (x − median) / IQR, where IQR = Q3 − Q1.

Resistant to outliers; ideal for skewed/heavy-tailed data.
Values are not bounded; range depends on distribution.

When to use which (quick guide)

Clustering or k-NN: Min–Max if few outliers; Robust if many outliers.
Linear/Logistic Regression, PCA: Z-score; Robust if heavy outliers.
Tree-based models (Random Forest, XGBoost): Scaling usually not needed.

Reliable workflow

Profile data: check min, max, mean, std, median, IQR; inspect outliers.
Choose scaler per feature group (e.g., robust for skewed revenue, z-score for balanced rates).
Fit scaler on training data only; apply same transform to validation/test/production.
Save scaler parameters (min/max or mean/std or median/IQR) for reproducibility.
Document transformed features and reasons.

Edge cases and tips

Constant features (std=0 or max=min): drop or leave unchanged; scaling will divide by zero.
Binary indicators: usually leave as 0/1. Scaling is rarely needed.
Dates: engineer meaningful numeric features first (e.g., days_since_signup), then scale.
Missing values: impute before scaling or use scalers that ignore missing values appropriately.

Worked examples

Example 1: Min–Max for two features

Data: order_value = [20, 50, 80], items_count = [1, 3, 5].

order_value: min=20, max=80 → [ (20−20)/60=0, (50−20)/60=0.5, (80−20)/60=1 ] → [0, 0.5, 1]
items_count: min=1, max=5 → [ (1−1)/4=0, (3−1)/4=0.5, (5−1)/4=1 ] → [0, 0.5, 1]

Example 2: Z-score for a rate

Data: conversion_rate = [0.08, 0.10, 0.12]. mean=0.10, std≈0.01633.

(0.08−0.10)/0.01633≈−1.225, (0.10−0.10)/0.01633=0, (0.12−0.10)/0.01633≈+1.225

Example 3: Robust scaling for skewed revenue

Data: revenue = [30, 32, 35, 40, 120]. Q1=32, median=35, Q3=40 → IQR=8.

(30−35)/8=−0.625, (32−35)/8=−0.375, (35−35)/8=0, (40−35)/8=0.625, (120−35)/8=10.625

Spreadsheet, SQL, Python snippets

Spreadsheet:

MinMax: =(A2 - MIN($A$2:$A$100)) / (MAX($A$2:$A$100)-MIN($A$2:$A$100))
Z-score: =(A2 - AVERAGE($A$2:$A$100)) / STDEV.S($A$2:$A$100)
Robust: =(A2 - MEDIAN($A$2:$A$100)) / (QUARTILE.INC($A$2:$A$100,3)-QUARTILE.INC($A$2:$A$100,1))

SQL (window example):

SELECT x,
       (x - MIN(x) OVER()) / NULLIF(MAX(x) OVER() - MIN(x) OVER(),0) AS x_minmax
FROM t;

Python (scikit-learn):

from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Practical projects

Customer clustering: scale monetary, frequency, and recency differently (e.g., robust for spend, z-score for frequency).
Churn model prep: z-score continuous features, keep binaries as-is; compare model stability with/without scaling.
PCA for dimensionality reduction: standardize features and visualize variance explained.

Exercises

Complete these two exercises. A quick checklist before you start:

Identify outliers and skewness.
Pick the right scaler per feature.
Fit on training data parameters only.
Verify transformed ranges/means.

Exercise 1: Mixed scaling

Dataset (rows):

order_value: [40, 70, 100, 160]
items_count: [1, 2, 4, 7]
discount_rate: [0.00, 0.05, 0.10, 0.20]

Tasks:

Min–Max scale order_value and items_count to [0,1].
Z-score standardize discount_rate.

Hints

Use min and max by column for Min–Max.
For z-score, compute mean and std across discount_rate.

Show solution

Min–Max:

order_value min=40, max=160 → [0, (70−40)/120=0.25, (100−40)/120=0.5, (160−40)/120=1] → [0, 0.25, 0.5, 1]
items_count min=1, max=7 → [0, (2−1)/6≈0.1667, (4−1)/6=0.5, (7−1)/6=1] → [0, 0.1667, 0.5, 1]

Z-score for discount_rate:

mean = (0 + 0.05 + 0.10 + 0.20)/4 = 0.0875
std (sample) ≈ 0.08539 (using n−1)
z ≈ [(0−0.0875)/0.08539=−1.021, (0.05−0.0875)/0.08539=−0.440, (0.10−0.0875)/0.08539=0.146, (0.20−0.0875)/0.08539=1.315]

Exercise 2: Robust scaling on skewed incomes

Data: annual_income_k = [30, 32, 35, 36, 38, 40, 120]

Tasks:

Compute median, Q1, Q3, IQR.
Robust scale each income: (x − median) / IQR.

Hints

Sorted values help find quartiles: [30, 32, 35, 36, 38, 40, 120].
Median is the middle value; Q1 is median of lower half, Q3 of upper half.

Show solution

Median=36; Q1=32; Q3=40 → IQR=8.
Scaled: (30−36)/8=−0.75; (32−36)/8=−0.5; (35−36)/8=−0.125; (36−36)/8=0; (38−36)/8=0.25; (40−36)/8=0.5; (120−36)/8=10.5.

Common mistakes and self-check

Fitting on full data (leakage): Always fit scaler on training only, then transform validation/test. Self-check: ensure you reuse the same fitted parameters.
Scaling categorical or binary flags unnecessarily: Usually keep as 0/1. Self-check: list all scaled columns; confirm only continuous features included.
Ignoring outliers with Min–Max: Leads to compressed ranges. Self-check: compare min/max before and after; inspect 1st/99th percentiles.
Forgetting to save parameters: Causes inconsistent production transforms. Self-check: export min/max or mean/std or median/IQR alongside model.
Dividing by zero: Constant columns break scaling. Self-check: drop zero-variance features first.

Learning path

Data profiling: distributions, missing values, outliers.
Choose scaler per feature type and distribution.
Implement transforms in your tool (Excel/SQL/Python/R).
Validate with simple models (k-NN or regression) to see impact.
Package: persist parameters and document workflow.

Next steps

Add scaling steps into your standard data-cleaning pipeline template.
Test different scalers and compare model performance.
Create a short doc that maps features to chosen scalers and why.

Mini challenge

You have two features: time_on_site_seconds (heavily skewed with few very long sessions) and pages_viewed (0–20, roughly symmetric). Choose a scaler for each and justify in one sentence. Then implement quickly on a 10-row sample.

Quick Test

Take the quick test below to check your understanding. Available to everyone. If you log in, your progress will be saved automatically.

Menu

Normalizing and Scaling

Table of Contents