How to learn Scaling And Normalization for Feature Engineering in Data Scientist for free

Why this matters

As a Data Scientist, many models assume comparable feature scales. Distance-based methods (k-NN, k-means), gradient-based models (logistic/linear regression with regularization, neural nets), and kernel SVMs can perform poorly or converge slowly if features vary wildly in scale. Proper scaling can stabilize training, improve performance, and make coefficients more interpretable.

Clustering customer behavior: scaling prevents high-spend features from overpowering count features.
Fraud detection: robust scaling and log transforms reduce outlier dominance.
Image or sensor pipelines: normalized ranges help models converge consistently.

Who this is for

Aspiring and practicing Data Scientists working with classical ML and deep learning.
Anyone building pipelines where data arrives continuously and must be transformed consistently.

Prerequisites

Basic statistics: mean, standard deviation, median, IQR.
Understanding of train/validation/test splits and cross-validation.
Familiarity with a ML library (e.g., scikit-learn) is helpful but not required.

Concept explained simply

Scaling changes the range of a feature; normalization can also mean making each sample a unit vector. Common variants:

Standardization (z-score): center to mean 0 and scale to std 1 per feature.
Min–Max scaling: squeeze values to a fixed range (often 0–1).
Robust scaling: center by median and scale by IQR to resist outliers.
L2 normalization (per sample): scale each sample vector to length 1. Common with text vectors.
Log/power transform: make skewed positive data more symmetric (e.g., counts, spending).

Quick mental model

Distances care about scale: if a feature is 0–10000 and another is 0–1, the big-range feature dominates.
Gradients prefer well-conditioned inputs: similarly scaled features help optimization.
Trees split on thresholds and are usually scale-insensitive, but preprocessing for stability (e.g., log on heavy skew) can still help.

When to use what

k-NN, k-means, PCA, SVM (RBF): Standardization or Min–Max; consider Robust if outliers exist.
Linear/Logistic Regression, Elastic Net: Standardization; Robust if heavy outliers; log transform for skew first.
Neural nets: Often Standardization or Min–Max to [0,1] (images) or [-1,1].
Tree-based (Random Forest, XGBoost): Usually fine without scaling; optional log/robust for extreme skew/outliers.
Sparse text (TF–IDF): L2 normalization per sample; avoid mean-centering that destroys sparsity.

Worked examples

Example 1 — Standardization for regression

Task: Predict house price using features: size_sqft, bedrooms, age_years.

Compute mean and std on training set only.
Transform train, validation, and test with the same parameters.
Train linear regression; coefficients become comparable in scale.

# Pseudocode
scaler = StandardScaler().fit(X_train[['size_sqft','bedrooms','age_years']])
X_train_s = scaler.transform(X_train[cols])
X_valid_s = scaler.transform(X_valid[cols])
model.fit(X_train_s, y_train)

What improves?

Faster, more stable convergence for gradient-based solvers.
Regularization treats features fairly.

Example 2 — Min–Max for pixel intensities

Task: k-NN on grayscale images (0–255). Scale to [0,1] so Euclidean distances reflect relative differences, and adjust k if needed.

X_train_img = X_train_img / 255.0
X_test_img = X_test_img / 255.0

Why not standardize pixels?

Works too, but Min–Max preserves the natural bounds and is common for images and neural nets.

Example 3 — Robust scaling + log for skewed money amounts

Task: Cluster customers using features: transactions_count (counts), total_spend (skewed with outliers).

Log1p on total_spend to reduce skew: log(1 + x).
RobustScale both features to reduce the influence of outliers.
Run k-means; clusters become more stable and interpretable.

X['total_spend_log'] = np.log1p(X['total_spend'])
scaler = RobustScaler().fit(X[['transactions_count','total_spend_log']])
X_scaled = scaler.transform(X[['transactions_count','total_spend_log']])

Signs it worked

Cluster centers not dominated by a few high spenders.
Silhouette score improves on validation.

Apply it safely (no data leakage)

Split first. Create train/validation/test (or cross-validation folds).
Fit transformers on train only. Save parameters (mean, std, median, IQR, min, max).
Transform validation/test with the same fitted transformer.
Handle missing values before scaling. Impute numeric features, then scale.
Use pipelines. Automate fit/transform to avoid leakage.

Tip: sparse data

For sparse inputs (e.g., TF–IDF), avoid mean-centering. Use per-sample L2 normalization or scale non-centering methods.

Common mistakes and self-check

Fitting on the whole dataset. Self-check: Does your code call fit on combined train+valid? If yes, fix it.
Scaling targets unintentionally. Self-check: Only scale X unless your model truly benefits from a transformed y (e.g., log-target for skewed regression).
Ignoring outliers. Self-check: Plot distributions; if heavy tails exist, try RobustScaler or log1p.
Scaling categorical encodings wrongly. Self-check: Do not scale one-hot dummies per feature with centering for sparse; consider leaving them as is.
Inconsistent transforms in production. Self-check: Are you saving and reusing the exact same fitted scaler? If not, predictions will drift.

Exercises

Do these in a notebook or REPL. Then compare with the solutions.

Exercise 1 — Standardize numeric features safely

Dataset columns: size_sqft, bedrooms, age_years, city (categorical). Task: Standardize numeric features using only the training split. Confirm means ~0 and stds ~1 on the training split.

Split into train/valid (e.g., 80/20).
Fit StandardScaler on train numeric columns only.
Transform train and valid; compute column means/stds on transformed train.
Verify valid set is transformed with the same scaler.

Solution idea (hidden)

See the solution in the Exercises section below for code and expected numbers.

Exercise 2 — Pick the right scaler

For each scenario, choose one: Standardize, Min–Max, Robust, L2 normalization (per sample), or No scaling.

k-means on e-commerce spend with extreme outliers and counts.
Lasso regression on moderate, non-skewed numeric features.
Logistic regression on TF–IDF sparse vectors.

Solution idea (hidden)

See the Exercises section below for the reasoning.

Practical projects

Build a clustering pipeline: log + robust scale + k-means on customer features. Evaluate with silhouette and stability across random seeds.
Train two classifiers (SVM RBF) on the same dataset: once without scaling, once with StandardScaler in a pipeline. Compare accuracy and confusion matrices.
Create a reusable scaling module that fits on train and serializes the transformer, then loads it to transform a held-out test file.

Learning path

Master splits and cross-validation to avoid leakage.
Practice Standard, Min–Max, and Robust scalers, plus log1p.
Handle sparse data with L2 normalization.
Build pipelines and persist transformers.
Evaluate impact with metrics and cross-validation.

Next steps

Complete the Quick Test at the end of this page. Everyone can take it for free; if you are logged in, your progress is saved.
Apply the best-suited scaling to one of your current projects and record before/after metrics.
Move on to the next subskill in Feature Engineering when you score 70%+.

Instructions

Dataset columns: size_sqft, bedrooms, age_years, city (categorical). Your tasks:

Split the data into train (80%) and valid (20%).
Fit a StandardScaler on the training set using only numeric columns.
Transform both train and valid with the fitted scaler.
Compute means and stds on the transformed training set; they should be about 0 and 1 respectively.

# Example (Python / scikit-learn)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

num_cols = ['size_sqft','bedrooms','age_years']
X_train, X_valid = train_test_split(X, test_size=0.2, random_state=42)
scaler = StandardScaler().fit(X_train[num_cols])
X_train_s = scaler.transform(X_train[num_cols])
X_valid_s = scaler.transform(X_valid[num_cols])

train_means = X_train_s.mean(axis=0)
train_stds = X_train_s.std(axis=0, ddof=0)
print(train_means, train_stds)

Menu

Scaling And Normalization

Table of Contents