How to learn Encoding Categorical Variables for Feature Engineering in Data Scientist for free

Why this matters

Models only learn from numbers. If you turn categories into numbers poorly, you can inject fake order, explode feature space, or leak target information. Good encoding improves accuracy, stability, and training speed.

Real tasks where this shows up

Predict churn from plan_type, region, and device_brand.
Forecast demand using store_id and holiday_name.
Credit risk modeling with occupation, education_level, and employer_industry.
Recommenders with product_category and user_segment.

Who this is for

Data Scientists and ML Engineers building tabular models.
Analysts moving from dashboards to predictive modeling.

Prerequisites

Comfort with basic Python or R data manipulation.
Understanding of supervised learning, train/validation/test splits, and cross-validation.
Basic statistics (means, variance, overfitting).

Concept explained simply

Encoding is the act of mapping text categories to numeric signals that your model can learn from without distorting meaning.

Mental model: Each category contains signal and noise. Your job is to convert categories into numbers that keep the signal and control the noise.

Nominal (no order): color, city, product_code.
Ordinal (has order): size XS < S < M < L < XL; education level.

Common encoders:

One-hot: new binary column per category. Great for small cardinality nominal features.
Ordinal encoding: map categories to integers according to meaningful order. For ordinal features only.
Frequency/Count encoding: replace category by its frequency or count. Good baseline for high-cardinality features.
Target (mean) encoding: map category to average target (with CV and smoothing). Powerful, but must prevent leakage.
Hashing: fixed-size vector via hash function. Useful when categories are numerous/changing.
Binary encoding: convert category index to binary digits. Middle ground between one-hot and hashing.

Handling missing and unseen categories

Always reserve a value for unknown (e.g., "__UNK__") and missing ("__NA__").
In one-hot, include an "other" bucket if you cap categories.
In target/frequency, fallback to global mean or global frequency for unseen categories.

Choosing an encoding: quick guide

If feature is ordinal: use ordinal encoding with domain-informed ordering.
If nominal and unique categories ≤ 20: one-hot (consider dropping one column for linear models).
If nominal and unique categories > 20:
- Tree-based models: try frequency/count; optionally target encoding with CV+smoothing.
- Linear models: try hashing or binary encoding; target encoding with strong regularization.
High-cardinality IDs (e.g., user_id): avoid one-hot; consider target encoding with strict CV, hashing, or learn embeddings in deep models.

Leakage control for target encoding

Compute encodings inside each CV fold from training folds only.
Use smoothing toward global mean, especially for rare categories.
Never compute encodings on full data before splitting.

Worked examples

Example 1 — One-hot for small nominal

Feature: color ∈ {red, green, blue}. One-hot gives color_red, color_green, color_blue.

color   color_red  color_green  color_blue
red     1          0            0
green   0          1            0
blue    0          0            1

Good for linear and tree-based models. Avoids fake ordering.

Example 2 — Ordinal for sizes

Feature: size ∈ {XS, S, M, L, XL}. Valid order: XS < S < M < L < XL.

mapping = {"XS":1, "S":2, "M":3, "L":4, "XL":5}
size_encoded = mapping[size]

Do not use ordinal encoding for categories without order.

Example 3 — Frequency encoding for high cardinality

Feature: city with hundreds of values.

# freq(city) = count(city)/N
Austin (2/6)=0.333, Boston (1/6)=0.167, Chicago (2/6)=0.333, Denver (1/6)=0.167

Simple, robust, and leakage-free.

Example 4 — Target encoding with CV + smoothing

For a binary target y, compute per-category mean inside CV folds and shrink toward global mean μ:

enc(c) = (sum_y_c + α*μ) / (n_c + α)
# choose α by validation (larger α => stronger shrinkage)

Always fit encodings on training folds only, then apply to validation/test. Use global mean for unseen categories.

Steps to implement in a project

Audit features: list categorical columns; mark nominal vs ordinal; count unique values; identify missing rate.
Choose encoders: apply the quick guide above. Decide handling for rare categories and unknowns.
Define CV scheme: K-fold or time-based split. Plan how encoders fit within each fold.
Build pipelines: fit encoders on training only; transform validation/test. Keep mappings for production.
Regularize: for target encoding, tune smoothing and noise; for one-hot with linear models, use regularization (L2/L1).
Evaluate: compare encoders via CV metrics and stability across folds.
Productionize: freeze category maps; set fallbacks for unknown/missing.

Hands-on exercises

These mirror the exercises below and the embedded solutions there. Try here first.

Exercise 1 — Churn dataset: pick and apply encodings

Dataset (6 rows):

city     plan_type  churn
Austin   Basic      0
Austin   Plus       1
Boston   Basic      0
Chicago  Plus       1
Chicago  Pro        0
Denver   Basic      1

Encode city using frequency (count/N).
One-hot encode plan_type into Basic, Plus, Pro.
Show the transformed table.

Peek: expected format

Columns: city_freq, plan_Basic, plan_Plus, plan_Pro, churn

[ ] Computed city frequencies correctly.
[ ] Created three plan_type columns with 0/1 values.
[ ] Preserved original row order.

Exercise 2 — Ordinal + rare categories

Dataset (5 rows):

size  country
S     US
XL    DE
M     FR
XS    CN
L     OTHER

Encode size with mapping XS=1, S=2, M=3, L=4, XL=5.
One-hot country but group rare categories into OTHER (keep US, DE, FR, OTHER).
Show the transformed table.

[ ] Applied the exact ordinal mapping.
[ ] Produced four country columns with correct 0/1 flags.
[ ] Included a column for OTHER.

Common mistakes and self-check

Leakage in target encoding: computed using full dataset before split. Fix: compute inside CV folds only.
Using ordinal encoding for nominal categories: introduces fake order. Fix: use one-hot or frequency.
One-hot blow-up: too many categories causes sparse noise. Fix: cap to top-K + OTHER, or use hashing/frequency/target encoding.
No plan for unseen categories in production. Fix: reserve UNK/OTHER and fallbacks.
Dropping all-one-hot columns for linear models. Fix: drop exactly one column per feature group when needed.

Self-check prompts

Did I choose encoders per feature type and cardinality?
Are encoders integrated with the CV pipeline without leakage?
Do encoded features improve validation metrics consistently across folds?
Do I have deterministic mappings and unknown fallbacks for production?

Practical projects

Customer churn: compare one-hot vs frequency vs target encoding on 3 categorical features; report AUC and feature importance.
Demand forecasting: use store_id and promo_type; try hashing with different dimensions; evaluate MAPE.
Credit risk: apply ordinal encoding to risk_grades and target encoding to employer_industry with smoothing; monitor stability across time-based splits.

Learning path

Start: One-hot and ordinal encoding on small datasets.
Next: Frequency and rare-category handling.
Then: Target encoding with CV and smoothing; compare to hashing.
Advanced: Binary encoding and embeddings for deep models.

Next steps

Finish the exercises below and take the quick test.
Integrate encoders into a reproducible pipeline (e.g., scikit-learn ColumnTransformer or similar).
Track metrics by fold; keep artifacts for production (mappings and fallbacks).

Mini challenge

You have feature merchant_country with 120 unique values and binary target fraud. Design an encoding plan for a linear model and for a gradient boosting model. State how you will prevent leakage and handle unknowns.

Progress & quick test

The quick test is available to everyone. If you are logged in, your progress is saved automatically.

Menu

Encoding Categorical Variables

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Choosing an encoding: quick guide

Worked examples

Steps to implement in a project

Hands-on exercises

Exercise 1 — Churn dataset: pick and apply encodings

Exercise 2 — Ordinal + rare categories

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Progress & quick test

Practice Exercises

Encode churn dataset with frequency + one-hot

Instructions

Expected Output

Ordinal sizes and rare-category handling

Encoding Categorical Variables — Quick Test

Have questions about Encoding Categorical Variables?

AI Assistant