How to learn Handling Categorical Values for Data Cleaning in Data Analyst for free

Who this is for

Data Analysts and learners who need to clean datasets with text categories (like country, channel, device, plan). You will turn messy labels into consistent, model-ready features.

Prerequisites

Basic familiarity with spreadsheets or Python/pandas (nice to have, not required).
Know the difference between numerical and categorical data.
Comfort reading small tables and following step-by-step procedures.

Learning path

Before: Missing values handling, text normalization, and deduplication.
This lesson: Detect, clean, and encode categorical values safely.
After: Feature scaling, feature selection, and data splits for modeling.

Why this matters

Real analyst tasks rely on accurate categoricals:

Marketing: Compare conversion by campaign channel or region.
Product: Analyze churn by subscription plan and device type.
Finance: Classify expense categories and flag anomalies.
Operations: Forecast demand by store type or supplier.

Wrongly handled categories lead to broken dashboards, misleading averages, and biased models. Getting this right is a high-leverage data cleaning skill.

Concept explained simply

Categories are labels that describe groups (e.g., "iOS", "Android"). Some have order ("Low", "Medium", "High"), others don’t ("Red", "Blue"). Algorithms need numbers, so we convert categories into numeric form without losing meaning or creating fake order.

Mental model

Imagine a set of labeled bins. First, make labels consistent (same spelling and case). Next, decide if bins are ordered. Then, turn bins into numbers carefully: use separate lights (one-hot) for unordered bins, or a ladder (ordinal codes) for ordered bins. For many tiny bins, bundle tiny ones into an "Other" bin so the system stays stable.

Core decisions

Type: Nominal (no order) vs. Ordinal (has order).
Cardinality: Few categories vs. many categories.
Downstream use: Dashboard summaries vs. statistical models vs. tree-based models.

Quick rules of thumb

Nominal + few categories: One-hot encode (optionally drop one column for linear models).
Ordinal: Map to ordered integers (e.g., 1–5).
High cardinality: Consider frequency encoding, target encoding (with leakage-safe folds), hashing, or grouping rare levels into "Other".
Unknowns/missing: Keep a distinct category ("Unknown" or "Missing") rather than dropping rows.

Step-by-step workflow

Profile: List columns that are categorical. Check unique counts, most frequent values, and missing rate.
Normalize text: Trim spaces, unify case, fix common typos (e.g., "U.S.A" vs "USA").
Define type: Mark columns as nominal or ordinal. Write the intended order for ordinal columns.
Handle rare/unknown: Decide a threshold (e.g., categories < 1% frequency) to group into "Other"; create an explicit "Missing"/"Unknown" where appropriate.
Choose encoding: One-hot for nominal small-cardinality; ordinal mapping for ordinal; frequency/target/hashing for high-cardinality.
Fit on training data only: Save mappings from training; apply to validation/test. Use fold-based statistics for target encoding to avoid leakage.
Validate: Spot-check with simple group averages and counts. Ensure no unexpected categories are dropped silently.

Worked examples

Example 1 — One-hot encode a nominal column

Column: channel = {"Email", "email ", "Social", "Paid", null}

Normalize: lower-case, trim: {"email", "email", "social", "paid", null}
Add Missing category: replace null with "missing"
One-hot encode: create channel_email, channel_social, channel_paid, channel_missing
Optional for linear models: drop one column (e.g., channel_missing) to avoid perfect multicollinearity

Why it works

Nominal categories have no order. One-hot keeps information without implying rank.

Example 2 — Ordinal encode survey satisfaction

Column: satisfaction = {"Very dissatisfied","Dissatisfied","Neutral","Satisfied","Very satisfied"}

Define order: Very dissatisfied < Dissatisfied < Neutral < Satisfied < Very satisfied
Map: {1,2,3,4,5}
Missing/unknown: map to 3 (median) or a separate 0 = "Missing" if you want to keep a distinct flag

Why not one-hot?

We would lose order information if we one-hot this variable, which can reduce model power and interpretability.

Example 3 — Frequency encoding for high-cardinality IDs

Column: product_id with thousands of unique values.

Compute frequency per ID on training data only (e.g., occurrences / total rows).
Replace product_id with its frequency number.
Unknown IDs in validation/test: assign a small default (e.g., global min frequency) or a learned "unknown" value.

When to use

Useful when one-hot would create too many columns and when plain label encoding would add fake order.

Example 4 — Target encoding with leakage-safe folds

Goal: Predict churn (0/1). Column: plan_type (many values).

Split training into K folds.
For each fold, compute mean churn by plan_type using only the other K-1 folds.
Map plan_type in the held-out fold to these means. Apply smoothing to shrink rare plans toward global mean.
On validation/test, map using statistics from full training only; unknown plans use global mean.

Why careful?

Directly using whole-training targets leaks information and inflates performance. Fold-based encoding prevents leakage.

Handling missing, rare, and unknown

Missing: Use an explicit category ("Missing") to avoid row drops. This preserves signal if missingness is informative.
Rare: Group categories below a frequency threshold into "Other" to improve stability.
Unknown at inference: Always define a fallback mapping (e.g., "Unknown" bin, default frequency, or global mean).

Practical tooling tips

Set categorical dtype (where available) to save memory and catch invalid values early.
When one-hot encoding, use an option that ignores unknown categories rather than erroring, and keep track of feature names.
Store mapping dictionaries (for ordinal, frequency, or target encodings) with metadata: date created, source columns, thresholds.
Always fit encoders on training data only, then transform validation/test.

Exercises

These mirror the exercises below. Try them before opening the solutions.

Exercise 1: Clean and encode a payment_method column with messy labels. Steps: normalize, group rare into Other, then one-hot with a safe fallback for unknowns.
Exercise 2: Ordinal-encode a satisfaction column with a defined order; handle missing and unexpected labels safely.

Checklist:
- Identified nominal vs ordinal correctly
- Chose encodings suited to cardinality
- Defined Missing/Unknown explicitly
- Kept mappings that can be reused on new data

Common mistakes and self-check

Mistake: Label-encoding nominal categories for linear models, creating fake order. Self-check: Are coefficients implying rank that doesn’t exist? If yes, switch to one-hot.
Mistake: Target leakage from using full-data target means. Self-check: Did you compute encodings using the same rows you evaluate on? Use fold-based stats.
Mistake: Exploding features with one-hot on high-cardinality columns. Self-check: Feature count grew massively. Consider frequency/target/hashing or grouping rare.
Mistake: Dropping rows with missing categories. Self-check: Row count shrank. Prefer a "Missing" category.
Mistake: Inconsistent mappings across splits. Self-check: Train vs test produce different encoded shapes. Fit on train; apply consistently.

Practical projects

Retail demo: Clean and encode store_type, region, and promotion_code to analyze weekly sales uplift by segment.
Churn mini-model: Encode plan_type (target/frequency) and device (one-hot), then compare churn rates across encoded groups.
Support tickets: Normalize and encode issue_category (rare grouping) to build a simple time-to-resolution predictor.

Quick Test

Take the Quick Test below to check understanding. Available to everyone; logged-in users get saved progress.

Mini challenge

You receive a "country" column with 250 unique values including typos ("U.S.", "USA", "United States", "Unted Stetes"). Create a plan to: normalize names, map to ISO-like standard labels, group tiny countries (< 0.5% of rows) into "Other", and choose an encoding for a logistic regression vs a gradient-boosted tree. Justify each decision in one sentence.

Next steps

Apply these encodings to a dataset you already know (marketing or product analytics).
Document your mappings and thresholds so teammates can reproduce your cleaning steps.
Continue to feature scaling and data splitting to prepare for modeling.

Menu

Handling Categorical Values

Table of Contents