luvv to helpDiscover the Best Free Online Tools
Topic 10 of 12

Handling Categorical Values

Learn Handling Categorical Values for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 19, 2025 | Updated: December 19, 2025

Who this is for

Data Analysts and learners who need to clean datasets with text categories (like country, channel, device, plan). You will turn messy labels into consistent, model-ready features.

Prerequisites

  • Basic familiarity with spreadsheets or Python/pandas (nice to have, not required).
  • Know the difference between numerical and categorical data.
  • Comfort reading small tables and following step-by-step procedures.

Learning path

  • Before: Missing values handling, text normalization, and deduplication.
  • This lesson: Detect, clean, and encode categorical values safely.
  • After: Feature scaling, feature selection, and data splits for modeling.

Why this matters

Real analyst tasks rely on accurate categoricals:

  • Marketing: Compare conversion by campaign channel or region.
  • Product: Analyze churn by subscription plan and device type.
  • Finance: Classify expense categories and flag anomalies.
  • Operations: Forecast demand by store type or supplier.

Wrongly handled categories lead to broken dashboards, misleading averages, and biased models. Getting this right is a high-leverage data cleaning skill.

Concept explained simply

Categories are labels that describe groups (e.g., "iOS", "Android"). Some have order ("Low", "Medium", "High"), others don’t ("Red", "Blue"). Algorithms need numbers, so we convert categories into numeric form without losing meaning or creating fake order.

Mental model

Imagine a set of labeled bins. First, make labels consistent (same spelling and case). Next, decide if bins are ordered. Then, turn bins into numbers carefully: use separate lights (one-hot) for unordered bins, or a ladder (ordinal codes) for ordered bins. For many tiny bins, bundle tiny ones into an "Other" bin so the system stays stable.

Core decisions

  • Type: Nominal (no order) vs. Ordinal (has order).
  • Cardinality: Few categories vs. many categories.
  • Downstream use: Dashboard summaries vs. statistical models vs. tree-based models.
Quick rules of thumb
  • Nominal + few categories: One-hot encode (optionally drop one column for linear models).
  • Ordinal: Map to ordered integers (e.g., 1–5).
  • High cardinality: Consider frequency encoding, target encoding (with leakage-safe folds), hashing, or grouping rare levels into "Other".
  • Unknowns/missing: Keep a distinct category ("Unknown" or "Missing") rather than dropping rows.

Step-by-step workflow

  1. Profile: List columns that are categorical. Check unique counts, most frequent values, and missing rate.
  2. Normalize text: Trim spaces, unify case, fix common typos (e.g., "U.S.A" vs "USA").
  3. Define type: Mark columns as nominal or ordinal. Write the intended order for ordinal columns.
  4. Handle rare/unknown: Decide a threshold (e.g., categories < 1% frequency) to group into "Other"; create an explicit "Missing"/"Unknown" where appropriate.
  5. Choose encoding: One-hot for nominal small-cardinality; ordinal mapping for ordinal; frequency/target/hashing for high-cardinality.
  6. Fit on training data only: Save mappings from training; apply to validation/test. Use fold-based statistics for target encoding to avoid leakage.
  7. Validate: Spot-check with simple group averages and counts. Ensure no unexpected categories are dropped silently.

Worked examples

Example 1 — One-hot encode a nominal column

Column: channel = {"Email", "email ", "Social", "Paid", null}

  1. Normalize: lower-case, trim: {"email", "email", "social", "paid", null}
  2. Add Missing category: replace null with "missing"
  3. One-hot encode: create channel_email, channel_social, channel_paid, channel_missing
  4. Optional for linear models: drop one column (e.g., channel_missing) to avoid perfect multicollinearity
Why it works

Nominal categories have no order. One-hot keeps information without implying rank.

Example 2 — Ordinal encode survey satisfaction

Column: satisfaction = {"Very dissatisfied","Dissatisfied","Neutral","Satisfied","Very satisfied"}

  1. Define order: Very dissatisfied < Dissatisfied < Neutral < Satisfied < Very satisfied
  2. Map: {1,2,3,4,5}
  3. Missing/unknown: map to 3 (median) or a separate 0 = "Missing" if you want to keep a distinct flag
Why not one-hot?

We would lose order information if we one-hot this variable, which can reduce model power and interpretability.

Example 3 — Frequency encoding for high-cardinality IDs

Column: product_id with thousands of unique values.

  1. Compute frequency per ID on training data only (e.g., occurrences / total rows).
  2. Replace product_id with its frequency number.
  3. Unknown IDs in validation/test: assign a small default (e.g., global min frequency) or a learned "unknown" value.
When to use

Useful when one-hot would create too many columns and when plain label encoding would add fake order.

Example 4 — Target encoding with leakage-safe folds

Goal: Predict churn (0/1). Column: plan_type (many values).

  1. Split training into K folds.
  2. For each fold, compute mean churn by plan_type using only the other K-1 folds.
  3. Map plan_type in the held-out fold to these means. Apply smoothing to shrink rare plans toward global mean.
  4. On validation/test, map using statistics from full training only; unknown plans use global mean.
Why careful?

Directly using whole-training targets leaks information and inflates performance. Fold-based encoding prevents leakage.

Handling missing, rare, and unknown

  • Missing: Use an explicit category ("Missing") to avoid row drops. This preserves signal if missingness is informative.
  • Rare: Group categories below a frequency threshold into "Other" to improve stability.
  • Unknown at inference: Always define a fallback mapping (e.g., "Unknown" bin, default frequency, or global mean).

Practical tooling tips

  • Set categorical dtype (where available) to save memory and catch invalid values early.
  • When one-hot encoding, use an option that ignores unknown categories rather than erroring, and keep track of feature names.
  • Store mapping dictionaries (for ordinal, frequency, or target encodings) with metadata: date created, source columns, thresholds.
  • Always fit encoders on training data only, then transform validation/test.

Exercises

These mirror the exercises below. Try them before opening the solutions.

  1. Exercise 1: Clean and encode a payment_method column with messy labels. Steps: normalize, group rare into Other, then one-hot with a safe fallback for unknowns.
  2. Exercise 2: Ordinal-encode a satisfaction column with a defined order; handle missing and unexpected labels safely.
  • Checklist:
    • Identified nominal vs ordinal correctly
    • Chose encodings suited to cardinality
    • Defined Missing/Unknown explicitly
    • Kept mappings that can be reused on new data

Common mistakes and self-check

  • Mistake: Label-encoding nominal categories for linear models, creating fake order. Self-check: Are coefficients implying rank that doesn’t exist? If yes, switch to one-hot.
  • Mistake: Target leakage from using full-data target means. Self-check: Did you compute encodings using the same rows you evaluate on? Use fold-based stats.
  • Mistake: Exploding features with one-hot on high-cardinality columns. Self-check: Feature count grew massively. Consider frequency/target/hashing or grouping rare.
  • Mistake: Dropping rows with missing categories. Self-check: Row count shrank. Prefer a "Missing" category.
  • Mistake: Inconsistent mappings across splits. Self-check: Train vs test produce different encoded shapes. Fit on train; apply consistently.

Practical projects

  • Retail demo: Clean and encode store_type, region, and promotion_code to analyze weekly sales uplift by segment.
  • Churn mini-model: Encode plan_type (target/frequency) and device (one-hot), then compare churn rates across encoded groups.
  • Support tickets: Normalize and encode issue_category (rare grouping) to build a simple time-to-resolution predictor.

Quick Test

Take the Quick Test below to check understanding. Available to everyone; logged-in users get saved progress.

Mini challenge

You receive a "country" column with 250 unique values including typos ("U.S.", "USA", "United States", "Unted Stetes"). Create a plan to: normalize names, map to ISO-like standard labels, group tiny countries (< 0.5% of rows) into "Other", and choose an encoding for a logistic regression vs a gradient-boosted tree. Justify each decision in one sentence.

Next steps

  • Apply these encodings to a dataset you already know (marketing or product analytics).
  • Document your mappings and thresholds so teammates can reproduce your cleaning steps.
  • Continue to feature scaling and data splitting to prepare for modeling.

Practice Exercises

2 exercises to complete

Instructions

You have a column payment_method with sample values: ["credit card", "Credit Card", "paypal", "Pay Pal", "bank-transfer", null, "COD", "cash on delivery"].

  1. Normalize values: trim, lower-case, and standardize synonyms to a controlled set: {credit_card, paypal, bank_transfer, cod}.
  2. Create an Other category for anything outside the set and map null to missing.
  3. One-hot encode into columns prefixed with pay_. Use a safe fallback for unknowns at inference time (map to Other before encoding).
  4. Optional: drop one dummy (e.g., pay_other) for linear models to avoid multicollinearity.
Expected Output
New columns: pay_bank_transfer, pay_cod, pay_credit_card, pay_paypal, pay_missing, pay_other (each 0/1). Unknown or malformed future values map to pay_other=1.

Handling Categorical Values — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Handling Categorical Values?

AI Assistant

Ask questions about this tool