How to learn Data Type Fixes for Data Cleaning in Data Analyst for free

Why this matters

Data type fixes turn messy strings into usable numbers, dates, booleans, and categories. As a Data Analyst, you will regularly:

Calculate KPIs from currency fields stored as text (e.g., "$1,234.50").
Merge datasets where dates use different formats (e.g., 31-12-2024 vs 12/31/2024).
Filter on boolean flags expressed as "Y/N", "yes/no", or 1/0.
Group by standardized categories and codes (e.g., country codes).

Correct types make calculations correct, joins reliable, and dashboards stable.

Concept explained simply

A data type describes how a value behaves: numbers add up, dates know months and years, booleans are true/false, and categories have a fixed set of labels. If the type is wrong, tools treat the value as plain text, and math/logic will fail or be inaccurate.

Mental model: the casting gate

Imagine a gate that checks every value and either converts it cleanly or rejects it. Your job is to design rules so most values pass the gate and rejected values are flagged for review. Rules include trimming spaces, removing symbols, standardizing formats, and choosing the right type.

Common types and how to fix

Numbers (currency, percentages, units)

Remove non-numeric symbols: currency ($, €, £), commas, spaces, percentage signs, unit strings.
Handle negatives: both leading '-' and parentheses (e.g., (123.45) -> -123.45).
Normalize decimal separators (dot vs comma) using locale-aware parsing or replacements.
Cast with a safe function (e.g., TRY_ casts) and review any values that fail.

Examples

Excel/Sheets:

1) Clean currency:
=VALUE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"$",""),",","")," ",""))

2) Parentheses negatives:
=IF(LEFT(A2,1)="(",-VALUE(SUBSTITUTE(SUBSTITUTE(MID(A2,2,LEN(A2)-2),",",""),"$","")),
     VALUE(SUBSTITUTE(SUBSTITUTE(A2,",",""),"$","")))

Python (pandas):

import re
s = df['price_str'].str.replace(r"[\$,\s]","", regex=True)
s = s.str.replace(r"\((.*)\)", r"-\1", regex=True)
df['price'] = pd.to_numeric(s, errors='coerce')

PostgreSQL:

-- Remove $ and commas, handle parentheses
SELECT CASE WHEN price_str ~ '^\(.*\)$'
            THEN -1 * (REPLACE(REPLACE(substr(price_str,2,length(price_str)-2),',',''),'$',''))::numeric
            ELSE (REPLACE(REPLACE(price_str,',',''),'$',''))::numeric
       END AS price
FROM t;

Dates and times

Unify format: explicitly define the pattern (e.g., YYYY-MM-DD).
Resolve ambiguity: decide whether day comes first (day-first vs month-first).
Handle time zones by converting to a standard (often UTC) when needed.
Flag impossible dates (e.g., 31/02/2024) for review.

Examples

Excel/Sheets:

=DATEVALUE(TEXT(A2,"yyyy-mm-dd"))  -- after first normalizing text to a consistent pattern
-- Or: Data > Split/Text-to-Columns with a defined date format

Python (pandas):

df['dt'] = pd.to_datetime(df['date_str'], errors='coerce', dayfirst=True)
# or provide exact format when known
df['dt_iso'] = pd.to_datetime(df['iso_str'], format='%Y-%m-%d', errors='coerce')

PostgreSQL:

SELECT to_date(date_str, 'DD-MM-YYYY') AS dt
FROM t;
-- For timestamps: to_timestamp('31-12-2024 23:59','DD-MM-YYYY HH24:MI')

Booleans

Map common variants to True/False: 'y','yes','true','1' -> true; 'n','no','false','0' -> false.
Treat blanks or unexpected values as NULL/Unknown and review.

Examples

Excel/Sheets:

=IF(OR(LOWER(A2)="y",LOWER(A2)="yes",A2="1",LOWER(A2)="true"),TRUE,
 IF(OR(LOWER(A2)="n",LOWER(A2)="no",A2="0",LOWER(A2)="false"),FALSE,NA()))

Python (pandas):

map_true = {'y':True,'yes':True,'true':True,'1':True}
map_false = {'n':False,'no':False,'false':False,'0':False}
low = df['flag'].astype(str).str.strip().str.lower()
df['is_active'] = low.map({**map_true, **map_false})

PostgreSQL:

SELECT CASE WHEN lower(flag) IN ('y','yes','true','1') THEN true
            WHEN lower(flag) IN ('n','no','false','0') THEN false
            ELSE NULL END AS is_active
FROM t;

Categories (enums, codes)

Standardize labels and casing (e.g., 'US', 'U.S.', 'United States' -> 'US').
Use a reference list/dictionary; log unknowns for review.

Examples

Excel/Sheets:

=SWITCH(UPPER(SUBSTITUTE(A2,".","")),
 "US","US","UNITED STATES","US","USA","US",
 "UK","UK","UNITED KINGDOM","UK",
 A2)  -- Fallback to original

Python (pandas):

canon = {'us':'US','u.s.':'US','united states':'US','usa':'US'}
val = df['country'].str.replace('.','', regex=False).str.strip().str.lower()
df['country_std'] = val.map(canon).fillna(df['country'])

PostgreSQL:

SELECT CASE WHEN replace(lower(country),'.','') IN ('us','united states','usa') THEN 'US'
            WHEN replace(lower(country),'.','') IN ('uk','united kingdom') THEN 'UK'
            ELSE country END AS country_std
FROM t;

Worked examples

Example 1: Currency string to number

Input: "$1,234.50", "(99.95)", " € 5 000 ". Goal: numeric values 1234.50, -99.95, 5000.00.

Show walkthrough

Trim spaces.
Detect parentheses to mark negatives.
Remove currency symbols and separators.
Cast to number; review any failures.

p = df['price_str'].str.strip()
p = p.str.replace(r"\((.*)\)", r"-\1", regex=True)
p = p.str.replace(r"[^0-9\-\.]","", regex=True)
df['price'] = pd.to_numeric(p, errors='coerce')

Example 2: Mixed date formats

Input: ["31-12-2024", "12/31/2024", "2024-12-31"]. Goal: one date column.

Show walkthrough

Try explicit formats in order.
Prefer day-first when region requires it.
Coerce failures to NULL for review.

date = pd.to_datetime(df['d1'], format='%d-%m-%Y', errors='coerce')
date = date.fillna(pd.to_datetime(df['d1'], format='%m/%d/%Y', errors='coerce'))
date = date.fillna(pd.to_datetime(df['d1'], errors='coerce'))
df['date'] = date

Example 3: Y/N to boolean

Input: ['Y','N','Yes','', '0','1','true','False'] -> True/False/NULL.

Show walkthrough

m = {'y':True,'yes':True,'1':True,'true':True,
     'n':False,'no':False,'0':False,'false':False}
low = df['flag'].astype(str).str.strip().str.lower()
df['flag_bool'] = low.map(m)

Practical steps (do this on every dataset)

Profile columns: detect current types, unique values, min/max, nulls.
Decide target type: number, date, boolean, category, or string.
Write conversion rules: trims, symbol removal, mapping dictionaries, explicit formats.
Apply safe casts: functions that return NULL on failure (e.g., errors='coerce', TRY_CAST).
Quantify fallout: % of values that failed conversion. Investigate top offenders.
Lock in: store cleaned columns with clear names (e.g., price_num, signup_date).

Exercises

These mirror the tasks below. Do them in your preferred tool (Excel/Sheets, SQL, or Python).

Exercise 1: Clean and cast currency strings to numbers.
Exercise 2: Parse mixed-format dates into one date column.
Exercise 3: Normalize boolean responses.

Checklist for your solutions

Spaces, symbols, and separators removed consistently.
Negatives handled, including parentheses.
Explicit date formats provided where possible.
Unknown boolean values set to NULL/NA (not silently converted).
Failure rate measured and reported.

Common mistakes and how to self-check

Mistake: Assuming all commas are thousands separators. Self-check: Count digits between separators; verify locale. Consider replacing only if pattern matches.
Mistake: Auto-parsing dates without specifying day-first. Self-check: Spot-check records where day and month both ≤ 12.
Mistake: Forcing booleans for every value. Self-check: Keep unknowns as NULL and report them.
Mistake: Losing negatives in parentheses. Self-check: Test with a sample that includes (123.45).
Mistake: Overwriting raw data. Self-check: Always create a new column for the converted value.

Practical projects

E-commerce snapshot: Convert price, discount %, and order_date to correct types; compute revenue and daily order counts.
Survey cleanup: Normalize yes/no/maybe flags; calculate completion rates by segment.
HR dataset: Standardize hire_date, termination_date, and employment_status; compute tenure and active headcount over time.

Learning path

Before this: Missing values handling and basic text cleaning (trim, lower, remove duplicates).
This subskill: Systematic conversion to numeric, date, boolean, and category types with validation.
Next: Data validation rules, outlier detection, and constraints to prevent future type drift.

Who this is for

Aspiring and junior Data Analysts cleaning CSVs, spreadsheets, or database tables.
Anyone preparing datasets for BI tools or machine learning features.

Prerequisites

Basic spreadsheet formulas or beginner knowledge of SQL/Python.
Comfort with simple string operations (trim, replace) and date formats.

Mini challenge

Given these values, produce 4 typed columns: price_num, signup_date, opted_in, country_std.

price_str: ["$1,290.00", "(75)", "EUR 5 100"]
date_str:  ["31-12-2024", "12/30/2024", "2024-12-29"]
opt_in:    ["Y", "no", "1"]
country:   ["U.S.", "United States", "UK"]

Rules: handle parentheses negatives, mixed dates, boolean mapping, and US/UK standardization. Report any failed parses.

Next steps

Apply these rules to one of your real datasets. Log the % of failed conversions.
Create small helper functions (or spreadsheet templates) you can reuse for numbers, dates, and booleans.
Take the quick test below to confirm mastery. Note: Everyone can take the test; only logged-in users will see saved progress.

Menu

Data Type Fixes

Table of Contents

Why this matters

Concept explained simply

Mental model: the casting gate

Common types and how to fix

Worked examples

Example 1: Currency string to number

Example 2: Mixed date formats

Example 3: Y/N to boolean

Practical steps (do this on every dataset)

Exercises

Checklist for your solutions

Common mistakes and how to self-check

Practical projects

Learning path

Who this is for

Prerequisites

Mini challenge

Next steps

Practice Exercises

Clean currency strings and cast to numbers

Instructions

Expected Output

Parse mixed-format dates into one column

Normalize booleans

Data Type Fixes — Quick Test

Have questions about Data Type Fixes?

AI Assistant