luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Data Type Standardization

Learn Data Type Standardization for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As an ETL Developer, you frequently join data from multiple sources. If one source stores a date as text and another as a timestamp, joins break, aggregations become incorrect, and downstream dashboards fail. Data type standardization ensures every column has a clear, consistent type and format across the pipeline.

  • Real tasks you will do: unify date formats, convert numbers with different separators, map booleans, standardize currency and units, and enforce schema in landing/curated layers.
  • Impact: fewer runtime errors, stable analytics, faster debugging, and easier collaboration between teams.

Concept explained simply

Data type standardization is the process of cleaning and converting values so that each field has one agreed type and representation. Examples:

  • Dates: from mixed strings like "03/01/2024", "2024-03-01", "20240301" to a single timestamp column.
  • Numbers: from "1,234.50" or "12,50" to DECIMAL(12,2).
  • Booleans: from "Y/N", "true/false", 1/0 to a single BOOLEAN.
  • Nulls: from "", "N/A", "-" to actual NULL.
Mental model
  1. Profile: detect current types and odd values.
  2. Map: define the target types and value mappings.
  3. Normalize: clean strings (trim, remove separators, unify case, detect locale).
  4. Parse: convert text to typed values (timestamp, decimal, boolean).
  5. Cast and enforce: apply target data types and constraints.
  6. Validate: check counts, nulls, ranges, failed parses.

What to standardize

  • Numeric: remove thousands separators, fix decimal symbol, cast to DECIMAL for money. Avoid FLOAT/DOUBLE for currency due to rounding.
  • Dates/Times: parse multiple formats, set timezone explicitly, store as UTC timestamp, keep original timezone when needed.
  • Booleans: map common variants ("Y/N", "Yes/No", "true/false", 1/0) to a single BOOLEAN.
  • Categories/Codes: map legacy codes to canonical values. Keep a reference mapping table under version control.
  • Nulls and Blanks: convert "", "N/A", "NULL", "-", 0-length strings to NULL when appropriate.
  • Identifiers: trim whitespace, normalize case if case-insensitive, preserve leading zeros if meaningful.
  • Arrays/JSON: parse to structured types where possible or store as validated JSON.

Worked examples

Example 1: SQL — Standardize orders
-- Goal: date to timestamp (UTC midnight), amount to DECIMAL(12,2), paid to BOOLEAN
WITH raw AS (
  SELECT * FROM orders_raw
), cleaned AS (
  SELECT
    TRIM(order_id) AS order_id,
    -- Normalize date strings then parse
    COALESCE(
      TO_TIMESTAMP(order_date, 'YYYY-MM-DD'),
      TO_TIMESTAMP(order_date, 'MM/DD/YYYY'),
      TO_TIMESTAMP(order_date, 'YYYYMMDD')
    ) AS order_ts_local,
    -- Normalize amount: remove currency symbols and thousand separators
    REGEXP_REPLACE(REGEXP_REPLACE(amount_txt, '[^0-9,.-]', ''), ',', '.') AS amount_norm,
    UPPER(TRIM(paid_status)) AS paid_norm,
    UPPER(NULLIF(TRIM(currency), '')) AS currency_norm
  FROM raw
), standardized AS (
  SELECT
    order_id,
    -- Assume local time is stored already as local midnight; convert to UTC if timezone known
    -- If timezone unknown, document assumption and keep as naive timestamp
    order_ts_local AT TIME ZONE 'UTC' AS order_ts,
    -- Currency normalization: EUR to USD at 1.10 (example static rate)
    CAST(
      CASE 
        WHEN currency_norm = 'EUR' THEN CAST(amount_norm AS DECIMAL(14,4)) * 1.10
        WHEN currency_norm = 'USD' OR currency_norm IS NULL THEN CAST(amount_norm AS DECIMAL(14,4))
        ELSE CAST(amount_norm AS DECIMAL(14,4)) -- fallback, document
      END AS DECIMAL(12,2)
    ) AS amount_usd,
    CASE 
      WHEN paid_norm IN ('Y','YES','TRUE','1') THEN TRUE
      WHEN paid_norm IN ('N','NO','FALSE','0') THEN FALSE
      ELSE NULL
    END AS paid
  FROM cleaned
)
SELECT * FROM standardized;

Note: For real pipelines, apply proper timezone conversion based on source system metadata.

Example 2: Python (pandas) — Clean booleans, dates, numbers
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'user_id': [' 001 ', '002', '003 '],
    'signup': ['2024-03-01', '03/02/2024', '20240303'],
    'opt_in': ['Y', 'false', None],
    'balance': ['1,234.50', '€12,50', 'USD 10.00']
})

# Trim IDs, preserve leading zeros
df['user_id'] = df['user_id'].str.strip()

# Normalize and parse dates to UTC-naive timestamps (document)
df['signup_ts'] = pd.to_datetime(df['signup'], errors='coerce', infer_datetime_format=True)

# Normalize numbers: remove non-numeric except separators, unify decimal symbol
def norm_amount(x):
    if pd.isna(x):
        return np.nan
    s = str(x)
    s = ''.join(ch for ch in s if ch.isdigit() or ch in [',', '.', '-'])
    s = s.replace(',', '.')
    try:
        return float(s)
    except ValueError:
        return np.nan

df['balance_num'] = df['balance'].apply(norm_amount)

# Map booleans
true_set = {'y','yes','true','1','t'}
false_set = {'n','no','false','0','f'}

def to_bool(x):
    if pd.isna(x):
        return np.nan
    v = str(x).strip().lower()
    if v in true_set:
        return True
    if v in false_set:
        return False
    return np.nan

df['opt_in_bool'] = df['opt_in'].apply(to_bool)

# Final dtypes: user_id (string), signup_ts (datetime64[ns]), balance_num (float), opt_in_bool (boolean/nullable)
Example 3: Spark SQL — Parse mixed date/time and amounts
-- Assume a Spark DataFrame 'df' with columns: dt_str, amount_str, flag
-- Using SQL expressions
SELECT
  TRIM(id) AS id,
  COALESCE(
    to_timestamp(dt_str, 'yyyy-MM-dd HH:mm:ss'),
    to_timestamp(dt_str, 'MM/dd/yyyy HH:mm'),
    to_timestamp(dt_str, 'yyyyMMdd')
  ) AS event_ts,
  CAST(
    regexp_replace(regexp_replace(amount_str, '[^0-9,.-]', ''), ',', '.')
    AS DECIMAL(12,2)
  ) AS amount,
  CASE 
    WHEN lower(flag) IN ('y','yes','true','1') THEN TRUE
    WHEN lower(flag) IN ('n','no','false','0') THEN FALSE
    ELSE NULL
  END AS flag_bool
FROM df;

Step-by-step process you can follow

  1. Profile: count distinct formats, find non-numeric characters, list boolean variants, detect null tokens.
  2. Define target schema: pick exact data types and precision (e.g., DECIMAL(12,2) for money; TIMESTAMP UTC).
  3. Normalize: trim, unify case, remove thousands separators, standardize decimal symbol.
  4. Parse and cast: use explicit formats; handle errors as NULL and log failures.
  5. Validate: check record counts, null rates, min/max, and sample rows.
  6. Document: store mapping rules and assumptions with versioning.

Validation rules and self-checks

  • Row count preserved: source rows = target rows.
  • Null spikes: investigate any sudden increase in NULL after casting.
  • Range checks: dates within expected window; amounts not negative unless allowed.
  • Boolean completeness: no unmapped values left as strings.
  • Parsing failures: count rows where parsing returned NULL but source was not empty.
-- Example validation queries
SELECT COUNT(*) FROM src;
SELECT COUNT(*) FROM tgt;
SELECT COUNT(*) FROM tgt WHERE amount_usd IS NULL AND amount_txt IS NOT NULL;
SELECT MIN(order_ts), MAX(order_ts) FROM tgt;
SELECT paid, COUNT(*) FROM tgt GROUP BY paid;

Who this is for

  • ETL Developers and Data Engineers building ingestion and transformation pipelines.
  • Analytics Engineers ensuring clean semantic layers.
  • Anyone preparing data for reliable analysis and ML features.

Prerequisites

  • Basic SQL (SELECT, CASE, CAST, DATE/TIMESTAMP functions).
  • Comfort with one scripting language (Python or Scala) if needed.
  • Understanding of timezones and number formats.
  • Familiarity with NULL semantics.

Learning path

  • Before: Data profiling and source exploration.
  • Now: Data type standardization (this lesson).
  • Next: Schema enforcement, data contracts, and data quality checks.

Common mistakes (and fixes)

  • Casting before cleaning. Fix: remove separators and normalize decimal symbol first.
  • Using FLOAT for money. Fix: use DECIMAL with sufficient precision.
  • Timezone confusion. Fix: always set source timezone explicitly, convert to UTC.
  • Dropping leading zeros from IDs. Fix: treat IDs as strings; do not cast to INT unless numerically meaningful.
  • Incomplete boolean mapping. Fix: enumerate all known variants and log unknowns.
  • Silent truncation/overflow. Fix: choose DECIMAL precision that covers maxima; test boundary values.
  • Locale surprises. Fix: do not rely on implicit locale; normalize strings explicitly.

Practical projects

  • Build a standardization view that unifies transactions from two payment gateways to a single schema.
  • Create a mapping table that converts vendor category codes to your company’s canonical categories and apply it in SQL.
  • Write a reusable function/library to normalize numbers and booleans across pipelines; include unit tests.

Exercises

Do these in your SQL engine or notebook. The quick test is available to everyone; sign in if you want your progress saved.

Exercise 1: SQL — Standardize payments

Source table: raw_payments with columns:

  • txn_id STRING (may contain spaces)
  • txn_date STRING with mixed formats: '2024-03-01', '03/01/2024', '20240301'
  • amount STRING like '1,234.50', '€12,50', 'USD 10.00'
  • currency STRING: 'EUR', 'USD', or NULL
  • success STRING: 'Y','N','true','false','1','0', or NULL

Task:

  • Produce: txn_id (trimmed), txn_ts (TIMESTAMP at UTC midnight), amount_usd DECIMAL(12,2) using rate EUR->USD = 1.10, success BOOLEAN.
  • When currency is NULL but amount contains a symbol, infer currency from the symbol.
  • Set parsing failures to NULL and keep row counts unchanged.

Expected output: a query returning standardized columns with the specified types.

Exercise 2: Python — Clean users

Input DataFrame raw_users with columns:

  • user_id: strings with leading/trailing spaces, some like '007'
  • signup_date: mixed forms '01-03-2024', '2024/03/02', 1709337600 (epoch seconds)
  • email_opt_in: 'Yes','no','1','0', None
  • balance: '1 234,00', '1234.00', ''

Task:

  • user_id trimmed (keep leading zeros).
  • signup_ts as pandas datetime in UTC; treat epoch as seconds.
  • email_opt_in_bool as boolean with unmapped values to NaN.
  • balance_num as float with correct decimal point; blanks to NaN.

Expected output: DataFrame with dtypes string, datetime64[ns, UTC], boolean/nullable, and float.

  • Checklist before you move on:
    • Did you preserve row counts?
    • Are all booleans truly boolean?
    • Are money amounts using fixed-point types (or floats only for non-financial)?
    • Did you log or quantify parsing failures?

Mini challenge

You receive product_size: 'M', 'medium', 'Med', 'L', 'Large', '' and weight: '1,000.0 g', '2.5kg', '0.5 KG'. Standardize to size ENUM ('S','M','L') and weight_kg DECIMAL(8,3).

Hint
  • Size: map variants to S/M/L; blanks to NULL.
  • Weight: strip spaces, lowercase, parse unit (g/kg), convert to kg, unify decimal.
Sample solution
-- Pseudocode / SQL-ish
SELECT
  CASE
    WHEN LOWER(TRIM(product_size)) IN ('s','small') THEN 'S'
    WHEN LOWER(TRIM(product_size)) IN ('m','med','medium') THEN 'M'
    WHEN LOWER(TRIM(product_size)) IN ('l','large') THEN 'L'
    ELSE NULL
  END AS size,
  CAST(
    CASE WHEN LOWER(weight) LIKE '%kg%'
      THEN CAST(REPLACE(LOWER(REGEXP_REPLACE(weight, '[^0-9,.-kg ]', '')), ',', '.') AS DOUBLE)
           -- remove 'kg' afterward or extract number first depending on engine
      WHEN LOWER(weight) LIKE '%g%'
      THEN (CAST(REPLACE(LOWER(REGEXP_REPLACE(weight, '[^0-9,.-g ]', '')), ',', '.') AS DOUBLE) / 1000.0)
      ELSE NULL
    END AS DECIMAL(8,3)
  ) AS weight_kg
FROM src;

Next steps

  • Add schema enforcement so new loads cannot introduce type drift.
  • Automate data quality checks for null spikes and parsing errors.
  • Document your standardization rules and share with your team.

Note on progress

The quick test below is available to everyone. Sign in to save your test results and track progress over time.

Practice Exercises

2 exercises to complete

Instructions

Source table: raw_payments with columns:

  • txn_id STRING (may contain spaces)
  • txn_date STRING with mixed formats: '2024-03-01', '03/01/2024', '20240301'
  • amount STRING like '1,234.50', '€12,50', 'USD 10.00'
  • currency STRING: 'EUR', 'USD', or NULL
  • success STRING: 'Y','N','true','false','1','0', or NULL

Tasks:

  • Return: txn_id (trimmed), txn_ts (TIMESTAMP at UTC midnight), amount_usd DECIMAL(12,2) using rate EUR->USD = 1.10, success BOOLEAN.
  • If currency is NULL but amount has a symbol, infer currency.
  • Keep row counts equal to source; set parse failures to NULL.
Expected Output
A SQL query that produces columns (txn_id, txn_ts, amount_usd, success) with correct types and reasonable NULL handling.

Data Type Standardization — Quick Test

Test your knowledge with 9 questions. Pass with 70% or higher.

9 questions70% to pass

Have questions about Data Type Standardization?

AI Assistant

Ask questions about this tool