How to learn Data Type Standardization for Data Transformations in ETL Developer for free

Why this matters

As an ETL Developer, you frequently join data from multiple sources. If one source stores a date as text and another as a timestamp, joins break, aggregations become incorrect, and downstream dashboards fail. Data type standardization ensures every column has a clear, consistent type and format across the pipeline.

Real tasks you will do: unify date formats, convert numbers with different separators, map booleans, standardize currency and units, and enforce schema in landing/curated layers.
Impact: fewer runtime errors, stable analytics, faster debugging, and easier collaboration between teams.

Concept explained simply

Data type standardization is the process of cleaning and converting values so that each field has one agreed type and representation. Examples:

Dates: from mixed strings like "03/01/2024", "2024-03-01", "20240301" to a single timestamp column.
Numbers: from "1,234.50" or "12,50" to DECIMAL(12,2).
Booleans: from "Y/N", "true/false", 1/0 to a single BOOLEAN.
Nulls: from "", "N/A", "-" to actual NULL.

Mental model

Profile: detect current types and odd values.
Map: define the target types and value mappings.
Normalize: clean strings (trim, remove separators, unify case, detect locale).
Parse: convert text to typed values (timestamp, decimal, boolean).
Cast and enforce: apply target data types and constraints.
Validate: check counts, nulls, ranges, failed parses.

What to standardize

Numeric: remove thousands separators, fix decimal symbol, cast to DECIMAL for money. Avoid FLOAT/DOUBLE for currency due to rounding.
Dates/Times: parse multiple formats, set timezone explicitly, store as UTC timestamp, keep original timezone when needed.
Booleans: map common variants ("Y/N", "Yes/No", "true/false", 1/0) to a single BOOLEAN.
Categories/Codes: map legacy codes to canonical values. Keep a reference mapping table under version control.
Nulls and Blanks: convert "", "N/A", "NULL", "-", 0-length strings to NULL when appropriate.
Identifiers: trim whitespace, normalize case if case-insensitive, preserve leading zeros if meaningful.
Arrays/JSON: parse to structured types where possible or store as validated JSON.

Worked examples

Example 1: SQL — Standardize orders

-- Goal: date to timestamp (UTC midnight), amount to DECIMAL(12,2), paid to BOOLEAN
WITH raw AS (
  SELECT * FROM orders_raw
), cleaned AS (
  SELECT
    TRIM(order_id) AS order_id,
    -- Normalize date strings then parse
    COALESCE(
      TO_TIMESTAMP(order_date, 'YYYY-MM-DD'),
      TO_TIMESTAMP(order_date, 'MM/DD/YYYY'),
      TO_TIMESTAMP(order_date, 'YYYYMMDD')
    ) AS order_ts_local,
    -- Normalize amount: remove currency symbols and thousand separators
    REGEXP_REPLACE(REGEXP_REPLACE(amount_txt, '[^0-9,.-]', ''), ',', '.') AS amount_norm,
    UPPER(TRIM(paid_status)) AS paid_norm,
    UPPER(NULLIF(TRIM(currency), '')) AS currency_norm
  FROM raw
), standardized AS (
  SELECT
    order_id,
    -- Assume local time is stored already as local midnight; convert to UTC if timezone known
    -- If timezone unknown, document assumption and keep as naive timestamp
    order_ts_local AT TIME ZONE 'UTC' AS order_ts,
    -- Currency normalization: EUR to USD at 1.10 (example static rate)
    CAST(
      CASE 
        WHEN currency_norm = 'EUR' THEN CAST(amount_norm AS DECIMAL(14,4)) * 1.10
        WHEN currency_norm = 'USD' OR currency_norm IS NULL THEN CAST(amount_norm AS DECIMAL(14,4))
        ELSE CAST(amount_norm AS DECIMAL(14,4)) -- fallback, document
      END AS DECIMAL(12,2)
    ) AS amount_usd,
    CASE 
      WHEN paid_norm IN ('Y','YES','TRUE','1') THEN TRUE
      WHEN paid_norm IN ('N','NO','FALSE','0') THEN FALSE
      ELSE NULL
    END AS paid
  FROM cleaned
)
SELECT * FROM standardized;

Note: For real pipelines, apply proper timezone conversion based on source system metadata.

Example 2: Python (pandas) — Clean booleans, dates, numbers

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'user_id': [' 001 ', '002', '003 '],
    'signup': ['2024-03-01', '03/02/2024', '20240303'],
    'opt_in': ['Y', 'false', None],
    'balance': ['1,234.50', '€12,50', 'USD 10.00']
})

# Trim IDs, preserve leading zeros
df['user_id'] = df['user_id'].str.strip()

# Normalize and parse dates to UTC-naive timestamps (document)
df['signup_ts'] = pd.to_datetime(df['signup'], errors='coerce', infer_datetime_format=True)

# Normalize numbers: remove non-numeric except separators, unify decimal symbol
def norm_amount(x):
    if pd.isna(x):
        return np.nan
    s = str(x)
    s = ''.join(ch for ch in s if ch.isdigit() or ch in [',', '.', '-'])
    s = s.replace(',', '.')
    try:
        return float(s)
    except ValueError:
        return np.nan

df['balance_num'] = df['balance'].apply(norm_amount)

# Map booleans
true_set = {'y','yes','true','1','t'}
false_set = {'n','no','false','0','f'}

def to_bool(x):
    if pd.isna(x):
        return np.nan
    v = str(x).strip().lower()
    if v in true_set:
        return True
    if v in false_set:
        return False
    return np.nan

df['opt_in_bool'] = df['opt_in'].apply(to_bool)

# Final dtypes: user_id (string), signup_ts (datetime64[ns]), balance_num (float), opt_in_bool (boolean/nullable)

Example 3: Spark SQL — Parse mixed date/time and amounts

-- Assume a Spark DataFrame 'df' with columns: dt_str, amount_str, flag
-- Using SQL expressions
SELECT
  TRIM(id) AS id,
  COALESCE(
    to_timestamp(dt_str, 'yyyy-MM-dd HH:mm:ss'),
    to_timestamp(dt_str, 'MM/dd/yyyy HH:mm'),
    to_timestamp(dt_str, 'yyyyMMdd')
  ) AS event_ts,
  CAST(
    regexp_replace(regexp_replace(amount_str, '[^0-9,.-]', ''), ',', '.')
    AS DECIMAL(12,2)
  ) AS amount,
  CASE 
    WHEN lower(flag) IN ('y','yes','true','1') THEN TRUE
    WHEN lower(flag) IN ('n','no','false','0') THEN FALSE
    ELSE NULL
  END AS flag_bool
FROM df;

Step-by-step process you can follow

Profile: count distinct formats, find non-numeric characters, list boolean variants, detect null tokens.
Define target schema: pick exact data types and precision (e.g., DECIMAL(12,2) for money; TIMESTAMP UTC).
Normalize: trim, unify case, remove thousands separators, standardize decimal symbol.
Parse and cast: use explicit formats; handle errors as NULL and log failures.
Validate: check record counts, null rates, min/max, and sample rows.
Document: store mapping rules and assumptions with versioning.

Validation rules and self-checks

Row count preserved: source rows = target rows.
Null spikes: investigate any sudden increase in NULL after casting.
Range checks: dates within expected window; amounts not negative unless allowed.
Boolean completeness: no unmapped values left as strings.
Parsing failures: count rows where parsing returned NULL but source was not empty.

-- Example validation queries
SELECT COUNT(*) FROM src;
SELECT COUNT(*) FROM tgt;
SELECT COUNT(*) FROM tgt WHERE amount_usd IS NULL AND amount_txt IS NOT NULL;
SELECT MIN(order_ts), MAX(order_ts) FROM tgt;
SELECT paid, COUNT(*) FROM tgt GROUP BY paid;

Who this is for

ETL Developers and Data Engineers building ingestion and transformation pipelines.
Analytics Engineers ensuring clean semantic layers.
Anyone preparing data for reliable analysis and ML features.

Prerequisites

Basic SQL (SELECT, CASE, CAST, DATE/TIMESTAMP functions).
Comfort with one scripting language (Python or Scala) if needed.
Understanding of timezones and number formats.
Familiarity with NULL semantics.

Learning path

Before: Data profiling and source exploration.
Now: Data type standardization (this lesson).
Next: Schema enforcement, data contracts, and data quality checks.

Common mistakes (and fixes)

Casting before cleaning. Fix: remove separators and normalize decimal symbol first.
Using FLOAT for money. Fix: use DECIMAL with sufficient precision.
Timezone confusion. Fix: always set source timezone explicitly, convert to UTC.
Dropping leading zeros from IDs. Fix: treat IDs as strings; do not cast to INT unless numerically meaningful.
Incomplete boolean mapping. Fix: enumerate all known variants and log unknowns.
Silent truncation/overflow. Fix: choose DECIMAL precision that covers maxima; test boundary values.
Locale surprises. Fix: do not rely on implicit locale; normalize strings explicitly.

Practical projects

Build a standardization view that unifies transactions from two payment gateways to a single schema.
Create a mapping table that converts vendor category codes to your company’s canonical categories and apply it in SQL.
Write a reusable function/library to normalize numbers and booleans across pipelines; include unit tests.

Exercises

Do these in your SQL engine or notebook. The quick test is available to everyone; sign in if you want your progress saved.

Exercise 1: SQL — Standardize payments

Source table: raw_payments with columns:

txn_id STRING (may contain spaces)
txn_date STRING with mixed formats: '2024-03-01', '03/01/2024', '20240301'
amount STRING like '1,234.50', '€12,50', 'USD 10.00'
currency STRING: 'EUR', 'USD', or NULL
success STRING: 'Y','N','true','false','1','0', or NULL

Task:

Produce: txn_id (trimmed), txn_ts (TIMESTAMP at UTC midnight), amount_usd DECIMAL(12,2) using rate EUR->USD = 1.10, success BOOLEAN.
When currency is NULL but amount contains a symbol, infer currency from the symbol.
Set parsing failures to NULL and keep row counts unchanged.

Expected output: a query returning standardized columns with the specified types.

Exercise 2: Python — Clean users

Input DataFrame raw_users with columns:

user_id: strings with leading/trailing spaces, some like '007'
signup_date: mixed forms '01-03-2024', '2024/03/02', 1709337600 (epoch seconds)
email_opt_in: 'Yes','no','1','0', None
balance: '1 234,00', '1234.00', ''

Task:

user_id trimmed (keep leading zeros).
signup_ts as pandas datetime in UTC; treat epoch as seconds.
email_opt_in_bool as boolean with unmapped values to NaN.
balance_num as float with correct decimal point; blanks to NaN.

Expected output: DataFrame with dtypes string, datetime64[ns, UTC], boolean/nullable, and float.

Checklist before you move on:
- Did you preserve row counts?
- Are all booleans truly boolean?
- Are money amounts using fixed-point types (or floats only for non-financial)?
- Did you log or quantify parsing failures?

Mini challenge

You receive product_size: 'M', 'medium', 'Med', 'L', 'Large', '' and weight: '1,000.0 g', '2.5kg', '0.5 KG'. Standardize to size ENUM ('S','M','L') and weight_kg DECIMAL(8,3).

Hint

Size: map variants to S/M/L; blanks to NULL.
Weight: strip spaces, lowercase, parse unit (g/kg), convert to kg, unify decimal.

Sample solution

-- Pseudocode / SQL-ish
SELECT
  CASE
    WHEN LOWER(TRIM(product_size)) IN ('s','small') THEN 'S'
    WHEN LOWER(TRIM(product_size)) IN ('m','med','medium') THEN 'M'
    WHEN LOWER(TRIM(product_size)) IN ('l','large') THEN 'L'
    ELSE NULL
  END AS size,
  CAST(
    CASE WHEN LOWER(weight) LIKE '%kg%'
      THEN CAST(REPLACE(LOWER(REGEXP_REPLACE(weight, '[^0-9,.-kg ]', '')), ',', '.') AS DOUBLE)
           -- remove 'kg' afterward or extract number first depending on engine
      WHEN LOWER(weight) LIKE '%g%'
      THEN (CAST(REPLACE(LOWER(REGEXP_REPLACE(weight, '[^0-9,.-g ]', '')), ',', '.') AS DOUBLE) / 1000.0)
      ELSE NULL
    END AS DECIMAL(8,3)
  ) AS weight_kg
FROM src;

Next steps

Add schema enforcement so new loads cannot introduce type drift.
Automate data quality checks for null spikes and parsing errors.
Document your standardization rules and share with your team.

Note on progress

The quick test below is available to everyone. Sign in to save your test results and track progress over time.

Menu

Data Type Standardization

Table of Contents

Why this matters

Concept explained simply

What to standardize

Worked examples

Step-by-step process you can follow

Validation rules and self-checks

Who this is for

Prerequisites

Learning path

Common mistakes (and fixes)

Practical projects

Exercises

Mini challenge

Next steps

Note on progress

Practice Exercises

SQL — Standardize payments

Instructions

Expected Output

Python — Clean users

Data Type Standardization — Quick Test

Have questions about Data Type Standardization?

AI Assistant