Why this matters
As an ETL Developer, you frequently join data from multiple sources. If one source stores a date as text and another as a timestamp, joins break, aggregations become incorrect, and downstream dashboards fail. Data type standardization ensures every column has a clear, consistent type and format across the pipeline.
- Real tasks you will do: unify date formats, convert numbers with different separators, map booleans, standardize currency and units, and enforce schema in landing/curated layers.
- Impact: fewer runtime errors, stable analytics, faster debugging, and easier collaboration between teams.
Concept explained simply
Data type standardization is the process of cleaning and converting values so that each field has one agreed type and representation. Examples:
- Dates: from mixed strings like "03/01/2024", "2024-03-01", "20240301" to a single timestamp column.
- Numbers: from "1,234.50" or "12,50" to DECIMAL(12,2).
- Booleans: from "Y/N", "true/false", 1/0 to a single BOOLEAN.
- Nulls: from "", "N/A", "-" to actual NULL.
Mental model
- Profile: detect current types and odd values.
- Map: define the target types and value mappings.
- Normalize: clean strings (trim, remove separators, unify case, detect locale).
- Parse: convert text to typed values (timestamp, decimal, boolean).
- Cast and enforce: apply target data types and constraints.
- Validate: check counts, nulls, ranges, failed parses.
What to standardize
- Numeric: remove thousands separators, fix decimal symbol, cast to DECIMAL for money. Avoid FLOAT/DOUBLE for currency due to rounding.
- Dates/Times: parse multiple formats, set timezone explicitly, store as UTC timestamp, keep original timezone when needed.
- Booleans: map common variants ("Y/N", "Yes/No", "true/false", 1/0) to a single BOOLEAN.
- Categories/Codes: map legacy codes to canonical values. Keep a reference mapping table under version control.
- Nulls and Blanks: convert "", "N/A", "NULL", "-", 0-length strings to NULL when appropriate.
- Identifiers: trim whitespace, normalize case if case-insensitive, preserve leading zeros if meaningful.
- Arrays/JSON: parse to structured types where possible or store as validated JSON.
Worked examples
Example 1: SQL — Standardize orders
-- Goal: date to timestamp (UTC midnight), amount to DECIMAL(12,2), paid to BOOLEAN
WITH raw AS (
SELECT * FROM orders_raw
), cleaned AS (
SELECT
TRIM(order_id) AS order_id,
-- Normalize date strings then parse
COALESCE(
TO_TIMESTAMP(order_date, 'YYYY-MM-DD'),
TO_TIMESTAMP(order_date, 'MM/DD/YYYY'),
TO_TIMESTAMP(order_date, 'YYYYMMDD')
) AS order_ts_local,
-- Normalize amount: remove currency symbols and thousand separators
REGEXP_REPLACE(REGEXP_REPLACE(amount_txt, '[^0-9,.-]', ''), ',', '.') AS amount_norm,
UPPER(TRIM(paid_status)) AS paid_norm,
UPPER(NULLIF(TRIM(currency), '')) AS currency_norm
FROM raw
), standardized AS (
SELECT
order_id,
-- Assume local time is stored already as local midnight; convert to UTC if timezone known
-- If timezone unknown, document assumption and keep as naive timestamp
order_ts_local AT TIME ZONE 'UTC' AS order_ts,
-- Currency normalization: EUR to USD at 1.10 (example static rate)
CAST(
CASE
WHEN currency_norm = 'EUR' THEN CAST(amount_norm AS DECIMAL(14,4)) * 1.10
WHEN currency_norm = 'USD' OR currency_norm IS NULL THEN CAST(amount_norm AS DECIMAL(14,4))
ELSE CAST(amount_norm AS DECIMAL(14,4)) -- fallback, document
END AS DECIMAL(12,2)
) AS amount_usd,
CASE
WHEN paid_norm IN ('Y','YES','TRUE','1') THEN TRUE
WHEN paid_norm IN ('N','NO','FALSE','0') THEN FALSE
ELSE NULL
END AS paid
FROM cleaned
)
SELECT * FROM standardized;
Note: For real pipelines, apply proper timezone conversion based on source system metadata.
Example 2: Python (pandas) — Clean booleans, dates, numbers
import pandas as pd
import numpy as np
df = pd.DataFrame({
'user_id': [' 001 ', '002', '003 '],
'signup': ['2024-03-01', '03/02/2024', '20240303'],
'opt_in': ['Y', 'false', None],
'balance': ['1,234.50', '€12,50', 'USD 10.00']
})
# Trim IDs, preserve leading zeros
df['user_id'] = df['user_id'].str.strip()
# Normalize and parse dates to UTC-naive timestamps (document)
df['signup_ts'] = pd.to_datetime(df['signup'], errors='coerce', infer_datetime_format=True)
# Normalize numbers: remove non-numeric except separators, unify decimal symbol
def norm_amount(x):
if pd.isna(x):
return np.nan
s = str(x)
s = ''.join(ch for ch in s if ch.isdigit() or ch in [',', '.', '-'])
s = s.replace(',', '.')
try:
return float(s)
except ValueError:
return np.nan
df['balance_num'] = df['balance'].apply(norm_amount)
# Map booleans
true_set = {'y','yes','true','1','t'}
false_set = {'n','no','false','0','f'}
def to_bool(x):
if pd.isna(x):
return np.nan
v = str(x).strip().lower()
if v in true_set:
return True
if v in false_set:
return False
return np.nan
df['opt_in_bool'] = df['opt_in'].apply(to_bool)
# Final dtypes: user_id (string), signup_ts (datetime64[ns]), balance_num (float), opt_in_bool (boolean/nullable)
Example 3: Spark SQL — Parse mixed date/time and amounts
-- Assume a Spark DataFrame 'df' with columns: dt_str, amount_str, flag
-- Using SQL expressions
SELECT
TRIM(id) AS id,
COALESCE(
to_timestamp(dt_str, 'yyyy-MM-dd HH:mm:ss'),
to_timestamp(dt_str, 'MM/dd/yyyy HH:mm'),
to_timestamp(dt_str, 'yyyyMMdd')
) AS event_ts,
CAST(
regexp_replace(regexp_replace(amount_str, '[^0-9,.-]', ''), ',', '.')
AS DECIMAL(12,2)
) AS amount,
CASE
WHEN lower(flag) IN ('y','yes','true','1') THEN TRUE
WHEN lower(flag) IN ('n','no','false','0') THEN FALSE
ELSE NULL
END AS flag_bool
FROM df;
Step-by-step process you can follow
- Profile: count distinct formats, find non-numeric characters, list boolean variants, detect null tokens.
- Define target schema: pick exact data types and precision (e.g., DECIMAL(12,2) for money; TIMESTAMP UTC).
- Normalize: trim, unify case, remove thousands separators, standardize decimal symbol.
- Parse and cast: use explicit formats; handle errors as NULL and log failures.
- Validate: check record counts, null rates, min/max, and sample rows.
- Document: store mapping rules and assumptions with versioning.
Validation rules and self-checks
- Row count preserved: source rows = target rows.
- Null spikes: investigate any sudden increase in NULL after casting.
- Range checks: dates within expected window; amounts not negative unless allowed.
- Boolean completeness: no unmapped values left as strings.
- Parsing failures: count rows where parsing returned NULL but source was not empty.
-- Example validation queries
SELECT COUNT(*) FROM src;
SELECT COUNT(*) FROM tgt;
SELECT COUNT(*) FROM tgt WHERE amount_usd IS NULL AND amount_txt IS NOT NULL;
SELECT MIN(order_ts), MAX(order_ts) FROM tgt;
SELECT paid, COUNT(*) FROM tgt GROUP BY paid;
Who this is for
- ETL Developers and Data Engineers building ingestion and transformation pipelines.
- Analytics Engineers ensuring clean semantic layers.
- Anyone preparing data for reliable analysis and ML features.
Prerequisites
- Basic SQL (SELECT, CASE, CAST, DATE/TIMESTAMP functions).
- Comfort with one scripting language (Python or Scala) if needed.
- Understanding of timezones and number formats.
- Familiarity with NULL semantics.
Learning path
- Before: Data profiling and source exploration.
- Now: Data type standardization (this lesson).
- Next: Schema enforcement, data contracts, and data quality checks.
Common mistakes (and fixes)
- Casting before cleaning. Fix: remove separators and normalize decimal symbol first.
- Using FLOAT for money. Fix: use DECIMAL with sufficient precision.
- Timezone confusion. Fix: always set source timezone explicitly, convert to UTC.
- Dropping leading zeros from IDs. Fix: treat IDs as strings; do not cast to INT unless numerically meaningful.
- Incomplete boolean mapping. Fix: enumerate all known variants and log unknowns.
- Silent truncation/overflow. Fix: choose DECIMAL precision that covers maxima; test boundary values.
- Locale surprises. Fix: do not rely on implicit locale; normalize strings explicitly.
Practical projects
- Build a standardization view that unifies transactions from two payment gateways to a single schema.
- Create a mapping table that converts vendor category codes to your company’s canonical categories and apply it in SQL.
- Write a reusable function/library to normalize numbers and booleans across pipelines; include unit tests.
Exercises
Do these in your SQL engine or notebook. The quick test is available to everyone; sign in if you want your progress saved.
Exercise 1: SQL — Standardize payments
Source table: raw_payments with columns:
- txn_id STRING (may contain spaces)
- txn_date STRING with mixed formats: '2024-03-01', '03/01/2024', '20240301'
- amount STRING like '1,234.50', '€12,50', 'USD 10.00'
- currency STRING: 'EUR', 'USD', or NULL
- success STRING: 'Y','N','true','false','1','0', or NULL
Task:
- Produce: txn_id (trimmed), txn_ts (TIMESTAMP at UTC midnight), amount_usd DECIMAL(12,2) using rate EUR->USD = 1.10, success BOOLEAN.
- When currency is NULL but amount contains a symbol, infer currency from the symbol.
- Set parsing failures to NULL and keep row counts unchanged.
Expected output: a query returning standardized columns with the specified types.
Exercise 2: Python — Clean users
Input DataFrame raw_users with columns:
- user_id: strings with leading/trailing spaces, some like '007'
- signup_date: mixed forms '01-03-2024', '2024/03/02', 1709337600 (epoch seconds)
- email_opt_in: 'Yes','no','1','0', None
- balance: '1 234,00', '1234.00', ''
Task:
- user_id trimmed (keep leading zeros).
- signup_ts as pandas datetime in UTC; treat epoch as seconds.
- email_opt_in_bool as boolean with unmapped values to NaN.
- balance_num as float with correct decimal point; blanks to NaN.
Expected output: DataFrame with dtypes string, datetime64[ns, UTC], boolean/nullable, and float.
- Checklist before you move on:
- Did you preserve row counts?
- Are all booleans truly boolean?
- Are money amounts using fixed-point types (or floats only for non-financial)?
- Did you log or quantify parsing failures?
Mini challenge
You receive product_size: 'M', 'medium', 'Med', 'L', 'Large', '' and weight: '1,000.0 g', '2.5kg', '0.5 KG'. Standardize to size ENUM ('S','M','L') and weight_kg DECIMAL(8,3).
Hint
- Size: map variants to S/M/L; blanks to NULL.
- Weight: strip spaces, lowercase, parse unit (g/kg), convert to kg, unify decimal.
Sample solution
-- Pseudocode / SQL-ish
SELECT
CASE
WHEN LOWER(TRIM(product_size)) IN ('s','small') THEN 'S'
WHEN LOWER(TRIM(product_size)) IN ('m','med','medium') THEN 'M'
WHEN LOWER(TRIM(product_size)) IN ('l','large') THEN 'L'
ELSE NULL
END AS size,
CAST(
CASE WHEN LOWER(weight) LIKE '%kg%'
THEN CAST(REPLACE(LOWER(REGEXP_REPLACE(weight, '[^0-9,.-kg ]', '')), ',', '.') AS DOUBLE)
-- remove 'kg' afterward or extract number first depending on engine
WHEN LOWER(weight) LIKE '%g%'
THEN (CAST(REPLACE(LOWER(REGEXP_REPLACE(weight, '[^0-9,.-g ]', '')), ',', '.') AS DOUBLE) / 1000.0)
ELSE NULL
END AS DECIMAL(8,3)
) AS weight_kg
FROM src;
Next steps
- Add schema enforcement so new loads cannot introduce type drift.
- Automate data quality checks for null spikes and parsing errors.
- Document your standardization rules and share with your team.
Note on progress
The quick test below is available to everyone. Sign in to save your test results and track progress over time.