Why this matters
ETL Developers constantly clean messy text and normalize dates so data can be joined, aggregated, and trusted. Common on-the-job tasks include:
- Standardizing names, emails, phone numbers, and free-text fields before loading into a warehouse.
- Parsing dates in many formats (e.g., 03/10/1979 vs 1979-10-03), converting time zones, and aligning to day/week/month.
- Building calendar dimensions and timestamp fields for accurate reporting and downstream modeling.
Who this is for
- Aspiring and junior ETL Developers who work with CSVs, logs, or API payloads.
- Analysts and data engineers who need reliable text/date cleaning steps in pipelines.
Prerequisites
- Basic SQL or a scripting language (e.g., Python) to express transformations.
- Familiarity with data types (string, date, timestamp).
Concept explained simply
Text transformations reshape strings: trim spaces, change case, split or join parts, and find/replace patterns. Date transformations turn strings into valid dates/timestamps, shift them (add days or months), convert time zones, and truncate to periods (day/week/month).
Mental model
Think of text and dates as clay. Your job is to mold them into standardized shapes so every downstream step recognizes and combines them correctly. If you always normalize inputs to one clean format, joins and aggregations become predictable and repeatable.
Core operations you will use often
- Text: TRIM/strip, UPPER/LOWER/Title Case, SUBSTRING/SPLIT, CONCAT, REPLACE/REGEXP_REPLACE, normalization (remove diacritics), safe casting.
- Date/Time: Parsing with explicit format strings, validating with try/strict parsing, DATEADD/ADD_MONTHS, DATEDIFF, DATE_TRUNC, time zone conversion (to/from UTC), handling nulls and invalids.
Safety tips
- Always trim before parsing or comparing strings/dates.
- Parse using explicit formats; avoid blind casting.
- Use time zone–aware functions instead of manual hour offsets (DST can bite!).
- Guard with try_parse + fallback formats; capture invalids to a quarantine table/flag.
Worked examples
1) Clean customer names and phones
Goal: Standardize names to Title Case, strip extra spaces, and normalize phones to E.164-like digits with leading +country code when known.
Input
name: " jOAN SMITH "
phone: "(415) 555-0123"
Transform
- Name: trim, collapse multiple spaces, title case
- Phone: remove non-digits, prepend +1 if 10-digit US
Output
name_clean: "Joan Smith"
phone_e164: "+14155550123"
SQL-style sketch
-- Name
INITCAP(REGEXP_REPLACE(TRIM(name), '\\s+', ' ')) AS name_clean
-- Phone (US example; adjust for your country rules)
CASE
WHEN REGEXP_REPLACE(phone, '[^0-9]', '') ~ '^[0-9]{10}$' THEN '+1' || REGEXP_REPLACE(phone, '[^0-9]', '')
WHEN REGEXP_REPLACE(phone, '[^0-9]', '') ~ '^1[0-9]{10}$' THEN '+' || REGEXP_REPLACE(phone, '[^0-9]', '')
ELSE '+' || REGEXP_REPLACE(phone, '[^0-9]', '')
END AS phone_e164
2) Parse messy dates into ISO
Goal: Convert multiple input patterns to ISO date (YYYY-MM-DD), with a safe fallback and invalid flag.
Input date_str examples:
- "03/10/1979" (assume MM/DD/YYYY)
- "10-03-1979" (assume DD-MM-YYYY)
- "1979-03-10" (ISO)
Output
- dob_iso: 1979-03-10
- dob_is_valid: true/false
SQL-style sketch
-- Try multiple formats with fallback
COALESCE(
TO_DATE(date_str, 'YYYY-MM-DD'),
TO_DATE(date_str, 'MM/DD/YYYY'),
TO_DATE(date_str, 'DD-MM-YYYY')
) AS dob_iso,
-- Validity flag
CASE WHEN
TO_DATE(date_str, 'YYYY-MM-DD') IS NOT NULL OR
TO_DATE(date_str, 'MM/DD/YYYY') IS NOT NULL OR
TO_DATE(date_str, 'DD-MM-YYYY') IS NOT NULL
THEN TRUE ELSE FALSE END AS dob_is_valid
3) Convert local timestamp to UTC and truncate to week
Goal: Turn a local timestamp with offset into UTC, then get week start (Monday).
Input
local_ts: "2024-07-01 09:00:00-07:00"
Output
utc_ts: "2024-07-01 16:00:00+00:00"
week_start_date: 2024-07-01
SQL-style sketch
-- If your engine supports TIMESTAMP WITH TIME ZONE, cast directly; otherwise parse offset explicitly.
-- Convert to UTC (engine-specific):
(local_ts AT TIME ZONE 'UTC') AS utc_ts
-- Week start Monday (common approach):
DATE_TRUNC('week', utc_ts)::DATE AS week_start_date
Note: Exact syntax varies. The key idea is: use time zone–aware conversion, then date_trunc.
Exercises
Do these to solidify the concepts. The quick test below is available to everyone; sign in to save your progress.
-
Exercise 1 — Clean names, phones, and DOB
Input rows:
1) name=" jOAN SMITH ", phone="(415) 555-0123", dob="03/10/1979" 2) name="ALAN turing", phone="415.555.0456", dob="1979-03-10" 3) name="m. o'NEIL ", phone="+1 212 555 8899", dob="10-03-1979"Tasks:
- Normalize name to Title Case with single spaces.
- Normalize phone to E.164-like digits beginning with +.
- Parse DOB to ISO date using multiple formats. Add validity flag.
-
Exercise 2 — Split product code
Input product_code examples:
ACME-TSHIRT-RED-L-2024Q1 ACME-MUG-BLACK-2023Q4 ACME-TSHIRT-BLUE-M-2024Q2Tasks:
- Extract brand, item, color, size (optional), year, quarter.
- Handle the version without size (MUG line).
-
Exercise 3 — Time zone to UTC
Input rows:
1) ts_local="2024-03-10 01:30:00-08:00" 2) ts_local="2024-03-10 03:30:00-07:00" (Consider DST shift around March.)Tasks:
- Convert each to UTC with a time zone–aware method.
- Return UTC timestamp and date.
Self-check checklist
- All outputs are trimmed and consistently cased.
- Dates parse correctly across the provided formats.
- UTC conversion matches expectation even across DST boundaries.
Common mistakes and how to self-check
- Blind casting strings to dates. Fix: use explicit formats and try/fallback parsing; flag invalids.
- Manual hour math for time zones. Fix: use time zone–aware functions (DST safe).
- Not trimming before comparing text. Fix: TRIM and collapse spaces; standardize case before joins.
- Regex that over-matches. Fix: anchor patterns and test on samples; keep examples of allowed/blocked cases.
- Ambiguous day/month interpretation. Fix: define business rule per source; document assumptions.
Quick self-audit
- Re-run transformations twice; results should not change (idempotent).
- Sample 20 random rows; manually spot-check at least 5 edge cases.
- Compare row counts before/after parsing; investigate sudden drops or spikes.
Practical projects
-
Customer contact normalizer
- Inputs: name, email, phone from 3 vendors.
- Steps: trim/case standardization, email lowercasing and basic format check, phone E.164-like formatting by locale.
- Deliverables: clean table + invalids table with reasons.
-
Calendar dimension builder
- Create a date table from min to max event date.
- Add columns: week start (Mon), week of year, fiscal period, is_weekend, is_month_end.
- Use it to aggregate a fact table by week and month.
-
Web log timestamp normalizer
- Parse logs with mixed offsets (e.g., -07:00, +01:00).
- Convert to UTC, derive date, hour, and session window start.
- Verify counts per hour remain consistent after conversion.
Learning path
- Before: basic data types, simple SELECTs/filters.
- Now: text/date parsing, normalization, and time zone conversion.
- Next: joins on standardized keys, slowly changing dimensions, and incremental loads.
Next steps
- Complete the Exercises above, then take the Quick Test below.
- Add parsing error reporting: count invalid dates per source.
- Document your chosen date format assumptions and share with teammates.
Note: The quick test is available to everyone. Sign in if you want your progress saved.
Mini challenge
Given event_time strings: "2024/12/31 23:59:59 -0500", "31-12-2024 21:00:00 +0100" — produce:
- utc_ts
- event_date (UTC)
- week_start_date (UTC, Monday)
Hint
Use explicit parsing per pattern, coalesce, then convert with time zone–aware functions before truncation.