How to learn Text And Date Transformations for Data Transformations in ETL Developer for free

Why this matters

ETL Developers constantly clean messy text and normalize dates so data can be joined, aggregated, and trusted. Common on-the-job tasks include:

Standardizing names, emails, phone numbers, and free-text fields before loading into a warehouse.
Parsing dates in many formats (e.g., 03/10/1979 vs 1979-10-03), converting time zones, and aligning to day/week/month.
Building calendar dimensions and timestamp fields for accurate reporting and downstream modeling.

Who this is for

Aspiring and junior ETL Developers who work with CSVs, logs, or API payloads.
Analysts and data engineers who need reliable text/date cleaning steps in pipelines.

Prerequisites

Basic SQL or a scripting language (e.g., Python) to express transformations.
Familiarity with data types (string, date, timestamp).

Concept explained simply

Text transformations reshape strings: trim spaces, change case, split or join parts, and find/replace patterns. Date transformations turn strings into valid dates/timestamps, shift them (add days or months), convert time zones, and truncate to periods (day/week/month).

Mental model

Think of text and dates as clay. Your job is to mold them into standardized shapes so every downstream step recognizes and combines them correctly. If you always normalize inputs to one clean format, joins and aggregations become predictable and repeatable.

Core operations you will use often

Text: TRIM/strip, UPPER/LOWER/Title Case, SUBSTRING/SPLIT, CONCAT, REPLACE/REGEXP_REPLACE, normalization (remove diacritics), safe casting.
Date/Time: Parsing with explicit format strings, validating with try/strict parsing, DATEADD/ADD_MONTHS, DATEDIFF, DATE_TRUNC, time zone conversion (to/from UTC), handling nulls and invalids.

Safety tips

Always trim before parsing or comparing strings/dates.
Parse using explicit formats; avoid blind casting.
Use time zone–aware functions instead of manual hour offsets (DST can bite!).
Guard with try_parse + fallback formats; capture invalids to a quarantine table/flag.

Worked examples

1) Clean customer names and phones

Goal: Standardize names to Title Case, strip extra spaces, and normalize phones to E.164-like digits with leading +country code when known.

Input
name: "  jOAN  SMITH  "
phone: "(415) 555-0123"

Transform
- Name: trim, collapse multiple spaces, title case
- Phone: remove non-digits, prepend +1 if 10-digit US

Output
name_clean: "Joan Smith"
phone_e164: "+14155550123"

SQL-style sketch

-- Name
INITCAP(REGEXP_REPLACE(TRIM(name), '\\s+', ' ')) AS name_clean

-- Phone (US example; adjust for your country rules)
CASE
  WHEN REGEXP_REPLACE(phone, '[^0-9]', '') ~ '^[0-9]{10}$' THEN '+1' || REGEXP_REPLACE(phone, '[^0-9]', '')
  WHEN REGEXP_REPLACE(phone, '[^0-9]', '') ~ '^1[0-9]{10}$' THEN '+' || REGEXP_REPLACE(phone, '[^0-9]', '')
  ELSE '+' || REGEXP_REPLACE(phone, '[^0-9]', '')
END AS phone_e164

2) Parse messy dates into ISO

Goal: Convert multiple input patterns to ISO date (YYYY-MM-DD), with a safe fallback and invalid flag.

Input date_str examples:
- "03/10/1979" (assume MM/DD/YYYY)
- "10-03-1979" (assume DD-MM-YYYY)
- "1979-03-10" (ISO)

Output
- dob_iso: 1979-03-10
- dob_is_valid: true/false

SQL-style sketch

-- Try multiple formats with fallback
COALESCE(
  TO_DATE(date_str, 'YYYY-MM-DD'),
  TO_DATE(date_str, 'MM/DD/YYYY'),
  TO_DATE(date_str, 'DD-MM-YYYY')
) AS dob_iso,

-- Validity flag
CASE WHEN
  TO_DATE(date_str, 'YYYY-MM-DD') IS NOT NULL OR
  TO_DATE(date_str, 'MM/DD/YYYY') IS NOT NULL OR
  TO_DATE(date_str, 'DD-MM-YYYY') IS NOT NULL
THEN TRUE ELSE FALSE END AS dob_is_valid

3) Convert local timestamp to UTC and truncate to week

Goal: Turn a local timestamp with offset into UTC, then get week start (Monday).

Input
local_ts: "2024-07-01 09:00:00-07:00"

Output
utc_ts:   "2024-07-01 16:00:00+00:00"
week_start_date: 2024-07-01

SQL-style sketch

-- If your engine supports TIMESTAMP WITH TIME ZONE, cast directly; otherwise parse offset explicitly.
-- Convert to UTC (engine-specific):
(local_ts AT TIME ZONE 'UTC') AS utc_ts

-- Week start Monday (common approach):
DATE_TRUNC('week', utc_ts)::DATE AS week_start_date

Note: Exact syntax varies. The key idea is: use time zone–aware conversion, then date_trunc.

Exercises

Do these to solidify the concepts. The quick test below is available to everyone; sign in to save your progress.

Exercise 1 — Clean names, phones, and DOB
Input rows:
```
1) name="  jOAN  SMITH  ", phone="(415) 555-0123", dob="03/10/1979"
2) name="ALAN turing", phone="415.555.0456", dob="1979-03-10"
3) name="m. o'NEIL  ", phone="+1 212 555 8899", dob="10-03-1979"
```
Tasks:
- Normalize name to Title Case with single spaces.
- Normalize phone to E.164-like digits beginning with +.
- Parse DOB to ISO date using multiple formats. Add validity flag.
Exercise 2 — Split product code
Input product_code examples:
```
ACME-TSHIRT-RED-L-2024Q1
ACME-MUG-BLACK-2023Q4
ACME-TSHIRT-BLUE-M-2024Q2
```
Tasks:
- Extract brand, item, color, size (optional), year, quarter.
- Handle the version without size (MUG line).
Exercise 3 — Time zone to UTC
Input rows:
```
1) ts_local="2024-03-10 01:30:00-08:00"
2) ts_local="2024-03-10 03:30:00-07:00"
(Consider DST shift around March.)
```
Tasks:
- Convert each to UTC with a time zone–aware method.
- Return UTC timestamp and date.

Self-check checklist

All outputs are trimmed and consistently cased.
Dates parse correctly across the provided formats.
UTC conversion matches expectation even across DST boundaries.

Common mistakes and how to self-check

Blind casting strings to dates. Fix: use explicit formats and try/fallback parsing; flag invalids.
Manual hour math for time zones. Fix: use time zone–aware functions (DST safe).
Not trimming before comparing text. Fix: TRIM and collapse spaces; standardize case before joins.
Regex that over-matches. Fix: anchor patterns and test on samples; keep examples of allowed/blocked cases.
Ambiguous day/month interpretation. Fix: define business rule per source; document assumptions.

Quick self-audit

Re-run transformations twice; results should not change (idempotent).
Sample 20 random rows; manually spot-check at least 5 edge cases.
Compare row counts before/after parsing; investigate sudden drops or spikes.

Practical projects

Customer contact normalizer
- Inputs: name, email, phone from 3 vendors.
- Steps: trim/case standardization, email lowercasing and basic format check, phone E.164-like formatting by locale.
- Deliverables: clean table + invalids table with reasons.
Calendar dimension builder
- Create a date table from min to max event date.
- Add columns: week start (Mon), week of year, fiscal period, is_weekend, is_month_end.
- Use it to aggregate a fact table by week and month.
Web log timestamp normalizer
- Parse logs with mixed offsets (e.g., -07:00, +01:00).
- Convert to UTC, derive date, hour, and session window start.
- Verify counts per hour remain consistent after conversion.

Learning path

Before: basic data types, simple SELECTs/filters.
Now: text/date parsing, normalization, and time zone conversion.
Next: joins on standardized keys, slowly changing dimensions, and incremental loads.

Next steps

Complete the Exercises above, then take the Quick Test below.
Add parsing error reporting: count invalid dates per source.
Document your chosen date format assumptions and share with teammates.

Note: The quick test is available to everyone. Sign in if you want your progress saved.

Mini challenge

Given event_time strings: "2024/12/31 23:59:59 -0500", "31-12-2024 21:00:00 +0100" — produce:

utc_ts
event_date (UTC)
week_start_date (UTC, Monday)

Hint

Use explicit parsing per pattern, coalesce, then convert with time zone–aware functions before truncation.

Menu

Text And Date Transformations

Table of Contents

Why this matters

Concept explained simply

Mental model

Core operations you will use often

Worked examples

Exercises

Self-check checklist

Common mistakes and how to self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Clean names, phones, and DOB

Instructions

Expected Output

Split product code into parts

Convert local timestamps to UTC and derive date

Text And Date Transformations — Quick Test

Have questions about Text And Date Transformations?

AI Assistant