Why this matters
Marketing Analysts constantly join exports from ad platforms, web analytics, and CRM. Raw data arrives messy: mixed case, extra spaces, inconsistent dates, duplicate leads, and irregular campaign names. Cleaning quickly and correctly means:
- Trustworthy conversion rates, CAC, and ROAS.
- Consistent channel and campaign naming for rollups.
- Accurate attribution and cohort analyses.
- Faster reporting with fewer manual fixes.
Concept explained simply
Data cleaning is turning inconsistent inputs into consistent, analysis-ready columns without changing the meaning. You standardize text, convert types (text to dates/numbers), handle missing values, fix categories, and remove duplicates.
Mental model
- See each column as a contract: one meaning, one format.
- Use repeatable formulas so new data cleans itself.
- Move left-to-right: detect → standardize → validate.
Common cleaning functions at a glance
- Whitespace/characters: TRIM, CLEAN, SUBSTITUTE, REGEXREPLACE
- Case/format: LOWER, UPPER, PROPER, TEXT
- Split/merge: SPLIT (or TEXTSPLIT), TEXTJOIN
- Dates/numbers: DATEVALUE, VALUE, NUMBERVALUE
- Lookup/mapping: VLOOKUP/XLOOKUP, IF, SWITCH
- Distinct/filter: UNIQUE, FILTER, SORT, REMOVE DUPLICATES (menu)
Step-by-step cleaning workflow
- Copy raw data to a Raw sheet; do not edit it directly.
- Build a Clean sheet with formulas referencing Raw.
- Standardize text:
=LOWER(TRIM(A2)), remove noise withREGEXREPLACE. - Parse and cast types: convert to valid dates and numbers.
- Map categories via lookup tables (e.g., channel groupings).
- Remove/flag duplicates and impossible values.
- Add validation columns (e.g.,
ISBLANK,ISNUMBER) for self-check.
Worked examples
1) Standardize UTM campaign and medium
Goal: convert varied inputs into consistent fields.
Raw!A: medium Raw!B: campaign " CPC " "Black Friday - US" "cpc" "black-friday_us" "Paid Social" "BlackFriday US" "(not set)" "BlackFriday-US" "CPC" "BF us"
In Clean!C (medium_std):
=LET(m,LOWER(TRIM(Raw!A2)),IF(m="(not set)","",IF(OR(m="cpc",m="ppc"),"cpc",IF(m="paid social","paid_social",m))))
In Clean!D (campaign_key):
=LOWER(REGEXREPLACE(TRIM(Raw!B2),"[^a-z0-9]+","_"))
Optional: remove leading/trailing underscores:
=LOWER(REGEXREPLACE(REGEXREPLACE(TRIM(Raw!B2),"[^a-z0-9]+","_"),"^_+|_+$",""))
2) Convert mixed date strings to real dates
Raw!C: date_text "2024/11/30" "11-30-2024" "30 Nov 2024" "2024.11.30" "Nov-30-24"
Clean!E (date_value):
=DATEVALUE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(Raw!C2,".","/"),"-","/")," "," "))
This normalizes separators to "/" so DATEVALUE can parse. Format Clean!E as Date.
If some rows still fail
Wrap with an IFERROR fallback, e.g. try parsing textual months:
=IFERROR(DATEVALUE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(Raw!C2,".","/"),"-","/")," "," ")),IFERROR(VALUE(Raw!C2),""))
3) Convert currency-like text to numbers
Raw!D: amount_text "$1,200.50" "€ 900,00" "1 500" "USD2,345" "£0.99"
Clean!F (amount_num):
=LET(t,REGEXREPLACE(Raw!D2,"[^0-9,\.\-]",""), dec, IF(REGEXMATCH(t,",\d{2}$"),",","."), grp, IF(dec=",",".",","), NUMBERVALUE(t, dec, grp))This removes currency letters/spaces, detects decimal separator, and converts to a number.
4) De-duplicate leads by email, keep latest
Sort Raw by Date desc, then keep the first occurrence of each email.
Formula approach in Sheets (array):
=LET(s,SORT(Raw!A2:D, Raw!D2:D, FALSE), emails, INDEX(s,,2), keep, MATCH(emails, emails, 0)=ROW(emails)-MIN(ROW(emails))+1, FILTER(s, keep))
Explanation: sort by date desc, then keep the first time each email appears.
Common mistakes and how to self-check
- Leaving hidden spaces: fix with
TRIMand confirm usingLEN(value). - Breaking joins due to case differences: always
LOWERboth sides before lookups. - Dates stored as text: check with
ISNUMBER(date_cell)afterDATEVALUE. - Over-cleaning: removing useful characters (e.g., hyphens in SKUs). Define allowed chars per column.
- Accidental duplicates after sort: re-check with
COUNTIF(email_range, email)=1.
Self-check checklist
- All key ID columns pass uniqueness checks.
- All date columns return TRUE for
ISNUMBER. - No leading/trailing spaces:
LEN(x)=LEN(TRIM(x)). - Category values match a controlled list (via validation or lookup).
- Aggregations before vs. after cleaning differ only where expected.
Exercises
These mirror the tasks below the lesson. Do them in a new Sheet. Aim to solve using formulas so your steps are repeatable.
Exercise 1: Standardize UTM fields
Input (paste into A1:B6):
medium campaign " CPC " "Black Friday - US" "cpc" "black-friday_us" "Paid Social" "BlackFriday US" "(not set)" "BlackFriday-US" "CPC" "BF us"
- Create columns C (medium_std) and D (campaign_key).
- medium_std rules: trim + lower; map "ppc" and any form of "cpc" to "cpc"; map "paid social" to "paid_social"; blank for "(not set)".
- campaign_key: trim, lower, replace any non-alphanumeric with a single underscore, remove leading/trailing underscores.
Exercise 2: Parse dates and amounts
Input (paste into A1:B6):
date_text amount_text 2024/11/30 $1,200.50 11-30-2024 € 900,00 30 Nov 2024 1 500 2024.11.30 USD2,345 Nov-30-24 £0.99
- Create columns C (date_value) and D (amount_num).
- Convert all date_text to true dates.
- Convert all amount_text to numeric values regardless of currency symbol or thousands delimiter.
Exercise checklist
- All standardized mediums are one of: cpc, paid_social, or blank when not set.
- All campaign keys are lowercase with underscores only.
ISNUMBER(C2:C)andISNUMBER(D2:D)return TRUE.- No leading/trailing underscores in campaign keys.
Practical projects
- Campaign naming cleaner: Input raw UTM exports; output standardized medium/source/campaign, with a lookup tab for mappings.
- Leads deduper: Merge form and CRM exports, keep latest per email, flag conflicts (e.g., different country recorded).
- Revenue prep sheet: Convert currency-like text to numbers, align to a unified currency column after conversion (assume conversion rates are handled elsewhere), and validate totals.
Mini challenge
Given product codes in A2:A, like "sku-123_a", " SKU 123 A ", "SKU-123A!!", create a formula that outputs standardized keys like "sku_123a" (lowercase, underscores, alphanumerics only). Hint: combine LOWER, TRIM, and REGEXREPLACE.
One-line answer
=LOWER(REGEXREPLACE(TRIM(A2),"[^a-z0-9]+","_"))
Who this is for
- Marketing Analysts cleaning exports from ads, analytics, and CRM.
- Anyone building recurring dashboards who wants consistent inputs.
Prerequisites
- Basic spreadsheet navigation and formula entry.
- Comfort with references (A1 and absolute references), sorting, and filtering.
Learning path
- Master text cleanup (TRIM, CLEAN, LOWER, SUBSTITUTE).
- Learn regex basics for
REGEXREPLACE/REGEXEXTRACT. - Parse types:
DATEVALUE,VALUE,NUMBERVALUE. - Category mapping with lookups and validation.
- Deduping and QA checks at the end of your pipeline.
Common pitfalls
- Mixing manual edits with formulas. Keep a separate Clean sheet.
- One-off fixes. Prefer formula patterns you can reuse.
- Ambiguous date formats. Normalize separators before parsing.
Next steps
- Turn your cleaning steps into a documented template.
- Add data validation drop-downs to prevent new messes.
- Practice on a new export each week to build speed.
Test availability note: The quick test is available to everyone; only logged-in users get saved progress.