Why this matters
Example 4 — Deduplicate by keeping the latest per ID
Goal: for each OrderID, keep the latest date row.
OrderID in A2:A, Date in B2:B, CustID in C2:C
One-cell solution:
=SORTN(
SORT({A2:A, C2:C, ARRAYFORMULA(IFERROR(DATEVALUE(SUBSTITUTE(B2:B,"/","-"))))}, 1, TRUE, 3, FALSE),
9^9, 2, 1, TRUE
)SORTby OrderID asc and Date desc (third column).SORTNkeeps first occurrence per OrderID (column 1).
Step-by-step cleaning workflow
1) Create a staging sheet
- Copy raw data to a new tab "stg_raw" (keep raw untouched).
- Add helper columns for parsing and flags.
2) Normalize text
- Remove stray characters:
=ARRAYFORMULA(REGEXREPLACE(range,"[^\w '\-]","")) - Whitespace/case:
=ARRAYFORMULA(PROPER(TRIM(range)))
3) Parse dates/numbers
- Dates:
=ARRAYFORMULA(IFERROR(DATEVALUE(SUBSTITUTE(date_col,"/","-")),"")) - Numbers:
=ARRAYFORMULA(VALUE(SUBSTITUTE(num_col,",","")))
4) Validate and deduplicate
- Duplicate flag:
=ARRAYFORMULA(COUNTIF(key_col,key_col)>1) - Keep latest per key: use
SORT+SORTNas in Example 4.
5) Map categories
- Create a mapping table (old → new).
- Apply:
=ARRAYFORMULA(IFERROR(VLOOKUP(old_col, map!A:B, 2, FALSE), old_col))
6) Output clean table
- Select only clean columns in final order:
=QUERY(clean_range, "select Col1, Col3, Col5", 0) - Freeze header, set formats, protect ranges if needed.
Reusable checklist
- Raw preserved
- Dates parse as numbers (check by changing format)
- No blank keys
- No dup keys (unless expected)
- Categories match controlled list
- All formulas error-handled (
IFERROR)
Exercises
Do these directly in a new sheet. Mirror of the practice tasks below. Your outputs should match the expected results provided.
Exercise 1 — Clean names into a standardized column
Dataset (A1:A6):
A Name " aNA gomez " "LEE CHEN!!!" "o'connor" " mary-jane smith " "Jo (temp) "
Task: In B2, write a single array formula to produce cleaned names with these rules: trim spaces, remove non-letters except spaces, apostrophes, and hyphens, and set Proper Case. Copy down via array (one formula only).
Expected result in B2:B6:
Ana Gomez Lee Chen O'Connor Mary-Jane Smith Jo Temp
Hint
Combine REGEXREPLACE, TRIM, and PROPER inside ARRAYFORMULA.
Exercise 2 — Standardize dates and deduplicate orders
Dataset (A1:C7):
OrderID | DateText | CustomerID 1001 | 2024/07/01 | C-01 1002 | 1-7-2024 | C-02 1001 | Jul 03, 2024 | C-01 1003 | 2024-07-02 | C-03 1002 | 2024/07/05 | C-02
Tasks:
- Create D2:D as StandardDate (true date values) from DateText using one array formula.
- Produce a deduplicated table with the latest row per OrderID (keep columns: OrderID, CustomerID, StandardDate).
Expected result (latest per OrderID):
1001 | C-01 | 2024-07-03 1002 | C-02 | 2024-07-05 1003 | C-03 | 2024-07-02
Hint
First convert text to dates with DATEVALUE and SUBSTITUTE. Then use SORT + SORTN to keep the latest per ID.
Common mistakes and self-check
- Mistake: Using Find & Replace to fix case and spaces manually. Fix: Use formulas (
TRIM,PROPER) so updates auto-apply. - Mistake: Storing dates as text. Fix: Ensure the cell formats change when you switch to a number format; otherwise parse with
DATEVALUE. - Mistake: VLOOKUP returning wrong matches. Fix: Use exact match (
FALSE) and clean keys on both sides. - Mistake: Losing duplicates unintentionally. Fix: Flag dups with
COUNTIFfirst; document dedup criteria. - Mistake: Unhandled errors break the table. Fix: Wrap with
IFERRORto keep outputs stable.
Self-check routine
- Switch date column to Plain text then back to Date—do values survive?
- COUNTBLANK on key columns equals 0?
- COUNTIF(key, key) maximum equals 1 after dedup?
- Random spot-check 5 rows against raw.
Mini challenge
You receive a CSV export with product names like " ultra-phone X (Black) ", categories like "MOB, phone", and prices with commas "1,299.00". Build a staging sheet that outputs:
- Clean ProductName: remove parentheses content, trim, Proper case.
- Category: standardized to one of {"Phone","Tablet","Accessory"} via mapping.
- Price: numeric.
Starter hints
- Remove parentheses:
=REGEXREPLACE(A2,"\s*\(.*\)","") - Price:
=VALUE(SUBSTITUTE(price,",","")) - Map:
=IFERROR(VLOOKUP(rawCat,map!A:B,2,FALSE),"Other")
Who this is for
- Aspiring and junior Data Analysts.
- Anyone cleaning CSV/Excel/Sheets exports for reporting.
Prerequisites
- Basic spreadsheet navigation and formulas.
- Comfort with ranges and absolute references.
Learning path
- Start: TRIM, PROPER, SPLIT, TEXTJOIN.
- Next: DATEVALUE, VALUE, error handling with IFERROR.
- Then: REGEXREPLACE/REGEXEXTRACT for pattern work.
- Finally: Deduping with SORT/SORTN, joining with VLOOKUP or INDEX/MATCH.
Practical projects
- Clean a 1,000-row lead list: names, emails, phone numbers, dedupe by email.
- Unify sales transactions from two marketplaces; standardize dates and categories; reconcile SKUs via a mapping tab.
- Create a repeatable cleaning template with staging and output tabs for a monthly report.
Next steps
- Build your personal cleaning template with sections for text normalization, date parsing, dedup, and mapping.
- Practice on messy public datasets (exports you already have) and time yourself to improve speed.
- Take the quick test below to confirm understanding.