Why this matters
As a Data Analyst, a surprising amount of work involves messy text: product names, emails, IDs, cities, categories, and free-form notes. The pandas .str accessor lets you clean, standardize, split, extract, and search strings directly on Series without loops. This speeds up analysis and keeps your code readable.
- Clean customer names and cities before grouping.
- Extract IDs, ZIP codes, or domains to join datasets.
- Filter leads by keywords and standardize categories.
- Split full names, parse addresses, and format outputs.
Concept explained simply
The .str accessor is a vectorized toolbox for text. When your Series holds strings (object or string dtype), you can call .str.method() to transform all values at once. It is like applying Python string methods and regex to every cell in a column, but faster and more concise.
Mental model
Imagine a conveyor belt of text values. The .str accessor is a panel of switches you flip in sequence: strip whitespace, lower case, replace patterns, split columns, extract parts. Each switch modifies the whole column in one go.
Core toolkit (80/20 you will use most)
- Cleaning and casing:
.str.strip(),.str.lower(),.str.upper(),.str.title(),.str.replace(pat, repl, regex=True),.str.normalize('NFKC') - Finding and filtering:
.str.contains(pat, regex=True, na=False),.str.startswith(),.str.endswith(),.str.len() - Extracting parts:
.str.extract(regex)(first match, groups),.str.findall(regex)(all matches) - Splitting and accessing:
.str.split(pat, n, expand=True),.str.get(i),.str.slice(start, stop) - Combining:
.str.cat(other=..., sep=..., na_rep=''),.str.pad(),.str.zfill()
Worked examples
Example 1 — Clean product names
import pandas as pd
s = pd.Series([' Basic Tee ', 'BASIC tee', None, 'basic-tee'])
clean = (s.str.strip()
.str.lower()
.str.replace(r'\s+', ' ', regex=True)
.str.replace('-', ' ', regex=False)
)
print(clean)
# 0 basic tee
# 1 basic tee
# 2 NaN
# 3 basic tee
Tip: Use regex=True for patterns and regex=False for literal replacements.
Example 2 — Extract order ID
df = pd.DataFrame({'note': ['Paid ORD-1023 today', 'No order', 'Check ORD-77 and ORD-78']})
df['order_id'] = df['note'].str.extract(r'(ORD-\d+)')
print(df[['note','order_id']])
# note order_id
# 0 Paid ORD-1023 today ORD-1023
# 1 No order NaN
# 2 Check ORD-77 and ORD-78 ORD-77
.str.extract() returns the first group match. Use .str.findall() to get all.
Example 3 — Split full name into two columns
df = pd.DataFrame({'full_name': ['Ana Ruiz', 'Lee', 'Sam A. Carter']})
df[['first','last']] = df['full_name'].str.split(r'\s+', n=1, expand=True)
print(df)
# full_name first last
# 0 Ana Ruiz Ana Ruiz
# 1 Lee Lee NaN
# 2 Sam A. Carter Sam A. Carter
Use n=1 to split only once from the left. Missing parts become NaN.
Example 4 — Filter emails with safe handling of NaN
emails = pd.Series(['a@x.com', None, 'bad-at-x', 'team@corp.org'])
mask = emails.str.contains('@', na=False)
print(emails[mask])
# 0 a@x.com
# 3 team@corp.org
na=False prevents NaN values from dropping useful rows by accident.
Who this is for
- Aspiring and practicing Data Analysts working with messy text columns.
- Anyone comfortable with basic pandas and ready to clean/parse string data efficiently.
Prerequisites
- Basic pandas: creating DataFrames/Series, selecting columns.
- Very light regex familiarity helps, but simple patterns are enough to start.
- Python installed and a Jupyter notebook or similar environment.
Learning path
- Clean and standardize:
.str.strip(),.str.lower(),.str.replace(). - Split and extract:
.str.split(..., expand=True),.str.extract(). - Filter and validate:
.str.contains(),.str.len(),.str.startswith(). - Combine and format:
.str.cat(),.str.pad(),.str.zfill(). - Edge cases: handling NaN, literal vs regex search, multi-match extraction.
Hands-on exercises
Mirror of the exercises below (run in your notebook). Use the checklist before running. Solutions are provided in collapsible blocks.
Exercise 1 — Standardize city names
Create a clean city column with single spaces, trimmed edges, and title case.
import pandas as pd
df = pd.DataFrame({
'city': [' new york ', 'LONDON', None, 'san francisco', 'Mexico-City']
})
# Goal: 'New York', 'London', NaN, 'San Francisco', 'Mexico City'
Exercise 2 — Extract order IDs
From column note, extract the first occurrence of an ID like ORD-12345 into order_id. Keep non-matching as NaN.
df = pd.DataFrame({'note': [
'Paid on 2023-01-01. ID: ORD-10023',
'Refunded. See ORD-8 and contact.',
'No id in this row',
]})
Exercise 3 — Split full names
Split full_name into first and last, splitting only once on whitespace. Leave missing parts as NaN.
df = pd.DataFrame({'full_name': ['Jo Kim', 'Prince', 'R. D. Patel']})
Checklist before running
- Did you handle NaN safely (e.g., parameters like
na=when filtering)? - For
.replace, do you need regex? If not, setregex=False. - For
.split, did you useexpand=Trueto get columns? - For
.extract, does your regex have a capturing group(...)?
Common mistakes and self-check
- Forgetting
expand=Truein.str.split()when you need new columns. Self-check: did you get a DataFrame of columns? - Accidentally using regex when you meant literal. Self-check: set
regex=Falsefor symbols like+,?,(to avoid regex interpretation. - Dropping rows because of NaN in
.str.contains(). Self-check: usena=False. - Running
.stron non-string columns. Self-check: convert with.astype('string')or.astype(str)when appropriate. - Over-splitting names/addresses. Self-check: use
n=1to control splits.
Practical projects
- CRM cleanup: Standardize city, state, and email domain; create a
domaincolumn with.str.extract(); combinecity, countrywith.str.cat(). - Order notes parser: Extract first
ORD-\d+, flag lines mentioningrefund(case-insensitive), and normalize whitespace. - Web logs mini ETL: From URL paths, extract product slug, pad IDs with
.str.zfill(6), and filter API calls with.str.startswith('/api').
Mini challenge
You receive a column address like "123 Main St, Springfield, IL 62704" or "45-12 5th Ave New York NY 10011". Create columns city, state, and zip by:
- Extracting a 5-digit ZIP with
.str.extract(r'(\d{5})'). - Extracting a 2-letter state with
.str.extract(r'\b([A-Z]{2})\b'). - Getting
cityas the token(s) before the state (hint: use.str.extractwith a group capturing text before the state).
Hint
Try a pattern like r',?\s*([A-Za-z .-]+),?\s*([A-Z]{2})\s*(\d{5})' to capture city, state, zip in one go.
Practice & test
Ready to check your understanding? Take the quick test below. Note: Anyone can take the test. Saving your progress is available to logged-in users.
Next steps
- Apply these methods to one real dataset you use weekly and document your cleaning steps.
- Create utility functions (e.g.,
clean_city(s),extract_order_id(s)) to reuse in future notebooks. - Move on to text feature engineering (n-grams, keyword flags) once you are comfortable.