luvv to helpDiscover the Best Free Online Tools
Topic 11 of 30

String Operations With Str Accessor

Learn String Operations With Str Accessor for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Data Analyst, a surprising amount of work involves messy text: product names, emails, IDs, cities, categories, and free-form notes. The pandas .str accessor lets you clean, standardize, split, extract, and search strings directly on Series without loops. This speeds up analysis and keeps your code readable.

  • Clean customer names and cities before grouping.
  • Extract IDs, ZIP codes, or domains to join datasets.
  • Filter leads by keywords and standardize categories.
  • Split full names, parse addresses, and format outputs.

Concept explained simply

The .str accessor is a vectorized toolbox for text. When your Series holds strings (object or string dtype), you can call .str.method() to transform all values at once. It is like applying Python string methods and regex to every cell in a column, but faster and more concise.

Mental model

Imagine a conveyor belt of text values. The .str accessor is a panel of switches you flip in sequence: strip whitespace, lower case, replace patterns, split columns, extract parts. Each switch modifies the whole column in one go.

Core toolkit (80/20 you will use most)

  • Cleaning and casing: .str.strip(), .str.lower(), .str.upper(), .str.title(), .str.replace(pat, repl, regex=True), .str.normalize('NFKC')
  • Finding and filtering: .str.contains(pat, regex=True, na=False), .str.startswith(), .str.endswith(), .str.len()
  • Extracting parts: .str.extract(regex) (first match, groups), .str.findall(regex) (all matches)
  • Splitting and accessing: .str.split(pat, n, expand=True), .str.get(i), .str.slice(start, stop)
  • Combining: .str.cat(other=..., sep=..., na_rep=''), .str.pad(), .str.zfill()

Worked examples

Example 1 — Clean product names
import pandas as pd

s = pd.Series(['  Basic Tee  ', 'BASIC  tee', None, 'basic-tee'])
clean = (s.str.strip()
           .str.lower()
           .str.replace(r'\s+', ' ', regex=True)
           .str.replace('-', ' ', regex=False)
        )
print(clean)
# 0       basic tee
# 1       basic tee
# 2             NaN
# 3       basic tee

Tip: Use regex=True for patterns and regex=False for literal replacements.

Example 2 — Extract order ID
df = pd.DataFrame({'note': ['Paid ORD-1023 today', 'No order', 'Check ORD-77 and ORD-78']})
df['order_id'] = df['note'].str.extract(r'(ORD-\d+)')
print(df[['note','order_id']])
#                         note order_id
# 0        Paid ORD-1023 today  ORD-1023
# 1                   No order      NaN
# 2  Check ORD-77 and ORD-78     ORD-77

.str.extract() returns the first group match. Use .str.findall() to get all.

Example 3 — Split full name into two columns
df = pd.DataFrame({'full_name': ['Ana Ruiz', 'Lee', 'Sam A. Carter']})
df[['first','last']] = df['full_name'].str.split(r'\s+', n=1, expand=True)
print(df)
#        full_name first    last
# 0        Ana Ruiz   Ana    Ruiz
# 1             Lee   Lee     NaN
# 2   Sam A. Carter   Sam  A. Carter

Use n=1 to split only once from the left. Missing parts become NaN.

Example 4 — Filter emails with safe handling of NaN
emails = pd.Series(['a@x.com', None, 'bad-at-x', 'team@corp.org'])
mask = emails.str.contains('@', na=False)
print(emails[mask])
# 0          a@x.com
# 3      team@corp.org

na=False prevents NaN values from dropping useful rows by accident.

Who this is for

  • Aspiring and practicing Data Analysts working with messy text columns.
  • Anyone comfortable with basic pandas and ready to clean/parse string data efficiently.

Prerequisites

  • Basic pandas: creating DataFrames/Series, selecting columns.
  • Very light regex familiarity helps, but simple patterns are enough to start.
  • Python installed and a Jupyter notebook or similar environment.

Learning path

  1. Clean and standardize: .str.strip(), .str.lower(), .str.replace().
  2. Split and extract: .str.split(..., expand=True), .str.extract().
  3. Filter and validate: .str.contains(), .str.len(), .str.startswith().
  4. Combine and format: .str.cat(), .str.pad(), .str.zfill().
  5. Edge cases: handling NaN, literal vs regex search, multi-match extraction.

Hands-on exercises

Mirror of the exercises below (run in your notebook). Use the checklist before running. Solutions are provided in collapsible blocks.

Exercise 1 — Standardize city names

Create a clean city column with single spaces, trimmed edges, and title case.

import pandas as pd

df = pd.DataFrame({
    'city': ['  new   york ', 'LONDON', None, 'san   francisco', 'Mexico-City']
})
# Goal: 'New York', 'London', NaN, 'San Francisco', 'Mexico City'

Exercise 2 — Extract order IDs

From column note, extract the first occurrence of an ID like ORD-12345 into order_id. Keep non-matching as NaN.

df = pd.DataFrame({'note': [
    'Paid on 2023-01-01. ID: ORD-10023',
    'Refunded. See ORD-8 and contact.',
    'No id in this row',
]})

Exercise 3 — Split full names

Split full_name into first and last, splitting only once on whitespace. Leave missing parts as NaN.

df = pd.DataFrame({'full_name': ['Jo Kim', 'Prince', 'R. D. Patel']})

Checklist before running

  • Did you handle NaN safely (e.g., parameters like na= when filtering)?
  • For .replace, do you need regex? If not, set regex=False.
  • For .split, did you use expand=True to get columns?
  • For .extract, does your regex have a capturing group (...)?

Common mistakes and self-check

  • Forgetting expand=True in .str.split() when you need new columns. Self-check: did you get a DataFrame of columns?
  • Accidentally using regex when you meant literal. Self-check: set regex=False for symbols like +, ?, ( to avoid regex interpretation.
  • Dropping rows because of NaN in .str.contains(). Self-check: use na=False.
  • Running .str on non-string columns. Self-check: convert with .astype('string') or .astype(str) when appropriate.
  • Over-splitting names/addresses. Self-check: use n=1 to control splits.

Practical projects

  • CRM cleanup: Standardize city, state, and email domain; create a domain column with .str.extract(); combine city, country with .str.cat().
  • Order notes parser: Extract first ORD-\d+, flag lines mentioning refund (case-insensitive), and normalize whitespace.
  • Web logs mini ETL: From URL paths, extract product slug, pad IDs with .str.zfill(6), and filter API calls with .str.startswith('/api').

Mini challenge

You receive a column address like "123 Main St, Springfield, IL 62704" or "45-12 5th Ave New York NY 10011". Create columns city, state, and zip by:

  1. Extracting a 5-digit ZIP with .str.extract(r'(\d{5})').
  2. Extracting a 2-letter state with .str.extract(r'\b([A-Z]{2})\b').
  3. Getting city as the token(s) before the state (hint: use .str.extract with a group capturing text before the state).
Hint

Try a pattern like r',?\s*([A-Za-z .-]+),?\s*([A-Z]{2})\s*(\d{5})' to capture city, state, zip in one go.

Practice & test

Ready to check your understanding? Take the quick test below. Note: Anyone can take the test. Saving your progress is available to logged-in users.

Next steps

  • Apply these methods to one real dataset you use weekly and document your cleaning steps.
  • Create utility functions (e.g., clean_city(s), extract_order_id(s)) to reuse in future notebooks.
  • Move on to text feature engineering (n-grams, keyword flags) once you are comfortable.

Practice Exercises

3 exercises to complete

Instructions

Given a DataFrame with a city column, create city_clean where:

  • Edges are trimmed.
  • Multiple spaces/hyphens collapse to a single space.
  • Casing is title case (e.g., New York).
import pandas as pd

df = pd.DataFrame({
    'city': ['  new   york ', 'LONDON', None, 'san   francisco', 'Mexico-City']
})
Expected Output
city -> city_clean ' new york ' -> 'New York' 'LONDON' -> 'London' None -> NaN 'san francisco' -> 'San Francisco' 'Mexico-City' -> 'Mexico City'

String Operations With Str Accessor — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about String Operations With Str Accessor?

AI Assistant

Ask questions about this tool