luvv to helpDiscover the Best Free Online Tools
Topic 4 of 31

String Functions and Pattern Matching Like Regex

Learn String Functions and Pattern Matching Like Regex for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 19, 2025 | Updated: December 19, 2025

Why this matters

As a Data Analyst, you constantly clean and shape text: emails, product names, IDs, campaign tags, and addresses. String functions and pattern matching (LIKE and regex-style predicates) let you standardize messy data, validate formats, extract useful parts, and create dependable features for analysis.

  • Validate and filter records (e.g., find invalid emails or phone numbers).
  • Standardize text (lowercase, trim spaces, normalize separators).
  • Extract fields (domain from email, SKU prefix, hashtag from message).
  • Match patterns for reporting (campaign names, UTM tags, codes).

Concept explained simply

Think of strings as sequences of characters. String functions help you transform them (cut, join, replace), while pattern matching helps you find strings that look a certain way.

Mental model

  • Shaping clay: functions like LOWER, TRIM, SUBSTRING shape the clay into what you need.
  • Stencils: LIKE and regex define stencils. Only strings that fit the stencil pass.
Vendor differences you should know
  • Common functions exist everywhere but names differ: POSITION vs INSTR vs CHARINDEX; SUBSTRING vs SUBSTR.
  • Case-insensitive search: ILIKE (PostgreSQL), LOWER(column) LIKE LOWER('%...%') (portable), COLLATE options (varies).
  • Regex: PostgreSQL uses ~, ~*, REGEXP_REPLACE; BigQuery/Oracle use REGEXP_EXTRACT/REGEXP_REPLACE/REGEXP_CONTAINS; MySQL uses REGEXP; SQL Server uses LIKE plus CLR or JSON tricks (no native full regex predicate).

Must-know string tools

Core cleanup and casing
-- Trim spaces
TRIM('  hello  ')
-- Remove leading/trailing only
LTRIM('  hello')
RTRIM('hello  ')
-- Case
LOWER('ABC')  -- 'abc'
UPPER('abc')  -- 'ABC'
INITCAP in some engines for Title Case
Cutting and locating text
-- Substring
SUBSTRING(col FROM 1 FOR 5)      -- ANSI style
SUBSTRING(col, 1, 5)             -- Many engines
LEFT(col, 3) / RIGHT(col, 2)     -- Some engines
-- Find position of a substring
POSITION('@' IN email)           -- ANSI / Postgres
INSTR(email, '@')                -- Oracle / BigQuery / MySQL
CHARINDEX('@', email)            -- SQL Server
Joining and replacing
CONCAT(first_name, ' ', last_name)
REPLACE(text, '-', ' ')
Pattern matching: LIKE and regex-style
  • LIKE: % for any length, _ for single char. Example: product_name LIKE '%Pro%'
  • Case-insensitive: ILIKE (PostgreSQL) or LOWER(product_name) LIKE '%pro%'
  • Regex-style (varies): - PostgreSQL: col ~ '^[A-Z]{3}\d{4}$' - BigQuery/Oracle: REGEXP_CONTAINS(col, r'^[A-Z]{3}\d{4}$') - MySQL: col REGEXP '^[A-Z]{3}[0-9]{4}$'
-- Extract with regex (examples)
-- BigQuery/Oracle: REGEXP_EXTRACT(col, '(?i)@(.+)$')
-- Postgres: REGEXP_REPLACE(col, '.*@', '')  -- keep domain

Worked examples

1) Standardize emails and get domains

Goal: Lowercase emails, remove spaces, and extract domains for grouping.

-- Vendor-agnostic approach using POSITION/INSTR/CHARINDEX alternatives
SELECT
  LOWER(TRIM(email)) AS email_clean,
  -- ANSI/Postgres
  CASE WHEN POSITION('@' IN TRIM(email)) > 0
       THEN SUBSTRING(TRIM(email) FROM POSITION('@' IN TRIM(email)) + 1)
       ELSE NULL END AS domain_pos,
  -- BigQuery/Oracle/MySQL style
  CASE WHEN INSTR(TRIM(email), '@') > 0
       THEN SUBSTRING(TRIM(email), INSTR(TRIM(email), '@') + 1)
       ELSE NULL END AS domain_instr
FROM users;
Why this works

Trim removes stray spaces, LOWER standardizes casing, and substring-from-@ extracts domains. Use the locator function your engine supports.

2) Validate SKU format ABC-12345

Goal: Only keep SKUs with exactly 3 uppercase letters, a dash, and 5 digits.

-- PostgreSQL
SELECT * FROM products WHERE sku ~ '^[A-Z]{3}-\d{5}$';
-- BigQuery/Oracle
SELECT * FROM products WHERE REGEXP_CONTAINS(sku, r'^[A-Z]{3}-\d{5}$');
-- MySQL
SELECT * FROM products WHERE sku REGEXP '^[A-Z]{3}-[0-9]{5}$';
-- Portable fallback (less strict): LIKE plus length checks
SELECT *
FROM products
WHERE sku LIKE '___-%'
  AND LENGTH(sku) = 9
  AND sku NOT LIKE '% %';
Trade-offs

Regex checks exact pattern; LIKE version approximates and may allow false positives. Prefer regex if available.

3) Extract first and last names

Goal: For "First Middle Last" keep First and Last.

-- Use spaces as separators; engine-specific position function
-- Postgres (POSITION)
SELECT
  SUBSTRING(full_name FROM 1 FOR POSITION(' ' IN full_name) - 1) AS first_name,
  SPLIT_PART(full_name, ' ', CARDINALITY(regexp_split_to_array(full_name, '\\s+'))) AS last_name
FROM people;

-- BigQuery
SELECT
  SPLIT(full_name, ' ')[SAFE_OFFSET(0)] AS first_name,
  SPLIT(full_name, ' ')[OFFSET(ARRAY_LENGTH(SPLIT(full_name, ' ')) - 1)] AS last_name
FROM people;

-- Portable idea: take text before first space and after last space
-- Using INSTR variant
SELECT
  SUBSTRING(full_name, 1, INSTR(full_name, ' ') - 1) AS first_name,
  SUBSTRING(full_name, LENGTH(full_name) - INSTR(REVERSE(full_name), ' ') + 2) AS last_name
FROM people;
Notes

Real names are messy. These patterns handle common cases, but always inspect edge cases (single names, extra spaces).

Exercises

Try these hands-on tasks. Results should match the "Expected output" shown in each exercise card below.

  • Exercise 1: Clean emails and count domains.
  • Exercise 2: Validate phone numbers with pattern matching.
  • Exercise 3: Extract code parts with regex.
Readiness checklist
  • I can trim, lowercase, and replace characters.
  • I can find positions and take substrings.
  • I can write LIKE patterns with % and _.
  • I can write a basic regex for digits, letters, and anchors ^ $.

Common mistakes

  • Using LIKE when you need exact structure: LIKE '%-%' matches almost anything with a dash. Prefer regex when format matters.
  • Forgetting anchors: Without ^ and $, '123' matches '\\d{3}' anywhere. Anchors enforce full-string match.
  • Case sensitivity surprises: LIKE may be case-sensitive or not, depending on collation. For portability, use LOWER on both sides.
  • Hidden whitespace: Non-breaking spaces or trailing tabs cause mismatches. TRIM and REPLACE double-checks help.
  • Greedy regex capture: Use specific quantifiers like {2,4} instead of .*

How to self-check your SQL

  • Preview a small sample with SELECT ... LIMIT to verify patterns are correct.
  • Count matched vs unmatched rows to see if numbers make sense.
  • Manually spot-check a few edge cases (empty, very long, strange characters).
  • Build negative tests: search for rows that should not match and confirm 0 results.

Practical projects

  • Contact hygiene mini-pipeline: Standardize names, emails, phones, and flag invalid records.
  • Product code parser: Split SKUs into family, region, and sequence and build summary tables.
  • Campaign audit: Identify naming deviations from agreed conventions and propose fixes.

Mini challenge

Write a query that finds rows in a table messages(text) where the text contains exactly one hashtag (e.g., #promo). Return that hashtag. Hint: Use regex to count or to extract and compare.

Hint
-- Extract first hashtag and ensure replacing it removes all '#'

Who this is for

Beginner to intermediate Data Analysts who need reliable text cleaning and validation in SQL.

Prerequisites

  • Basic SELECT, WHERE, GROUP BY.
  • Comfort reading function docs for your SQL engine.

Learning path

  1. Master TRIM, LOWER/UPPER, REPLACE, SUBSTRING, POSITION/INSTR/CHARINDEX.
  2. Use LIKE and case-insensitive matches.
  3. Adopt regex-style predicates for precise validation and extraction.
  4. Build reusable snippets for your company’s most common patterns.

Next steps

  • Convert your most frequent cleaning steps into views or saved queries.
  • Create a small "data quality dashboard" counting valid vs invalid formats.
  • Take the quick test to confirm mastery.

Quick Test and progress

The quick test is available to everyone. If you log in, your progress and results will be saved.

Practice Exercises

3 exercises to complete

Instructions

You have a table users(id INT, email TEXT) with sample rows like: ' ALICE@GMAIL.com ', 'bob@yahoo.com', 'carol@outlook.COM', 'bad-email-without-at'.

  1. Clean email: trim spaces and lowercase to email_clean.
  2. Extract domain (text after '@') to domain.
  3. Return domain counts, excluding rows without a valid '@'.

Write one query that outputs domain and cnt, ordered by cnt desc.

Expected Output
domain | cnt ------ | --- gmail.com | 1 yahoo.com | 1 outlook.com | 1

String Functions and Pattern Matching Like Regex — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about String Functions and Pattern Matching Like Regex?

AI Assistant

Ask questions about this tool