Why this matters
As a Data Analyst, you constantly receive columns that look numeric but are stored as text, dates with mixed formats, or IDs that lose leading zeros. Clean, correct dtypes are the foundation for reliable analysis, joins, filtering, aggregations, and time-series work.
- Compute revenue correctly by converting currency strings to numbers.
- Build time-series charts by parsing dates (and time zones) properly.
- Prevent broken merges by keeping IDs as strings with leading zeros.
- Speed up grouping and memory with categorical dtypes.
Who this is for
- Beginner to intermediate analysts using pandas.
- Anyone cleaning CSV/Excel data before analysis or dashboards.
Prerequisites
- Basic Python and pandas (Series/DataFrame, selecting columns).
- Comfort using Jupyter/Notebook or a Python REPL.
Concept explained simply
pandas dtypes describe how data is stored and how operations behave. You change dtypes to make pandas treat the data correctly.
Mental model
Imagine your DataFrame as a set of labeled pipes carrying values. Each pipe has a shape (dtype): numeric pipes allow math to flow, datetime pipes allow calendar/time logic, categorical pipes are labeled buckets, and object/string pipes are generic boxes. Converting dtypes is changing a pipe to the right shape so operations fit and flow.
Quick reference
- Force dtype exactly:
df["col"] = df["col"].astype("float64")or"Int64"(nullable int),"string","category","boolean". - Parse messy numbers:
pd.to_numeric(series, errors="coerce")turns bad values intoNaN. - Parse dates and times:
pd.to_datetime(series, errors="coerce", dayfirst=True, utc=True) - Time zones: localize naive times with
.dt.tz_localize("Europe/London")then convert with.dt.tz_convert("UTC"). - Keep leading zeros: use
.astype("string")(not int) for IDs. - Fast grouping/filtering on few unique values:
.astype("category").
Worked examples
1) Currency strings to numeric
import pandas as pd
s = pd.Series(["$1,299.99", "$59", "—", "7.5", None])
clean = (s.str.replace(r"[^0-9.\-]", "", regex=True)
.replace("", pd.NA))
num = pd.to_numeric(clean, errors="coerce")
print(num)
print(num.dtype)
Why: remove symbols first, then coerce invalid entries to NaN so math works.
Mini task
num.round(2) and compute total using num.sum(min_count=1) to keep NaN-safe sums.2) Parsing day-first dates and time zones
dates = pd.Series(["31/12/2023 23:45", "01/01/2024 09:10", "bad"])
when = pd.to_datetime(dates, errors="coerce", dayfirst=True)
print(when)
print(when.dtype) # datetime64[ns]
# Assume these timestamps are already in UTC strings:
shipped = pd.Series(["2024-01-02T08:15Z", None, "2024-01-03T12:00Z"])
shipped_ts = pd.to_datetime(shipped, errors="coerce", utc=True)
print(shipped_ts.dtype) # datetime64[ns, UTC]
Mini task
when.dt.date == pd.Timestamp("2023-12-31").date(). If your times are local (naive), localize then convert: when_local = when.dt.tz_localize("Europe/London").dt.tz_convert("UTC") (only for non-UTC naive times).3) Keep IDs and optimize categories
df = pd.DataFrame({
"customer_id": ["0007", "0012", "0100"],
"segment": ["A", "B", "A"]
})
df["customer_id"] = df["customer_id"].astype("string") # preserve leading zeros
df["segment"] = df["segment"].astype("category")
print(df.dtypes)
Use string dtype for IDs (not numbers). Use category for repeating short labels to reduce memory and speed up grouping.
4) Boolean from Y/N with missing values
raw = pd.Series(["Y", "N", None, "Y"])
yn = raw.map({"Y": True, "N": False}).astype("boolean")
print(yn)
print(yn.dtype) # boolean (nullable)
Common mistakes and self-checks
- Using
.astype(int)on a column with NaN: this raises errors. Use.astype("Int64")(nullable) or fill NaN first. - Forgetting
errors="coerce"into_numeric/to_datetime: bad values will crash conversion; coerce them to NaN instead. - Parsing day-first dates incorrectly: pass
dayfirst=Truefor formats like31/12/2023. - Confusing
tz_localizewithtz_convert: localize naive timestamps to their correct original timezone; then convert to the target timezone (often UTC). - Treating IDs as numbers: you can lose leading zeros and sorting semantics. Keep as
string. - Leaving numeric-looking columns as
object: aggregations and comparisons may be wrong or slow.
Self-check routine
- Print
df.dtypesand confirm each column’s dtype is intentional. - Run
df.apply(pd.api.types.infer_dtype)to spot mixed types. - After conversion, validate ranges (e.g., prices >= 0) and date bounds (e.g., not before 1970 unless expected).
Practice exercises
Do the two exercises below. Then take the quick test to confirm mastery.
Note: The quick test is available to everyone; only logged-in users get saved progress.
- Exercise 1 — Clean an orders dataset: convert currency, parse dates with day-first, keep IDs as strings, handle UTC timestamps, and build a nullable boolean.
- Exercise 2 — Repair messy numeric survey ages: coerce bad values to NaN, fill with the median, and store as nullable integer.
- [ ] I used
to_numeric(..., errors="coerce")to safely parse numbers. - [ ] I used
to_datetimewith the right options (dayfirst,utc). - [ ] I chose
stringfor IDs andcategoryfor short label columns. - [ ] I handled missing values before forcing strict dtypes.
Need a hint?
str.replace(r"[^0-9.\-]", "", regex=True), replace empty strings with pd.NA, then to_numeric. For dates: if strings end with Z, use utc=True.Practical projects
- Sales cleanup notebook: load a CSV with prices like
$1,234.50, quantities as text, andDD/MM/YYYYdates. Convert, validate ranges, and compute monthly revenue. - Event timeline: parse event times from multiple sources (some UTC, some local). Localize, convert to UTC, and build an hourly event count series.
- Customer segmentation prep: convert ID columns to
string, categorical-encode 3 label columns, and benchmark groupby speed before/after categories.
Learning path
- Now: Data types and conversions (this lesson).
- Next: Missing values (identify, impute, nullable dtypes).
- Then: String methods for cleaning text columns.
- Finally: Date/time operations (resampling, rolling windows, time zone workflows).
Next steps
- Refactor an existing notebook to make dtypes explicit at the top (one cell that cleans and converts).
- Add assertions after conversions (e.g.,
assert (df.price >= 0).all()ignoring NaN). - Take the quick test to lock in your knowledge.
Mini challenge
You receive three columns: net_price like "€2.499,90" (European format), order_time like "2024/03/01 18:30" (local Berlin time), and tier from {"free","pro","pro","free","enterprise"}. Convert them to numeric, timezone-aware UTC datetimes, and categories in the most memory-efficient way. Document each step in comments. Tip: replace thousand/decimal separators (e.g., . as thousands and , as decimal).
Ready for the quick test?
Answer a few questions to see if you can spot the right conversion in real scenarios.