How to learn Data Types and Type Conversion Astype To Numeric To Datetime for Python pandas in Data Analyst for free

Why this matters

As a Data Analyst, you constantly receive columns that look numeric but are stored as text, dates with mixed formats, or IDs that lose leading zeros. Clean, correct dtypes are the foundation for reliable analysis, joins, filtering, aggregations, and time-series work.

Compute revenue correctly by converting currency strings to numbers.
Build time-series charts by parsing dates (and time zones) properly.
Prevent broken merges by keeping IDs as strings with leading zeros.
Speed up grouping and memory with categorical dtypes.

Who this is for

Beginner to intermediate analysts using pandas.
Anyone cleaning CSV/Excel data before analysis or dashboards.

Prerequisites

Basic Python and pandas (Series/DataFrame, selecting columns).
Comfort using Jupyter/Notebook or a Python REPL.

Concept explained simply

pandas dtypes describe how data is stored and how operations behave. You change dtypes to make pandas treat the data correctly.

Mental model

Imagine your DataFrame as a set of labeled pipes carrying values. Each pipe has a shape (dtype): numeric pipes allow math to flow, datetime pipes allow calendar/time logic, categorical pipes are labeled buckets, and object/string pipes are generic boxes. Converting dtypes is changing a pipe to the right shape so operations fit and flow.

Quick reference

Force dtype exactly: df["col"] = df["col"].astype("float64") or "Int64" (nullable int), "string", "category", "boolean".
Parse messy numbers: pd.to_numeric(series, errors="coerce") turns bad values into NaN.
Parse dates and times: pd.to_datetime(series, errors="coerce", dayfirst=True, utc=True)
Time zones: localize naive times with .dt.tz_localize("Europe/London") then convert with .dt.tz_convert("UTC").
Keep leading zeros: use .astype("string") (not int) for IDs.
Fast grouping/filtering on few unique values: .astype("category").

Worked examples

1) Currency strings to numeric

import pandas as pd
s = pd.Series(["$1,299.99", "$59", "—", "7.5", None])
clean = (s.str.replace(r"[^0-9.\-]", "", regex=True)
           .replace("", pd.NA))
num = pd.to_numeric(clean, errors="coerce")
print(num)
print(num.dtype)

Why: remove symbols first, then coerce invalid entries to NaN so math works.

Mini task

Round the numbers to 2 decimals with num.round(2) and compute total using num.sum(min_count=1) to keep NaN-safe sums.

2) Parsing day-first dates and time zones

dates = pd.Series(["31/12/2023 23:45", "01/01/2024 09:10", "bad"])
when = pd.to_datetime(dates, errors="coerce", dayfirst=True)
print(when)
print(when.dtype)  # datetime64[ns]
# Assume these timestamps are already in UTC strings:
shipped = pd.Series(["2024-01-02T08:15Z", None, "2024-01-03T12:00Z"])
shipped_ts = pd.to_datetime(shipped, errors="coerce", utc=True)
print(shipped_ts.dtype)  # datetime64[ns, UTC]

Mini task

Filter rows where when.dt.date == pd.Timestamp("2023-12-31").date(). If your times are local (naive), localize then convert: when_local = when.dt.tz_localize("Europe/London").dt.tz_convert("UTC") (only for non-UTC naive times).

3) Keep IDs and optimize categories

df = pd.DataFrame({
  "customer_id": ["0007", "0012", "0100"],
  "segment": ["A", "B", "A"]
})
df["customer_id"] = df["customer_id"].astype("string")  # preserve leading zeros
df["segment"] = df["segment"].astype("category")
print(df.dtypes)

Use string dtype for IDs (not numbers). Use category for repeating short labels to reduce memory and speed up grouping.

4) Boolean from Y/N with missing values

raw = pd.Series(["Y", "N", None, "Y"]) 
yn = raw.map({"Y": True, "N": False}).astype("boolean")
print(yn)
print(yn.dtype)  # boolean (nullable)

Common mistakes and self-checks

Using .astype(int) on a column with NaN: this raises errors. Use .astype("Int64") (nullable) or fill NaN first.
Forgetting errors="coerce" in to_numeric/to_datetime: bad values will crash conversion; coerce them to NaN instead.
Parsing day-first dates incorrectly: pass dayfirst=True for formats like 31/12/2023.
Confusing tz_localize with tz_convert: localize naive timestamps to their correct original timezone; then convert to the target timezone (often UTC).
Treating IDs as numbers: you can lose leading zeros and sorting semantics. Keep as string.
Leaving numeric-looking columns as object: aggregations and comparisons may be wrong or slow.

Self-check routine

Print df.dtypes and confirm each column’s dtype is intentional.
Run df.apply(pd.api.types.infer_dtype) to spot mixed types.
After conversion, validate ranges (e.g., prices >= 0) and date bounds (e.g., not before 1970 unless expected).

Practice exercises

Do the two exercises below. Then take the quick test to confirm mastery.

Note: The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Clean an orders dataset: convert currency, parse dates with day-first, keep IDs as strings, handle UTC timestamps, and build a nullable boolean.
Exercise 2 — Repair messy numeric survey ages: coerce bad values to NaN, fill with the median, and store as nullable integer.

[ ] I used to_numeric(..., errors="coerce") to safely parse numbers.
[ ] I used to_datetime with the right options (dayfirst, utc).
[ ] I chose string for IDs and category for short label columns.
[ ] I handled missing values before forcing strict dtypes.

Need a hint?

For currency: strip non-numeric characters with str.replace(r"[^0-9.\-]", "", regex=True), replace empty strings with pd.NA, then to_numeric. For dates: if strings end with Z, use utc=True.

Practical projects

Sales cleanup notebook: load a CSV with prices like $1,234.50, quantities as text, and DD/MM/YYYY dates. Convert, validate ranges, and compute monthly revenue.
Event timeline: parse event times from multiple sources (some UTC, some local). Localize, convert to UTC, and build an hourly event count series.
Customer segmentation prep: convert ID columns to string, categorical-encode 3 label columns, and benchmark groupby speed before/after categories.

Learning path

Now: Data types and conversions (this lesson).
Next: Missing values (identify, impute, nullable dtypes).
Then: String methods for cleaning text columns.
Finally: Date/time operations (resampling, rolling windows, time zone workflows).

Next steps

Refactor an existing notebook to make dtypes explicit at the top (one cell that cleans and converts).
Add assertions after conversions (e.g., assert (df.price >= 0).all() ignoring NaN).
Take the quick test to lock in your knowledge.

Mini challenge

You receive three columns: net_price like "€2.499,90" (European format), order_time like "2024/03/01 18:30" (local Berlin time), and tier from {"free","pro","pro","free","enterprise"}. Convert them to numeric, timezone-aware UTC datetimes, and categories in the most memory-efficient way. Document each step in comments. Tip: replace thousand/decimal separators (e.g., . as thousands and , as decimal).

Ready for the quick test?

Answer a few questions to see if you can spot the right conversion in real scenarios.

Instructions

Create the DataFrame and perform the conversions.

import pandas as pd
orders = pd.DataFrame({
  "order_id": ["0001","0002","0003","0004"],
  "price": ["$1,299.99", "$59.00", "—", "7.5"],
  "order_date": ["31/12/2023 23:45","01/01/2024 09:10","15/01/2024 17:00","28/02/2024 12:30"],
  "shipped_at": ["2024-01-02T08:15Z", "", "2024-01-16T12:00Z", None],
  "vip": ["Y","N", None, "Y"]
})
# Tasks:
# 1) Keep order_id as string dtype (preserve leading zeros).
# 2) Convert price to float; treat non-numeric/blank as NaN.
# 3) Parse order_date with day-first.
# 4) Parse shipped_at as UTC-aware datetimes; blanks -> NaT.
# 5) Map vip to nullable boolean: Y->True, N->False, None->.
# 6) Print orders.dtypes and the first rows to verify.

Menu

Data Types and Type Conversion Astype To Numeric To Datetime

Table of Contents