Why this matters
As a Data Analyst, you constantly move data between files, spreadsheets, APIs, and databases. Mastering pandas I/O lets you:
- Pull clean datasets quickly from CSV/Excel/JSON.
- Save analysis-ready outputs in efficient formats (Parquet) or back to SQL.
- Handle large files safely with chunking and correct dtypes.
Real tasks you’ll meet
- Import weekly sales CSVs, standardize columns, and load into a reporting database.
- Read budget spreadsheets with multiple sheets, merge, and export a clean summary.
- Ingest JSON from an API, normalize nested fields, and save as Parquet for fast querying.
Who this is for
- Aspiring and junior Data Analysts using Python.
- Analysts switching from Excel to pandas.
- Anyone who needs reliable read/write pipelines.
Prerequisites
- Basic Python (variables, functions).
- Basic pandas (DataFrame, columns, filters).
- Python installed; optional: engines like pyarrow (Parquet) and openpyxl (Excel).
Concept explained simply
pandas I/O is about two things: reading into DataFrames and writing DataFrames out. Each format has a reader (pd.read_csv, pd.read_excel, pd.read_json, pd.read_parquet, pd.read_sql) and a writer (to_csv, to_excel, to_json, to_parquet, to_sql). You control parsing (dtypes, dates, columns) and performance (chunksize, compression).
Mental model
- Define the shape: choose columns (usecols) and types (dtype) early to avoid surprises.
- Decode correctly: specify encoding, separators, decimal, thousands, and date parsing.
- Scale safely: use chunksize for big files, Parquet for speed/size, SQL for storage.
Mini cheat sheet
# CSV
df = pd.read_csv("file.csv", usecols=["id","date","amount"], dtype={"id":"int64","amount":"float64"}, parse_dates=["date"], na_values=["", "NA", "null"])
df.to_csv("out.csv", index=False)
# Excel
df = pd.read_excel("book.xlsx", sheet_name="Data", usecols="A:D", dtype={"id":"Int64"})
df.to_excel("out.xlsx", sheet_name="Clean", index=False)
# JSON (records lines)
df = pd.read_json("data.jsonl", lines=True)
df.to_json("out.json", orient="records", lines=True)
# Parquet (requires pyarrow or fastparquet)
df = pd.read_parquet("data.parquet")
df.to_parquet("out.parquet", compression="snappy", index=False)
# SQL (example: SQLite)
import sqlite3
con = sqlite3.connect(":memory:")
df.to_sql("table_name", con, if_exists="replace", index=False)
q = pd.read_sql_query("SELECT * FROM table_name WHERE amount > ?", con, params=(1000,))Worked examples
Example 1: Robust CSV import/export
import pandas as pd
from io import StringIO
csv_data = StringIO("""id,date,region,amount
1,2024-01-10,EU,1200.50
2,2024/01/11,US,NA
3,10-01-2024,EU,950
""")
# Read with date parsing and custom NA
fmt = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"]
df = pd.read_csv(csv_data, na_values=["NA"], parse_dates=["date"], dayfirst=False)
# If mixed date formats, coerce:
df["date"] = pd.to_datetime(df["date"], errors="coerce", format=None)
# Clean and export
clean = df.dropna(subset=["amount"]).assign(amount=lambda d: d["amount"].astype(float))
clean.to_csv("clean_sales.csv", index=False)
print(clean.dtypes)Key ideas: handle NA, parse dates, coerce mismatches, export without index.
Example 2: Excel with multiple sheets
import pandas as pd
# Read specific sheet and range
df_q1 = pd.read_excel("budget.xlsx", sheet_name="Q1", usecols="A:F", skiprows=1)
df_q2 = pd.read_excel("budget.xlsx", sheet_name="Q2", usecols="A:F", skiprows=1)
# Combine and write to a new workbook
budget = pd.concat([df_q1, df_q2], ignore_index=True)
with pd.ExcelWriter("budget_clean.xlsx", engine="openpyxl") as writer:
budget.to_excel(writer, sheet_name="All", index=False)
(budget.groupby("department")["spend"].sum()
.reset_index()
.to_excel(writer, sheet_name="Summary", index=False))Tip: Use ExcelWriter to save multiple sheets in one file.
Example 3: JSON and JSON Lines
import pandas as pd
from pandas import json_normalize
# JSON Lines (one JSON per line)
df = pd.read_json("events.jsonl", lines=True)
# Nested JSON to columns
nested = {"user": {"id": 1, "name": "Ana"}, "purchase": {"total": 19.9, "items": 3}}
flat = json_normalize(nested) # columns like user.id, user.name, purchase.total
# Export as lines for streaming systems
df.to_json("events_out.jsonl", orient="records", lines=True)
print(flat.head())Choose lines=True for newline-delimited JSON; json_normalize flattens nested objects.
Example 4: Parquet for speed and size
import pandas as pd
# Requires pyarrow (recommended)
df = pd.read_csv("large.csv", usecols=["id","date","amount"], parse_dates=["date"], dtype={"id":"int64","amount":"float64"})
df.to_parquet("large.parquet", compression="snappy", index=False)
fast = pd.read_parquet("large.parquet")
print(len(fast))Parquet is columnar and compressed—great for analytics. If engine error occurs, install pyarrow.
Example 5: SQL roundtrip (SQLite)
import pandas as pd
import sqlite3
con = sqlite3.connect(":memory:")
sales = pd.DataFrame({
"id": [1,2,3,4],
"region": ["EU","US","EU","APAC"],
"amount": [1200, 800, 650, 300]
})
# Write to SQL table
sales.to_sql("sales", con, if_exists="replace", index=False)
# Safe parameterized query
q = "SELECT region, SUM(amount) AS total FROM sales WHERE amount > ? GROUP BY region"
summary = pd.read_sql_query(q, con, params=(600,))
print(summary)Use parameterized queries (params=...) to avoid SQL injection and parsing issues.
Performance and safety tips
- Limit columns with usecols and set dtype early to reduce memory.
- Use chunksize in read_csv for very large files; process chunk by chunk.
- Prefer Parquet for intermediate analytics storage; it’s compact and fast.
- Specify encoding (e.g., encoding="utf-8" or "latin-1") for non-UTF files.
- Be explicit with decimal="," and thousands="," if numbers use different separators.
Chunking pattern
import pandas as pd
acc = []
for chunk in pd.read_csv("big.csv", chunksize=100_000, usecols=["id","amount"], dtype={"id":"int64","amount":"float64"}):
acc.append(chunk[chunk["amount"] > 0])
result = pd.concat(acc, ignore_index=True)
result.to_parquet("positive_amounts.parquet", index=False)Exercises
These mirror the tasks below. Try them first, then open the solutions. Everyone can take the quick test; log in to save your progress.
- CSV to Parquet clean-up
- Load sales.csv, parse date column, coerce bad dates.
- Filter positive amounts, compute revenue = amount.
- Save cleaned.parquet with snappy compression and summary.csv without index.
- SQL roundtrip aggregation
- Create a DataFrame of orders and push to SQLite.
- Query total revenue per customer with a parameterized minimum amount.
- Export the result to Excel with a sheet name Summary.
Self-check checklist
- No stray DataFrame index saved to files.
- Dates parsed correctly; no unexpected object dtype for dates.
- Numeric columns have numeric dtypes (int/float), not object.
- Output files open without errors; row counts match expectations.
Common mistakes and how to self-check
- Forgetting index=False when saving, leading to extra unnamed column. Check by reopening file and inspecting columns.
- Wrong encoding on read_csv. If you see weird characters, try encoding="latin-1" or inspect the source encoding.
- Mixed date formats causing object dtype. Use pd.to_datetime(..., errors="coerce") and check df["date"].dtype.
- Excel engine missing. If you see an engine error, install openpyxl and retry.
- JSON vs JSON Lines confusion. If each line is a JSON object, use lines=True.
- Parquet engine not installed. Install pyarrow for best results.
- Building SQL strings with f-strings. Use params= to avoid SQL injection and formatting issues.
Practical projects
- Monthly pipeline: Read weekly CSVs, combine, clean, and write a monthly Parquet plus an Excel summary.
- Spreadsheet audit: Load multiple Excel sheets, validate dtypes, flag anomalies, and export a clean workbook.
- Warehouse loader: Ingest JSONL events, normalize, and append to a SQL table for dashboards.
Learning path
- Right after this: Data types and missing values in pandas (astype, nullable dtypes).
- Then: Joining and merging datasets (merge, concat).
- Then: GroupBy and aggregations for reporting.
- Optional: Time zones and datetime best practices.
Mini challenge
Load quarterly data from a mix of formats: CSV (sales_q1.csv), Excel (sales_q2.xlsx, sheet Sales), and JSONL (sales_q3.jsonl). Standardize columns to id, date, region, amount (float), parse dates, concatenate, drop rows with missing id or amount, and save to two outputs: all_sales.parquet (snappy) and a region-level Excel summary (sheet Regions) with total amount per region.
Hint: use usecols, dtype, parse_dates, concat, groupby, to_parquet, and ExcelWriter.
When you’re ready, take the Quick Test below. Everyone can take it; log in to save progress.