Why this matters
As a Data Scientist, most projects start and end with data I/O. You must ingest files from stakeholders, APIs, or data lakes, clean them reliably, and export consistent results. Getting data loading right reduces bugs, saves time, and prevents subtle errors that derail models.
- Ingest messy CSVs and Excel sheets from business teams
- Handle large flat files without running out of memory
- Normalize types (dates, numbers, categories) for correct analysis
- Export analysis-ready datasets for teammates and pipelines
Quick note: The Quick Test is available to everyone. If you log in, your progress and score are saved.
Who this is for
- Aspiring and practicing Data Scientists working with pandas/numpy
- Analysts transitioning from spreadsheets to Python
- Engineers who need reliable data ingestion in experiments
Prerequisites
- Basic Python: variables, lists/dicts, functions
- Intro pandas: DataFrame basics (select, filter)
- Comfort with running Python scripts or notebooks
Concept explained simply
Think of data I/O as a funnel:
- Top: Raw files arrive (CSV, JSON, Excel, Parquet)
- Middle (pandas): You specify how to read them (types, date parsing, missing values) and optionally stream in chunks
- Bottom: You export clean, typed, compressed data (CSV, Parquet, JSON) ready for analysis or production
Mental model: Be explicit. Tell pandas exactly what to expect: which columns to load (usecols), their types (dtype), how to parse dates (parse_dates), the encoding (encoding), and memory behavior (chunksize).
Common file formats and when to use them
- CSV: Universal, human-readable; larger on disk, no types; great for interchange
- JSON / JSON Lines: Nested or event data; JSONL is stream-friendly
- Excel: Business-friendly; requires engines like
openpyxlfor .xlsx - Parquet: Columnar, compressed, typed; best for analytics at scale
- Compression:
gzip,zip,bz2,xzreduce size and I/O time
Worked examples
Example 1: Read a CSV correctly (types, dates, missing values)
import pandas as pd
from io import StringIO
raw = StringIO(
"""order_id,date,product,price,qty
1,2023-01-02,Widget A,12.5,2
2,2023/01/03,Widget B,9.99,3
3,01-04-2023,Widget A,,1
4,2023-01-05,Widget C,15.0,NaN
"""
)
# Be explicit about types and dates
na_vals = ["", "NA", "NaN"]
dtypes = {"order_id": "int64", "product": "category", "price": "float64", "qty": "float64"}
df = pd.read_csv(
raw,
na_values=na_vals,
dtype=dtypes,
parse_dates=["date"],
dayfirst=True, # because one date is 01-04-2023 (DD-MM-YYYY)
)
df["revenue"] = df["price"].fillna(0) * df["qty"].fillna(0)
print(df.dtypes)
print(df)
Key ideas: tell pandas about NA markers, dtypes, and date formats; then compute safely with fillna.
Example 2: Large CSV in chunks to Parquet
import pandas as pd
from io import StringIO
raw = StringIO("""user_id,event,ts
1,login,2023-01-01 09:00:00
2,login,2023-01-01 09:05:00
1,click,2023-01-01 09:06:00
3,login,2023-01-01 09:07:00
""")
chunks = pd.read_csv(raw, parse_dates=["ts"], chunksize=2)
acc = []
for c in chunks:
c["date"] = c["ts"].dt.date
acc.append(c)
df = pd.concat(acc, ignore_index=True)
# Save to Parquet (fast, typed). Requires pyarrow or fastparquet installed in your environment.
# df.to_parquet("events.parquet", index=False)
print(df.groupby(["date", "event"]).size())
Chunking prevents memory blowups and lets you process rows incrementally.
Example 3: JSON Lines in, filtered CSV out
import pandas as pd
from io import StringIO
raw = StringIO('\n'.join([
'{"id": 10, "country": "US", "amount": 25.0}',
'{"id": 11, "country": "DE", "amount": 17.5}',
'{"id": 12, "country": "US", "amount": 11.0}'
]))
df = pd.read_json(raw, lines=True)
us = df[df["country"] == "US"]
# us.to_csv("us_amounts.csv", index=False)
print(us)
JSON Lines (lines=True) is streaming-friendly and common for logs.
Example 4: Encodings and compression
import pandas as pd
from io import BytesIO
import gzip
csv_bytes = "name;score\nAna;9\nJosé;10\n".encode("latin-1")
compressed = gzip.compress(csv_bytes)
buf = BytesIO(compressed)
df = pd.read_csv(buf, sep=';', compression='gzip', encoding='latin-1')
print(df)
Always specify encoding when unsure; compression speeds I/O and saves space.
Practical steps: a repeatable checklist
- Confirm file format, delimiter, header row, and encoding
- List required columns; load only them (
usecols) - Define
dtypefor each column; usecategoryfor repeated strings - Parse dates explicitly with
parse_dates; setdayfirstif needed - Declare
na_values; verify missing value counts - For large files: set
chunksizeand process incrementally - Validate: row counts, unique keys, summary stats
- Export in a compact, analytics-friendly format (often Parquet); include
index=False
Exercises
Do these locally or in any Python environment. The data is embedded so you can copy-paste and run.
Exercise 1 — Clean CSV and summarize revenue
Goal: Load a small sales CSV, set correct types, handle missing values, compute revenue by product (descending).
Data to use
order_id,date,product,price,qty
1,2023-01-02,Widget A,12.5,2
2,2023/01/03,Widget B,9.99,3
3,01-04-2023,Widget A,,1
4,2023-01-05,Widget C,15.0,
- Use
dtypefor product as category, price/qty as float - Parse dates (note mixed formats)
- Treat empty strings as NA
- Add
revenue = price * qtywith missing treated as 0 - Group by product and sum revenue, sort desc
Expected output shown in the Exercises section list below.
Exercise 2 — JSON Lines in, export filtered CSV
Goal: Read JSON Lines, filter by country == "US", compute total amount, and export to a compressed CSV.
Data to use
{"id": 10, "country": "US", "amount": 25.0}
{"id": 11, "country": "DE", "amount": 17.5}
{"id": 12, "country": "US", "amount": 11.0}
- Read with
pd.read_json(..., lines=True) - Filter to US rows
- Print total amount and number of rows
- Export to CSV with
compression='gzip'andindex=False(you can skip writing to disk if not available)
Common mistakes and self-check
- Relying on pandas guesses: Explicitly set
dtype,parse_dates,na_values - Wrong delimiter: Use
sep=';'when needed; confirm column counts - Silently misparsed dates: Check
df["date"].isna().sum()after parsing - Memory blowups: Use
usecols,chunksize, convert strings tocategory - Unexpected encoding: Try
encoding='utf-8','latin-1'if you see decode errors - Forgetting index on export: Add
index=Falseunless you need it
Self-check: After loading, validate number of rows, unique keys, and a few sampled values. If results look off, re-check delimiter, encoding, and date parsing.
Practical projects
- Build a CSV-to-Parquet converter with CLI args: input path, usecols, dtypes, compression
- Create a daily ingestion script for a logs folder that appends JSONL into a partitioned Parquet dataset
- Write a data validator that checks schema, NA policy, and basic stats before saving
Learning path
- Master pandas read/write patterns (CSV, JSONL, Parquet, Excel)
- Learn schema control: dtypes, categories, nullable integers, dates
- Optimize for scale: chunking, memory profiling, feather/parquet
- Automate: small ETL scripts, scheduled jobs
Next steps
- Re-run the examples with your own files and encodings
- Add logging and data validation to your I/O scripts
- Take the Quick Test below to confirm understanding
Mini challenge
Given a 2 GB CSV with semicolons, mixed date formats, and occasional "NULL" strings, outline the exact read_csv call (sep, parse_dates, na_values, usecols, dtype, chunksize) and a plan to export to Parquet with partitioning by year-month. Keep it to 6–8 lines of code.