How to learn Data Loading And IO for Python (pandas and numpy) in Data Scientist for free

Why this matters

As a Data Scientist, most projects start and end with data I/O. You must ingest files from stakeholders, APIs, or data lakes, clean them reliably, and export consistent results. Getting data loading right reduces bugs, saves time, and prevents subtle errors that derail models.

Ingest messy CSVs and Excel sheets from business teams
Handle large flat files without running out of memory
Normalize types (dates, numbers, categories) for correct analysis
Export analysis-ready datasets for teammates and pipelines

Quick note: The Quick Test is available to everyone. If you log in, your progress and score are saved.

Who this is for

Aspiring and practicing Data Scientists working with pandas/numpy
Analysts transitioning from spreadsheets to Python
Engineers who need reliable data ingestion in experiments

Prerequisites

Basic Python: variables, lists/dicts, functions
Intro pandas: DataFrame basics (select, filter)
Comfort with running Python scripts or notebooks

Concept explained simply

Think of data I/O as a funnel:

Top: Raw files arrive (CSV, JSON, Excel, Parquet)
Middle (pandas): You specify how to read them (types, date parsing, missing values) and optionally stream in chunks
Bottom: You export clean, typed, compressed data (CSV, Parquet, JSON) ready for analysis or production

Mental model: Be explicit. Tell pandas exactly what to expect: which columns to load (usecols), their types (dtype), how to parse dates (parse_dates), the encoding (encoding), and memory behavior (chunksize).

Common file formats and when to use them

CSV: Universal, human-readable; larger on disk, no types; great for interchange
JSON / JSON Lines: Nested or event data; JSONL is stream-friendly
Excel: Business-friendly; requires engines like openpyxl for .xlsx
Parquet: Columnar, compressed, typed; best for analytics at scale
Compression: gzip, zip, bz2, xz reduce size and I/O time

Worked examples

Example 1: Read a CSV correctly (types, dates, missing values)

import pandas as pd
from io import StringIO
raw = StringIO(
    """order_id,date,product,price,qty
    1,2023-01-02,Widget A,12.5,2
    2,2023/01/03,Widget B,9.99,3
    3,01-04-2023,Widget A,,1
    4,2023-01-05,Widget C,15.0,NaN
    """
)
# Be explicit about types and dates
na_vals = ["", "NA", "NaN"]
dtypes = {"order_id": "int64", "product": "category", "price": "float64", "qty": "float64"}
df = pd.read_csv(
    raw,
    na_values=na_vals,
    dtype=dtypes,
    parse_dates=["date"],
    dayfirst=True,  # because one date is 01-04-2023 (DD-MM-YYYY)
)
df["revenue"] = df["price"].fillna(0) * df["qty"].fillna(0)
print(df.dtypes)
print(df)

Key ideas: tell pandas about NA markers, dtypes, and date formats; then compute safely with fillna.

Example 2: Large CSV in chunks to Parquet

import pandas as pd
from io import StringIO
raw = StringIO("""user_id,event,ts
1,login,2023-01-01 09:00:00
2,login,2023-01-01 09:05:00
1,click,2023-01-01 09:06:00
3,login,2023-01-01 09:07:00
""")
chunks = pd.read_csv(raw, parse_dates=["ts"], chunksize=2)
acc = []
for c in chunks:
    c["date"] = c["ts"].dt.date
    acc.append(c)
df = pd.concat(acc, ignore_index=True)
# Save to Parquet (fast, typed). Requires pyarrow or fastparquet installed in your environment.
# df.to_parquet("events.parquet", index=False)
print(df.groupby(["date", "event"]).size())

Chunking prevents memory blowups and lets you process rows incrementally.

Example 3: JSON Lines in, filtered CSV out

import pandas as pd
from io import StringIO
raw = StringIO('\n'.join([
    '{"id": 10, "country": "US", "amount": 25.0}',
    '{"id": 11, "country": "DE", "amount": 17.5}',
    '{"id": 12, "country": "US", "amount": 11.0}'
]))
df = pd.read_json(raw, lines=True)
us = df[df["country"] == "US"]
# us.to_csv("us_amounts.csv", index=False)
print(us)

JSON Lines (lines=True) is streaming-friendly and common for logs.

Example 4: Encodings and compression

import pandas as pd
from io import BytesIO
import gzip
csv_bytes = "name;score\nAna;9\nJosé;10\n".encode("latin-1")
compressed = gzip.compress(csv_bytes)
buf = BytesIO(compressed)
df = pd.read_csv(buf, sep=';', compression='gzip', encoding='latin-1')
print(df)

Always specify encoding when unsure; compression speeds I/O and saves space.

Practical steps: a repeatable checklist

Confirm file format, delimiter, header row, and encoding
List required columns; load only them (usecols)
Define dtype for each column; use category for repeated strings
Parse dates explicitly with parse_dates; set dayfirst if needed
Declare na_values; verify missing value counts
For large files: set chunksize and process incrementally
Validate: row counts, unique keys, summary stats
Export in a compact, analytics-friendly format (often Parquet); include index=False

Exercises

Do these locally or in any Python environment. The data is embedded so you can copy-paste and run.

Exercise 1 — Clean CSV and summarize revenue

Goal: Load a small sales CSV, set correct types, handle missing values, compute revenue by product (descending).

Data to use

order_id,date,product,price,qty
1,2023-01-02,Widget A,12.5,2
2,2023/01/03,Widget B,9.99,3
3,01-04-2023,Widget A,,1
4,2023-01-05,Widget C,15.0,

Use dtype for product as category, price/qty as float
Parse dates (note mixed formats)
Treat empty strings as NA
Add revenue = price * qty with missing treated as 0
Group by product and sum revenue, sort desc

Expected output shown in the Exercises section list below.

Exercise 2 — JSON Lines in, export filtered CSV

Goal: Read JSON Lines, filter by country == "US", compute total amount, and export to a compressed CSV.

Data to use

{"id": 10, "country": "US", "amount": 25.0}
{"id": 11, "country": "DE", "amount": 17.5}
{"id": 12, "country": "US", "amount": 11.0}

Read with pd.read_json(..., lines=True)
Filter to US rows
Print total amount and number of rows
Export to CSV with compression='gzip' and index=False (you can skip writing to disk if not available)

Common mistakes and self-check

Relying on pandas guesses: Explicitly set dtype, parse_dates, na_values
Wrong delimiter: Use sep=';' when needed; confirm column counts
Silently misparsed dates: Check df["date"].isna().sum() after parsing
Memory blowups: Use usecols, chunksize, convert strings to category
Unexpected encoding: Try encoding='utf-8', 'latin-1' if you see decode errors
Forgetting index on export: Add index=False unless you need it

Self-check: After loading, validate number of rows, unique keys, and a few sampled values. If results look off, re-check delimiter, encoding, and date parsing.

Practical projects

Build a CSV-to-Parquet converter with CLI args: input path, usecols, dtypes, compression
Create a daily ingestion script for a logs folder that appends JSONL into a partitioned Parquet dataset
Write a data validator that checks schema, NA policy, and basic stats before saving

Learning path

Master pandas read/write patterns (CSV, JSONL, Parquet, Excel)
Learn schema control: dtypes, categories, nullable integers, dates
Optimize for scale: chunking, memory profiling, feather/parquet
Automate: small ETL scripts, scheduled jobs

Next steps

Re-run the examples with your own files and encodings
Add logging and data validation to your I/O scripts
Take the Quick Test below to confirm understanding

Mini challenge

Given a 2 GB CSV with semicolons, mixed date formats, and occasional "NULL" strings, outline the exact read_csv call (sep, parse_dates, na_values, usecols, dtype, chunksize) and a plan to export to Parquet with partitioning by year-month. Keep it to 6–8 lines of code.

Menu

Data Loading And IO

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Worked examples

Example 1: Read a CSV correctly (types, dates, missing values)

Example 2: Large CSV in chunks to Parquet

Example 3: JSON Lines in, filtered CSV out

Example 4: Encodings and compression

Practical steps: a repeatable checklist

Exercises

Exercise 1 — Clean CSV and summarize revenue

Exercise 2 — JSON Lines in, export filtered CSV

Common mistakes and self-check

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Clean CSV and summarize revenue

Instructions

Expected Output

JSON Lines in, filter, summarize, and export compressed CSV

Data Loading And IO — Quick Test

Have questions about Data Loading And IO?

AI Assistant