luvv to helpDiscover the Best Free Online Tools
Topic 3 of 10

Data Loading And IO

Learn Data Loading And IO for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Data Scientist, most projects start and end with data I/O. You must ingest files from stakeholders, APIs, or data lakes, clean them reliably, and export consistent results. Getting data loading right reduces bugs, saves time, and prevents subtle errors that derail models.

  • Ingest messy CSVs and Excel sheets from business teams
  • Handle large flat files without running out of memory
  • Normalize types (dates, numbers, categories) for correct analysis
  • Export analysis-ready datasets for teammates and pipelines

Quick note: The Quick Test is available to everyone. If you log in, your progress and score are saved.

Who this is for

  • Aspiring and practicing Data Scientists working with pandas/numpy
  • Analysts transitioning from spreadsheets to Python
  • Engineers who need reliable data ingestion in experiments

Prerequisites

  • Basic Python: variables, lists/dicts, functions
  • Intro pandas: DataFrame basics (select, filter)
  • Comfort with running Python scripts or notebooks

Concept explained simply

Think of data I/O as a funnel:

  • Top: Raw files arrive (CSV, JSON, Excel, Parquet)
  • Middle (pandas): You specify how to read them (types, date parsing, missing values) and optionally stream in chunks
  • Bottom: You export clean, typed, compressed data (CSV, Parquet, JSON) ready for analysis or production

Mental model: Be explicit. Tell pandas exactly what to expect: which columns to load (usecols), their types (dtype), how to parse dates (parse_dates), the encoding (encoding), and memory behavior (chunksize).

Common file formats and when to use them
  • CSV: Universal, human-readable; larger on disk, no types; great for interchange
  • JSON / JSON Lines: Nested or event data; JSONL is stream-friendly
  • Excel: Business-friendly; requires engines like openpyxl for .xlsx
  • Parquet: Columnar, compressed, typed; best for analytics at scale
  • Compression: gzip, zip, bz2, xz reduce size and I/O time

Worked examples

Example 1: Read a CSV correctly (types, dates, missing values)

import pandas as pd
from io import StringIO
raw = StringIO(
    """order_id,date,product,price,qty
    1,2023-01-02,Widget A,12.5,2
    2,2023/01/03,Widget B,9.99,3
    3,01-04-2023,Widget A,,1
    4,2023-01-05,Widget C,15.0,NaN
    """
)
# Be explicit about types and dates
na_vals = ["", "NA", "NaN"]
dtypes = {"order_id": "int64", "product": "category", "price": "float64", "qty": "float64"}
df = pd.read_csv(
    raw,
    na_values=na_vals,
    dtype=dtypes,
    parse_dates=["date"],
    dayfirst=True,  # because one date is 01-04-2023 (DD-MM-YYYY)
)
df["revenue"] = df["price"].fillna(0) * df["qty"].fillna(0)
print(df.dtypes)
print(df)

Key ideas: tell pandas about NA markers, dtypes, and date formats; then compute safely with fillna.

Example 2: Large CSV in chunks to Parquet

import pandas as pd
from io import StringIO
raw = StringIO("""user_id,event,ts
1,login,2023-01-01 09:00:00
2,login,2023-01-01 09:05:00
1,click,2023-01-01 09:06:00
3,login,2023-01-01 09:07:00
""")
chunks = pd.read_csv(raw, parse_dates=["ts"], chunksize=2)
acc = []
for c in chunks:
    c["date"] = c["ts"].dt.date
    acc.append(c)
df = pd.concat(acc, ignore_index=True)
# Save to Parquet (fast, typed). Requires pyarrow or fastparquet installed in your environment.
# df.to_parquet("events.parquet", index=False)
print(df.groupby(["date", "event"]).size())

Chunking prevents memory blowups and lets you process rows incrementally.

Example 3: JSON Lines in, filtered CSV out

import pandas as pd
from io import StringIO
raw = StringIO('\n'.join([
    '{"id": 10, "country": "US", "amount": 25.0}',
    '{"id": 11, "country": "DE", "amount": 17.5}',
    '{"id": 12, "country": "US", "amount": 11.0}'
]))
df = pd.read_json(raw, lines=True)
us = df[df["country"] == "US"]
# us.to_csv("us_amounts.csv", index=False)
print(us)

JSON Lines (lines=True) is streaming-friendly and common for logs.

Example 4: Encodings and compression

import pandas as pd
from io import BytesIO
import gzip
csv_bytes = "name;score\nAna;9\nJosé;10\n".encode("latin-1")
compressed = gzip.compress(csv_bytes)
buf = BytesIO(compressed)
df = pd.read_csv(buf, sep=';', compression='gzip', encoding='latin-1')
print(df)

Always specify encoding when unsure; compression speeds I/O and saves space.

Practical steps: a repeatable checklist

  • Confirm file format, delimiter, header row, and encoding
  • List required columns; load only them (usecols)
  • Define dtype for each column; use category for repeated strings
  • Parse dates explicitly with parse_dates; set dayfirst if needed
  • Declare na_values; verify missing value counts
  • For large files: set chunksize and process incrementally
  • Validate: row counts, unique keys, summary stats
  • Export in a compact, analytics-friendly format (often Parquet); include index=False

Exercises

Do these locally or in any Python environment. The data is embedded so you can copy-paste and run.

Exercise 1 — Clean CSV and summarize revenue

Goal: Load a small sales CSV, set correct types, handle missing values, compute revenue by product (descending).

Data to use
order_id,date,product,price,qty
1,2023-01-02,Widget A,12.5,2
2,2023/01/03,Widget B,9.99,3
3,01-04-2023,Widget A,,1
4,2023-01-05,Widget C,15.0,
  • Use dtype for product as category, price/qty as float
  • Parse dates (note mixed formats)
  • Treat empty strings as NA
  • Add revenue = price * qty with missing treated as 0
  • Group by product and sum revenue, sort desc

Expected output shown in the Exercises section list below.

Exercise 2 — JSON Lines in, export filtered CSV

Goal: Read JSON Lines, filter by country == "US", compute total amount, and export to a compressed CSV.

Data to use
{"id": 10, "country": "US", "amount": 25.0}
{"id": 11, "country": "DE", "amount": 17.5}
{"id": 12, "country": "US", "amount": 11.0}
  • Read with pd.read_json(..., lines=True)
  • Filter to US rows
  • Print total amount and number of rows
  • Export to CSV with compression='gzip' and index=False (you can skip writing to disk if not available)

Common mistakes and self-check

  • Relying on pandas guesses: Explicitly set dtype, parse_dates, na_values
  • Wrong delimiter: Use sep=';' when needed; confirm column counts
  • Silently misparsed dates: Check df["date"].isna().sum() after parsing
  • Memory blowups: Use usecols, chunksize, convert strings to category
  • Unexpected encoding: Try encoding='utf-8', 'latin-1' if you see decode errors
  • Forgetting index on export: Add index=False unless you need it

Self-check: After loading, validate number of rows, unique keys, and a few sampled values. If results look off, re-check delimiter, encoding, and date parsing.

Practical projects

  • Build a CSV-to-Parquet converter with CLI args: input path, usecols, dtypes, compression
  • Create a daily ingestion script for a logs folder that appends JSONL into a partitioned Parquet dataset
  • Write a data validator that checks schema, NA policy, and basic stats before saving

Learning path

  • Master pandas read/write patterns (CSV, JSONL, Parquet, Excel)
  • Learn schema control: dtypes, categories, nullable integers, dates
  • Optimize for scale: chunking, memory profiling, feather/parquet
  • Automate: small ETL scripts, scheduled jobs

Next steps

  • Re-run the examples with your own files and encodings
  • Add logging and data validation to your I/O scripts
  • Take the Quick Test below to confirm understanding

Mini challenge

Given a 2 GB CSV with semicolons, mixed date formats, and occasional "NULL" strings, outline the exact read_csv call (sep, parse_dates, na_values, usecols, dtype, chunksize) and a plan to export to Parquet with partitioning by year-month. Keep it to 6–8 lines of code.

Practice Exercises

2 exercises to complete

Instructions

Load the provided CSV (use StringIO), set correct dtypes, parse dates, handle missing values, compute revenue per product, and sort descending.

from io import StringIO
import pandas as pd
raw = StringIO("""order_id,date,product,price,qty
1,2023-01-02,Widget A,12.5,2
2,2023/01/03,Widget B,9.99,3
3,01-04-2023,Widget A,,1
4,2023-01-05,Widget C,15.0,
""")
# Your code here
Expected Output
product,revenue Widget B,29.97 Widget A,25.00 Widget C,0.00

Data Loading And IO — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Data Loading And IO?

AI Assistant

Ask questions about this tool