luvv to helpDiscover the Best Free Online Tools
Topic 3 of 30

Reading and Writing Data Csv Excel Json Parquet Sql

Learn Reading and Writing Data Csv Excel Json Parquet Sql for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Data Analyst, you constantly move data between files, spreadsheets, APIs, and databases. Mastering pandas I/O lets you:

  • Pull clean datasets quickly from CSV/Excel/JSON.
  • Save analysis-ready outputs in efficient formats (Parquet) or back to SQL.
  • Handle large files safely with chunking and correct dtypes.
Real tasks you’ll meet
  • Import weekly sales CSVs, standardize columns, and load into a reporting database.
  • Read budget spreadsheets with multiple sheets, merge, and export a clean summary.
  • Ingest JSON from an API, normalize nested fields, and save as Parquet for fast querying.

Who this is for

  • Aspiring and junior Data Analysts using Python.
  • Analysts switching from Excel to pandas.
  • Anyone who needs reliable read/write pipelines.

Prerequisites

  • Basic Python (variables, functions).
  • Basic pandas (DataFrame, columns, filters).
  • Python installed; optional: engines like pyarrow (Parquet) and openpyxl (Excel).

Concept explained simply

pandas I/O is about two things: reading into DataFrames and writing DataFrames out. Each format has a reader (pd.read_csv, pd.read_excel, pd.read_json, pd.read_parquet, pd.read_sql) and a writer (to_csv, to_excel, to_json, to_parquet, to_sql). You control parsing (dtypes, dates, columns) and performance (chunksize, compression).

Mental model

  • Define the shape: choose columns (usecols) and types (dtype) early to avoid surprises.
  • Decode correctly: specify encoding, separators, decimal, thousands, and date parsing.
  • Scale safely: use chunksize for big files, Parquet for speed/size, SQL for storage.
Mini cheat sheet
# CSV
df = pd.read_csv("file.csv", usecols=["id","date","amount"], dtype={"id":"int64","amount":"float64"}, parse_dates=["date"], na_values=["", "NA", "null"]) 
df.to_csv("out.csv", index=False)

# Excel
df = pd.read_excel("book.xlsx", sheet_name="Data", usecols="A:D", dtype={"id":"Int64"})
df.to_excel("out.xlsx", sheet_name="Clean", index=False)

# JSON (records lines)
df = pd.read_json("data.jsonl", lines=True)
df.to_json("out.json", orient="records", lines=True)

# Parquet (requires pyarrow or fastparquet)
df = pd.read_parquet("data.parquet")
df.to_parquet("out.parquet", compression="snappy", index=False)

# SQL (example: SQLite)
import sqlite3
con = sqlite3.connect(":memory:")
df.to_sql("table_name", con, if_exists="replace", index=False)
q = pd.read_sql_query("SELECT * FROM table_name WHERE amount > ?", con, params=(1000,))

Worked examples

Example 1: Robust CSV import/export

import pandas as pd
from io import StringIO

csv_data = StringIO("""id,date,region,amount
1,2024-01-10,EU,1200.50
2,2024/01/11,US,NA
3,10-01-2024,EU,950
""")

# Read with date parsing and custom NA
fmt = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"]
df = pd.read_csv(csv_data, na_values=["NA"], parse_dates=["date"], dayfirst=False)

# If mixed date formats, coerce:
df["date"] = pd.to_datetime(df["date"], errors="coerce", format=None)

# Clean and export
clean = df.dropna(subset=["amount"]).assign(amount=lambda d: d["amount"].astype(float))
clean.to_csv("clean_sales.csv", index=False)
print(clean.dtypes)

Key ideas: handle NA, parse dates, coerce mismatches, export without index.

Example 2: Excel with multiple sheets

import pandas as pd

# Read specific sheet and range
df_q1 = pd.read_excel("budget.xlsx", sheet_name="Q1", usecols="A:F", skiprows=1)
df_q2 = pd.read_excel("budget.xlsx", sheet_name="Q2", usecols="A:F", skiprows=1)

# Combine and write to a new workbook
budget = pd.concat([df_q1, df_q2], ignore_index=True)
with pd.ExcelWriter("budget_clean.xlsx", engine="openpyxl") as writer:
    budget.to_excel(writer, sheet_name="All", index=False)
    (budget.groupby("department")["spend"].sum()
           .reset_index()
           .to_excel(writer, sheet_name="Summary", index=False))

Tip: Use ExcelWriter to save multiple sheets in one file.

Example 3: JSON and JSON Lines

import pandas as pd
from pandas import json_normalize

# JSON Lines (one JSON per line)
df = pd.read_json("events.jsonl", lines=True)

# Nested JSON to columns
nested = {"user": {"id": 1, "name": "Ana"}, "purchase": {"total": 19.9, "items": 3}}
flat = json_normalize(nested)  # columns like user.id, user.name, purchase.total

# Export as lines for streaming systems
df.to_json("events_out.jsonl", orient="records", lines=True)
print(flat.head())

Choose lines=True for newline-delimited JSON; json_normalize flattens nested objects.

Example 4: Parquet for speed and size

import pandas as pd

# Requires pyarrow (recommended)
df = pd.read_csv("large.csv", usecols=["id","date","amount"], parse_dates=["date"], dtype={"id":"int64","amount":"float64"})
df.to_parquet("large.parquet", compression="snappy", index=False)
fast = pd.read_parquet("large.parquet")
print(len(fast))

Parquet is columnar and compressed—great for analytics. If engine error occurs, install pyarrow.

Example 5: SQL roundtrip (SQLite)

import pandas as pd
import sqlite3

con = sqlite3.connect(":memory:")

sales = pd.DataFrame({
    "id": [1,2,3,4],
    "region": ["EU","US","EU","APAC"],
    "amount": [1200, 800, 650, 300]
})

# Write to SQL table
sales.to_sql("sales", con, if_exists="replace", index=False)

# Safe parameterized query
q = "SELECT region, SUM(amount) AS total FROM sales WHERE amount > ? GROUP BY region"
summary = pd.read_sql_query(q, con, params=(600,))
print(summary)

Use parameterized queries (params=...) to avoid SQL injection and parsing issues.

Performance and safety tips

  • Limit columns with usecols and set dtype early to reduce memory.
  • Use chunksize in read_csv for very large files; process chunk by chunk.
  • Prefer Parquet for intermediate analytics storage; it’s compact and fast.
  • Specify encoding (e.g., encoding="utf-8" or "latin-1") for non-UTF files.
  • Be explicit with decimal="," and thousands="," if numbers use different separators.
Chunking pattern
import pandas as pd

acc = []
for chunk in pd.read_csv("big.csv", chunksize=100_000, usecols=["id","amount"], dtype={"id":"int64","amount":"float64"}):
    acc.append(chunk[chunk["amount"] > 0])
result = pd.concat(acc, ignore_index=True)
result.to_parquet("positive_amounts.parquet", index=False)

Exercises

These mirror the tasks below. Try them first, then open the solutions. Everyone can take the quick test; log in to save your progress.

  1. CSV to Parquet clean-up
    • Load sales.csv, parse date column, coerce bad dates.
    • Filter positive amounts, compute revenue = amount.
    • Save cleaned.parquet with snappy compression and summary.csv without index.
  2. SQL roundtrip aggregation
    • Create a DataFrame of orders and push to SQLite.
    • Query total revenue per customer with a parameterized minimum amount.
    • Export the result to Excel with a sheet name Summary.
Self-check checklist
  • No stray DataFrame index saved to files.
  • Dates parsed correctly; no unexpected object dtype for dates.
  • Numeric columns have numeric dtypes (int/float), not object.
  • Output files open without errors; row counts match expectations.

Common mistakes and how to self-check

  • Forgetting index=False when saving, leading to extra unnamed column. Check by reopening file and inspecting columns.
  • Wrong encoding on read_csv. If you see weird characters, try encoding="latin-1" or inspect the source encoding.
  • Mixed date formats causing object dtype. Use pd.to_datetime(..., errors="coerce") and check df["date"].dtype.
  • Excel engine missing. If you see an engine error, install openpyxl and retry.
  • JSON vs JSON Lines confusion. If each line is a JSON object, use lines=True.
  • Parquet engine not installed. Install pyarrow for best results.
  • Building SQL strings with f-strings. Use params= to avoid SQL injection and formatting issues.

Practical projects

  • Monthly pipeline: Read weekly CSVs, combine, clean, and write a monthly Parquet plus an Excel summary.
  • Spreadsheet audit: Load multiple Excel sheets, validate dtypes, flag anomalies, and export a clean workbook.
  • Warehouse loader: Ingest JSONL events, normalize, and append to a SQL table for dashboards.

Learning path

  • Right after this: Data types and missing values in pandas (astype, nullable dtypes).
  • Then: Joining and merging datasets (merge, concat).
  • Then: GroupBy and aggregations for reporting.
  • Optional: Time zones and datetime best practices.

Mini challenge

Load quarterly data from a mix of formats: CSV (sales_q1.csv), Excel (sales_q2.xlsx, sheet Sales), and JSONL (sales_q3.jsonl). Standardize columns to id, date, region, amount (float), parse dates, concatenate, drop rows with missing id or amount, and save to two outputs: all_sales.parquet (snappy) and a region-level Excel summary (sheet Regions) with total amount per region.

Hint: use usecols, dtype, parse_dates, concat, groupby, to_parquet, and ExcelWriter.

When you’re ready, take the Quick Test below. Everyone can take it; log in to save progress.

Practice Exercises

2 exercises to complete

Instructions

  1. Read sales.csv with columns: id (int), date (mixed formats), region (str), amount (float). Treat "NA" and empty strings as missing. Parse date, coercing errors.
  2. Filter positive amounts and compute revenue = amount.
  3. Save cleaned.parquet (snappy, no index) and summary.csv with total revenue by region (no index).
Data shape expectation

If input has 10 rows and 2 invalid dates and 1 non-positive amount, expect 7 rows in the cleaned output.

Expected Output
Files created: cleaned.parquet (snappy) and summary.csv with columns [region, revenue]. Cleaned DataFrame contains only positive amounts; dates are datetime64[ns].

Reading and Writing Data Csv Excel Json Parquet Sql — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Reading and Writing Data Csv Excel Json Parquet Sql?

AI Assistant

Ask questions about this tool