How to learn Reading and Writing Data Csv Excel Json Parquet Sql for Python pandas in Data Analyst for free

Why this matters

As a Data Analyst, you constantly move data between files, spreadsheets, APIs, and databases. Mastering pandas I/O lets you:

Pull clean datasets quickly from CSV/Excel/JSON.
Save analysis-ready outputs in efficient formats (Parquet) or back to SQL.
Handle large files safely with chunking and correct dtypes.

Real tasks you’ll meet

Import weekly sales CSVs, standardize columns, and load into a reporting database.
Read budget spreadsheets with multiple sheets, merge, and export a clean summary.
Ingest JSON from an API, normalize nested fields, and save as Parquet for fast querying.

Who this is for

Aspiring and junior Data Analysts using Python.
Analysts switching from Excel to pandas.
Anyone who needs reliable read/write pipelines.

Prerequisites

Basic Python (variables, functions).
Basic pandas (DataFrame, columns, filters).
Python installed; optional: engines like pyarrow (Parquet) and openpyxl (Excel).

Concept explained simply

pandas I/O is about two things: reading into DataFrames and writing DataFrames out. Each format has a reader (pd.read_csv, pd.read_excel, pd.read_json, pd.read_parquet, pd.read_sql) and a writer (to_csv, to_excel, to_json, to_parquet, to_sql). You control parsing (dtypes, dates, columns) and performance (chunksize, compression).

Mental model

Define the shape: choose columns (usecols) and types (dtype) early to avoid surprises.
Decode correctly: specify encoding, separators, decimal, thousands, and date parsing.
Scale safely: use chunksize for big files, Parquet for speed/size, SQL for storage.

Mini cheat sheet

# CSV
df = pd.read_csv("file.csv", usecols=["id","date","amount"], dtype={"id":"int64","amount":"float64"}, parse_dates=["date"], na_values=["", "NA", "null"]) 
df.to_csv("out.csv", index=False)

# Excel
df = pd.read_excel("book.xlsx", sheet_name="Data", usecols="A:D", dtype={"id":"Int64"})
df.to_excel("out.xlsx", sheet_name="Clean", index=False)

# JSON (records lines)
df = pd.read_json("data.jsonl", lines=True)
df.to_json("out.json", orient="records", lines=True)

# Parquet (requires pyarrow or fastparquet)
df = pd.read_parquet("data.parquet")
df.to_parquet("out.parquet", compression="snappy", index=False)

# SQL (example: SQLite)
import sqlite3
con = sqlite3.connect(":memory:")
df.to_sql("table_name", con, if_exists="replace", index=False)
q = pd.read_sql_query("SELECT * FROM table_name WHERE amount > ?", con, params=(1000,))

Worked examples

Example 1: Robust CSV import/export

import pandas as pd
from io import StringIO

csv_data = StringIO("""id,date,region,amount
1,2024-01-10,EU,1200.50
2,2024/01/11,US,NA
3,10-01-2024,EU,950
""")

# Read with date parsing and custom NA
fmt = ["%Y-%m-%d", "%Y/%m/%d", "%d-%m-%Y"]
df = pd.read_csv(csv_data, na_values=["NA"], parse_dates=["date"], dayfirst=False)

# If mixed date formats, coerce:
df["date"] = pd.to_datetime(df["date"], errors="coerce", format=None)

# Clean and export
clean = df.dropna(subset=["amount"]).assign(amount=lambda d: d["amount"].astype(float))
clean.to_csv("clean_sales.csv", index=False)
print(clean.dtypes)

Key ideas: handle NA, parse dates, coerce mismatches, export without index.

Example 2: Excel with multiple sheets

import pandas as pd

# Read specific sheet and range
df_q1 = pd.read_excel("budget.xlsx", sheet_name="Q1", usecols="A:F", skiprows=1)
df_q2 = pd.read_excel("budget.xlsx", sheet_name="Q2", usecols="A:F", skiprows=1)

# Combine and write to a new workbook
budget = pd.concat([df_q1, df_q2], ignore_index=True)
with pd.ExcelWriter("budget_clean.xlsx", engine="openpyxl") as writer:
    budget.to_excel(writer, sheet_name="All", index=False)
    (budget.groupby("department")["spend"].sum()
           .reset_index()
           .to_excel(writer, sheet_name="Summary", index=False))

Tip: Use ExcelWriter to save multiple sheets in one file.

Example 3: JSON and JSON Lines

import pandas as pd
from pandas import json_normalize

# JSON Lines (one JSON per line)
df = pd.read_json("events.jsonl", lines=True)

# Nested JSON to columns
nested = {"user": {"id": 1, "name": "Ana"}, "purchase": {"total": 19.9, "items": 3}}
flat = json_normalize(nested)  # columns like user.id, user.name, purchase.total

# Export as lines for streaming systems
df.to_json("events_out.jsonl", orient="records", lines=True)
print(flat.head())

Choose lines=True for newline-delimited JSON; json_normalize flattens nested objects.

Example 4: Parquet for speed and size

import pandas as pd

# Requires pyarrow (recommended)
df = pd.read_csv("large.csv", usecols=["id","date","amount"], parse_dates=["date"], dtype={"id":"int64","amount":"float64"})
df.to_parquet("large.parquet", compression="snappy", index=False)
fast = pd.read_parquet("large.parquet")
print(len(fast))

Parquet is columnar and compressed—great for analytics. If engine error occurs, install pyarrow.

Example 5: SQL roundtrip (SQLite)

import pandas as pd
import sqlite3

con = sqlite3.connect(":memory:")

sales = pd.DataFrame({
    "id": [1,2,3,4],
    "region": ["EU","US","EU","APAC"],
    "amount": [1200, 800, 650, 300]
})

# Write to SQL table
sales.to_sql("sales", con, if_exists="replace", index=False)

# Safe parameterized query
q = "SELECT region, SUM(amount) AS total FROM sales WHERE amount > ? GROUP BY region"
summary = pd.read_sql_query(q, con, params=(600,))
print(summary)

Use parameterized queries (params=...) to avoid SQL injection and parsing issues.

Performance and safety tips

Limit columns with usecols and set dtype early to reduce memory.
Use chunksize in read_csv for very large files; process chunk by chunk.
Prefer Parquet for intermediate analytics storage; it’s compact and fast.
Specify encoding (e.g., encoding="utf-8" or "latin-1") for non-UTF files.
Be explicit with decimal="," and thousands="," if numbers use different separators.

Chunking pattern

import pandas as pd

acc = []
for chunk in pd.read_csv("big.csv", chunksize=100_000, usecols=["id","amount"], dtype={"id":"int64","amount":"float64"}):
    acc.append(chunk[chunk["amount"] > 0])
result = pd.concat(acc, ignore_index=True)
result.to_parquet("positive_amounts.parquet", index=False)

Exercises

These mirror the tasks below. Try them first, then open the solutions. Everyone can take the quick test; log in to save your progress.

CSV to Parquet clean-up
- Load sales.csv, parse date column, coerce bad dates.
- Filter positive amounts, compute revenue = amount.
- Save cleaned.parquet with snappy compression and summary.csv without index.
SQL roundtrip aggregation
- Create a DataFrame of orders and push to SQLite.
- Query total revenue per customer with a parameterized minimum amount.
- Export the result to Excel with a sheet name Summary.

Self-check checklist

No stray DataFrame index saved to files.
Dates parsed correctly; no unexpected object dtype for dates.
Numeric columns have numeric dtypes (int/float), not object.
Output files open without errors; row counts match expectations.

Common mistakes and how to self-check

Forgetting index=False when saving, leading to extra unnamed column. Check by reopening file and inspecting columns.
Wrong encoding on read_csv. If you see weird characters, try encoding="latin-1" or inspect the source encoding.
Mixed date formats causing object dtype. Use pd.to_datetime(..., errors="coerce") and check df["date"].dtype.
Excel engine missing. If you see an engine error, install openpyxl and retry.
JSON vs JSON Lines confusion. If each line is a JSON object, use lines=True.
Parquet engine not installed. Install pyarrow for best results.
Building SQL strings with f-strings. Use params= to avoid SQL injection and formatting issues.

Practical projects

Monthly pipeline: Read weekly CSVs, combine, clean, and write a monthly Parquet plus an Excel summary.
Spreadsheet audit: Load multiple Excel sheets, validate dtypes, flag anomalies, and export a clean workbook.
Warehouse loader: Ingest JSONL events, normalize, and append to a SQL table for dashboards.

Learning path

Right after this: Data types and missing values in pandas (astype, nullable dtypes).
Then: Joining and merging datasets (merge, concat).
Then: GroupBy and aggregations for reporting.
Optional: Time zones and datetime best practices.

Mini challenge

Load quarterly data from a mix of formats: CSV (sales_q1.csv), Excel (sales_q2.xlsx, sheet Sales), and JSONL (sales_q3.jsonl). Standardize columns to id, date, region, amount (float), parse dates, concatenate, drop rows with missing id or amount, and save to two outputs: all_sales.parquet (snappy) and a region-level Excel summary (sheet Regions) with total amount per region.

Hint: use usecols, dtype, parse_dates, concat, groupby, to_parquet, and ExcelWriter.

When you’re ready, take the Quick Test below. Everyone can take it; log in to save progress.

Menu

Reading and Writing Data Csv Excel Json Parquet Sql

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

Mental model

Worked examples

Example 1: Robust CSV import/export

Example 2: Excel with multiple sheets

Example 3: JSON and JSON Lines

Example 4: Parquet for speed and size

Example 5: SQL roundtrip (SQLite)

Performance and safety tips

Exercises

Common mistakes and how to self-check

Practical projects

Learning path

Mini challenge

Practice Exercises

CSV to Parquet Clean-up

Instructions

Expected Output

SQL Roundtrip Aggregation

Reading and Writing Data Csv Excel Json Parquet Sql — Quick Test

Have questions about Reading and Writing Data Csv Excel Json Parquet Sql?

AI Assistant