How to learn Chunking and Streaming Large Files for Python pandas in Data Analyst for free

Why this matters

As a Data Analyst, you often face files too large to fit into memory: daily clickstream logs, month-long sales exports, or multi-GB CSVs from partners. Chunking and streaming let you process these safely and quickly without upgrading your laptop. You will be able to:

Compute KPIs (e.g., revenue per region) from 10–100 GB CSVs.
Filter, deduplicate, or sample rows while writing to a new file incrementally.
Load data into databases in batches to avoid memory spikes.

Concept explained simply

Chunking reads a big file piece-by-piece (e.g., 50,000 rows at a time) so you process each piece and discard it before reading the next. Streaming is a broader idea: consume data as it arrives or in small parts without holding everything at once.

Mental model

Imagine a conveyor belt with boxes (chunks). You open one box, do your work (filter, aggregate), put the result aside, then open the next. You never hold all boxes at once, so your table (memory) never overflows.

Essential pandas tools

pd.read_csv(..., chunksize=50000): returns an iterator of DataFrames.
usecols=... and dtype=...: read only what you need with right data types.
compression='gzip': read compressed CSVs directly.
Accumulate results incrementally: update dicts, or group each chunk then combine.
Write results in batches using mode='a' (append) and control header writing.

Choosing a chunk size

Start with 50k–200k rows per chunk. If your machine slows or memory spikes, reduce it. If CPU is idle and memory is comfortable, increase it.

Worked example 1: Global sum across chunks

Task: Total revenue from a huge CSV with columns: order_id, region, revenue.

import pandas as pd

revenue_total = 0.0
for chunk in pd.read_csv('orders.csv', usecols=['revenue'], dtype={'revenue':'float32'}, chunksize=100_000):
    revenue_total += chunk['revenue'].sum()

print(round(revenue_total, 2))

Why it works: each chunk sums its piece; we add to a running total. Memory stays small.

Worked example 2: Group by category across chunks

Task: Revenue by region without loading the full dataset.

import pandas as pd

partial_totals = {}
for chunk in pd.read_csv('orders.csv', usecols=['region','revenue'],
                        dtype={'region':'category', 'revenue':'float32'},
                        chunksize=200_000):
    grp = chunk.groupby('region', observed=True)['revenue'].sum()
    for region, val in grp.items():
        partial_totals[region] = partial_totals.get(region, 0.0) + float(val)

# Turn into a DataFrame if needed
result = pd.DataFrame(list(partial_totals.items()), columns=['region','revenue'])
print(result.sort_values('revenue', ascending=False))

Tip: observed=True is faster with categorical dtypes and avoids unused category combinations.

Worked example 3: Filter to a new file (append mode)

Task: Keep only successful transactions and store them into a new CSV.

import pandas as pd

out_path = 'transactions_success.csv'
first_chunk = True

for chunk in pd.read_csv('transactions.csv', chunksize=100_000, dtype={'status':'category'}):
    filtered = chunk[chunk['status'] == 'success']
    filtered.to_csv(out_path, index=False, mode='a', header=first_chunk)
    first_chunk = False

Key detail: write header only once (first_chunk flag). This avoids corrupting the output.

Worked example 4: Reading compressed logs

Task: Count lines in a GZIP CSV where http_status == 200.

import pandas as pd

total_ok = 0
for chunk in pd.read_csv('logs.csv.gz', compression='gzip', chunksize=250_000,
                        usecols=['http_status'], dtype={'http_status':'int16'}):
    total_ok += (chunk['http_status'] == 200).sum()

print(total_ok)

Performance and memory tips

Read fewer columns with usecols.
Downcast numerics: int16/float32 when safe.
Use category for low-cardinality strings (e.g., country, status).
Avoid concatenating all chunks into one big DataFrame unless you truly need it. Aggregate or write out instead.
Prototype on nrows=100_000 to validate logic, then switch to chunking.

Quick self-check: am I safe on memory?

Peak memory doesn’t jump when you increase file size.
Your code works on a subset and on the full file with the same logic.
No step creates a giant temporary DataFrame unintentionally.

Exercises

Complete the tasks below. The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Chunked groupby to CSV

Compute total quantity per product from a large CSV with columns: order_id, product, qty. Write the final totals to product_qty.csv.

Read with chunksize=50_000, usecols=['product','qty'], dtype={'product':'category','qty':'int32'}.
Group by product for each chunk and accumulate totals in a dict.
Convert the dict to a DataFrame and save to CSV.

Need a tiny sample to test?

product,qty
A,2
B,1
A,5
C,3
B,4

Expected output:

product,qty
A,7
B,5
C,3

[ ] Uses chunksize and usecols.
[ ] Accumulates per product without concatenating all chunks.
[ ] Writes a single header row in the output.

Common mistakes

Concatenating every chunk into a big list and pd.concat at the end. This defeats the purpose. Prefer incremental sums, counts, or writing to file in batches.
Forgetting header control when appending. Always use a flag to write the header only on the first write.
Wrong dtypes causing large memory footprint. Downcast integers/floats; use category for repeated strings.
Expensive parsing on every column. Limit columns with usecols and avoid unnecessary parsing (like dates) if not needed.

How to self-check

Monitor memory with your OS tools during a run; it should be steady.
Test logic on a small nrows; results should match the chunked version.
Log or print chunk counters to confirm iterative processing.

Practical projects

Daily KPI builder: Stream a month of transactions, compute revenue, orders, and AOV per day, write a daily summary CSV.
Log filter: From compressed web logs, keep only http_status==200 and selected columns, write to a clean CSV.
Database loader: Read a huge CSV and write to a SQL table in batches (e.g., 50k rows per insert) to avoid timeouts and memory spikes.

Who this is for

Data Analysts, BI Developers, and anyone processing files larger than available RAM.

Prerequisites

Comfort with basic pandas (DataFrame, groupby, filtering).
CSV basics and Python file paths.
Optional: basic SQL for batch inserts.

Learning path

Practice simple chunked sums and counts.
Move to groupby per chunk and merge partial results.
Add filtering and writing in append mode.
Handle compressed files and dtype optimization.

Next steps

Generalize your accumulation logic into reusable functions.
Add simple logging (chunk number, rows processed, time per chunk).
Explore writing to Parquet in partitions if your workflow uses columnar formats.

Mini challenge

Extend Worked example 2 to compute both sum(revenue) and count(orders) per region in a single pass, writing a CSV report as results update per chunk.

Quick Test

Take the quick test below. Passing score is 70%.

Menu

Chunking and Streaming Large Files

Table of Contents