Why this matters
As a Data Analyst, you often face files too large to fit into memory: daily clickstream logs, month-long sales exports, or multi-GB CSVs from partners. Chunking and streaming let you process these safely and quickly without upgrading your laptop. You will be able to:
- Compute KPIs (e.g., revenue per region) from 10–100 GB CSVs.
- Filter, deduplicate, or sample rows while writing to a new file incrementally.
- Load data into databases in batches to avoid memory spikes.
Concept explained simply
Chunking reads a big file piece-by-piece (e.g., 50,000 rows at a time) so you process each piece and discard it before reading the next. Streaming is a broader idea: consume data as it arrives or in small parts without holding everything at once.
Mental model
Imagine a conveyor belt with boxes (chunks). You open one box, do your work (filter, aggregate), put the result aside, then open the next. You never hold all boxes at once, so your table (memory) never overflows.
Essential pandas tools
pd.read_csv(..., chunksize=50000): returns an iterator of DataFrames.usecols=...anddtype=...: read only what you need with right data types.compression='gzip': read compressed CSVs directly.- Accumulate results incrementally: update dicts, or group each chunk then combine.
- Write results in batches using
mode='a'(append) and control header writing.
Choosing a chunk size
Start with 50k–200k rows per chunk. If your machine slows or memory spikes, reduce it. If CPU is idle and memory is comfortable, increase it.
Worked example 1: Global sum across chunks
Task: Total revenue from a huge CSV with columns: order_id, region, revenue.
import pandas as pd
revenue_total = 0.0
for chunk in pd.read_csv('orders.csv', usecols=['revenue'], dtype={'revenue':'float32'}, chunksize=100_000):
revenue_total += chunk['revenue'].sum()
print(round(revenue_total, 2))Why it works: each chunk sums its piece; we add to a running total. Memory stays small.
Worked example 2: Group by category across chunks
Task: Revenue by region without loading the full dataset.
import pandas as pd
partial_totals = {}
for chunk in pd.read_csv('orders.csv', usecols=['region','revenue'],
dtype={'region':'category', 'revenue':'float32'},
chunksize=200_000):
grp = chunk.groupby('region', observed=True)['revenue'].sum()
for region, val in grp.items():
partial_totals[region] = partial_totals.get(region, 0.0) + float(val)
# Turn into a DataFrame if needed
result = pd.DataFrame(list(partial_totals.items()), columns=['region','revenue'])
print(result.sort_values('revenue', ascending=False))Tip: observed=True is faster with categorical dtypes and avoids unused category combinations.
Worked example 3: Filter to a new file (append mode)
Task: Keep only successful transactions and store them into a new CSV.
import pandas as pd
out_path = 'transactions_success.csv'
first_chunk = True
for chunk in pd.read_csv('transactions.csv', chunksize=100_000, dtype={'status':'category'}):
filtered = chunk[chunk['status'] == 'success']
filtered.to_csv(out_path, index=False, mode='a', header=first_chunk)
first_chunk = FalseKey detail: write header only once (first_chunk flag). This avoids corrupting the output.
Worked example 4: Reading compressed logs
Task: Count lines in a GZIP CSV where http_status == 200.
import pandas as pd
total_ok = 0
for chunk in pd.read_csv('logs.csv.gz', compression='gzip', chunksize=250_000,
usecols=['http_status'], dtype={'http_status':'int16'}):
total_ok += (chunk['http_status'] == 200).sum()
print(total_ok)Performance and memory tips
- Read fewer columns with
usecols. - Downcast numerics:
int16/float32when safe. - Use
categoryfor low-cardinality strings (e.g., country, status). - Avoid concatenating all chunks into one big DataFrame unless you truly need it. Aggregate or write out instead.
- Prototype on
nrows=100_000to validate logic, then switch to chunking.
Quick self-check: am I safe on memory?
- Peak memory doesn’t jump when you increase file size.
- Your code works on a subset and on the full file with the same logic.
- No step creates a giant temporary DataFrame unintentionally.
Exercises
Complete the tasks below. The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Chunked groupby to CSV
Compute total quantity per product from a large CSV with columns: order_id, product, qty. Write the final totals to product_qty.csv.
- Read with
chunksize=50_000,usecols=['product','qty'],dtype={'product':'category','qty':'int32'}. - Group by
productfor each chunk and accumulate totals in a dict. - Convert the dict to a DataFrame and save to CSV.
Need a tiny sample to test?
product,qty
A,2
B,1
A,5
C,3
B,4Expected output:
product,qty
A,7
B,5
C,3- [ ] Uses
chunksizeandusecols. - [ ] Accumulates per product without concatenating all chunks.
- [ ] Writes a single header row in the output.
Common mistakes
- Concatenating every chunk into a big list and
pd.concatat the end. This defeats the purpose. Prefer incremental sums, counts, or writing to file in batches. - Forgetting
headercontrol when appending. Always use a flag to write the header only on the first write. - Wrong dtypes causing large memory footprint. Downcast integers/floats; use
categoryfor repeated strings. - Expensive parsing on every column. Limit columns with
usecolsand avoid unnecessary parsing (like dates) if not needed.
How to self-check
- Monitor memory with your OS tools during a run; it should be steady.
- Test logic on a small
nrows; results should match the chunked version. - Log or print chunk counters to confirm iterative processing.
Practical projects
- Daily KPI builder: Stream a month of transactions, compute revenue, orders, and AOV per day, write a daily summary CSV.
- Log filter: From compressed web logs, keep only
http_status==200and selected columns, write to a clean CSV. - Database loader: Read a huge CSV and write to a SQL table in batches (e.g., 50k rows per insert) to avoid timeouts and memory spikes.
Who this is for
Data Analysts, BI Developers, and anyone processing files larger than available RAM.
Prerequisites
- Comfort with basic pandas (DataFrame, groupby, filtering).
- CSV basics and Python file paths.
- Optional: basic SQL for batch inserts.
Learning path
- Practice simple chunked sums and counts.
- Move to groupby per chunk and merge partial results.
- Add filtering and writing in append mode.
- Handle compressed files and dtype optimization.
Next steps
- Generalize your accumulation logic into reusable functions.
- Add simple logging (chunk number, rows processed, time per chunk).
- Explore writing to Parquet in partitions if your workflow uses columnar formats.
Mini challenge
Extend Worked example 2 to compute both sum(revenue) and count(orders) per region in a single pass, writing a CSV report as results update per chunk.
Quick Test
Take the quick test below. Passing score is 70%.