luvv to helpDiscover the Best Free Online Tools
Topic 26 of 30

Chunking and Streaming Large Files

Learn Chunking and Streaming Large Files for free with explanations, exercises, and a quick test (for Data Analyst).

Published: December 20, 2025 | Updated: December 20, 2025

Why this matters

As a Data Analyst, you often face files too large to fit into memory: daily clickstream logs, month-long sales exports, or multi-GB CSVs from partners. Chunking and streaming let you process these safely and quickly without upgrading your laptop. You will be able to:

  • Compute KPIs (e.g., revenue per region) from 10–100 GB CSVs.
  • Filter, deduplicate, or sample rows while writing to a new file incrementally.
  • Load data into databases in batches to avoid memory spikes.

Concept explained simply

Chunking reads a big file piece-by-piece (e.g., 50,000 rows at a time) so you process each piece and discard it before reading the next. Streaming is a broader idea: consume data as it arrives or in small parts without holding everything at once.

Mental model

Imagine a conveyor belt with boxes (chunks). You open one box, do your work (filter, aggregate), put the result aside, then open the next. You never hold all boxes at once, so your table (memory) never overflows.

Essential pandas tools

  • pd.read_csv(..., chunksize=50000): returns an iterator of DataFrames.
  • usecols=... and dtype=...: read only what you need with right data types.
  • compression='gzip': read compressed CSVs directly.
  • Accumulate results incrementally: update dicts, or group each chunk then combine.
  • Write results in batches using mode='a' (append) and control header writing.
Choosing a chunk size

Start with 50k–200k rows per chunk. If your machine slows or memory spikes, reduce it. If CPU is idle and memory is comfortable, increase it.

Worked example 1: Global sum across chunks

Task: Total revenue from a huge CSV with columns: order_id, region, revenue.

import pandas as pd

revenue_total = 0.0
for chunk in pd.read_csv('orders.csv', usecols=['revenue'], dtype={'revenue':'float32'}, chunksize=100_000):
    revenue_total += chunk['revenue'].sum()

print(round(revenue_total, 2))

Why it works: each chunk sums its piece; we add to a running total. Memory stays small.

Worked example 2: Group by category across chunks

Task: Revenue by region without loading the full dataset.

import pandas as pd

partial_totals = {}
for chunk in pd.read_csv('orders.csv', usecols=['region','revenue'],
                        dtype={'region':'category', 'revenue':'float32'},
                        chunksize=200_000):
    grp = chunk.groupby('region', observed=True)['revenue'].sum()
    for region, val in grp.items():
        partial_totals[region] = partial_totals.get(region, 0.0) + float(val)

# Turn into a DataFrame if needed
result = pd.DataFrame(list(partial_totals.items()), columns=['region','revenue'])
print(result.sort_values('revenue', ascending=False))

Tip: observed=True is faster with categorical dtypes and avoids unused category combinations.

Worked example 3: Filter to a new file (append mode)

Task: Keep only successful transactions and store them into a new CSV.

import pandas as pd

out_path = 'transactions_success.csv'
first_chunk = True

for chunk in pd.read_csv('transactions.csv', chunksize=100_000, dtype={'status':'category'}):
    filtered = chunk[chunk['status'] == 'success']
    filtered.to_csv(out_path, index=False, mode='a', header=first_chunk)
    first_chunk = False

Key detail: write header only once (first_chunk flag). This avoids corrupting the output.

Worked example 4: Reading compressed logs

Task: Count lines in a GZIP CSV where http_status == 200.

import pandas as pd

total_ok = 0
for chunk in pd.read_csv('logs.csv.gz', compression='gzip', chunksize=250_000,
                        usecols=['http_status'], dtype={'http_status':'int16'}):
    total_ok += (chunk['http_status'] == 200).sum()

print(total_ok)

Performance and memory tips

  • Read fewer columns with usecols.
  • Downcast numerics: int16/float32 when safe.
  • Use category for low-cardinality strings (e.g., country, status).
  • Avoid concatenating all chunks into one big DataFrame unless you truly need it. Aggregate or write out instead.
  • Prototype on nrows=100_000 to validate logic, then switch to chunking.
Quick self-check: am I safe on memory?
  • Peak memory doesn’t jump when you increase file size.
  • Your code works on a subset and on the full file with the same logic.
  • No step creates a giant temporary DataFrame unintentionally.

Exercises

Complete the tasks below. The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Chunked groupby to CSV

Compute total quantity per product from a large CSV with columns: order_id, product, qty. Write the final totals to product_qty.csv.

  1. Read with chunksize=50_000, usecols=['product','qty'], dtype={'product':'category','qty':'int32'}.
  2. Group by product for each chunk and accumulate totals in a dict.
  3. Convert the dict to a DataFrame and save to CSV.
Need a tiny sample to test?
product,qty
A,2
B,1
A,5
C,3
B,4

Expected output:

product,qty
A,7
B,5
C,3
  • [ ] Uses chunksize and usecols.
  • [ ] Accumulates per product without concatenating all chunks.
  • [ ] Writes a single header row in the output.

Common mistakes

  • Concatenating every chunk into a big list and pd.concat at the end. This defeats the purpose. Prefer incremental sums, counts, or writing to file in batches.
  • Forgetting header control when appending. Always use a flag to write the header only on the first write.
  • Wrong dtypes causing large memory footprint. Downcast integers/floats; use category for repeated strings.
  • Expensive parsing on every column. Limit columns with usecols and avoid unnecessary parsing (like dates) if not needed.
How to self-check
  • Monitor memory with your OS tools during a run; it should be steady.
  • Test logic on a small nrows; results should match the chunked version.
  • Log or print chunk counters to confirm iterative processing.

Practical projects

  • Daily KPI builder: Stream a month of transactions, compute revenue, orders, and AOV per day, write a daily summary CSV.
  • Log filter: From compressed web logs, keep only http_status==200 and selected columns, write to a clean CSV.
  • Database loader: Read a huge CSV and write to a SQL table in batches (e.g., 50k rows per insert) to avoid timeouts and memory spikes.

Who this is for

Data Analysts, BI Developers, and anyone processing files larger than available RAM.

Prerequisites

  • Comfort with basic pandas (DataFrame, groupby, filtering).
  • CSV basics and Python file paths.
  • Optional: basic SQL for batch inserts.

Learning path

  1. Practice simple chunked sums and counts.
  2. Move to groupby per chunk and merge partial results.
  3. Add filtering and writing in append mode.
  4. Handle compressed files and dtype optimization.

Next steps

  • Generalize your accumulation logic into reusable functions.
  • Add simple logging (chunk number, rows processed, time per chunk).
  • Explore writing to Parquet in partitions if your workflow uses columnar formats.

Mini challenge

Extend Worked example 2 to compute both sum(revenue) and count(orders) per region in a single pass, writing a CSV report as results update per chunk.

Quick Test

Take the quick test below. Passing score is 70%.

Practice Exercises

1 exercises to complete

Instructions

Compute total quantity per product from a large CSV with columns: order_id, product, qty. Process in chunks and write the final totals to product_qty.csv.

  1. Read with chunksize=50_000, usecols=['product','qty'], dtype={'product':'category','qty':'int32'}.
  2. Group by product for each chunk and accumulate totals in a dict.
  3. Convert to a DataFrame sorted by product and save to CSV.
Tiny sample to test
product,qty
A,2
B,1
A,5
C,3
B,4
Expected Output
CSV file product_qty.csv with: product,qty A,7 B,5 C,3

Chunking and Streaming Large Files — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Chunking and Streaming Large Files?

AI Assistant

Ask questions about this tool