Why this matters
As a Product Analyst, you routinely pull raw data into clean, usable tables before any analysis. Typical tasks include:
- Extracting event logs (views, clicks, purchases) for a date range to analyze funnels or retention.
- Pulling orders and users from a database to build revenue and cohort datasets.
- Flattening JSON payloads from tracking or feature flags to evaluate experiments.
- Loading Parquet or CSV snapshots efficiently to avoid memory issues.
Solid extraction ensures your metrics are correct, reproducible, and fast to compute.
Concept explained simply
Data extraction is the step of getting just the right slice of data, with the right shape, into pandas. Think: source → subset → structure → save.
- Source: where data lives (CSV/Parquet/JSON/SQL).
- Subset: the rows/columns you actually need (date and event filters, usecols).
- Structure: clean types, tidy columns, flatten nested fields, de-duplicate.
- Save: persist to a file or keep in memory for analysis.
Mental model: the S3S funnel
- Scope: Define the exact question and required fields.
- Selective read: Load minimally (usecols, chunksize, date filters, SQL WHERE).
- Shape: Normalize/flatten, parse dates, set dtypes, drop dupes.
- Store: Cache to Parquet/feather for faster reuse.
Core techniques in pandas
- CSV: pandas.read_csv(..., usecols=..., dtype=..., parse_dates=...)
- Parquet: pandas.read_parquet(...) for speed and compactness
- JSON: pandas.json_normalize(...) to flatten nested fields
- SQL: pandas.read_sql(query, conn, params=...) with parameterized filters
- Efficiency: usecols, dtype, chunksize, nrows, compression, category dtype
Worked examples
Example 1 — Event CSV: load only needed data
import pandas as pd
from io import StringIO
csv_text = """user_id,event_type,ts,order_id,price
u1,view,2023-05-20T12:00:00,,
u1,add_to_cart,2023-05-21T09:00:00,,
u1,purchase,2023-05-22T10:00:00,5001,19.99
u2,view,2023-05-23T08:00:00,,
u2,purchase,2023-05-25T14:00:00,5002,9.99
u2,purchase,2023-05-26T15:00:00,5003,9.99
u3,view,2023-05-26T16:00:00,,
u3,purchase,2023-05-27T11:00:00,5004,15.0
"""
use_cols = ["user_id","event_type","ts","order_id","price"]
dtypes = {"user_id":"string","event_type":"category","order_id":"Int64","price":"float64"}
df = pd.read_csv(StringIO(csv_text), usecols=use_cols, dtype=dtypes, parse_dates=["ts"])
# Filter and de-duplicate to first purchase per user in a window
mask = (df.event_type == "purchase") & (df.ts.between("2023-05-22", "2023-05-27"))
purchases = df.loc[mask].sort_values(["user_id","ts"]).drop_duplicates("user_id", keep="first")
print(purchases[["user_id","ts","order_id","price"]])Why it works: usecols + dtype + parse_dates cuts memory; sort + drop_duplicates keeps first purchase per user.
Example 2 — Nested JSON: flatten product attributes
import pandas as pd
from pandas import json_normalize
data = [
{"product_id":"p1","price":{"amount":19.99,"currency":"USD"},"tags":["summer","promo"],"category":"apparel"},
{"product_id":"p2","price":{"amount":9.50,"currency":"USD"},"tags":["new"],"category":"accessories"}
]
flat = json_normalize(data)
flat = flat.rename(columns={
"price.amount":"price",
"price.currency":"currency"
})
# Optional: explode tags to one tag per row
exploded = flat.explode("tags", ignore_index=True)
print(exploded[["product_id","category","price","currency","tags"]])Why it works: json_normalize flattens nested dicts; explode handles list fields for tidy analysis.
Example 3 — SQL extraction with parameters
import pandas as pd
import sqlite3
# Setup toy DB
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE users(user_id TEXT, country TEXT);")
conn.execute("CREATE TABLE orders(order_id INTEGER, user_id TEXT, ts TEXT, amount REAL);")
conn.executemany("INSERT INTO users VALUES(?,?)", [("u1","US"),("u2","US"),("u3","CA")])
conn.executemany("INSERT INTO orders VALUES(?,?,?,?)", [
(5001,"u1","2023-05-22T10:00:00",19.99),
(5002,"u2","2023-05-25T14:00:00",9.99),
(5003,"u2","2023-05-26T15:00:00",9.99),
(5004,"u3","2023-05-27T11:00:00",15.00)
])
query = """
SELECT o.order_id, o.user_id, o.ts, o.amount, u.country
FROM orders o
LEFT JOIN users u USING(user_id)
WHERE o.ts BETWEEN :start AND :end AND u.country = :country
"""
df = pd.read_sql(query, conn, params={"start":"2023-05-22","end":"2023-05-27","country":"US"})
df["ts"] = pd.to_datetime(df["ts"]) # ensure datetime dtype
print(df)
conn.close()Why it works: Parameterized WHERE keeps reads small and avoids string concatenation mistakes; immediate dtype fix keeps downstream operations safe.
Example 4 — Parquet for speed
import pandas as pd
# Assuming events.parquet exists locally
# cols keeps schema narrow; engine picks fast backend if available
cols = ["user_id","event_type","ts"]
df = pd.read_parquet("events.parquet", columns=cols)
df["ts"] = pd.to_datetime(df["ts"], errors="coerce")
print(df.head())Why it works: Parquet is columnar and compressed; reading only needed columns is fast and memory-efficient.
Efficiency checklist
- Specify usecols to limit columns.
- Set dtypes early (category for low-cardinality strings).
- Parse dates with parse_dates or pd.to_datetime.
- Filter at the source (SQL WHERE; for files, load minimally then filter).
- Use chunksize for very large CSVs; process and append.
- Prefer Parquet/Feather for intermediate storage.
- Handle compression (gzip, zip) directly in read_csv.
Common mistakes and self-check
- Forgetting parse_dates: leads to string comparisons. Self-check: df.dtypes shows object for timestamp.
- Wrong dtypes for IDs: int vs nullable Int64 vs string. Self-check: do nulls exist? If yes, use nullable or strings.
- Silent timezone issues: mixing naive and tz-aware. Self-check: df["ts"].dt.tz is consistent.
- Reading all columns: unnecessary memory blow-ups. Self-check: compare df.shape before/after usecols.
- Not de-duplicating events: double counting. Self-check: drop_duplicates by unique key (e.g., event_id).
- Exploding lists too late: mis-aggregations. Self-check: count rows before/after explode and validate totals.
- String concatenation in SQL: error-prone. Prefer params in read_sql.
Who this is for
Aspiring and practicing Product Analysts who need reliable, efficient methods to bring raw product data into pandas for analysis and dashboards.
Prerequisites
- Basic Python (variables, functions)
- Intro pandas (DataFrame, filtering, groupby)
- Very basic SQL (SELECT, WHERE, JOIN) helpful
Learning path
- Master targeted reading from CSV/Parquet (usecols, dtype, parse_dates).
- Flatten JSON payloads with json_normalize, explode, and column renaming.
- Query databases with parameterized SQL, then set correct dtypes.
- Apply window filters and de-duplication for first/last events.
- Package repeatable extracts and cache to Parquet.
Practical projects
- Weekly conversion funnel extractor: reads last 7 days of events, outputs tidy Parquet.
- Order-to-user joiner: produces revenue dataset by country with correct timestamps and currencies.
- Experiment assignment flattener: loads JSON assignment logs, explodes variants, and saves a clean table.
Exercises
These mirror the tasks in the Exercises section below. You can complete them in any Python environment. The quick test at the end is available to everyone; only logged-in users have progress saved.
Exercise 1: First purchase per user (CSV)
Load a small event log, filter purchases within a date window, and keep the first purchase per user.
# See the Exercises list below for the full prompt and dataset (ex1)
Exercise 2: Flatten product JSON
Flatten nested price fields and explode tags to produce a tidy product table.
# See the Exercises list below for the full prompt and dataset (ex2)
Mini challenge
Combine techniques: Load purchases between two dates from CSV, enrich with a JSON product catalog (price currency), and output total GMV by currency. Aim to minimize memory (usecols, dtype) and ensure timestamps are proper datetimes.
Hints
- Read purchases with usecols and parse_dates; keep only purchase rows.
- json_normalize catalog, rename price fields, and select needed columns.
- Merge on product_id, groupby currency, and sum price.
Next steps
- Automate repeatable extracts and save to Parquet for faster iteration.
- Add basic validation checks (row counts, null rates) after each extract.
- Proceed to transformations and feature engineering for product metrics.
Progress reminder
The quick test is open to everyone. Log in if you want your test results and exercise progress saved.