luvv to helpDiscover the Best Free Online Tools
Topic 4 of 8

Product Data Extraction

Learn Product Data Extraction for free with explanations, exercises, and a quick test (for Product Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Why this matters

As a Product Analyst, you routinely pull raw data into clean, usable tables before any analysis. Typical tasks include:

  • Extracting event logs (views, clicks, purchases) for a date range to analyze funnels or retention.
  • Pulling orders and users from a database to build revenue and cohort datasets.
  • Flattening JSON payloads from tracking or feature flags to evaluate experiments.
  • Loading Parquet or CSV snapshots efficiently to avoid memory issues.

Solid extraction ensures your metrics are correct, reproducible, and fast to compute.

Concept explained simply

Data extraction is the step of getting just the right slice of data, with the right shape, into pandas. Think: source → subset → structure → save.

  • Source: where data lives (CSV/Parquet/JSON/SQL).
  • Subset: the rows/columns you actually need (date and event filters, usecols).
  • Structure: clean types, tidy columns, flatten nested fields, de-duplicate.
  • Save: persist to a file or keep in memory for analysis.
Mental model: the S3S funnel
  1. Scope: Define the exact question and required fields.
  2. Selective read: Load minimally (usecols, chunksize, date filters, SQL WHERE).
  3. Shape: Normalize/flatten, parse dates, set dtypes, drop dupes.
  4. Store: Cache to Parquet/feather for faster reuse.

Core techniques in pandas

  • CSV: pandas.read_csv(..., usecols=..., dtype=..., parse_dates=...)
  • Parquet: pandas.read_parquet(...) for speed and compactness
  • JSON: pandas.json_normalize(...) to flatten nested fields
  • SQL: pandas.read_sql(query, conn, params=...) with parameterized filters
  • Efficiency: usecols, dtype, chunksize, nrows, compression, category dtype

Worked examples

Example 1 — Event CSV: load only needed data

import pandas as pd
from io import StringIO
csv_text = """user_id,event_type,ts,order_id,price
u1,view,2023-05-20T12:00:00,,
u1,add_to_cart,2023-05-21T09:00:00,,
u1,purchase,2023-05-22T10:00:00,5001,19.99
u2,view,2023-05-23T08:00:00,,
u2,purchase,2023-05-25T14:00:00,5002,9.99
u2,purchase,2023-05-26T15:00:00,5003,9.99
u3,view,2023-05-26T16:00:00,,
u3,purchase,2023-05-27T11:00:00,5004,15.0
"""
use_cols = ["user_id","event_type","ts","order_id","price"]
dtypes = {"user_id":"string","event_type":"category","order_id":"Int64","price":"float64"}
df = pd.read_csv(StringIO(csv_text), usecols=use_cols, dtype=dtypes, parse_dates=["ts"])
# Filter and de-duplicate to first purchase per user in a window
mask = (df.event_type == "purchase") & (df.ts.between("2023-05-22", "2023-05-27"))
purchases = df.loc[mask].sort_values(["user_id","ts"]).drop_duplicates("user_id", keep="first")
print(purchases[["user_id","ts","order_id","price"]])

Why it works: usecols + dtype + parse_dates cuts memory; sort + drop_duplicates keeps first purchase per user.

Example 2 — Nested JSON: flatten product attributes

import pandas as pd
from pandas import json_normalize
data = [
  {"product_id":"p1","price":{"amount":19.99,"currency":"USD"},"tags":["summer","promo"],"category":"apparel"},
  {"product_id":"p2","price":{"amount":9.50,"currency":"USD"},"tags":["new"],"category":"accessories"}
]
flat = json_normalize(data)
flat = flat.rename(columns={
  "price.amount":"price",
  "price.currency":"currency"
})
# Optional: explode tags to one tag per row
exploded = flat.explode("tags", ignore_index=True)
print(exploded[["product_id","category","price","currency","tags"]])

Why it works: json_normalize flattens nested dicts; explode handles list fields for tidy analysis.

Example 3 — SQL extraction with parameters

import pandas as pd
import sqlite3
# Setup toy DB
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE users(user_id TEXT, country TEXT);")
conn.execute("CREATE TABLE orders(order_id INTEGER, user_id TEXT, ts TEXT, amount REAL);")
conn.executemany("INSERT INTO users VALUES(?,?)", [("u1","US"),("u2","US"),("u3","CA")])
conn.executemany("INSERT INTO orders VALUES(?,?,?,?)", [
  (5001,"u1","2023-05-22T10:00:00",19.99),
  (5002,"u2","2023-05-25T14:00:00",9.99),
  (5003,"u2","2023-05-26T15:00:00",9.99),
  (5004,"u3","2023-05-27T11:00:00",15.00)
])
query = """
SELECT o.order_id, o.user_id, o.ts, o.amount, u.country
FROM orders o
LEFT JOIN users u USING(user_id)
WHERE o.ts BETWEEN :start AND :end AND u.country = :country
"""
df = pd.read_sql(query, conn, params={"start":"2023-05-22","end":"2023-05-27","country":"US"})
df["ts"] = pd.to_datetime(df["ts"])  # ensure datetime dtype
print(df)
conn.close()

Why it works: Parameterized WHERE keeps reads small and avoids string concatenation mistakes; immediate dtype fix keeps downstream operations safe.

Example 4 — Parquet for speed

import pandas as pd
# Assuming events.parquet exists locally
# cols keeps schema narrow; engine picks fast backend if available
cols = ["user_id","event_type","ts"]
df = pd.read_parquet("events.parquet", columns=cols)
df["ts"] = pd.to_datetime(df["ts"], errors="coerce")
print(df.head())

Why it works: Parquet is columnar and compressed; reading only needed columns is fast and memory-efficient.

Efficiency checklist

  • Specify usecols to limit columns.
  • Set dtypes early (category for low-cardinality strings).
  • Parse dates with parse_dates or pd.to_datetime.
  • Filter at the source (SQL WHERE; for files, load minimally then filter).
  • Use chunksize for very large CSVs; process and append.
  • Prefer Parquet/Feather for intermediate storage.
  • Handle compression (gzip, zip) directly in read_csv.

Common mistakes and self-check

  • Forgetting parse_dates: leads to string comparisons. Self-check: df.dtypes shows object for timestamp.
  • Wrong dtypes for IDs: int vs nullable Int64 vs string. Self-check: do nulls exist? If yes, use nullable or strings.
  • Silent timezone issues: mixing naive and tz-aware. Self-check: df["ts"].dt.tz is consistent.
  • Reading all columns: unnecessary memory blow-ups. Self-check: compare df.shape before/after usecols.
  • Not de-duplicating events: double counting. Self-check: drop_duplicates by unique key (e.g., event_id).
  • Exploding lists too late: mis-aggregations. Self-check: count rows before/after explode and validate totals.
  • String concatenation in SQL: error-prone. Prefer params in read_sql.

Who this is for

Aspiring and practicing Product Analysts who need reliable, efficient methods to bring raw product data into pandas for analysis and dashboards.

Prerequisites

  • Basic Python (variables, functions)
  • Intro pandas (DataFrame, filtering, groupby)
  • Very basic SQL (SELECT, WHERE, JOIN) helpful

Learning path

  1. Master targeted reading from CSV/Parquet (usecols, dtype, parse_dates).
  2. Flatten JSON payloads with json_normalize, explode, and column renaming.
  3. Query databases with parameterized SQL, then set correct dtypes.
  4. Apply window filters and de-duplication for first/last events.
  5. Package repeatable extracts and cache to Parquet.

Practical projects

  • Weekly conversion funnel extractor: reads last 7 days of events, outputs tidy Parquet.
  • Order-to-user joiner: produces revenue dataset by country with correct timestamps and currencies.
  • Experiment assignment flattener: loads JSON assignment logs, explodes variants, and saves a clean table.

Exercises

These mirror the tasks in the Exercises section below. You can complete them in any Python environment. The quick test at the end is available to everyone; only logged-in users have progress saved.

Exercise 1: First purchase per user (CSV)

Load a small event log, filter purchases within a date window, and keep the first purchase per user.

# See the Exercises list below for the full prompt and dataset (ex1)
Exercise 2: Flatten product JSON

Flatten nested price fields and explode tags to produce a tidy product table.

# See the Exercises list below for the full prompt and dataset (ex2)

Mini challenge

Combine techniques: Load purchases between two dates from CSV, enrich with a JSON product catalog (price currency), and output total GMV by currency. Aim to minimize memory (usecols, dtype) and ensure timestamps are proper datetimes.

Hints
  • Read purchases with usecols and parse_dates; keep only purchase rows.
  • json_normalize catalog, rename price fields, and select needed columns.
  • Merge on product_id, groupby currency, and sum price.

Next steps

  • Automate repeatable extracts and save to Parquet for faster iteration.
  • Add basic validation checks (row counts, null rates) after each extract.
  • Proceed to transformations and feature engineering for product metrics.

Progress reminder

The quick test is open to everyone. Log in if you want your test results and exercise progress saved.

Practice Exercises

2 exercises to complete

Instructions

Create a DataFrame from the CSV text below. Keep only purchase events between 2023-05-22 and 2023-05-27 inclusive. For each user, keep the first purchase (earliest ts). Output columns: user_id, ts, order_id, price. Sort by user_id.

CSV text
user_id,event_type,ts,order_id,price
u1,view,2023-05-20T12:00:00,,
u1,add_to_cart,2023-05-21T09:00:00,,
u1,purchase,2023-05-22T10:00:00,5001,19.99
u2,view,2023-05-23T08:00:00,,
u2,purchase,2023-05-25T14:00:00,5002,9.99
u2,purchase,2023-05-26T15:00:00,5003,9.99
u3,view,2023-05-26T16:00:00,,
u3,purchase,2023-05-27T11:00:00,5004,15.0
  • Use usecols, dtype, and parse_dates to load efficiently.
  • Drop duplicates by user_id after sorting by ts ascending.
  • Ensure ts is datetime dtype.
Expected Output
user_id,ts,order_id,price u1,2023-05-22T10:00:00,5001,19.99 u2,2023-05-25T14:00:00,5002,9.99 u3,2023-05-27T11:00:00,5004,15.0

Product Data Extraction — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Product Data Extraction?

AI Assistant

Ask questions about this tool