Who this is for
Product Analysts who work with raw event data (web/app analytics, product telemetry, marketing attribution) and need clean, trustworthy data for funnels, retention, and experiment analysis.
Prerequisites
- Basic pandas: DataFrame operations, filtering, sorting
- Dates/times in pandas: to_datetime, timezone basics
- Comfort reading JSON-like columns (dict-like properties)
Why this matters
Real product decisions rely on accurate events. Dirty logs cause wrong funnels, broken cohorts, and misleading A/B test results. Typical on-the-job tasks:
- Unify mixed timestamp formats from SDKs and servers
- Deduplicate repeated events from retries or frontend double-fires
- Standardize event names and parameters for consistent analysis
- Filter bot/script traffic without harming real user data
- Flatten JSON properties so features are usable in SQL/pandas
Concept explained simply
Event log cleaning turns a messy stream of actions into a reliable timeline per user. You make timestamps comparable, ensure event names mean one thing, remove accidental repeats, and keep only analyzable rows.
Mental model
Think of your event log as a sorted story per user:
- Start with a spine: user_id and event_time in one timezone
- Make the words consistent: normalized event_name
- Remove echoes: deduplicate exact/near-duplicate events
- Add details: flatten properties into clear columns
- Keep it human: remove bot-like patterns and corrupt rows
Data requirements checklist
- Required columns: user_id, event_time, event_name
- Nice-to-have: event_id (for exact dedup), properties (dict), device, source
- All timestamps convertible via pandas.to_datetime
- One row = one event
- No personally identifiable information in free-text fields
Step-by-step cleaning pipeline
- Parse timestamps and set a single timezone (usually UTC)
- Sort by user_id, then event_time
- Standardize event_name (trim, lowercase, map aliases)
- Flatten properties into columns; cast types
- Remove obviously bad rows (missing user_id, invalid time)
- Drop exact duplicates (event_id or full-row subset)
- Remove near-duplicates fired within a tiny window per user/event
- Flag and optionally filter bots/spam (abnormally high rate)
Tip: reversible cleaning
Add helper columns like dropped_reason, is_bot, is_near_dup before final filtering. It lets you audit decisions.
Worked examples
Example 1 — Parse timestamps and unify timezone
import pandas as pd
# Assume df has 'event_time' as string with mixed formats
# Strategy: parse and force UTC; if you know a source timezone, localize then convert
df['event_time'] = pd.to_datetime(df['event_time'], errors='coerce', utc=True)
# If you know logs were local time 'US/Pacific' without tz info:
# naive = pd.to_datetime(df['event_time'], errors='coerce')
# df['event_time'] = naive.dt.tz_localize('US/Pacific').dt.tz_convert('UTC')
# Drop rows with invalid times
df = df.dropna(subset=['event_time'])
Why
UTC avoids daylight saving surprises and enables correct cross-region comparisons.
Example 2 — Standardize event names
# Normalize spelling/casing and map aliases
normalize = {
'signup': 'sign_up',
'sign up': 'sign_up',
'Sign-Up': 'sign_up',
'purchase': 'purchase'
}
clean = (df['event_name'].astype(str)
.str.strip()
.str.lower()
.str.replace(' ', '_', regex=False))
df['event_name'] = clean.map(normalize).fillna(clean)
Why
Consistent event_name values prevent split funnels and miscounted KPIs.
Example 3 — Deduplicate exact and near-duplicates
# 1) Drop exact duplicates by event_id if available
if 'event_id' in df.columns:
df = df.sort_values(['event_id', 'event_time']).drop_duplicates('event_id', keep='first')
else:
# fallback: drop identical rows across key fields
df = df.drop_duplicates(subset=['user_id', 'event_name', 'event_time'])
# 2) Remove near-duplicates fired within 2s per user/event
# (e.g., double-clicks, retrying SDK)
df = df.sort_values(['user_id', 'event_name', 'event_time'])
lag = df.groupby(['user_id', 'event_name'])['event_time'].diff()
df['is_near_dup'] = (lag.dt.total_seconds().fillna(9999) < 2)
df = df.loc[~df['is_near_dup']].drop(columns=['is_near_dup'])
Why
Near-duplicate filtering removes noisy bursts without harming normal activity.
Example 4 — Flatten JSON properties and cast types
# Suppose 'properties' holds dict-like data
props = pd.json_normalize(df['properties']).add_prefix('prop_')
# Align index and join back
props.index = df.index
df = pd.concat([df.drop(columns=['properties']), props], axis=1)
# Cast common fields
for c in ['prop_price', 'prop_value']:
if c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
if 'prop_success' in df.columns:
df['prop_success'] = df['prop_success'].astype('boolean')
Why
Flattened columns make features directly usable for grouping, aggregation, and modeling.
Exercises (hands-on)
Complete the exercise below. You can take the quick test afterward. Note: everyone can take the test; only logged-in users will see saved progress.
- Checklist for your solution:
- UTC timestamps
- No null user_id
- Standardized event_name
- No exact/near duplicates
- Numeric price column
- Sorted by user_id, event_time
Common mistakes and self-check
1) Mixing timezones
Symptom: negative session gaps or events out of order. Fix: enforce UTC and re-sort.
2) Over-aggressive deduplication
Symptom: real repeated actions removed (e.g., add_to_cart twice). Fix: limit near-dup window to tiny intervals and only within same user_id and event_name.
3) Losing properties while flattening
Symptom: Nulls where values existed. Fix: ensure index alignment when joining normalized properties.
4) Keeping rows with missing user_id
Symptom: inflated counts and broken funnels. Fix: drop or impute only if a reliable session_id exists; otherwise drop.
5) Not sorting before diff()
Symptom: near-dup logic fails. Fix: sort by user_id, event_name, event_time before groupby().diff().
Practical projects
- Build a cleaning pipeline function clean_events(df, tz='UTC', near_dup_sec=2) returning a standardized DataFrame
- Create a validation report that counts dropped rows by reason and shows before/after metrics
- Construct a small ruleset to flag bots based on events per minute and repetitive patterns
Mini challenge
Using a cleaned log, generate a session_id by labeling a new session when time since previous event for the same user exceeds 30 minutes. Then compute per-session event counts and median session length. Keep code concise and readable.
Learning path
- Before: pandas basics → datetime & timezones → string processing
- Now: event log cleaning (this lesson)
- Next: sessionization, funnel building, retention cohorts, experiment analysis
Next steps
- Turn the examples into a reusable cleaning module
- Add tests with small synthetic DataFrames to guard against regressions
- Run the quick test below to check understanding