How to learn Event Log Cleaning for Python pandas in Product Analyst for free

Who this is for

Product Analysts who work with raw event data (web/app analytics, product telemetry, marketing attribution) and need clean, trustworthy data for funnels, retention, and experiment analysis.

Prerequisites

Basic pandas: DataFrame operations, filtering, sorting
Dates/times in pandas: to_datetime, timezone basics
Comfort reading JSON-like columns (dict-like properties)

Why this matters

Real product decisions rely on accurate events. Dirty logs cause wrong funnels, broken cohorts, and misleading A/B test results. Typical on-the-job tasks:

Unify mixed timestamp formats from SDKs and servers
Deduplicate repeated events from retries or frontend double-fires
Standardize event names and parameters for consistent analysis
Filter bot/script traffic without harming real user data
Flatten JSON properties so features are usable in SQL/pandas

Concept explained simply

Event log cleaning turns a messy stream of actions into a reliable timeline per user. You make timestamps comparable, ensure event names mean one thing, remove accidental repeats, and keep only analyzable rows.

Mental model

Think of your event log as a sorted story per user:

Start with a spine: user_id and event_time in one timezone
Make the words consistent: normalized event_name
Remove echoes: deduplicate exact/near-duplicate events
Add details: flatten properties into clear columns
Keep it human: remove bot-like patterns and corrupt rows

Data requirements checklist

Required columns: user_id, event_time, event_name
Nice-to-have: event_id (for exact dedup), properties (dict), device, source
All timestamps convertible via pandas.to_datetime
One row = one event
No personally identifiable information in free-text fields

Step-by-step cleaning pipeline

Parse timestamps and set a single timezone (usually UTC)
Sort by user_id, then event_time
Standardize event_name (trim, lowercase, map aliases)
Flatten properties into columns; cast types
Remove obviously bad rows (missing user_id, invalid time)
Drop exact duplicates (event_id or full-row subset)
Remove near-duplicates fired within a tiny window per user/event
Flag and optionally filter bots/spam (abnormally high rate)

Tip: reversible cleaning

Add helper columns like dropped_reason, is_bot, is_near_dup before final filtering. It lets you audit decisions.

Worked examples

Example 1 — Parse timestamps and unify timezone

import pandas as pd

# Assume df has 'event_time' as string with mixed formats
# Strategy: parse and force UTC; if you know a source timezone, localize then convert

df['event_time'] = pd.to_datetime(df['event_time'], errors='coerce', utc=True)

# If you know logs were local time 'US/Pacific' without tz info:
# naive = pd.to_datetime(df['event_time'], errors='coerce')
# df['event_time'] = naive.dt.tz_localize('US/Pacific').dt.tz_convert('UTC')

# Drop rows with invalid times
df = df.dropna(subset=['event_time'])

Why

UTC avoids daylight saving surprises and enables correct cross-region comparisons.

Example 2 — Standardize event names

# Normalize spelling/casing and map aliases
normalize = {
    'signup': 'sign_up',
    'sign up': 'sign_up',
    'Sign-Up': 'sign_up',
    'purchase': 'purchase'
}

clean = (df['event_name'].astype(str)
           .str.strip()
           .str.lower()
           .str.replace(' ', '_', regex=False))

df['event_name'] = clean.map(normalize).fillna(clean)

Why

Consistent event_name values prevent split funnels and miscounted KPIs.

Example 3 — Deduplicate exact and near-duplicates

# 1) Drop exact duplicates by event_id if available
if 'event_id' in df.columns:
    df = df.sort_values(['event_id', 'event_time']).drop_duplicates('event_id', keep='first')
else:
    # fallback: drop identical rows across key fields
    df = df.drop_duplicates(subset=['user_id', 'event_name', 'event_time'])

# 2) Remove near-duplicates fired within 2s per user/event
#   (e.g., double-clicks, retrying SDK)

df = df.sort_values(['user_id', 'event_name', 'event_time'])
lag = df.groupby(['user_id', 'event_name'])['event_time'].diff()
df['is_near_dup'] = (lag.dt.total_seconds().fillna(9999) < 2)

df = df.loc[~df['is_near_dup']].drop(columns=['is_near_dup'])

Why

Near-duplicate filtering removes noisy bursts without harming normal activity.

Example 4 — Flatten JSON properties and cast types

# Suppose 'properties' holds dict-like data
props = pd.json_normalize(df['properties']).add_prefix('prop_')

# Align index and join back
props.index = df.index
df = pd.concat([df.drop(columns=['properties']), props], axis=1)

# Cast common fields
for c in ['prop_price', 'prop_value']:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

if 'prop_success' in df.columns:
    df['prop_success'] = df['prop_success'].astype('boolean')

Why

Flattened columns make features directly usable for grouping, aggregation, and modeling.

Exercises (hands-on)

Complete the exercise below. You can take the quick test afterward. Note: everyone can take the test; only logged-in users will see saved progress.

Checklist for your solution:
- UTC timestamps
- No null user_id
- Standardized event_name
- No exact/near duplicates
- Numeric price column
- Sorted by user_id, event_time

Common mistakes and self-check

1) Mixing timezones

Symptom: negative session gaps or events out of order. Fix: enforce UTC and re-sort.

2) Over-aggressive deduplication

Symptom: real repeated actions removed (e.g., add_to_cart twice). Fix: limit near-dup window to tiny intervals and only within same user_id and event_name.

3) Losing properties while flattening

Symptom: Nulls where values existed. Fix: ensure index alignment when joining normalized properties.

4) Keeping rows with missing user_id

Symptom: inflated counts and broken funnels. Fix: drop or impute only if a reliable session_id exists; otherwise drop.

5) Not sorting before diff()

Symptom: near-dup logic fails. Fix: sort by user_id, event_name, event_time before groupby().diff().

Practical projects

Build a cleaning pipeline function clean_events(df, tz='UTC', near_dup_sec=2) returning a standardized DataFrame
Create a validation report that counts dropped rows by reason and shows before/after metrics
Construct a small ruleset to flag bots based on events per minute and repetitive patterns

Mini challenge

Using a cleaned log, generate a session_id by labeling a new session when time since previous event for the same user exceeds 30 minutes. Then compute per-session event counts and median session length. Keep code concise and readable.

Learning path

Before: pandas basics → datetime & timezones → string processing
Now: event log cleaning (this lesson)
Next: sessionization, funnel building, retention cohorts, experiment analysis

Next steps

Turn the examples into a reusable cleaning module
Add tests with small synthetic DataFrames to guard against regressions
Run the quick test below to check understanding

Menu

Event Log Cleaning

Table of Contents