luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Event Log Cleaning

Learn Event Log Cleaning for free with explanations, exercises, and a quick test (for Product Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Who this is for

Product Analysts who work with raw event data (web/app analytics, product telemetry, marketing attribution) and need clean, trustworthy data for funnels, retention, and experiment analysis.

Prerequisites

  • Basic pandas: DataFrame operations, filtering, sorting
  • Dates/times in pandas: to_datetime, timezone basics
  • Comfort reading JSON-like columns (dict-like properties)

Why this matters

Real product decisions rely on accurate events. Dirty logs cause wrong funnels, broken cohorts, and misleading A/B test results. Typical on-the-job tasks:

  • Unify mixed timestamp formats from SDKs and servers
  • Deduplicate repeated events from retries or frontend double-fires
  • Standardize event names and parameters for consistent analysis
  • Filter bot/script traffic without harming real user data
  • Flatten JSON properties so features are usable in SQL/pandas

Concept explained simply

Event log cleaning turns a messy stream of actions into a reliable timeline per user. You make timestamps comparable, ensure event names mean one thing, remove accidental repeats, and keep only analyzable rows.

Mental model

Think of your event log as a sorted story per user:

  • Start with a spine: user_id and event_time in one timezone
  • Make the words consistent: normalized event_name
  • Remove echoes: deduplicate exact/near-duplicate events
  • Add details: flatten properties into clear columns
  • Keep it human: remove bot-like patterns and corrupt rows

Data requirements checklist

  • Required columns: user_id, event_time, event_name
  • Nice-to-have: event_id (for exact dedup), properties (dict), device, source
  • All timestamps convertible via pandas.to_datetime
  • One row = one event
  • No personally identifiable information in free-text fields

Step-by-step cleaning pipeline

  1. Parse timestamps and set a single timezone (usually UTC)
  2. Sort by user_id, then event_time
  3. Standardize event_name (trim, lowercase, map aliases)
  4. Flatten properties into columns; cast types
  5. Remove obviously bad rows (missing user_id, invalid time)
  6. Drop exact duplicates (event_id or full-row subset)
  7. Remove near-duplicates fired within a tiny window per user/event
  8. Flag and optionally filter bots/spam (abnormally high rate)
Tip: reversible cleaning

Add helper columns like dropped_reason, is_bot, is_near_dup before final filtering. It lets you audit decisions.

Worked examples

Example 1 — Parse timestamps and unify timezone

import pandas as pd

# Assume df has 'event_time' as string with mixed formats
# Strategy: parse and force UTC; if you know a source timezone, localize then convert

df['event_time'] = pd.to_datetime(df['event_time'], errors='coerce', utc=True)

# If you know logs were local time 'US/Pacific' without tz info:
# naive = pd.to_datetime(df['event_time'], errors='coerce')
# df['event_time'] = naive.dt.tz_localize('US/Pacific').dt.tz_convert('UTC')

# Drop rows with invalid times
df = df.dropna(subset=['event_time'])
Why

UTC avoids daylight saving surprises and enables correct cross-region comparisons.

Example 2 — Standardize event names

# Normalize spelling/casing and map aliases
normalize = {
    'signup': 'sign_up',
    'sign up': 'sign_up',
    'Sign-Up': 'sign_up',
    'purchase': 'purchase'
}

clean = (df['event_name'].astype(str)
           .str.strip()
           .str.lower()
           .str.replace(' ', '_', regex=False))

df['event_name'] = clean.map(normalize).fillna(clean)
Why

Consistent event_name values prevent split funnels and miscounted KPIs.

Example 3 — Deduplicate exact and near-duplicates

# 1) Drop exact duplicates by event_id if available
if 'event_id' in df.columns:
    df = df.sort_values(['event_id', 'event_time']).drop_duplicates('event_id', keep='first')
else:
    # fallback: drop identical rows across key fields
    df = df.drop_duplicates(subset=['user_id', 'event_name', 'event_time'])

# 2) Remove near-duplicates fired within 2s per user/event
#   (e.g., double-clicks, retrying SDK)

df = df.sort_values(['user_id', 'event_name', 'event_time'])
lag = df.groupby(['user_id', 'event_name'])['event_time'].diff()
df['is_near_dup'] = (lag.dt.total_seconds().fillna(9999) < 2)

df = df.loc[~df['is_near_dup']].drop(columns=['is_near_dup'])
Why

Near-duplicate filtering removes noisy bursts without harming normal activity.

Example 4 — Flatten JSON properties and cast types

# Suppose 'properties' holds dict-like data
props = pd.json_normalize(df['properties']).add_prefix('prop_')

# Align index and join back
props.index = df.index
df = pd.concat([df.drop(columns=['properties']), props], axis=1)

# Cast common fields
for c in ['prop_price', 'prop_value']:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

if 'prop_success' in df.columns:
    df['prop_success'] = df['prop_success'].astype('boolean')
Why

Flattened columns make features directly usable for grouping, aggregation, and modeling.

Exercises (hands-on)

Complete the exercise below. You can take the quick test afterward. Note: everyone can take the test; only logged-in users will see saved progress.

  • Checklist for your solution:
    • UTC timestamps
    • No null user_id
    • Standardized event_name
    • No exact/near duplicates
    • Numeric price column
    • Sorted by user_id, event_time

Common mistakes and self-check

1) Mixing timezones

Symptom: negative session gaps or events out of order. Fix: enforce UTC and re-sort.

2) Over-aggressive deduplication

Symptom: real repeated actions removed (e.g., add_to_cart twice). Fix: limit near-dup window to tiny intervals and only within same user_id and event_name.

3) Losing properties while flattening

Symptom: Nulls where values existed. Fix: ensure index alignment when joining normalized properties.

4) Keeping rows with missing user_id

Symptom: inflated counts and broken funnels. Fix: drop or impute only if a reliable session_id exists; otherwise drop.

5) Not sorting before diff()

Symptom: near-dup logic fails. Fix: sort by user_id, event_name, event_time before groupby().diff().

Practical projects

  • Build a cleaning pipeline function clean_events(df, tz='UTC', near_dup_sec=2) returning a standardized DataFrame
  • Create a validation report that counts dropped rows by reason and shows before/after metrics
  • Construct a small ruleset to flag bots based on events per minute and repetitive patterns

Mini challenge

Using a cleaned log, generate a session_id by labeling a new session when time since previous event for the same user exceeds 30 minutes. Then compute per-session event counts and median session length. Keep code concise and readable.

Learning path

  • Before: pandas basics → datetime & timezones → string processing
  • Now: event log cleaning (this lesson)
  • Next: sessionization, funnel building, retention cohorts, experiment analysis

Next steps

  • Turn the examples into a reusable cleaning module
  • Add tests with small synthetic DataFrames to guard against regressions
  • Run the quick test below to check understanding

Practice Exercises

1 exercises to complete

Instructions

You receive a DataFrame df with columns: user_id, event_time (strings), event_name (mixed casing), event_id (may repeat), properties (dict with 'price'), and some rows with null user_id. Clean it using pandas:

  • Parse event_time and convert to UTC
  • Drop rows with invalid event_time or null user_id
  • Standardize event_name: trim, lowercase, replace spaces with underscores, map 'signup'/'sign up'/'Sign-Up' to 'sign_up'
  • Drop exact duplicates by event_id (keep first)
  • Remove near-duplicates: events with the same user_id and event_name occurring within 2 seconds
  • Flatten properties to a numeric column price
  • Sort by user_id, event_time and return the cleaned DataFrame
Expected Output
A cleaned DataFrame with UTC timestamps, no null user_id, standardized event_name, a numeric 'price' column, exact/near duplicates removed, sorted by user_id then event_time.

Event Log Cleaning — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Event Log Cleaning?

AI Assistant

Ask questions about this tool