Why this matters
As a Product Analyst, your insights are only as good as the data behind them. Missing or broken events quietly distort funnels, A/B tests, retention, and revenue attribution. Detecting issues early prevents weeks of misleading dashboards and wrong decisions.
- Find the real cause of sudden funnel drops
- Catch schema changes that break dashboards
- Stop double-counting or sample bias before it spreads
- Give engineering clear, actionable fixes
Who this is for
- Product Analysts and Data Analysts working with event data
- Product Managers who own KPI dashboards
- Engineers instrumenting product analytics
Prerequisites
- Basics of event tracking: event name, timestamp, user/session IDs, properties
- Comfort with funnels and cohort definitions
- Ability to run simple queries or use your analytics tool’s charts
Concept explained simply
Think of your event stream like a heartbeat monitor for your product. When events go missing, the heart skips beats. When events are broken, the signal is noisy or mislabeled.
- Missing events: expected events stop arriving (volume collapses, gaps in time)
- Broken events: events arrive but are wrong (name changed, properties missing, duplicates, wrong timestamp)
Mental model: Volume, Shape, Flow
- Volume: how many events per time unit or per active user
- Shape: schema correctness (properties exist, types/values valid)
- Flow: relationships between events (ratios, funnels, cause–effect)
Detect by watching all three. If any axis changes suddenly without a plausible product reason, investigate.
Core detection signals
- Volume anomalies: sharp drop/spike vs 7/28-day baseline
- User-normalized ratios: events per active user/device
- Schema validity: required properties present and typed correctly
- Topology ratios: downstream-to-upstream event ratio (e.g., CheckoutCompleted / AddToCart)
- Freshness: events arriving too late or with future timestamps
- Duplicates: unusually high event-per-user per minute
Show quick checks
- Day-over-day change beyond threshold (e.g., 30–50%)
- Property null rate drift (e.g., >5–10% increase)
- Version change detected in payload (event_version, sdk_version)
- Time skew: created_at vs received_at difference
Worked examples
Example 1: Sudden drop in AddToCart
Signal: AddToCart events down 65% day-over-day. Active users only down 2%.
- Check ratio: AddToCart per active user fell from 0.9 to 0.3
- Upstream event (ProductViewed) steady; downstream (CheckoutStarted) also down
- Conclusion: Missing event, likely SDK or feature-gating change
- Action: Confirm recent release; verify SDK init on product pages; request hotfix
What would prove it?
- Staging environment still emits AddToCart
- Only Web platform affected; Mobile stable
- Release notes show deprecated click handler
Example 2: SignupCompleted missing plan_id
Signal: Revenue by plan dashboard flatlines for new users; total signups normal.
- Null rate for plan_id jumped from 1% to 60%
- event_version updated to v3 yesterday
- Conclusion: Broken event schema (missing property), not missing event
- Action: Ask engineering to re-add plan_id or map new property planTier to plan_id
Self-check
- Verify plan distribution pre-change vs post-change
- Ensure downstream billing table still has plan
Example 3: Double-counted Purchase
Signal: Revenue 30% higher than payment processor; Purchase per user spiked.
- Duplicate bursts within 2 seconds for same user/order_id
- Autotrack + manual track both firing on same button
- Conclusion: Broken event (duplicate semantics)
- Action: Add idempotency key (order_id), dedupe rule, or disable autotrack on that element
Step-by-step detection playbook
- Scan volume and ratios
- Events per active user/device
- Downstream/Upstream ratios (e.g., Purchase / CheckoutStarted)
- Check schema shape
- Required properties present
- Type/enum validity; null rate changes
- Segment by dimension
- Platform, app version, region, release channel
- Time sanity
- Late arrivals, future timestamps, timezone shifts
- Duplicates and spikes
- Per-user-per-minute caps; identical payloads
- Trace to change
- Recent deployments, feature flags, SDK updates
Fast 30-minute triage
- Compare today vs last 7-day median
- Normalize by active users
- Check two key properties null rate
- Split by platform/app version
- Sample 20 raw events for eyeballing
Instrumentation verification checklist
- Event fires exactly once per user action
- Required properties always present
- Stable event name and casing
- Event version tracked on schema change
- User/session IDs present and consistent
- Timestamps in UTC; no future times
- Backoff/retry configured for network failures
Common mistakes and how to self-check
- Watching raw volume only
- Self-check: Always compute event per active user
- Assuming marketing seasonality explains drops
- Self-check: Compare ratio to upstream event; seasonality affects both
- Ignoring null rates
- Self-check: Track property presence as a KPI
- Not segmenting by app version
- Self-check: Build a version breakdown chart
- Fixing dashboards instead of data
- Self-check: Validate raw payload before patching charts
Practical projects
- Build an anomaly view: for 5 key events, chart 7-day rolling ratio to upstream and null rates for 3 required properties
- Create a schema contract: define required properties and types; include event_version; mock alarms for >10% null increase
- Implement dedupe logic: design an idempotency rule using order_id within a 5-minute window
Exercises (do these before the quick test)
Note: The quick test is available to everyone. Only logged-in users will have their progress saved.
- Exercise 1: Find a missing event via ratio analysis (see below)
- Exercise 2: Diagnose a broken schema via null rates (see below)
Exercise 1 data and hints
You observe the following yesterday vs 7-day median:
- Active users: -3%
- ProductViewed: -4%
- AddToCart: -52%
- CheckoutStarted: -49%
Task: Determine if AddToCart is missing or broken. Specify two additional checks to confirm.
Exercise 2 data and hints
SignupCompleted has required properties: user_id, plan_id, source. Yesterday property presence:
- user_id: 100%
- plan_id: 42% (was 99% last week)
- source: 99%
Task: Is this missing or broken? Propose a remediation and an alert threshold.
Mini challenge
Your funnel is ProductViewed → AddToCart → CheckoutStarted → Purchase. Today, Purchase is flat, CheckoutStarted down 20%, AddToCart down 3%, ProductViewed flat. What’s your top hypothesis and first query?
Possible approach
Hypothesis: CheckoutStarted event broken or gated by a release. First query: CheckoutStarted per active user by app_version/platform; check null rate for cart_id on CheckoutStarted.
Learning path
- Before: Event design and naming; consistent IDs
- This lesson: Detect missing and broken events
- Next: Alerting and data contracts; rollout validation and canary checks
Next steps
- Set baselines and thresholds for your top 10 events
- Add an event_version property where missing
- Schedule a weekly 15-minute data health review