luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Attribution Ready Data Prep

Learn Attribution Ready Data Prep for free with explanations, exercises, and a quick test (for Marketing Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Why this matters

Attribution is only as good as the data you feed it. As a Marketing Analyst, you’ll answer questions like “Which channels drive conversions?” and “Where should we put the next dollar?” To do that, you need clean, standardized, deduplicated touchpoint data that connects to conversions within a clear lookback window. This lesson shows you how to prepare attribution-ready data with SQL.

Who this is for

  • Marketing Analysts and Growth Analysts who run channel/ROI analysis.
  • Data-savvy marketers who need reliable multi-touch paths.
  • Anyone building first-touch, last-touch, or multi-touch attribution reports.

Prerequisites

  • Comfort with SQL SELECT, JOIN, GROUP BY, and window functions.
  • Basic understanding of UTM parameters, channels, and conversions.
  • Familiarity with timestamps and time zones.

Concept explained simply

Attribution-ready data is a tidy table of user touchpoints linked to conversions. Each row represents a candidate touch (ad click, email click, session start) with:

  • Unified identity (stable user key).
  • Standardized channel/campaign fields.
  • Clean timestamps in one timezone.
  • Deduplication rules applied.
  • Join logic to match touches to conversions within a lookback window.
Mental model

Think of your dataset as a timeline per user:

  • First, make every touch comparable (same fields, same time zone, one user key).
  • Then, remove duplicates and noise (bots, internal traffic).
  • Finally, attach conversions to the relevant previous touches within the window (e.g., 30 days).

Once you have this tidy base, any attribution rule (first-touch, last-touch, linear, position-based) becomes a small SQL step on top.

Data you typically have

  • ad_clicks(click_id, user_id, device_id, utm_source, utm_medium, utm_campaign, clicked_at_utc)
  • ad_impressions(imp_id, user_id, device_id, utm_source, utm_medium, utm_campaign, seen_at_utc) – optional
  • web_sessions(session_id, user_id, started_at_utc, landing_page, is_bot)
  • conversions(order_id, user_id, revenue, converted_at_utc)
  • channel_map(source, medium, channel)

Note: Your naming may differ. The steps are the same.

Core steps to make data attribution-ready

  1. Unify identity – Create a stable key with COALESCE, e.g., COALESCE(user_id, device_id) AS person_id.
  2. Normalize channels – Map UTM source/medium to a standard channel taxonomy.
  3. Standardize timestamps – Convert all times to UTC; store a date for partitioning.
  4. Deduplicate – Remove duplicate clicks/impressions by a natural/business key.
  5. Filter noise – Exclude bots, internal traffic, and test users.
  6. Define lookback window – E.g., 30 days from touch to conversion.
  7. Sequence touches – Use window functions to order touches per person.
  8. Attach conversions – Match each conversion to eligible prior touches.

Worked examples

Example 1 – Normalize and deduplicate ad clicks
-- 1) Map UTM to channel and deduplicate clicks
WITH mapped AS (
  SELECT 
    COALESCE(user_id, device_id) AS person_id,
    LOWER(utm_source) AS utm_source,
    LOWER(utm_medium) AS utm_medium,
    utm_campaign,
    clicked_at_utc::timestamp AS clicked_at_utc,
    click_id,
    ROW_NUMBER() OVER (PARTITION BY click_id ORDER BY clicked_at_utc) AS rn
  FROM ad_clicks
), dedup AS (
  SELECT * FROM mapped WHERE rn = 1
)
SELECT d.person_id,
       cm.channel AS canonical_channel,
       d.utm_source, d.utm_medium, d.utm_campaign,
       d.clicked_at_utc,
       DATE_TRUNC('day', d.clicked_at_utc) AS clicked_date
FROM dedup d
LEFT JOIN channel_map cm
  ON cm.source = d.utm_source AND cm.medium = d.utm_medium;

Result: One clean row per actual click with a consistent channel.

Example 2 – Create a unified touch table
-- 2) Union multiple touch sources into one table
WITH clicks AS (
  SELECT COALESCE(user_id, device_id) AS person_id,
         clicked_at_utc AS touch_at,
         'click' AS touch_type,
         cm.channel AS channel,
         utm_campaign
  FROM ad_clicks c
  LEFT JOIN channel_map cm
    ON cm.source = LOWER(c.utm_source) AND cm.medium = LOWER(c.utm_medium)
  QUALIFY ROW_NUMBER() OVER (PARTITION BY c.click_id ORDER BY clicked_at_utc) = 1
), sessions AS (
  SELECT user_id AS person_id,
         started_at_utc AS touch_at,
         'session' AS touch_type,
         CASE WHEN landing_page ILIKE '%/email/%' THEN 'Email' ELSE 'Direct/Organic' END AS channel,
         NULL AS utm_campaign
  FROM web_sessions
  WHERE COALESCE(is_bot,false) = false
)
SELECT person_id, touch_at, touch_type, channel, utm_campaign
FROM clicks
UNION ALL
SELECT person_id, touch_at, touch_type, channel, utm_campaign
FROM sessions;

Result: A single timeline of touches per person across sources.

Example 3 – Attach conversions with a 30-day lookback
-- 3) For each conversion, find eligible prior touches (30 days)
WITH t AS (
  SELECT person_id, touch_at, channel,
         ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY touch_at) AS touch_n
  FROM unified_touches
), c AS (
  SELECT person_id, order_id, converted_at_utc AS conv_at, revenue
  FROM conversions
), joined AS (
  SELECT c.order_id, c.person_id, c.conv_at, c.revenue,
         t.touch_at, t.channel,
         t.touch_n
  FROM c
  JOIN t ON t.person_id = c.person_id
        AND t.touch_at <= c.conv_at
        AND t.touch_at >= c.conv_at - INTERVAL '30 days'
)
SELECT order_id, person_id, conv_at, revenue,
       ARRAY_AGG(channel ORDER BY touch_at) AS path_channels,
       ARRAY_AGG(touch_at ORDER BY touch_at) AS path_times,
       (ARRAY_AGG(channel ORDER BY touch_at DESC))[1] AS last_touch_channel,
       (ARRAY_AGG(channel ORDER BY touch_at))[1] AS first_touch_channel
FROM joined
GROUP BY 1,2,3,4;

Result: For each conversion, you get an ordered path and clear first/last touch channels.

Attribution-ready checklist

  • [ ] All timestamps in UTC.
  • [ ] One stable person_id per row.
  • [ ] Channel taxonomy applied and documented.
  • [ ] Duplicate touches removed via clear keys.
  • [ ] Bot/internal/test traffic excluded.
  • [ ] Lookback window chosen and justified (e.g., 30 days).
  • [ ] Conversions attached only to earlier touches.
  • [ ] Edge cases handled (same-minute clicks, multiple conversions per user).

Common mistakes and how to self-check

  • Mixed time zones – Symptom: touches appearing after conversion when they weren’t. Fix: convert all to UTC at ingest.
  • Duplicate clicks – Symptom: inflated paths. Fix: ROW_NUMBER() over a natural key (e.g., click_id) and keep rn=1.
  • Loose joins – Symptom: future touches linked to past conversions. Fix: enforce touch_at <= conv_at.
  • Missing identity rules – Symptom: fragmented paths by device. Fix: COALESCE to a priority stack (user_id > device_id).
  • Unclear channel taxonomy – Symptom: “Other” dominates. Fix: maintain a channel_map table and version it.

Exercises (hands-on)

Complete these in your SQL environment. The Quick Test below is available to everyone. Logged-in users get saved progress.

Exercise 1: Map UTMs, deduplicate clicks, standardize time

Goal: Produce a clean clicks table with person_id, canonical channel, and UTC timestamps.

Data assumptions
  • Tables: ad_clicks, channel_map
  • Deduplicate on click_id
  • Use COALESCE(user_id, device_id) as person_id
Expected output (sample)
person_id | clicked_at_utc        | channel   | utm_campaign
--------- | --------------------- | --------- | ------------
U123      | 2025-05-01 10:12:00   | Paid Search | spring_sale
U555      | 2025-05-01 11:45:13   | Paid Social | brand_push

Exercise 2: Build conversion paths with a 30-day lookback

Goal: For each conversion, attach prior touches and compute first-touch and last-touch channels.

Data assumptions
  • Tables: unified_touches(person_id, touch_at, channel), conversions(person_id, order_id, revenue, converted_at_utc)
  • Lookback: 30 days
Expected output (sample)
order_id | person_id | first_touch | last_touch | path_channels
-------- | --------- | ----------- | ---------- | -----------------------------
O101     | U123      | Paid Social | Email      | {Paid Social, Paid Search, Email}
O102     | U555      | Paid Search | Paid Search| {Paid Search}

Practical projects

  • Build a daily attribution-ready mart (incremental) that outputs one row per conversion with path arrays.
  • Create a channel taxonomy documentation page and a validation query that surfaces unmapped source/medium pairs.
  • Implement parameterized lookback windows (7/14/30 days) and compare attribution shifts.

Learning path

  1. Standardize timestamps and identity keys on all marketing tables.
  2. Create and apply a channel_map; fix unmapped cases.
  3. Build a unified_touches view (union of clicks/sessions/email).
  4. Attach conversions with a clear lookback window.
  5. Add first-touch/last-touch fields; then test multi-touch rules.
  6. Schedule and monitor the pipeline; add data quality checks.

Next steps

  • Layer on attribution models (first, last, linear, position-based).
  • Create channel ROI dashboards based on the prepared data.
  • Iterate on identity rules and channel taxonomy as tracking evolves.

Mini challenge

Extend the join to include impressions (view-through) but only if there is no click within 24 hours of the conversion. How would you prioritize clicks over impressions in SQL?

Reference snippets

Prioritize clicks over impressions
WITH candidates AS (
  SELECT c.order_id, t.person_id, t.touch_at, t.channel, t.touch_type,
         ROW_NUMBER() OVER (
           PARTITION BY c.order_id
           ORDER BY CASE WHEN t.touch_type = 'click' THEN 1 ELSE 2 END, t.touch_at DESC
         ) AS ranky
  FROM conversions c
  JOIN unified_touches t
    ON t.person_id = c.person_id
   AND t.touch_at <= c.converted_at_utc
   AND t.touch_at >= c.converted_at_utc - INTERVAL '30 days'
)
SELECT * FROM candidates WHERE ranky = 1;

Pro tips

Small but impactful habits
  • Name lookback windows explicitly in column names (e.g., last_touch_30d) to avoid confusion.
  • Keep a QA query that counts touches after conversion; it should be zero.
  • Version your channel_map and store the version used per output date.

Practice Exercises

2 exercises to complete

Instructions

Create a clean clicks dataset with:

  • person_id = COALESCE(user_id, device_id)
  • clicked_at_utc normalized to timestamp
  • channel from channel_map (join on lower(source), lower(medium))
  • Deduplicate by click_id using ROW_NUMBER()

Return columns: person_id, clicked_at_utc, channel, utm_campaign.

Expected Output
person_id | clicked_at_utc | channel | utm_campaign U123 | 2025-05-01 10:12:00 | Paid Search | spring_sale U555 | 2025-05-01 11:45:13 | Paid Social | brand_push

Attribution Ready Data Prep — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Attribution Ready Data Prep?

AI Assistant

Ask questions about this tool