Why this matters
As a BI Developer, you turn business questions into trusted dashboards. That only works if you can quickly find the right data, understand its limits, and call out what is missing. This skill helps you avoid stalled projects, wrong metrics, and last-minute surprises.
- Real tasks: map KPIs to systems/tables, confirm fields and grain, check freshness, and document gaps with clear mitigation options.
- Common scenarios: marketing ROI needing ad spend + revenue, churn needing lifecycle events, supply metrics needing inventory and sales.
Concept explained simply
Identifying data sources and gaps means: list all possible places your needed fields could come from, evaluate if they are fit for purpose, and write down anything that prevents you from building the metric today.
Mental model: GRAIN+Q
- G = Grain: what one row represents (order, order line, session, event).
- R = Range (history): how far back the data goes.
- A = Availability: access, permissions, and where it lives (DW, SaaS, API).
- I = Integration: how you join it (keys, IDs, mapping tables).
- N = Naming/definitions: do fields match your metric definition.
- +Q = Quality: freshness, completeness, nulls, duplicates.
What to capture for each candidate source
- Owner/contact and system (e.g., Warehouse, ERP, CRM, Ads platform)
- Table/view name and last refresh time
- Grain (row meaning) and primary keys
- Critical fields present/absent
- History depth and retention policies
- Freshness/cadence (real-time, hourly, daily)
- Access/privacy constraints (PII, restricted scopes)
- Known data quality notes (nulls, duplicates, outliers)
A repeatable 5-step method
- Clarify the metric question and decision. Write the exact definition, filters, and time window.
- Derive the data needs. List entities, fields, and required joins.
- Inventory candidate sources. For each, fill the GRAIN+Q checklist.
- Identify and categorize gaps. Missing field, grain mismatch, freshness/latency, coverage, history, definition misalignment, access/privacy.
- Recommend mitigations. Alternate sources, derive/compute, adjust definition, add tracking, use proxies, or accept latency. Document trade-offs.
Simple templates you can copy
Source inventory entry:
- System/Table:
- Owner:
- Grain/PK:
- Key fields present/missing:
- History depth:
- Freshness:
- Access/privacy:
- Quality notes:
Gap log entry:
- Gap:
- Type (missing field / grain / freshness / history / coverage / definition / access):
- Impact:
- Mitigation options:
- Decision/next step:
Worked examples
Example 1: Weekly revenue by campaign
Requirement: Weekly net revenue by campaign for last 26 weeks, only paid orders, attributed to last click from web.
- Data needs: order_id, net_revenue, paid status, order_date, campaign_id/utm, last-click logic, channel filters.
- Candidate sources: DW fact_orders (paid flag, revenue), dim_campaigns, web_analytics_sessions (utm, last_click).
- Gaps found:
- Freshness: orders updated daily 02:00; stakeholders expect near real-time. Type: freshness.
- Attribution: last-click field not materialized; needs logic. Type: definition/transform.
- Coverage: direct traffic has no campaign. Type: coverage.
- Mitigations: set expectation to daily at 09:00; implement deterministic last-click in model; display "Unattributed" bucket.
Example 2: Churn rate by monthly cohort
Requirement: Churn rate for subscription users by signup cohort, last 12 months.
- Data needs: user_id, signup_date, cancel_date, reactivation, plan_status timeline.
- Candidate sources: billing_events (subscription_started/canceled), users_dim (signup), support_tickets (cancellations via support).
- Gaps found:
- History: billing_events only retained 9 months. Type: history.
- Integration: some cancellations only in support_tickets, no reliable user_id; email used. Type: integration.
- Mitigations: pull archived billing data from cold storage to cover older months; create email-to-user_id mapping table and measure residual unmatched rate.
Example 3: Out-of-stock rate by store
Requirement: Daily OOS rate by store and product category.
- Data needs: on_hand_qty by day, sales by day, store_id, product_category.
- Candidate sources: inventory_snapshot_daily, sales_fact, product_dim.
- Gaps found:
- Grain mismatch: inventory at store-product-day; sales at order line with timestamps. Type: grain.
- Quality: negative on_hand values on weekends. Type: quality.
- Mitigations: aggregate sales to day and align timezone; cap negative inventory at zero and flag days with anomalies for exclusion.
Exercises
Complete these before checking solutions. Use the checklist below to stay consistent.
Exercise 1 — Source inventory and gap log: Monthly Active Users by country
Goal: Identify sources and gaps to report Monthly Active Users (MAU) by country for the last 12 months, web only, excluding known bots.
- Define MAU precisely.
- List candidate sources and fill GRAIN+Q for each.
- List gaps and propose mitigations.
Deliverable: a short source inventory and a gap log with decisions.
Exercise 2 — Source mapping: Gross margin weekly by product category
Goal: Identify sources and gaps to report gross margin by product category, weekly, last 52 weeks.
- Define gross margin in fields.
- List candidate sources and joins.
- Flag gaps and mitigation options.
Deliverable: a brief mapping with gaps and suggested path.
Exercise checklist
- Exact metric definition with filters/time window.
- Source inventory includes owner, grain, history, freshness, key fields.
- Join keys identified; fallbacks if missing.
- Gaps categorized and impact stated.
- Mitigations are realistic with trade-offs.
Common mistakes and self-check
- Mistake: Assuming field names equal definitions. Self-check: compare business definition with source documentation and sample values.
- Mistake: Ignoring grain. Self-check: write row meaning for every table you use.
- Mistake: Overlooking history limits. Self-check: query min/max dates and retention policies.
- Mistake: Missing join key quality. Self-check: compute join coverage and duplicate rates.
- Mistake: Promising real-time on daily sources. Self-check: write the refresh cadence and share it with stakeholders.
Mini challenge
Stakeholder asks: "Show average resolution time for high-priority tickets by customer tier, last quarter." In 5 minutes, jot down: metric definition, candidate sources, expected gaps, and one mitigation per gap. Keep it concise.
Who this is for
- Aspiring and junior BI Developers who translate stakeholder needs into datasets and dashboards.
- Data analysts moving toward BI/analytics engineering responsibilities.
Prerequisites
- Comfort with SQL joins, aggregations, and date logic.
- Basic understanding of data warehousing concepts (facts, dimensions, grain).
- Familiarity with your companys data sources and access policies.
Learning path
- Requirement clarification and metric definitions.
- Identifying data sources and gaps (this lesson).
- Proposing mitigations and scoping transformations.
- Modeling datasets with tested joins and grain.
- Validating metrics and documenting assumptions.
Practical projects
- Build a source inventory for one domain (e.g., marketing). Include grain, history, freshness, and owner for 81 key tables.
- Create a gap log for two KPIs and propose concrete mitigations with timelines.
- Publish a one-page data readiness summary to align stakeholder expectations.
Next steps
- Apply the GRAIN+Q model to your current top KPI.
- Finish the exercises and then take the Quick Test.
- Schedule a brief review with a domain owner to confirm assumptions.
Progress and test
The Quick Test below is available to everyone. Only logged-in users have their progress saved.