Why this matters
Certified datasets are the trusted, governed sources that BI Developers expose to analysts and business users. They reduce duplicated logic, prevent conflicting numbers, speed up dashboard delivery, and make audits easier.
- Real tasks you will do: decide which dataset becomes the official source, document its meaning and lineage, set quality checks, control access, and manage version changes safely.
- Impact: fewer ad‑hoc extracts, consistent KPIs, faster stakeholder confidence.
Who this is for & Prerequisites
- Who this is for: BI Developers, Analytics Engineers, Data Stewards, and anyone publishing shared data models.
- Prerequisites: basic SQL, understanding of your data warehouse model, and familiarity with your BI platform’s dataset or semantic layer concepts.
Concept explained simply
Think of your analytics environment like a library. Many books (datasets) exist, but only a few are on the “certified shelf.” Those books follow strict rules: they’re well‑edited, documented, and owned by someone. When people need facts, they grab the certified edition.
Mental model: Gold/Silver/Bronze. Certification equals “Gold”: highest trust, documented, monitored, governed, and supported.
Certification in one sentence
A certified dataset is a governed, documented, monitored, and owned data product that business users can trust for decisions.
Certification criteria (practical checklist)
Use this before promoting any dataset to certified. Aim for each item to be verifiably true.
- Clear purpose and scope: what questions it answers and what it does NOT cover.
- Single source of truth for its metric domain; duplicates are deprecated or redirected.
- Named owner and steward with contact channel.
- Documentation: business definitions, field-level descriptions, refresh schedule, known limitations.
- Lineage recorded: source systems, transformation steps, and dependencies.
- Quality controls: automated tests (row counts, null rates, referential integrity, KPI thresholds).
- Access policy: who can view/query; PII handling defined; row/column security if needed.
- Performance: typical queries return within acceptable time; aggregations/indexing tuned.
- Versioning and change log with semantic versioning (e.g., 1.3.0) and deprecation windows.
- Monitoring and alerts on freshness, volume, and test failures.
- Naming and tags: clear, stable name; tags like domain=finance, status=certified.
Practical workflow to certify a dataset
- Propose
Details
- Write a short proposal: purpose, target users, KPIs/fields list, expected refresh.
- Map lineage from raw to semantic model.
- Prepare
Details
- Implement data quality rules (DQ) and set thresholds.
- Complete field descriptions and business definitions.
- Benchmark performance with common queries.
- Review
Details
- Peer review by another BI/AE plus a data steward.
- Check the certification checklist item by item.
- Approve & Tag
Details
- Record owner/steward, add status=certified, and semantic version (e.g., 1.0.0).
- Publish release notes.
- Monitor
Details
- Set alerts on freshness, volumes, DQ tests.
- Track usage; deprecate overlapping datasets.
- Change manage
Details
- For breaking changes, bump major version, communicate impact, provide migration path.
Worked examples
Example 1: Certified Sales Orders
- Purpose: official count of orders, revenue, AOV.
- Scope exclusions: returns and chargebacks in separate dataset.
- Key DQ rules: daily freshness < 4 hours; null customer_id < 0.1%; revenue >= 0; order_id uniqueness = 100%.
- Owner/Steward: Commerce BI lead / Data governance analyst.
- Performance: top product revenue by month under 5 seconds for last 24 months.
- Versioning: 1.2.0 (added discount_rate field, non-breaking).
Doc snippet
Revenue: sum(order_line_amount_after_discounts) in company currency at time of order. Excludes taxes and shipping. AOV = revenue / distinct orders.
Example 2: Customer 360
- Purpose: unified customer attributes for segmentation and LTV.
- Security: PII masked at column-level for general audience; full access only for authorized roles.
- DQ: email format valid >= 98%; duplicate customer profiles merge rate > 95%; foreign keys to orders 100% valid.
- Monitoring: spikes in new records > 3x baseline trigger an alert.
Doc snippet
Primary key: customer_sk (surrogate). Use for joins. Do not join on email. LTV_365 equals sum of net revenue in 365 days after first order.
Example 3: Marketing Spend
- Purpose: certified spend by channel and campaign, aligned with finance postings.
- Lineage: ad platform extracts → mapping table → finance adjustments → semantic model.
- DQ: spend non-negative; channel in approved list; daily freshness 24h; reconciliation variance with finance < 1% monthly.
- Change management: breaking rename of channel_group in 2.0.0 with a 60‑day deprecation notice.
Doc snippet
Spend is post‑adjustment amount aligned to finance. Use channel_group for reporting, not raw vendor channel.
Common mistakes and how to self-check
- Mistake: Certifying duplicate datasets for the same KPI. Self-check: search existing assets; if overlap, deprecate or merge.
- Mistake: No owner. Self-check: ensure named owner and backup steward are in the metadata.
- Mistake: Vague metric definitions. Self-check: can a new analyst reproduce KPI exactly from doc? If not, clarify.
- Mistake: Ignoring change management. Self-check: do you use semantic versioning and announce breaking changes?
- Mistake: Missing DQ tests. Self-check: at least one freshness, volume, validity, and business rule test is automated.
Exercises
Note: The quick test is available to everyone. If you are logged in, your progress will be saved.
Exercise ex1 — Build a certification checklist for one dataset
Pick a dataset you work with (or use the Sales Orders example). Draft a certification checklist and complete it as if you’re submitting for approval.
Instructions
- Write a one-paragraph purpose and scope with at least one explicit exclusion.
- List owner and steward contacts.
- Define 4+ DQ tests: freshness, nulls, uniqueness, business rule.
- Document 3 key fields with definitions.
- Propose access policy and tags.
- Assign version and write a two-line change log.
Expected output
A completed checklist (bullet list or table) plus a short doc snippet with field definitions and DQ rules.
Show solution
Sample outline: Purpose: Official orders metrics (revenue, AOV). Exclusion: returns handled elsewhere. Owner: Commerce BI lead. Steward: Governance analyst. DQ: freshness < 4h; order_id unique 100%; revenue ≥ 0; customer_id null < 0.1%. Fields: revenue (sum of order_line_amount_after_discounts), order_count (distinct order_id), aov (revenue/order_count). Access: All analysts read; finance full read; PII masked. Tags: domain=commerce, status=certified. Version: 1.0.0. Change log: 1.0.0 initial release.
Exercise ex2 — Choose which dataset to certify
You have two similar datasets: Sales_Fast (fresh every hour, missing discounts) and Sales_Full (daily, includes discounts and returns, with field docs). Decide which to certify and justify.
Instructions
- Compare against the checklist: completeness, documentation, lineage, and DQ.
- List risks if you certify the wrong one.
- Propose mitigation: can you improve the other or deprecate it?
Expected output
1–2 paragraphs stating the chosen dataset, rationale tied to criteria, and a small deprecation/upgrade plan.
Show solution
Choose Sales_Full for certification: it includes discounts/returns, matches finance, and is documented. Risks of certifying Sales_Fast: inconsistent AOV and margin. Plan: certify Sales_Full v1.0.0; add a freshness improvement task; tag Sales_Fast as deprecated with redirect note.
Self-checklist before you certify
- Purpose clearly stated and scoped
- Owner/steward identified
- Field-level docs for top 10 fields
- Lineage captured end-to-end
- Automated DQ tests in place
- Access policy defined
- Performance validated
- Version + change log written
- Monitoring and alerts configured
Practical projects
- Project 1: Convert one widely used but inconsistent dataset into a certified “Gold” version with docs, tests, and tags.
- Project 2: Implement a deprecation campaign for duplicate KPI datasets; publish redirects and migration notes.
- Project 3: Build a small dashboard that surfaces dataset health: freshness, DQ pass rate, usage count, and last change.
Learning path
- Before: brush up on data modeling and semantic layers; practice writing precise metric definitions.
- During: implement DQ tests and monitoring; practice versioning and change logs.
- After: introduce row/column security; expand to domain-based certification (Finance, Sales, Marketing).
Next steps
- Complete the exercises above and save your notes.
- Take the quick test below to check readiness.
- Pick one real dataset this week and push it through the full workflow.
Mini challenge
Draft a 3-sentence announcement for a newly certified dataset that tells users what it is, what changed, and how to migrate (if needed). Keep it plain language and precise.