What this subskill covers
Discoverability and Search is how people find trustworthy data, metrics, dashboards, and pipelines through a metadata catalog. As a Data Architect, you define the metadata to index, how results are ranked, and how lineage, access, and quality shape the search experience.
Note: The Quick Test at the end is available to everyone. Only logged-in users have their progress saved.
Why this matters for a Data Architect
- Reduce duplicated data by guiding teams to certified sources.
- Speed up delivery by making the right table, metric, or owner findable in seconds.
- Power impact analysis by surfacing lineage in search (upstream/downstream).
- Support governance by highlighting sensitivity, retention, and access policies.
Real tasks you will do
- Define ranking rules (trust score + usage + freshness).
- Model synonyms for business terms (e.g., GMV, revenue, sales).
- Enable filters for domain, PII level, certification, data product, freshness.
- Integrate lineage to show top upstream sources and downstream consumers.
- Design access-aware search that hides restricted assets.
Concept explained simply
Think of your data estate like a library with a map. The catalog (metadata) describes each book (dataset, metric, dashboard): title, author, topics, quality, and who can borrow it. The map (lineage) shows where each book came from and which other books reference it.
Search finds candidates. Ranking chooses the best. Facets (filters) narrow results. Lineage and trust signals help users decide.
Mental model
- Index: store searchable fields (names, descriptions, columns, tags, owners, domains, policies, freshness, usage).
- Rank: combine trust (certified, tested), popularity (queries, readers), freshness (last update), and textual relevance.
- Filter: domain, system, sensitivity, SLA, lifecycle (active/deprecated), data product, business area.
- Explain: show why a result ranks high (matched synonyms, certified, heavily used).
- Connect: lineage card with upstream/downstream and owners.
Key components and patterns
- Metadata model: datasets, columns, dashboards, metrics, owners, domains, tags, policies (PII, retention), quality status, SLAs, deprecation, glossary terms, data product names.
- Indexing & search: tokenize names and descriptions, support exact matches and partials, optionally add synonyms for business terms. Keep updates incremental.
- Facets & filters: domain, system, sensitivity/PII, certification, freshness (last updated), status (active/deprecated), data product, geography, SLA tier.
- Lineage-powered discovery: show key upstream sources and top downstream consumers; expose impact radius and recent failures.
- Trust & quality: certification badges, test pass rate, incident count, last successful run, owner/steward visibility.
- Access-aware results: only show what the user is allowed to see; never leak sensitive titles or column names across boundaries.
- Curation workflow: allow users to suggest descriptions, owners, and tags; route to stewards for approval.
Worked examples
1) Find the "active users" KPI source
- Search terms: active users, DAU, MAU.
- Synonyms: daily_active_users, dau, monthly_active_users.
- Filters: data type=metric, certification=certified, domain=product analytics.
- Ranking: certified metric with high usage and recent refresh ranks first.
- Lineage: verify metric is computed from event_fact table and standard sessionization logic.
Decision: choose the certified "product.active_users_daily" metric. Confirm owner and SLA from the detail panel.
2) Locate a trustworthy customer field
- Query: customer age.
- Synonyms: age_years, age, consumer_age.
- Filters: sensitivity=PII, domain=marketing, status=active.
- Lineage: check upstream source (CRM) and transformations.
- Trust: certified + tests passing + low incident count.
Decision: select "dim_customer.age_years" from CRM with proper PII handling and governance notes.
3) Triage a broken dashboard
- Query: dashboard name or top metric name.
- Open lineage: trace upstream chain to the first failing job.
- Impact: review downstream count to assess blast radius.
- Owner: contact upstream dataset owner shown in metadata.
Action: prioritize fix based on downstream consumers and SLA tier.
Step-by-step: design a minimal discovery feature
- Define searchable fields: name, description, columns, tags, owners, domain, system, sensitivity, certification, last updated, usage count, status.
- Create synonyms for key business terms (e.g., GMV=Gross Merchandise Value, revenue=sales).
- Choose default facets: domain, certification, freshness, sensitivity, status.
- Set ranking: score = text_relevance + trust_boost + popularity_boost + freshness_boost. Explain score to users.
- Add lineage snippets: show 2 key upstream datasets and top 3 downstream consumers.
- Access awareness: filter results by the user’s entitlements before ranking.
- Curation flow: allow suggest-edits; stewards approve changes to descriptions and tags.
Tip: Explainability matters
Display why a result ranks high: "Matches: description, synonym=DAU; Certified; 1.2k monthly queries; updated 4 hours ago."
Exercises you can do now
Do these before the quick test. Mirror them in your environment or on paper.
Exercise 1 — Search plan for refunds in the last 90 days
Goal: Define how search should find a dataset for "orders with refunds in last 90 days".
- Write your initial query and at least 3 synonyms.
- Pick facets to apply (domain, sensitivity, freshness, certification, status).
- Draft a ranking formula with weights for trust, popularity, freshness, and text relevance.
- Describe the lineage hops you expect (e.g., orders → payments → refunds).
- List the decision signals you will show on the result card.
Exercise 2 — Rank and trace: finance_quarterly_revenue
Given three candidate datasets named similarly, choose a ranking and lineage validation approach:
- Propose a scoring breakdown for: certified data mart, highly used but stale table, and fresh but uncertified export.
- Write a 3-step lineage check to validate correctness.
- Define when to down-rank or tag as deprecated.
Checklist for both exercises
- Synonyms cover abbreviations and business terms.
- Facets include trust, sensitivity, and freshness.
- Ranking formula is explainable.
- Lineage validation is concrete (named hops/owners).
- Access rules are considered.
Common mistakes and self-check
- Only matching on names: Include descriptions, columns, tags, and synonyms.
- Ignoring trust: Certification and quality signals must boost ranking.
- Over-exposing sensitive assets: Enforce access-aware filtering before ranking.
- Stale indexes: Plan incremental updates to keep freshness accurate.
- Opaque ranking: Always show why a result ranks high.
Self-check
- Can a new analyst find the canonical revenue metric in under 3 clicks?
- Do restricted datasets disappear entirely for unauthorized users?
- Does each result show owner, last update, and certification at a glance?
- Can you trace upstream sources from any result card?
Practical projects
- MVP Catalog in a spreadsheet: Create sheets for datasets, columns, owners, tags, policies, lineage edges. Use filters as facets. Write a simple scoring column and sort by it. Add a separate sheet that explains the score for the top 10 results.
- Glossary-driven search: Build a two-column list of business terms and synonyms. Apply them to your spreadsheet catalog, then measure how many queries now find certified sources first.
- Lineage blast radius: From a selected table, enumerate all downstream assets and rank them by importance (usage and SLA). Use this to simulate impact analysis.
Mini challenge
In one page, design the result card for a dataset search. Include: name, domain, owners, certification, last updated, sensitivity, 2 upstream and 3 downstream assets, and the “why ranked” explanation. Add two facet selections you would pre-apply for a first-time user.
Who this is for
- Data Architects designing catalogs and governance.
- Analytics Engineers and Data Engineers curating datasets and metrics.
- Stewards and Owners responsible for data trust.
Prerequisites
- Basic metadata modeling (entities, fields, tags, ownership).
- Understanding of data lineage (upstream/downstream, jobs, schedules).
- Governance basics (sensitivity, access controls, certification).
Learning path
- Start with Metadata Modeling.
- Learn Lineage Capture and Visualization.
- Add Discoverability and Search (this lesson).
- Advance to Trust Signals and Quality Metrics.
- Finish with Access Controls and Stewardship Workflows.
Next steps
- Complete the exercises and refine your ranking formula.
- Pilot a small catalog with 50–100 assets and measure search success.
- Iterate on facets and synonym coverage based on user feedback.