Why this matters
Master data describes core business entities like Customer, Product, Supplier, and Location. As a Data Architect, you design how these entities are unified across many systems to power analytics, operations, and governance. Done well, teams trust a single version of truth; done poorly, they ship features with duplicates, wrong prices, or broken personalization.
- Unify customer identities across CRM, billing, and ecommerce.
- Standardize product catalogs and hierarchies for pricing and reporting.
- Define keys and survivorship rules so golden records are reliable.
- Model relationships (customer-address, product-category, supplier-items) clearly.
- Enable data quality checks, stewardship workflows, and controlled change.
Concept explained simply
Master data is the set of core nouns of a business (who, what, where) that many processes share. Examples: Customers, Products, Suppliers, Employees, Locations. It changes more slowly than transactions, is reused widely, and needs consistent identifiers.
Key ideas:
- Golden record: the best, most complete view of an entity, merged from sources.
- Authoritative source: the system that should define specific attributes (e.g., CRM for marketing consent, PIM for product specs).
- Identifiers: stable business keys (like SKU, email) and surrogate keys for modeling.
- Matching and merging: rules and algorithms to recognize duplicates and unify them.
- Survivorship: decision rules to pick the final value when sources disagree.
- Reference data: controlled lists used by master data (countries, units, categories).
- Hierarchies and relationships: parent-child and many-to-many links (product taxonomy, customer-household).
- Change handling: decide which attributes overwrite (Type 1) vs. keep history (Type 2).
Mental model
Think of master data as a curated address book for the company. Many people submit contacts (systems), but you maintain one clean card per contact (golden record) with notes on where each detail came from (lineage) and rules on who can update which fields (stewardship).
How to model master data (step-by-step)
1) Define domains and scope
List domains (Customer, Product, Supplier, Location). For each, list sources, consumers, and high-value decisions impacted (pricing, churn, compliance).
2) Identify business keys and identifiers
Capture natural keys (SKU, email, tax ID) and expected issues (nulls, format differences). Add surrogate keys for the logical model.
3) Decide authoritative sources per attribute
Map attributes to their primary source (e.g., product dimensions from PIM, price from ERP). If multiple, set a trust order.
4) Define matching and merging rules
Specify exact keys (SKU), probabilistic matches (name + phone + address), and tie-breakers. Document a minimum match score to auto-merge.
5) Choose change handling (Type 1 vs Type 2)
Overwrite non-critical data (Type 1; e.g., phone number). Track history (Type 2) for regulatory or analytics needs (e.g., address, legal name).
6) Model relationships and hierarchies
Normalize shared entities (Address, Category). Use bridge tables for many-to-many (Party-Address). Represent hierarchies with parent_id or a hierarchy table.
7) Add data quality checks and stewardship fields
Include validation status, quality score, source timestamps, and steward review flags.
Worked examples
Example 1: Customer master across CRM, Billing, Ecommerce
Goal: One consolidated customer.
- Entities: Party, Person, Organization, Identifier, Address, PartyAddress, SourceSystem.
- Business keys: email (if verified), phone, billing account number.
- Matching: exact on billing_account OR fuzzy on (first_name, last_name, normalized_address, phone).
- Survivorship: email from Ecommerce if verified; otherwise CRM. Billing address from Billing. Marketing consent from CRM only.
- Change handling: Name and Address Type 2; Phone Type 1; Consent Type 2 (for audit).
Why this works
Different systems excel in different attributes. Keeping audit and Type 2 for sensitive fields preserves compliance and analytics.
Example 2: Product master with variants
- Entities: Product, ProductVariant, SKU, Category, ProductCategory, Attribute, ProductAttribute.
- Keys: Global Product ID (surrogate), SKU (business), GTIN (business).
- Hierarchy: Category parent_id to build a taxonomy.
- Authoritative sources: PIM for descriptions/specs; ERP for price and inventory.
- Change handling: Specs Type 2 (for warranty), price Type 1 for current operational view plus a price_history table if needed.
Variant modeling note
Keep product-level attributes (brand, model) separate from variant-level attributes (color, size).
Example 3: Address and Location normalization
- Normalize address attributes (street, city, region, postal_code, country_code) and store a standardized version.
- Use reference data for country and region codes.
- Support multiple roles via PartyAddress (billing, shipping, legal).
- Track geocoding status and confidence score.
Why normalize?
Normalized addresses improve match quality and routing, and reduce duplicates from format differences.
Example 4: Supplier onboarding
- Entities: Supplier (Organization), Identifier (tax_id, DUNS), BankAccount (sensitive), Contact.
- Authoritative source: Procurement portal for onboarding data; Finance for payment terms.
- Controls: Steward-reviewed status before supplier becomes active.
Exercises
Complete the exercises below. Then use the checklist to self-review. You can compare your answers using the solution toggles.
Exercise 1: Design a Customer master logical model (ID: ex1)
Company has CRM, Ecommerce, and Billing. Attributes: first_name, last_name, email (verified flag in Ecommerce), phone, billing_account_number, addresses (billing/shipping), marketing_consent, created_at per source. Deliver:
- Entities and relationships (with surrogate keys).
- Business keys and identifiers.
- Attribute-level authoritative source and survivorship.
- Type 1 vs Type 2 choices with reasoning.
- Basic CRUD note: who creates/updates each attribute.
Exercise 2: Define survivorship and matching rules (ID: ex2)
Given three sources for Customer:
- CRM: name, email, consent; email sometimes null.
- Ecommerce: email (verified flag), phone, shipping_address.
- Billing: billing_account_number, billing_address, legal_name.
Write matching rules (exact and fuzzy), survivorship order per attribute, and tie-breakers for conflicts.
Self-check checklist
- Have you separated identifiers from attributes and modeled them explicitly?
- Do you specify authoritative source per attribute and a fallback order?
- Are Type 1/Type 2 decisions tied to compliance and analytics needs?
- Did you include relationship tables for many-to-many links (e.g., Party-Address)?
- Do your matching rules handle edge cases (nicknames, missing email)?
- Is there a status/quality field to control downstream consumption?
Common mistakes and how to self-check
- Mixing transactions into master entities. Self-check: Does the table include events (orders)? If yes, separate them.
- Only one key. Self-check: Do you store both business keys and surrogate keys? You should.
- No explicit survivorship. Self-check: Can you answer “which source wins for phone?” If not, define it.
- Forgetting history. Self-check: Which attributes require Type 2 and why?
- Over-merging customers. Self-check: Do you have confidence thresholds and manual review for ambiguous matches?
- Ignoring reference data. Self-check: Are country/region codes constrained to a list?
Who this is for
Data Architects, Senior Data Engineers, and Analytics Engineers designing unified data models and governance for core entities.
Prerequisites
- Relational modeling (entities, relationships, normalization).
- Basic data quality concepts (validation, profiling).
- Understanding of business processes around customers/products.
Learning path
- Review master data fundamentals and domains.
- Practice identifiers and survivorship rules with small datasets.
- Model hierarchies and relationships for one domain (Product).
- Add Type 1/Type 2 history and stewardship fields.
- Validate with a pilot integration of two sources, then expand.
Practical projects
- Customer 360 Pilot: Build a Customer master from CRM + Ecommerce with match/merge and a golden record view. Acceptance: 98% dedupe precision on a labeled sample.
- Product Catalog Normalization: Consolidate PIM + ERP, with a canonical category hierarchy and SKU attributes. Acceptance: 100% category code validation; no orphan SKUs.
- Address Standardization Service: Normalize and score addresses across domains. Acceptance: 95% addresses standardized; store original and standardized forms.
Next steps
- Do the exercises and compare with the provided solutions.
- Take the quick test to check understanding. Note: The quick test is available to everyone; only logged-in users get saved progress.
- Apply the patterns to your next integration—start with two sources and one domain.
Mini challenge
Your company acquires a brand with its own CRM and PIM. Draft a one-page plan covering: domains in scope, identifier strategy, top 5 survivorship rules, Type 2 fields, and a risk list (e.g., over-merge, consent conflicts). Keep it implementation-agnostic and focused on modeling decisions.