Why this matters
As a Data Architect, you turn messy, duplicate-prone data into reliable structures that feed analytics, applications, and governance. Normalization gives you the rules to remove anomalies (insert, update, delete problems), reduce redundancy, and produce logical models that scale. You will use this when designing source system schemas, standardizing conceptual models into logical ones, and preparing stable layers before denormalized marts.
- Real tasks: designing OLTP schemas, reviewing vendor data models, mapping conceptual entities to 3NF, and approving naming/keys/relationships.
- Impact: fewer bugs, easier changes, smaller storage, simpler lineage, and predictable joins.
Concept explained simply
Normalization is a set of rules that ensure each fact lives in one place and depends on the right key.
Mental model
Imagine putting each unique fact on a labeled card. Cards are grouped by what uniquely identifies them (their key). If a card depends on another card (like City depends on ZipCode), you split it so each group is defined by its own key. The result: no duplicates, no contradictory facts.
The rules (practical version)
- 1NF: No repeating groups; attributes are atomic; each field has a single value per row.
- 2NF: If a table has a composite key, non-key attributes must depend on the whole key, not part of it (no partial dependency).
- 3NF: Non-key attributes must not depend on other non-key attributes (no transitive dependency).
- BCNF (bonus): Every determinant is a candidate key. Use when odd dependencies remain even after 3NF.
Common anomalies you avoid
- Insert anomaly: Cannot add a course until a student enrolls.
- Update anomaly: Changing a product name requires updating many rows.
- Delete anomaly: Deleting the last order also deletes the only record of a customer.
Worked examples
Example 1 — Student enrollments to 3NF
Raw table (EnrollmentRaw):
StudentID, StudentName, CourseID, CourseName, Grade
Issues: StudentName depends only on StudentID. CourseName depends only on CourseID. Grade depends on the pair (StudentID, CourseID). Partial and transitive dependencies exist.
Normalize it
- 1NF: Already atomic.
- 2NF: Split attributes that depend on part of the composite key.
- Students(StudentID PK, StudentName)
- Courses(CourseID PK, CourseName)
- Enrollments(StudentID FK, CourseID FK, Grade, PK: StudentID+CourseID)
- 3NF: No non-key attribute depends on another non-key attribute in any table. We are done.
Example 2 — Orders with customer data
Raw table (OrderLinesRaw):
OrderID, OrderDate, CustomerID, CustomerName, ProductID, ProductName, Qty
Dependencies: OrderID → OrderDate, CustomerID; CustomerID → CustomerName; ProductID → ProductName; Key for lines is (OrderID, ProductID).
Normalize it
- Orders(OrderID PK, OrderDate, CustomerID FK)
- Customers(CustomerID PK, CustomerName)
- Products(ProductID PK, ProductName)
- OrderLines(OrderID FK, ProductID FK, Qty, PK: OrderID+ProductID)
Now CustomerName changes in one place. No non-key depends on another non-key in the same table.
Example 3 — Employees, departments, projects
Raw table (AssignmentsRaw):
EmpID, EmpName, DeptID, DeptName, ProjectID, ProjectName, Role
Given: EmpID → EmpName, DeptID; DeptID → DeptName; ProjectID → ProjectName. Key is (EmpID, ProjectID).
Normalize it
- Employees(EmpID PK, EmpName, DeptID FK)
- Departments(DeptID PK, DeptName)
- Projects(ProjectID PK, ProjectName)
- Assignments(EmpID FK, ProjectID FK, Role, PK: EmpID+ProjectID)
How to apply this step-by-step
- List candidate keys and functional dependencies (FDs).
- Check 1NF: remove repeating groups; split multi-valued fields.
- Check 2NF: if key is composite, remove attributes depending on part of the key.
- Check 3NF: remove attributes that depend on other non-key attributes.
- Consider BCNF if some determinant is not a candidate key.
- Name tables by entity, choose stable natural/surrogate keys, and define FK relationships.
Self-check checklist
- Each non-key attribute depends on the whole key and nothing but the key.
- No attribute contains lists (commas, arrays) or repeated groups.
- Updates to a business fact happen in one table only.
- Joins reflect real relationships (1:1, 1:N, N:M via bridges).
Exercises
Complete these and compare with the solutions in the collapsible sections of each exercise on this page. Your progress in the quick test is available to everyone; only logged-in users get saved progress.
- Exercise 1: Normalize EnrollmentRaw(StudentID, StudentName, CourseID, CourseName, Grade) to 3NF. Specify tables, keys, and FKs.
- Exercise 2: For ProductSales(ProductID, ProductName, CategoryName, CategoryManager, OrderID, OrderDate, Qty) with key (OrderID, ProductID), identify partial and transitive dependencies, then propose 3NF tables.
- Exercise 3: Given FDs for Assignment(EmpID, EmpName, DeptID, DeptName, ProjectID, ProjectName, Role) and key (EmpID, ProjectID), produce a 3NF schema and list FDs satisfied by each table.
Pre-submission checklist
- Did you state functional dependencies before splitting tables?
- Did you keep business-friendly names and stable keys?
- Did you avoid embedding lists and repeated groups?
- Did you make many-to-many relationships explicit with bridge tables?
Common mistakes and how to catch them
- Forgetting the composite key: leads to 2NF violations. Fix by checking each attribute against the whole key.
- Hiding lists in strings: “Tags = red,blue,green” is not 1NF. Create a bridge table.
- Over-normalizing analytics outputs: keep source/OLTP in 3NF, then design star schemas for analytics separately when needed.
- Using unstable natural keys: if ProductName changes, do not use it as PK; use a surrogate and keep the natural key as a unique attribute.
Quick self-audit
- Pick any attribute and ask: which key determines it? If answer is not a key, fix the design.
- Simulate an update: how many rows change? If more than one for a single fact, there is redundancy.
Practical projects
- Normalize a small CRM: Leads, Accounts, Contacts, Activities. Deliver ERD, 3NF tables, and sample insert statements.
- Catalog-to-order flow: Products, Categories, Customers, Orders, OrderLines. Show FDs and justify all keys.
- Learning platform: Students, Courses, Instructors, Enrollments, Assignments, Grades. Include bridge tables for many-to-many relations.
Who this is for
- Aspiring and current Data Architects structuring OLTP and integration layers.
- Data Engineers designing staging and core layers before dimensional models.
- Analysts migrating from spreadsheets to relational models.
Prerequisites
- Basic relational database knowledge (tables, keys, foreign keys).
- Comfort reading simple ER diagrams and understanding one-to-many and many-to-many relationships.
Learning path
- Conceptual entities and relationships.
- Keys and functional dependencies.
- Normalization (1NF → 3NF, BCNF).
- Integrity constraints and naming conventions.
- From logical (3NF) to physical design (indexes, data types).
- Purposeful denormalization when needed (analytics, performance).
Mini challenge
Given Table: ShipmentsRaw(ShipmentID, ShipmentDate, WarehouseID, WarehouseCity, OrderID, CustomerID, CustomerName). Identify the minimal set of tables in 3NF and their keys, then write one sentence explaining which anomaly you prevented.
When you are ready, take the quick test at the end of this page.