Who this is for
- Data Architects defining governance standards.
- Data Engineers implementing catalogs, masking, and access policies.
- Analysts who need to handle PII safely in reports and models.
- Product and Privacy champions coordinating with security/legal teams.
Prerequisites
- Basic SQL and understanding of tables, columns, and views.
- Familiarity with your organization's roles/groups (e.g., analyst, finance, support).
- High-level awareness of privacy concepts (consent, retention). Legal details vary by region—coordinate with your privacy team.
Why this matters
As a Data Architect, you must keep data useful for the business while protecting people’s privacy. Clear classification and PII tagging enable:
- Automatic masking of sensitive columns for non-privileged users.
- Safer sharing with vendors by excluding high-risk fields.
- Faster incident response and data discovery during audits.
- Consistent retention and deletion workflows across systems.
Real tasks you will do:
- Define labels like Public, Internal, Confidential, Restricted and apply them to datasets and columns.
- Tag PII types (direct identifiers, quasi-identifiers, sensitive) and required controls (masking, encryption, retention).
- Design policies that propagate tags through pipelines and derived datasets.
Concept explained simply
Data classification is grouping data by risk if exposed. PII tagging marks data that can identify a person.
Mental model
Imagine every dataset and column wears a luggage tag. The tag says how delicate it is (sensitivity), what’s inside (PII type), who owns it, and how to handle it (mask, encrypt, keep for X days). Systems and people read the tag to treat the data correctly.
Common labels and PII types
- Sensitivity labels:
- Public – safe to share widely.
- Internal – business-only, low risk if leaked.
- Confidential – non-public with moderate risk (e.g., revenue by customer).
- Restricted – high risk, includes PII/financial/health data.
- PII categories:
- Direct identifiers – uniquely identify a person (name, email, phone, SSN, national ID).
- Quasi-identifiers – identify in combination (ZIP, birthdate, gender, IP address, device ID).
- Sensitive special categories – e.g., health, biometrics, precise location, ethnicity (definitions vary by jurisdiction).
Core components of a robust tagging scheme
- Required tags (key:value):
- sensitivity: public|internal|confidential|restricted
- pii: true|false
- pii_type: none|direct|quasi|sensitive
- regulatory_scope: none|gdpr|ccpa|hipaa|coppa|other (choose what applies)
- owner: team-or-person
- lawful_basis_or_purpose: e.g., contract, consent, fraud_prevention (if tracked)
- retention_period_days: integer (e.g., 365)
- encryption_required: true|false
- masking_policy: none|partial|hash|tokenize|aggregate_only
- access_scope: roles or groups allowed to view unmasked
- Apply at two levels:
- Dataset/table level – default sensitivity and retention.
- Column level – precise PII type and masking rules.
- Propagation rules:
- Downstream datasets inherit the most restrictive tag among inputs unless explicitly downgraded with justification.
- Aggregations that remove possibility of re-identification may be reclassified after review.
Worked examples
Example 1: Email event logs
Schema: event_time, user_id, email_address, ip_address, campaign_id
- Table tags: sensitivity=restricted, pii=true, retention_period_days=365, owner=marketing_analytics
- Column tags:
- email_address: pii_type=direct, masking_policy=partial (e.g., a*****@domain.com)
- ip_address: pii_type=quasi, masking_policy=partial or hash
- user_id: pii_type=quasi (if linkable to person), masking_policy=hash
- campaign_id: pii=false, sensitivity=internal
- Reasoning: Direct contact info and linkable identifiers warrant Restricted classification.
Example 2: E‑commerce orders
Schema: order_id, user_id, shipping_name, shipping_address, sku, price, payment_last4
- Table: sensitivity=restricted, pii=true, regulatory_scope=gdpr|ccpa (as applicable), retention_period_days=1095
- Columns:
- shipping_name, shipping_address: pii_type=direct, masking_policy=tokenize
- payment_last4: pii_type=quasi, masking_policy=none or retain as-is (not full PAN)
- sku, price: pii=false, sensitivity=internal
- Derived table: daily_revenue_by_sku (no user fields) – can be reclassified to internal, pii=false after review.
Example 3: Support transcripts (unstructured)
Fields: ticket_id, created_at, channel, transcript_text
- Table: sensitivity=restricted, pii=true
- transcript_text: pii_type=possible_mixed (unknown until scanned) – set masking_policy=redact_matches
- Process: run periodic DLP scanning; tag findings_count and last_scan_at; route high-risk samples to human review.
Step-by-step: Design your classification and PII tagging
- List data assets. Inventory tables, views, files, and key columns.
- Pick a simple label set. Public, Internal, Confidential, Restricted.
- Define PII types. direct, quasi, sensitive (plus optional possible_mixed for unstructured).
- Agree on required tags. Owner, retention, masking, regulatory_scope.
- Create decision rules. Examples: any direct identifier → Restricted; derived aggregates with no re-identification risk → Internal.
- Apply tags to a pilot domain. Document exceptions with justification.
- Automate propagation. Ensure pipelines copy tags forward and prevent downgrades without review.
- Monitor and iterate. Scan for drift, update labels when schemas change.
Implementation patterns (technology-agnostic)
- Catalog-first: Store tags centrally; sync to warehouse columns and files.
- Column comments or metadata: Encode key:value tags in metadata fields if native tags are limited.
- Masking policies: Define role-based masking (partial, hash, tokenize) for tagged columns.
- Data contracts: Include tags in schema definitions checked in version control.
- DLP scanning: Regex+ML scanning for free‑text fields; attach findings as metadata.
- Lineage-aware rules: If any input is Restricted, mark output Restricted unless an approved transformation removes risk.
Exercises
These mirror the exercises below. Try them here, then compare with the solutions.
Exercise 1: Classify a payments table
Table payments_raw columns: payment_id, user_id, full_name, email, card_token, card_bin, amount, currency, created_at, ip_address
- Task: Assign sensitivity, pii, pii_type, masking_policy, retention_period_days at table and column level.
- Goal: Produce a concise mapping for each column.
Show hints
- Direct identifiers: full_name, email.
- Quasi: user_id (if linkable), ip_address, card_bin.
- Token vs PAN: a properly generated card_token is not a PAN but may be sensitive.
Show solution
See the full solution in the Exercises section below.
Exercise 2: Define masking by role
Roles: finance_analyst, support_agent, data_scientist, admin. Table customers columns: customer_id, email, phone, city, birthdate, ltv
- Task: Decide which roles see unmasked values for each column. Propose a generic masking policy for others.
Show hints
- finance_analyst needs ltv; support_agent may need contact info.
- city can be Internal; birthdate is quasi and often masked.
Show solution
See the full solution in the Exercises section below.
- Checklist:
- I labeled the table and each column.
- I chose masking aligned to risk.
- I set a retention period with an owner.
Common mistakes and self-check
- Only tagging tables, not columns. Self-check: Are direct identifiers explicitly tagged?
- Over-classifying harmless aggregates. Self-check: Can any person be re-identified? If not, consider lowering sensitivity.
- Ignoring lineage. Self-check: Do derived views inherit the strictest upstream tag by default?
- No retention set. Self-check: Does each Restricted table have retention_period_days?
- One-size masking. Self-check: Do roles with legitimate need have unmasked access while others see masked?
Practical projects
- Create a tag dictionary for your org and apply it to 10 critical tables.
- Implement automatic tag propagation in one ETL/ELT pipeline.
- Add masking rules to two high-risk columns and validate with role-based queries.
- Run a DLP scan on a text field and add findings_count to the table metadata.
Learning path
- Start: Data classification and PII tagging (this lesson).
- Next: Access control and masking policies.
- Then: Lineage and metadata management.
- Follow-up: Retention and deletion workflows.
- Advanced: Privacy-preserving analytics (k-anonymity, differential privacy — conceptual grounding).
Mini challenge
You produce a weekly product_usage_agg by user_id with counts of feature clicks. Marketing wants user_id replaced so they can share the file with a partner. What tagging and transformation do you propose?
Show one good answer
- Transform: Replace user_id with a salted hash or random token specific to the partner and time-bounded.
- Tags: table sensitivity=confidential, pii=true, pii_type=quasi (tokenized), masking_policy=hash; add regulatory_scope as applicable.
- Note: Provide only necessary fields, remove small cohorts to reduce re-identification risk, set explicit retention.
Next steps
- Complete the exercises and compare with solutions.
- Take the quick test below to check your understanding.
- Pick one production dataset and apply the tag set this week.
Quick test
Everyone can take the test for free. Only logged-in users have their progress saved.
- There are multiple-choice questions. Aim for at least 70%.
- Use this as a learning tool—review explanations if you miss an item.