How to learn Data Classification And PII Tagging for Data Governance And Stewardship in Data Architect for free

Who this is for

Data Architects defining governance standards.
Data Engineers implementing catalogs, masking, and access policies.
Analysts who need to handle PII safely in reports and models.
Product and Privacy champions coordinating with security/legal teams.

Prerequisites

Basic SQL and understanding of tables, columns, and views.
Familiarity with your organization's roles/groups (e.g., analyst, finance, support).
High-level awareness of privacy concepts (consent, retention). Legal details vary by region—coordinate with your privacy team.

Why this matters

As a Data Architect, you must keep data useful for the business while protecting people’s privacy. Clear classification and PII tagging enable:

Automatic masking of sensitive columns for non-privileged users.
Safer sharing with vendors by excluding high-risk fields.
Faster incident response and data discovery during audits.
Consistent retention and deletion workflows across systems.

Real tasks you will do:

Define labels like Public, Internal, Confidential, Restricted and apply them to datasets and columns.
Tag PII types (direct identifiers, quasi-identifiers, sensitive) and required controls (masking, encryption, retention).
Design policies that propagate tags through pipelines and derived datasets.

Concept explained simply

Data classification is grouping data by risk if exposed. PII tagging marks data that can identify a person.

Mental model

Imagine every dataset and column wears a luggage tag. The tag says how delicate it is (sensitivity), what’s inside (PII type), who owns it, and how to handle it (mask, encrypt, keep for X days). Systems and people read the tag to treat the data correctly.

Common labels and PII types

Sensitivity labels:
- Public – safe to share widely.
- Internal – business-only, low risk if leaked.
- Confidential – non-public with moderate risk (e.g., revenue by customer).
- Restricted – high risk, includes PII/financial/health data.
PII categories:
- Direct identifiers – uniquely identify a person (name, email, phone, SSN, national ID).
- Quasi-identifiers – identify in combination (ZIP, birthdate, gender, IP address, device ID).
- Sensitive special categories – e.g., health, biometrics, precise location, ethnicity (definitions vary by jurisdiction).

Core components of a robust tagging scheme

Required tags (key:value):
- sensitivity: public|internal|confidential|restricted
- pii: true|false
- pii_type: none|direct|quasi|sensitive
- regulatory_scope: none|gdpr|ccpa|hipaa|coppa|other (choose what applies)
- owner: team-or-person
- lawful_basis_or_purpose: e.g., contract, consent, fraud_prevention (if tracked)
- retention_period_days: integer (e.g., 365)
- encryption_required: true|false
- masking_policy: none|partial|hash|tokenize|aggregate_only
- access_scope: roles or groups allowed to view unmasked
Apply at two levels:
- Dataset/table level – default sensitivity and retention.
- Column level – precise PII type and masking rules.
Propagation rules:
- Downstream datasets inherit the most restrictive tag among inputs unless explicitly downgraded with justification.
- Aggregations that remove possibility of re-identification may be reclassified after review.

Worked examples

Example 1: Email event logs

Schema: event_time, user_id, email_address, ip_address, campaign_id

Table tags: sensitivity=restricted, pii=true, retention_period_days=365, owner=marketing_analytics
Column tags:
- email_address: pii_type=direct, masking_policy=partial (e.g., a*****@domain.com)
- ip_address: pii_type=quasi, masking_policy=partial or hash
- user_id: pii_type=quasi (if linkable to person), masking_policy=hash
- campaign_id: pii=false, sensitivity=internal
Reasoning: Direct contact info and linkable identifiers warrant Restricted classification.

Example 2: E‑commerce orders

Schema: order_id, user_id, shipping_name, shipping_address, sku, price, payment_last4

Table: sensitivity=restricted, pii=true, regulatory_scope=gdpr|ccpa (as applicable), retention_period_days=1095
Columns:
- shipping_name, shipping_address: pii_type=direct, masking_policy=tokenize
- payment_last4: pii_type=quasi, masking_policy=none or retain as-is (not full PAN)
- sku, price: pii=false, sensitivity=internal
Derived table: daily_revenue_by_sku (no user fields) – can be reclassified to internal, pii=false after review.

Example 3: Support transcripts (unstructured)

Fields: ticket_id, created_at, channel, transcript_text

Table: sensitivity=restricted, pii=true
transcript_text: pii_type=possible_mixed (unknown until scanned) – set masking_policy=redact_matches
Process: run periodic DLP scanning; tag findings_count and last_scan_at; route high-risk samples to human review.

Step-by-step: Design your classification and PII tagging

List data assets. Inventory tables, views, files, and key columns.
Pick a simple label set. Public, Internal, Confidential, Restricted.
Define PII types. direct, quasi, sensitive (plus optional possible_mixed for unstructured).
Agree on required tags. Owner, retention, masking, regulatory_scope.
Create decision rules. Examples: any direct identifier → Restricted; derived aggregates with no re-identification risk → Internal.
Apply tags to a pilot domain. Document exceptions with justification.
Automate propagation. Ensure pipelines copy tags forward and prevent downgrades without review.
Monitor and iterate. Scan for drift, update labels when schemas change.

Implementation patterns (technology-agnostic)

Catalog-first: Store tags centrally; sync to warehouse columns and files.
Column comments or metadata: Encode key:value tags in metadata fields if native tags are limited.
Masking policies: Define role-based masking (partial, hash, tokenize) for tagged columns.
Data contracts: Include tags in schema definitions checked in version control.
DLP scanning: Regex+ML scanning for free‑text fields; attach findings as metadata.
Lineage-aware rules: If any input is Restricted, mark output Restricted unless an approved transformation removes risk.

Exercises

These mirror the exercises below. Try them here, then compare with the solutions.

Exercise 1: Classify a payments table

Table payments_raw columns: payment_id, user_id, full_name, email, card_token, card_bin, amount, currency, created_at, ip_address

Task: Assign sensitivity, pii, pii_type, masking_policy, retention_period_days at table and column level.
Goal: Produce a concise mapping for each column.

Show hints

Direct identifiers: full_name, email.
Quasi: user_id (if linkable), ip_address, card_bin.
Token vs PAN: a properly generated card_token is not a PAN but may be sensitive.

Show solution

See the full solution in the Exercises section below.

Exercise 2: Define masking by role

Roles: finance_analyst, support_agent, data_scientist, admin. Table customers columns: customer_id, email, phone, city, birthdate, ltv

Task: Decide which roles see unmasked values for each column. Propose a generic masking policy for others.

Show hints

finance_analyst needs ltv; support_agent may need contact info.
city can be Internal; birthdate is quasi and often masked.

Show solution

See the full solution in the Exercises section below.

Checklist:
- I labeled the table and each column.
- I chose masking aligned to risk.
- I set a retention period with an owner.

Common mistakes and self-check

Only tagging tables, not columns. Self-check: Are direct identifiers explicitly tagged?
Over-classifying harmless aggregates. Self-check: Can any person be re-identified? If not, consider lowering sensitivity.
Ignoring lineage. Self-check: Do derived views inherit the strictest upstream tag by default?
No retention set. Self-check: Does each Restricted table have retention_period_days?
One-size masking. Self-check: Do roles with legitimate need have unmasked access while others see masked?

Practical projects

Create a tag dictionary for your org and apply it to 10 critical tables.
Implement automatic tag propagation in one ETL/ELT pipeline.
Add masking rules to two high-risk columns and validate with role-based queries.
Run a DLP scan on a text field and add findings_count to the table metadata.

Learning path

Start: Data classification and PII tagging (this lesson).
Next: Access control and masking policies.
Then: Lineage and metadata management.
Follow-up: Retention and deletion workflows.
Advanced: Privacy-preserving analytics (k-anonymity, differential privacy — conceptual grounding).

Mini challenge

You produce a weekly product_usage_agg by user_id with counts of feature clicks. Marketing wants user_id replaced so they can share the file with a partner. What tagging and transformation do you propose?

Show one good answer

Transform: Replace user_id with a salted hash or random token specific to the partner and time-bounded.
Tags: table sensitivity=confidential, pii=true, pii_type=quasi (tokenized), masking_policy=hash; add regulatory_scope as applicable.
Note: Provide only necessary fields, remove small cohorts to reduce re-identification risk, set explicit retention.

Next steps

Complete the exercises and compare with solutions.
Take the quick test below to check your understanding.
Pick one production dataset and apply the tag set this week.

Quick test

Everyone can take the test for free. Only logged-in users have their progress saved.

There are multiple-choice questions. Aim for at least 70%.
Use this as a learning tool—review explanations if you miss an item.

Menu

Data Classification And PII Tagging

Table of Contents