luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

PII Tagging Masking Tokenization

Learn PII Tagging Masking Tokenization for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As a Data Platform Engineer, you enable secure analytics while protecting sensitive data. You will:

  • Classify and tag PII fields (names, emails, SSNs) in data warehouses and lakes.
  • Apply masking and tokenization to let analysts work with data safely.
  • Enforce role-based access with policies that automatically hide or transform sensitive columns.
  • Support compliance (GDPR/CCPA/PCI) with auditable, repeatable controls.
  • Design pipelines so that production logs, customer events, and ML features avoid exposing raw PII.

Concept explained simply

PII tagging, masking, and tokenization are three steps to control sensitive data:

  • Tagging/classification: Label columns or fields as PII types (e.g., email, phone) so policies know what to protect.
  • Masking: Hide or transform values so they are readable but not revealing. Examples: partial email (j***@domain), nulling, or format-preserving obfuscation. Can be static (permanent) or dynamic (at query time, based on user role).
  • Tokenization: Replace a value with a token. The true value is stored in a secure vault. Deterministic tokens let you join on tokens without seeing raw PII.
Mental model

Think of a 4-step loop:

  1. Discover sensitive fields.
  2. Tag them with consistent metadata.
  3. Protect using masking, tokenization, and access policies.
  4. Monitor access and data drift; fix gaps.

Key terms

  • PII (Personally Identifiable Information): Data that can identify a person (directly or indirectly).
  • Dynamic masking: Masking applied at query-time depending on user privileges.
  • Static masking: Persistent transformation written to storage.
  • Tokenization: Substitute real value with a token and keep a reversible mapping in a secure store.
  • Pseudonymization: Replacing identifiers with pseudonyms that can be reversed under strict controls.

Worked examples

Example 1 — Dynamic masking in a cloud warehouse

Goal: Mask emails for ANALYST role, show full for ADMIN.

-- Conceptual SQL (adjust to your warehouse syntax)
-- 1) Tag the column
ALTER TABLE analytics.users MODIFY COLUMN email SET TAG pii_type = 'email';

-- 2) Create a masking policy
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
  CASE WHEN CURRENT_ROLE() IN ('ADMIN') THEN val
       WHEN val IS NULL THEN NULL
       ELSE CONCAT(SUBSTR(val, 1, 1), REPEAT('*', LEAST(6, LENGTH(val)-3)), SUBSTR(val, -2))
  END;

-- 3) Apply policy to tagged column
ALTER TABLE analytics.users MODIFY COLUMN email SET MASKING POLICY mask_email;

Result: Analysts see j******@le.com; Admins see full email.

Example 2 — Mask via a view in PostgreSQL

Goal: Analysts query a view with masked data; Secured role can query base table.

-- Base table
CREATE TABLE public.customers(
  customer_id BIGSERIAL PRIMARY KEY,
  full_name   TEXT,
  email       TEXT,
  phone       TEXT
);

-- Masking function (partial email)
CREATE OR REPLACE FUNCTION mask_email(e TEXT) RETURNS TEXT AS $$
  SELECT CASE WHEN e IS NULL THEN NULL
              ELSE SUBSTR(e, 1, 1) || '***' || SUBSTR(e, POSITION('@' IN e)) END;
$$ LANGUAGE SQL IMMUTABLE;

-- Masked view for analysts
CREATE OR REPLACE VIEW secure.v_customers_masked AS
SELECT customer_id,
       full_name,
       mask_email(email) AS email,
       '(***) ***-' || RIGHT(phone, 4) AS phone
FROM public.customers;

-- Grants
REVOKE ALL ON public.customers FROM analyst;
GRANT SELECT ON secure.v_customers_masked TO analyst;
GRANT SELECT ON public.customers TO admin;

Result: Analysts query the view and see masked fields by default.

Example 3 — Deterministic tokenization for joins

Goal: Replace email with a deterministic token to enable joins without revealing the original.

# Pseudo-Python for concept demonstration (do not hardcode secrets)
import hmac, hashlib

SECRET_KEY = b"stored-in-kms-or-secret-manager"

def token_for_email(email: str) -> str:
    # Deterministic token using HMAC-SHA256
    mac = hmac.new(SECRET_KEY, email.strip().lower().encode(), hashlib.sha256).hexdigest()
    return f"tk_{mac[:20]}"  # shorten for storage

# Use token_for_email in an ingestion job, store tokens instead of raw emails.
# Keep raw-to-token mapping only if you truly need reversibility; otherwise avoid storing raw.

Design tips:

  • Keep secrets in a managed KMS/secret manager; rotate regularly.
  • If reversibility is required, store mappings in a hardened vault with strict access logging.
  • Use deterministic tokens when you need to join across datasets; use non-deterministic for stronger privacy where joins are not required.

How to implement (step-by-step)

  1. Inventory data sources: list tables, columns, and sample values.
  2. Define PII taxonomy: email, phone, government_id, address, ip_address, cookie_id, etc.
  3. Tag columns consistently in your catalog/warehouse metadata.
  4. Choose controls:
    • Dynamic masking for analysts and BI users.
    • Static masking for sharing datasets externally.
    • Tokenization for logs, event streams, and ML feature stores.
  5. Implement role-based policies; default to least privilege.
  6. Add audits: log policy hits, denied queries, and access to vaults.
  7. Test with realistic cases and edge values (NULLs, short strings, international formats).
  8. Document: what is protected, how, and who can request access.

Common mistakes and self-check

  • Only masking in BI, leaving raw exports unprotected. Fix: enforce policies at the warehouse/lake layer.
  • Inconsistent tagging (email vs EmailAddress). Fix: standardize a controlled vocabulary.
  • Hardcoding secrets in code. Fix: use a secret manager and rotate keys.
  • Using non-deterministic tokens when analytical joins are needed (or vice versa). Fix: align tokenization with use cases.
  • Forgetting backups and logs: raw PII may leak there. Fix: protect snapshots and logs, or tokenize upstream.
Self-check checklist
  • Every PII column is tagged with an agreed label.
  • Default analyst role cannot access raw PII.
  • There is an auditable path to request elevated access.
  • Secrets are stored and rotated via a managed service.
  • Backups/logs are masked or tokenized consistently.

Exercises

Work through these practical tasks. Solutions are available, but try first. Tip: All exercises are available to everyone; sign in to save your progress.

Exercise 1 — Tagging and protection plan

Given two tables:

users(user_id, full_name, email, phone, created_at)
transactions(tx_id, user_id, card_last4, ip_address, amount, tx_ts)

Task:

  • Decide PII tags for each column.
  • Propose masking/tokenization strategy per column for ANALYST vs ADMIN roles.
  • Write 3-5 policy rules in plain language.
Show solution

Example tags:

  • users.full_name: pii_type=name
  • users.email: pii_type=email
  • users.phone: pii_type=phone
  • transactions.card_last4: pci_partial
  • transactions.ip_address: pii_type=ip_address

Masking/tokenization:

  • full_name: mask to first initial + last name initial for ANALYST; full for ADMIN.
  • email: dynamic mask for ANALYST; full for ADMIN.
  • phone: show last 4 only for ANALYST.
  • card_last4: allow as-is (already partial) for ANALYST; restrict full PAN not stored.
  • ip_address: tokenized deterministically for joins; full for ADMIN if justified.

Policy rules:

  1. Columns tagged pii_type=email use mask_email for non-ADMIN roles.
  2. Columns tagged pii_type=phone show last 4 digits for non-ADMIN roles.
  3. ip_address is tokenized upstream; raw access limited to SECURITY role.
  4. card_last4 is visible to ANALYST; full PAN is never stored.
  5. All access is logged; exceptions require ticket approval.
Exercise 2 — Implement a masking policy or view

Task: Write SQL to mask users.email for ANALYST while showing full for ADMIN. Use either a masking policy (if supported) or a masked view.

Show solution
-- Option A: Masking policy (conceptual)
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
  CASE WHEN CURRENT_ROLE() IN ('ADMIN') THEN val
       WHEN val IS NULL THEN NULL
       ELSE REGEXP_REPLACE(val, '(^.).*@', '\\1***@') END;
ALTER TABLE analytics.users MODIFY COLUMN email SET MASKING POLICY mask_email;

-- Option B: View
CREATE OR REPLACE VIEW analytics.v_users_masked AS
SELECT user_id,
       full_name,
       CASE WHEN CURRENT_ROLE() IN ('ADMIN') THEN email
            ELSE REGEXP_REPLACE(email, '(^.).*@', '\\1***@') END AS email,
       phone,
       created_at
FROM analytics.users;

Completion checklist

  • You assigned consistent tags to all PII columns.
  • You wrote at least one executable policy or view.
  • You tested both roles (ANALYST, ADMIN) and verified outputs.

Practical projects

  • Warehouse policy pack: implement reusable masking policies for email, phone, name, and IP, driven by column tags.
  • Tokenizer microservice: build a minimal API that issues deterministic tokens and stores reversible mappings in a hardened table protected by row-level ACLs.
  • Catalog integration: write a job that scans schemas, auto-detects likely PII (regex + heuristics), applies tags, and generates a report of uncovered columns.

Who this is for

Data Platform, Data Engineering, Analytics Engineering, and Security Engineering roles that manage data access and compliance.

Prerequisites

  • Comfort with SQL and database roles/grants.
  • Basic understanding of data catalogs and column metadata.
  • Familiarity with secrets management (KMS/secret manager).

Learning path

  1. Learn PII categories and compliance basics.
  2. Practice tagging in your warehouse/catalog.
  3. Apply dynamic masking and validate with role changes.
  4. Implement tokenization in ingestion jobs.
  5. Add auditing and monitoring for policy coverage.

Next steps

  • Automate classification with patterns and human review.
  • Integrate masking/tokenization into CI checks for new columns.
  • Set up regular audits to ensure drift doesn't expose PII.

Mini challenge

Pick one dataset with at least 5 tables. Tag all PII, implement masking for 3 columns, tokenize 1 column needed for joins, and produce an audit log showing role-based differences for the same query.

Quick Test

Take the quick test to check your understanding. Everyone can take it. Sign in to save your progress.

Practice Exercises

2 exercises to complete

Instructions

Given two tables:

users(user_id, full_name, email, phone, created_at)
transactions(tx_id, user_id, card_last4, ip_address, amount, tx_ts)

Tasks:

  • Assign PII tags to relevant columns.
  • Choose masking/tokenization per column for ANALYST vs ADMIN roles.
  • Write 3-5 policy rules in plain language.
Expected Output
A short plan listing tags per column, chosen masking/tokenization strategies, and 3–5 policy rules that can be implemented in your warehouse.

PII Tagging Masking Tokenization — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about PII Tagging Masking Tokenization?

AI Assistant

Ask questions about this tool