Why this matters
As a Data Platform Engineer, you enable secure analytics while protecting sensitive data. You will:
- Classify and tag PII fields (names, emails, SSNs) in data warehouses and lakes.
- Apply masking and tokenization to let analysts work with data safely.
- Enforce role-based access with policies that automatically hide or transform sensitive columns.
- Support compliance (GDPR/CCPA/PCI) with auditable, repeatable controls.
- Design pipelines so that production logs, customer events, and ML features avoid exposing raw PII.
Concept explained simply
PII tagging, masking, and tokenization are three steps to control sensitive data:
- Tagging/classification: Label columns or fields as PII types (e.g., email, phone) so policies know what to protect.
- Masking: Hide or transform values so they are readable but not revealing. Examples: partial email (j***@domain), nulling, or format-preserving obfuscation. Can be static (permanent) or dynamic (at query time, based on user role).
- Tokenization: Replace a value with a token. The true value is stored in a secure vault. Deterministic tokens let you join on tokens without seeing raw PII.
Mental model
Think of a 4-step loop:
- Discover sensitive fields.
- Tag them with consistent metadata.
- Protect using masking, tokenization, and access policies.
- Monitor access and data drift; fix gaps.
Key terms
- PII (Personally Identifiable Information): Data that can identify a person (directly or indirectly).
- Dynamic masking: Masking applied at query-time depending on user privileges.
- Static masking: Persistent transformation written to storage.
- Tokenization: Substitute real value with a token and keep a reversible mapping in a secure store.
- Pseudonymization: Replacing identifiers with pseudonyms that can be reversed under strict controls.
Worked examples
Example 1 — Dynamic masking in a cloud warehouse
Goal: Mask emails for ANALYST role, show full for ADMIN.
-- Conceptual SQL (adjust to your warehouse syntax)
-- 1) Tag the column
ALTER TABLE analytics.users MODIFY COLUMN email SET TAG pii_type = 'email';
-- 2) Create a masking policy
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
CASE WHEN CURRENT_ROLE() IN ('ADMIN') THEN val
WHEN val IS NULL THEN NULL
ELSE CONCAT(SUBSTR(val, 1, 1), REPEAT('*', LEAST(6, LENGTH(val)-3)), SUBSTR(val, -2))
END;
-- 3) Apply policy to tagged column
ALTER TABLE analytics.users MODIFY COLUMN email SET MASKING POLICY mask_email;
Result: Analysts see j******@le.com; Admins see full email.
Example 2 — Mask via a view in PostgreSQL
Goal: Analysts query a view with masked data; Secured role can query base table.
-- Base table
CREATE TABLE public.customers(
customer_id BIGSERIAL PRIMARY KEY,
full_name TEXT,
email TEXT,
phone TEXT
);
-- Masking function (partial email)
CREATE OR REPLACE FUNCTION mask_email(e TEXT) RETURNS TEXT AS $$
SELECT CASE WHEN e IS NULL THEN NULL
ELSE SUBSTR(e, 1, 1) || '***' || SUBSTR(e, POSITION('@' IN e)) END;
$$ LANGUAGE SQL IMMUTABLE;
-- Masked view for analysts
CREATE OR REPLACE VIEW secure.v_customers_masked AS
SELECT customer_id,
full_name,
mask_email(email) AS email,
'(***) ***-' || RIGHT(phone, 4) AS phone
FROM public.customers;
-- Grants
REVOKE ALL ON public.customers FROM analyst;
GRANT SELECT ON secure.v_customers_masked TO analyst;
GRANT SELECT ON public.customers TO admin;
Result: Analysts query the view and see masked fields by default.
Example 3 — Deterministic tokenization for joins
Goal: Replace email with a deterministic token to enable joins without revealing the original.
# Pseudo-Python for concept demonstration (do not hardcode secrets)
import hmac, hashlib
SECRET_KEY = b"stored-in-kms-or-secret-manager"
def token_for_email(email: str) -> str:
# Deterministic token using HMAC-SHA256
mac = hmac.new(SECRET_KEY, email.strip().lower().encode(), hashlib.sha256).hexdigest()
return f"tk_{mac[:20]}" # shorten for storage
# Use token_for_email in an ingestion job, store tokens instead of raw emails.
# Keep raw-to-token mapping only if you truly need reversibility; otherwise avoid storing raw.
Design tips:
- Keep secrets in a managed KMS/secret manager; rotate regularly.
- If reversibility is required, store mappings in a hardened vault with strict access logging.
- Use deterministic tokens when you need to join across datasets; use non-deterministic for stronger privacy where joins are not required.
How to implement (step-by-step)
- Inventory data sources: list tables, columns, and sample values.
- Define PII taxonomy: email, phone, government_id, address, ip_address, cookie_id, etc.
- Tag columns consistently in your catalog/warehouse metadata.
- Choose controls:
- Dynamic masking for analysts and BI users.
- Static masking for sharing datasets externally.
- Tokenization for logs, event streams, and ML feature stores.
- Implement role-based policies; default to least privilege.
- Add audits: log policy hits, denied queries, and access to vaults.
- Test with realistic cases and edge values (NULLs, short strings, international formats).
- Document: what is protected, how, and who can request access.
Common mistakes and self-check
- Only masking in BI, leaving raw exports unprotected. Fix: enforce policies at the warehouse/lake layer.
- Inconsistent tagging (email vs EmailAddress). Fix: standardize a controlled vocabulary.
- Hardcoding secrets in code. Fix: use a secret manager and rotate keys.
- Using non-deterministic tokens when analytical joins are needed (or vice versa). Fix: align tokenization with use cases.
- Forgetting backups and logs: raw PII may leak there. Fix: protect snapshots and logs, or tokenize upstream.
Self-check checklist
- Every PII column is tagged with an agreed label.
- Default analyst role cannot access raw PII.
- There is an auditable path to request elevated access.
- Secrets are stored and rotated via a managed service.
- Backups/logs are masked or tokenized consistently.
Exercises
Work through these practical tasks. Solutions are available, but try first. Tip: All exercises are available to everyone; sign in to save your progress.
Exercise 1 — Tagging and protection plan
Given two tables:
users(user_id, full_name, email, phone, created_at)
transactions(tx_id, user_id, card_last4, ip_address, amount, tx_ts)
Task:
- Decide PII tags for each column.
- Propose masking/tokenization strategy per column for ANALYST vs ADMIN roles.
- Write 3-5 policy rules in plain language.
Show solution
Example tags:
- users.full_name: pii_type=name
- users.email: pii_type=email
- users.phone: pii_type=phone
- transactions.card_last4: pci_partial
- transactions.ip_address: pii_type=ip_address
Masking/tokenization:
- full_name: mask to first initial + last name initial for ANALYST; full for ADMIN.
- email: dynamic mask for ANALYST; full for ADMIN.
- phone: show last 4 only for ANALYST.
- card_last4: allow as-is (already partial) for ANALYST; restrict full PAN not stored.
- ip_address: tokenized deterministically for joins; full for ADMIN if justified.
Policy rules:
- Columns tagged pii_type=email use mask_email for non-ADMIN roles.
- Columns tagged pii_type=phone show last 4 digits for non-ADMIN roles.
- ip_address is tokenized upstream; raw access limited to SECURITY role.
- card_last4 is visible to ANALYST; full PAN is never stored.
- All access is logged; exceptions require ticket approval.
Exercise 2 — Implement a masking policy or view
Task: Write SQL to mask users.email for ANALYST while showing full for ADMIN. Use either a masking policy (if supported) or a masked view.
Show solution
-- Option A: Masking policy (conceptual)
CREATE MASKING POLICY mask_email AS (val STRING) RETURNS STRING ->
CASE WHEN CURRENT_ROLE() IN ('ADMIN') THEN val
WHEN val IS NULL THEN NULL
ELSE REGEXP_REPLACE(val, '(^.).*@', '\\1***@') END;
ALTER TABLE analytics.users MODIFY COLUMN email SET MASKING POLICY mask_email;
-- Option B: View
CREATE OR REPLACE VIEW analytics.v_users_masked AS
SELECT user_id,
full_name,
CASE WHEN CURRENT_ROLE() IN ('ADMIN') THEN email
ELSE REGEXP_REPLACE(email, '(^.).*@', '\\1***@') END AS email,
phone,
created_at
FROM analytics.users;
Completion checklist
- You assigned consistent tags to all PII columns.
- You wrote at least one executable policy or view.
- You tested both roles (ANALYST, ADMIN) and verified outputs.
Practical projects
- Warehouse policy pack: implement reusable masking policies for email, phone, name, and IP, driven by column tags.
- Tokenizer microservice: build a minimal API that issues deterministic tokens and stores reversible mappings in a hardened table protected by row-level ACLs.
- Catalog integration: write a job that scans schemas, auto-detects likely PII (regex + heuristics), applies tags, and generates a report of uncovered columns.
Who this is for
Data Platform, Data Engineering, Analytics Engineering, and Security Engineering roles that manage data access and compliance.
Prerequisites
- Comfort with SQL and database roles/grants.
- Basic understanding of data catalogs and column metadata.
- Familiarity with secrets management (KMS/secret manager).
Learning path
- Learn PII categories and compliance basics.
- Practice tagging in your warehouse/catalog.
- Apply dynamic masking and validate with role changes.
- Implement tokenization in ingestion jobs.
- Add auditing and monitoring for policy coverage.
Next steps
- Automate classification with patterns and human review.
- Integrate masking/tokenization into CI checks for new columns.
- Set up regular audits to ensure drift doesn't expose PII.
Mini challenge
Pick one dataset with at least 5 tables. Tag all PII, implement masking for 3 columns, tokenize 1 column needed for joins, and produce an audit log showing role-based differences for the same query.
Quick Test
Take the quick test to check your understanding. Everyone can take it. Sign in to save your progress.