Who this is for
Data Engineers, Analytics Engineers, and platform-focused developers who design pipelines, build data models, or expose datasets to analysts and apps.
Prerequisites
- Basic SQL (SELECT, JOIN, views)
- Familiarity with roles/permissions in your warehouse or database
- Understanding of tables vs. views, and column data types
Why this matters
Real tasks you will face:
- Expose customer data to analysts without revealing emails, phone numbers, or government IDs.
- Move sensitive tables between environments (prod → staging) without leaking PII.
- Enable dedup/attribution using hashed identifiers instead of raw PII.
- Log and monitor pipelines while preventing PII from appearing in logs and error payloads.
Done well, masking reduces breach impact, speeds compliance reviews, and lets teams self-serve safely.
Concept explained simply
PII (Personally Identifiable Information) is any data that could identify a person (e.g., name + address, email, phone, national ID). Handling PII means discovering it, classifying it, limiting its use, and protecting it in motion and at rest.
Mental model
- Classify: What is sensitive? To whom?
- Minimize: Keep only what you need, as short as possible, for as little time as possible.
- Protect: Obscure data using masking, tokenization, hashing, or encryption.
- Control: Grant access by role; log and review usage.
- Prove: Document rules; make them testable and automatable.
Core principles and terminology
- Static masking: Permanently transforming data in a copy (e.g., masked sandbox export).
- Dynamic masking: On-the-fly transformation at query time, often role-aware.
- Tokenization: Replace a value with a surrogate token; mapping stored securely elsewhere; reversible for specific services.
- Hashing: One-way transformation for matching/joins (e.g., SHA-256 with salt). Not reversible; good for analytics joins without revealing PII.
- Encryption: Reversible with keys; protects at rest/in transit. Access requires key management.
- Format-preserving techniques: Keep value shape (e.g., xxxx-xxxx-xxxx-1234) to avoid breaking downstream schemas/validations.
- Data minimization: Drop, truncate, or generalize data you don’t need (e.g., keep only email domain, not full address).
- Least privilege: Grant only columns/rows needed. Build masked views for broad roles; allow unmasked only for a few.
- Auditability: Log who accessed sensitive columns; review periodically.
Worked examples
Example 1 — Mask emails and phones in a view
Goal: Show analysts a customer table with masked contact info while keeping domain and last 4 digits.
-- Source table
-- customers(id INT, full_name TEXT, email TEXT, phone TEXT, ssn TEXT)
-- Masking logic (dialect-agnostic pseudo-SQL)
CREATE OR REPLACE VIEW customers_masked AS
SELECT
id,
full_name,
/* keep first letter + domain */
CONCAT(SUBSTRING(email, 1, 1), '***', SUBSTRING(email, POSITION('@' IN email))) AS email_masked,
/* mask all but last 4 digits */
CONCAT(REPEAT('x', GREATEST(LENGTH(phone) - 4, 0)), RIGHT(phone, 4)) AS phone_masked,
/* show only last4 of SSN */
CONCAT('***-**-', RIGHT(ssn, 4)) AS ssn_masked
FROM customers;
Tip: If your database lacks POSITION/REPEAT/RIGHT, adapt with equivalent functions.
Example 2 — Tokenize card numbers with a mapping table
Goal: Replace card numbers with random tokens in analytics while allowing a secure service to detokenize if needed.
-- Token map keeps no raw PAN; use a strong salted hash as a lookup key.
-- card_token_map(pan_hash TEXT UNIQUE, token TEXT UNIQUE, created_at TIMESTAMP)
-- Ingest step (pseudo-SQL):
INSERT INTO card_token_map(pan_hash, token, created_at)
SELECT
SHA256(CONCAT(salt_secret, pan)) AS pan_hash,
GENERATE_UUID() AS token,
CURRENT_TIMESTAMP
FROM new_cards nc
WHERE NOT EXISTS (
SELECT 1 FROM card_token_map t WHERE t.pan_hash = SHA256(CONCAT(salt_secret, nc.pan))
);
-- Analytics table uses token instead of PAN
-- transactions_secure(..., pan_token TEXT)
Store salt_secret and any detokenization logic in a secure service, not in analyst-accessible SQL.
Example 3 — Privacy-preserving joins with salted hashes
Goal: Join events and CRM by user without exposing email.
-- Create a consistent salted hash in both systems
-- crm_users(email_hash TEXT)
-- web_events(email_hash TEXT)
-- Hash creation (pseudo-SQL)
UPDATE crm_users SET email_hash = SHA256(CONCAT(env_salt, LOWER(TRIM(email))));
UPDATE web_events SET email_hash = SHA256(CONCAT(env_salt, LOWER(TRIM(email))));
-- Join on email_hash
SELECT e.session_id, c.user_id
FROM web_events e
JOIN crm_users c USING (email_hash);
Use environment-specific salts to prevent cross-environment linkage.
How to apply in pipelines
- Discover and classify: Tag columns as PII, Sensitive, or Public.
- Choose a protection method per column: mask, hash, tokenize, encrypt, or drop.
- Implement access layers: masked views for most roles; raw only for a few audited roles.
- Test with sample datasets: verify format, uniqueness, and referential integrity remain valid.
- Automate: codify rules (e.g., SQL views, transformation jobs); add checks in CI for new PII columns.
- Review: rotate salts/keys periodically; review audit logs and role memberships.
Exercises
These are open to everyone. Note: the quick test is available to everyone; only logged-in users have progress saved.
Exercise 1 — Create a masked view for analysts
Using the table below, create a view that masks email and phone and shows only the last 4 of ssn.
-- customers(id INT, full_name TEXT, email TEXT, phone TEXT, ssn TEXT)
INSERT INTO customers VALUES
(1,'Ada Lovelace','ada@example.com','15551234567','123-45-6789'),
(2,'Grace Hopper','grace@navy.mil','15557654321','987-65-4321');
-- Requirements:
-- email_masked: keep first letter + domain (a***@example.com)
-- phone_masked: xxxxxx4567
-- ssn_masked: ***-**-6789
- Deliverable: CREATE VIEW customers_masked AS ...
- Check: Query the view and verify masked formats.
Exercise 2 — Build a tokenization mapping for credit cards
Create a mapping that replaces raw card numbers with tokens for analytics.
-- cards(card_id INT, pan TEXT)
INSERT INTO cards VALUES (1,'4111111111111111'),(2,'4000000000000002');
-- Create table: card_token_map(pan_hash TEXT UNIQUE, token TEXT UNIQUE)
-- Create view: cards_tokenized(card_id INT, pan_token TEXT)
-- Use SHA256(salt || pan) as lookup key; token can be a generated UUID.
- Deliverable: mapping table + a tokenized view that exposes pan_token only.
- Check: Each PAN maps to one token; same PAN → same token; different PANs → different tokens.
Need a checklist?
- Masked outputs keep expected format.
- No raw PII appears in views.
- Hashing uses a non-empty salt.
- Tokens are unique and non-guessable.
- Joins still work where intended.
Common mistakes and self-checks
- Masking that breaks joins: Keep stable keys (hashes/tokens) for joins; don’t mask join keys unless replaced with a consistent surrogate.
- Forgetting logs: Ensure ETL logs, error messages, and dashboards do not print raw PII. Self-check: search for email/phone patterns in logs.
- Unsalted hashing: Always include a salt to resist rainbow-table attacks. Self-check: confirm salts differ per environment.
- Leaking through derived fields: Domains, ZIP+4, or birthdates can re-identify users. Self-check: aggregate or generalize where possible.
- Too-broad access: Review role memberships quarterly. Self-check: attempt to select raw columns using a normal analyst role.
Practical projects
- Build a PII classification catalog: scan schemas and tag columns; export a report with recommended protections.
- Create a dynamic masking layer: role-aware views that show raw, redacted, or aggregated values based on user role.
- Design a sandbox anonymization job: generate a masked copy for dev/staging with deterministic hashing + format-preserving masks.
Learning path
- Identify PII: classify columns in your top 5 datasets.
- Implement masking: build at least one masked view per dataset.
- Add hashing/tokenization: replace PII in analytics-facing tables.
- Secure access: update roles/permissions and add audit queries.
- Harden logs and tests: scrub PII in logs and sample data used for tests.
Next steps
- Add automated checks that fail pipelines when new PII columns lack a masking rule.
- Rotate salts/keys on a schedule; document how to rehash/retokenize safely.
- Proceed to the quick test to confirm understanding.
Mini challenge
Take one high-risk table in your environment and produce two artifacts: (1) a masked view for analysts, (2) a one-page policy describing which columns are masked, hashed, tokenized, or dropped, and why. Have a peer review for both usability and privacy risk.