luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

PII Handling And Masking

Learn PII Handling And Masking for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Who this is for

Data Engineers, Analytics Engineers, and platform-focused developers who design pipelines, build data models, or expose datasets to analysts and apps.

Prerequisites

  • Basic SQL (SELECT, JOIN, views)
  • Familiarity with roles/permissions in your warehouse or database
  • Understanding of tables vs. views, and column data types

Why this matters

Real tasks you will face:

  • Expose customer data to analysts without revealing emails, phone numbers, or government IDs.
  • Move sensitive tables between environments (prod → staging) without leaking PII.
  • Enable dedup/attribution using hashed identifiers instead of raw PII.
  • Log and monitor pipelines while preventing PII from appearing in logs and error payloads.

Done well, masking reduces breach impact, speeds compliance reviews, and lets teams self-serve safely.

Concept explained simply

PII (Personally Identifiable Information) is any data that could identify a person (e.g., name + address, email, phone, national ID). Handling PII means discovering it, classifying it, limiting its use, and protecting it in motion and at rest.

Mental model
  • Classify: What is sensitive? To whom?
  • Minimize: Keep only what you need, as short as possible, for as little time as possible.
  • Protect: Obscure data using masking, tokenization, hashing, or encryption.
  • Control: Grant access by role; log and review usage.
  • Prove: Document rules; make them testable and automatable.

Core principles and terminology

  • Static masking: Permanently transforming data in a copy (e.g., masked sandbox export).
  • Dynamic masking: On-the-fly transformation at query time, often role-aware.
  • Tokenization: Replace a value with a surrogate token; mapping stored securely elsewhere; reversible for specific services.
  • Hashing: One-way transformation for matching/joins (e.g., SHA-256 with salt). Not reversible; good for analytics joins without revealing PII.
  • Encryption: Reversible with keys; protects at rest/in transit. Access requires key management.
  • Format-preserving techniques: Keep value shape (e.g., xxxx-xxxx-xxxx-1234) to avoid breaking downstream schemas/validations.
  • Data minimization: Drop, truncate, or generalize data you don’t need (e.g., keep only email domain, not full address).
  • Least privilege: Grant only columns/rows needed. Build masked views for broad roles; allow unmasked only for a few.
  • Auditability: Log who accessed sensitive columns; review periodically.

Worked examples

Example 1 — Mask emails and phones in a view

Goal: Show analysts a customer table with masked contact info while keeping domain and last 4 digits.

-- Source table
-- customers(id INT, full_name TEXT, email TEXT, phone TEXT, ssn TEXT)

-- Masking logic (dialect-agnostic pseudo-SQL)
CREATE OR REPLACE VIEW customers_masked AS
SELECT
  id,
  full_name,
  /* keep first letter + domain */
  CONCAT(SUBSTRING(email, 1, 1), '***', SUBSTRING(email, POSITION('@' IN email))) AS email_masked,
  /* mask all but last 4 digits */
  CONCAT(REPEAT('x', GREATEST(LENGTH(phone) - 4, 0)), RIGHT(phone, 4)) AS phone_masked,
  /* show only last4 of SSN */
  CONCAT('***-**-', RIGHT(ssn, 4)) AS ssn_masked
FROM customers;

Tip: If your database lacks POSITION/REPEAT/RIGHT, adapt with equivalent functions.

Example 2 — Tokenize card numbers with a mapping table

Goal: Replace card numbers with random tokens in analytics while allowing a secure service to detokenize if needed.

-- Token map keeps no raw PAN; use a strong salted hash as a lookup key.
-- card_token_map(pan_hash TEXT UNIQUE, token TEXT UNIQUE, created_at TIMESTAMP)

-- Ingest step (pseudo-SQL):
INSERT INTO card_token_map(pan_hash, token, created_at)
SELECT
  SHA256(CONCAT(salt_secret, pan)) AS pan_hash,
  GENERATE_UUID() AS token,
  CURRENT_TIMESTAMP
FROM new_cards nc
WHERE NOT EXISTS (
  SELECT 1 FROM card_token_map t WHERE t.pan_hash = SHA256(CONCAT(salt_secret, nc.pan))
);

-- Analytics table uses token instead of PAN
-- transactions_secure(..., pan_token TEXT)

Store salt_secret and any detokenization logic in a secure service, not in analyst-accessible SQL.

Example 3 — Privacy-preserving joins with salted hashes

Goal: Join events and CRM by user without exposing email.

-- Create a consistent salted hash in both systems
-- crm_users(email_hash TEXT)
-- web_events(email_hash TEXT)

-- Hash creation (pseudo-SQL)
UPDATE crm_users SET email_hash = SHA256(CONCAT(env_salt, LOWER(TRIM(email))));
UPDATE web_events SET email_hash = SHA256(CONCAT(env_salt, LOWER(TRIM(email))));

-- Join on email_hash
SELECT e.session_id, c.user_id
FROM web_events e
JOIN crm_users c USING (email_hash);

Use environment-specific salts to prevent cross-environment linkage.

How to apply in pipelines

  1. Discover and classify: Tag columns as PII, Sensitive, or Public.
  2. Choose a protection method per column: mask, hash, tokenize, encrypt, or drop.
  3. Implement access layers: masked views for most roles; raw only for a few audited roles.
  4. Test with sample datasets: verify format, uniqueness, and referential integrity remain valid.
  5. Automate: codify rules (e.g., SQL views, transformation jobs); add checks in CI for new PII columns.
  6. Review: rotate salts/keys periodically; review audit logs and role memberships.

Exercises

These are open to everyone. Note: the quick test is available to everyone; only logged-in users have progress saved.

Exercise 1 — Create a masked view for analysts

Using the table below, create a view that masks email and phone and shows only the last 4 of ssn.

-- customers(id INT, full_name TEXT, email TEXT, phone TEXT, ssn TEXT)
INSERT INTO customers VALUES
(1,'Ada Lovelace','ada@example.com','15551234567','123-45-6789'),
(2,'Grace Hopper','grace@navy.mil','15557654321','987-65-4321');

-- Requirements:
-- email_masked: keep first letter + domain (a***@example.com)
-- phone_masked: xxxxxx4567
-- ssn_masked: ***-**-6789
  • Deliverable: CREATE VIEW customers_masked AS ...
  • Check: Query the view and verify masked formats.

Exercise 2 — Build a tokenization mapping for credit cards

Create a mapping that replaces raw card numbers with tokens for analytics.

-- cards(card_id INT, pan TEXT)
INSERT INTO cards VALUES (1,'4111111111111111'),(2,'4000000000000002');

-- Create table: card_token_map(pan_hash TEXT UNIQUE, token TEXT UNIQUE)
-- Create view: cards_tokenized(card_id INT, pan_token TEXT)
-- Use SHA256(salt || pan) as lookup key; token can be a generated UUID.
  • Deliverable: mapping table + a tokenized view that exposes pan_token only.
  • Check: Each PAN maps to one token; same PAN → same token; different PANs → different tokens.
Need a checklist?
  • Masked outputs keep expected format.
  • No raw PII appears in views.
  • Hashing uses a non-empty salt.
  • Tokens are unique and non-guessable.
  • Joins still work where intended.

Common mistakes and self-checks

  • Masking that breaks joins: Keep stable keys (hashes/tokens) for joins; don’t mask join keys unless replaced with a consistent surrogate.
  • Forgetting logs: Ensure ETL logs, error messages, and dashboards do not print raw PII. Self-check: search for email/phone patterns in logs.
  • Unsalted hashing: Always include a salt to resist rainbow-table attacks. Self-check: confirm salts differ per environment.
  • Leaking through derived fields: Domains, ZIP+4, or birthdates can re-identify users. Self-check: aggregate or generalize where possible.
  • Too-broad access: Review role memberships quarterly. Self-check: attempt to select raw columns using a normal analyst role.

Practical projects

  • Build a PII classification catalog: scan schemas and tag columns; export a report with recommended protections.
  • Create a dynamic masking layer: role-aware views that show raw, redacted, or aggregated values based on user role.
  • Design a sandbox anonymization job: generate a masked copy for dev/staging with deterministic hashing + format-preserving masks.

Learning path

  1. Identify PII: classify columns in your top 5 datasets.
  2. Implement masking: build at least one masked view per dataset.
  3. Add hashing/tokenization: replace PII in analytics-facing tables.
  4. Secure access: update roles/permissions and add audit queries.
  5. Harden logs and tests: scrub PII in logs and sample data used for tests.

Next steps

  • Add automated checks that fail pipelines when new PII columns lack a masking rule.
  • Rotate salts/keys on a schedule; document how to rehash/retokenize safely.
  • Proceed to the quick test to confirm understanding.

Mini challenge

Take one high-risk table in your environment and produce two artifacts: (1) a masked view for analysts, (2) a one-page policy describing which columns are masked, hashed, tokenized, or dropped, and why. Have a peer review for both usability and privacy risk.

Practice Exercises

2 exercises to complete

Instructions

Given customers(id, full_name, email, phone, ssn), create a view customers_masked that:

  • email_masked: keep the first letter and domain (a***@example.com)
  • phone_masked: replace all but last 4 digits with x (e.g., xxxxxx4567)
  • ssn_masked: show ***-**-last4 (e.g., ***-**-6789)

Insert sample rows and verify outputs match the formats. Use your SQL dialect; adapt functions as needed.

Expected Output
A query to customers_masked returns masked contact information while preserving useful parts (email domain, last 4 digits). No raw PII appears.

PII Handling And Masking — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about PII Handling And Masking?

AI Assistant

Ask questions about this tool