luvv to helpDiscover the Best Free Online Tools
Topic 7 of 10

Managing PII And Sensitive Data

Learn Managing PII And Sensitive Data for free with explanations, exercises, and a quick test (for Analytics Engineer).

Published: December 23, 2025 | Updated: December 23, 2025

Why this matters

As an Analytics Engineer, you will move data from source systems to models and dashboards. Many datasets contain PII (personally identifiable information) or other sensitive fields (like health, payment, or secrets). Your job is to protect people and the business by collecting the minimum data needed, transforming it into the safest usable form, and limiting access. Good practices prevent data leaks, reduce compliance risk (GDPR/HIPAA/PCI), and build trust with users.

Who this is for

  • Analytics Engineers and BI Developers building pipelines and semantic layers
  • Data Engineers integrating sources into a warehouse
  • Data Analysts who need safe ways to use identifiers for joins

Prerequisites

  • Comfort with SQL (SELECT, JOIN, CASE, CREATE VIEW)
  • Basic knowledge of ETL/ELT pipelines and data warehouses
  • Familiarity with roles/permissions at a high level

Concept explained simply

PII is any data that can identify a person. Direct identifiers (like email, phone, full name, SSN) identify someone immediately. Quasi-identifiers (like birthdate, ZIP, gender) can identify people when combined. Sensitive data also includes health info (PHI), payment data (PCI), credentials, and secrets.

What counts as PII? (open for examples)
  • Direct identifiers: full name, email, phone, government ID, device ID
  • Quasi-identifiers: birth date, ZIP/postcode, exact location, rare job title
  • Other sensitive: health fields, payment card details, API keys, auth tokens

Mental model: minimize, transform, gate.

  • Minimize: do not collect or store what you don’t need.
  • Transform: make data less risky (hash, mask, tokenize, aggregate).
  • Gate: restrict access via roles, views, row-level security, and logs.

Tip: Treat PII like a heavy object with gravity. The more you store and move it, the more risk you pull into your system. Keep it small, still, and contained.

Core patterns for ETL/ELT

  1. Classification at ingestion
    • Tag columns: pii_direct, pii_quasi, sensitive, or safe.
    • Document purpose: why do we store this field?
  2. Data minimization
    • Drop unneeded PII in staging.
    • Truncate granularity (e.g., birth year instead of full date) if sufficient.
  3. Masking (irreversible redaction for display)
    • Show only what’s necessary, e.g., user@example.com → u***@example.com.
    • Use static masking in transformed tables or dynamic masking in views.
  4. Hashing (one-way)
    • Compute a deterministic key for joins without revealing the original (e.g., SHA2(lower(trim(email)) || pepper)).
    • Use a secret pepper/salt managed outside code; do not hardcode.
  5. Tokenization (reversible mapping via secure table)
    • Replace PII with a surrogate token (e.g., uuid) and store original in a restricted lookup table.
    • Analytics uses tokens; only a small group can reverse map when truly needed.
  6. Aggregation and k-anonymity
    • Suppress or generalize groups with small counts (e.g., counts < 10 → Other).
    • Remove unique combinations that could re-identify users.
  7. Access control
    • Store raw PII in a secure schema; expose masked views for most users.
    • Grant least privilege: analysts get views; only a few get base tables.
  8. Auditing and retention
    • Log who accessed sensitive objects (without logging PII values).
    • Set deletion/TTL jobs to remove PII after it’s no longer needed.
  9. Safe operations
    • Never print PII to logs or error messages.
    • Use sample/anonymized data in demos and tests.

Note: Regulations vary by country/company. Work with your security/compliance partners for final decisions.

Worked examples

Example 1 — Deterministic hash for joining on email (no email exposure)
-- 1) Standardize email and create a join key with a secret pepper
-- Assume a runtime parameter :pepper provided by your secure config
CREATE OR REPLACE VIEW dim_customers_safekey AS
SELECT
  customer_id,
  LOWER(TRIM(email)) AS email_std,
  SHA2(CONCAT(LOWER(TRIM(email)), :pepper), 256) AS customer_key_hash
FROM raw.customers;

-- 2) Join another source on the same hashed key
CREATE OR REPLACE VIEW fct_orders_with_customer AS
SELECT o.order_id, o.amount, c.customer_id, c.customer_key_hash
FROM raw.orders o
JOIN (
  SELECT
    LOWER(TRIM(email)) AS email_std,
    SHA2(CONCAT(LOWER(TRIM(email)), :pepper), 256) AS customer_key_hash,
    customer_id
  FROM raw.customers
) c
  ON SHA2(CONCAT(LOWER(TRIM(o.customer_email)), :pepper), 256) = c.customer_key_hash;

Result: You can join accurately without storing or exposing the raw email. Keep the pepper secret outside code.

Example 2 — Tokenization via restricted lookup table
-- 1) Create a restricted lookup table (in secure schema)
CREATE TABLE secure.pii_email_lookup AS
SELECT
  GENERATE_UUID() AS email_token,
  email AS email_plain,
  CURRENT_TIMESTAMP AS created_at
FROM raw.customers;

-- 2) Create an analytics-facing table that only uses tokens
CREATE OR REPLACE VIEW analytics.dim_customers AS
SELECT
  c.customer_id,
  l.email_token,
  c.signup_date
FROM raw.customers c
JOIN secure.pii_email_lookup l
  ON l.email_plain = c.email;

-- Most users get SELECT on analytics.dim_customers, only a few on secure.pii_email_lookup.

Result: Analysts use tokens for joins and analysis. Only authorized users can reverse to emails via the secure table.

Example 3 — Dynamic masking in a view based on a role map
-- 1) Access map: who should see full PII
CREATE OR REPLACE VIEW admin.user_access AS
SELECT 'alice' AS username, 'full' AS access UNION ALL
SELECT 'bi_analyst_1', 'masked' UNION ALL
SELECT 'data_scientist_2', 'masked';

-- 2) Masking view
CREATE OR REPLACE VIEW analytics.v_customers_masked AS
SELECT
  c.customer_id,
  CASE WHEN ua.access = 'full' THEN c.email
       ELSE CONCAT(SUBSTR(c.email,1,1), '***@', SPLIT_PART(c.email,'@',2)) END AS email_display,
  CASE WHEN ua.access = 'full' THEN c.phone
       ELSE CONCAT('***-***-', RIGHT(REGEXP_REPLACE(c.phone,'[^0-9]',''), 4)) END AS phone_display
FROM raw.customers c
LEFT JOIN admin.user_access ua
  ON ua.username = SESSION_USER();

Result: Same view serves full or masked values depending on the current user’s access in the map.

Example 4 — k-anonymity thresholding (suppress small groups)
-- Aggregate and suppress groups smaller than k=10
WITH base AS (
  SELECT age, postal_code, COUNT(*) AS n
  FROM analytics.users
  GROUP BY age, postal_code
)
SELECT
  CASE WHEN n < 10 THEN NULL ELSE age END AS age_bucket,
  CASE WHEN n < 10 THEN 'Other' ELSE postal_code END AS postal_area,
  SUM(n) AS users
FROM base
GROUP BY 1,2;

Result: Prevents revealing unique or rare cohorts.

Hands-on exercises

Practice the core patterns. Solutions are collapsible below each exercise. After completing, take the Quick Test. Everyone can take it for free; if you log in, your progress and attempts are saved.

Exercise 1 — Partial phone masking

Goal: Show only last 4 digits of phone, preserving country code format if present.

  1. Create a SELECT that outputs phone_masked from a column phone.
  2. Keep only digits internally; display as ***-***-1234 for US-like numbers.
Hint
  • Use REGEXP_REPLACE to strip non-digits, RIGHT to take last 4.
  • Build the mask string with CONCAT.
Show solution
SELECT
  phone,
  CONCAT('***-***-', RIGHT(REGEXP_REPLACE(phone, '[^0-9]', ''), 4)) AS phone_masked
FROM raw.customers;

Exercise 2 — Deterministic hashed join key

Goal: Produce customer_key_hash for both customers and orders so they can be joined without exposing email.

  1. Standardize email to lowercase and trim.
  2. Use SHA2 with a secret pepper parameter (:pepper).
  3. Demonstrate a join between raw.orders and raw.customers using the hash.
Hint
  • Apply the exact same function on both sides of the join.
  • Keep the pepper out of code in real life; pass it at runtime.
Show solution
WITH cust AS (
  SELECT customer_id,
         SHA2(CONCAT(LOWER(TRIM(email)), :pepper), 256) AS customer_key_hash
  FROM raw.customers
), ord AS (
  SELECT order_id, amount,
         SHA2(CONCAT(LOWER(TRIM(customer_email)), :pepper), 256) AS customer_key_hash
  FROM raw.orders
)
SELECT o.order_id, o.amount, c.customer_id
FROM ord o
JOIN cust c USING (customer_key_hash);

Exercise 3 — Role-based masking view

Goal: Build a view that shows full dob to "full" users and only birth_year to others.

  1. Create an access map view admin.user_access(username, access).
  2. Create analytics.v_users_dob_masked with CASE using SESSION_USER().
  3. Return columns: user_id, dob_display, birth_year.
Hint
  • Use EXTRACT(YEAR FROM dob) for birth_year.
  • LEFT JOIN access map; default to masked when no match.
Show solution
CREATE OR REPLACE VIEW analytics.v_users_dob_masked AS
SELECT
  u.user_id,
  CASE WHEN COALESCE(ua.access, 'masked') = 'full' THEN TO_CHAR(u.dob)
       ELSE NULL END AS dob_display,
  EXTRACT(YEAR FROM u.dob) AS birth_year
FROM raw.users u
LEFT JOIN admin.user_access ua
  ON ua.username = SESSION_USER();

Self-check checklist

  • I can hash emails with a pepper and join safely.
  • I can create a masking view controlled by roles/users.
  • I can suppress small groups to protect privacy.
  • I do not hardcode secrets in SQL or logs.

Common mistakes and how to self-check

  • Mistake: Hardcoding salts/peppers in code or notebooks. Self-check: Search repo for obvious secrets; move to runtime secrets management.
  • Mistake: Inconsistent standardization before hashing (e.g., case/whitespace). Self-check: Verify that email_std is identical across sources.
  • Mistake: Logging PII in pipeline errors. Self-check: Trigger a controlled error and inspect logs; ensure values are redacted.
  • Mistake: Granting base table access to broad groups. Self-check: List grants; most users should only see masked views.
  • Mistake: Not handling small cohorts. Self-check: Count group sizes; apply a k threshold and retest.
  • Mistake: Retaining PII indefinitely. Self-check: Confirm TTLs or deletion jobs exist and run.

Practical projects

  • Build a privacy-safe customer 360: Use hashed keys for joins, tokenized emails, masked views for analysts, and a k=15 cohort suppression layer for reporting.
  • Design a PII intake pipeline: Classify columns, drop unneeded PII in staging, and generate a data catalog note per column (purpose, retention, masking).
  • Incident drill: Simulate a suspected PII leak in a dev environment; practice revoking access, rotating secrets, validating logs, and adding automated tests to prevent recurrence.

Learning path

  1. Classify and document PII fields in your current sources.
  2. Implement hashing for cross-source joins; remove raw identifiers from analytics tables.
  3. Add masking views and adjust permissions to least privilege.
  4. Add cohort suppression and retention jobs.
  5. Set up monitoring: access logs, schema tests for PII presence, and redaction tests.

Mini challenge

You discover user_phone in a popular dashboard. Only support staff truly need full phones; analysts just need region-level insights.

  • Task 1: Remove raw phone from the dashboard’s dataset; add phone_masked.
  • Task 2: Provide an alternative metric (e.g., country code distribution) that meets analysts’ needs.
  • Task 3: Document who can still see full phone and why.

Next steps

  • Refactor one live dataset to minimize PII this week.
  • Add a unit test that fails when a new PII column appears in analytics schemas.
  • Align with security/compliance on your standard k threshold and retention policies.

Quick Test

Available to everyone for free. Log in to save your progress and see your history.

Practice Exercises

3 exercises to complete

Instructions

Write a SQL SELECT that outputs phone_masked from raw.customers.phone, showing only the last 4 digits in the format ***-***-1234 regardless of input formatting.

Expected Output
Original: +1 (212) 555-0199 -> Masked: ***-***-0199

Managing PII And Sensitive Data — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Managing PII And Sensitive Data?

AI Assistant

Ask questions about this tool