Why this matters
As an Analytics Engineer, you will move data from source systems to models and dashboards. Many datasets contain PII (personally identifiable information) or other sensitive fields (like health, payment, or secrets). Your job is to protect people and the business by collecting the minimum data needed, transforming it into the safest usable form, and limiting access. Good practices prevent data leaks, reduce compliance risk (GDPR/HIPAA/PCI), and build trust with users.
Who this is for
- Analytics Engineers and BI Developers building pipelines and semantic layers
- Data Engineers integrating sources into a warehouse
- Data Analysts who need safe ways to use identifiers for joins
Prerequisites
- Comfort with SQL (SELECT, JOIN, CASE, CREATE VIEW)
- Basic knowledge of ETL/ELT pipelines and data warehouses
- Familiarity with roles/permissions at a high level
Concept explained simply
PII is any data that can identify a person. Direct identifiers (like email, phone, full name, SSN) identify someone immediately. Quasi-identifiers (like birthdate, ZIP, gender) can identify people when combined. Sensitive data also includes health info (PHI), payment data (PCI), credentials, and secrets.
What counts as PII? (open for examples)
- Direct identifiers: full name, email, phone, government ID, device ID
- Quasi-identifiers: birth date, ZIP/postcode, exact location, rare job title
- Other sensitive: health fields, payment card details, API keys, auth tokens
Mental model: minimize, transform, gate.
- Minimize: do not collect or store what you don’t need.
- Transform: make data less risky (hash, mask, tokenize, aggregate).
- Gate: restrict access via roles, views, row-level security, and logs.
Tip: Treat PII like a heavy object with gravity. The more you store and move it, the more risk you pull into your system. Keep it small, still, and contained.
Core patterns for ETL/ELT
- Classification at ingestion
- Tag columns: pii_direct, pii_quasi, sensitive, or safe.
- Document purpose: why do we store this field?
- Data minimization
- Drop unneeded PII in staging.
- Truncate granularity (e.g., birth year instead of full date) if sufficient.
- Masking (irreversible redaction for display)
- Show only what’s necessary, e.g., user@example.com → u***@example.com.
- Use static masking in transformed tables or dynamic masking in views.
- Hashing (one-way)
- Compute a deterministic key for joins without revealing the original (e.g., SHA2(lower(trim(email)) || pepper)).
- Use a secret pepper/salt managed outside code; do not hardcode.
- Tokenization (reversible mapping via secure table)
- Replace PII with a surrogate token (e.g., uuid) and store original in a restricted lookup table.
- Analytics uses tokens; only a small group can reverse map when truly needed.
- Aggregation and k-anonymity
- Suppress or generalize groups with small counts (e.g., counts < 10 → Other).
- Remove unique combinations that could re-identify users.
- Access control
- Store raw PII in a secure schema; expose masked views for most users.
- Grant least privilege: analysts get views; only a few get base tables.
- Auditing and retention
- Log who accessed sensitive objects (without logging PII values).
- Set deletion/TTL jobs to remove PII after it’s no longer needed.
- Safe operations
- Never print PII to logs or error messages.
- Use sample/anonymized data in demos and tests.
Note: Regulations vary by country/company. Work with your security/compliance partners for final decisions.
Worked examples
Example 1 — Deterministic hash for joining on email (no email exposure)
-- 1) Standardize email and create a join key with a secret pepper
-- Assume a runtime parameter :pepper provided by your secure config
CREATE OR REPLACE VIEW dim_customers_safekey AS
SELECT
customer_id,
LOWER(TRIM(email)) AS email_std,
SHA2(CONCAT(LOWER(TRIM(email)), :pepper), 256) AS customer_key_hash
FROM raw.customers;
-- 2) Join another source on the same hashed key
CREATE OR REPLACE VIEW fct_orders_with_customer AS
SELECT o.order_id, o.amount, c.customer_id, c.customer_key_hash
FROM raw.orders o
JOIN (
SELECT
LOWER(TRIM(email)) AS email_std,
SHA2(CONCAT(LOWER(TRIM(email)), :pepper), 256) AS customer_key_hash,
customer_id
FROM raw.customers
) c
ON SHA2(CONCAT(LOWER(TRIM(o.customer_email)), :pepper), 256) = c.customer_key_hash;
Result: You can join accurately without storing or exposing the raw email. Keep the pepper secret outside code.
Example 2 — Tokenization via restricted lookup table
-- 1) Create a restricted lookup table (in secure schema)
CREATE TABLE secure.pii_email_lookup AS
SELECT
GENERATE_UUID() AS email_token,
email AS email_plain,
CURRENT_TIMESTAMP AS created_at
FROM raw.customers;
-- 2) Create an analytics-facing table that only uses tokens
CREATE OR REPLACE VIEW analytics.dim_customers AS
SELECT
c.customer_id,
l.email_token,
c.signup_date
FROM raw.customers c
JOIN secure.pii_email_lookup l
ON l.email_plain = c.email;
-- Most users get SELECT on analytics.dim_customers, only a few on secure.pii_email_lookup.
Result: Analysts use tokens for joins and analysis. Only authorized users can reverse to emails via the secure table.
Example 3 — Dynamic masking in a view based on a role map
-- 1) Access map: who should see full PII
CREATE OR REPLACE VIEW admin.user_access AS
SELECT 'alice' AS username, 'full' AS access UNION ALL
SELECT 'bi_analyst_1', 'masked' UNION ALL
SELECT 'data_scientist_2', 'masked';
-- 2) Masking view
CREATE OR REPLACE VIEW analytics.v_customers_masked AS
SELECT
c.customer_id,
CASE WHEN ua.access = 'full' THEN c.email
ELSE CONCAT(SUBSTR(c.email,1,1), '***@', SPLIT_PART(c.email,'@',2)) END AS email_display,
CASE WHEN ua.access = 'full' THEN c.phone
ELSE CONCAT('***-***-', RIGHT(REGEXP_REPLACE(c.phone,'[^0-9]',''), 4)) END AS phone_display
FROM raw.customers c
LEFT JOIN admin.user_access ua
ON ua.username = SESSION_USER();
Result: Same view serves full or masked values depending on the current user’s access in the map.
Example 4 — k-anonymity thresholding (suppress small groups)
-- Aggregate and suppress groups smaller than k=10
WITH base AS (
SELECT age, postal_code, COUNT(*) AS n
FROM analytics.users
GROUP BY age, postal_code
)
SELECT
CASE WHEN n < 10 THEN NULL ELSE age END AS age_bucket,
CASE WHEN n < 10 THEN 'Other' ELSE postal_code END AS postal_area,
SUM(n) AS users
FROM base
GROUP BY 1,2;
Result: Prevents revealing unique or rare cohorts.
Hands-on exercises
Practice the core patterns. Solutions are collapsible below each exercise. After completing, take the Quick Test. Everyone can take it for free; if you log in, your progress and attempts are saved.
Exercise 1 — Partial phone masking
Goal: Show only last 4 digits of phone, preserving country code format if present.
- Create a SELECT that outputs phone_masked from a column phone.
- Keep only digits internally; display as ***-***-1234 for US-like numbers.
Hint
- Use REGEXP_REPLACE to strip non-digits, RIGHT to take last 4.
- Build the mask string with CONCAT.
Show solution
SELECT
phone,
CONCAT('***-***-', RIGHT(REGEXP_REPLACE(phone, '[^0-9]', ''), 4)) AS phone_masked
FROM raw.customers;
Exercise 2 — Deterministic hashed join key
Goal: Produce customer_key_hash for both customers and orders so they can be joined without exposing email.
- Standardize email to lowercase and trim.
- Use SHA2 with a secret pepper parameter (:pepper).
- Demonstrate a join between raw.orders and raw.customers using the hash.
Hint
- Apply the exact same function on both sides of the join.
- Keep the pepper out of code in real life; pass it at runtime.
Show solution
WITH cust AS (
SELECT customer_id,
SHA2(CONCAT(LOWER(TRIM(email)), :pepper), 256) AS customer_key_hash
FROM raw.customers
), ord AS (
SELECT order_id, amount,
SHA2(CONCAT(LOWER(TRIM(customer_email)), :pepper), 256) AS customer_key_hash
FROM raw.orders
)
SELECT o.order_id, o.amount, c.customer_id
FROM ord o
JOIN cust c USING (customer_key_hash);
Exercise 3 — Role-based masking view
Goal: Build a view that shows full dob to "full" users and only birth_year to others.
- Create an access map view admin.user_access(username, access).
- Create analytics.v_users_dob_masked with CASE using SESSION_USER().
- Return columns: user_id, dob_display, birth_year.
Hint
- Use EXTRACT(YEAR FROM dob) for birth_year.
- LEFT JOIN access map; default to masked when no match.
Show solution
CREATE OR REPLACE VIEW analytics.v_users_dob_masked AS
SELECT
u.user_id,
CASE WHEN COALESCE(ua.access, 'masked') = 'full' THEN TO_CHAR(u.dob)
ELSE NULL END AS dob_display,
EXTRACT(YEAR FROM u.dob) AS birth_year
FROM raw.users u
LEFT JOIN admin.user_access ua
ON ua.username = SESSION_USER();
Self-check checklist
- I can hash emails with a pepper and join safely.
- I can create a masking view controlled by roles/users.
- I can suppress small groups to protect privacy.
- I do not hardcode secrets in SQL or logs.
Common mistakes and how to self-check
- Mistake: Hardcoding salts/peppers in code or notebooks. Self-check: Search repo for obvious secrets; move to runtime secrets management.
- Mistake: Inconsistent standardization before hashing (e.g., case/whitespace). Self-check: Verify that email_std is identical across sources.
- Mistake: Logging PII in pipeline errors. Self-check: Trigger a controlled error and inspect logs; ensure values are redacted.
- Mistake: Granting base table access to broad groups. Self-check: List grants; most users should only see masked views.
- Mistake: Not handling small cohorts. Self-check: Count group sizes; apply a k threshold and retest.
- Mistake: Retaining PII indefinitely. Self-check: Confirm TTLs or deletion jobs exist and run.
Practical projects
- Build a privacy-safe customer 360: Use hashed keys for joins, tokenized emails, masked views for analysts, and a k=15 cohort suppression layer for reporting.
- Design a PII intake pipeline: Classify columns, drop unneeded PII in staging, and generate a data catalog note per column (purpose, retention, masking).
- Incident drill: Simulate a suspected PII leak in a dev environment; practice revoking access, rotating secrets, validating logs, and adding automated tests to prevent recurrence.
Learning path
- Classify and document PII fields in your current sources.
- Implement hashing for cross-source joins; remove raw identifiers from analytics tables.
- Add masking views and adjust permissions to least privilege.
- Add cohort suppression and retention jobs.
- Set up monitoring: access logs, schema tests for PII presence, and redaction tests.
Mini challenge
You discover user_phone in a popular dashboard. Only support staff truly need full phones; analysts just need region-level insights.
- Task 1: Remove raw phone from the dashboard’s dataset; add phone_masked.
- Task 2: Provide an alternative metric (e.g., country code distribution) that meets analysts’ needs.
- Task 3: Document who can still see full phone and why.
Next steps
- Refactor one live dataset to minimize PII this week.
- Add a unit test that fails when a new PII column appears in analytics schemas.
- Align with security/compliance on your standard k threshold and retention policies.
Quick Test
Available to everyone for free. Log in to save your progress and see your history.