Why this matters
As an Applied Scientist, you touch data at every stage: collection, feature engineering, training, evaluation, deployment, and monitoring. Personally Identifiable Information (PII) can appear explicitly (email, phone, SSN) or implicitly (combinations like ZIP+birth year+gender). Good PII handling prevents harm to users, reduces legal and reputational risk, and keeps your ML workflows efficient and compliant.
- Build models without keeping unnecessary raw identifiers.
- Ship metrics and dashboards without leaking user data.
- Debug failures safely: avoid logging PII in prompts, traces, or errors.
- Enable reproducible science via safe, linkable pseudonyms instead of raw identifiers.
Who this is for
- Applied Scientists and ML Engineers training or evaluating models on user data.
- Data Scientists building features, dashboards, or experiments with real-world data.
- Researchers preparing datasets for LLMs or recommendation systems.
Prerequisites
- Comfort with Python or another data language (e.g., pandas, SQL).
- Basic understanding of ML/data pipelines and logging.
- Familiarity with JSON/CSV and simple regex.
Concept explained simply
PII is any information that can identify a person directly (like email) or indirectly when combined with other data (like ZIP + birth year). Your job: minimize, protect, and control PII across the data lifecycle.
Mental model
Think in two layers:
- Surface area: Where can PII exist? (raw data, features, embeddings, logs, prompts, dashboards)
- Lifecycle: When do you touch it? (collect → store → use/train → evaluate/share → deploy/serve → log/monitor → retire/delete)
At each lifecycle step, apply controls: collect less, transform early, restrict access, log deliberately, and delete on schedule.
Common PII categories
- Direct identifiers: name, email, phone, SSN, national ID, exact address, account numbers, device IDs.
- Quasi-identifiers: date of birth, ZIP/postcode, gender, unique timestamps, rare events, IP.
- Sensitive text: free-form notes, support tickets, prompts, transcripts.
Controls by lifecycle (quick reference)
- Collect: Data minimization and purpose limitation. Avoid free-text fields by default.
- Store: Encrypt at rest; separate keys/secrets; limit access via roles.
- Use/Train: De-identify early; use pseudonyms (e.g., HMAC of user_id); exclude PII from embeddings.
- Evaluate/Share: Share aggregates, apply k-anonymity or differential privacy for broad audiences.
- Deploy/Serve: Redact inputs/outputs that could carry PII; mask logs.
- Log/Monitor: Log only what you need; never store raw PII in error traces or prompts.
- Retire/Delete: Retention schedules; delete raw PII after derivation; document deletion.
PII readiness checklist
- [ ] Do we truly need this field? If not, drop it.
- [ ] Can we transform it? (mask, generalize, tokenize, HMAC)
- [ ] Are secrets and keys stored separately with restricted access?
- [ ] Are logs scrubbed of PII by default?
- [ ] Are aggregates safe to share? (check k-anonymity/thresholds/differential privacy)
- [ ] Is there a defined retention period and deletion process?
- [ ] Is free-text scanned and redacted?
Worked examples
Example 1: Training a recommender with emails in source data
- Goal: Predict next purchase. Email is not needed for modeling.
- Action: Drop raw email after generating a stable pseudonym: user_pseudo = HMAC(secret_key, email).
- Result: You preserve user linkage across events without storing raw email.
- Bonus: Keep mapping key separate; rotate if compromised.
Example 2: Support tickets contain free-text PII
- Detect PII with a rules+NLP pass (regex for email/phone; NER for names/locations).
- Redact or replace with tags: "[EMAIL]", "[PHONE]", "[NAME]".
- For training a classifier, use redacted text. Store original only where necessary with tight access.
Example 3: Aggregated metrics for a public report
- You want to share monthly active users per city.
- Apply count thresholds (e.g., suppress buckets with n < 20) and optional differential privacy noise for extra protection.
- Publish only aggregates; never release raw rows.
How to implement controls (step-by-step)
- Map your data: List all fields and where they flow (source → transform → model → logs → outputs).
- Classify: Mark direct identifiers, quasi-identifiers, and sensitive free-text.
- Transform early: Drop, mask, generalize, or pseudonymize (e.g., HMAC) before broad access.
- Secure storage: Encrypt, restrict roles, separate keys, audit access.
- Safe analytics: Use aggregates with thresholds; consider k-anonymity or differential privacy for wide sharing.
- Harden logs: Default redaction in pipelines; avoid printing raw inputs in errors.
- Retain and delete: Define timelines; delete raw PII after feature derivation.
- Document: Keep a simple data-protection note for each dataset and model.
Exercises
These mirror the interactive exercises below. Try them here, then check the solution.
Exercise ex1 — Scrub PII from logs and create a privacy report
Dataset (2 JSON lines):
{"user_id": "u-001", "name": "Ava Brown", "email": "ava.brown@example.com", "phone": "+1-202-555-0133", "ip": "203.0.113.55", "notes": "Call me at 202-555-0133. Card ends 4242."}
{"user_id": "u-002", "name": "L. Chen", "email": "li.chen@example.org", "dob": "1990-04-03", "zip": "02139", "notes": "My address is 77 Mass Ave"}- Identify direct identifiers vs quasi-identifiers.
- Decide transform per field: drop, mask, generalize, tokenize/HMAC.
- Produce cleaned records and a short privacy report (what changed and why; residual risks).
Show solution
Direct identifiers: name, email, phone, exact address (in notes). Quasi-identifiers: IP, DOB, ZIP. Actions:
- name: drop (not needed).
- email: replace with user_pseudo = HMAC(secret, email).
- phone: mask → "+1-202-***-****" and redact from notes.
- ip: generalize to "/24" or city-level if needed; or drop if not needed.
- dob: generalize to year (1990); zip: generalize to first 3 digits (021**).
- notes: redact with tags: "Call me at [PHONE]. Card ends [REDACTED]." and "My address is [ADDRESS]".
Example cleaned record:
{"user_id": "u-001", "user_pseudo": "hmac_7f1c...", "phone_masked": "+1-202-***-****", "ip_block": "203.0.113.0/24", "notes_redacted": "Call me at [PHONE]. Card ends [REDACTED]."}
{"user_id": "u-002", "user_pseudo": "hmac_ba92...", "dob_year": 1990, "zip3": "021**", "notes_redacted": "My address is [ADDRESS]"}Privacy report: We removed raw names/emails, masked phone, generalized DOB/ZIP/IP, redacted free text. Residual risk: pattern uniqueness (rare ZIP+year); mitigate via thresholds in aggregates. Keys are stored separately with restricted access.
Common mistakes and self-check
- Mistake: Keeping raw IDs "just in case". Fix: Generate pseudonyms (HMAC) and delete raw IDs after verification.
- Mistake: Logging full requests/responses containing PII. Fix: Default to redaction; add allowlists for safe fields.
- Mistake: Using unsalted hashes for linkability. Fix: Use keyed HMAC with rotation policy.
- Mistake: Publishing small-cell aggregates. Fix: Apply minimum counts or noise before sharing.
- Mistake: Putting PII into embeddings. Fix: Remove PII before embedding or use redaction tags.
Self-check prompts
- If an attacker gets my training data, what raw identifiers would they find?
- Can I reproduce results without ever seeing a real email or phone?
- Are my logs safe to share with a teammate outside the project?
Practical projects
- Build a PII scrubber: rules (regex) + lightweight NER, with unit tests for common identifiers.
- Create a privacy transform library: drop/mask/generalize/HMAC utilities with configuration and audit trail.
- Safe metrics pipeline: implement count thresholds and optional differential privacy for a weekly dashboard.
Learning path
- Start: Identify PII and map data flows.
- Apply transformations early (mask, generalize, pseudonymize).
- Harden storage and logs.
- Design safe analytics (thresholds, k-anonymity, differential privacy for wide sharing).
- Document retention and deletion; perform a mini privacy review per dataset/model.
Next steps
- Integrate PII checks into CI for data pipelines.
- Extend your scrubber to new data types (images/audio transcripts with PII in captions).
- Coordinate with stakeholders to agree on retention windows and access roles.
Mini challenge
You receive a new column "referrer_url" that sometimes contains full query strings with emails or phone numbers. Design a one-page plan to detect and redact PII in URLs, including tests and how you’ll verify no PII reaches logs or metrics.
Quick Test
Take the short test below to check understanding. Everyone can take it; logged-in learners will see saved progress automatically.