luvv to helpDiscover the Best Free Online Tools
Topic 8 of 9

Handling PII And Compliance Basics

Learn Handling PII And Compliance Basics for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, you will touch real user data. Mishandling personally identifiable information (PII) can cause user harm, legal penalties, and lost trust. Good MLOps includes privacy-by-design: collect less, protect more, and prove it with audit trails.

  • Common tasks: designing pipelines that redact PII, setting retention rules, adding consent checks, building deletion workflows, and documenting privacy controls in model cards.
  • Business impact: fewer incidents, faster approvals from legal/security, and models that can be deployed with confidence.

Note: This is practical guidance, not legal advice. Consult your organization’s legal counsel for policy decisions.

Who this is for

  • ML Engineers and Data Scientists shipping models to production.
  • MLOps/Platform Engineers responsible for data pipelines and observability.
  • Team leads who must ensure privacy compliance at scale.

Prerequisites

  • Basic ML workflow knowledge (ingest, train, evaluate, deploy, monitor).
  • Familiarity with data schemas and feature engineering.
  • Basic understanding of authentication/authorization concepts.

Concept explained simply

Handling PII means recognizing which data can identify a person and applying controls so the person stays protected throughout your ML lifecycle.

Mental model

Think in three layers:

  • Identify: classify data as Public → Internal → Confidential → Restricted (PII lives in Restricted).
  • Minimize: collect only what you need, for a stated purpose, with a time limit.
  • Control: restrict access, mask in logs, tokenize in features, and keep auditable records.
Quick guide to common legal ideas (plain language)
  • Lawful basis: you must have a valid reason to use data (e.g., user consent, contract, legitimate interest; sensitive data often needs explicit consent).
  • Purpose limitation: only use data for the purposes you stated.
  • Data minimization: keep the smallest amount of data that works.
  • Retention and deletion: set time limits and actually delete or de-identify on schedule.
  • Data subject rights: enable access, correction, deletion, and objection requests.
  • Security and accountability: control access, log who did what, and prove it.
Anonymization vs. pseudonymization
  • Anonymized: cannot reasonably re-identify individuals anymore. Hard to guarantee; irreversible in practice.
  • Pseudonymized: direct identifiers replaced (e.g., with tokens or hashes), but re-identification is possible if you hold the mapping or additional data. Still considered personal data.

Core definitions

  • PII (personally identifiable information): data that directly identifies (name, email, phone, SSN) or can be combined to identify (IP address with other fields, unique device IDs).
  • Sensitive data: special categories like health, biometrics, precise location, financial accounts—requires stronger safeguards.
  • De-identification toolkit: redaction, tokenization, hashing/HMAC, generalization (e.g., age bands), suppression, differential privacy, federated learning.
Examples of minimization you can apply
  • Replace email with stable user_id and keep the mapping in a separate secure store.
  • Round timestamps to day or hour, not milliseconds.
  • Use city or region instead of full address.
  • Keep only last 90 days of raw events; aggregate older data.

Practical workflow for ML teams

  1. Classify data: tag fields (Direct Identifier, Sensitive, Quasi-identifier, Non-PII).
  2. Define purpose: write why each field is needed for the model; remove extras.
  3. Design controls: tokenization/HMAC for identifiers, redact logs, encrypt at rest, role-based access.
  4. Retention plan: set per-table retention; schedule deletions; keep audit logs of deletions.
  5. Consent and rights handling: store consent state; respect opt-out; implement deletion/unlearning queue.
  6. Validation gates: add CI checks for schema tags (no raw PII in features or logs).
  7. Document: include privacy notes in model cards and runbooks.
Tokenization vs. hashing vs. HMAC
  • Tokenization: replace with random token; lookups happen in a secure vault.
  • Hashing (one-way): reduces exposure but can be reversible via guessing for values with small domains (e.g., phone numbers); still personal data.
  • HMAC (keyed hash): deterministic mapping with a secret key; good for joins without exposing raw value; still personal data.

Worked examples

1) Customer churn model using emails

Problem: Dataset has email, signup_time, purchases. Emails leak identity and add risk.

Solution steps:

  • Replace email with a stable user_id. Store email↔user_id in a separate secure service.
  • If you need to group by domain, compute email_domain client-side and drop full email. Or apply HMAC to the domain only if determinism is needed.
  • Redact emails from logs and error messages.
  • Retention: keep features 180 days; delete raw emails ASAP after tokenization.

2) Resume parser (names + phone numbers)

Problem: Model learns spurious signals from names/phones; high risk of bias and re-identification.

Solution steps:

  • Drop names and phone numbers from training set. Use candidate_id only.
  • Mask PII in text using entity redaction (e.g., replace detected names with [NAME]).
  • Bias check: ensure features are job-related (skills, experience length) not identity.
  • Retention: delete raw resumes after extraction; keep structured fields with masks.

3) Medical imaging classification

Problem: DICOM headers contain patient identifiers; images may include burned-in text.

Solution steps:

  • Strip or replace identifiable DICOM tags; validate with automated checks.
  • Detect and crop burned-in PHI overlays.
  • Use site-local training (federated) or strict access controls.
  • Maintain deletion workflow to remove a patient’s data and trigger partial retraining or unlearning.

Compliance-by-design checklist

  • Data purpose documented and approved.
  • Minimal set of features collected; direct identifiers avoided in features.
  • PII separated from features with tokenization/HMAC as needed.
  • Logs and metrics redacted; no raw PII in observability.
  • Access control and encryption enforced; secrets rotated.
  • Per-dataset retention and deletion jobs configured and tested.
  • Consent state respected; opt-out handled.
  • Deletion/unlearning requests flow reaches data lake, features, and models.
  • Model card includes privacy notes and data lineage.

Exercises

Do this now. Then compare with the solution.

Exercise 1: See the detailed task in the Exercises section below. The Quick Test is available to everyone; only logged-in users get saved progress.

Common mistakes and how to self-check

  • Mistake: Assuming hashing “removes” PII. Self-check: Can the value be linked back using lookups or guessing? If yes, treat as personal data.
  • Mistake: Keeping raw identifiers for convenience. Self-check: Replace with tokens; store the mapping elsewhere.
  • Mistake: Logging full requests containing PII. Self-check: Scan logs for email patterns/phone formats; add redaction.
  • Mistake: No deletion pipeline. Self-check: Trigger a test deletion; verify removal from lake, features, and derived models.
  • Mistake: Undefined purpose/retention. Self-check: Can you state why each field exists and its time limit?

Practical projects

  • Build a data classifier: write a simple rule-based tagger for a sample schema, producing tags like Direct Identifier, Sensitive, Non-PII, and a proposed action (drop, tokenize, aggregate).
  • Create a redaction middleware: remove emails, phones, and IDs from logs and metrics payloads; prove it with unit tests.
  • Deletion drill: implement a mock deletion request that removes a user across staging tables and triggers a model retrain script on a reduced dataset.

Mini challenge

Your error monitoring shows occasional stack traces with real user emails. Propose a two-part fix that stops the leak today and prevents it next month.

Possible approach
  • Today: enable pattern-based redaction in the logging sink; scrub historical logs beyond retention policy.
  • Next month: move to structured logging with explicit allow-lists and add a CI gate that fails if new code logs disallowed fields.

Learning path

  • Start here: identify PII in your current datasets and tag fields.
  • Next: implement tokenization/HMAC for identifiers and remove raw PII from features/logs.
  • Then: set retention schedules and a deletion/unlearning workflow.
  • Finally: document privacy controls in model cards and automate validation checks in CI/CD.

Next steps

  • Run the exercise below and draft your team’s PII handling plan.
  • Take the Quick Test to confirm you understand the basics.
  • Apply one improvement in your pipeline this week (e.g., redact logs or add retention jobs).

Quick Test info

The Quick Test is available to everyone; only logged-in users get saved progress.

Practice Exercises

1 exercises to complete

Instructions

You are given a raw table for a churn model:

fields: user_id, email, phone, ip_address, city, address_line, created_at, last_login_at,
purchase_amount, plan_tier, complaint_text
  • Classify each field as Direct Identifier, Sensitive, Quasi-identifier, or Non-PII.
  • Propose a PII-safe feature pipeline: what to drop, what to tokenize/HMAC, and what to aggregate/generalize.
  • Define retention for raw vs. derived features.
  • Specify at least two validation checks to prevent regressions (e.g., in CI).
Expected Output
A short plan mapping fields to categories and actions (drop/tokenize/generalize), retention windows (e.g., 30/180 days), and two CI checks.

Handling PII And Compliance Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Handling PII And Compliance Basics?

AI Assistant

Ask questions about this tool