Why this matters
Applied Scientists work with real-world data—often about people. Privacy and compliance protect users, reduce risk, and keep your models shippable. You will routinely:
- Decide what data is necessary for a model and what to drop or mask.
- Design labeling workflows without exposing unnecessary personal data.
- Handle data subject requests (access, deletion) and retention windows.
- Collaborate with legal/security teams on data protection impact assessments (DPIAs).
- Document how datasets were collected, processed, labeled, and audited.
Note: This guide is educational and not legal advice. Partner with your organization’s legal/compliance team for decisions.
Concept explained simply
Privacy and compliance ensure you collect and use only the data you truly need, protect it properly, and respect people’s rights.
Mental model: The 3-guardrail loop
- Purpose: Be specific. Why do you need the data? Can you achieve the goal with less?
- Protection: If you must keep it, protect it—masking, access controls, encryption, logging.
- Proof: Document what you did so others can verify (audits, DPIA, data map).
Key terms you will use
- Personal Data (PII): Any data that can identify a person (directly or indirectly).
- Sensitive Data: Higher-risk categories (e.g., health, biometrics, precise location, children’s data).
- Pseudonymization: Replace direct identifiers with tokens; still re-identifiable with a key.
- Anonymization: Irreversibly remove link to a person; treat with care to avoid re-identification.
- Lawful Basis: Legal reason to process personal data (e.g., consent, contract, legitimate interests, legal obligation).
- Data Subject Rights: Access, deletion, correction, portability, objection—must be supportable.
- DPIA: Data Protection Impact Assessment for higher-risk processing (e.g., profiling, sensitive data).
Key principles and checklists
Data minimization checklist
- Is every field required for the model objective?
- Use synthetic or aggregated data where possible.
- Prefer features over raw text/images when feasible.
- Mask or drop direct identifiers (name, email, phone, SSN, exact address) unless essential.
- Reduce precision (e.g., city instead of full address; age band instead of birthdate).
Lawful basis quick guide
- Consent: Clear, informed, revocable. Good for optional features and research with users’ agreement.
- Contract: Needed to deliver a service the user requested.
- Legitimate interests: Balance your need vs. user impact; document this assessment.
- Legal obligation: You must process for legal reasons.
Labeling workflow guardrails
- Redact identifiers before sending items to annotators where possible.
- Use role-based access; labelers see only what they need.
- Bind annotators by confidentiality and acceptable-use policies.
- Log who accessed what and when.
- Provide clear instructions to avoid adding personal notes in labels.
Retention and deletion
- Set retention based on purpose and policy (e.g., 90 days for raw, 1 year for derived features).
- Automate deletion and verify with logs.
- Support deletion requests by mapping identifiers through pipelines and derived data.
Worked examples
Example 1: Training a support-ticket classifier
Goal: Route tickets to the right team.
- Minimize: Drop name, email, phone. Keep ticket text, product, coarse timestamp (month), and language.
- Protection: Pseudonymize ticket IDs; encrypt storage; restrict access to the ML team.
- Proof: Document fields kept/dropped, rationale, and retention (raw text 90 days; embeddings 1 year).
Example 2: Face detection for store cameras
Goal: Count visitors, not identify them.
- Minimize: Process on-device; store only counts and bounding box stats, not raw faces.
- Protection: If frames are temporarily buffered, encrypt and auto-delete within seconds/minutes.
- Proof: DPIA due to potential high risk; justify anonymization approach and deletion timers.
Example 3: Mobile telemetry for model personalization
Goal: Improve on-device recommendations.
- Minimize: Collect event types, coarse location (city), device type. Avoid exact GPS and contact lists.
- Protection: Aggregate on-device; send only aggregated signals; limit IP storage.
- Proof: Document opt-in (consent), data flows, and retention windows.
How to implement in your workflow
- Define purpose: Write a one-sentence model goal and a list of truly necessary data fields.
- Map data: Create a simple table: source, fields, personal/sensitive flag, transform (drop/mask/keep), retention.
- Choose lawful basis: Decide consent/contract/legitimate interest and note rationale.
- DPIA trigger check: If profiling, sensitive data, or large-scale monitoring—do a DPIA with stakeholders.
- Prepare labeling: Redact, restrict access, and train annotators. Add instructions to avoid personal notes.
- Secure & log: Encrypt, enforce role-based access, and enable audit logs.
- Operationalize deletion: Set timers and test deletion end-to-end, including derived artifacts.
Exercises
Do these now. You can compare with sample solutions inside each exercise card below. In the test at the end, your progress is available to everyone; only logged-in users will have results saved.
- Exercise 1: Data mapping and minimization plan for a chat dataset.
- Exercise 2: Draft lawful basis, consent copy elements, and retention for a speech labeling task.
Exercise 1: Data mapping and minimization (click to open)
Dataset: 50k customer support chats with fields: chat_id, user_id, timestamp (UTC), user_name, user_email, message_text, product_sku, country, agent_notes.
Task: Create a table with columns [Field, Keep/Drop/Transform, Why], and define retention for raw vs. derived features.
Checklist:
- Drop or mask direct identifiers.
- Reduce precision where possible.
- Set different retention for raw vs. derived.
When done, compare with the solution below.
Exercise 2: Speech labeling compliance plan (click to open)
Project: Train a wake-word model from user-submitted audio clips.
Tasks:
- Choose lawful basis and justify.
- Write 3–5 bullet points for consent text elements.
- Define annotator access rules and masking.
- Set retention for raw audio vs. features.
When done, compare with the solution below.
Self-check checklist
- Can you explain why each field is necessary?
- Have you reduced or masked identifiers?
- Is retention time-bound and automated?
- Can deletion requests be fulfilled across raw and derived data?
- Is your labeling workflow privacy-preserving?
Common mistakes and how to self-check
- Mistake: Collecting everything “just in case.” Fix: Tie each field to a specific feature or hypothesis.
- Mistake: Confusing pseudonymization with anonymization. Fix: If re-identification is possible, treat as personal data.
- Mistake: Retention set-and-forget. Fix: Implement automated deletion and verify with logs.
- Mistake: Exposing PII to annotators. Fix: Redact before labeling; use role-based access.
- Mistake: Ignoring downstream artifacts. Fix: Include embeddings, features, and caches in deletion workflows.
Practical projects
- Privacy-by-design data spec: Write a 1-page spec for a model dataset with purpose, fields, transformations, lawful basis, and retention.
- Labeling redaction pipeline: Prototype a simple script to remove names/emails from texts and log redaction rates.
- Deletion drill: Simulate a deletion request and document the steps to remove raw and derived data.
Mini challenge
You receive 10k emails for a spam classifier. In 5 sentences, propose a plan to minimize data, enable labeling safely, and set retention. Aim for clear trade-offs and rationale.
Learning path
- Start: Data minimization and redaction techniques.
- Next: Lawful basis selection and DPIA basics.
- Then: Labeling governance and annotator controls.
- Finally: Retention, deletion, and audit logging in production.
Who this is for
- Applied Scientists and ML Engineers handling user data.
- Data/Label Operations leads designing annotation workflows.
- Product teams embedding models into user-facing features.
Prerequisites
- Basic ML workflow knowledge (data collection, training, evaluation, deployment).
- Understanding of your organization’s data stack (storage, access control, logging).
- Willingness to document decisions clearly.
Next steps
- Complete the exercises and compare with solutions.
- Take the Quick Test below. Anyone can take it; sign in if you want your progress saved.
- Apply the checklists to your current dataset and get feedback from a privacy or security partner.