luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Privacy And Compliance For Data

Learn Privacy And Compliance For Data for free with explanations, exercises, and a quick test (for Applied Scientist).

Published: January 7, 2026 | Updated: January 7, 2026

Why this matters

Applied Scientists work with real-world data—often about people. Privacy and compliance protect users, reduce risk, and keep your models shippable. You will routinely:

  • Decide what data is necessary for a model and what to drop or mask.
  • Design labeling workflows without exposing unnecessary personal data.
  • Handle data subject requests (access, deletion) and retention windows.
  • Collaborate with legal/security teams on data protection impact assessments (DPIAs).
  • Document how datasets were collected, processed, labeled, and audited.

Note: This guide is educational and not legal advice. Partner with your organization’s legal/compliance team for decisions.

Concept explained simply

Privacy and compliance ensure you collect and use only the data you truly need, protect it properly, and respect people’s rights.

Mental model: The 3-guardrail loop

  • Purpose: Be specific. Why do you need the data? Can you achieve the goal with less?
  • Protection: If you must keep it, protect it—masking, access controls, encryption, logging.
  • Proof: Document what you did so others can verify (audits, DPIA, data map).
Key terms you will use
  • Personal Data (PII): Any data that can identify a person (directly or indirectly).
  • Sensitive Data: Higher-risk categories (e.g., health, biometrics, precise location, children’s data).
  • Pseudonymization: Replace direct identifiers with tokens; still re-identifiable with a key.
  • Anonymization: Irreversibly remove link to a person; treat with care to avoid re-identification.
  • Lawful Basis: Legal reason to process personal data (e.g., consent, contract, legitimate interests, legal obligation).
  • Data Subject Rights: Access, deletion, correction, portability, objection—must be supportable.
  • DPIA: Data Protection Impact Assessment for higher-risk processing (e.g., profiling, sensitive data).

Key principles and checklists

Data minimization checklist

  • Is every field required for the model objective?
  • Use synthetic or aggregated data where possible.
  • Prefer features over raw text/images when feasible.
  • Mask or drop direct identifiers (name, email, phone, SSN, exact address) unless essential.
  • Reduce precision (e.g., city instead of full address; age band instead of birthdate).

Lawful basis quick guide

  • Consent: Clear, informed, revocable. Good for optional features and research with users’ agreement.
  • Contract: Needed to deliver a service the user requested.
  • Legitimate interests: Balance your need vs. user impact; document this assessment.
  • Legal obligation: You must process for legal reasons.

Labeling workflow guardrails

  • Redact identifiers before sending items to annotators where possible.
  • Use role-based access; labelers see only what they need.
  • Bind annotators by confidentiality and acceptable-use policies.
  • Log who accessed what and when.
  • Provide clear instructions to avoid adding personal notes in labels.

Retention and deletion

  • Set retention based on purpose and policy (e.g., 90 days for raw, 1 year for derived features).
  • Automate deletion and verify with logs.
  • Support deletion requests by mapping identifiers through pipelines and derived data.

Worked examples

Example 1: Training a support-ticket classifier

Goal: Route tickets to the right team.

  • Minimize: Drop name, email, phone. Keep ticket text, product, coarse timestamp (month), and language.
  • Protection: Pseudonymize ticket IDs; encrypt storage; restrict access to the ML team.
  • Proof: Document fields kept/dropped, rationale, and retention (raw text 90 days; embeddings 1 year).

Example 2: Face detection for store cameras

Goal: Count visitors, not identify them.

  • Minimize: Process on-device; store only counts and bounding box stats, not raw faces.
  • Protection: If frames are temporarily buffered, encrypt and auto-delete within seconds/minutes.
  • Proof: DPIA due to potential high risk; justify anonymization approach and deletion timers.

Example 3: Mobile telemetry for model personalization

Goal: Improve on-device recommendations.

  • Minimize: Collect event types, coarse location (city), device type. Avoid exact GPS and contact lists.
  • Protection: Aggregate on-device; send only aggregated signals; limit IP storage.
  • Proof: Document opt-in (consent), data flows, and retention windows.

How to implement in your workflow

  1. Define purpose: Write a one-sentence model goal and a list of truly necessary data fields.
  2. Map data: Create a simple table: source, fields, personal/sensitive flag, transform (drop/mask/keep), retention.
  3. Choose lawful basis: Decide consent/contract/legitimate interest and note rationale.
  4. DPIA trigger check: If profiling, sensitive data, or large-scale monitoring—do a DPIA with stakeholders.
  5. Prepare labeling: Redact, restrict access, and train annotators. Add instructions to avoid personal notes.
  6. Secure & log: Encrypt, enforce role-based access, and enable audit logs.
  7. Operationalize deletion: Set timers and test deletion end-to-end, including derived artifacts.

Exercises

Do these now. You can compare with sample solutions inside each exercise card below. In the test at the end, your progress is available to everyone; only logged-in users will have results saved.

  • Exercise 1: Data mapping and minimization plan for a chat dataset.
  • Exercise 2: Draft lawful basis, consent copy elements, and retention for a speech labeling task.
Exercise 1: Data mapping and minimization (click to open)

Dataset: 50k customer support chats with fields: chat_id, user_id, timestamp (UTC), user_name, user_email, message_text, product_sku, country, agent_notes.

Task: Create a table with columns [Field, Keep/Drop/Transform, Why], and define retention for raw vs. derived features.

Checklist:

  • Drop or mask direct identifiers.
  • Reduce precision where possible.
  • Set different retention for raw vs. derived.

When done, compare with the solution below.

Exercise 2: Speech labeling compliance plan (click to open)

Project: Train a wake-word model from user-submitted audio clips.

Tasks:

  • Choose lawful basis and justify.
  • Write 3–5 bullet points for consent text elements.
  • Define annotator access rules and masking.
  • Set retention for raw audio vs. features.

When done, compare with the solution below.

Self-check checklist

  • Can you explain why each field is necessary?
  • Have you reduced or masked identifiers?
  • Is retention time-bound and automated?
  • Can deletion requests be fulfilled across raw and derived data?
  • Is your labeling workflow privacy-preserving?

Common mistakes and how to self-check

  • Mistake: Collecting everything “just in case.” Fix: Tie each field to a specific feature or hypothesis.
  • Mistake: Confusing pseudonymization with anonymization. Fix: If re-identification is possible, treat as personal data.
  • Mistake: Retention set-and-forget. Fix: Implement automated deletion and verify with logs.
  • Mistake: Exposing PII to annotators. Fix: Redact before labeling; use role-based access.
  • Mistake: Ignoring downstream artifacts. Fix: Include embeddings, features, and caches in deletion workflows.

Practical projects

  • Privacy-by-design data spec: Write a 1-page spec for a model dataset with purpose, fields, transformations, lawful basis, and retention.
  • Labeling redaction pipeline: Prototype a simple script to remove names/emails from texts and log redaction rates.
  • Deletion drill: Simulate a deletion request and document the steps to remove raw and derived data.

Mini challenge

You receive 10k emails for a spam classifier. In 5 sentences, propose a plan to minimize data, enable labeling safely, and set retention. Aim for clear trade-offs and rationale.

Learning path

  • Start: Data minimization and redaction techniques.
  • Next: Lawful basis selection and DPIA basics.
  • Then: Labeling governance and annotator controls.
  • Finally: Retention, deletion, and audit logging in production.

Who this is for

  • Applied Scientists and ML Engineers handling user data.
  • Data/Label Operations leads designing annotation workflows.
  • Product teams embedding models into user-facing features.

Prerequisites

  • Basic ML workflow knowledge (data collection, training, evaluation, deployment).
  • Understanding of your organization’s data stack (storage, access control, logging).
  • Willingness to document decisions clearly.

Next steps

  • Complete the exercises and compare with solutions.
  • Take the Quick Test below. Anyone can take it; sign in if you want your progress saved.
  • Apply the checklists to your current dataset and get feedback from a privacy or security partner.

Practice Exercises

2 exercises to complete

Instructions

Dataset: 50k customer support chats with fields: chat_id, user_id, timestamp (UTC), user_name, user_email, message_text, product_sku, country, agent_notes.

Create a table with columns [Field, Keep/Drop/Transform, Why], and define retention for raw vs. derived features (e.g., embeddings). Aim to drop or transform identifiers and reduce precision where possible.

Expected Output
A field-by-field plan with rationale, plus explicit retention windows for raw text and derived features.

Privacy And Compliance For Data — Quick Test

Test your knowledge with 6 questions. Pass with 70% or higher.

6 questions70% to pass

Have questions about Privacy And Compliance For Data?

AI Assistant

Ask questions about this tool