Why this matters
Data engineers move, transform, and store sensitive information. Compliance makes sure we do this legally and safely, with clear controls and proof. You will be asked to:
- Tag and protect personal data (PII/PHI/payment data).
- Implement access controls, encryption, and data retention/purge.
- Handle data subject requests (export or delete personal data).
- Keep audit logs that prove who accessed data and when.
- Design pipelines that respect data residency (e.g., keep EU data in EU).
Note: This is not legal advice. Always align with your organization’s legal and security teams.
Quick test is available to everyone; only logged-in users get saved progress.
Concept explained simply
Compliance is about three things:
- Know what sensitive data you have and why you have it.
- Put guardrails in place to reduce risk (controls).
- Keep evidence that the guardrails work (audits and logs).
Mental model: Locate → Reduce → Protect → Prove
- Locate: Map data flows and classify sensitive fields.
- Reduce: Collect less, anonymize or aggregate when possible.
- Protect: Restrict access, encrypt, mask, and separate environments.
- Prove: Log, monitor, and document reviews and tests.
Key terms and scope
- PII: Personal data that identifies a person (name, email, phone, device ID).
- PHI: Health-related data tied to a person.
- PCI data: Payment card data (e.g., PAN). Requires strong controls and isolation.
- Data Subject Rights: People can request access, correction, or deletion of their personal data.
- Data Residency: Keeping data within specific regions.
- Data Minimization: Collect and keep only what you truly need.
- Least Privilege: Give only the access someone needs, nothing more.
- Retention vs Deletion: How long you keep data and how you securely remove it.
- DPIA (Privacy Impact Assessment): A risk review for projects using personal data.
- Evidence: Audit logs, reviews, test results, and documented procedures showing your controls work.
Regulation names you may hear: GDPR, CCPA/CPRA, HIPAA, PCI DSS, SOC 2. The exact rules vary by company and country. Focus on the principles and controls above.
Worked examples
Example 1: Implement deletion requests in a data lake
- Locate: Identify all tables where user_id appears (raw, bronze, silver, gold).
- Reduce: Stop copying unneeded PII into downstream tables.
- Protect: Ensure only the pipeline service can run purge jobs.
- Prove: Log each deletion job with counts and affected tables.
Practical steps:
- Add a "tombstone" list of user_ids to delete. Join it when writing tables to remove matching rows.
- For immutable raw files, mark for purge and run a compaction or overwrite process to exclude those user_ids.
- Rebuild downstream aggregates to remove references.
- Write a deletion report (user_ids count, tables touched, timestamp).
Example 2: Isolate payment data (PCI-like control)
- Locate: Tag columns like card_number, cardholder_name, expiry.
- Reduce: Store tokens instead of raw card numbers. Keep raw PAN only in a dedicated, tightly controlled zone.
- Protect: Encrypt at rest and in transit, separate networks/storage, restrict access to a small group.
- Prove: Access logs, key rotation records, quarterly access reviews.
Practical steps:
- Create a separate storage container for payment data with stricter policies.
- Use tokenization service to store tokens in analytics dataset.
- Mask or fully remove payment fields in BI views.
Example 3: Least privilege for analytics platform
- Locate: List datasets per team (marketing, finance, product).
- Reduce: Split shared tables into sensitive and non-sensitive versions.
- Protect: Create role-based access (e.g., finance_read, marketing_read); deny default public access.
- Prove: Monthly review: who has which role; log changes to roles.
Practical steps:
- Define roles and map them to datasets and columns.
- Implement column masking for emails and phone numbers where full value is not needed.
- Document the role-to-dataset matrix and review it regularly.
How to apply this on your job
- Map data flows: List sources, pipelines, storage, and outputs. Mark PII/PHI/PCI fields.
- Set classification tags: sensitive, internal, public.
- Minimize: Drop unneeded fields; aggregate early when possible.
- Access control: Assign roles; remove individual, ad-hoc access.
- Encryption: Enable at-rest and in-transit; manage keys with rotation.
- Logging and monitoring: Access logs, data changes, deletion actions.
- Retention: Define how long to keep each dataset and how to securely delete.
- Test and document: Run tabletop tests (e.g., mock deletion request), keep results as evidence.
Quick self-check before releasing a new pipeline
- All sensitive fields tagged
- Access roles reviewed
- Encryption confirmed
- Retention and purge defined
- Logs and dashboards in place
- Evidence stored (review notes, test runs)
Exercises
These mirror the exercises below. Try them before opening the solutions.
Exercise 1 (ex1): Tag PII and propose controls
You have a user_profile table with columns: user_id, email, full_name, signup_ts, country, marketing_opt_in, device_id. Identify PII and propose column-level controls for staging, analytics, and BI views.
Exercise 2 (ex2): Retention and deletion plan
Three datasets: logs_clickstream (billions of rows), user_support_tickets, payments_events. Draft retention windows and how you will implement safe deletion and evidence.
Checklist for your answers
- Sensitive fields correctly identified and tagged
- Controls vary by environment and team role
- Retention aligns with business need and risk
- Deletion process includes downstream reprocessing
- Evidence described (logs, reports, reviews)
Common mistakes and how to self-check
- Mistake: "We encrypt, so we’re compliant." Fix: Also restrict access, log, and define retention.
- Mistake: Copying PII to dev/test. Fix: Use synthetic data or masked subsets.
- Mistake: Keeping data forever. Fix: Set and enforce retention by dataset.
- Mistake: Backups ignored. Fix: Include backups in deletion/retention plans.
- Mistake: No lineage for deletions. Fix: Track where identifiers flow; reprocess derived tables.
- Mistake: Secrets in code. Fix: Use secret managers, rotate regularly.
Self-check
- Can you show which tables contain emails and why?
- Can you prove who accessed payment data last month?
- Can you run a deletion request end-to-end and produce a report?
Practical projects
- Build a consent-aware event pipeline: tag fields, mask in BI, log consent checks.
- Create a data flow map: sources → lake/warehouse → marts; mark PII columns.
- Set up audit logging for a sensitive dataset and a monthly access review workflow.
- Implement column masking views for email/phone with role-based exceptions.
Who this is for
- Data Engineers and Analytics Engineers handling pipelines and warehouses.
- Platform Engineers responsible for data platforms and IAM.
- Anyone shaping data models that may include personal or payment data.
Prerequisites
- Basic SQL and ETL/ELT concepts.
- Understanding of data storage layers and backups.
- Intro-level IAM (roles, policies, least privilege).
Learning path
- Compliance Awareness Basics (this lesson)
- Data Classification and Tagging
- Access Control and Least Privilege
- Encryption and Key Management
- Monitoring, Auditing, and Evidence
- Privacy Engineering Patterns (pseudonymization, anonymization)
Next steps
- Apply the checklist to one production dataset this week.
- Run a mock deletion request and document evidence.
- Review roles on your most sensitive table and remove unneeded access.
When you’re ready, take the quick test below. Note: Everyone can take the test; only logged-in users get saved progress.
Mini challenge
You must share product analytics with a marketing team. They do not need emails, but they need user-level funnels by country and device. Propose a design that:
- Removes direct identifiers (email, full_name).
- Uses a stable pseudonymous user_key for joins.
- Masks rare countries to reduce re-identification risk.
- Logs access and enforces read-only roles for marketing.
One possible approach
Create a marketing_funnels table with user_key = hash(user_id, org_salt), aggregate to daily, keep country but bucket rare values as "other". No email/full_name. Restrict write access to pipelines only, marketing gets read role. Enable access logs and monthly review.