Topic Not Found

Why this matters Data engineers move, transform, and store sensitive information. Compliance makes sure we do this legally and safely, with clear controls and proof. You will be asked to: Tag and protect personal data (PII/PHI/payment data). Implement access controls, encryption, and data retention/purge. Handle data subject requests (export or delete personal data). Keep audit logs that prove who accessed data and when. Design pipelines that respect data residency (e.g., keep EU data in EU). Note: This is not legal advice. Always align with your organization’s legal and security teams. Quick test is available to everyone; only logged-in users get saved progress. Concept explained simply Compliance is about three things: Know what sensitive data you have and why you have it. Put guardrails in place to reduce risk (controls). Keep evidence that the guardrails work (audits and logs). Mental model: Locate → Reduce → Protect → Prove Locate: Map data flows and classify sensitive fields. Reduce: Collect less, anonymize or aggregate when possible. Protect: Restrict access, encrypt, mask, and separate environments. Prove: Log, monitor, and document reviews and tests. Key terms and scope PII: Personal data that identifies a person (name, email, phone, device ID). PHI: Health-related data tied to a person. PCI data: Payment card data (e.g., PAN). Requires strong controls and isolation. Data Subject Rights: People can request access, correction, or deletion of their personal data. Data Residency: Keeping data within specific regions. Data Minimization: Collect and keep only what you truly need. Least Privilege: Give only the access someone needs, nothing more. Retention vs Deletion: How long you keep data and how you securely remove it. DPIA (Privacy Impact Assessment): A risk review for projects using personal data. Evidence: Audit logs, reviews, test results, and documented procedures showing your controls work. Regulation names you may hear: GDPR, CCPA/CPRA, HIPAA, PCI DSS, SOC 2. The exact rules vary by company and country. Focus on the principles and controls above. Worked examples Example 1: Implement deletion requests in a data lake Locate: Identify all tables where user_id appears (raw, bronze, silver, gold). Reduce: Stop copying unneeded PII into downstream tables. Protect: Ensure only the pipeline service can run purge jobs. Prove: Log each deletion job with counts and affected tables. Practical steps: Add a "tombstone" list of user_ids to delete. Join it when writing tables to remove matching rows. For immutable raw files, mark for purge and run a compaction or overwrite process to exclude those user_ids. Rebuild downstream aggregates to remove references. Write a deletion report (user_ids count, tables touched, timestamp). Example 2: Isolate payment data (PCI-like control) Locate: Tag columns like card_number, cardholder_name, expiry. Reduce: Store tokens instead of raw card numbers. Keep raw PAN only in a dedicated, tightly controlled zone. Protect: Encrypt at rest and in transit, separate networks/storage, restrict access to a small group. Prove: Access logs, key rotation records, quarterly access reviews. Practical steps: Create a separate storage container for payment data with stricter policies. Use tokenization service to store tokens in analytics dataset. Mask or fully remove payment fields in BI views. Example 3: Least privilege for analytics platform Locate: List datasets per team (marketing, finance, product). Reduce: Split shared tables into sensitive and non-sensitive versions. Protect: Create role-based access (e.g., finance_read, marketing_read); deny default public access. Prove: Monthly review: who has which role; log changes to roles. Practical steps: Define roles and map them to datasets and columns. Implement column masking for emails and phone numbers where full value is not needed. Document the role-to-dataset matrix and review it regularly. How to apply this on your job Map data flows: List sources, pipelines, storage, and outputs. Mark PII/PHI/PCI fields. Set classification tags: sensitive, internal, public. Minimize: Drop unneeded fields; aggregate early when possible. Access control: Assign roles; remove individual, ad-hoc access. Encryption: Enable at-rest and in-transit; manage keys with rotation. Logging and monitoring: Access logs, data changes, deletion actions. Retention: Define how long to keep each dataset and how to securely delete. Test and document: Run tabletop tests (e.g., mock deletion request), keep results as evidence. Quick self-check before releasing a new pipeline All sensitive fields tagged Access roles reviewed Encryption confirmed Retention and purge defined Logs and dashboards in place Evidence stored (review notes, test runs) Exercises These mirror the exercises below. Try them before opening the solutions. Exercise 1 (ex1): Tag PII and propose controls You have a user_profile table with columns: user_id, email, full_name, signup_ts, country, marketing_opt_in, device_id. Identify PII and propose column-level controls for staging, analytics, and BI views. Exercise 2 (ex2): Retention and deletion plan Three datasets: logs_clickstream (billions of rows), user_support_tickets, payments_events. Draft retention windows and how you will implement safe deletion and evidence. Checklist for your answers Sensitive fields correctly identified and tagged Controls vary by environment and team role Retention aligns with business need and risk Deletion process includes downstream reprocessing Evidence described (logs, reports, reviews) Common mistakes and how to self-check Mistake: "We encrypt, so we’re compliant." Fix: Also restrict access, log, and define retention. Mistake: Copying PII to dev/test. Fix: Use synthetic data or masked subsets. Mistake: Keeping data forever. Fix: Set and enforce retention by dataset. Mistake: Backups ignored. Fix: Include backups in deletion/retention plans. Mistake: No lineage for deletions. Fix: Track where identifiers flow; reprocess derived tables. Mistake: Secrets in code. Fix: Use secret managers, rotate regularly. Self-check Can you show which tables contain emails and why? Can you prove who accessed payment data last month? Can you run a deletion request end-to-end and produce a report? Practical projects Build a consent-aware event pipeline: tag fields, mask in BI, log consent checks. Create a data flow map: sources → lake/warehouse → marts; mark PII columns. Set up audit logging for a sensitive dataset and a monthly access review workflow. Implement column masking views for email/phone with role-based exceptions. Who this is for Data Engineers and Analytics Engineers handling pipelines and warehouses. Platform Engineers responsible for data platforms and IAM. Anyone shaping data models that may include personal or payment data. Prerequisites Basic SQL and ETL/ELT concepts. Understanding of data storage layers and backups. Intro-level IAM (roles, policies, least privilege). Learning path Compliance Awareness Basics (this lesson) Data Classification and Tagging Access Control and Least Privilege Encryption and Key Management Monitoring, Auditing, and Evidence Privacy Engineering Patterns (pseudonymization, anonymization) Next steps Apply the checklist to one production dataset this week. Run a mock deletion request and document evidence. Review roles on your most sensitive table and remove unneeded access. When you’re ready, take the quick test below. Note: Everyone can take the test; only logged-in users get saved progress. Mini challenge You must share product analytics with a marketing team. They do not need emails, but they need user-level funnels by country and device. Propose a design that: Removes direct identifiers (email, full_name). Uses a stable pseudonymous user_key for joins. Masks rare countries to reduce re-identification risk. Logs access and enforces read-only roles for marketing. One possible approach Create a marketing_funnels table with user_key = hash(user_id, org_salt), aggregate to daily, keep country but bucket rare values as "other". No email/full_name. Restrict write access to pipelines only, marketing gets read role. Enable access logs and monthly review.

Menu

Compliance Awareness Basics

Table of Contents

Why this matters

Concept explained simply

Mental model: Locate → Reduce → Protect → Prove

Key terms and scope

Worked examples

How to apply this on your job

Exercises

Exercise 1 (ex1): Tag PII and propose controls

Exercise 2 (ex2): Retention and deletion plan

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Tag PII and propose controls

Instructions

Expected Output

Retention and deletion plan

Compliance Awareness Basics — Quick Test

Have questions about Compliance Awareness Basics?

AI Assistant