Who this is for
Data Platform Engineers who need to design, operate, or improve data systems that handle personal or regulated data. Also useful for analytics engineers and platform SREs partnering with security or compliance teams.
Prerequisites
- Basic understanding of data lakes/warehouses and access control concepts (roles, permissions).
- Familiarity with PII concepts (names, emails, IDs).
- Basic SQL and understanding of data pipelines (batch/streaming).
Why this matters
- You will be asked to implement controls for privacy laws (e.g., GDPR-like requests), retention, and auditing.
- Stakeholders (security, legal, data owners) expect evidence that controls exist and work.
- Controls reduce risk of breaches, fines, and customer trust loss.
Real tasks you might own
- Classify datasets and tag PII columns for masking.
- Set role-based access for finance, marketing, and external partners.
- Implement data retention and deletion workflows for user accounts.
- Encrypt data at rest and enforce TLS for in-transit data.
- Enable, retain, and review audit logs for access and schema changes.
Concept explained simply
Compliance controls are guardrails that define who can access data, how it is protected, how long it is kept, and how you prove it. They turn legal and security requirements into concrete configurations, processes, and evidence.
Mental model: The 6 questions
- What data is sensitive? (classification)
- Where does it live and flow? (inventory and lineage)
- Who can access it and why? (RBAC/ABAC + approvals)
- How is it protected? (encryption, masking, network)
- How long is it kept? (retention/deletion)
- How is it proven? (audit logs, monitoring, reviews)
Core controls you will use
1) Data classification and inventory
- Tag data domains (Customer, Finance, HR).
- Tag sensitivity (Public, Internal, Confidential, Restricted).
- Tag PII/PHI columns (Email, Phone, SSN-like IDs).
- Outcome: You can filter assets by sensitivity and apply controls consistently.
2) Access control (RBAC/ABAC)
- Roles map to job functions (e.g., Marketing_Analyst_Read).
- Policies restrict Restricted data to minimal roles; approvals required.
- Service accounts separated from human users; least privilege by default.
3) Encryption
- At rest: enable managed encryption for storage, keys rotated periodically.
- In transit: enforce TLS for all data movement.
- Key management: restrict key usage to necessary services.
4) Retention and deletion
- Define retention periods by dataset class (e.g., Raw events: 90 days, Aggregates: 3 years).
- Automate deletion or archival; verify with logs and reports.
- Support user deletion requests across systems.
5) Masking and tokenization
- Default masked views for PII; unmask only for permitted roles.
- Tokenize high-risk identifiers; keep vault separate from analytics.
6) Audit logging and monitoring
- Record access, policy changes, schema changes, and data deletions.
- Send critical events to a central log; retain for a defined period.
- Regular reviews: monthly or quarterly.
7) Data sharing and consent
- Share only necessary fields; remove or mask PII by default.
- Document legal basis or consent handling where applicable.
8) Third parties and data flows
- Maintain a registry of destinations and purposes.
- Apply the same controls to extracts (encryption, access, retention).
9) Data localization
- Keep data in allowed regions where required; restrict cross-region copy.
Worked examples
Example A: Masking PII in the analytics warehouse
- Classify columns: email, phone, dob as Restricted PII.
- Create a masked view that shows partial email for most roles.
- Grant unmasked access only to a named role with approval.
- Enable logging for all SELECTs on PII tables.
- Non-privileged users see masked values.
- Access attempts to unmasked view are logged.
- Role grants require ticket/approval reference.
Example B: 90-day retention for raw event data
- Set retention policy: Raw events retained 90 days, then deleted.
- Implement a scheduled job that deletes partitions older than 90 days.
- Generate a deletion report with counts per day.
- Store reports and job logs for 12 months.
- No partitions older than 90 days exist.
- Deletion logs and reports are accessible to compliance reviewers.
Example C: Right-to-delete (user erasure) workflow
- Receive user_id from request system.
- Locate user_id across lake, warehouse, and derived tables via lineage or registry.
- Delete or anonymize records; reprocess affected aggregates.
- Produce evidence: job run ID, tables touched, before/after counts.
- All systems updated within defined SLA (e.g., 30 days).
- Evidence bundle stored with request ID.
Example D: Controlled data share to a vendor
- Classify required fields; exclude PII when not strictly needed.
- Create a sanitized export view with only approved fields.
- Encrypt export; restrict access to a dedicated service account.
- Log all exports and review monthly.
- Only approved schema is shared.
- All export operations are logged and reviewed.
Exercises
These exercises are available to everyone. Progress is saved only for logged-in users.
Exercise 1: Minimal controls for a customer data mart
Design a minimal, practical compliance control set for a new customer data mart that includes customer_profile, orders, and support_tickets tables.
- Classify datasets and sensitive columns.
- Define roles and access (who can read which tables/columns).
- Define masking rules and when unmasking is allowed.
- Define retention per table and audit logging scope.
- Write 3–5 acceptance criteria you could show to auditors.
Hints
- Start with simple tags: Restricted for PII columns; Confidential for non-PII business data.
- Default to masked views; create a single privileged role for unmasking with approvals.
Exercise 2: Retention and deletion workflow
Create a retention plan for raw_events and customer_profile tables and describe the deletion workflow.
- Retention windows for each table.
- How deletion runs (schedule, criteria) and how you verify success.
- What logs and reports you keep and for how long.
- How you handle re-processing of aggregates after deletions.
Hints
- Use partition-based deletes for time-series data.
- Keep deletion evidence separate from the data being deleted.
Self-check checklist
- I used classification tags consistently across datasets.
- Least-privilege roles are clearly defined.
- Masking rules cover all PII columns by default.
- Retention windows are clear and automated.
- Audit logging scope and review cadence are documented.
Common mistakes and self-check
- Inconsistent tagging: Fix by defining allowed values and adding automated checks.
- Too-broad roles: Split roles by function and data domain.
- Masking only in BI tools: Enforce masking in the warehouse/lake too.
- Deletion without evidence: Always produce reports with IDs, counts, and timestamps.
- No review of logs: Schedule monthly reviews and track findings.
Quick self-audit
- Pick one sensitive table. Can you prove who accessed it last month?
- Pick one PII column. Can you show the masked and unmasked paths?
- Pick one dataset. Can you show when its retention job last ran and what it deleted?
Practical projects
- Build a data classification registry: a simple table holding dataset, owner, sensitivity, PII columns, retention, and review date. Populate for 10 datasets.
- Implement masked views for three PII tables and a privileged unmasking role with request ID enforcement.
- Create a retention job for raw events with daily deletion and a weekly summary report.
Learning path
- Start: Compliance Controls Basics (this lesson).
- Next: Implementing Data Masking and Tokenization.
- Then: Access Governance and Just-in-Time Access.
- Advanced: Automated Lineage, Data Loss Prevention, and Continuous Compliance.
Next steps
- Write a one-page control summary for your current platform covering classification, access, masking, encryption, retention, and logging.
- Pick one dataset and apply all six control areas end-to-end this week.
- Take the Quick Test below to verify your understanding.
Mini challenge
You must share order analytics with a partner. Draft a minimal control plan that includes classification, minimized schema, masking, encryption, logging, and a 90-day access review. Keep it under 10 bullet points. Use
Reference outline
- Classify: Confidential dataset, no PII shared.
- Schema: Only order_id, product_id, quantity, price, date.
- Masking: N/A (no PII); verify no join keys reveal identities.
- Encryption: Encrypted at rest and in transit; dedicated service account.
- Access: Partner role limited to read-only on shared view.
- Logging: All SELECTs on the view logged and reviewed monthly.
- Retention: Share refreshed daily; logs retained 12 months.
- Review: Quarterly access review with ticketed approvals.