Who this is for
Data engineers, analytics engineers, and platform-minded developers who deploy pipelines, warehouses, and data platforms in the cloud and need safe, auditable access controls.
Prerequisites
- Basic understanding of cloud resources: storage, compute, databases/warehouse
- Comfort with JSON or YAML-like policy syntax
- Familiarity with your cloud provider's IAM terms is helpful but not required
Why this matters
As a data engineer, you move and transform sensitive data. You will routinely:
- Grant an ETL job read access to a raw bucket and write access to a curated bucket
- Give analysts read-only access to a warehouse while protecting PII
- Rotate credentials and use temporary tokens in orchestration systems
- Audit who touched which dataset to pass compliance checks
Correct IAM and role-based access keeps data safe, limits blast radius, and makes audits straightforward.
Concept explained simply
Identity and Access Management (IAM) answers two questions: Who are you, and what can you do? Role-Based Access Control (RBAC) groups permissions into roles like Reader, Writer, or Admin, then assigns those roles to users, groups, or services.
Mental model
Think of your platform as a building:
- Principals = people or services holding keys
- Roles = keyrings with specific doors they can open
- Policies = the rules printed on the keyring specifying which doors and when
- Resources = rooms (buckets, tables, clusters)
- Conditions = extra checks (time of day, resource tags, environment)
Good security means issuing the smallest keyring needed for a job, for a limited time, and logging each door opened.
Core building blocks
- Principals: users, groups, service accounts, or workloads
- Roles: collections of permissions (read, write, admin) scoped to resources
- Policies: allow/deny rules attached to roles or directly to principals/resources
- Scope: limit policy to specific paths, tables, databases, projects, or environments
- Conditions: tag-based or context checks (environment=prod, data=pii)
- Temporary credentials: short-lived tokens acquired by assuming a role
- Audit logs: track who assumed what role and which resources were accessed
Rule of thumb
- Deny by default, then allow only what is necessary
- Prefer roles assigned to groups or service accounts over direct user grants
- Use temporary credentials; avoid static keys
- Split duties: ingestion, transformation, analytics, and admin each get distinct roles
Worked examples
Example 1: Warehouse ReadOnly and Loader
- Create role AnalyticsReader with select permissions on schemas views and tables, no create/drop/alter
- Create role FactLoader with insert/update on fact and dimension tables in curated schema only
- Assign AnalyticsReader to analyst group; assign FactLoader to ETL service account
Rationale
Analysts can query safely; ETL can write curated tables but cannot alter schema or read secrets outside scope.
Example 2: Bucket scope for ETL
- Allow storage:GetObject on raw/sales/* (read-only)
- Allow storage:PutObject on curated/sales/* (write-only)
- Deny storage:DeleteObject and forbid wildcards outside these prefixes
- Allow secrets:GetSecretValue for a single warehouse connection secret
- ETL assumes the role for a 1-hour session per run
Rationale
Limits both read and write to exact folders. No deletes means a bad job cannot wipe data.
Example 3: Environment separation (dev/staging/prod)
- Tag resources with environment=dev|staging|prod
- Attach permission boundaries so dev-role cannot act on prod resources
- Grant broader rights in dev, stricter read-only in staging, and least privilege in prod
- Use separate service accounts per environment
Rationale
Prevents accidental access across environments and supports safe experimentation.
Hands-on practice
Complete the exercises below. When done, use the checklist to self-review.
Exercise 1 — ETL role policy (least privilege)
Design a policy for a nightly ETL job that:
- Reads only from raw/sales/
- Writes only to curated/sales/
- Cannot delete any object
- Can read one secret called jdbc/warehouse
- Uses a temporary session up to 1 hour
Tip
Scope to exact prefixes, avoid *, and specify only needed actions. Add an assume-role statement tied to the ETL service principal.
Exercise 2 — RBAC matrix for the team
Propose roles for a team with Data Engineers, Data Analysts, ML Engineers, and a Platform Admin across three data zones: raw, curated, warehouse.
Tip
Start with read-only for most, writer for ETL on curated, and tightly controlled admin rights.
Self-review checklist
- I granted only the actions required for the job
- I scoped access to exact paths/schemas/tables
- I avoided wildcards except where justified
- I used temporary credentials and role assumption
- I separated duties by environment and function
- I included conditions or tags where possible
- I considered audit logging for critical access
Common mistakes and how to self-check
- Using broad wildcards: Replace storage:* and dataset:* with specific actions and resources
- Static keys in code: Switch to role assumption or workload identity; rotate keys immediately if found
- Single mega-role for everything: Split into reader, writer, admin, and per-environment roles
- Granting directly to users: Assign roles to groups or service accounts for easier audits
- No deny guardrails: Add explicit denies or permission boundaries for prod resources
- Unscoped secrets access: Limit to the exact secret and version; read-only
Self-check mini audit
- Pick one pipeline and list every permission it uses; remove any unused
- Verify session duration; aim for shortest practical runtime window
- Ensure access to PII is explicitly approved and logged
Practical projects
- Lock down a demo data lake: create raw and curated prefixes, build ETL roles, and prove least privilege with a dry-run script
- Warehouse access tiers: set up Reader, Loader, and Admin roles; onboard a new analyst in minutes using group assignment
- Environment isolation: tag resources and enforce boundaries so dev cannot affect prod; validate by attempting a blocked action
Learning path
- Basics: principals, roles, policies, scopes, conditions, audit logs
- Least privilege in practice: narrow actions and resources, remove wildcards
- Workload identity: service accounts and temporary credentials in orchestrators
- Environment strategy: dev/staging/prod separation with permission boundaries
- Data-layer nuance: object storage prefixes, table-level permissions, row/column-level security (if available)
- Governance: tagging, logging, alerting, and periodic access reviews
Next steps
- Harden one real pipeline by converting static keys to role assumption
- Introduce an explicit deny for prod resources
- Schedule quarterly access reviews for data roles
Mini challenge
Design two roles for a marketing attribution job: one role that reads only curated/marketing/ and another that writes only to curated/attribution/. Include a condition that the write role cannot be used outside 00:00–04:00 UTC. Explain how you would test it safely.
Take the quick test
The quick test below is available to everyone. Only logged-in users will have their progress saved.