Who this is for
- Machine Learning Engineers deploying models or training jobs on cloud platforms.
- MLOps engineers wiring CI/CD, data access, and monitoring for ML systems.
- Data scientists who need predictable, secure access to cloud data and compute.
Prerequisites
- Basic understanding of cloud resources (compute, storage, projects/accounts).
- Comfort with JSON/YAML configuration files.
- Familiarity with your cloud provider's console or CLI (any provider is fine).
Why this matters
As an ML Engineer, you will: grant training jobs access to datasets, restrict who can read model artifacts, rotate secrets for pipelines, and prove compliance with audit logs. Getting IAM and permissions right prevents data leaks, avoids outage-causing denials, and keeps costs under control.
- Real task: Let a training job read from a specific bucket prefix and write only to a model-artifacts location.
- Real task: Allow CI to push images to a private registry but block it from deleting tags.
- Real task: Give a contractor time-limited, read-only access to a dataset and nothing else.
Core concepts explained simply
Identity and Access Management (IAM) controls who can do what on which resource, and under which conditions.
- Identity: a person (user), machine (service account or managed identity), or group/role.
- Permission: an allowed action (e.g., read object, write log, start job).
- Policy: attaches permissions to identities on resources. Has an effect (allow/deny) and optional conditions (time, IP, resource prefix).
- Scope: where the policy applies (account/project, resource group, bucket, container, specific path/prefix).
- Session: temporary credentials a job uses; should be short-lived.
Mental model: Access = Identity + Permission + Resource + Condition. Start narrow and expand only when a job fails for a legitimate reason.
Least privilege: Give the minimum permissions needed, scoped to the smallest resource possible, for the shortest time.
Guardrails: Deny policies, organization constraints, naming conventions, and logging that make it hard to do risky things by accident.
Cloud translation (mind the vocabulary differences):
- AWS: IAM users/roles, policies (JSON), resource ARNs, SCPs for guardrails.
- GCP: Principals (users/service accounts), roles/bindings, resource hierarchy (org > folder > project), IAM Conditions.
- Azure: Entra ID principals, role assignments, scopes (subscription/resource group/resource), managed identities.
Worked examples
Example 1 — AWS: Training job reads data and writes artifacts
Goal: A training role can read only s3://ml-data/projects/churn/train/* and write only s3://ml-artifacts/churn/*.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadTrainingData",
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": [
"arn:aws:s3:::ml-data/projects/churn/train/*"
]
},
{
"Sid": "WriteArtifacts",
"Effect": "Allow",
"Action": ["s3:PutObject"],
"Resource": [
"arn:aws:s3:::ml-artifacts/churn/*"
]
}
]
}
Mental check: Identity (training role), Permissions (GetObject/PutObject), Resources (specific prefixes), Conditions (none yet). Least privilege: satisfied.
Example 2 — GCP: Vertex AI training reads dataset prefix
Goal: A Vertex AI custom job's service account can read only gs://ml-data/churn/train/*.
# Binding a custom role with storage.objectViewer over a prefix via IAM Conditions
bindings:
- role: roles/storage.objectViewer
members:
- serviceAccount:vertex-train@PROJECT_ID.iam.gserviceaccount.com
condition:
title: ReadTrainPrefixOnly
expression: resource.name.startsWith("projects/_/buckets/ml-data/objects/churn/train/")
description: Limit to training prefix
Mental check: Condition narrows scope to the exact object path prefix.
Example 3 — Azure: AML compute with managed identity and scoped storage access
Goal: Azure ML compute's system-assigned managed identity can read from a specific container and write to an artifacts container.
- Assign Storage Blob Data Reader on scope: Storage Account > Container ml-data/churn/train.
- Assign Storage Blob Data Contributor on scope: Storage Account > Container ml-artifacts/churn.
- Use the managed identity in the AML job so tokens are short-lived.
Mental check: Right identity, right roles, minimum scopes.
Hands-on exercises
Do these locally as design tasks. They mirror the graded exercises below.
Exercise 1 — Design a least-privilege plan for a training pipeline
- Identify all identities: CI pipeline, training job, model registry writer.
- List needed resources: data prefix, artifacts location, container registry, logs.
- Assign minimal permissions per identity with the narrowest scope (prefix-level where possible).
- Add one guardrail (a deny or org policy) to block wildcard writes to data buckets.
Expected result: A short plan listing identities, roles/permissions, scopes, conditions, and one deny-style guardrail.
Exercise 2 — Write a read-only policy for a data prefix
Create a minimal policy that allows reading only from a single dataset prefix and nothing else. Use pseudo-JSON if needed. Include:
- Effect: Allow
- Action: only read/list
- Resource: the exact prefix path
- Optional condition: restrict to that prefix
Expected result: A small JSON-like document granting read to a specific prefix only.
Preflight checklist for ML jobs
- Identity is a service account/managed identity (not a personal user).
- Permissions include only required actions (read data, write artifacts, write logs).
- Scope is the smallest resource (specific bucket/container or prefix).
- Sessions are temporary; no long-lived keys checked into code.
- Audit logs enabled; you can trace who accessed what.
- Guardrails exist: deny wildcards on data buckets, restricted public access.
Common mistakes and how to self-check
- Using broad roles (e.g., admin or owner). Self-check: Can this identity delete resources? If yes, you overscoped.
- Wildcarding resources (e.g., bucket/* when only a prefix is needed). Self-check: Try listing outside the prefix; if it works, tighten scope.
- Permanent secrets in code. Self-check: Search repos for keys/tokens; replace with managed identities or secret managers.
- No separation of duties (CI has prod write/delete). Self-check: Review role boundaries; ensure CI can push but not delete or deploy to prod without approval.
- Missing logging. Self-check: Trigger a read and confirm it appears in audit logs within minutes.
Mini challenge
You must give a contractor read-only access to the dataset prefix for 7 days and nothing else. Describe:
- The identity you will create or use.
- The exact permissions and scope.
- How you will make access expire automatically.
- One monitoring step to verify correct use.
Learning path
- Step 1: Identities and sessions — service accounts, managed identities, short-lived credentials.
- Step 2: Policies — allow vs deny, resource scoping, conditions.
- Step 3: Data access patterns — prefix-level access, read vs write split.
- Step 4: Guardrails — org policies, deny statements, private endpoints.
- Step 5: Audit and review — enable logs, periodic access reviews, least-privilege drift checks.
- Step 6: Automation — codify IAM in IaC and add policy tests in CI.
Practical projects
- Project 1: Build a minimal ML sandbox with two identities: trainer (read data, write artifacts) and registry-writer (write to model registry only). Prove it with a small training run.
- Project 2: Add a deny guardrail that blocks wildcard writes to data buckets/containers. Attempt a write outside the allowed prefix to confirm the deny triggers.
- Project 3: Create IAM policy unit tests in CI to fail PRs that add broad permissions or wildcards.
Next steps
- Refine your policies from the exercises into reusable templates.
- Pair with a teammate to do a 20-minute access review on your current ML projects.
- When ready, take the Quick Test below. Everyone can take it for free; only logged-in users get saved progress.
Ready? Take the Quick Test
Target score: 70% or higher. If you miss the mark, revisit the exercises and the common mistakes section, then try again.