Why this matters
As an MLOps Engineer, you connect data, training, and serving. Access control and IAM (Identity and Access Management) prevent data leaks, model tampering, and unauthorized use of compute. Real tasks include granting a training job access to the right bucket path, giving a model server read-only access to a registry, issuing short-lived credentials to CI/CD, and proving compliance during audits.
- Protect sensitive datasets (PII, PHI) while enabling experiments.
- Stop lateral movement across environments (dev/stage/prod).
- Meet compliance requirements (audit logs, separation of duties, least privilege).
Who this is for
- MLOps Engineers building pipelines and platforms.
- Data Scientists who request access and run jobs.
- Platform/SRE engineers integrating secrets, auth, and audit.
Prerequisites
- Basic understanding of ML pipelines (data ingest → train → register → deploy).
- Familiarity with cloud concepts (roles, policies, service accounts) or Kubernetes service accounts.
- Know what a secret is (API key, token, certificate).
Concept explained simply
Access control decides who (identity) can do what (permission) on which resource, under which conditions.
- Identity: user, service account, CI runner, app.
- Permission: read, write, list, execute.
- Resource: dataset path, model registry entry, secret, compute node.
- Condition: time-limited, environment-limited, IP range, tag-based.
Mental model
Think of your ML platform as a building with rooms:
- Lobby (CI/CD): issues visitor badges (short-lived creds).
- Data room: locked cabinets (dataset prefixes) with camera logs (audit).
- Training room: machines allowed to read only their cabinet and write results across the hall (artifact store).
- Serving room: read-only access to final models; guarded by a receptionist (API gateway) checking visitor badges (tokens).
Deeper dive: IAM building blocks
- Principals: humans, services, groups.
- Policies: allow/deny statements granting specific actions on resources.
- Role-based access control (RBAC): map job functions to bundled permissions.
- Attribute-based access control (ABAC): add conditions (env=prod, tag=teamA).
- Federation: trust an external identity provider to mint tokens (OIDC).
- Secrets management: store and rotate credentials safely.
Key terms in 60 seconds
- Least privilege: grant the minimum needed to complete a task.
- Separation of duties: no single identity can build, approve, and deploy to prod.
- Short-lived credentials: expire quickly; reduce blast radius.
- Federated identity: exchange trusted tokens instead of static keys.
- Scoped access: path-, project-, or environment-limited permissions.
- Audit logging: record who did what, when, from where.
Worked examples
Example 1: Training job on Kubernetes using object storage and a model registry
- Create a dedicated service account for the training pod. Do not reuse admin credentials.
- Grant read access only to s3://datasets/team-a/project-x/*.
- Grant write access to s3://artifacts/project-x/runs/${run_id}/* (no list on parent prefixes).
- Grant write to model registry only for version creation; deny delete.
- Use identity federation (e.g., OIDC) so the pod receives short-lived tokens bound to its service account and namespace.
- Encrypt at rest with a managed key; restrict who can decrypt.
Illustrative policy sketch
{
"principal": "svcacct:train-pod",
"allow": [
{"action": ["object.get"], "resource": "datasets/team-a/project-x/*"},
{"action": ["object.put"], "resource": "artifacts/project-x/runs/${run_id}/*"},
{"action": ["registry.models.create_version"], "resource": "model-registry/project-x/*"}
],
"deny": [
{"action": ["object.list"], "resource": "artifacts/*"},
{"action": ["registry.models.delete"], "resource": "model-registry/*"}
],
"conditions": {"env": "stage", "token_ttl_minutes": 60}
}Example 2: Model serving behind an API gateway
- Gateway enforces auth (OIDC) and rate limits; only tokens with audience=model-serving are accepted.
- Serving pod service account: read-only access to the specific model version path; cannot write.
- mTLS between gateway and serving pods; no public cluster IPs.
- CI/CD can deploy but cannot read datasets.
Example 3: Just-in-time access for a data scientist debugging PII
- Request approved by a manager ticket; access time-limited to 4 hours.
- Read-only to pii-datasets/project-y/*; actions are logged.
- Temporary membership removed automatically after TTL.
Step-by-step: Set up least privilege for an ML pipeline
- Map identities: humans (engineers, approvers), services (train, serve, CI/CD), robots (data loaders).
- Define resources: dataset prefixes, feature store tables, artifact buckets, model registry, secrets, clusters/namespaces.
- Group permissions into roles: e.g., Trainer, Deployer, Registry-Writer, Dataset-Reader(Project X).
- Apply deny-by-default: no wildcard access; explicitly allow only necessary actions.
- Scope by environment: dev vs stage vs prod roles; no cross-env wildcard roles.
- Issue short-lived creds: use federation for workloads; avoid static keys in pods.
- Protect secrets: store in a secrets manager; mount at runtime; rotate regularly.
- Enable audit logs: collect and review; wire alerts for high-risk actions (delete, decrypt, assume-role).
- Test: run a dry-run CI job; verify it fails on disallowed actions and succeeds on allowed ones.
- Review regularly: quarterly access reviews; remove stale roles; adjust scopes.
Security controls checklist
- Each workload has its own service account.
- No long-lived access keys in code or images.
- Dataset access is path-scoped, not bucket-wide.
- Model registry write is separated from delete.
- Prod deploy requires approval by a different person.
- All access attempts are logged and retained.
- Tokens/keys rotate automatically and frequently.
- Emergency break-glass account stored offline and monitored.
Your turn: exercises
Try these. You can check solutions, but attempt first.
Exercise 1 — Design least-privilege for a training job
You have a Kubernetes training job that needs to read datasets from a project path and write artifacts and model versions. Define identities, required permissions, resource scopes, and conditions (TTL, environment).
Need a nudge?
- Separate service accounts per workload.
- Scope dataset access to a specific prefix.
- Allow write only to a run-specific folder.
- Use short-lived tokens with TTL ≤ 60 minutes.
Exercise 2 — Pre-deploy IAM checklist for model serving
Create a concise checklist that a CI pipeline must pass before deploying a new model to production.
Need a nudge?
- AuthN at gateway, AuthZ for endpoints.
- Read-only model pull in prod.
- Audit logging and rollout approvals.
Common mistakes and how to self-check
- Wildcard permissions (e.g., read all buckets). Self-check: can this identity access other teams’ data?
- Static secrets in code. Self-check: rotate the secret; does anything break? If yes, you had a hardcoded path.
- Overlapping roles that give unintended combined power. Self-check: simulate policy evaluation for high-risk actions.
- No environment boundaries. Self-check: can a dev role list or read prod resources?
- No audit/alerts. Self-check: who would notice a model delete at 2 AM?
Practical projects
- Implement federation for a training workload: issue short-lived tokens and prove dataset reads and artifact writes while listing parent paths is denied.
- Create per-environment roles for a model server and demonstrate that a staging server cannot read prod models.
- Build a quarterly access review script that reports principals with unused high-privilege permissions.
Learning path
- This lesson: IAM fundamentals, least privilege, federation, audit.
- Next: Secrets management for ML (rotation, mounting, envelope encryption).
- Then: Network-level controls (mTLS, private endpoints, VPC/namespace isolation).
- Later: Data governance and lineage with access policies.
- Finally: Continuous compliance (policy-as-code, automated checks).
Next steps
- Refactor one pipeline to use unique service accounts and short-lived tokens.
- Replace any wildcard permissions found in your current roles.
- Enable and review audit logs; create an alert for model delete actions.
Mini challenge
Your company retrains a demand-forecasting model weekly. Draft a one-page IAM plan with: identities, roles, resource scopes, TTLs, and an approval flow for prod deploy. Add at least three deny rules you will test explicitly.
Quick Test (progress note)
Take the Quick Test below to check your understanding. The test is available to everyone; only logged-in users get saved progress.