Why this skill matters for MLOps Engineers
Security and compliance for ML ensures your data, models, and pipelines are protected end-to-end. As an MLOps Engineer, you enable safe experimentation and reliable delivery by enforcing least-privilege access, protecting PII, isolating networks, logging everything important, and proving that releases meet policy.
- Protect sensitive training data and model outputs.
- Prevent credential leaks and code-to-production supply chain risks.
- Enable regulated workloads (e.g., privacy laws) without blocking delivery.
- Reduce incident impact with strong auditability and recovery paths.
Who this is for
- MLOps and ML platform engineers shipping models to production.
- Data/ML engineers handling pipelines with sensitive data.
- Team leads who need practical, auditable controls.
Prerequisites
- Basic Python and CLI skills.
- Familiarity with containers and CI/CD.
- Basic understanding of cloud IAM and Kubernetes is helpful.
Learning path
1) Foundations: identities, access, and secrets
- Map system components: users, services, data stores, model registry, CI/CD, inference/training clusters.
- Apply least-privilege IAM roles to pipelines and services.
- Move credentials to a vault or secrets manager; remove from code and config.
2) Data protection: PII handling and encryption
- Identify PII fields and set redaction/masking rules at ingestion.
- Encrypt data at rest and in transit; enforce TLS everywhere.
- Set retention and deletion policies for training artifacts and logs.
3) Network isolation and safe connectivity
- Place workloads in private networks; restrict egress by allow-lists.
- Use service-to-service authentication (mTLS or workload identity).
- Expose only required endpoints, with rate limits and WAF where possible.
4) Auditability and governance
- Emit structured audit logs for data access, model pushes, and inference calls.
- Protect logs with append-only storage and lifecycle policies.
- Create runbooks: access reviews, key rotation, incident response.
5) Secure deployment and model risk management
- Sign and verify images/artifacts; scan dependencies.
- Gate releases with approvals and automated checks.
- Document model risks, monitoring plans, and rollback steps.
Worked examples
Example 1 — Least-privilege IAM for training to read a specific dataset
Grant a training job read-only access to a single bucket/prefix, nothing else.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadOnlyDataset",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::ml-datasets",
"arn:aws:s3:::ml-datasets/customer-churn/*"
],
"Condition": {
"StringLike": {"s3:prefix": ["customer-churn/*"]}
}
}
]
}
Attach this role to the training job’s compute. Avoid wildcard actions and unrestricted resources.
Example 2 — PII redaction in a Python preprocessing step
Remove emails and phone numbers before writing to disk or sending to downstream stages.
import re
def redact_pii(text: str) -> str:
email_pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
phone_pattern = r"(\+?\d{1,3}[\s-]?)?(\(?\d{3}\)?[\s-]?)\d{3}[\s-]?\d{4}"
text = re.sub(email_pattern, "[REDACTED_EMAIL]", text)
text = re.sub(phone_pattern, "[REDACTED_PHONE]", text)
return text
records = [
{"id": 1, "note": "Contact jane.doe@example.com or +1-212-555-1212"},
]
clean = [{**r, "note": redact_pii(r["note"])} for r in records]
print(clean)
Redaction should happen as early as possible. Keep a tested ruleset and unit tests for PII patterns.
Example 3 — Store secrets outside code (Kubernetes + env)
Create a Kubernetes Secret and mount it as an environment variable. Do not commit the values to Git.
# k8s secret (values are base64-encoded)
apiVersion: v1
kind: Secret
metadata:
name: model-api-secrets
namespace: prod
type: Opaque
data:
TOKEN: c3VwZXJfc2VjcmV0X3Rva2Vu
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference
namespace: prod
spec:
replicas: 2
selector: {matchLabels: {app: inference}}
template:
metadata: {labels: {app: inference}}
spec:
containers:
- name: app
image: ghcr.io/org/inference:1.2.3
env:
- name: THIRD_PARTY_TOKEN
valueFrom:
secretKeyRef:
name: model-api-secrets
key: TOKEN
Use a secrets manager and automated rotation. Limit who can read the Secret and audit all access.
Example 4 — NetworkPolicy to restrict egress
Deny all egress by default and allow only the model registry and metrics endpoint.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: inference-egress
namespace: prod
spec:
podSelector:
matchLabels:
app: inference
policyTypes: [Egress]
egress:
- to:
- namespaceSelector:
matchLabels:
ns: platform
podSelector:
matchLabels:
app: model-registry
ports:
- protocol: TCP
port: 8443
- to:
- ipBlock:
cidr: 10.10.20.0/24
ports:
- protocol: TCP
port: 9090
Pair this with DNS allow-lists or private endpoints to avoid accidental data exfiltration.
Example 5 — Structured audit logging in Python
Write structured, rotating logs for traceability.
import json
import logging
from logging.handlers import RotatingFileHandler
logger = logging.getLogger("audit")
logger.setLevel(logging.INFO)
handler = RotatingFileHandler("/var/log/inference_audit.log", maxBytes=10_000_000, backupCount=5)
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
def audit(event_type, **kwargs):
logger.info(json.dumps({"event": event_type, **kwargs}))
# Example usage
audit("model_loaded", model="churn-v3", sha256="abc123")
audit("predict", user="svc-inference", request_id="r-789", status=200)
Ship logs to an append-only store. Include request IDs, actor, model version, and hashes.
Example 6 — Verify signed model image in CI
Block deployments if signature verification fails.
# Pseudocode shell steps in CI
cosign verify --key cosign.pub ghcr.io/org/inference:1.2.3
if [ $? -ne 0 ]; then
echo "Signature verification failed" >&2
exit 1
fi
# Continue only after verify
trivy image --exit-code 1 ghcr.io/org/inference:1.2.3
Combine signature verification with image scanning to reduce supply chain risk.
Drills and quick exercises
- Create a role with read access to one dataset path and verify it cannot read others.
- Write a unit test suite for your PII redaction function with 10+ edge cases.
- Rotate one API token and prove zero downtime during rotation.
- Enable deny-by-default egress for a test namespace and allowlist only two endpoints.
- Emit structured audit logs for model load and predict; include request IDs.
- Sign a container image and enforce verification in CI before deploy.
Common mistakes and debugging tips
- Overly broad IAM policies: Start with read-only and specific resources; use access advisor or logs to refine.
- Secrets in env for long periods: Rotate regularly; prefer short-lived credentials.
- Redaction too late in the pipeline: Redact before writing to disk or emitting logs.
- Open egress: Deny by default and explicitly allow required hosts/ports.
- Missing context in logs: Always log who/what/when/version/request-id.
- Skipping verification on hotfixes: Automate signature checks so they cannot be bypassed.
Debugging tips
- Permissions: Use cloud access logs to identify the exact denied action and resource.
- Secrets: Confirm mount paths/keys in pod; test with a minimal pod and env print.
- Network: Use a test pod to run curl/dig from inside the namespace to verify policies.
- Logging: Validate JSON schema with a linter before shipping; ensure time is synchronized (NTP).
Mini project: Secure ML inference service
Goal: Deploy a small inference API with end-to-end controls.
- Create a minimal model server (mock predict OK).
- Store one external API key in a secrets manager; mount it at runtime.
- Configure IAM so the service can only read its model from a single bucket/path.
- Apply a NetworkPolicy to allow registry and metrics only.
- Emit structured audit logs for startup, model load, and predict.
- Sign the image and enforce signature verification in CI.
- Document risks (e.g., data exfiltration, PII in logs) and controls; require one approval to deploy.
Acceptance checklist
- Secrets never appear in code, logs, or images.
- Denied egress traffic is visible in network logs.
- Audit logs include model hash and request IDs.
- CI fails if signature or scan fails.
- A reviewer can verify IAM and network constraints from manifests/policies.
Subskills
- Access Control And IAM Basics — Design least-privilege roles for ML pipelines; scope permissions to exact resources; plan rotation.
- PII Handling And Redaction — Identify PII and redact/mask early; set retention and deletion policies.
- Secure Secrets Storage — Keep secrets in a vault or secrets manager; automate rotation and limited blast radius.
- Network Isolation Basics — Private networking, deny-by-default egress, allow-lists, and service identity.
- Audit Logs And Governance — Structured, append-only logs; reviews, runbooks, and lifecycle policies.
- Model Risk Management Basics — Document risks/controls, define approval gates, monitor for drift/incidents.
- Secure Deployment Practices — Sign and verify artifacts, scan images, and use progressive rollouts.
Next steps
- Automate policy checks in CI to block insecure changes.
- Add runtime security (e.g., minimal base images, read-only root FS).
- Expand monitoring to include data drift, security events, and anomaly alerts.