Why this skill matters for MLOps Engineers
Kubernetes is the backbone for running ML training, batch inference, and real-time serving reliably at scale. As an MLOps Engineer, mastering it lets you ship models faster, keep services stable, scale on demand (including GPUs), and automate updates without downtime.
- Standardize deployments of inference services and pipelines
- Run batch jobs and scheduled predictions with Jobs and CronJobs
- Control costs and performance with resource requests/limits and autoscaling
- Manage configs and secrets safely for data stores and model artifacts
- Ship repeatable setups using Helm charts
- Debug networking and roll back quickly when something breaks
What you’ll be able to do
- Deploy an ML model as a scalable, monitored service behind an Ingress
- Run batch inference with Jobs/CronJobs and handle retries
- Allocate CPU/GPU resources and apply HPA for traffic spikes
- Store parameters in ConfigMaps and credentials in Secrets
- Package ML infra as Helm charts for teams to reuse
- Diagnose pod crashes, networking issues, and perform safe rollbacks
Who this is for
- Practitioners moving from notebooks to production ML
- MLOps/Platform Engineers building ML serving and batch stacks
- Data Scientists who want hands-on deployment skills
Prerequisites
- Comfort with Docker images and basic Linux shell
- Basic YAML reading/editing
- Familiarity with HTTP APIs and environment variables
- Optional: GPU basics if you plan to serve/train on GPUs
Learning path
1) Get a cluster + kubectl
Use any Kubernetes cluster (local or cloud). Practice with a dedicated namespace for this skill.
Mini task
Create a namespace named ml and set it as default context for this session.
kubectl create namespace ml
kubectl config set-context --current --namespace=ml2) Deployments, Services, Ingress
Run a simple inference web app as a Deployment. Expose it with a Service (ClusterIP). Route external traffic via an Ingress.
3) ConfigMaps and Secrets
Externalize configuration (e.g., model path, feature switches) and keep credentials in Secrets.
4) Jobs and CronJobs
Use Jobs for one-off batch inference and CronJobs for scheduled predictions or ETL-like preprocessing.
5) Resources and Autoscaling
Set requests/limits for predictable performance and enable HPA to scale pods automatically.
6) Helm Basics
Package manifests into a chart with values for environments (dev/stage/prod).
7) Debugging pods and networking
Learn to inspect events, logs, and connectivity to resolve common breaks quickly.
8) Rolling updates and rollbacks
Release new model images safely, monitor, and revert instantly if needed.
Worked examples
1) Real-time inference: Deployment + Service + Ingress
This runs a simple HTTP inference server listening on port 8080.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: iris-inference
spec:
replicas: 2
selector:
matchLabels: { app: iris }
template:
metadata:
labels: { app: iris }
spec:
containers:
- name: server
image: ghcr.io/example/iris-inference:1.0
ports:
- containerPort: 8080
env:
- name: MODEL_PATH
value: /models/iris.onnx
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
readinessProbe:
httpGet: { path: /health, port: 8080 }
initialDelaySeconds: 5
livenessProbe:
httpGet: { path: /health, port: 8080 }
initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: iris-svc
spec:
selector: { app: iris }
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: iris-ingress
spec:
rules:
- host: ml.local
http:
paths:
- path: /predict
pathType: Prefix
backend:
service:
name: iris-svc
port:
number: 80Smoke test
Port-forward the Service and test with curl.
kubectl port-forward svc/iris-svc 8080:80 &
curl -X POST http://localhost:8080/predict -d '{"x":[5.1,3.5,1.4,0.2]}' -H 'Content-Type: application/json'2) Batch inference with a Job
Run inference for a dataset in object storage. Credentials are provided via a Secret.
---
apiVersion: v1
kind: Secret
metadata:
name: s3-cred
type: Opaque
stringData:
AWS_ACCESS_KEY_ID: "minio"
AWS_SECRET_ACCESS_KEY: "miniosecret"
---
apiVersion: batch/v1
kind: Job
metadata:
name: batch-infer
spec:
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: runner
image: ghcr.io/example/batch-infer:1.0
env:
- name: INPUT_URI
value: s3://datasets/iris.csv
- name: OUTPUT_URI
value: s3://predictions/iris.jsonl
- name: AWS_REGION
value: us-east-1
- name: AWS_ACCESS_KEY_ID
valueFrom: { secretKeyRef: { name: s3-cred, key: AWS_ACCESS_KEY_ID } }
- name: AWS_SECRET_ACCESS_KEY
valueFrom: { secretKeyRef: { name: s3-cred, key: AWS_SECRET_ACCESS_KEY } }Tips
- Use
backoffLimitfor retries - Add
activeDeadlineSecondsto cap long runs
3) GPU-enabled inference
Schedule a pod that needs one GPU. Requires the GPU device plugin installed on GPU nodes.
apiVersion: apps/v1
kind: Deployment
metadata:
name: torch-infer-gpu
spec:
replicas: 1
selector:
matchLabels: { app: torch-gpu }
template:
metadata:
labels: { app: torch-gpu }
spec:
nodeSelector:
accelerator: nvidia
containers:
- name: server
image: ghcr.io/example/torch-infer:2.0
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: "500m"
memory: "1Gi"
ports:
- containerPort: 8080Note
Request GPU via limits only; requests for GPUs are implicitly equal to limits on most setups.
4) Autoscaling with HPA (CPU-based)
Scale pods from 2 up to 10 when average CPU exceeds 60%.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: iris-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: iris-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60Check scaling
kubectl get hpa
kubectl describe hpa iris-hpa5) Helm chart skeleton for an inference service
# Chart.yaml
apiVersion: v2
name: iris-infer
version: 0.1.0
---
# values.yaml
replicaCount: 2
image:
repository: ghcr.io/example/iris-inference
tag: "1.0"
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
service:
port: 80
targetPort: 8080
ingress:
enabled: true
host: ml.local
path: /predict
---
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "iris-infer.fullname" . }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
app: {{ include "iris-infer.name" . }}
template:
metadata:
labels:
app: {{ include "iris-infer.name" . }}
spec:
containers:
- name: app
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
ports:
- containerPort: {{ .Values.service.targetPort }}
resources: {{- toYaml .Values.resources | nindent 12 }}Install
helm install iris ./iris-infer
helm upgrade --install iris ./iris-infer -f values.yaml6) Rolling update and rollback
# Update the image
kubectl set image deployment/iris-inference server=ghcr.io/example/iris-inference:1.1
kubectl rollout status deployment/iris-inference
# If errors spike, roll back
kubectl rollout undo deployment/iris-inference --to-revision=1Drills and exercises
- Create a Deployment for a toy FastAPI model server with a ConfigMap-based MODEL_NAME
- Add a readinessProbe that returns 200 only after the model loads
- Expose via Service and Ingress and verify with port-forward
- Create a Job that reads from a PVC and writes predictions back to the same PVC
- Set requests/limits and observe pod scheduling behavior when the node is pressure-limited
- Configure an HPA to scale from 1 to 5 on CPU 70%
- Templatize the setup with Helm and override image.tag via
-for--set - Break something on purpose (wrong port) and use
kubectl describeand logs to fix it
Common mistakes and debugging tips
- Missing readinessProbe causes traffic to hit cold pods. Fix: add a lightweight health endpoint and readinessProbe.
- No resource requests leads to noisy-neighbor issues. Fix: set minimal CPU/memory requests aligned with baseline load.
- Using latest image tags breaks reproducibility. Fix: pin semantic versions and label Deployments with the model version.
- Secrets in ConfigMaps leak credentials. Fix: store credentials in Secrets and mount as env or files.
- HPA not scaling. Fix: ensure metrics-server is running and target Deployment has resource requests.
- Ingress returns 404. Fix: verify path, service port/name, and that the Ingress controller is installed and running.
- GPU pod pending. Fix: check GPU node labels, device plugin, taints/tolerations, and
nvidia.com/gpulimits.
Quick debug toolkit
kubectl get events --sort-by=.lastTimestamp
kubectl describe pod <name>
kubectl logs <name> -c <container>
kubectl exec -it <name> -- sh
kubectl get endpoints <service>
nslookup iris-svc
curl -v http://iris-svc/predictMini project: Production-ready ML inference
Goal: Package and deploy an iris classifier with safe rollouts, config separation, and autoscaling.
- Containerize the model server with a
/predictroute and a/healthcheck. - Create a ConfigMap for non-secret params and a Secret for any credentials.
- Deploy via Helm with
values-dev.yamlandvalues-prod.yaml. - Enable HPA (CPU 60% target, 2–10 replicas).
- Expose through Ingress at
ml.local/predict. - Perform a rolling update to version 1.1; monitor and roll back as practice.
Acceptance checklist
- Zero-downtime rollout verified
- Config changes do not require image rebuild
- Secrets are not committed to source control
- Autoscaling observed under load test
Practical projects
- Scheduled batch predictions: nightly CronJob writing results to storage with retry policy
- GPU A/B serving: two Deployments with different model variants, traffic split at Ingress level
- Feature precompute pipeline: Job chain triggered via CronJob and messaging layer (simulate with separate Jobs)
Subskills
Focus areas for this skill:
- Deployments, Services, Ingress Basics — run and expose ML services safely
- Jobs, CronJobs for Batch Inference — one-off and scheduled predictions
- Resource Requests, Limits, Autoscaling — predictable performance and cost control
- ConfigMaps and Secrets — clean config separation and secure credentials
- Helm Basics — reusable, parameterized deployments
- Debugging Pods and Networking — fast incident resolution
- Rolling Updates and Rollbacks — safe releases and instant reverts
Next steps
- Add observability: logs, metrics, and dashboards
- Introduce canary or blue/green strategies
- Automate CI/CD to build, test, and deploy charts per commit