luvv to helpDiscover the Best Free Online Tools

Kubernetes For ML Workloads

Learn Kubernetes For ML Workloads for MLOps Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 4, 2026 | Updated: January 4, 2026

Why this skill matters for MLOps Engineers

Kubernetes is the backbone for running ML training, batch inference, and real-time serving reliably at scale. As an MLOps Engineer, mastering it lets you ship models faster, keep services stable, scale on demand (including GPUs), and automate updates without downtime.

  • Standardize deployments of inference services and pipelines
  • Run batch jobs and scheduled predictions with Jobs and CronJobs
  • Control costs and performance with resource requests/limits and autoscaling
  • Manage configs and secrets safely for data stores and model artifacts
  • Ship repeatable setups using Helm charts
  • Debug networking and roll back quickly when something breaks

What you’ll be able to do

  • Deploy an ML model as a scalable, monitored service behind an Ingress
  • Run batch inference with Jobs/CronJobs and handle retries
  • Allocate CPU/GPU resources and apply HPA for traffic spikes
  • Store parameters in ConfigMaps and credentials in Secrets
  • Package ML infra as Helm charts for teams to reuse
  • Diagnose pod crashes, networking issues, and perform safe rollbacks

Who this is for

  • Practitioners moving from notebooks to production ML
  • MLOps/Platform Engineers building ML serving and batch stacks
  • Data Scientists who want hands-on deployment skills

Prerequisites

  • Comfort with Docker images and basic Linux shell
  • Basic YAML reading/editing
  • Familiarity with HTTP APIs and environment variables
  • Optional: GPU basics if you plan to serve/train on GPUs

Learning path

1) Get a cluster + kubectl

Use any Kubernetes cluster (local or cloud). Practice with a dedicated namespace for this skill.

Mini task

Create a namespace named ml and set it as default context for this session.

kubectl create namespace ml
kubectl config set-context --current --namespace=ml

2) Deployments, Services, Ingress

Run a simple inference web app as a Deployment. Expose it with a Service (ClusterIP). Route external traffic via an Ingress.

3) ConfigMaps and Secrets

Externalize configuration (e.g., model path, feature switches) and keep credentials in Secrets.

4) Jobs and CronJobs

Use Jobs for one-off batch inference and CronJobs for scheduled predictions or ETL-like preprocessing.

5) Resources and Autoscaling

Set requests/limits for predictable performance and enable HPA to scale pods automatically.

6) Helm Basics

Package manifests into a chart with values for environments (dev/stage/prod).

7) Debugging pods and networking

Learn to inspect events, logs, and connectivity to resolve common breaks quickly.

8) Rolling updates and rollbacks

Release new model images safely, monitor, and revert instantly if needed.

Worked examples

1) Real-time inference: Deployment + Service + Ingress

This runs a simple HTTP inference server listening on port 8080.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iris-inference
spec:
  replicas: 2
  selector:
    matchLabels: { app: iris }
  template:
    metadata:
      labels: { app: iris }
    spec:
      containers:
        - name: server
          image: ghcr.io/example/iris-inference:1.0
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_PATH
              value: /models/iris.onnx
          resources:
            requests: { cpu: "250m", memory: "256Mi" }
            limits:   { cpu: "500m", memory: "512Mi" }
          readinessProbe:
            httpGet: { path: /health, port: 8080 }
            initialDelaySeconds: 5
          livenessProbe:
            httpGet: { path: /health, port: 8080 }
            initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: iris-svc
spec:
  selector: { app: iris }
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
      name: http
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: iris-ingress
spec:
  rules:
    - host: ml.local
      http:
        paths:
          - path: /predict
            pathType: Prefix
            backend:
              service:
                name: iris-svc
                port:
                  number: 80
Smoke test

Port-forward the Service and test with curl.

kubectl port-forward svc/iris-svc 8080:80 &
curl -X POST http://localhost:8080/predict -d '{"x":[5.1,3.5,1.4,0.2]}' -H 'Content-Type: application/json'

2) Batch inference with a Job

Run inference for a dataset in object storage. Credentials are provided via a Secret.

---
apiVersion: v1
kind: Secret
metadata:
  name: s3-cred
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "minio"
  AWS_SECRET_ACCESS_KEY: "miniosecret"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-infer
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: runner
          image: ghcr.io/example/batch-infer:1.0
          env:
            - name: INPUT_URI
              value: s3://datasets/iris.csv
            - name: OUTPUT_URI
              value: s3://predictions/iris.jsonl
            - name: AWS_REGION
              value: us-east-1
            - name: AWS_ACCESS_KEY_ID
              valueFrom: { secretKeyRef: { name: s3-cred, key: AWS_ACCESS_KEY_ID } }
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom: { secretKeyRef: { name: s3-cred, key: AWS_SECRET_ACCESS_KEY } }
Tips
  • Use backoffLimit for retries
  • Add activeDeadlineSeconds to cap long runs

3) GPU-enabled inference

Schedule a pod that needs one GPU. Requires the GPU device plugin installed on GPU nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: torch-infer-gpu
spec:
  replicas: 1
  selector:
    matchLabels: { app: torch-gpu }
  template:
    metadata:
      labels: { app: torch-gpu }
    spec:
      nodeSelector:
        accelerator: nvidia
      containers:
        - name: server
          image: ghcr.io/example/torch-infer:2.0
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              cpu: "500m"
              memory: "1Gi"
          ports:
            - containerPort: 8080
Note

Request GPU via limits only; requests for GPUs are implicitly equal to limits on most setups.

4) Autoscaling with HPA (CPU-based)

Scale pods from 2 up to 10 when average CPU exceeds 60%.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: iris-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iris-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
Check scaling
kubectl get hpa
kubectl describe hpa iris-hpa

5) Helm chart skeleton for an inference service

# Chart.yaml
apiVersion: v2
name: iris-infer
version: 0.1.0
---
# values.yaml
replicaCount: 2
image:
  repository: ghcr.io/example/iris-inference
  tag: "1.0"
resources:
  requests: { cpu: "250m", memory: "256Mi" }
  limits:   { cpu: "500m", memory: "512Mi" }
service:
  port: 80
  targetPort: 8080
ingress:
  enabled: true
  host: ml.local
  path: /predict
---
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "iris-infer.fullname" . }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ include "iris-infer.name" . }}
  template:
    metadata:
      labels:
        app: {{ include "iris-infer.name" . }}
    spec:
      containers:
        - name: app
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          ports:
            - containerPort: {{ .Values.service.targetPort }}
          resources: {{- toYaml .Values.resources | nindent 12 }}
Install
helm install iris ./iris-infer
helm upgrade --install iris ./iris-infer -f values.yaml

6) Rolling update and rollback

# Update the image
kubectl set image deployment/iris-inference server=ghcr.io/example/iris-inference:1.1
kubectl rollout status deployment/iris-inference

# If errors spike, roll back
kubectl rollout undo deployment/iris-inference --to-revision=1

Drills and exercises

  • Create a Deployment for a toy FastAPI model server with a ConfigMap-based MODEL_NAME
  • Add a readinessProbe that returns 200 only after the model loads
  • Expose via Service and Ingress and verify with port-forward
  • Create a Job that reads from a PVC and writes predictions back to the same PVC
  • Set requests/limits and observe pod scheduling behavior when the node is pressure-limited
  • Configure an HPA to scale from 1 to 5 on CPU 70%
  • Templatize the setup with Helm and override image.tag via -f or --set
  • Break something on purpose (wrong port) and use kubectl describe and logs to fix it

Common mistakes and debugging tips

  • Missing readinessProbe causes traffic to hit cold pods. Fix: add a lightweight health endpoint and readinessProbe.
  • No resource requests leads to noisy-neighbor issues. Fix: set minimal CPU/memory requests aligned with baseline load.
  • Using latest image tags breaks reproducibility. Fix: pin semantic versions and label Deployments with the model version.
  • Secrets in ConfigMaps leak credentials. Fix: store credentials in Secrets and mount as env or files.
  • HPA not scaling. Fix: ensure metrics-server is running and target Deployment has resource requests.
  • Ingress returns 404. Fix: verify path, service port/name, and that the Ingress controller is installed and running.
  • GPU pod pending. Fix: check GPU node labels, device plugin, taints/tolerations, and nvidia.com/gpu limits.
Quick debug toolkit
kubectl get events --sort-by=.lastTimestamp
kubectl describe pod <name>
kubectl logs <name> -c <container>
kubectl exec -it <name> -- sh
kubectl get endpoints <service>
nslookup iris-svc
curl -v http://iris-svc/predict

Mini project: Production-ready ML inference

Goal: Package and deploy an iris classifier with safe rollouts, config separation, and autoscaling.

  1. Containerize the model server with a /predict route and a /health check.
  2. Create a ConfigMap for non-secret params and a Secret for any credentials.
  3. Deploy via Helm with values-dev.yaml and values-prod.yaml.
  4. Enable HPA (CPU 60% target, 2–10 replicas).
  5. Expose through Ingress at ml.local/predict.
  6. Perform a rolling update to version 1.1; monitor and roll back as practice.
Acceptance checklist
  • Zero-downtime rollout verified
  • Config changes do not require image rebuild
  • Secrets are not committed to source control
  • Autoscaling observed under load test

Practical projects

  • Scheduled batch predictions: nightly CronJob writing results to storage with retry policy
  • GPU A/B serving: two Deployments with different model variants, traffic split at Ingress level
  • Feature precompute pipeline: Job chain triggered via CronJob and messaging layer (simulate with separate Jobs)

Subskills

Focus areas for this skill:

  • Deployments, Services, Ingress Basics — run and expose ML services safely
  • Jobs, CronJobs for Batch Inference — one-off and scheduled predictions
  • Resource Requests, Limits, Autoscaling — predictable performance and cost control
  • ConfigMaps and Secrets — clean config separation and secure credentials
  • Helm Basics — reusable, parameterized deployments
  • Debugging Pods and Networking — fast incident resolution
  • Rolling Updates and Rollbacks — safe releases and instant reverts

Next steps

  • Add observability: logs, metrics, and dashboards
  • Introduce canary or blue/green strategies
  • Automate CI/CD to build, test, and deploy charts per commit

Kubernetes For ML Workloads — Skill Exam

This exam checks practical understanding of deploying and operating ML workloads on Kubernetes. You can take it for free. Your progress and best score are saved if you are logged in; guests can still complete the exam but results won’t be saved.Approx. 12–15 questionsPass score: 70%Open notes and terminal allowed

15 questions70% to pass

Have questions about Kubernetes For ML Workloads?

AI Assistant

Ask questions about this tool