How to learn Kubernetes For ML Workloads for MLOps Engineer for free

Why this skill matters for MLOps Engineers

Kubernetes is the backbone for running ML training, batch inference, and real-time serving reliably at scale. As an MLOps Engineer, mastering it lets you ship models faster, keep services stable, scale on demand (including GPUs), and automate updates without downtime.

Standardize deployments of inference services and pipelines
Run batch jobs and scheduled predictions with Jobs and CronJobs
Control costs and performance with resource requests/limits and autoscaling
Manage configs and secrets safely for data stores and model artifacts
Ship repeatable setups using Helm charts
Debug networking and roll back quickly when something breaks

What you’ll be able to do

Deploy an ML model as a scalable, monitored service behind an Ingress
Run batch inference with Jobs/CronJobs and handle retries
Allocate CPU/GPU resources and apply HPA for traffic spikes
Store parameters in ConfigMaps and credentials in Secrets
Package ML infra as Helm charts for teams to reuse
Diagnose pod crashes, networking issues, and perform safe rollbacks

Who this is for

Practitioners moving from notebooks to production ML
MLOps/Platform Engineers building ML serving and batch stacks
Data Scientists who want hands-on deployment skills

Prerequisites

Comfort with Docker images and basic Linux shell
Basic YAML reading/editing
Familiarity with HTTP APIs and environment variables
Optional: GPU basics if you plan to serve/train on GPUs

Learning path

1) Get a cluster + kubectl

Use any Kubernetes cluster (local or cloud). Practice with a dedicated namespace for this skill.

Mini task

Create a namespace named ml and set it as default context for this session.

kubectl create namespace ml
kubectl config set-context --current --namespace=ml

2) Deployments, Services, Ingress

Run a simple inference web app as a Deployment. Expose it with a Service (ClusterIP). Route external traffic via an Ingress.

3) ConfigMaps and Secrets

Externalize configuration (e.g., model path, feature switches) and keep credentials in Secrets.

4) Jobs and CronJobs

Use Jobs for one-off batch inference and CronJobs for scheduled predictions or ETL-like preprocessing.

5) Resources and Autoscaling

Set requests/limits for predictable performance and enable HPA to scale pods automatically.

6) Helm Basics

Package manifests into a chart with values for environments (dev/stage/prod).

7) Debugging pods and networking

Learn to inspect events, logs, and connectivity to resolve common breaks quickly.

8) Rolling updates and rollbacks

Release new model images safely, monitor, and revert instantly if needed.

Worked examples

1) Real-time inference: Deployment + Service + Ingress

This runs a simple HTTP inference server listening on port 8080.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iris-inference
spec:
  replicas: 2
  selector:
    matchLabels: { app: iris }
  template:
    metadata:
      labels: { app: iris }
    spec:
      containers:
        - name: server
          image: ghcr.io/example/iris-inference:1.0
          ports:
            - containerPort: 8080
          env:
            - name: MODEL_PATH
              value: /models/iris.onnx
          resources:
            requests: { cpu: "250m", memory: "256Mi" }
            limits:   { cpu: "500m", memory: "512Mi" }
          readinessProbe:
            httpGet: { path: /health, port: 8080 }
            initialDelaySeconds: 5
          livenessProbe:
            httpGet: { path: /health, port: 8080 }
            initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: iris-svc
spec:
  selector: { app: iris }
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
      name: http
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: iris-ingress
spec:
  rules:
    - host: ml.local
      http:
        paths:
          - path: /predict
            pathType: Prefix
            backend:
              service:
                name: iris-svc
                port:
                  number: 80

Smoke test

Port-forward the Service and test with curl.

kubectl port-forward svc/iris-svc 8080:80 &
curl -X POST http://localhost:8080/predict -d '{"x":[5.1,3.5,1.4,0.2]}' -H 'Content-Type: application/json'

2) Batch inference with a Job

Run inference for a dataset in object storage. Credentials are provided via a Secret.

---
apiVersion: v1
kind: Secret
metadata:
  name: s3-cred
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "minio"
  AWS_SECRET_ACCESS_KEY: "miniosecret"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-infer
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: runner
          image: ghcr.io/example/batch-infer:1.0
          env:
            - name: INPUT_URI
              value: s3://datasets/iris.csv
            - name: OUTPUT_URI
              value: s3://predictions/iris.jsonl
            - name: AWS_REGION
              value: us-east-1
            - name: AWS_ACCESS_KEY_ID
              valueFrom: { secretKeyRef: { name: s3-cred, key: AWS_ACCESS_KEY_ID } }
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom: { secretKeyRef: { name: s3-cred, key: AWS_SECRET_ACCESS_KEY } }

Tips

Use backoffLimit for retries
Add activeDeadlineSeconds to cap long runs

3) GPU-enabled inference

Schedule a pod that needs one GPU. Requires the GPU device plugin installed on GPU nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: torch-infer-gpu
spec:
  replicas: 1
  selector:
    matchLabels: { app: torch-gpu }
  template:
    metadata:
      labels: { app: torch-gpu }
    spec:
      nodeSelector:
        accelerator: nvidia
      containers:
        - name: server
          image: ghcr.io/example/torch-infer:2.0
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              cpu: "500m"
              memory: "1Gi"
          ports:
            - containerPort: 8080

Note

Request GPU via limits only; requests for GPUs are implicitly equal to limits on most setups.

4) Autoscaling with HPA (CPU-based)

Scale pods from 2 up to 10 when average CPU exceeds 60%.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: iris-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iris-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Check scaling

kubectl get hpa
kubectl describe hpa iris-hpa

5) Helm chart skeleton for an inference service

# Chart.yaml
apiVersion: v2
name: iris-infer
version: 0.1.0
---
# values.yaml
replicaCount: 2
image:
  repository: ghcr.io/example/iris-inference
  tag: "1.0"
resources:
  requests: { cpu: "250m", memory: "256Mi" }
  limits:   { cpu: "500m", memory: "512Mi" }
service:
  port: 80
  targetPort: 8080
ingress:
  enabled: true
  host: ml.local
  path: /predict
---
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "iris-infer.fullname" . }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ include "iris-infer.name" . }}
  template:
    metadata:
      labels:
        app: {{ include "iris-infer.name" . }}
    spec:
      containers:
        - name: app
          image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
          ports:
            - containerPort: {{ .Values.service.targetPort }}
          resources: {{- toYaml .Values.resources | nindent 12 }}

Install

helm install iris ./iris-infer
helm upgrade --install iris ./iris-infer -f values.yaml

6) Rolling update and rollback

# Update the image
kubectl set image deployment/iris-inference server=ghcr.io/example/iris-inference:1.1
kubectl rollout status deployment/iris-inference

# If errors spike, roll back
kubectl rollout undo deployment/iris-inference --to-revision=1

Drills and exercises

Create a Deployment for a toy FastAPI model server with a ConfigMap-based MODEL_NAME
Add a readinessProbe that returns 200 only after the model loads
Expose via Service and Ingress and verify with port-forward
Create a Job that reads from a PVC and writes predictions back to the same PVC
Set requests/limits and observe pod scheduling behavior when the node is pressure-limited
Configure an HPA to scale from 1 to 5 on CPU 70%
Templatize the setup with Helm and override image.tag via -f or --set
Break something on purpose (wrong port) and use kubectl describe and logs to fix it

Common mistakes and debugging tips

Missing readinessProbe causes traffic to hit cold pods. Fix: add a lightweight health endpoint and readinessProbe.
No resource requests leads to noisy-neighbor issues. Fix: set minimal CPU/memory requests aligned with baseline load.
Using latest image tags breaks reproducibility. Fix: pin semantic versions and label Deployments with the model version.
Secrets in ConfigMaps leak credentials. Fix: store credentials in Secrets and mount as env or files.
HPA not scaling. Fix: ensure metrics-server is running and target Deployment has resource requests.
Ingress returns 404. Fix: verify path, service port/name, and that the Ingress controller is installed and running.
GPU pod pending. Fix: check GPU node labels, device plugin, taints/tolerations, and nvidia.com/gpu limits.

Quick debug toolkit

kubectl get events --sort-by=.lastTimestamp
kubectl describe pod <name>
kubectl logs <name> -c <container>
kubectl exec -it <name> -- sh
kubectl get endpoints <service>
nslookup iris-svc
curl -v http://iris-svc/predict

Mini project: Production-ready ML inference

Goal: Package and deploy an iris classifier with safe rollouts, config separation, and autoscaling.

Containerize the model server with a /predict route and a /health check.
Create a ConfigMap for non-secret params and a Secret for any credentials.
Deploy via Helm with values-dev.yaml and values-prod.yaml.
Enable HPA (CPU 60% target, 2–10 replicas).
Expose through Ingress at ml.local/predict.
Perform a rolling update to version 1.1; monitor and roll back as practice.

Acceptance checklist

Zero-downtime rollout verified
Config changes do not require image rebuild
Secrets are not committed to source control
Autoscaling observed under load test

Practical projects

Scheduled batch predictions: nightly CronJob writing results to storage with retry policy
GPU A/B serving: two Deployments with different model variants, traffic split at Ingress level
Feature precompute pipeline: Job chain triggered via CronJob and messaging layer (simulate with separate Jobs)

Subskills

Focus areas for this skill:

Deployments, Services, Ingress Basics — run and expose ML services safely
Jobs, CronJobs for Batch Inference — one-off and scheduled predictions
Resource Requests, Limits, Autoscaling — predictable performance and cost control
ConfigMaps and Secrets — clean config separation and secure credentials
Helm Basics — reusable, parameterized deployments
Debugging Pods and Networking — fast incident resolution
Rolling Updates and Rollbacks — safe releases and instant reverts

Next steps

Add observability: logs, metrics, and dashboards
Introduce canary or blue/green strategies
Automate CI/CD to build, test, and deploy charts per commit

Menu

Kubernetes For ML Workloads

Table of Contents