Who this is for
- MLOps Engineers deploying ML inference APIs or batch services on Kubernetes.
- Data/ML Engineers who need stable rollout, scaling, and simple external access.
- Developers moving from local Docker to production Kubernetes.
Prerequisites
- Comfort with Docker images.
- Basic kubectl usage (apply, get, describe, logs).
- Know Pods and containers (what a Pod is, basic YAML structure).
Why this matters
Your day-to-day MLOps tasks often include:
- Rolling out a new model version with zero downtime.
- Keeping a stable endpoint for clients even as Pods restart or scale.
- Routing external HTTP traffic to the right service, often with path-based rules like /predict or /metrics.
- Quickly reverting a bad release without breaking traffic.
Deployments, Services, and Ingress are the trio that make this reliable:
- Deployment = desired state, rolling updates, scaling for your Pods.
- Service = stable virtual IP and DNS name for accessing Pods.
- Ingress = HTTP(S) routing from outside the cluster to internal Services.
Concept explained simply
Think of Kubernetes networking like a building with three layers of doors:
- Deployment: The operations team that ensures a certain number of identical rooms (Pods) are always ready and handles swapping occupants during renovations (rolling updates).
- Service: A receptionist with a permanent phone number (ClusterIP) who forwards calls to any available room that matches certain labels.
- Ingress: The main entrance for visitors from the street; it reads the visitor's request (host/path) and directs them to the right receptionist (Service).
Mental model
- Labels and selectors are the glue. Deployment labels Pods; Service selects Pods via those labels; Ingress points to the Service.
- Ports must align. Container port (in Pod) -> Service targetPort -> Service port -> Ingress backend service/port.
- Health checks matter. Readiness gates traffic until the app is ready; liveness restarts stuck containers.
Worked examples
Example 1: Deployment for an inference API
Goal: run 2 replicas of a FastAPI/Flask model server with safe rollouts and health checks.
Show Deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: infer-deploy
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: infer
template:
metadata:
labels:
app: infer
spec:
containers:
- name: infer
image: ghcr.io/example/infer:1.0
ports:
- containerPort: 8080
env:
- name: MODEL_NAME
value: resnet50
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
What this gives you: consistent scaling, safe rolling updates, and probes that prevent bad pods from receiving traffic.
Example 2: Service for stable access
Goal: create a stable virtual IP to reach the Pods.
Show Service YAML
apiVersion: v1
kind: Service
metadata:
name: infer-svc
spec:
type: ClusterIP
selector:
app: infer
ports:
- name: http
port: 80
targetPort: 8080
Traffic flow: Service port 80 forwards to Pod port 8080. Inside the cluster, DNS name is infer-svc.default.svc.cluster.local (format varies by namespace).
Example 3: Ingress for external HTTP routing
Goal: expose /predict and /health from the outside world.
Show Ingress YAML (generic, controller required)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: infer-ing
annotations:
# Example for NGINX; annotations vary by controller
nginx.ingress.kubernetes.io/rewrite-target: /$1
spec:
rules:
- host: ml.example.local
http:
paths:
- path: /predict
pathType: Prefix
backend:
service:
name: infer-svc
port:
number: 80
- path: /health
pathType: Prefix
backend:
service:
name: infer-svc
port:
number: 80
Note: An Ingress controller must be installed in the cluster. The YAML declares intent; the controller does the actual routing.
How to apply: step-by-step
- Apply the Deployment. Wait until all Pods are Ready.
- Apply the Service. Confirm it has a ClusterIP and endpoints.
- Apply the Ingress. Verify the controller admits the rule and endpoints are ready.
- Smoke test /health first, then /predict to validate functionality.
Useful kubectl commands
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
kubectl get deploy,po,svc,ing
kubectl describe deploy infer-deploy
kubectl describe svc infer-svc
kubectl describe ing infer-ing
kubectl get endpointslices -l app=infer
kubectl logs deploy/infer-deploy
Exercises (your turn)
These map to the graded exercises below. You can complete them in any conformant Kubernetes cluster. If you cannot run a cluster now, write the YAML and self-check against the checklist.
Exercise 1: Create a Deployment and Service for an inference API
Requirements:
- Deployment named infer-deploy with 2 replicas.
- Image: ghcr.io/example/infer:1.0
- Container port: 8080; readiness on /health; liveness on /live.
- Env: MODEL_NAME=beta
- Service infer-svc exposes port 80 -> targetPort 8080.
Self-check checklist
- kubectl get deploy shows 2/2 READY.
- kubectl get svc shows infer-svc with a ClusterIP and port 80.
- kubectl describe endpoints or endpointslices shows addresses for 2 Pods.
- kubectl port-forward svc/infer-svc 8080:80 returns healthy response on /health.
Exercise 2: Add an Ingress for /predict
Requirements:
- Ingress name: infer-ing
- Host: ml.example.local
- Path: /predict routes to infer-svc port 80
Self-check checklist
- kubectl get ing shows ADDRESS populated (after controller config).
- curl -H "Host: ml.example.local" hits the service.
- Readiness failures do not receive traffic (test by breaking readiness temporarily).
Common mistakes and how to self-check
- Mismatched labels: Service selector must match Pod labels. Self-check: kubectl get pods -l app=infer and ensure non-empty.
- Port mismatch: Service targetPort must equal containerPort (or named port). Self-check: kubectl describe svc and verify.
- Only liveness probe: Using liveness without readiness can send traffic to unready Pods. Add readiness to gate traffic.
- Forgetting Ingress controller: Ingress YAML without a controller does nothing. Self-check: kubectl get pods -n ingress-controller-namespace (varies by setup).
- Zero maxUnavailable during rollout without surge: Can deadlock. Use maxSurge: 1 and maxUnavailable: 0 for zero-downtime updates.
Practical projects
- Blue/Green switch: Deploy infer-v1 and infer-v2. Point Service selector to one at a time and switch over.
- Path router: Expose /predict to infer-svc and /metrics to a separate metrics-svc via Ingress.
- A/B experiment: Two Deployments behind two Services, then manually route 10% of test traffic to B using a separate path (/predict-b).
Learning path
- Start: Deployments/Services/Ingress basics (this lesson).
- Next: ConfigMaps/Secrets for model configs and API keys.
- Then: Horizontal Pod Autoscaler for load-based scaling of inference.
- Advanced: Canary rollouts with Progressive Delivery tools; TLS and cert rotation; Multi-tenancy and network policies.
Next steps
- Finish the exercises and compare with the provided solutions.
- Take the quick test below to validate knowledge. Note: Everyone can take the test; progress is saved only when logged in.
- Apply to a real service in your environment and capture before/after SLOs (latency, error rate) during a rollout.
Mini challenge
Create two Deployments (infer-v1, infer-v2) and two Services (infer-v1-svc, infer-v2-svc). Use an Ingress to route:
- /v1/predict -> infer-v1-svc
- /v2/predict -> infer-v2-svc
Bonus: Switch your main /predict to point to v2 by updating only the Ingress. Measure downtime (should be near zero if readiness is correct).