Who this is for
- Platform Engineers responsible for keeping Kubernetes clusters and workloads healthy.
- Backend Engineers who deploy services to containers and need fast root cause analysis.
- Anyone on-call for containerized apps and cluster infrastructure.
Prerequisites
- Basic Kubernetes: Pods, Deployments, Services, Nodes.
- Comfort with a terminal and reading logs.
- Familiarity with YAML manifests and container images.
Learning path
- Learn the 5-step triage flow and mental model (below).
- Practice with worked examples (CrashLoopBackOff, ImagePullBackOff, Networking, Scheduling).
- Do the hands-on exercises and checklist.
- Build a small troubleshooting playbook.
- Take the quick test to validate understanding.
Why this matters
Real tasks you will face:
- A new rollout causes pods to restart with CrashLoopBackOff.
- Can’t reach a service from another namespace—timeouts and 502s.
- Pods stay Pending due to scheduling constraints or resource pressure.
- Readiness probes fail after a configuration change.
- Image pulls fail in production during a critical deployment.
Fast, structured troubleshooting reduces downtime, improves reliability, and shortens on-call incidents.
Concept explained simply
Debugging in Kubernetes means shrinking the gap between what you declared in YAML and what is actually running. You observe state (status, logs, events), form a hypothesis, test it safely, and apply a minimal change.
Mental model: Spec → Controller → Reality
- Spec: Your manifests (Deployments, Services, Ingress, etc.).
- Controller: Reconciliation loops that try to make reality match the spec.
- Reality: Pods, Nodes, IPs, endpoints, and actual process behavior.
Debugging is finding where Spec and Reality diverge and why the controller can’t close the gap.
5-step triage flow (use every time)
- Look: Status snapshot.
kubectl get pods -A -o wide kubectl get events --sort-by=.metadata.creationTimestamp | tail -n 30 kubectl get deploy,rs,svc,ing -n your-ns - Explain: Use describe for system’s story.
kubectl describe pod <pod> -n your-ns kubectl describe deploy <deploy> -n your-ns kubectl describe node <node> - Listen: Logs and probes.
kubectl logs <pod> -n your-ns kubectl logs <pod> -c <container> --previous kubectl exec -it <pod> -- env | sort - Trace: Network and endpoints.
kubectl get svc <svc> -n your-ns -o yaml | grep -i selector -A3 kubectl get endpoints <svc> -n your-ns kubectl run -it net-debug --rm --image=busybox --restart=Never -- sh # inside: wget -qO- - Resource: CPU/memory and scheduling.
kubectl top pods -n your-ns kubectl describe pod <pod> | egrep -i 'oom|probe|failed' kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase
Optional advanced tools:
- Ephemeral debug container:
kubectl debug -it <pod> --image=busybox --target=<container> - Port-forward:
kubectl port-forward svc/<svc> 8080:80to test locally. - Rollout controls:
kubectl rollout status|undo deploy/<name>
Worked examples
1) CrashLoopBackOff after deploy
- Look:
kubectl get pods -n appshows CrashLoopBackOff. - Explain:
kubectl describe podshows container exited with code 1. - Listen:
kubectl logs --previousshows "Missing DB_URL". - Check config:
kubectl describe deploy appreveals no env var or wrong Secret ref. - Fix: Patch the deployment to include the env var; apply and monitor rollout.
kubectl set env deploy/app DB_URL=$(kubectl get secret db -o jsonpath='{.data.url}' | base64 -d) kubectl rollout status deploy/app
2) ImagePullBackOff
- Explain:
kubectl describe podshows "ErrImagePull: manifest unknown". - Verify image: tag typo or private registry auth missing.
- Fix tag:
kubectl set image deploy/web web=registry.example/web:1.2.3 - If private: ensure imagePullSecrets referenced in the pod spec.
3) Service reachable locally but not from another service
- Trace:
kubectl get svc api -o yamlandkubectl get endpoints apishows 0 endpoints. - Compare selectors with pod labels: mismatch found (e.g.,
app: api-v2vsapp: api). - Fix selector or labels so they match. Endpoints should populate within seconds.
4) Pods Pending due to scheduling
- Explain:
kubectl describe podshows "0/3 nodes are available: Insufficient cpu" or a taint. - Check requests/limits: reduce or right-size; or add nodes.
- If taints: add tolerations or change nodeSelector/affinity.
Core troubleshooting checklist
- Check pod status and restarts.
- Read recent events and describe pod/deploy.
- Fetch current and previous container logs.
- Verify env vars, ConfigMaps, Secrets, and volumes are mounted.
- Confirm Service selectors match pod labels; endpoints exist.
- Test network path from inside the cluster (curl/wget).
- Inspect probes (readiness/liveness) and application port.
- Check resource pressure (OOMKilled, node pressure).
- Validate image tag and registry auth.
- Use rollout status/undo if needed.
Exercises (hands-on thinking)
Mirror of the exercises below. Do them as thought experiments or on a sandbox cluster.
Exercise 1: Diagnose a CrashLoopBackOff
Use the 5-step flow to find the cause and propose a minimal fix.
- Run:
kubectl get pods -n shop,kubectl describe pod,kubectl logs --previous. - Check for missing env or failing migrations.
Exercise 2: Solve ImagePullBackOff
Identify whether it’s an image tag issue or private registry auth problem.
- Check pod events for "manifest unknown" or "authentication required".
- Propose the exact kubectl fix command.
Exercise 3: Service has no endpoints
Find the selector/label mismatch and fix it.
- Compare
kubectl get svc -o yamlselector vskubectl get pods --show-labels. - Decide whether to change the Service or the Deployment labels.
Common mistakes and how to self-check
- Only reading logs but ignoring events. Self-check: Did you run
kubectl describeand read recent events? - Debugging outside the cluster only. Self-check: Did you curl the service from inside the cluster?
- Forgetting probes. Self-check: Are readiness/liveness ports and paths correct?
- Selector mismatch. Self-check: Do Service selectors exactly match pod labels?
- Ignoring resource limits. Self-check: Any OOMKilled or node pressure messages?
- Assuming image is correct. Self-check: Verified the tag exists and pull secrets are configured?
Practical projects
- Build a Troubleshooting Runbook: a one-page checklist plus common commands for your team.
- Create a "net-debug" image or manifest for cluster-internal curl/dig testing.
- Set up a failure lab: manifests that intentionally break (bad image tag, wrong selector, failing probe) and practice fixing them.
Mini challenge
Your API pods started failing readiness after a config rollout. Using only kubectl, identify two plausible root causes and one quick rollback command. Keep the blast radius small.
Hint
- Compare probe port/path vs container port.
- Check env changes in Deployment revision (previous ReplicaSet diff).
- Use
kubectl rollout undo deploy/apiif needed.
Next steps
- Practice the flow on staging incidents.
- Automate frequent checks with simple scripts.
- Pair with teammates on post-incident reviews to expand your playbook.
Quick Test
Take the quick test to check your understanding. Everyone can take it; only logged-in users will have their progress saved.