Menu

Topic 8 of 8

Debugging And Troubleshooting

Learn Debugging And Troubleshooting for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Who this is for

  • Platform Engineers responsible for keeping Kubernetes clusters and workloads healthy.
  • Backend Engineers who deploy services to containers and need fast root cause analysis.
  • Anyone on-call for containerized apps and cluster infrastructure.

Prerequisites

  • Basic Kubernetes: Pods, Deployments, Services, Nodes.
  • Comfort with a terminal and reading logs.
  • Familiarity with YAML manifests and container images.

Learning path

  1. Learn the 5-step triage flow and mental model (below).
  2. Practice with worked examples (CrashLoopBackOff, ImagePullBackOff, Networking, Scheduling).
  3. Do the hands-on exercises and checklist.
  4. Build a small troubleshooting playbook.
  5. Take the quick test to validate understanding.

Why this matters

Real tasks you will face:

  • A new rollout causes pods to restart with CrashLoopBackOff.
  • Can’t reach a service from another namespace—timeouts and 502s.
  • Pods stay Pending due to scheduling constraints or resource pressure.
  • Readiness probes fail after a configuration change.
  • Image pulls fail in production during a critical deployment.

Fast, structured troubleshooting reduces downtime, improves reliability, and shortens on-call incidents.

Concept explained simply

Debugging in Kubernetes means shrinking the gap between what you declared in YAML and what is actually running. You observe state (status, logs, events), form a hypothesis, test it safely, and apply a minimal change.

Mental model: Spec → Controller → Reality

  • Spec: Your manifests (Deployments, Services, Ingress, etc.).
  • Controller: Reconciliation loops that try to make reality match the spec.
  • Reality: Pods, Nodes, IPs, endpoints, and actual process behavior.

Debugging is finding where Spec and Reality diverge and why the controller can’t close the gap.

5-step triage flow (use every time)

  1. Look: Status snapshot.
    kubectl get pods -A -o wide
    kubectl get events --sort-by=.metadata.creationTimestamp | tail -n 30
    kubectl get deploy,rs,svc,ing -n your-ns
  2. Explain: Use describe for system’s story.
    kubectl describe pod <pod> -n your-ns
    kubectl describe deploy <deploy> -n your-ns
    kubectl describe node <node>
  3. Listen: Logs and probes.
    kubectl logs <pod> -n your-ns
    kubectl logs <pod> -c <container> --previous
    kubectl exec -it <pod> -- env | sort
  4. Trace: Network and endpoints.
    kubectl get svc <svc> -n your-ns -o yaml | grep -i selector -A3
    kubectl get endpoints <svc> -n your-ns
    kubectl run -it net-debug --rm --image=busybox --restart=Never -- sh
    # inside: wget -qO- 
      
  5. Resource: CPU/memory and scheduling.
    kubectl top pods -n your-ns
    kubectl describe pod <pod> | egrep -i 'oom|probe|failed'
    kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase

Optional advanced tools:

  • Ephemeral debug container: kubectl debug -it <pod> --image=busybox --target=<container>
  • Port-forward: kubectl port-forward svc/<svc> 8080:80 to test locally.
  • Rollout controls: kubectl rollout status|undo deploy/<name>

Worked examples

1) CrashLoopBackOff after deploy
  1. Look: kubectl get pods -n app shows CrashLoopBackOff.
  2. Explain: kubectl describe pod shows container exited with code 1.
  3. Listen: kubectl logs --previous shows "Missing DB_URL".
  4. Check config: kubectl describe deploy app reveals no env var or wrong Secret ref.
  5. Fix: Patch the deployment to include the env var; apply and monitor rollout.
    kubectl set env deploy/app DB_URL=$(kubectl get secret db -o jsonpath='{.data.url}' | base64 -d)
    kubectl rollout status deploy/app
2) ImagePullBackOff
  1. Explain: kubectl describe pod shows "ErrImagePull: manifest unknown".
  2. Verify image: tag typo or private registry auth missing.
  3. Fix tag: kubectl set image deploy/web web=registry.example/web:1.2.3
  4. If private: ensure imagePullSecrets referenced in the pod spec.
3) Service reachable locally but not from another service
  1. Trace: kubectl get svc api -o yaml and kubectl get endpoints api shows 0 endpoints.
  2. Compare selectors with pod labels: mismatch found (e.g., app: api-v2 vs app: api).
  3. Fix selector or labels so they match. Endpoints should populate within seconds.
4) Pods Pending due to scheduling
  1. Explain: kubectl describe pod shows "0/3 nodes are available: Insufficient cpu" or a taint.
  2. Check requests/limits: reduce or right-size; or add nodes.
  3. If taints: add tolerations or change nodeSelector/affinity.

Core troubleshooting checklist

  • Check pod status and restarts.
  • Read recent events and describe pod/deploy.
  • Fetch current and previous container logs.
  • Verify env vars, ConfigMaps, Secrets, and volumes are mounted.
  • Confirm Service selectors match pod labels; endpoints exist.
  • Test network path from inside the cluster (curl/wget).
  • Inspect probes (readiness/liveness) and application port.
  • Check resource pressure (OOMKilled, node pressure).
  • Validate image tag and registry auth.
  • Use rollout status/undo if needed.

Exercises (hands-on thinking)

Mirror of the exercises below. Do them as thought experiments or on a sandbox cluster.

Exercise 1: Diagnose a CrashLoopBackOff

Use the 5-step flow to find the cause and propose a minimal fix.

  • Run: kubectl get pods -n shop, kubectl describe pod, kubectl logs --previous.
  • Check for missing env or failing migrations.
Exercise 2: Solve ImagePullBackOff

Identify whether it’s an image tag issue or private registry auth problem.

  • Check pod events for "manifest unknown" or "authentication required".
  • Propose the exact kubectl fix command.
Exercise 3: Service has no endpoints

Find the selector/label mismatch and fix it.

  • Compare kubectl get svc -o yaml selector vs kubectl get pods --show-labels.
  • Decide whether to change the Service or the Deployment labels.

Common mistakes and how to self-check

  • Only reading logs but ignoring events. Self-check: Did you run kubectl describe and read recent events?
  • Debugging outside the cluster only. Self-check: Did you curl the service from inside the cluster?
  • Forgetting probes. Self-check: Are readiness/liveness ports and paths correct?
  • Selector mismatch. Self-check: Do Service selectors exactly match pod labels?
  • Ignoring resource limits. Self-check: Any OOMKilled or node pressure messages?
  • Assuming image is correct. Self-check: Verified the tag exists and pull secrets are configured?

Practical projects

  • Build a Troubleshooting Runbook: a one-page checklist plus common commands for your team.
  • Create a "net-debug" image or manifest for cluster-internal curl/dig testing.
  • Set up a failure lab: manifests that intentionally break (bad image tag, wrong selector, failing probe) and practice fixing them.

Mini challenge

Your API pods started failing readiness after a config rollout. Using only kubectl, identify two plausible root causes and one quick rollback command. Keep the blast radius small.

Hint
  • Compare probe port/path vs container port.
  • Check env changes in Deployment revision (previous ReplicaSet diff).
  • Use kubectl rollout undo deploy/api if needed.

Next steps

  • Practice the flow on staging incidents.
  • Automate frequent checks with simple scripts.
  • Pair with teammates on post-incident reviews to expand your playbook.

Quick Test

Take the quick test to check your understanding. Everyone can take it; only logged-in users will have their progress saved.

Practice Exercises

3 exercises to complete

Instructions

A pod in namespace shop is in CrashLoopBackOff. Use the 5-step flow to find the root cause and propose a minimal fix.

  1. Get status: kubectl get pods -n shop
  2. Describe pod: kubectl describe pod <pod> -n shop
  3. Check previous logs: kubectl logs <pod> -n shop --previous
  4. Inspect Deployment env/volumes: kubectl get deploy shop-api -n shop -o yaml
  5. Apply a minimal fix and verify rollout.
Expected Output
You identify a missing DB_URL env var (or wrong Secret/ConfigMap reference) causing the process to exit, and patch the Deployment to supply the correct configuration. Rollout stabilizes with passing readiness.

Debugging And Troubleshooting — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Debugging And Troubleshooting?

AI Assistant

Ask questions about this tool