How to learn Debugging And Troubleshooting for Containers And Kubernetes in Platform Engineer for free

Who this is for

Platform Engineers responsible for keeping Kubernetes clusters and workloads healthy.
Backend Engineers who deploy services to containers and need fast root cause analysis.
Anyone on-call for containerized apps and cluster infrastructure.

Prerequisites

Basic Kubernetes: Pods, Deployments, Services, Nodes.
Comfort with a terminal and reading logs.
Familiarity with YAML manifests and container images.

Learning path

Learn the 5-step triage flow and mental model (below).
Practice with worked examples (CrashLoopBackOff, ImagePullBackOff, Networking, Scheduling).
Do the hands-on exercises and checklist.
Build a small troubleshooting playbook.
Take the quick test to validate understanding.

Why this matters

Real tasks you will face:

A new rollout causes pods to restart with CrashLoopBackOff.
Can’t reach a service from another namespace—timeouts and 502s.
Pods stay Pending due to scheduling constraints or resource pressure.
Readiness probes fail after a configuration change.
Image pulls fail in production during a critical deployment.

Fast, structured troubleshooting reduces downtime, improves reliability, and shortens on-call incidents.

Concept explained simply

Debugging in Kubernetes means shrinking the gap between what you declared in YAML and what is actually running. You observe state (status, logs, events), form a hypothesis, test it safely, and apply a minimal change.

Mental model: Spec → Controller → Reality

Spec: Your manifests (Deployments, Services, Ingress, etc.).
Controller: Reconciliation loops that try to make reality match the spec.
Reality: Pods, Nodes, IPs, endpoints, and actual process behavior.

Debugging is finding where Spec and Reality diverge and why the controller can’t close the gap.

5-step triage flow (use every time)

Look: Status snapshot.

kubectl get pods -A -o wide
kubectl get events --sort-by=.metadata.creationTimestamp | tail -n 30
kubectl get deploy,rs,svc,ing -n your-ns

Explain: Use describe for system’s story.

kubectl describe pod <pod> -n your-ns
kubectl describe deploy <deploy> -n your-ns
kubectl describe node <node>

Listen: Logs and probes.

kubectl logs <pod> -n your-ns
kubectl logs <pod> -c <container> --previous
kubectl exec -it <pod> -- env | sort

Trace: Network and endpoints.

kubectl get svc <svc> -n your-ns -o yaml | grep -i selector -A3
kubectl get endpoints <svc> -n your-ns
kubectl run -it net-debug --rm --image=busybox --restart=Never -- sh
# inside: wget -qO-


  Resource: CPU/memory and scheduling.
    kubectl top pods -n your-ns
kubectl describe pod <pod> | egrep -i 'oom|probe|failed'
kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,STATUS:.status.phase


Optional advanced tools:

  Ephemeral debug container: kubectl debug -it <pod> --image=busybox --target=<container>
  Port-forward: kubectl port-forward svc/<svc> 8080:80 to test locally.
  Rollout controls: kubectl rollout status|undo deploy/<name>


Worked examples

  1) CrashLoopBackOff after deploy
  
    Look: kubectl get pods -n app shows CrashLoopBackOff.
    Explain: kubectl describe pod shows container exited with code 1.
    Listen: kubectl logs --previous shows "Missing DB_URL".
    Check config: kubectl describe deploy app reveals no env var or wrong Secret ref.
    Fix: Patch the deployment to include the env var; apply and monitor rollout.
      kubectl set env deploy/app DB_URL=$(kubectl get secret db -o jsonpath='{.data.url}' | base64 -d)
kubectl rollout status deploy/app
    
  



  2) ImagePullBackOff
  
    Explain: kubectl describe pod shows "ErrImagePull: manifest unknown".
    Verify image: tag typo or private registry auth missing.
    Fix tag: kubectl set image deploy/web web=registry.example/web:1.2.3
    If private: ensure imagePullSecrets referenced in the pod spec.
  



  3) Service reachable locally but not from another service
  
    Trace: kubectl get svc api -o yaml and kubectl get endpoints api shows 0 endpoints.
    Compare selectors with pod labels: mismatch found (e.g., app: api-v2 vs app: api).
    Fix selector or labels so they match. Endpoints should populate within seconds.
  



  4) Pods Pending due to scheduling
  
    Explain: kubectl describe pod shows "0/3 nodes are available: Insufficient cpu" or a taint.
    Check requests/limits: reduce or right-size; or add nodes.
    If taints: add tolerations or change nodeSelector/affinity.
  


Core troubleshooting checklist

  Check pod status and restarts.
  Read recent events and describe pod/deploy.
  Fetch current and previous container logs.
  Verify env vars, ConfigMaps, Secrets, and volumes are mounted.
  Confirm Service selectors match pod labels; endpoints exist.
  Test network path from inside the cluster (curl/wget).
  Inspect probes (readiness/liveness) and application port.
  Check resource pressure (OOMKilled, node pressure).
  Validate image tag and registry auth.
  Use rollout status/undo if needed.


Exercises (hands-on thinking)
Mirror of the exercises below. Do them as thought experiments or on a sandbox cluster.

  Exercise 1: Diagnose a CrashLoopBackOff
  Use the 5-step flow to find the cause and propose a minimal fix.
  
    Run: kubectl get pods -n shop, kubectl describe pod, kubectl logs --previous.
    Check for missing env or failing migrations.
  


  Exercise 2: Solve ImagePullBackOff
  Identify whether it’s an image tag issue or private registry auth problem.
  
    Check pod events for "manifest unknown" or "authentication required".
    Propose the exact kubectl fix command.
  


  Exercise 3: Service has no endpoints
  Find the selector/label mismatch and fix it.
  
    Compare kubectl get svc -o yaml selector vs kubectl get pods --show-labels.
    Decide whether to change the Service or the Deployment labels.
  


Common mistakes and how to self-check

  Only reading logs but ignoring events. Self-check: Did you run kubectl describe and read recent events?
  Debugging outside the cluster only. Self-check: Did you curl the service from inside the cluster?
  Forgetting probes. Self-check: Are readiness/liveness ports and paths correct?
  Selector mismatch. Self-check: Do Service selectors exactly match pod labels?
  Ignoring resource limits. Self-check: Any OOMKilled or node pressure messages?
  Assuming image is correct. Self-check: Verified the tag exists and pull secrets are configured?


Practical projects

  Build a Troubleshooting Runbook: a one-page checklist plus common commands for your team.
  Create a "net-debug" image or manifest for cluster-internal curl/dig testing.
  Set up a failure lab: manifests that intentionally break (bad image tag, wrong selector, failing probe) and practice fixing them.


Mini challenge
Your API pods started failing readiness after a config rollout. Using only kubectl, identify two plausible root causes and one quick rollback command. Keep the blast radius small.

  Hint
  
    Compare probe port/path vs container port.
    Check env changes in Deployment revision (previous ReplicaSet diff).
    Use kubectl rollout undo deploy/api if needed.
  


Next steps

  Practice the flow on staging incidents.
  Automate frequent checks with simple scripts.
  Pair with teammates on post-incident reviews to expand your playbook.


Quick Test
Take the quick test to check your understanding. Everyone can take it; only logged-in users will have their progress saved.

Menu

Debugging And Troubleshooting

Table of Contents

Who this is for

Prerequisites

Learning path

Why this matters

Concept explained simply

Mental model: Spec → Controller → Reality

5-step triage flow (use every time)

Worked examples

Core troubleshooting checklist

Exercises (hands-on thinking)

Common mistakes and how to self-check

Practical projects

Mini challenge

Next steps

Quick Test

`Practice Exercises`

Diagnose CrashLoopBackOff caused by missing configuration

Instructions

Expected Output

Fix ImagePullBackOff during rollout

Repair a Service with no endpoints

`Debugging And Troubleshooting — Quick Test`

`Have questions about Debugging And Troubleshooting?`

`AI Assistant`