luvv to helpDiscover the Best Free Online Tools
Topic 7 of 7

Debugging Pods And Networking

Learn Debugging Pods And Networking for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Who this is for

  • MLOps Engineers responsible for keeping ML training and inference services healthy.
  • Data/ML Engineers who deploy jobs to Kubernetes and need to troubleshoot pod and network issues.
  • SREs supporting GPU/CPU ML clusters.

Prerequisites

  • Basic Kubernetes objects: Pods, Deployments, Jobs/CronJobs, Services.
  • Comfort with kubectl and reading YAML manifests.
  • Basic Linux networking: DNS, ports, TCP/UDP, curl/nc.

Why this matters

  • Real tasks: your training job gets stuck in Pending or CrashLoopBackOff minutes before a deadline.
  • Real tasks: inference Pods are Running but not Ready; requests time out through the Service.
  • Real tasks: new NetworkPolicy blocks model downloads or feature store access.

Fast, structured debugging keeps ML systems reliable and reduces downtime during experiments and releases.

Concept explained simply

Debugging Pods and networking is about answering three questions, in order:

  1. Is the Pod healthy? (images, command, probes, resources)
  2. Is the Service routing to healthy Pods? (selectors, endpoints)
  3. Is network policy or DNS blocking traffic? (NetworkPolicy, kube-dns, egress)

Mental model

  • Layer 1: Pod lifecycle — container starts, runs, exits. Signals include status, restarts, lastState, logs.
  • Layer 2: Service discovery — DNS name to ClusterIP, then to Pod IPs via Endpoints.
  • Layer 3: Network policy — allows or denies traffic between sources/destinations and ports.
Statuses cheat sheet
  • ImagePullBackOff: registry/auth/image name error; check events.
  • CrashLoopBackOff: process exits repeatedly; check logs --previous and describe.
  • Running but not Ready: readiness probe failing or app startup delay.
  • OOMKilled: memory limit too low or memory leak; see lastState in describe.
Networking path (inside cluster)
  1. Client resolves service.namespace.svc to ClusterIP via kube-dns.
  2. kube-proxy or eBPF dataplane chooses a Pod IP from Endpoints.
  3. NetworkPolicy rules enforced on source and destination.

Core toolkit

  • kubectl get/describe pods, events: fast status and reasons.
  • kubectl logs [pod] [-c container] [--previous]: app output and crash details.
  • kubectl exec -it [pod] -- sh/bash: run curl, nc, nslookup inside the Pod.
  • kubectl port-forward [pod|svc] local:remote: test endpoints locally.
  • kubectl get svc,endpoints,endpointslices: verify routing targets.
  • kubectl get networkpolicy; read rules: who can talk to whom, on which ports.
  • kubectl debug [pod] --image=busybox:1.36 -it: temporary toolbox container.
Try it: minimal diagnosis sequence
  1. kubectl get pods -o wide
  2. kubectl describe pod <name>
  3. kubectl logs <name> --previous (if restarting)
  4. kubectl get svc,ep
  5. kubectl exec -it <name> -- sh; curl

Worked examples

Example 1: CrashLoopBackOff in an inference Pod

Situation: A model server Pod restarts repeatedly.

Investigate
# Observe status and restarts
kubectl get pod mlinfer-7b9f -n prod
# Events and last termination reason
kubectl describe pod mlinfer-7b9f -n prod
# Previous logs from the crashing container
kubectl logs mlinfer-7b9f -n prod --previous

Typical findings: Reason=OOMKilled in lastState; events show Back-off restarting failed container; logs show memory allocation during model load.

Fix
  • Increase memory requests/limits to fit model (e.g., request: 1Gi, limit: 2Gi).
  • Optionally increase readinessProbe initialDelaySeconds while the model warms up.

Example 2: Service returns timeouts

Situation: Client calls predict-svc but gets timeouts.

Investigate
# Check Service and Endpoints
kubectl get svc predict-svc -n prod -o wide
kubectl get endpoints predict-svc -n prod -o yaml
# Validate selector matches Pod labels
kubectl get pods -n prod --show-labels

Typical root cause: Endpoints list is empty due to selector mismatch, so the Service has no backing Pods.

Fix
  • Align Service selector with Pod labels, or relabel Pods.
  • Confirm Endpoints now contain Pod IPs.

Example 3: DNS resolution fails inside training job

Situation: Job cannot download data from an object store by name, but pinging its IP works.

Investigate
# From inside the Pod
nslookup minio.storage.svc
curl -I http://minio.storage.svc:9000
# Check NetworkPolicies and kube-dns access
kubectl get networkpolicy -A

Typical root cause: Egress policy blocks UDP/TCP 53 to kube-dns. Sometimes ndots setting causes slow lookups; adjust if needed.

Fix
  • Add egress rules to allow DNS to kube-dns (UDP/TCP 53).
  • If lookups are slow for external FQDNs, set dnsConfig.options ndots: 1.

Example 4: ImagePullBackOff before job starts

Situation: Training job Pods never start; image pull fails.

Investigate
kubectl describe pod trainer-abc -n ml
# Look for events: Failed to pull image, authentication denied, not found
Fix
  • Correct image name/tag or registry path.
  • Configure imagePullSecrets on the ServiceAccount or Pod.

Step-by-step playbooks

When a Pod won't stay up

  1. kubectl describe pod to get reason, events, and lastState.
  2. kubectl logs --previous if CrashLoopBackOff.
  3. Check probes; relax initialDelaySeconds for slow starts.
  4. Check resources; raise memory/CPU if OOMKilled.
  5. Re-deploy and watch events.

When Service DNS works but requests fail

  1. kubectl get endpoints to verify backing Pods are listed.
  2. Ensure Pod labels match Service selector.
  3. Confirm readiness probes are passing; only Ready Pods are included.
  4. From a test Pod, curl service:port.

When cross-namespace access fails

  1. List NetworkPolicies in both namespaces.
  2. Find egress rules on client and ingress rules on server.
  3. Allow required ports (e.g., 80/443/5432) and DNS 53 if needed.
  4. Re-test with nc or curl from a toolbox Pod.

Exercises

Do these practical debugging tasks. You can complete them by writing out the commands and conclusions. Solutions are provided inside expandable sections.

  • Exercise 1: Diagnose CrashLoopBackOff in an inference Pod.
  • Exercise 2: Fix a Service with no endpoints due to selector mismatch.
  • Exercise 3: Unblock DNS in a restricted NetworkPolicy.

Checklist before you submit

  • Your answer includes the exact kubectl commands you would run.
  • You state the root cause in one clear sentence.
  • You propose a minimal fix and a way to verify it.

Common mistakes and self-check

  • Skipping describe: Self-check — Did you capture events and lastState? If not, run kubectl describe.
  • Assuming Service works if DNS resolves: Self-check — Did you verify Endpoints are non-empty?
  • Testing from your laptop only: Self-check — Did you exec into a Pod inside the cluster to test?
  • Ignoring readiness probe delays: Self-check — Are Pods Ready? If not, inspect readinessProbe settings.
  • Forgetting DNS egress in NetworkPolicy: Self-check — Do your egress rules permit UDP/TCP 53 to kube-dns?

Practical projects

  • Build an ML service runbook: include pod lifecycle checks, probe tuning, Service/Endpoints validation, and a NetworkPolicy allowlist template.
  • Create a networking sandbox namespace: install a deny-all policy, then iteratively allow DNS, Service-to-Service HTTP, and database ports; document each test command.

Learning path

  • Before: Pod spec basics, containers and images, Services and DNS.
  • This subskill: Systematic debugging of Pod lifecycle and cluster networking.
  • Next: Resource requests/limits tuning, autoscaling, observability (metrics/logs), and GitOps rollout strategies.

Mini challenge

A canary Deployment of your model server is Running but not Ready. The Service points to both stable and canary. Traffic spikes cause 5% errors. In 10 minutes, identify and fix the issue with minimal blast radius. What do you check first, what do you change, and how do you verify safely?

Next steps

  • Practice the playbooks until you can run them from memory.
  • Automate sanity checks with small scripts or make targets.
  • Pair-review your NetworkPolicies with a teammate.

Try the Quick Test

The Quick Test is available to everyone for free. Log in to save your progress and resume later.

Practice Exercises

3 exercises to complete

Instructions

Scenario: Pod mlinfer-7b9f in namespace prod restarts every ~10s with status CrashLoopBackOff.

$ kubectl get pod mlinfer-7b9f -n prod
NAME           READY   STATUS             RESTARTS   AGE
mlinfer-7b9f   0/1     CrashLoopBackOff   8          2m

Events snippet observed in describe:

Back-off restarting failed container
Last State: Terminated: Reason=OOMKilled

Task:

  • List the exact kubectl commands you would run to confirm the root cause.
  • Write one-sentence root cause.
  • Propose a minimal fix and how you would verify it.
Expected Output
Root cause identified as OOMKilled during model load due to memory limit too low; commands include describe, logs --previous; fix is to increase memory requests/limits and adjust readinessProbe if needed; verify Pod becomes Ready and Endpoints are populated.

Debugging Pods And Networking — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Debugging Pods And Networking?

AI Assistant

Ask questions about this tool