How to learn Debugging Pods And Networking for Kubernetes For ML Workloads in MLOps Engineer for free

Who this is for

MLOps Engineers responsible for keeping ML training and inference services healthy.
Data/ML Engineers who deploy jobs to Kubernetes and need to troubleshoot pod and network issues.
SREs supporting GPU/CPU ML clusters.

Prerequisites

Basic Kubernetes objects: Pods, Deployments, Jobs/CronJobs, Services.
Comfort with kubectl and reading YAML manifests.
Basic Linux networking: DNS, ports, TCP/UDP, curl/nc.

Why this matters

Real tasks: your training job gets stuck in Pending or CrashLoopBackOff minutes before a deadline.
Real tasks: inference Pods are Running but not Ready; requests time out through the Service.
Real tasks: new NetworkPolicy blocks model downloads or feature store access.

Fast, structured debugging keeps ML systems reliable and reduces downtime during experiments and releases.

Concept explained simply

Debugging Pods and networking is about answering three questions, in order:

Is the Pod healthy? (images, command, probes, resources)
Is the Service routing to healthy Pods? (selectors, endpoints)
Is network policy or DNS blocking traffic? (NetworkPolicy, kube-dns, egress)

Mental model

Layer 1: Pod lifecycle — container starts, runs, exits. Signals include status, restarts, lastState, logs.
Layer 2: Service discovery — DNS name to ClusterIP, then to Pod IPs via Endpoints.
Layer 3: Network policy — allows or denies traffic between sources/destinations and ports.

Statuses cheat sheet

ImagePullBackOff: registry/auth/image name error; check events.
CrashLoopBackOff: process exits repeatedly; check logs --previous and describe.
Running but not Ready: readiness probe failing or app startup delay.
OOMKilled: memory limit too low or memory leak; see lastState in describe.

Networking path (inside cluster)

Client resolves service.namespace.svc to ClusterIP via kube-dns.
kube-proxy or eBPF dataplane chooses a Pod IP from Endpoints.
NetworkPolicy rules enforced on source and destination.

Core toolkit

kubectl get/describe pods, events: fast status and reasons.
kubectl logs [pod] [-c container] [--previous]: app output and crash details.
kubectl exec -it [pod] -- sh/bash: run curl, nc, nslookup inside the Pod.
kubectl port-forward [pod|svc] local:remote: test endpoints locally.
kubectl get svc,endpoints,endpointslices: verify routing targets.
kubectl get networkpolicy; read rules: who can talk to whom, on which ports.
kubectl debug [pod] --image=busybox:1.36 -it: temporary toolbox container.

Try it: minimal diagnosis sequence

kubectl get pods -o wide
kubectl describe pod <name>
kubectl logs <name> --previous (if restarting)
kubectl get svc,ep
kubectl exec -it <name> -- sh; curl

Worked examples

Example 1: CrashLoopBackOff in an inference Pod

Situation: A model server Pod restarts repeatedly.

Investigate

# Observe status and restarts
kubectl get pod mlinfer-7b9f -n prod
# Events and last termination reason
kubectl describe pod mlinfer-7b9f -n prod
# Previous logs from the crashing container
kubectl logs mlinfer-7b9f -n prod --previous

Typical findings: Reason=OOMKilled in lastState; events show Back-off restarting failed container; logs show memory allocation during model load.

Fix

Increase memory requests/limits to fit model (e.g., request: 1Gi, limit: 2Gi).
Optionally increase readinessProbe initialDelaySeconds while the model warms up.

Example 2: Service returns timeouts

Situation: Client calls predict-svc but gets timeouts.

Investigate

# Check Service and Endpoints
kubectl get svc predict-svc -n prod -o wide
kubectl get endpoints predict-svc -n prod -o yaml
# Validate selector matches Pod labels
kubectl get pods -n prod --show-labels

Typical root cause: Endpoints list is empty due to selector mismatch, so the Service has no backing Pods.

Fix

Align Service selector with Pod labels, or relabel Pods.
Confirm Endpoints now contain Pod IPs.

Example 3: DNS resolution fails inside training job

Situation: Job cannot download data from an object store by name, but pinging its IP works.

Investigate

# From inside the Pod
nslookup minio.storage.svc
curl -I http://minio.storage.svc:9000
# Check NetworkPolicies and kube-dns access
kubectl get networkpolicy -A

Typical root cause: Egress policy blocks UDP/TCP 53 to kube-dns. Sometimes ndots setting causes slow lookups; adjust if needed.

Fix

Add egress rules to allow DNS to kube-dns (UDP/TCP 53).
If lookups are slow for external FQDNs, set dnsConfig.options ndots: 1.

Example 4: ImagePullBackOff before job starts

Situation: Training job Pods never start; image pull fails.

Investigate

kubectl describe pod trainer-abc -n ml
# Look for events: Failed to pull image, authentication denied, not found

Fix

Correct image name/tag or registry path.
Configure imagePullSecrets on the ServiceAccount or Pod.

Step-by-step playbooks

When a Pod won't stay up

kubectl describe pod to get reason, events, and lastState.
kubectl logs --previous if CrashLoopBackOff.
Check probes; relax initialDelaySeconds for slow starts.
Check resources; raise memory/CPU if OOMKilled.
Re-deploy and watch events.

When Service DNS works but requests fail

kubectl get endpoints to verify backing Pods are listed.
Ensure Pod labels match Service selector.
Confirm readiness probes are passing; only Ready Pods are included.
From a test Pod, curl service:port.

When cross-namespace access fails

List NetworkPolicies in both namespaces.
Find egress rules on client and ingress rules on server.
Allow required ports (e.g., 80/443/5432) and DNS 53 if needed.
Re-test with nc or curl from a toolbox Pod.

Exercises

Do these practical debugging tasks. You can complete them by writing out the commands and conclusions. Solutions are provided inside expandable sections.

Exercise 1: Diagnose CrashLoopBackOff in an inference Pod.
Exercise 2: Fix a Service with no endpoints due to selector mismatch.
Exercise 3: Unblock DNS in a restricted NetworkPolicy.

Checklist before you submit

Your answer includes the exact kubectl commands you would run.
You state the root cause in one clear sentence.
You propose a minimal fix and a way to verify it.

Common mistakes and self-check

Skipping describe: Self-check — Did you capture events and lastState? If not, run kubectl describe.
Assuming Service works if DNS resolves: Self-check — Did you verify Endpoints are non-empty?
Testing from your laptop only: Self-check — Did you exec into a Pod inside the cluster to test?
Ignoring readiness probe delays: Self-check — Are Pods Ready? If not, inspect readinessProbe settings.
Forgetting DNS egress in NetworkPolicy: Self-check — Do your egress rules permit UDP/TCP 53 to kube-dns?

Practical projects

Build an ML service runbook: include pod lifecycle checks, probe tuning, Service/Endpoints validation, and a NetworkPolicy allowlist template.
Create a networking sandbox namespace: install a deny-all policy, then iteratively allow DNS, Service-to-Service HTTP, and database ports; document each test command.

Learning path

Before: Pod spec basics, containers and images, Services and DNS.
This subskill: Systematic debugging of Pod lifecycle and cluster networking.
Next: Resource requests/limits tuning, autoscaling, observability (metrics/logs), and GitOps rollout strategies.

Mini challenge

A canary Deployment of your model server is Running but not Ready. The Service points to both stable and canary. Traffic spikes cause 5% errors. In 10 minutes, identify and fix the issue with minimal blast radius. What do you check first, what do you change, and how do you verify safely?

Next steps

Practice the playbooks until you can run them from memory.
Automate sanity checks with small scripts or make targets.
Pair-review your NetworkPolicies with a teammate.

Try the Quick Test

The Quick Test is available to everyone for free. Log in to save your progress and resume later.

Menu

Debugging Pods And Networking

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core toolkit

Worked examples

Example 1: CrashLoopBackOff in an inference Pod

Example 2: Service returns timeouts

Example 3: DNS resolution fails inside training job

Example 4: ImagePullBackOff before job starts

Step-by-step playbooks

When a Pod won't stay up

When Service DNS works but requests fail

When cross-namespace access fails

Exercises

Checklist before you submit

Common mistakes and self-check

Practical projects

Learning path

Mini challenge

Next steps

Try the Quick Test

Practice Exercises

Diagnose a CrashLoopBackOff in an inference Pod

Instructions

Expected Output

Fix a Service with no endpoints (selector mismatch)

Unblock DNS in a restrictive NetworkPolicy

Debugging Pods And Networking — Quick Test

Have questions about Debugging Pods And Networking?

AI Assistant