luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Container Debugging

Learn Container Debugging for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, containers keep your training jobs, model servers, and pipelines reproducible. When a container crashes or a model server is unreachable, debugging quickly saves time and cloud cost. You will use these skills to:

  • Unblock failing CI jobs that build and run training containers
  • Fix model-serving outages (ports, health checks, memory limits)
  • Diagnose GPU availability for training/inference containers
  • Trace data/path issues in mounted volumes during feature generation

Who this is for

ML/AI practitioners who package apps with Docker: training scripts, batch jobs, and API servers (FastAPI/Flask/Triton/TensorFlow Serving).

Prerequisites

  • Basic Docker usage: images, Dockerfile, run, build, volumes, ports
  • Command line comfort (bash/sh)
  • Basic Python app structure (for examples)

Concept explained simply

A container is a boxed app with its own filesystem, processes, and network. Debugging is about observing and controlling what happens inside the box and at its boundaries: startup, logs, processes, files, network, resources, and build steps.

A mental model

Use the "windows and knobs" model:

  • Window 1: Logs (what did the app say?)
  • Window 2: Shell (what does the box look like inside?)
  • Window 3: Inspect (what does Docker know about state/config?)
  • Window 4: Network (is it listening and reachable?)
  • Knob A: Entry point/CMD override to break into a shell
  • Knob B: Resources (CPU/RAM/GPU) and limits
  • Knob C: Volumes and file permissions

Core toolbox

Logs and status
docker ps -a
docker logs <container>
docker logs -f <container>   # follow live

Stopped containers still have logs. Pair with status:

docker inspect -f '{{.State.Status}} {{.State.ExitCode}} {{.State.OOMKilled}}' <container>
Open a shell inside
docker exec -it <container> sh   # or bash
# As root if needed
docker exec -it -u 0 <container> sh

If it crashes at start, override entrypoint to get a shell:

docker run --rm -it --entrypoint sh <image>
Processes and health
docker top <container>
ps aux        # inside container
# Health
docker inspect -f '{{.State.Health.Status}}' <container>
Network checks
docker port <container>
# Inside container:
ss -tulpn     # what's listening
curl localhost:8080
# Share network namespace for deep inspection:
docker run --rm -it --network container:<container> alpine sh

Ensure your app listens on 0.0.0.0 (not 127.0.0.1) when publishing ports.

Files, volumes, permissions
docker inspect -f '{{json .Mounts}}' <container> | jq .
ls -l /app
id -u; id -g

Mismatch between host user and container user often causes write failures. Use chown or run as matching UID/GID.

Resources and OOM
docker stats
# Was it OOM killed?
docker inspect -f '{{.State.OOMKilled}} {{.State.ExitCode}}' <container>  # 137 often = OOM

If training crashes, reduce batch size, increase memory limit, or stream data.

GPU availability
# Start with GPU support
docker run --gpus all --rm nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
# Inside your app container:
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

If false: ensure host drivers installed and container started with --gpus all.

Build-time debugging
# See build steps
docker build --no-cache -t myapp:debug .
# Temporary interactive build stage (useful for copying or inspecting):
# Add in Dockerfile during dev:
# RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt

Use multi-stage builds to keep debug tools in a dev stage but out of prod images.

Worked examples

1) Container exits immediately on start

Symptoms: Status "exited" with non-zero code.

docker logs app
# Output: python: can't open file 'server.py': [Errno 2] No such file or directory

Fix: The CMD references a non-existent file. Override to inspect:

docker run --rm -it --entrypoint sh myapp:latest
ls -l
# Correct Dockerfile CMD to app.py and rebuild
2) Service unreachable from host

Symptoms: docker ps shows 0.0.0.0:8080->8080 but curl localhost:8080 fails.

# Inside container
ss -tulpn | grep 8080
# Listening on 127.0.0.1:8080 only

Fix: Bind app to 0.0.0.0 and restart container.

3) GPU not visible to PyTorch

Symptoms: torch.cuda.is_available() is False.

# Check container runtime
nvidia-smi            # not found or no GPUs
# Restart with GPU
docker run --gpus all -it my-ml-image
python -c "import torch; print(torch.cuda.is_available())"

If still false, verify driver on host and correct CUDA/CuDNN in image.

4) OOM-killed training job

Symptoms: Exit code 137, logs end abruptly.

docker inspect -f '{{.State.OOMKilled}} {{.State.ExitCode}}' train
# true 137

Fix: Increase memory limit or reduce batch size/precision; stream data from disk.

Guided debugging flow (follow these steps)

  1. Check status and logs:
    docker ps -a
    docker logs -n 100 <container>
    
  2. If it crashes at start, drop into shell:
    docker run --rm -it --entrypoint sh <image>
    
    Verify files, env, and run the app manually.
  3. Check networking:
    docker port <container>
    ss -tulpn  # inside
    
    Ensure the app listens on 0.0.0.0 and ports are published.
  4. Check volumes and permissions:
    docker inspect -f '{{json .Mounts}}' <container>
    ls -l /data
    
    Fix ownership or mount paths.
  5. Check resources:
    docker stats
    
    Look for CPU/RAM spikes; inspect OOMKilled.
  6. GPU check (if needed):
    docker exec -it <container> nvidia-smi
    
    Confirm container started with --gpus all.

Exercises

These mirror the interactive tasks below so you can practice locally.

  • Exercise 1: Fix a container that exits immediately (missing entry script)
  • Exercise 2: Make a Flask app reachable from host (bind to 0.0.0.0 and publish port)

Checklist before you say “fixed”

  • Container stays healthy (or expected status) for at least 60 seconds
  • Logs are clean: no stack traces repeating
  • Port mapping shows expected host:container and the service is reachable
  • Volumes mounted and writable where needed
  • No OOMKilled and memory within limits
  • GPU is visible when required
  • Image can rebuild cleanly after changes

Common mistakes and self-check

  • Binding to 127.0.0.1 inside container. Self-check: ss -tulpn shows 127.0.0.1. Fix to 0.0.0.0.
  • Wrong CMD/ENTRYPOINT. Self-check: docker inspect shows the command; docker logs show missing file/module.
  • Forgetting --gpus all. Self-check: nvidia-smi not found or no devices.
  • Mounting wrong path. Self-check: docker inspect .Mounts and ls inside container.
  • Permission denied on mounted volume. Self-check: id -u; directory ownership.
  • Assuming logs go to stdout. Self-check: app logs to file; tail that file inside container.

Practical projects

  • Containerize a FastAPI model server with healthcheck; simulate failures (port, OOM) and document fixes.
  • Build a training container with a debug flavor (extra tools) and a release flavor; compare sizes and startup.
  • Create a small GPU inference container and a validation script that asserts GPU availability on start.

Learning path

  • Start: Container basics and Dockerfile fundamentals
  • Then: Container Debugging (this page)
  • Next: CI/CD for container builds and runtime observability
  • Finally: Production hardening (healthchecks, resource limits, autoscaling)

Next steps

  • Do the exercises below and verify with the checklist
  • Run the Quick Test to check understanding
  • Apply the toolbox to your current ML container or model server

How to use the Quick Test

The Quick Test is available to everyone. Logged-in learners get saved progress and can resume later.

Mini challenge

Your API container builds and runs, docker ps shows 0.0.0.0:9000->8000/tcp, but curl localhost:9000 hangs. In 5 minutes, list your first three checks and a command for each. Compare with the checklist above.

Practice Exercises

2 exercises to complete

Instructions

Goal: Diagnose and fix a container that exits due to a wrong CMD.

  1. Create a folder and add this Dockerfile:
FROM python:3.10-slim
WORKDIR /app
COPY app.py .
CMD ["python", "server.py"]
  1. Add app.py:
print("App booted")
import time
time.sleep(5)
  1. Build and run:
docker build -t exit-demo:bad .
docker run --name exit-demo --rm exit-demo:bad

It will exit quickly. Use logs and inspect to find out why. Then fix the Dockerfile, rebuild, and confirm the container stays up for ~5 seconds printing the message.

Expected Output
You identify the wrong CMD (server.py missing), correct it to app.py, rebuild, and the container runs printing 'App booted' then exits after ~5 seconds.

Container Debugging — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Container Debugging?

AI Assistant

Ask questions about this tool