Why this matters
As a Machine Learning Engineer, containers keep your training jobs, model servers, and pipelines reproducible. When a container crashes or a model server is unreachable, debugging quickly saves time and cloud cost. You will use these skills to:
- Unblock failing CI jobs that build and run training containers
- Fix model-serving outages (ports, health checks, memory limits)
- Diagnose GPU availability for training/inference containers
- Trace data/path issues in mounted volumes during feature generation
Who this is for
ML/AI practitioners who package apps with Docker: training scripts, batch jobs, and API servers (FastAPI/Flask/Triton/TensorFlow Serving).
Prerequisites
- Basic Docker usage: images, Dockerfile, run, build, volumes, ports
- Command line comfort (bash/sh)
- Basic Python app structure (for examples)
Concept explained simply
A container is a boxed app with its own filesystem, processes, and network. Debugging is about observing and controlling what happens inside the box and at its boundaries: startup, logs, processes, files, network, resources, and build steps.
A mental model
Use the "windows and knobs" model:
- Window 1: Logs (what did the app say?)
- Window 2: Shell (what does the box look like inside?)
- Window 3: Inspect (what does Docker know about state/config?)
- Window 4: Network (is it listening and reachable?)
- Knob A: Entry point/CMD override to break into a shell
- Knob B: Resources (CPU/RAM/GPU) and limits
- Knob C: Volumes and file permissions
Core toolbox
Logs and status
docker ps -a
docker logs <container>
docker logs -f <container> # follow live
Stopped containers still have logs. Pair with status:
docker inspect -f '{{.State.Status}} {{.State.ExitCode}} {{.State.OOMKilled}}' <container>
Open a shell inside
docker exec -it <container> sh # or bash
# As root if needed
docker exec -it -u 0 <container> sh
If it crashes at start, override entrypoint to get a shell:
docker run --rm -it --entrypoint sh <image>
Processes and health
docker top <container>
ps aux # inside container
# Health
docker inspect -f '{{.State.Health.Status}}' <container>
Network checks
docker port <container>
# Inside container:
ss -tulpn # what's listening
curl localhost:8080
# Share network namespace for deep inspection:
docker run --rm -it --network container:<container> alpine sh
Ensure your app listens on 0.0.0.0 (not 127.0.0.1) when publishing ports.
Files, volumes, permissions
docker inspect -f '{{json .Mounts}}' <container> | jq .
ls -l /app
id -u; id -g
Mismatch between host user and container user often causes write failures. Use chown or run as matching UID/GID.
Resources and OOM
docker stats
# Was it OOM killed?
docker inspect -f '{{.State.OOMKilled}} {{.State.ExitCode}}' <container> # 137 often = OOM
If training crashes, reduce batch size, increase memory limit, or stream data.
GPU availability
# Start with GPU support
docker run --gpus all --rm nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
# Inside your app container:
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"
If false: ensure host drivers installed and container started with --gpus all.
Build-time debugging
# See build steps
docker build --no-cache -t myapp:debug .
# Temporary interactive build stage (useful for copying or inspecting):
# Add in Dockerfile during dev:
# RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt
Use multi-stage builds to keep debug tools in a dev stage but out of prod images.
Worked examples
1) Container exits immediately on start
Symptoms: Status "exited" with non-zero code.
docker logs app
# Output: python: can't open file 'server.py': [Errno 2] No such file or directory
Fix: The CMD references a non-existent file. Override to inspect:
docker run --rm -it --entrypoint sh myapp:latest
ls -l
# Correct Dockerfile CMD to app.py and rebuild
2) Service unreachable from host
Symptoms: docker ps shows 0.0.0.0:8080->8080 but curl localhost:8080 fails.
# Inside container
ss -tulpn | grep 8080
# Listening on 127.0.0.1:8080 only
Fix: Bind app to 0.0.0.0 and restart container.
3) GPU not visible to PyTorch
Symptoms: torch.cuda.is_available() is False.
# Check container runtime
nvidia-smi # not found or no GPUs
# Restart with GPU
docker run --gpus all -it my-ml-image
python -c "import torch; print(torch.cuda.is_available())"
If still false, verify driver on host and correct CUDA/CuDNN in image.
4) OOM-killed training job
Symptoms: Exit code 137, logs end abruptly.
docker inspect -f '{{.State.OOMKilled}} {{.State.ExitCode}}' train
# true 137
Fix: Increase memory limit or reduce batch size/precision; stream data from disk.
Guided debugging flow (follow these steps)
- Check status and logs:
docker ps -a docker logs -n 100 <container> - If it crashes at start, drop into shell:
Verify files, env, and run the app manually.docker run --rm -it --entrypoint sh <image> - Check networking:
Ensure the app listens on 0.0.0.0 and ports are published.docker port <container> ss -tulpn # inside - Check volumes and permissions:
Fix ownership or mount paths.docker inspect -f '{{json .Mounts}}' <container> ls -l /data - Check resources:
Look for CPU/RAM spikes; inspect OOMKilled.docker stats - GPU check (if needed):
Confirm container started with --gpus all.docker exec -it <container> nvidia-smi
Exercises
These mirror the interactive tasks below so you can practice locally.
- Exercise 1: Fix a container that exits immediately (missing entry script)
- Exercise 2: Make a Flask app reachable from host (bind to 0.0.0.0 and publish port)
Checklist before you say “fixed”
- Container stays healthy (or expected status) for at least 60 seconds
- Logs are clean: no stack traces repeating
- Port mapping shows expected host:container and the service is reachable
- Volumes mounted and writable where needed
- No OOMKilled and memory within limits
- GPU is visible when required
- Image can rebuild cleanly after changes
Common mistakes and self-check
- Binding to 127.0.0.1 inside container. Self-check: ss -tulpn shows 127.0.0.1. Fix to 0.0.0.0.
- Wrong CMD/ENTRYPOINT. Self-check: docker inspect shows the command; docker logs show missing file/module.
- Forgetting --gpus all. Self-check: nvidia-smi not found or no devices.
- Mounting wrong path. Self-check: docker inspect .Mounts and ls inside container.
- Permission denied on mounted volume. Self-check: id -u; directory ownership.
- Assuming logs go to stdout. Self-check: app logs to file; tail that file inside container.
Practical projects
- Containerize a FastAPI model server with healthcheck; simulate failures (port, OOM) and document fixes.
- Build a training container with a debug flavor (extra tools) and a release flavor; compare sizes and startup.
- Create a small GPU inference container and a validation script that asserts GPU availability on start.
Learning path
- Start: Container basics and Dockerfile fundamentals
- Then: Container Debugging (this page)
- Next: CI/CD for container builds and runtime observability
- Finally: Production hardening (healthchecks, resource limits, autoscaling)
Next steps
- Do the exercises below and verify with the checklist
- Run the Quick Test to check understanding
- Apply the toolbox to your current ML container or model server
How to use the Quick Test
The Quick Test is available to everyone. Logged-in learners get saved progress and can resume later.
Mini challenge
Your API container builds and runs, docker ps shows 0.0.0.0:9000->8000/tcp, but curl localhost:9000 hangs. In 5 minutes, list your first three checks and a command for each. Compare with the checklist above.