How to learn Container Debugging for Containerization Docker in Machine Learning Engineer for free

Why this matters

As a Machine Learning Engineer, containers keep your training jobs, model servers, and pipelines reproducible. When a container crashes or a model server is unreachable, debugging quickly saves time and cloud cost. You will use these skills to:

Unblock failing CI jobs that build and run training containers
Fix model-serving outages (ports, health checks, memory limits)
Diagnose GPU availability for training/inference containers
Trace data/path issues in mounted volumes during feature generation

Who this is for

ML/AI practitioners who package apps with Docker: training scripts, batch jobs, and API servers (FastAPI/Flask/Triton/TensorFlow Serving).

Prerequisites

Basic Docker usage: images, Dockerfile, run, build, volumes, ports
Command line comfort (bash/sh)
Basic Python app structure (for examples)

Concept explained simply

A container is a boxed app with its own filesystem, processes, and network. Debugging is about observing and controlling what happens inside the box and at its boundaries: startup, logs, processes, files, network, resources, and build steps.

A mental model

Use the "windows and knobs" model:

Window 1: Logs (what did the app say?)
Window 2: Shell (what does the box look like inside?)
Window 3: Inspect (what does Docker know about state/config?)
Window 4: Network (is it listening and reachable?)
Knob A: Entry point/CMD override to break into a shell
Knob B: Resources (CPU/RAM/GPU) and limits
Knob C: Volumes and file permissions

Core toolbox

Logs and status

docker ps -a
docker logs <container>
docker logs -f <container>   # follow live

Stopped containers still have logs. Pair with status:

docker inspect -f '{{.State.Status}} {{.State.ExitCode}} {{.State.OOMKilled}}' <container>

Open a shell inside

docker exec -it <container> sh   # or bash
# As root if needed
docker exec -it -u 0 <container> sh

If it crashes at start, override entrypoint to get a shell:

docker run --rm -it --entrypoint sh <image>

Processes and health

docker top <container>
ps aux        # inside container
# Health
docker inspect -f '{{.State.Health.Status}}' <container>

Network checks

docker port <container>
# Inside container:
ss -tulpn     # what's listening
curl localhost:8080
# Share network namespace for deep inspection:
docker run --rm -it --network container:<container> alpine sh

Ensure your app listens on 0.0.0.0 (not 127.0.0.1) when publishing ports.

Files, volumes, permissions

docker inspect -f '{{json .Mounts}}' <container> | jq .
ls -l /app
id -u; id -g

Mismatch between host user and container user often causes write failures. Use chown or run as matching UID/GID.

Resources and OOM

docker stats
# Was it OOM killed?
docker inspect -f '{{.State.OOMKilled}} {{.State.ExitCode}}' <container>  # 137 often = OOM

If training crashes, reduce batch size, increase memory limit, or stream data.

GPU availability

# Start with GPU support
docker run --gpus all --rm nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
# Inside your app container:
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

If false: ensure host drivers installed and container started with --gpus all.

Build-time debugging

# See build steps
docker build --no-cache -t myapp:debug .
# Temporary interactive build stage (useful for copying or inspecting):
# Add in Dockerfile during dev:
# RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt

Use multi-stage builds to keep debug tools in a dev stage but out of prod images.

Worked examples

1) Container exits immediately on start

Symptoms: Status "exited" with non-zero code.

docker logs app
# Output: python: can't open file 'server.py': [Errno 2] No such file or directory

Fix: The CMD references a non-existent file. Override to inspect:

docker run --rm -it --entrypoint sh myapp:latest
ls -l
# Correct Dockerfile CMD to app.py and rebuild

2) Service unreachable from host

Symptoms: docker ps shows 0.0.0.0:8080->8080 but curl localhost:8080 fails.

# Inside container
ss -tulpn | grep 8080
# Listening on 127.0.0.1:8080 only

Fix: Bind app to 0.0.0.0 and restart container.

3) GPU not visible to PyTorch

Symptoms: torch.cuda.is_available() is False.

# Check container runtime
nvidia-smi            # not found or no GPUs
# Restart with GPU
docker run --gpus all -it my-ml-image
python -c "import torch; print(torch.cuda.is_available())"

If still false, verify driver on host and correct CUDA/CuDNN in image.

4) OOM-killed training job

Symptoms: Exit code 137, logs end abruptly.

docker inspect -f '{{.State.OOMKilled}} {{.State.ExitCode}}' train
# true 137

Fix: Increase memory limit or reduce batch size/precision; stream data from disk.

Guided debugging flow (follow these steps)

Check status and logs:

docker ps -a
docker logs -n 100 <container>

If it crashes at start, drop into shell:
```
docker run --rm -it --entrypoint sh <image>
```
Verify files, env, and run the app manually.
Check networking:
```
docker port <container>
ss -tulpn  # inside
```
Ensure the app listens on 0.0.0.0 and ports are published.

Check volumes and permissions:

docker inspect -f '{{json .Mounts}}' <container>
ls -l /data

Fix ownership or mount paths.

Check resources:
```
docker stats
```
Look for CPU/RAM spikes; inspect OOMKilled.
GPU check (if needed):
```
docker exec -it <container> nvidia-smi
```
Confirm container started with --gpus all.

Exercises

These mirror the interactive tasks below so you can practice locally.

Exercise 1: Fix a container that exits immediately (missing entry script)
Exercise 2: Make a Flask app reachable from host (bind to 0.0.0.0 and publish port)

Checklist before you say “fixed”

Container stays healthy (or expected status) for at least 60 seconds
Logs are clean: no stack traces repeating
Port mapping shows expected host:container and the service is reachable
Volumes mounted and writable where needed
No OOMKilled and memory within limits
GPU is visible when required
Image can rebuild cleanly after changes

Common mistakes and self-check

Binding to 127.0.0.1 inside container. Self-check: ss -tulpn shows 127.0.0.1. Fix to 0.0.0.0.
Wrong CMD/ENTRYPOINT. Self-check: docker inspect shows the command; docker logs show missing file/module.
Forgetting --gpus all. Self-check: nvidia-smi not found or no devices.
Mounting wrong path. Self-check: docker inspect .Mounts and ls inside container.
Permission denied on mounted volume. Self-check: id -u; directory ownership.
Assuming logs go to stdout. Self-check: app logs to file; tail that file inside container.

Practical projects

Containerize a FastAPI model server with healthcheck; simulate failures (port, OOM) and document fixes.
Build a training container with a debug flavor (extra tools) and a release flavor; compare sizes and startup.
Create a small GPU inference container and a validation script that asserts GPU availability on start.

Learning path

Start: Container basics and Dockerfile fundamentals
Then: Container Debugging (this page)
Next: CI/CD for container builds and runtime observability
Finally: Production hardening (healthchecks, resource limits, autoscaling)

Next steps

Do the exercises below and verify with the checklist
Run the Quick Test to check understanding
Apply the toolbox to your current ML container or model server

How to use the Quick Test

The Quick Test is available to everyone. Logged-in learners get saved progress and can resume later.

Mini challenge

Your API container builds and runs, docker ps shows 0.0.0.0:9000->8000/tcp, but curl localhost:9000 hangs. In 5 minutes, list your first three checks and a command for each. Compare with the checklist above.

Menu

Container Debugging

Table of Contents

Why this matters

Who this is for

Prerequisites

Concept explained simply

A mental model

Core toolbox

Worked examples

Guided debugging flow (follow these steps)

Exercises

Checklist before you say “fixed”

Common mistakes and self-check

Practical projects

Learning path

Next steps

How to use the Quick Test

Mini challenge

Practice Exercises

Fix a container that exits immediately (missing entry script)

Instructions

Expected Output

Make a Flask app reachable from host (binding and ports)

Container Debugging — Quick Test

Have questions about Container Debugging?

AI Assistant