Why this matters
As a Machine Learning Engineer, your models must ship reliably and run fast. Poor dependency management and bloated runtimes lead to slow builds, large images, cold-start delays, security risks, and inconsistent behavior across environments. Optimizing dependencies and runtimes means faster deployments, smaller images, reproducible builds, and lower costs.
- Real task: Build a CPU-only inference image under 800MB that starts in under 1s.
- Real task: Produce a GPU image compatible with a specific CUDA and driver version.
- Real task: Ensure reproducible builds from a clean CI machine.
Concept explained simply
Dependency and runtime optimization is choosing only what you need (base image, OS packages, Python/conda packages, model runtimes) and arranging Docker layers so builds are fast, reproducible, and minimal. For GPU, it also means matching CUDA/CuDNN precisely to your framework wheels.
Mental model
- Start lean: pick the smallest base that can run your code.
- Build then trim: compile in a builder stage, copy only artifacts to a slim runtime stage.
- Freeze the recipe: pin versions and hashes so results don’t change unexpectedly.
- Cache wisely: separate infrequent from frequent changes to reuse layers.
- Match the GPU stack: CUDA runtime version must match your framework build.
Core principles
- Minimal base images: prefer slim or distroless where feasible; avoid full OS images unless required.
- Multi-stage builds: compile native deps (e.g., numpy, opencv) in a builder; copy wheels/binaries into a clean runtime.
- Layer caching: copy dependency files (requirements.txt, lock files) before app code; install deps in a separate step.
- Pin versions and hashes: use exact versions; when possible, include hash checks for deterministic installs.
- OS package hygiene: combine apt-get update with install in one RUN; remove apt lists and build tools after use.
- .dockerignore: exclude data, venvs, build artifacts, and caches to keep context small.
- Non-root runtime: run as a non-root user for security and least privilege.
- Runtime choice: CPU vs GPU, and for GPU choose matching CUDA runtime (not devel) unless you compile at runtime.
- Environment parity: align Python version, libc, and glibc with your target environment.
- Deterministic builds: use lock files (requirements.txt with pins, poetry.lock, conda-lock) and consistent indexes.
Worked examples
Example 1 — Shrink a CPU FastAPI inference image
Naive Dockerfile:
FROM python:3.11
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Issues: large base, no caching for deps, copies junk, runs as root.
Improved:
# Builder
FROM python:3.11-slim AS builder
ENV PIP_NO_CACHE_DIR=0 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends build-essential gcc && rm -rf /var/lib/apt/lists/*
WORKDIR /wheels
COPY requirements.txt ./
RUN pip wheel --wheel-dir=/wheels -r requirements.txt
# Runtime
FROM python:3.11-slim AS runtime
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-compile /wheels/*
COPY . /app
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Typical result: significantly smaller image and faster rebuilds (varies by project).
Example 2 — Clean up OS packages and context
RUN --mount=type=cache,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*
Plus ensure .dockerignore includes:
__pycache__/
*.pyc
.env
.venv/
.data/
models/
.git/
.dist/
node_modules/
Effect: smaller context, fewer invalidated layers, smaller final image.
Example 3 — GPU runtime with CUDA
ARG CUDA_VERSION=12.2.0
FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu22.04 as runtime
ENV NVIDIA_VISIBLE_DEVICES=all \
NVIDIA_DRIVER_CAPABILITIES=compute,utility \
PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends python3 python3-pip && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements-gpu.txt ./
# Use the correct torch/TF build that matches CUDA_VERSION
# Example (adjust versions to match your CUDA):
# RUN pip install --no-cache-dir --extra-index-url https://download.pytorch.org/whl/cu121 torch==<ver>+cu121 torchvision==<ver>+cu121
RUN pip install --no-cache-dir -r requirements-gpu.txt
COPY . /app
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
CMD ["python3", "serve.py"]
Notes: Use CUDA runtime (not devel) for inference. Ensure framework wheel matches CUDA version. Host must have a compatible NVIDIA driver.
Example 4 — Deterministic installs with lock + hashes
Use a locked requirements file:
# requirements.txt (pinned)
fastapi==0.111.0 --hash=sha256:<hash1>
uvicorn==0.30.0 --hash=sha256:<hash2>
Then:
RUN pip install --require-hashes -r requirements.txt
Benefit: exact, reproducible installs across machines.
Exercises
Complete these tasks locally. A simple CPU machine is enough unless noted. Mirror answers in the Exercises panel below.
Starter files and tips
Starter Dockerfile (ex1):
FROM python:3.11
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Checklist you should hit:
- Use slim/minimal base images.
- Use multi-stage to build wheels, copy only what’s needed.
- Pin dependency versions (and hashes if possible).
- Combine apt-get update/install, remove lists after install.
- Use .dockerignore to exclude junk.
- Run as a non-root user.
- Separate dependency install from source copy to maximize cache.
Common mistakes and self-check
- Using a full OS image when slim/distroless suffices. Self-check: List native tools actually needed at runtime.
- Installing build tools in the final image. Self-check: Are gcc/build-essential in your final stage? If yes, move to builder.
- Not pinning versions. Self-check: requirements show exact versions? If not, pin them.
- Invalidating cache by copying source before dependencies. Self-check: Does Dockerfile copy requirements before app code?
- CUDA mismatch. Self-check: Does your framework wheel match the CUDA runtime tag?
- Leaving apt cache and lists. Self-check: rm -rf /var/lib/apt/lists/* at the end of apt RUN?
- Running as root. Self-check: USER set to a non-root account?
Mini challenge
Pick any of your current images. In 30 minutes, apply: slim base, multi-stage build, pinned requirements, and non-root user. Measure image size and cold start before/after. Write down three changes that delivered the biggest wins.
Who this is for
- Machine Learning Engineers deploying inference/training services.
- Data/Platform Engineers managing ML microservices and batch jobs.
- MLOps engineers maintaining GPU fleets.
Prerequisites
- Basic Docker knowledge (images, layers, Dockerfile, build, run).
- Working Python project (FastAPI/Flask/CLI) to containerize.
- For GPU: access to NVIDIA driver and nvidia-container-runtime.
Learning path
- Start with minimal base images and .dockerignore.
- Add multi-stage builds for native deps.
- Pin versions and enable deterministic installs.
- Optimize OS layers and remove build-time packages.
- Handle GPU runtimes with correct CUDA/framework pairing.
- Adopt non-root users, healthcheck, and sensible entrypoints.
Practical projects
- CPU inference service: FastAPI with a small sklearn model; image target <= 500–800MB.
- GPU inference service: PyTorch ResNet; ensure CUDA match and measure throughput.
- Batch job image: nightly feature computation; validate deterministic installs by rebuilding on a clean runner.
Next steps
- Automate image scans and size checks in CI.
- Create a base image you own (internal standard) and inherit from it.
- Document your dependency policy (pinning, hashes, approved indexes).
Quick Test
Everyone can take the test below for free. Only logged-in users have their progress saved.