How to learn Dependency And Runtime Optimization for Containerization Docker in Machine Learning Engineer for free

Why this matters

As a Machine Learning Engineer, your models must ship reliably and run fast. Poor dependency management and bloated runtimes lead to slow builds, large images, cold-start delays, security risks, and inconsistent behavior across environments. Optimizing dependencies and runtimes means faster deployments, smaller images, reproducible builds, and lower costs.

Real task: Build a CPU-only inference image under 800MB that starts in under 1s.
Real task: Produce a GPU image compatible with a specific CUDA and driver version.
Real task: Ensure reproducible builds from a clean CI machine.

Concept explained simply

Dependency and runtime optimization is choosing only what you need (base image, OS packages, Python/conda packages, model runtimes) and arranging Docker layers so builds are fast, reproducible, and minimal. For GPU, it also means matching CUDA/CuDNN precisely to your framework wheels.

Mental model

Start lean: pick the smallest base that can run your code.
Build then trim: compile in a builder stage, copy only artifacts to a slim runtime stage.
Freeze the recipe: pin versions and hashes so results don’t change unexpectedly.
Cache wisely: separate infrequent from frequent changes to reuse layers.
Match the GPU stack: CUDA runtime version must match your framework build.

Core principles

Minimal base images: prefer slim or distroless where feasible; avoid full OS images unless required.
Multi-stage builds: compile native deps (e.g., numpy, opencv) in a builder; copy wheels/binaries into a clean runtime.
Layer caching: copy dependency files (requirements.txt, lock files) before app code; install deps in a separate step.
Pin versions and hashes: use exact versions; when possible, include hash checks for deterministic installs.
OS package hygiene: combine apt-get update with install in one RUN; remove apt lists and build tools after use.
.dockerignore: exclude data, venvs, build artifacts, and caches to keep context small.
Non-root runtime: run as a non-root user for security and least privilege.
Runtime choice: CPU vs GPU, and for GPU choose matching CUDA runtime (not devel) unless you compile at runtime.
Environment parity: align Python version, libc, and glibc with your target environment.
Deterministic builds: use lock files (requirements.txt with pins, poetry.lock, conda-lock) and consistent indexes.

Worked examples

Example 1 — Shrink a CPU FastAPI inference image

Naive Dockerfile:

FROM python:3.11
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Issues: large base, no caching for deps, copies junk, runs as root.

Improved:

# Builder
FROM python:3.11-slim AS builder
ENV PIP_NO_CACHE_DIR=0 \ 
    PYTHONDONTWRITEBYTECODE=1 \ 
    PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends build-essential gcc && rm -rf /var/lib/apt/lists/*
WORKDIR /wheels
COPY requirements.txt ./
RUN pip wheel --wheel-dir=/wheels -r requirements.txt

# Runtime
FROM python:3.11-slim AS runtime
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-compile /wheels/*
COPY . /app
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Typical result: significantly smaller image and faster rebuilds (varies by project).

Example 2 — Clean up OS packages and context

RUN --mount=type=cache,target=/var/cache/apt \
    apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

Plus ensure .dockerignore includes:

__pycache__/
*.pyc
.env
.venv/
.data/
models/
.git/
.dist/
node_modules/

Effect: smaller context, fewer invalidated layers, smaller final image.

Example 3 — GPU runtime with CUDA

ARG CUDA_VERSION=12.2.0
FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu22.04 as runtime
ENV NVIDIA_VISIBLE_DEVICES=all \ 
    NVIDIA_DRIVER_CAPABILITIES=compute,utility \ 
    PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y --no-install-recommends python3 python3-pip && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements-gpu.txt ./
# Use the correct torch/TF build that matches CUDA_VERSION
# Example (adjust versions to match your CUDA):
# RUN pip install --no-cache-dir --extra-index-url https://download.pytorch.org/whl/cu121 torch==<ver>+cu121 torchvision==<ver>+cu121
RUN pip install --no-cache-dir -r requirements-gpu.txt
COPY . /app
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
CMD ["python3", "serve.py"]

Notes: Use CUDA runtime (not devel) for inference. Ensure framework wheel matches CUDA version. Host must have a compatible NVIDIA driver.

Example 4 — Deterministic installs with lock + hashes

Use a locked requirements file:

# requirements.txt (pinned)
fastapi==0.111.0 --hash=sha256:<hash1>
uvicorn==0.30.0 --hash=sha256:<hash2>

Then:

RUN pip install --require-hashes -r requirements.txt

Benefit: exact, reproducible installs across machines.

Exercises

Complete these tasks locally. A simple CPU machine is enough unless noted. Mirror answers in the Exercises panel below.

Exercise ex1 (CPU inference): Optimize the provided naive Dockerfile for a FastAPI model server. Goals: image <= 800MB (CPU-only), non-root user, multi-stage build, pinned deps.

Exercise ex2 (GPU inference): Create a CUDA runtime image for PyTorch/TensorFlow inference. Goals: use a runtime CUDA base, framework build matching CUDA, OS cleanup, non-root user.

Starter files and tips

Starter Dockerfile (ex1):

FROM python:3.11
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Checklist you should hit:

Use slim/minimal base images.
Use multi-stage to build wheels, copy only what’s needed.
Pin dependency versions (and hashes if possible).
Combine apt-get update/install, remove lists after install.
Use .dockerignore to exclude junk.
Run as a non-root user.
Separate dependency install from source copy to maximize cache.

Common mistakes and self-check

Using a full OS image when slim/distroless suffices. Self-check: List native tools actually needed at runtime.
Installing build tools in the final image. Self-check: Are gcc/build-essential in your final stage? If yes, move to builder.
Not pinning versions. Self-check: requirements show exact versions? If not, pin them.
Invalidating cache by copying source before dependencies. Self-check: Does Dockerfile copy requirements before app code?
CUDA mismatch. Self-check: Does your framework wheel match the CUDA runtime tag?
Leaving apt cache and lists. Self-check: rm -rf /var/lib/apt/lists/* at the end of apt RUN?
Running as root. Self-check: USER set to a non-root account?

Mini challenge

Pick any of your current images. In 30 minutes, apply: slim base, multi-stage build, pinned requirements, and non-root user. Measure image size and cold start before/after. Write down three changes that delivered the biggest wins.

Who this is for

Machine Learning Engineers deploying inference/training services.
Data/Platform Engineers managing ML microservices and batch jobs.
MLOps engineers maintaining GPU fleets.

Prerequisites

Basic Docker knowledge (images, layers, Dockerfile, build, run).
Working Python project (FastAPI/Flask/CLI) to containerize.
For GPU: access to NVIDIA driver and nvidia-container-runtime.

Learning path

Start with minimal base images and .dockerignore.
Add multi-stage builds for native deps.
Pin versions and enable deterministic installs.
Optimize OS layers and remove build-time packages.
Handle GPU runtimes with correct CUDA/framework pairing.
Adopt non-root users, healthcheck, and sensible entrypoints.

Practical projects

CPU inference service: FastAPI with a small sklearn model; image target <= 500–800MB.
GPU inference service: PyTorch ResNet; ensure CUDA match and measure throughput.
Batch job image: nightly feature computation; validate deterministic installs by rebuilding on a clean runner.

Next steps

Automate image scans and size checks in CI.
Create a base image you own (internal standard) and inherit from it.
Document your dependency policy (pinning, hashes, approved indexes).

Quick Test

Everyone can take the test below for free. Only logged-in users have their progress saved.

Menu

Dependency And Runtime Optimization

Table of Contents

Why this matters

Concept explained simply

Mental model

Core principles

Worked examples

Exercises

Common mistakes and self-check

Mini challenge

Who this is for

Prerequisites

Learning path

Practical projects

Next steps

Quick Test

Practice Exercises

Optimize a CPU FastAPI inference Dockerfile

Instructions

Expected Output

Build a CUDA runtime image for PyTorch or TensorFlow

Dependency And Runtime Optimization — Quick Test

Have questions about Dependency And Runtime Optimization?

AI Assistant