luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

Writing Dockerfiles For ML

Learn Writing Dockerfiles For ML for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

As a Machine Learning Engineer, you ship training and inference code that must run the same way on laptops, CI, and cloud. A well-crafted Dockerfile gives you:

  • Reproducible experiments and model training
  • Fast builds via caching (lower cost, less waiting)
  • Reliable GPU access and compatible CUDA/cuDNN stacks
  • Smaller images for quicker deployments
  • Safer images with non-root users and no secrets baked in

Concept explained simply

A Dockerfile is a recipe to create a machine image. Each instruction adds a new layer. When a layer hasn’t changed, Docker reuses it from cache. So if you structure your Dockerfile from the most stable parts (base OS, system packages, Python deps) to the most changing parts (your code), builds stay fast and predictable.

Mental model

  • Base image = your starting kitchen
  • RUN/apt + pip install = stocking the pantry
  • COPY code = adding your unique recipe
  • ENTRYPOINT/CMD = how to start cooking
  • .dockerignore = don’t bring unnecessary clutter into the kitchen

Core building blocks for ML Dockerfiles

  • Base images: python:3.10-slim (CPU) or CUDA-enabled images (GPU). Avoid floating latest tags; pin versions.
  • System deps: install in one RUN layer; clean apt caches to keep images small.
  • Python deps: copy only lock/requirements files before installing to maximize caching.
  • Non-root user: reduce risk; write to app dirs without root.
  • ENV vs ARG: ARG for build-time, ENV for runtime.
  • Multi-stage builds: build heavy stuff first, copy only what you need into a slim runtime.
  • .dockerignore: exclude data, models, __pycache__, and secrets.
  • GPU: use CUDA-capable base and run with --gpus all (host must have NVIDIA drivers/toolkit).
Why caching matters (quick example)

Good:

COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src/ ./src

Bad (breaks cache on every code change):

COPY . .
RUN pip install -r requirements.txt
Choosing a base image
  • CPU training/inference: python:3.10-slim or similar.
  • GPU training/inference: CUDA runtime images (e.g., nvidia/cuda:<version>-runtime-ubuntu22.04) or framework vendor images.
  • Pin versions (e.g., 3.10-slim, specific CUDA numbers) for reproducibility.

Worked examples

Example 1 — CPU training image (scikit-learn)

Goal: Reproducible training without baking data inside the image.

# Dockerfile
FROM python:3.10-slim

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update \ 
 && apt-get install -y --no-install-recommends build-essential git \ 
 && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# 1) Cache-friendly dependency install
COPY requirements.txt ./
RUN pip install --upgrade pip \ 
 && pip install -r requirements.txt

# 2) Copy only the code last (changes often)
COPY src/ ./src
COPY train.py .

# 3) Non-root for safety
RUN useradd -m -u 1000 appuser
USER appuser

CMD ["python", "train.py"]

requirements.txt

pandas==2.1.4
scikit-learn==1.3.2
joblib==1.3.2

train.py

from pathlib import Path
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

# Expect CSV via mounted volume: /data/train.csv
csv_path = Path("/data/train.csv")
if not csv_path.exists():
    print("Missing /data/train.csv. Mount a data volume.")
    raise SystemExit(1)

# Simple demo: first N-1 columns as features, last column as label
df = pd.read_csv(csv_path)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

model = LogisticRegression(max_iter=200)
model.fit(X, y)
acc = accuracy_score(y, model.predict(X))
print(f"Training complete. In-sample accuracy: {acc:.3f}")

Path("/app/models").mkdir(exist_ok=True)
joblib.dump(model, "/app/models/model.pkl")
print("Saved /app/models/model.pkl")

.dockerignore

__pycache__/
*.pyc
.env
.git
models/
data/
How to run
# Build
docker build -t ml-sklearn-train:cpu .

# Run (mount local data dir containing train.csv)
docker run --rm -v "$PWD/data":/data ml-sklearn-train:cpu

Example 2 — GPU training image (PyTorch)

Goal: Use CUDA-enabled base and verify GPU visibility.

# Dockerfile.gpu
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update \ 
 && apt-get install -y --no-install-recommends python3 python3-pip python3-venv \ 
 && rm -rf /var/lib/apt/lists/*

RUN python3 -m venv /opt/venv \
 && /opt/venv/bin/pip install --upgrade pip \
 && /opt/venv/bin/pip install torch==2.2.0+cu121 torchvision==0.17.0+cu121 --index-url https://download.pytorch.org/whl/cu121

ENV PATH="/opt/venv/bin:$PATH"

WORKDIR /app
COPY gpu_check.py .

RUN useradd -m -u 1000 appuser
USER appuser

CMD ["python", "gpu_check.py"]

gpu_check.py

import torch
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("Device name:", torch.cuda.get_device_name(0))
How to run (host must have NVIDIA drivers and container toolkit)
docker build -f Dockerfile.gpu -t ml-pytorch-train:gpu .
docker run --rm --gpus all ml-pytorch-train:gpu

Example 3 — Multi-stage: slim inference image (FastAPI)

Goal: Build dependencies in one stage, copy minimal runtime with a non-root user.

# Dockerfile.infer
# 1) Builder stage
FROM python:3.10-slim AS builder
ENV PIP_NO_CACHE_DIR=1
RUN apt-get update \ 
 && apt-get install -y --no-install-recommends build-essential \ 
 && rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY requirements.txt .
RUN python -m venv /opt/venv \
 && /opt/venv/bin/pip install --upgrade pip \
 && /opt/venv/bin/pip install -r requirements.txt

# 2) Runtime stage
FROM python:3.10-slim AS runtime
ENV PATH="/opt/venv/bin:$PATH" \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1
WORKDIR /app
# Copy only the virtualenv and app sources
COPY --from=builder /opt/venv /opt/venv
COPY app ./app

# Security: run as non-root
RUN useradd -m -u 1000 appuser
USER appuser

EXPOSE 8000
CMD ["uvicorn", "app.main:api", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt

fastapi==0.109.0
uvicorn[standard]==0.25.0
scikit-learn==1.3.2
joblib==1.3.2

app/main.py

from fastapi import FastAPI
api = FastAPI()

@api.get("/")
def root():
    return {"status": "ok"}
How to run
docker build -f Dockerfile.infer -t ml-fastapi-infer:cpu .
docker run --rm -p 8000:8000 ml-fastapi-infer:cpu
# Visit http://localhost:8000 (returns {"status": "ok"})

Pre-build checklist

  • Pin base image and key package versions
  • Place dependency install before copying the full source
  • Use a .dockerignore to exclude data, models, and secrets
  • Install system packages in one RUN and clean apt lists
  • Create a non-root user and set WORKDIR
  • Choose CMD/ENTRYPOINT clearly; expose ports only if needed
  • For GPU: choose a CUDA-matching base and test with a minimal script

Exercises

These mirror the practice section. You can complete them locally. The quick test is available for free; if you log in, your progress is saved.

Exercise 1 — CPU training image (cache-friendly)

  1. Create files: Dockerfile, requirements.txt, train.py, .dockerignore using Example 1 as a guide.
  2. Build the image: docker build -t ex1-sklearn:cpu .
  3. Prepare ./data/train.csv with simple numeric columns and a binary label.
  4. Run: docker run --rm -v "$PWD/data":/data ex1-sklearn:cpu
  • Self-check: Edit only train.py and rebuild; the dependency layers should be cached (much faster).

Exercise 2 — Multi-stage inference (non-root, slim)

  1. Create Dockerfile.infer, requirements.txt, and app/main.py as in Example 3.
  2. Build: docker build -f Dockerfile.infer -t ex2-infer:cpu .
  3. Run: docker run --rm -p 8000:8000 ex2-infer:cpu
  4. Open http://localhost:8000 and confirm JSON response.
  • Self-check: Compare image size with and without multi-stage (single-stage typically larger).

Common mistakes and self-check

  • Copying full source before installing deps, breaking cache. Fix: copy only requirements first.
  • Using latest tags. Fix: pin versions.
  • Leaving apt caches. Fix: && rm -rf /var/lib/apt/lists/*
  • Running as root. Fix: create and switch to non-root user.
  • Baking large datasets or models into images. Fix: mount volumes or download at runtime.
  • Leaking secrets via ENV or COPY. Fix: use runtime env vars or secret managers; add to .dockerignore.
  • Mismatched CUDA/toolkit vs framework wheels. Fix: match CUDA versions and test with a small script.

Practical projects

  • Reproducible training: Create a CPU image for a tabular model with dependency caching and non-root user.
  • GPU training: Build a PyTorch CUDA image; run a mini training loop; log GPU name at startup.
  • Slim inference: Ship a FastAPI model service using multi-stage build and measure image size difference.

Who this is for

  • ML Engineers and Data Scientists who need portable training and inference environments
  • MLOps practitioners standardizing project containers

Prerequisites

  • Basic Python packaging and virtual environments
  • Command line and Docker basics (build, run, volumes)
  • Optional: NVIDIA GPU on host for GPU exercise

Learning path

  • Before: Docker fundamentals, Linux basics
  • Now: Writing Dockerfiles for ML (this lesson)
  • Next: Docker Compose for multi-service ML stacks; CI build caching; model serving patterns

Next steps

  • Automate builds in CI with pinned images and vulnerability scans
  • Create team templates for CPU/GPU training and inference
  • Adopt multi-stage builds and non-root defaults across repos

Mini challenge

Take your current ML repo and ship two images:

  • Training image (CPU) with cached deps and a train.py entrypoint
  • Inference image (slim) exposing port 8000 with a health endpoint
Hint

Start from Example 1 and Example 3. Verify cache by timing rebuilds after changing only code.

Ready? Take the quick test below for free. If you log in, your test progress will be saved.

Practice Exercises

2 exercises to complete

Instructions

  1. Create Dockerfile, requirements.txt, train.py, and .dockerignore following Example 1.
  2. Build: docker build -t ex1-sklearn:cpu .
  3. Ensure you have ./data/train.csv. Run: docker run --rm -v "$PWD/data":/data ex1-sklearn:cpu
  4. Edit only train.py (e.g., change a print). Rebuild and observe faster caching.
Expected Output
Console shows training accuracy and 'Saved /app/models/model.pkl'. Second build should be significantly faster (dependency layers cached).

Writing Dockerfiles For ML — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Writing Dockerfiles For ML?

AI Assistant

Ask questions about this tool