How to learn Containerization And Images for MLOps Engineer for free

Why this skill matters for MLOps Engineers

Containers are how you ship training jobs, batch pipelines, and realtime model services reliably across laptops, CI, and production clusters. Strong image practices give you reproducibility, faster builds, smaller attack surface, and predictable launches on CPUs and GPUs. This skill unlocks consistent deployments, easier debugging, safer rollbacks, and rapid iteration.

Who this is for

MLOps engineers standardizing training and serving environments.
Data scientists moving notebooks to production with minimal friction.
Platform engineers designing reliable ML build and deploy pipelines.

Prerequisites

Comfort with Python and the command line.
Basic understanding of model training and serving (e.g., FastAPI, Flask, or similar).
Docker installed locally; optional: access to a GPU machine if you want to try GPU images.

What you will learn

Write secure, cache-efficient Dockerfiles for training and serving.
Pin and lock dependencies for reproducible builds.
Build GPU-capable images and validate CUDA availability.
Minimize image size with multi-stage builds and slim bases.
Tag, version, and roll back images safely.
Scan images for vulnerabilities and misconfigurations.
Use Docker Compose for local multi-service ML stacks.

Learning path (practical roadmap)

Foundations: Build a minimal Python training image and run a script.
- Create a Dockerfile with a slim Python base.
- Copy a requirements file, install, and run a simple train.py.
Dependency discipline: Pin every version; create and use a lockfile.
- Repeat the build with explicit versions and compare layer cache hits.
Serving: Containerize a FastAPI model server and expose a port.
GPU basics: Switch to a CUDA base and verify GPU access with nvidia-smi (if available).
Optimization: Apply multi-stage builds and reduce image size; add non-root user.
Security & tagging: Scan the image; adopt a tagging convention like app:1.2.0-20240101-sha.
Local stack: Use Docker Compose to run training + serving + cache (e.g., Redis) locally.

Deep dive: Layer caching and multi-stage builds

Install dependencies before copying source code to maximize cache reuse. Use multi-stage builds to compile heavy artifacts once and copy only runtime needs into a slim final image.

# bad: cache breaks on any source change
COPY . /app
RUN pip install -r /app/requirements.txt

# better: stable layers first
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

GPU container quickstart checklist

Use an NVIDIA CUDA base matching your framework (e.g., PyTorch/TensorFlow).
Install the NVIDIA Container Toolkit on the host.
Run with --gpus all and verify nvidia-smi inside the container.

Worked examples

Example 1: Dockerfile for a training job (with pinning and cache)

# Dockerfile.train
FROM python:3.11-slim as base
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

# system deps only if needed (gcc, git, etc.)
RUN apt-get update && apt-get install -y --no-install-recommends build-essential git && rm -rf /var/lib/apt/lists/*

WORKDIR /app
# 1) pin dependencies
COPY requirements.txt ./
# example requirements.txt
# numpy==1.26.0
# pandas==2.1.1
# scikit-learn==1.3.1
# joblib==1.3.2
RUN pip install --upgrade pip && pip install -r requirements.txt

# 2) copy code last to keep cache hits when code changes
COPY . .

# 3) run
CMD ["python", "train.py"]

Tip: Build with a lockfile (e.g., pip-compile) to avoid transitive drift.

Example 2: FastAPI serving image (non-root user, slim)

# Dockerfile.serve
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1

# create non-root user
RUN useradd -m appuser
WORKDIR /app

COPY requirements-serve.txt .
# fastapi==0.110.0
# uvicorn[standard]==0.27.1
# scikit-learn==1.3.1
RUN pip install --no-cache-dir -r requirements-serve.txt

COPY . .
USER appuser
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

Example 3: GPU image basics for training or inference

# Dockerfile.gpu
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

# minimal tools
RUN apt-get update && apt-get install -y --no-install-recommends python3 python3-pip && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements-gpu.txt .
# torch==2.1.0+cu121 (match CUDA) from an appropriate index
# torchvision==0.16.0+cu121
RUN pip3 install --upgrade pip && pip3 install -r requirements-gpu.txt

COPY . .
CMD ["python3", "infer.py"]

# Run (host must have NVIDIA drivers + container toolkit)
docker run --rm --gpus all -v $PWD:/app -w /app myorg/model-gpu:0.1 nvidia-smi

Example 4: Multi-stage to minimize size (build then copy)

# Dockerfile.multi
FROM python:3.11 as builder
WORKDIR /w
COPY requirements.txt .
RUN pip install --user -r requirements.txt
COPY . .

FROM python:3.11-slim
ENV PATH=/root/.local/bin:$PATH
WORKDIR /app
# copy only installed wheels/site-packages and needed files
COPY --from=builder /root/.local /root/.local
COPY --from=builder /w/app.py /app/app.py
CMD ["python", "app.py"]

Example 5: Local dev with Docker Compose

# docker-compose.yml
services:
  train:
    build:
      context: .
      dockerfile: Dockerfile.train
    volumes:
      - ./:/app
    command: ["python", "train.py", "--epochs", "2"]

  serve:
    build:
      context: .
      dockerfile: Dockerfile.serve
    ports:
      - "8080:8080"
    environment:
      - MODEL_PATH=/app/model.bin
    depends_on:
      - train

Run the stack, wait for training to produce the artifact, then hit the serving endpoint locally.

Example 6: Quick security scan

# Docker Scout (built into newer Docker versions)
docker build -t myimg:0.1 .
docker scout quickview myimg:0.1

# Trivy (install locally, then)
trivy image --severity HIGH,CRITICAL myimg:0.1

Fix critical findings by upgrading base images, patching packages, or replacing vulnerable libraries.

Drills and exercises

Create a Dockerfile that trains a small scikit-learn model, outputs a model file, and exits.
Refactor your Dockerfile to pin all dependencies and use a lockfile; verify identical hashes across two builds.
Switch to a multi-stage build and measure image size reduction (target at least 30%).
Add a non-root user and verify the process runs without elevated privileges.
Build a GPU image (if available) and run nvidia-smi inside the container.
Adopt a tagging scheme: app:1.0.0-YYYYMMDD-; push two tags for the same image (version + commit).
Scan your image and resolve at least one high-severity issue by upgrading the base image.

Common mistakes and debugging tips

Unpinned dependencies: Leads to nondeterministic builds. Fix by pinning and using a lockfile.
Layer cache misses: Copying source before installing requirements breaks caching. Copy requirements first.
Huge images: Avoid development toolchains in final images; use multi-stage and slim bases.
Root user in production: Add a non-root user and proper file permissions.
Alpine with Python ML libs: Many ML wheels target glibc, not musl. Prefer debian/ubuntu slim for Python ML stacks.
GPU not detected: Ensure NVIDIA drivers + container toolkit on host, use --gpus all, and match CUDA versions with your framework.
Port conflicts: Change host port mapping if 8080 is already in use.
Timezone/locale surprises: Set environment variables or minimize reliance on system locale/time if needed.

Troubleshooting checklist

docker system df to see image layer sizes.
docker history myimg:tag to inspect layer-by-layer growth.
Run a shell in the image (if available) to verify files and permissions.
If using distroless, add debug variants or run tests before switching.

Mini project: Reproducible training-to-serving pipeline

Build a two-image setup: one for training a model and storing the artifact, another for serving that artifact via FastAPI. Orchestrate locally with Docker Compose.

Acceptance criteria

Training container exits successfully and writes model.bin to a shared volume.
Serving container starts as non-root and loads model.bin on startup.
curl request to /predict returns a valid JSON response.
Total serving image size under 400 MB (stretch goal: under 250 MB).
Security scan shows no critical vulnerabilities in the serving image.

Suggested steps

Create Dockerfile.train and Dockerfile.serve with pinned dependencies.
Use a named volume in Compose to share artifacts.
Add multi-stage to reduce serving image size.
Adopt a tagging scheme and tag both images with version + short SHA.
Run a security scan and fix top issues.

Practical projects

CPU-only microservice: Containerize a tiny scikit-learn model with FastAPI and autoscale-ready settings (readiness/health endpoints).
GPU inference demo: Torch-based image classifier in a CUDA runtime image; measure latency with and without half-precision.
Scheduled batch job: A container that downloads data, retrains nightly, and uploads a model to object storage; include build cache optimization.

Subskills

Dockerfiles For Training And Serving: Author cache-friendly, secure Dockerfiles that separate training and serving needs.
Image Tagging And Versioning: Tag by semver, date, and commit SHA for traceable rollouts and rollbacks.
Dependency Pinning And Lockfiles: Lock every transitive dependency for deterministic builds.
GPU Image Basics: Choose correct CUDA bases, validate drivers, and run with --gpus all.
Minimizing Image Size: Multi-stage builds, slim bases, and only copying what you need.
Security Scanning For Images: Identify and remediate CVEs before shipping.
Local Dev With Docker Compose: Spin up training + serving stacks and iterate quickly.

Next steps

Harden CI: add build cache, vulnerability scan, and signature/attestation steps.
Publish images to a private registry; enforce allowed base images.
Connect to your orchestration layer (Kubernetes, ECS, or similar) and add health/readiness probes.

Menu

Containerization And Images

Table of Contents

Why this skill matters for MLOps Engineers

Who this is for

Prerequisites

What you will learn

Learning path (practical roadmap)

Worked examples

Example 1: Dockerfile for a training job (with pinning and cache)

Example 2: FastAPI serving image (non-root user, slim)

Example 3: GPU image basics for training or inference

Example 4: Multi-stage to minimize size (build then copy)

Example 5: Local dev with Docker Compose

Drills and exercises

Common mistakes and debugging tips

Mini project: Reproducible training-to-serving pipeline

Practical projects

Subskills

Next steps

Topics

Dockerfiles For Training And Serving

Image Tagging And Versioning

Security Scanning For Images

Dependency Pinning And Lockfiles

GPU Image Basics

Minimizing Image Size

Local Dev With Docker Compose

Have questions about Containerization And Images?

AI Assistant