How to learn GPU Image Basics for Containerization And Images in MLOps Engineer for free

Who this is for

MLOps engineers who package training/inference services that need NVIDIA GPUs.
Data scientists moving models to GPU-enabled containers for reproducibility.
Engineers optimizing image size and reliability across GPU nodes.

Prerequisites

Comfort with Dockerfiles and basic Linux shell.
Basic understanding of CUDA as the NVIDIA GPU compute stack (no deep CUDA coding required).
Access to a machine with an NVIDIA GPU and drivers installed helps, but you can still learn concepts without one.

Why this matters

Real MLOps work often includes:

Packaging a PyTorch or TensorFlow service that can see GPUs reliably on any node.
Reducing image size to cut CI/CD time and startup latency.
Pinning compatible versions (driver, CUDA, cuDNN, framework) to avoid runtime crashes.
Building multi-stage images: compile in a CUDA devel image, ship a smaller runtime image.

Typical on-call issue you will prevent

"Container started, but no GPU found" or "Illegal instruction / library not found" after an upgrade. Correct base image selection and runtime flags (like --gpus) prevent this.

Concept explained simply

GPU-enabled Docker images bundle the user-space GPU libraries (CUDA, cuDNN, NCCL) your app needs. The actual NVIDIA driver stays on the host. At runtime, Docker passes the host GPUs into the container so your app can use them.

Mental model

Host: has the physical GPUs + NVIDIA driver.
Container image: has CUDA user-space libraries and your app.
Runtime bridge: NVIDIA Container Toolkit + --gpus flag expose GPUs to the container.

If the host driver and your image’s CUDA stack are incompatible, things break. If the container doesn’t include the right user-space libs, your framework won’t find the GPU.

Key components

Base images: nvidia/cuda:<version>-<variant>-<os>
- Variants: runtime (small, run apps) vs devel (includes compiler nvcc for building).
- Common OS: ubuntu22.04 (or similar). Choose what your org prefers.
User-space libraries: CUDA, cuDNN, NCCL (often bundled in the base tag).
NVIDIA Container Toolkit: enables docker run --gpus ... and passes devices/libs from host.
Environment variables:
- NVIDIA_VISIBLE_DEVICES=all or GPU indices (e.g., 0,1).
- NVIDIA_DRIVER_CAPABILITIES=compute,utility (typical for ML workloads).

Compatibility rule of thumb

Host driver version must support the CUDA major/minor used by your image.
Your ML framework build (e.g., PyTorch/TensorFlow) must match the CUDA/cuDNN user-space versions in the image.
When in doubt, pick the framework build first, then choose a base image that matches its CUDA requirements.

Worked examples

Example 1 — Minimal runtime image for inference

Goal: Small image for PyTorch inference that can see the GPU.

# Dockerfile
ARG CUDA_VERSION=12.1.1
ARG OS_FLAVOR=ubuntu22.04
FROM nvidia/cuda:${CUDA_VERSION}-cudnn8-runtime-${OS_FLAVOR}

# System deps
RUN apt-get update -y && apt-get install -y --no-install-recommends \
    python3 python3-venv python3-pip ca-certificates && \
    rm -rf /var/lib/apt/lists/*

# App env
WORKDIR /app
COPY requirements.txt .
RUN python3 -m pip install --no-cache-dir -r requirements.txt

COPY server.py .
CMD ["python3", "server.py"]

requirements.txt example:

# Match your CUDA-capable framework build.
# Example: torch with CUDA. Replace with versions your project uses.
torch

server.py example:

import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPUs:", torch.cuda.device_count())

Run (host must have NVIDIA Container Toolkit):

docker build -t my-infer:gpu .
docker run --rm --gpus all my-infer:gpu

Expected: prints CUDA available: True and number of GPUs.

Example 2 — Development vs Runtime (multi-stage)

Goal: Compile a CUDA binary in a devel image, run it from a lean runtime image.

# Dockerfile
ARG CUDA_VERSION=12.1.1
ARG OS_FLAVOR=ubuntu22.04

# Build stage with NVCC
FROM nvidia/cuda:${CUDA_VERSION}-devel-${OS_FLAVOR} AS build
RUN apt-get update -y && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/*
WORKDIR /src
# Simple vector add kernel
RUN printf '#include <stdio.h>\n__global__ void add(int n, float *x, float *y){int i=blockIdx.x*blockDim.x+threadIdx.x; if(i<n) y[i]=x[i]+y[i];}\nint main(){int N=1<<20; float *x,*y; cudaMallocManaged(&x,N*sizeof(float)); cudaMallocManaged(&y,N*sizeof(float)); for(int i=0;i<N;i++){x[i]=1.0f; y[i]=2.0f;} add<<<(N+255)/256,256>>>(N,x,y); cudaDeviceSynchronize(); printf("y[0]=%f\n", y[0]); cudaFree(x); cudaFree(y); return 0;}' > vadd.cu
RUN nvcc -O2 vadd.cu -o vadd

# Runtime stage: smaller
FROM nvidia/cuda:${CUDA_VERSION}-runtime-${OS_FLAVOR}
WORKDIR /app
COPY --from=build /src/vadd /app/vadd
CMD ["/app/vadd"]

Run:

docker build -t vadd:gpu .
docker run --rm --gpus all vadd:gpu

Expected: prints something like y[0]=3.000000.

Example 3 — Choosing base tags safely

Pick your framework build (e.g., the CUDA version it expects).
Choose the closest matching CUDA base image tag, e.g.: nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04 for a runtime-only service.
If you need to compile CUDA code or some pip wheels with GPU ops, use -devel- in a build stage, then copy artifacts into a -runtime- final stage.
Pin exact tags to ensure reproducible images.

How to choose the right base image

Identify framework build CUDA version (major.minor). Example pattern: 12.1 or 11.8.
Pick runtime for running apps, devel when compiling CUDA code.
Choose OS compatible with your org (e.g., ubuntu22.04).
Prefer tags that include cuDNN when needed (e.g., cudnn8).
Pin versions to avoid surprise upgrades.

Self-check: tag anatomy

Pattern: nvidia/cuda:<cuda_version>-<cudnn>-<variant>-<os>
E.g., nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

Security and size tips

Use multi-stage builds; keep final images on runtime variants.
Remove build tools and caches: rm -rf /var/lib/apt/lists/*, --no-install-recommends, pip --no-cache-dir.
Run as non-root where possible.
Pin versions in apt and pip for reproducibility.
Regularly rebuild to pick up security updates.

Common mistakes and self-check

Using runtime image when you need nvcc to compile. Self-check: Does your build require CUDA compilation? If yes, add a -devel- builder stage.
Mismatched CUDA between framework and image. Self-check: Confirm your framework build’s expected CUDA major.minor vs image tag.
Forgetting --gpus or NVIDIA Container Toolkit. Self-check: Does nvidia-smi run in the container?
Not pinning tags. Self-check: Are you using floating tags like latest? Pin exact versions.
Huge final images. Self-check: Are compilers and headers present in the final image? Move them to a build stage only.

Exercises

Do these hands-on tasks. The quick test is available to everyone; only logged-in users will have their progress saved.

Exercise 1 — Minimal CUDA runtime and GPU check

Create a Dockerfile using a -runtime- CUDA base image.
Install Python and a GPU-aware library (e.g., PyTorch) per your project needs.
Run the container with --gpus all and print whether a GPU is visible.

Tip

If you cannot access a GPU host right now, you can still build the image and run it without --gpus. It should print CUDA available: False, which confirms your code path works.

Exercise 2 — Multi-stage build for a CUDA binary

Use -devel- in the first stage to compile a simple CUDA program.
Copy the binary into a final -runtime- stage.
Run with --gpus all and verify output.

Checklist before you move on

You can explain the difference between runtime and devel images.
You can choose a CUDA base tag that matches your framework build.
You can run nvidia-smi inside a container.
You can reduce image size by using multi-stage builds.

Practical projects

GPU inference microservice: Build a fast-start image for a small model (e.g., text classification) and measure cold-start time before and after slimming.
Training worker image: Multi-stage build that compiles a custom CUDA op and ships a runtime-only final image.
Repro-friendly template: Create a repo template with pinned CUDA base tag, reproducible pip/apt installs, and a health check that runs nvidia-smi.

Learning path

Next: GPU scheduling basics in container orchestrators (resource requests, limits, device plugins).
Then: CI pipelines that build/test GPU images on runners with GPUs.
Later: Advanced CUDA runtime topics (NCCL tuning, MIG awareness, multi-GPU topology).

Next steps

Pick a framework version your team uses and pin a matching CUDA base tag.
Convert one of your existing CPU images to a GPU-enabled runtime image.
Set a policy to rebuild base images regularly and scan for vulnerabilities.

Mini challenge

Given a training job that needs CUDA 12.1, cuDNN 8, and no compilation, propose a final image tag and one runtime flag you will use when starting the container.

Example answer

Final image tag: nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04 (pinned). Runtime flag: --gpus all (and optionally -e NVIDIA_VISIBLE_DEVICES=all).

Menu

GPU Image Basics

Table of Contents