Why this matters
As a Machine Learning Engineer, you ship containers that hold CUDA, Python, notebooks, and models. A single critical vulnerability or leaked secret can block deployment, break compliance, or expose data. Security scanning helps you catch issues early: outdated CUDA base layers, vulnerable Python wheels, exposed API keys in image layers, and misconfigurations in Dockerfiles.
- Before pushing to a registry, scan images to catch known CVEs.
- Gate CI builds: fail when high/critical issues are present.
- Produce SBOMs so teams know what’s inside the image.
- Scan for secrets embedded in layers or copied files.
- Quickly assess impact when a new 0-day is announced.
Real ML tasks this unlocks
- Choosing a safer CUDA/PyTorch base image for training jobs.
- Verifying Jupyter images used by data scientists are safe.
- Creating a minimal inference image with fewer vulnerabilities.
- Proving compliance with an SBOM for audits.
Concept explained simply
A container image is a stack of layers (base OS + runtimes + your code). Scanners read the image, list software (packages), match them against vulnerability databases, and report issues by severity (LOW to CRITICAL). You then remediate by updating the base image, upgrading packages, or removing unused components.
- CVE: a unique identifier for a known vulnerability.
- CVSS: a score estimating severity/impact.
- OS vs App vulns: OS packages (e.g., Debian) vs language deps (e.g., pip).
- SBOM: a Software Bill of Materials (CycloneDX/SPDX) listing components.
- Policies: rules that fail builds if certain severities exist.
Mental model: Image = Layers. Scanner = Map packages to CVEs. Fix = Replace vulnerable layers/deps with safer versions or remove them.
What counts as a pass/fail?
Common policy: fail if any CRITICAL/HIGH is detected, optionally ignoring unfixed issues or specific CVEs with documented justification. Aim for no CRITICAL; keep HIGHs low and documented.
Minimal setup (choose one scanner)
- Docker Scout (built into Docker CLI on many systems)
docker pull pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime docker scout cves pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime - Trivy (popular OSS scanner)
trivy image --severity HIGH,CRITICAL \ --ignore-unfixed \ --scanners vuln \ pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime # Gate by severity (non-zero exit on findings) trivy image --severity HIGH,CRITICAL --ignore-unfixed --exit-code 1 IMAGE # Secrets scan trivy image --scanners secret IMAGE # SBOM (CycloneDX) trivy image --format cyclonedx -o sbom.json IMAGE - Grype
grype pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime --fail-on high
Reading scan results effectively
- Focus on HIGH/CRITICAL with a fixed version available.
- Note where the package lives: OS layer vs pip/conda dependency.
- Prefer base image upgrades over ad-hoc fixes when many issues share the same layer.
Worked examples
Example 1: Scan a CUDA-based PyTorch image
- Pull image:
docker pull pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime - Scan (Trivy):
trivy image --severity HIGH,CRITICAL --ignore-unfixed \ pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime - Interpretation: You’ll likely see OS-level issues (from the underlying Debian/Ubuntu base) and possibly language-level issues. Prefer upgrading the base to a newer tag with fewer CVEs rather than patching many packages manually.
Sample output snippet
CRITICAL openssl 1.1.x CVE-XXXX-YYYY Fixed in 1.1.z
HIGH curl 7.x CVE-AAAA-BBBB Fixed in 7.y
Action: pick a newer base image tag that includes patched openssl/curl.
Example 2: Scan a minimal Python inference image and fix
- Create Dockerfile:
FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY server.py . CMD ["python", "server.py"] - Build and scan:
docker build -t my-infer:1 . trivy image --severity HIGH,CRITICAL --ignore-unfixed my-infer:1 - If results show OS vulns, try a fresher base and reduce packages:
FROM python:3.11-slim-bookworm # Optionally add: apt-get update && apt-get dist-upgrade -y && rm -rf /var/lib/apt/lists/*Rebuild and rescan to confirm reductions in HIGH/CRITICAL findings.
Example 3: Scan for secrets
- Run secrets scan:
trivy image --scanners secret my-infer:1 - If a secret is found (e.g., API key in a copied .env), remove it from the build context, add it to .dockerignore, and inject at runtime via environment or a secret manager.
Mini reference: reducing attack surface
- Choose minimal bases (slim, distroless) when possible.
- Multi-stage builds: copy only what’s needed into final image.
- Pin image digests for reproducibility.
Hands-on exercises
These mirror the graded exercises below. Do them here first.
Exercise 1 (ex1): Scan a public ML image and gate by severity
- Pull an ML image, e.g.:
docker pull jupyter/scipy-notebook:latest - Scan and fail on HIGH/CRITICAL using one tool you have available:
# Option A: Trivy trivy image --severity HIGH,CRITICAL --ignore-unfixed --exit-code 1 jupyter/scipy-notebook:latest # Option B: Docker Scout docker scout cves jupyter/scipy-notebook:latest # Manually treat HIGH/CRITICAL findings as a failure # Option C: Grype grype jupyter/scipy-notebook:latest --fail-on high - Note the exit code and count of HIGH/CRITICAL. Capture a short summary.
Exercise 2 (ex2): Scan your own image for secrets and export an SBOM
- Build or choose any local image (e.g., my-infer:1).
- Run a secrets scan and SBOM export:
trivy image --scanners secret my-infer:1 trivy image --format cyclonedx -o sbom.json my-infer:1 - Verify sbom.json exists and review any flagged secrets.
Checklist
- I scanned an image and identified HIGH/CRITICAL issues.
- I verified the scanner’s exit code or policy behavior.
- I produced an SBOM file.
- I ran a secrets scan and addressed any findings.
Common mistakes and self-check
- Ignoring base images: Many issues come from the base layer. Self-check: did you try a newer base tag?
- Scanning only OS packages: Language-level CVEs matter too. Self-check: ensure the scanner includes pip/conda analysis.
- Not gating builds: Scans without policies don’t prevent risky releases. Self-check: confirm non-zero exit on severe findings.
- Shipping secrets: .env files or creds in layers. Self-check: secrets scan passes; .dockerignore excludes sensitive files.
- Unpinned tags: Using latest can change risk overnight. Self-check: pin image digest for reproducibility.
Practical projects
- Convert a training image from a full base to a slim base and measure CVE reduction.
- Create a CI job that scans PR-built images and fails on HIGH/CRITICAL.
- Generate and store SBOMs for all your service images, then compare over time.
Learning path
- Start here: basic scanning, SBOM, secrets scan.
- Next: Dockerfile hardening (non-root, minimal permissions, multi-stage).
- Then: signing and verifying images (image provenance).
- Then: CI/CD gating and exemptions with justification.
- Finally: runtime scanning and Kubernetes admission policies.
Who this is for
- Machine Learning Engineers shipping training or inference containers.
- Data scientists maintaining Jupyter or batch images.
- Anyone responsible for ML platform security.
Prerequisites
- Basic Docker usage (build, tag, run).
- Familiarity with Python environments and base images.
- Access to a terminal with Docker and a scanner installed.
Next steps
- Complete the exercises, then take the quick test below to lock in knowledge. Note: the test is available to everyone; only logged-in users get saved progress.
- Apply scanning to one image in your real project and record baseline metrics (counts of HIGH/CRITICAL).
- Plan a weekly review of base image tags and SBOM diffs.
Mini challenge
Within 30 minutes, reduce HIGH/CRITICAL findings on any ML image by at least 50% by switching to a newer or slimmer base and removing unused packages. Document before/after results and the exact changes you made.