luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Containerization Basics

Learn Containerization Basics for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

Data engineers package jobs, dependencies, and tools so they run the same in dev, CI, and production. Containers make this predictable. You will:

  • Ship Python/SQL jobs with fixed dependencies.
  • Spin up local databases, message brokers, and object stores for testing.
  • Reproduce issues quickly in isolated environments.
  • Scale workers horizontally without “it works on my machine”.
Real tasks you’ll do at work
  • Build an image for a batch ETL and run it on a scheduler (e.g., Airflow).
  • Stand up Postgres and a metrics exporter locally to test a pipeline.
  • Pin image tags and roll out a safe update via CI/CD.

Concept explained simply

A container is a lightweight box that holds your app plus everything it needs to run. An image is the recipe (layers). A container is a running instance of that recipe.

Mental model

  • Image = frozen layered recipe.
  • Container = a process started from that recipe with its own filesystem and network.
  • Registry = a library of recipes you can pull from or push to.
Deep dive (in plain terms)

Images are built from a Dockerfile, layer by layer. If a layer didn’t change, the builder reuses it (cache) to speed up builds. Containers share the host kernel but are isolated with namespaces and cgroups. Volumes give containers durable storage.

Key concepts and terms

  • Dockerfile: Instructions to build an image (FROM, COPY, RUN, CMD).
  • Tag: A label for an image version (e.g., myapp:1.2.0).
  • Layer: A cached step in the image. Order matters for build speed.
  • Volume: Persistent storage outside the container’s writable layer.
  • Network: Virtual network that lets containers talk by name (DNS).
  • Environment variables: Config you pass at run time (e.g., DB_URL).

Worked examples

Example 1 — Containerize a tiny Python ETL

Goal: read a CSV, compute totals, print results.

Files
# app/requirements.txt
# (empty – we use only the standard library)

# app/main.py
import csv
from pathlib import Path

def run():
    rows = 0
    total = 0.0
    with open("data/input.csv", newline="") as f:
        reader = csv.DictReader(f)
        for r in reader:
            rows += 1
            total += float(r["amount"]) if r["amount"] else 0
    print(f"Total rows: {rows}; Total amount: {total}")

if __name__ == "__main__":
    run()

# app/data/input.csv
id,amount
1,25.5
2,50
3,50

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY app/ /app/
CMD ["python", "main.py"]
Build and run
docker build -t tiny-etl:1.0 .
docker run --rm tiny-etl:1.0

Expected console: Total rows: 3; Total amount: 125.5

Example 2 — Local Postgres with a volume + quick query

Goal: start Postgres with data persistence and query it from a temporary client container.

docker volume create pg_data
docker network create de-net || true

docker run -d --name pg --network de-net -e POSTGRES_PASSWORD=secret -v pg_data:/var/lib/postgresql/data postgres:16
# Wait a few seconds for PG to start
sleep 5
# Create a table and insert rows using the postgres client image
docker run --rm --network de-net -e PGPASSWORD=secret postgres:16 psql -h pg -U postgres -c "CREATE TABLE items (id INT, amount NUMERIC); INSERT INTO items VALUES (1,25.5),(2,50),(3,50);"
# Query rows
docker run --rm --network de-net -e PGPASSWORD=secret postgres:16 psql -h pg -U postgres -c "SELECT COUNT(*) FROM items;"

You should see a count of 3.

Example 3 — Two-service docker-compose (ETL + Postgres)

docker-compose.yml
services:
  pg:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    volumes:
      - pg_data:/var/lib/postgresql/data
    networks: [de]
  etl:
    image: tiny-etl:1.0
    depends_on: [pg]
    environment:
      DB_HOST: pg
      DB_USER: postgres
      DB_PASSWORD: secret
    networks: [de]
volumes:
  pg_data:
networks:
  de:
Run
docker compose up --build

The ETL can reach Postgres via the service name pg on the de network.

Registries, tags, and layers

  • Use immutable tags (e.g., 1.0.3) for deployments. Avoid relying on latest in production.
  • Order Dockerfile steps to maximize cache hits: install dependencies before copying frequently changing code.
  • Push/pull: docker push myrepo/tiny-etl:1.0 and docker pull when deploying elsewhere.
Layer caching trick

Put dependency installation before copying source files. That way, code changes don’t invalidate heavy dependency layers unnecessarily.

Volumes, env vars, and networks (essentials)

  • Persist data with named volumes (-v pg_data:/var/lib/postgresql/data).
  • Keep config in env vars; mount separate files for secrets when possible.
  • Use a user-defined network so containers can resolve each other by service name.

Security & resource limits basics

  • Run as a non-root user where possible.
  • Pass secrets at runtime; avoid baking them into images or logs.
  • Set resource boundaries for noisy workloads, e.g.: --memory 1g --cpus 1.0.
Non-root example
# in Dockerfile
RUN useradd -m appuser
USER appuser

Exercises

Complete these to lock in the skills. The Quick Test at the end is available to everyone; only logged-in learners will see saved progress.

Exercise 1 — Containerize a tiny Python ETL and run it

  • Create the files shown in Example 1.
  • Build the image tiny-etl:1.0.
  • Run it and confirm totals match.
Checklist
  • [ ] Dockerfile builds successfully
  • [ ] Image runs without mounting anything
  • [ ] Output shows correct counts

Exercise 2 — Postgres with volume + query from another container

  • Create a named volume and network.
  • Run Postgres on that network with a password.
  • Use a temporary postgres client container to create a table, insert rows, and count them.
Checklist
  • [ ] Named volume created
  • [ ] Postgres reachable by service name
  • [ ] Query returns expected count

Common mistakes & self-check

  • Using latest tags in prod: Pin a specific version instead.
  • Huge images: Use slim base images; clean caches; copy only needed files.
  • Baking secrets into images: Pass via env vars or mounted files at runtime.
  • Unreliable builds due to cache busting: Order Dockerfile steps carefully.
  • Forgetting volumes for stateful services: Use named volumes for databases and brokers.
Self-check prompts
  • Can you explain the difference between an image and a container?
  • Can two containers talk without exposing ports on the host? How?
  • How would you roll back a bad image deployment?

Practical projects

  • Local Data Lab: Compose file with Postgres + a Python ETL image that loads CSV to a table.
  • Metrics Sandbox: Add a lightweight exporter and scrape with a separate containerized tool; visualize counts from your pipeline.
  • Batch Runner: Build an image that runs a daily job via a simple shell entrypoint and accepts runtime config via env vars.

Mini challenge

Refactor your Dockerfile to reduce the final image size by at least 30% without changing functionality. Hint: use a slimmer base image, multi-stage build to drop build-time files, and avoid copying test data into the image.

What to measure
  • Image size before vs. after (docker images).
  • Build time improvements due to better layer caching.

Who this is for

  • New data engineers who need reproducible environments.
  • Analysts or scientists moving pipelines from notebooks to production.
  • Software engineers supporting data workloads.

Prerequisites

  • Basic command line skills.
  • Python fundamentals (reading files, running scripts).
  • Comfort with SQL basics is helpful.

Learning path

  • Containerization Basics (this lesson)
  • Docker Compose for local data stacks
  • Image publishing and CI builds
  • Intro to Kubernetes for batch jobs

Next steps

  • Add a non-root user to your images and set resource limits.
  • Introduce docker-compose to manage multi-service stacks.
  • Automate builds in CI with pinned tags and image scanning.

Progress & test note

The quick test below is available to everyone. If you are logged in, your answers and progress will be saved so you can pick up later.

Practice Exercises

2 exercises to complete

Instructions

  1. Create a folder app/ with main.py and data/input.csv exactly as shown in Example 1.
  2. Create the Dockerfile from Example 1 at the project root.
  3. Build:
    docker build -t tiny-etl:1.0 .
  4. Run:
    docker run --rm tiny-etl:1.0
Expected Output
Total rows: 3; Total amount: 125.5

Containerization Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Containerization Basics?

AI Assistant

Ask questions about this tool