Topic Not Found

Why this matters

Data engineers package jobs, dependencies, and tools so they run the same in dev, CI, and production. Containers make this predictable. You will:

Ship Python/SQL jobs with fixed dependencies.
Spin up local databases, message brokers, and object stores for testing.
Reproduce issues quickly in isolated environments.
Scale workers horizontally without “it works on my machine”.

Real tasks you’ll do at work

Build an image for a batch ETL and run it on a scheduler (e.g., Airflow).
Stand up Postgres and a metrics exporter locally to test a pipeline.
Pin image tags and roll out a safe update via CI/CD.

Concept explained simply

A container is a lightweight box that holds your app plus everything it needs to run. An image is the recipe (layers). A container is a running instance of that recipe.

Mental model

Image = frozen layered recipe.
Container = a process started from that recipe with its own filesystem and network.
Registry = a library of recipes you can pull from or push to.

Deep dive (in plain terms)

Images are built from a Dockerfile, layer by layer. If a layer didn’t change, the builder reuses it (cache) to speed up builds. Containers share the host kernel but are isolated with namespaces and cgroups. Volumes give containers durable storage.

Key concepts and terms

Dockerfile: Instructions to build an image (FROM, COPY, RUN, CMD).
Tag: A label for an image version (e.g., myapp:1.2.0).
Layer: A cached step in the image. Order matters for build speed.
Volume: Persistent storage outside the container’s writable layer.
Network: Virtual network that lets containers talk by name (DNS).
Environment variables: Config you pass at run time (e.g., DB_URL).

Worked examples

Example 1 — Containerize a tiny Python ETL

Goal: read a CSV, compute totals, print results.

Files

# app/requirements.txt
# (empty – we use only the standard library)

# app/main.py
import csv
from pathlib import Path

def run():
    rows = 0
    total = 0.0
    with open("data/input.csv", newline="") as f:
        reader = csv.DictReader(f)
        for r in reader:
            rows += 1
            total += float(r["amount"]) if r["amount"] else 0
    print(f"Total rows: {rows}; Total amount: {total}")

if __name__ == "__main__":
    run()

# app/data/input.csv
id,amount
1,25.5
2,50
3,50

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY app/ /app/
CMD ["python", "main.py"]

Build and run

docker build -t tiny-etl:1.0 .
docker run --rm tiny-etl:1.0

Expected console: Total rows: 3; Total amount: 125.5

Example 2 — Local Postgres with a volume + quick query

Goal: start Postgres with data persistence and query it from a temporary client container.

docker volume create pg_data
docker network create de-net || true

docker run -d --name pg --network de-net -e POSTGRES_PASSWORD=secret -v pg_data:/var/lib/postgresql/data postgres:16
# Wait a few seconds for PG to start
sleep 5
# Create a table and insert rows using the postgres client image
docker run --rm --network de-net -e PGPASSWORD=secret postgres:16 psql -h pg -U postgres -c "CREATE TABLE items (id INT, amount NUMERIC); INSERT INTO items VALUES (1,25.5),(2,50),(3,50);"
# Query rows
docker run --rm --network de-net -e PGPASSWORD=secret postgres:16 psql -h pg -U postgres -c "SELECT COUNT(*) FROM items;"

You should see a count of 3.

Example 3 — Two-service docker-compose (ETL + Postgres)

docker-compose.yml

services:
  pg:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    volumes:
      - pg_data:/var/lib/postgresql/data
    networks: [de]
  etl:
    image: tiny-etl:1.0
    depends_on: [pg]
    environment:
      DB_HOST: pg
      DB_USER: postgres
      DB_PASSWORD: secret
    networks: [de]
volumes:
  pg_data:
networks:
  de:

Run

docker compose up --build

The ETL can reach Postgres via the service name pg on the de network.

Registries, tags, and layers

Use immutable tags (e.g., 1.0.3) for deployments. Avoid relying on latest in production.
Order Dockerfile steps to maximize cache hits: install dependencies before copying frequently changing code.
Push/pull: docker push myrepo/tiny-etl:1.0 and docker pull when deploying elsewhere.

Layer caching trick

Put dependency installation before copying source files. That way, code changes don’t invalidate heavy dependency layers unnecessarily.

Volumes, env vars, and networks (essentials)

Persist data with named volumes (-v pg_data:/var/lib/postgresql/data).
Keep config in env vars; mount separate files for secrets when possible.
Use a user-defined network so containers can resolve each other by service name.

Security & resource limits basics

Run as a non-root user where possible.
Pass secrets at runtime; avoid baking them into images or logs.
Set resource boundaries for noisy workloads, e.g.: --memory 1g --cpus 1.0.

Non-root example

# in Dockerfile
RUN useradd -m appuser
USER appuser

Exercises

Complete these to lock in the skills. The Quick Test at the end is available to everyone; only logged-in learners will see saved progress.

Exercise 1 — Containerize a tiny Python ETL and run it

Create the files shown in Example 1.
Build the image tiny-etl:1.0.
Run it and confirm totals match.

Checklist

[ ] Dockerfile builds successfully
[ ] Image runs without mounting anything
[ ] Output shows correct counts

Exercise 2 — Postgres with volume + query from another container

Create a named volume and network.
Run Postgres on that network with a password.
Use a temporary postgres client container to create a table, insert rows, and count them.

Checklist

[ ] Named volume created
[ ] Postgres reachable by service name
[ ] Query returns expected count

Common mistakes & self-check

Using latest tags in prod: Pin a specific version instead.
Huge images: Use slim base images; clean caches; copy only needed files.
Baking secrets into images: Pass via env vars or mounted files at runtime.
Unreliable builds due to cache busting: Order Dockerfile steps carefully.
Forgetting volumes for stateful services: Use named volumes for databases and brokers.

Self-check prompts

Can you explain the difference between an image and a container?
Can two containers talk without exposing ports on the host? How?
How would you roll back a bad image deployment?

Practical projects

Local Data Lab: Compose file with Postgres + a Python ETL image that loads CSV to a table.
Metrics Sandbox: Add a lightweight exporter and scrape with a separate containerized tool; visualize counts from your pipeline.
Batch Runner: Build an image that runs a daily job via a simple shell entrypoint and accepts runtime config via env vars.

Mini challenge

Refactor your Dockerfile to reduce the final image size by at least 30% without changing functionality. Hint: use a slimmer base image, multi-stage build to drop build-time files, and avoid copying test data into the image.

What to measure

Image size before vs. after (docker images).
Build time improvements due to better layer caching.

Who this is for

New data engineers who need reproducible environments.
Analysts or scientists moving pipelines from notebooks to production.
Software engineers supporting data workloads.

Prerequisites

Basic command line skills.
Python fundamentals (reading files, running scripts).
Comfort with SQL basics is helpful.

Learning path

Containerization Basics (this lesson)
Docker Compose for local data stacks
Image publishing and CI builds
Intro to Kubernetes for batch jobs

Next steps

Add a non-root user to your images and set resource limits.
Introduce docker-compose to manage multi-service stacks.
Automate builds in CI with pinned tags and image scanning.

Progress & test note

The quick test below is available to everyone. If you are logged in, your answers and progress will be saved so you can pick up later.

Menu

Containerization Basics

Table of Contents