Why this matters
Data engineers package jobs, dependencies, and tools so they run the same in dev, CI, and production. Containers make this predictable. You will:
- Ship Python/SQL jobs with fixed dependencies.
- Spin up local databases, message brokers, and object stores for testing.
- Reproduce issues quickly in isolated environments.
- Scale workers horizontally without “it works on my machine”.
Real tasks you’ll do at work
- Build an image for a batch ETL and run it on a scheduler (e.g., Airflow).
- Stand up Postgres and a metrics exporter locally to test a pipeline.
- Pin image tags and roll out a safe update via CI/CD.
Concept explained simply
A container is a lightweight box that holds your app plus everything it needs to run. An image is the recipe (layers). A container is a running instance of that recipe.
Mental model
- Image = frozen layered recipe.
- Container = a process started from that recipe with its own filesystem and network.
- Registry = a library of recipes you can pull from or push to.
Deep dive (in plain terms)
Images are built from a Dockerfile, layer by layer. If a layer didn’t change, the builder reuses it (cache) to speed up builds. Containers share the host kernel but are isolated with namespaces and cgroups. Volumes give containers durable storage.
Key concepts and terms
- Dockerfile: Instructions to build an image (FROM, COPY, RUN, CMD).
- Tag: A label for an image version (e.g., myapp:1.2.0).
- Layer: A cached step in the image. Order matters for build speed.
- Volume: Persistent storage outside the container’s writable layer.
- Network: Virtual network that lets containers talk by name (DNS).
- Environment variables: Config you pass at run time (e.g., DB_URL).
Worked examples
Example 1 — Containerize a tiny Python ETL
Goal: read a CSV, compute totals, print results.
Files
# app/requirements.txt
# (empty – we use only the standard library)
# app/main.py
import csv
from pathlib import Path
def run():
rows = 0
total = 0.0
with open("data/input.csv", newline="") as f:
reader = csv.DictReader(f)
for r in reader:
rows += 1
total += float(r["amount"]) if r["amount"] else 0
print(f"Total rows: {rows}; Total amount: {total}")
if __name__ == "__main__":
run()
# app/data/input.csv
id,amount
1,25.5
2,50
3,50
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY app/ /app/
CMD ["python", "main.py"]
Build and run
docker build -t tiny-etl:1.0 .
docker run --rm tiny-etl:1.0Expected console: Total rows: 3; Total amount: 125.5
Example 2 — Local Postgres with a volume + quick query
Goal: start Postgres with data persistence and query it from a temporary client container.
docker volume create pg_data
docker network create de-net || true
docker run -d --name pg --network de-net -e POSTGRES_PASSWORD=secret -v pg_data:/var/lib/postgresql/data postgres:16
# Wait a few seconds for PG to start
sleep 5
# Create a table and insert rows using the postgres client image
docker run --rm --network de-net -e PGPASSWORD=secret postgres:16 psql -h pg -U postgres -c "CREATE TABLE items (id INT, amount NUMERIC); INSERT INTO items VALUES (1,25.5),(2,50),(3,50);"
# Query rows
docker run --rm --network de-net -e PGPASSWORD=secret postgres:16 psql -h pg -U postgres -c "SELECT COUNT(*) FROM items;"You should see a count of 3.
Example 3 — Two-service docker-compose (ETL + Postgres)
docker-compose.yml
services:
pg:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
volumes:
- pg_data:/var/lib/postgresql/data
networks: [de]
etl:
image: tiny-etl:1.0
depends_on: [pg]
environment:
DB_HOST: pg
DB_USER: postgres
DB_PASSWORD: secret
networks: [de]
volumes:
pg_data:
networks:
de:
Run
docker compose up --buildThe ETL can reach Postgres via the service name pg on the de network.
Registries, tags, and layers
- Use immutable tags (e.g., 1.0.3) for deployments. Avoid relying on
latestin production. - Order Dockerfile steps to maximize cache hits: install dependencies before copying frequently changing code.
- Push/pull:
docker push myrepo/tiny-etl:1.0anddocker pullwhen deploying elsewhere.
Layer caching trick
Put dependency installation before copying source files. That way, code changes don’t invalidate heavy dependency layers unnecessarily.
Volumes, env vars, and networks (essentials)
- Persist data with named volumes (
-v pg_data:/var/lib/postgresql/data). - Keep config in env vars; mount separate files for secrets when possible.
- Use a user-defined network so containers can resolve each other by service name.
Security & resource limits basics
- Run as a non-root user where possible.
- Pass secrets at runtime; avoid baking them into images or logs.
- Set resource boundaries for noisy workloads, e.g.:
--memory 1g --cpus 1.0.
Non-root example
# in Dockerfile
RUN useradd -m appuser
USER appuser
Exercises
Complete these to lock in the skills. The Quick Test at the end is available to everyone; only logged-in learners will see saved progress.
Exercise 1 — Containerize a tiny Python ETL and run it
- Create the files shown in Example 1.
- Build the image
tiny-etl:1.0. - Run it and confirm totals match.
Checklist
- [ ] Dockerfile builds successfully
- [ ] Image runs without mounting anything
- [ ] Output shows correct counts
Exercise 2 — Postgres with volume + query from another container
- Create a named volume and network.
- Run Postgres on that network with a password.
- Use a temporary postgres client container to create a table, insert rows, and count them.
Checklist
- [ ] Named volume created
- [ ] Postgres reachable by service name
- [ ] Query returns expected count
Common mistakes & self-check
- Using latest tags in prod: Pin a specific version instead.
- Huge images: Use slim base images; clean caches; copy only needed files.
- Baking secrets into images: Pass via env vars or mounted files at runtime.
- Unreliable builds due to cache busting: Order Dockerfile steps carefully.
- Forgetting volumes for stateful services: Use named volumes for databases and brokers.
Self-check prompts
- Can you explain the difference between an image and a container?
- Can two containers talk without exposing ports on the host? How?
- How would you roll back a bad image deployment?
Practical projects
- Local Data Lab: Compose file with Postgres + a Python ETL image that loads CSV to a table.
- Metrics Sandbox: Add a lightweight exporter and scrape with a separate containerized tool; visualize counts from your pipeline.
- Batch Runner: Build an image that runs a daily job via a simple shell entrypoint and accepts runtime config via env vars.
Mini challenge
Refactor your Dockerfile to reduce the final image size by at least 30% without changing functionality. Hint: use a slimmer base image, multi-stage build to drop build-time files, and avoid copying test data into the image.
What to measure
- Image size before vs. after (
docker images). - Build time improvements due to better layer caching.
Who this is for
- New data engineers who need reproducible environments.
- Analysts or scientists moving pipelines from notebooks to production.
- Software engineers supporting data workloads.
Prerequisites
- Basic command line skills.
- Python fundamentals (reading files, running scripts).
- Comfort with SQL basics is helpful.
Learning path
- Containerization Basics (this lesson)
- Docker Compose for local data stacks
- Image publishing and CI builds
- Intro to Kubernetes for batch jobs
Next steps
- Add a non-root user to your images and set resource limits.
- Introduce docker-compose to manage multi-service stacks.
- Automate builds in CI with pinned tags and image scanning.
Progress & test note
The quick test below is available to everyone. If you are logged in, your answers and progress will be saved so you can pick up later.