Who this is for
- Aspiring and junior Data Engineers who need reliable dev/stage/prod setups.
- Analysts or ML engineers turning notebooks into repeatable jobs.
- Anyone who has heard “works on my machine” and wants it to work everywhere.
Prerequisites
- Basic command line comfort (cd, mkdir, running scripts).
- Python basics (running a script, installing packages).
- Optional but helpful: Docker installed locally.
Why this matters
Real Data Engineering tasks rely on consistent environments to avoid silent failures and costly re-runs. You will:
- Deploy batch jobs (e.g., Spark/SQL/ETL) across dev, staging, and production.
- Run Airflow/DBT pipelines that require the same versions and credentials in each environment.
- Share reproducible projects with teammates and CI systems.
- Rotate secrets safely and parameterize jobs per environment without code changes.
Concept explained simply
Environment Configuration Management is how you make your code run the same way everywhere by:
- Pinning software versions.
- Separating configuration from code (using environment variables and config files).
- Keeping secrets out of source code.
- Automating setup so it’s repeatable and testable.
Mental model
Think: Recipe + Pantry + Labels.
- Recipe: code and dependency list (requirements.txt).
- Pantry: base runtime image or virtual environment with the tools you need.
- Labels: environment variables/config files that tell the same code how to behave in dev vs prod.
Core principles
- Idempotency: running setup twice should produce the same state.
- Pin versions: specify exact versions to avoid breaking changes.
- Config, not code: swap environments by changing variables, not code.
- Secrets management: never commit secrets; use env vars or a secret manager.
- Environment parity: dev should mirror prod as closely as practical.
- Documentation as code: .env.example, Makefile, and READMEs reduce guesswork.
Worked examples
Example 1: A small ETL with .env configuration
Goal: Parameterize input/output paths and processing batch size via environment variables.
# .env (do not commit real secrets)
INPUT_PATH=data/input.csv
OUTPUT_PATH=data/output.parquet
CHUNK_SIZE=5000
# .env.example (safe to commit)
INPUT_PATH=path/to/input.csv
OUTPUT_PATH=path/to/output.parquet
CHUNK_SIZE=5000
# job.py
import os
from dotenv import load_dotenv
import pandas as pd
load_dotenv()
input_path = os.getenv("INPUT_PATH", "data/input.csv")
output_path = os.getenv("OUTPUT_PATH", "data/output.parquet")
chunk_size = int(os.getenv("CHUNK_SIZE", "10000"))
# Minimal example: read once, write once (no real chunking)
df = pd.read_csv(input_path)
df.to_parquet(output_path, index=False)
print(f"Wrote {len(df)} rows to {output_path}")
Run: create venv, install python-dotenv, pandas, pyarrow; then python job.py. Change .env values to switch behavior without editing code.
Example 2: Reproducible environment with Docker
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PYTHONUNBUFFERED=1
CMD ["python", "job.py"]
# requirements.txt
pandas==2.2.0
pyarrow==14.0.1
python-dotenv==1.0.0
Build and run:
docker build -t etl-job:latest .
# Use --env-file to pass configuration without baking it into the image
docker run --rm --env-file .env -v "$PWD/data":/app/data etl-job:latest
Result: consistent runtime independent of host machine.
Example 3: Multi-environment configs (dev/stage/prod)
# config/dev.json
{
"db_url": "postgresql://dev_user@localhost:5432/devdb",
"bucket": "local-bucket",
"parallelism": 2
}
# config/prod.json
{
"db_url": "postgresql://service@prod:5432/warehouse",
"bucket": "analytics-bucket",
"parallelism": 8
}
# job_configured.py
import json, os
from dotenv import load_dotenv
load_dotenv()
env = os.getenv("APP_ENV", "dev")
with open(f"config/{env}.json") as f:
cfg = json.load(f)
print(f"Running with {env} config: parallelism={cfg['parallelism']}")
Switch environments by setting APP_ENV=dev|stage|prod.
Step-by-step: make your project reproducible
- Create a project folder: env-demo. Add folders: data, config.
- Initialize a virtual environment and pin dependencies.
python -m venv .venv source .venv/bin/activate # Windows: .venv\\Scripts\\activate pip install --upgrade pip pip install pandas==2.2.0 pyarrow==14.0.1 python-dotenv==1.0.0 pip freeze > requirements.txt - Add .env and .env.example; keep secrets only in .env.
- Add a simple Makefile to standardize commands.
run: python job.py setup: python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt - Smoke test: run the job locally; change .env to simulate staging.
- Optional: build a Docker image and run with --env-file to ensure parity.
Tip: Keep secrets safe
- Never commit .env files with real secrets.
- Store example keys in .env.example (no secrets).
- Rotate credentials periodically and prefer short-lived tokens where available.
Exercises
Complete these in order. They mirror the exercises below so your work can be checked.
- Exercise 1: Parameterize a small ETL with .env (see Exercises list below).
- Exercise 2: Lock reproducible dependencies with a venv and pinned versions.
- Exercise 3: Containerize and run the job with environment variables.
Self-check checklist
- [ ] I can run the same script by only changing .env or APP_ENV.
- [ ] My requirements.txt has exact versions.
- [ ] My .env is ignored by git and .env.example documents required keys.
- [ ] Docker run with --env-file produces the same output as local run.
Common mistakes and how to self-check
- Forgetting to pin versions: Run pip freeze > requirements.txt; re-install to verify consistent versions.
- Committing secrets: Ensure .gitignore includes .env; check git history if something slipped.
- Hardcoding paths: Replace paths with env variables (INPUT_PATH, OUTPUT_PATH).
- Config drift across environments: Keep dev/stage/prod config files in one place; document differences.
- Docker image missing dependencies: Build after updating requirements.txt and verify with docker run.
Practical projects
- Project 1: CSV to Parquet batch job with .env-driven input/output and chunk size. Provide .env.example.
- Project 2: Dockerized data quality checker that reads a config JSON per environment and prints validation results.
- Project 3: Simple orchestration (cron or a tiny scheduler script) that runs the same container with different .env files for dev and prod.
Learning path
- Today: Master .env files, pinned dependencies, and Docker basics.
- Next: Introduce Makefile or task runners to codify setup and run commands.
- Then: Learn secrets managers and CI variables; template configs per environment.
- Later: Infrastructure-as-Code and container orchestration for scalable, repeatable deployments.
Next steps
- Finish the exercises below and ensure your checklist is all marked.
- Convert one of your existing scripts into a parameterized, dockerized job.
- Take the Quick Test to confirm your understanding.
Mini challenge
Given a script that reads from one table and writes a daily partitioned Parquet to a folder, make it environment-agnostic by moving connection strings, output paths, and partition size into .env or config files. Prove it by running once with APP_ENV=dev and once with APP_ENV=prod without changing code.
Quick Test
Everyone can take the test; only logged-in users get saved progress.