How to learn Environment Configuration Management for Infrastructure And DevOps Basics in Data Engineer for free

Who this is for

Aspiring and junior Data Engineers who need reliable dev/stage/prod setups.
Analysts or ML engineers turning notebooks into repeatable jobs.
Anyone who has heard “works on my machine” and wants it to work everywhere.

Prerequisites

Basic command line comfort (cd, mkdir, running scripts).
Python basics (running a script, installing packages).
Optional but helpful: Docker installed locally.

Why this matters

Real Data Engineering tasks rely on consistent environments to avoid silent failures and costly re-runs. You will:

Deploy batch jobs (e.g., Spark/SQL/ETL) across dev, staging, and production.
Run Airflow/DBT pipelines that require the same versions and credentials in each environment.
Share reproducible projects with teammates and CI systems.
Rotate secrets safely and parameterize jobs per environment without code changes.

Concept explained simply

Environment Configuration Management is how you make your code run the same way everywhere by:

Pinning software versions.
Separating configuration from code (using environment variables and config files).
Keeping secrets out of source code.
Automating setup so it’s repeatable and testable.

Mental model

Think: Recipe + Pantry + Labels.

Recipe: code and dependency list (requirements.txt).
Pantry: base runtime image or virtual environment with the tools you need.
Labels: environment variables/config files that tell the same code how to behave in dev vs prod.

Core principles

Idempotency: running setup twice should produce the same state.
Pin versions: specify exact versions to avoid breaking changes.
Config, not code: swap environments by changing variables, not code.
Secrets management: never commit secrets; use env vars or a secret manager.
Environment parity: dev should mirror prod as closely as practical.
Documentation as code: .env.example, Makefile, and READMEs reduce guesswork.

Worked examples

Example 1: A small ETL with .env configuration

Goal: Parameterize input/output paths and processing batch size via environment variables.

# .env (do not commit real secrets)
INPUT_PATH=data/input.csv
OUTPUT_PATH=data/output.parquet
CHUNK_SIZE=5000

# .env.example (safe to commit)
INPUT_PATH=path/to/input.csv
OUTPUT_PATH=path/to/output.parquet
CHUNK_SIZE=5000

# job.py
import os
from dotenv import load_dotenv
import pandas as pd

load_dotenv()
input_path = os.getenv("INPUT_PATH", "data/input.csv")
output_path = os.getenv("OUTPUT_PATH", "data/output.parquet")
chunk_size = int(os.getenv("CHUNK_SIZE", "10000"))

# Minimal example: read once, write once (no real chunking)
df = pd.read_csv(input_path)
df.to_parquet(output_path, index=False)
print(f"Wrote {len(df)} rows to {output_path}")

Run: create venv, install python-dotenv, pandas, pyarrow; then python job.py. Change .env values to switch behavior without editing code.

Example 2: Reproducible environment with Docker

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PYTHONUNBUFFERED=1
CMD ["python", "job.py"]

# requirements.txt
pandas==2.2.0
pyarrow==14.0.1
python-dotenv==1.0.0

Build and run:

docker build -t etl-job:latest .
# Use --env-file to pass configuration without baking it into the image
docker run --rm --env-file .env -v "$PWD/data":/app/data etl-job:latest

Result: consistent runtime independent of host machine.

Example 3: Multi-environment configs (dev/stage/prod)

# config/dev.json
{
  "db_url": "postgresql://dev_user@localhost:5432/devdb",
  "bucket": "local-bucket",
  "parallelism": 2
}
# config/prod.json
{
  "db_url": "postgresql://service@prod:5432/warehouse",
  "bucket": "analytics-bucket",
  "parallelism": 8
}

# job_configured.py
import json, os
from dotenv import load_dotenv
load_dotenv()
env = os.getenv("APP_ENV", "dev")
with open(f"config/{env}.json") as f:
    cfg = json.load(f)
print(f"Running with {env} config: parallelism={cfg['parallelism']}")

Switch environments by setting APP_ENV=dev|stage|prod.

Step-by-step: make your project reproducible

Create a project folder: env-demo. Add folders: data, config.

Initialize a virtual environment and pin dependencies.

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\activate
pip install --upgrade pip
pip install pandas==2.2.0 pyarrow==14.0.1 python-dotenv==1.0.0
pip freeze > requirements.txt

Add .env and .env.example; keep secrets only in .env.

Add a simple Makefile to standardize commands.

run: 
	python job.py
setup:
	python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt

Smoke test: run the job locally; change .env to simulate staging.
Optional: build a Docker image and run with --env-file to ensure parity.

Tip: Keep secrets safe

Never commit .env files with real secrets.
Store example keys in .env.example (no secrets).
Rotate credentials periodically and prefer short-lived tokens where available.

Exercises

Complete these in order. They mirror the exercises below so your work can be checked.

Exercise 1: Parameterize a small ETL with .env (see Exercises list below).
Exercise 2: Lock reproducible dependencies with a venv and pinned versions.
Exercise 3: Containerize and run the job with environment variables.

Self-check checklist

[ ] I can run the same script by only changing .env or APP_ENV.
[ ] My requirements.txt has exact versions.
[ ] My .env is ignored by git and .env.example documents required keys.
[ ] Docker run with --env-file produces the same output as local run.

Common mistakes and how to self-check

Forgetting to pin versions: Run pip freeze > requirements.txt; re-install to verify consistent versions.
Committing secrets: Ensure .gitignore includes .env; check git history if something slipped.
Hardcoding paths: Replace paths with env variables (INPUT_PATH, OUTPUT_PATH).
Config drift across environments: Keep dev/stage/prod config files in one place; document differences.
Docker image missing dependencies: Build after updating requirements.txt and verify with docker run.

Practical projects

Project 1: CSV to Parquet batch job with .env-driven input/output and chunk size. Provide .env.example.
Project 2: Dockerized data quality checker that reads a config JSON per environment and prints validation results.
Project 3: Simple orchestration (cron or a tiny scheduler script) that runs the same container with different .env files for dev and prod.

Learning path

Today: Master .env files, pinned dependencies, and Docker basics.
Next: Introduce Makefile or task runners to codify setup and run commands.
Then: Learn secrets managers and CI variables; template configs per environment.
Later: Infrastructure-as-Code and container orchestration for scalable, repeatable deployments.

Next steps

Finish the exercises below and ensure your checklist is all marked.
Convert one of your existing scripts into a parameterized, dockerized job.
Take the Quick Test to confirm your understanding.

Mini challenge

Given a script that reads from one table and writes a daily partitioned Parquet to a folder, make it environment-agnostic by moving connection strings, output paths, and partition size into .env or config files. Prove it by running once with APP_ENV=dev and once with APP_ENV=prod without changing code.

Quick Test

Everyone can take the test; only logged-in users get saved progress.

Menu

Environment Configuration Management

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core principles

Worked examples

Step-by-step: make your project reproducible

Exercises

Self-check checklist

Common mistakes and how to self-check

Practical projects

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Parameterize a small ETL with .env

Instructions

Expected Output

Lock reproducible dependencies

Containerize and run with environment variables

Environment Configuration Management — Quick Test

Have questions about Environment Configuration Management?

AI Assistant