How to learn Infrastructure And DevOps Basics for Data Engineer for free

Why this skill matters for Data Engineers

Modern data teams ship pipelines and platforms that must be reliable, reproducible, and cost-aware. Infrastructure and DevOps basics help you: provision cloud resources safely, package and run jobs consistently, automate testing and deployments, manage environments and secrets, control costs, scale workloads, and recover from failures. Mastering these unlocks faster delivery and fewer production incidents.

Who this is for

Early-career Data Engineers moving beyond notebooks to production.
Analysts/Scientists deploying recurring jobs or Airflow/Spark pipelines.
Engineers transitioning from on-prem to cloud data platforms.

Prerequisites

Comfortable with Python or SQL for data tasks.
Basic Git (clone, commit, branch, PR).
Familiarity with at least one cloud service concept (storage, compute, IAM).

Learning path

Start with Infrastructure as Code (Terraform basics). Learn to declare and version cloud resources.
Containerize a simple data job (Dockerfile, image size optimization, environment variables).
Add CI/CD: run tests, build/push container, deploy job to a scheduler (e.g., Airflow/K8s/Cron).
Introduce environment configuration management and secret handling.
Manage dependencies and lock versions for reproducibility.
Set up cost monitoring, quotas, and alerting.
Plan scaling and resource allocation; test performance.
Add backups, RPO/RTO targets, and a recovery runbook.

What "good" looks like

Infra is defined in code, reviewed via PRs, and applied via CI.
Jobs run in containers with pinned, reproducible dependencies.
Config/secrets are externalized; environments are isolated.
Pipelines have automated tests and deploy steps.
Costs are measured; quotas prevent runaway spend.
Capacity and recovery objectives are documented and tested.

Practical roadmap (milestones)

M1 — IaC Skeleton: Provision storage and an execution role with Terraform; enable remote state.
M2 — Containerized Job: Package a Python ETL into a small, fast Docker image.
M3 — CI Basics: Lint, unit test, build and push image on each commit to main.
M4 — Deploy Automation: CI applies Terraform in a controlled stage, then production.
M5 — Config/Secrets: Move environment variables and secrets to a manager; no secrets in Git.
M6 — Cost & Scaling: Add budgets/alerts; set resource requests/limits; load test.
M7 — Recovery: Add backups and a tested disaster recovery runbook with RPO/RTO.

Worked examples

1) Terraform: bucket + access policy

# main.tf
terraform {
  required_providers { aws = { source = "hashicorp/aws", version = "~> 5.0" } }
  backend "s3" { bucket = "my-tf-state" key = "infra/dev/terraform.tfstate" region = "us-east-1" }
}
provider "aws" { region = var.region }

resource "aws_s3_bucket" "raw" {
  bucket = "acme-data-raw-${var.env}"
  force_destroy = true
}

resource "aws_iam_user" "etl" { name = "etl-${var.env}" }
resource "aws_iam_user_policy" "etl_readwrite" {
  name = "etl-s3-access-${var.env}"
  user = aws_iam_user.etl.name
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect = "Allow",
      Action = ["s3:PutObject","s3:GetObject","s3:ListBucket"],
      Resource = [aws_s3_bucket.raw.arn, "${aws_s3_bucket.raw.arn}/*"]
    }]
  })
}

variable "env" { type = string }
variable "region" { type = string, default = "us-east-1" }
output "raw_bucket" { value = aws_s3_bucket.raw.bucket }

Run: terraform init → terraform plan → terraform apply -var env=dev.

2) Dockerfile for a Python ETL

# Dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install system deps first to maximize caching
RUN apt-get update && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY etl/ etl/
COPY main.py .
ENV PYTHONUNBUFFERED=1
CMD ["python","main.py"]

Tip: keep images small; pin versions in requirements.txt; read secrets from environment variables.

3) CI pipeline (test → build → push)

# .ci/example.yml
name: ci
on: [push]
jobs:
  build_test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install -r requirements.txt && pip install pytest flake8
      - run: flake8 etl && pytest -q
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with: { registry: ghcr.io, username: ${{ secrets.CI_USER }}, password: ${{ secrets.CI_TOKEN }} }
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ghcr.io/org/data-etl:${{ github.sha }}

Store credentials in the platform’s secret manager. Never commit tokens.

4) Environment config and secrets

# config.yaml
s3_bucket: "acme-data-raw"
input_prefix: "landing/"
output_prefix: "curated/"

# main.py (snippet)
import os, yaml
with open("config.yaml") as f:
    cfg = yaml.safe_load(f)

S3_BUCKET = cfg["s3_bucket"]
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
DB_PASSWORD = os.environ["DB_PASSWORD"]  # Inject via secret manager at runtime

Configuration stays in files or parameters; secrets come only from environment or secret manager at runtime.

5) Dependency management with a lockfile

# requirements.in (human-edited)
boto3
pandas==2.2.*
pyarrow

# compile lock
pip install pip-tools
pip-compile requirements.in -o requirements.txt

# Use lock
pip install -r requirements.txt

Result: reproducible builds with pinned transitive versions.

6) Scheduling with resource limits (Kubernetes Job)

apiVersion: batch/v1
kind: Job
metadata:
  name: etl-daily
spec:
  template:
    spec:
      containers:
      - name: etl
        image: ghcr.io/org/data-etl:SHA
        resources:
          requests: { cpu: "500m", memory: "1Gi" }
          limits:   { cpu: "2",    memory: "4Gi" }
        env:
        - name: DB_PASSWORD
          valueFrom: { secretKeyRef: { name: db-secret, key: password } }
      restartPolicy: Never
  backoffLimit: 2

Requests help the scheduler place the pod; limits cap usage and costs.

Drills and exercises

[ ] Write a Terraform module that creates a storage bucket and outputs its name.
[ ] Build a Docker image under 300MB for a simple pandas job.
[ ] Add flake8 and pytest to your CI and fail on lint/test errors.
[ ] Externalize all secrets to a secret manager or environment variables.
[ ] Create separate dev/stage/prod configs and deploy the same container to each.
[ ] Set a monthly budget with an alert at 80% spend.
[ ] Load test your job with 5× data and record runtime and cost deltas.
[ ] Practice a restore from a backup and measure RTO.

Stretch goals

Introduce canary deployments for a non-critical pipeline.
Add data quality checks that gate deployments (e.g., expectations tests).
Automate Terraform plan comments on pull requests.

Common mistakes and debugging tips

Hardcoding secrets: move to secret manager; rotation becomes trivial.
No version pinning: create a lockfile; rebuild and compare hashes.
One environment for all: split dev/stage/prod; use separate state and accounts/projects.
Ignoring costs: set quotas and alerts; label resources by owner and environment.
Oversized containers: slim base images, multi-stage builds, clean caches.
Unclear failure signals: add structured logs, metrics, and alerts; define SLOs.
Terraform drift: run scheduled terraform plan and reconcile changes via PRs.

Quick debugging checklist

Pipeline fails on start: check image tag, secrets, and permission errors.
Time-outs: increase resources or parallelism; profile I/O vs CPU bottlenecks.
IAM denied: inspect exact action and resource ARN in error logs.
CI flaky: pin tool versions; cache dependencies; run tests in containers.

Mini project: Productionize a daily batch pipeline

Goal: Ingest a CSV from object storage, transform with pandas, and write partitioned Parquet to curated storage. Provision infra with Terraform, containerize the job, deploy with CI/CD, and add cost controls.

Infra: Terraform a raw and curated bucket, an execution role/user, and remote state.
Code: Write main.py that loads CSV, cleans columns, writes Parquet, and logs metrics.
Container: Create a slim Dockerfile; pin dependencies with a lockfile.
Config: Store non-secret config in YAML; secrets via environment variables.
CI: Lint/test, build image, push to registry; on tag, apply Terraform to stage.
Scheduler: Create a Kubernetes CronJob or Airflow DAG to run daily.
Costs: Add budget alert and a quota on compute; set requests/limits.
Recovery: Enable versioning on buckets; document RPO/RTO; test restore.

Acceptance criteria

Single command (or CI) deploys infra and schedules the job.
Job runs in dev and stage with different configs using the same image.
No secrets in Git; passing unit tests; image size under 300MB.
Budget alert configured; documented recovery steps.

Subskills

Infrastructure As Code Basics: Define, version, and review infra changes; plan before apply; manage remote state.
Containerization Basics: Build small, secure images; use env vars; optimize layers and caching.
CI CD For Data Pipelines: Automate tests, builds, and deployments; gate releases with checks.
Environment Configuration Management: Separate dev/stage/prod; externalize config; manage secrets safely.
Dependency Management: Pin versions, use lockfiles and virtual environments; avoid system Python.
Cost Monitoring And Quotas: Track spend, set budgets and alerts; apply quotas and resource limits.
Scaling And Resource Planning: Right-size CPU/memory; parallelize; test performance vs cost.
Disaster Recovery Basics: Define RPO/RTO; backups and restores; runbooks and drills.

Next steps

Complete the drills and the mini project to build confidence.
Take the skill exam below to validate your understanding. Anyone can take it; saved progress is available for logged-in users.
Continue with platform-specific deep dives (e.g., Airflow on K8s, Spark autoscaling) once basics are solid.

Menu

Infrastructure And DevOps Basics

Table of Contents

Why this skill matters for Data Engineers

Who this is for

Prerequisites

Learning path

Practical roadmap (milestones)

Worked examples

1) Terraform: bucket + access policy

2) Dockerfile for a Python ETL

3) CI pipeline (test → build → push)

4) Environment config and secrets

5) Dependency management with a lockfile

6) Scheduling with resource limits (Kubernetes Job)

Drills and exercises

Common mistakes and debugging tips

Mini project: Productionize a daily batch pipeline

Subskills

Next steps

Infrastructure And DevOps Basics — Skill Exam

Topics

Infrastructure As Code Basics

Containerization Basics

CI CD For Data Pipelines

Environment Configuration Management

Cost Monitoring And Quotas

Dependency Management

Scaling And Resource Planning

Disaster Recovery Basics

Have questions about Infrastructure And DevOps Basics?

AI Assistant