Why this matters
As an ETL Developer, scripting glues your tools together, makes jobs reliable, and unblocks you when GUIs fall short. You will use scripts to:
- Parameterize and run extracts, loads, and validations across environments.
- Add retries, alerts, and logs around fragile steps (e.g., flaky networks).
- Backfill historical data and reprocess failed partitions safely.
- Prepare runtime environments (folders, permissions, temp cleanups).
- Automate local reproductions of pipeline issues for faster debugging.
Example daily tasks you might automate
- Pull yesterday’s data from a source, land it with date-based folders, and validate counts.
- Convert CSVs to Parquet and push to a data lake with atomic moves.
- Load a staging table, merge into a target, and send a summary to logs.
Concept explained simply
Scripting for automation means writing small, repeatable programs (Bash, Python, PowerShell, etc.) that run ETL steps the same way every time, without human clicks. Good scripts are safe, observable, and configurable.
Mental model
Think of an automated job as a reliable factory machine:
- Inputs: parameters, files, tables, secrets from environment variables.
- Process: deterministic commands with guardrails (retries, checks).
- Outputs: data artifacts, logs, and clear exit codes.
Your script wraps each ETL step with safety rails so that running it today or tomorrow produces the same results (idempotency) and leaves a clear trace (observability).
Core patterns you will use
- Idempotency: running twice yields the same final state. Use staging tables, upserts, atomic moves, and partitioned outputs.
- Observability: timestamped logs, row counts, checksums, and explicit exit codes.
- Resilience: retries with backoff for transient failures; fail fast on non-retryable errors.
- Configuration: environment variables and parameters; no hardcoded secrets.
- Safety: set strict modes (e.g., Bash set -euo pipefail), validate inputs, and clean up temp files.
Worked examples
Example 1 — Bash: Extract from Postgres to dated CSV with retries and logs
#!/usr/bin/env bash
set -euo pipefail
log() { echo "$(date -u +%F'T'%T'Z') | $*"; }
retry() { local n=0; local max=${1:-5}; local delay=${2:-2}; shift 2; until "$@"; do n=$((n+1)); if [[ $n -ge $max ]]; then return 1; fi; sleep $((delay * n)); done }
: "${PGHOST:?} ${PGDATABASE:?} ${PGUSER:?} ${PGPASSWORD:?}" # supplied via environment
OUT_DIR="./landing/$(date -u +%F)"; mkdir -p "$OUT_DIR"; OUT_FILE="$OUT_DIR/customers.csv"
QUERY="COPY (SELECT id, name, email FROM public.customers WHERE updated_at::date = CURRENT_DATE - INTERVAL '1 day') TO STDOUT WITH CSV HEADER"
log "Starting extract"
retry 5 2 bash -lc "psql -h \"$PGHOST\" -U \"$PGUSER\" -d \"$PGDATABASE\" -c \"$QUERY\" > \"$OUT_FILE\""
log "Wrote $(wc -l < \"$OUT_FILE\") lines to $OUT_FILE"
How it works
- retry adds exponential backoff for transient DB hiccups.
- Environment variables carry credentials; nothing is hardcoded.
- Timestamped folders make runs auditable and idempotent per partition (date).
Example 2 — Python: Convert JSON files to partitioned Parquet idempotently
import json, os, hashlib
from datetime import datetime
SRC = "./raw_json"
DST = f"./parquet/date={datetime.utcnow().date()}"
os.makedirs(DST, exist_ok=True)
import pandas as pd
def file_hash(path):
h = hashlib.sha256()
with open(path, 'rb') as f:
h.update(f.read())
return h.hexdigest()
for fn in os.listdir(SRC):
if not fn.endswith('.json'):
continue
p = os.path.join(SRC, fn)
h = file_hash(p)
out = os.path.join(DST, f"{h}.parquet")
if os.path.exists(out):
print(f"Skip {fn}: already converted")
continue
with open(p) as f:
data = json.load(f)
df = pd.json_normalize(data)
df.to_parquet(out, index=False)
print(f"Wrote {out} rows={len(df)}")
How it works
- Hash-based filenames prevent duplicate conversions.
- Partitioning by date allows safe re-runs for specific days.
- Observable prints help trace what happened.
Example 3 — PowerShell: Reliable file move with checksum and archive
# Requires PowerShell 5+
$src = "C:\\drop"
$dst = "C:\\landing\\$(Get-Date -Format yyyy-MM-dd)"
$arc = "C:\\archive"
New-Item -ItemType Directory -Force -Path $dst | Out-Null
New-Item -ItemType Directory -Force -Path $arc | Out-Null
Get-ChildItem -Path $src -Filter *.csv | ForEach-Object {
$hash = (Get-FileHash $_.FullName -Algorithm SHA256).Hash
$out = Join-Path $dst ($hash + ".csv")
if (Test-Path $out) {
Write-Host "Skip $($_.Name): duplicate"
} else {
Copy-Item $_.FullName $out -Force
Write-Host "Copied to $out"
}
Move-Item $_.FullName (Join-Path $arc $_.Name) -Force
}
How it works
- SHA-256 prevents duplicate loads.
- Dated landing path enables partitioned processing.
- Archive keeps source area clean for the next run.
Exercises
Do these locally. Mirror content is provided in the Exercises section below for copy/paste.
- Bash: Reliable CSV-to-PostgreSQL load
Goal: Load a CSV with columns id,name,email into staging, then upsert into target.
Requirements: strict mode, logging, upsert by id, clean staging on success, idempotent (re-run safely). - Python: Idempotent folder ingest to SQLite
Goal: Detect new CSVs in ./incoming, insert rows into a SQLite table, skip already ingested files by checksum. Produce a run summary.
Checklist before you automate
- Script fails fast on errors (strict modes / exceptions)
- All paths are parameterized (no hardcoded absolute paths)
- Secrets come from environment variables or a managed secret store
- Idempotency plan (staging + upsert, atomic moves, or checksums)
- Retries with backoff for transient steps
- Clear logs and exit codes
- Dry-run flag to preview actions
Common mistakes and self-check
- Hardcoding secrets: Never store passwords in scripts. Use environment variables. Self-check: search for patterns like "password=".
- No strict mode: Missing set -euo pipefail or try/except leads to silent failures. Self-check: intentionally break an input and confirm the script fails quickly.
- Non-idempotent writes: Overwriting targets mid-run or partial copies. Self-check: re-run and verify no duplicates or corruption.
- Missing logs: Hard to debug in prod. Self-check: can you answer what ran, when, and how many rows?
- Ignoring exit codes: Orchestrators rely on them. Self-check: echo $? (or equivalent) after your script.
Who this is for
- Aspiring and current ETL Developers who need reliable automation around data movement.
- Data Engineers who want portable, scriptable glue between tools.
- Analysts automating repeatable data prep tasks.
Prerequisites
- Basic command line comfort (navigating folders, running commands).
- Beginner knowledge of SQL.
- One scripting language (Bash, Python, or PowerShell) at a basic level.
Practical projects
- Daily CSV landing: From a shared folder, land files partitioned by date, dedupe, and log counts.
- Staging and merge: Create a staging table for a dimension, load a CSV, and upsert to the target. Include a data-quality row count check.
- Backfill tool: Given a date range, loop through partitions and reprocess with retries and resumable markers.
Learning path
- Start with safe scripting patterns (strict mode, exit codes, logs).
- Learn idempotent data movement (staging/merge, partitioning, atomic moves).
- Add observability (structured logs, row counts, checksums).
- Wrap with retries/backoff and parameterization.
- Integrate scripts into your orchestrator as tasks.
Quick Test
Take the quick test to check your understanding. Anyone can take it; only logged-in users will have their progress saved.
How to use this test effectively
- Score 70% or higher to proceed.
- If you miss a question, review the related section and re-run the exercise.
- Focus on idempotency, observability, and safe configuration.
Mini challenge
Your orchestrator is temporarily down, but you must run the daily customer pipeline. Write a short plan and pseudocode to:
- Extract yesterday’s customers to a dated folder.
- Convert to a durable format (e.g., Parquet) with dedupe.
- Upsert into your warehouse, log counts, and ensure reruns cause no duplicates.
Acceptance criteria
- Idempotency strategy is explicit.
- Secrets are not hardcoded.
- Clear logs and exit codes included.