luvv to helpDiscover the Best Free Online Tools
Topic 3 of 8

Scripting For Automation

Learn Scripting For Automation for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As an ETL Developer, scripting glues your tools together, makes jobs reliable, and unblocks you when GUIs fall short. You will use scripts to:

  • Parameterize and run extracts, loads, and validations across environments.
  • Add retries, alerts, and logs around fragile steps (e.g., flaky networks).
  • Backfill historical data and reprocess failed partitions safely.
  • Prepare runtime environments (folders, permissions, temp cleanups).
  • Automate local reproductions of pipeline issues for faster debugging.
Example daily tasks you might automate
  • Pull yesterday’s data from a source, land it with date-based folders, and validate counts.
  • Convert CSVs to Parquet and push to a data lake with atomic moves.
  • Load a staging table, merge into a target, and send a summary to logs.

Concept explained simply

Scripting for automation means writing small, repeatable programs (Bash, Python, PowerShell, etc.) that run ETL steps the same way every time, without human clicks. Good scripts are safe, observable, and configurable.

Mental model

Think of an automated job as a reliable factory machine:

  • Inputs: parameters, files, tables, secrets from environment variables.
  • Process: deterministic commands with guardrails (retries, checks).
  • Outputs: data artifacts, logs, and clear exit codes.

Your script wraps each ETL step with safety rails so that running it today or tomorrow produces the same results (idempotency) and leaves a clear trace (observability).

Core patterns you will use

  • Idempotency: running twice yields the same final state. Use staging tables, upserts, atomic moves, and partitioned outputs.
  • Observability: timestamped logs, row counts, checksums, and explicit exit codes.
  • Resilience: retries with backoff for transient failures; fail fast on non-retryable errors.
  • Configuration: environment variables and parameters; no hardcoded secrets.
  • Safety: set strict modes (e.g., Bash set -euo pipefail), validate inputs, and clean up temp files.

Worked examples

Example 1 — Bash: Extract from Postgres to dated CSV with retries and logs

#!/usr/bin/env bash
set -euo pipefail

log() { echo "$(date -u +%F'T'%T'Z') | $*"; }
retry() { local n=0; local max=${1:-5}; local delay=${2:-2}; shift 2; until "$@"; do n=$((n+1)); if [[ $n -ge $max ]]; then return 1; fi; sleep $((delay * n)); done }

: "${PGHOST:?} ${PGDATABASE:?} ${PGUSER:?} ${PGPASSWORD:?}"  # supplied via environment
OUT_DIR="./landing/$(date -u +%F)"; mkdir -p "$OUT_DIR"; OUT_FILE="$OUT_DIR/customers.csv"

QUERY="COPY (SELECT id, name, email FROM public.customers WHERE updated_at::date = CURRENT_DATE - INTERVAL '1 day') TO STDOUT WITH CSV HEADER"

log "Starting extract"
retry 5 2 bash -lc "psql -h \"$PGHOST\" -U \"$PGUSER\" -d \"$PGDATABASE\" -c \"$QUERY\" > \"$OUT_FILE\""
log "Wrote $(wc -l < \"$OUT_FILE\") lines to $OUT_FILE"
How it works
  • retry adds exponential backoff for transient DB hiccups.
  • Environment variables carry credentials; nothing is hardcoded.
  • Timestamped folders make runs auditable and idempotent per partition (date).

Example 2 — Python: Convert JSON files to partitioned Parquet idempotently

import json, os, hashlib
from datetime import datetime

SRC = "./raw_json"
DST = f"./parquet/date={datetime.utcnow().date()}"
os.makedirs(DST, exist_ok=True)

import pandas as pd

def file_hash(path):
    h = hashlib.sha256()
    with open(path, 'rb') as f:
        h.update(f.read())
    return h.hexdigest()

for fn in os.listdir(SRC):
    if not fn.endswith('.json'): 
        continue
    p = os.path.join(SRC, fn)
    h = file_hash(p)
    out = os.path.join(DST, f"{h}.parquet")
    if os.path.exists(out):
        print(f"Skip {fn}: already converted")
        continue
    with open(p) as f:
        data = json.load(f)
    df = pd.json_normalize(data)
    df.to_parquet(out, index=False)
    print(f"Wrote {out} rows={len(df)}")
How it works
  • Hash-based filenames prevent duplicate conversions.
  • Partitioning by date allows safe re-runs for specific days.
  • Observable prints help trace what happened.

Example 3 — PowerShell: Reliable file move with checksum and archive

# Requires PowerShell 5+
$src = "C:\\drop"
$dst = "C:\\landing\\$(Get-Date -Format yyyy-MM-dd)"
$arc = "C:\\archive"
New-Item -ItemType Directory -Force -Path $dst | Out-Null
New-Item -ItemType Directory -Force -Path $arc | Out-Null

Get-ChildItem -Path $src -Filter *.csv | ForEach-Object {
  $hash = (Get-FileHash $_.FullName -Algorithm SHA256).Hash
  $out = Join-Path $dst ($hash + ".csv")
  if (Test-Path $out) {
    Write-Host "Skip $($_.Name): duplicate"
  } else {
    Copy-Item $_.FullName $out -Force
    Write-Host "Copied to $out"
  }
  Move-Item $_.FullName (Join-Path $arc $_.Name) -Force
}
How it works
  • SHA-256 prevents duplicate loads.
  • Dated landing path enables partitioned processing.
  • Archive keeps source area clean for the next run.

Exercises

Do these locally. Mirror content is provided in the Exercises section below for copy/paste.

  1. Bash: Reliable CSV-to-PostgreSQL load
    Goal: Load a CSV with columns id,name,email into staging, then upsert into target.
    Requirements: strict mode, logging, upsert by id, clean staging on success, idempotent (re-run safely).
  2. Python: Idempotent folder ingest to SQLite
    Goal: Detect new CSVs in ./incoming, insert rows into a SQLite table, skip already ingested files by checksum. Produce a run summary.
Checklist before you automate
  • Script fails fast on errors (strict modes / exceptions)
  • All paths are parameterized (no hardcoded absolute paths)
  • Secrets come from environment variables or a managed secret store
  • Idempotency plan (staging + upsert, atomic moves, or checksums)
  • Retries with backoff for transient steps
  • Clear logs and exit codes
  • Dry-run flag to preview actions

Common mistakes and self-check

  • Hardcoding secrets: Never store passwords in scripts. Use environment variables. Self-check: search for patterns like "password=".
  • No strict mode: Missing set -euo pipefail or try/except leads to silent failures. Self-check: intentionally break an input and confirm the script fails quickly.
  • Non-idempotent writes: Overwriting targets mid-run or partial copies. Self-check: re-run and verify no duplicates or corruption.
  • Missing logs: Hard to debug in prod. Self-check: can you answer what ran, when, and how many rows?
  • Ignoring exit codes: Orchestrators rely on them. Self-check: echo $? (or equivalent) after your script.

Who this is for

  • Aspiring and current ETL Developers who need reliable automation around data movement.
  • Data Engineers who want portable, scriptable glue between tools.
  • Analysts automating repeatable data prep tasks.

Prerequisites

  • Basic command line comfort (navigating folders, running commands).
  • Beginner knowledge of SQL.
  • One scripting language (Bash, Python, or PowerShell) at a basic level.

Practical projects

  • Daily CSV landing: From a shared folder, land files partitioned by date, dedupe, and log counts.
  • Staging and merge: Create a staging table for a dimension, load a CSV, and upsert to the target. Include a data-quality row count check.
  • Backfill tool: Given a date range, loop through partitions and reprocess with retries and resumable markers.

Learning path

  • Start with safe scripting patterns (strict mode, exit codes, logs).
  • Learn idempotent data movement (staging/merge, partitioning, atomic moves).
  • Add observability (structured logs, row counts, checksums).
  • Wrap with retries/backoff and parameterization.
  • Integrate scripts into your orchestrator as tasks.

Quick Test

Take the quick test to check your understanding. Anyone can take it; only logged-in users will have their progress saved.

How to use this test effectively
  • Score 70% or higher to proceed.
  • If you miss a question, review the related section and re-run the exercise.
  • Focus on idempotency, observability, and safe configuration.

Mini challenge

Your orchestrator is temporarily down, but you must run the daily customer pipeline. Write a short plan and pseudocode to:

  • Extract yesterday’s customers to a dated folder.
  • Convert to a durable format (e.g., Parquet) with dedupe.
  • Upsert into your warehouse, log counts, and ensure reruns cause no duplicates.
Acceptance criteria
  • Idempotency strategy is explicit.
  • Secrets are not hardcoded.
  • Clear logs and exit codes included.

Practice Exercises

2 exercises to complete

Instructions

Create a Bash script load_customers.sh that:

  • Enables strict mode (set -euo pipefail) and logs with timestamps.
  • Takes two parameters: input CSV path and target table name (e.g., public.customers).
  • Creates a temporary staging table with columns id INT PRIMARY KEY, name TEXT, email TEXT.
  • Uses COPY to load the CSV into staging (assume header present).
  • Upserts into target by id (insert new, update changed name/email).
  • Drops staging on success; prints how many rows inserted/updated.
  • Reads PGHOST, PGDATABASE, PGUSER, PGPASSWORD from environment variables (do not hardcode).

Note: If the target table does not exist, create it with matching schema and UNIQUE PRIMARY KEY on id.

Expected Output
Example log lines: 2026-01-11T00:00:00Z | Loaded 1,234 rows into staging; 1,000 inserted, 100 updated, 134 unchanged; Exit code 0

Scripting For Automation — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Scripting For Automation?

AI Assistant

Ask questions about this tool