Who this is for
You build and support data platforms and want fast, reliable local feedback loops. If you touch ingestion, transformation, orchestration, or platform tooling, this lesson is for you.
Prerequisites
- Basic command-line familiarity
- Docker installed (or ready to use an alternative like Podman)
- Beginner knowledge of SQL and Python (helpful, not mandatory)
Why this matters
In real data platform work, you will:
- Reproduce pipeline issues locally without burning cloud credits
- Validate schema/contract changes before they break downstream teams
- Run transformations and tests quickly on your laptop
- Mock cloud services (S3, Kinesis, IAM) to test integrations
- Ship platform templates and dev containers so everyone can get productive in minutes
Concept explained simply
Local dev and testing tooling is a set of small, reliable tools and patterns that let you run a mini data platform on your laptop. You spin up lightweight versions of storage, compute, orchestration, and testing so you can iterate fast and confidently.
Mental model
Think in layers:
- Environment: containers/dev containers, reproducible shells
- Data services: local S3, warehouse/DB, message queue
- Transform + Orchestration: dbt/Spark + Airflow/Prefect locally
- Testing: unit, data quality, contract, integration
- Developer UX: make/tasks, pre-commit, seed fixtures, example datasets
What “good” looks like
- One command to start the stack
- One command to run tests
- Sample data included; easy to reset
- Docs in README; environment parity with CI
Core components of a local data sandbox
- Container runtime: Docker or Podman
- Object storage: MinIO or LocalStack (S3 API)
- Warehouse/DB: Postgres, DuckDB, or SQLite
- Compute: Spark local or DuckDB; optional Flink
- Orchestration: Airflow or Prefect in local mode
- Testing: pytest, dbt tests, Great Expectations, data contract checks
- Dev productivity: Makefile/Taskfile, pre-commit, .env files, seed datasets
Minimal setup steps (recommended)
Example docker-compose.yml (trimmed)
version: '3.9'
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: demo
POSTGRES_PASSWORD: demo
POSTGRES_DB: warehouse
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U demo -d warehouse"]
interval: 5s
timeout: 3s
retries: 10
minio:
image: quay.io/minio/minio:latest
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: minio
MINIO_ROOT_PASSWORD: minio123
ports:
- "9000:9000"
- "9001:9001"
volumes:
- ./local_artifacts/minio:/data
Worked examples
Example 1: Seed data and run a local transform
Walkthrough
- Start services:
docker compose up -d - Seed Postgres with a CSV:
psql postgresql://demo:demo@localhost:5432/warehouse \ -c "CREATE TABLE IF NOT EXISTS orders_raw (order_id int, user_id int, amount numeric, created_at timestamp);" \ -c "\copy orders_raw FROM 'data/orders.csv' WITH (FORMAT csv, HEADER true);" - Run a transform (example SQL):
psql postgresql://demo:demo@localhost:5432/warehouse -c " CREATE TABLE IF NOT EXISTS orders_clean AS SELECT order_id, user_id, amount::numeric(12,2) AS amount, created_at::timestamp AS created_at FROM orders_raw WHERE amount IS NOT NULL;" - Quick data check:
psql postgresql://demo:demo@localhost:5432/warehouse -c "SELECT COUNT(*) FROM orders_clean;"
Example 2: Data quality tests with pytest + SQL
Walkthrough
Create tests/test_quality.py
import os
import psycopg
DSN = os.getenv("DSN", "postgresql://demo:demo@localhost:5432/warehouse")
def test_no_negative_amounts():
with psycopg.connect(DSN) as conn:
cur = conn.execute("SELECT COUNT(*) FROM orders_clean WHERE amount < 0")
cnt = cur.fetchone()[0]
assert cnt == 0, f"Found {cnt} negative amounts"
def test_recent_data_exists():
with psycopg.connect(DSN) as conn:
cur = conn.execute("SELECT COUNT(*) FROM orders_clean WHERE created_at > now() - interval '90 days'")
cnt = cur.fetchone()[0]
assert cnt > 0, "No recent data found"
Run: pytest -q
Example 3: Contract test for schema changes
Walkthrough
Define a simple schema contract using a SQL assertion file (contracts/orders_clean.sql):
-- Expect required columns and types
SELECT
(SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = 'orders_clean' AND column_name IN ('order_id','user_id','amount','created_at')) = 4 AS has_columns;
-- Type spot-check (numeric scale)
SELECT pg_typeof(amount) = 'numeric'::regtype AS amount_is_numeric FROM orders_clean LIMIT 1;
Run checks:
psql postgresql://demo:demo@localhost:5432/warehouse -f contracts/orders_clean.sql
Interpret: any false rows mean the contract is broken.
Exercises (hands-on)
These mirror the exercises below. Run them now and mark your checklist.
Exercise 1: One-command local stack
Goal: Start Postgres + MinIO with Docker Compose, load a CSV, and verify a transform table appears.
- Create docker-compose.yml with services (see example).
- Add a Makefile with targets: up, seed, transform, test, down.
- Seed the CSV into orders_raw, create orders_clean, and SELECT COUNT(*).
Exercise 2: Write tests
Goal: Add two tests—one quality test (no negative amounts) and one contract test (required columns exist).
- Add tests/test_quality.py or equivalent SQL checks.
- Ensure tests fail if you temporarily insert bad data, then pass after fix.
Checklist
- [ ] docker compose up brings services healthy
- [ ] Makefile target up works
- [ ] Seeded data visible in orders_raw
- [ ] orders_clean created with transformed data
- [ ] Tests run with a single command
- [ ] Negative amount test passes
- [ ] Contract check passes
- [ ] One-command cleanup resets state
Common mistakes and self-check
- Missing parity with CI: Self-check—run identical commands locally and in CI (using Makefile/Taskfile) to ensure parity.
- Brittle paths: Self-check—use env vars and relative project paths; verify on a clean clone.
- Forgotten test data reset: Self-check—add a clean target that drops/re-creates tables/buckets.
- Overcomplicated stacks: Self-check—start minimal (DB + storage); add only what’s needed for the task.
- No healthchecks: Self-check—add healthchecks so tests wait until services are ready.
Practical projects
- Bootstrap a team template repo with docker-compose, Makefile, and a sample dataset
- Add data quality tests and a contract check for two critical tables
- Introduce pre-commit hooks to validate SQL formatting and run a fast smoke test
- Wrap common flows in make targets: up, seed, transform, test, down, clean
Learning path
- Start: Minimal local stack with Postgres + MinIO
- Next: Add automated tests (pytest/dbt tests) and pre-commit
- Then: Add orchestration locally (Airflow or Prefect) for end-to-end runs
- Finally: Mock cloud dependencies (LocalStack) and add contract testing in CI
Mini challenge
Create a single make e2e command that: brings up services, seeds data, runs transforms, executes tests, prints a short summary, then exits with the correct status code.
Next steps
- Polish your template repo and share it with your team
- Add sample data generators (e.g., synthetic orders) for richer tests
- Expand tests to cover edge cases (empty files, schema drift, timezones)
About saving your progress
The quick test and exercises are available to everyone. If you log in, your progress will be saved automatically.