Why this matters
As a Data Platform Engineer, you enable fast, reliable delivery across many repos and teams. Standard project scaffolding reduces setup time, prevents bikeshedding, improves onboarding, and makes CI/CD, security, and governance predictable.
- Spin up a new pipeline service in minutes with consistent structure
- Apply shared build, test, and deploy workflows across projects
- Enforce quality gates (linting, tests, docs) by default
- Speed up incident response because all repos look and behave the same
Concept explained simply
Standard project scaffolding is a ready-to-use template for new repos. It includes directory structure, baseline config, tests, docs, CI jobs, and local dev tooling. You clone it, change names, and start building immediately.
Mental model
Think of scaffolding as a productized starting point: one opinionated template per workload type (e.g., batch pipeline, stream processing, dbt analytics, Terraform infra). Each template ships with the smallest set of decisions already made.
Good scaffolding rules of thumb
- Simple defaults; extensible when needed
- Zero to first commit under 15 minutes
- Local run equals CI run (same commands)
- Documentation lives in the repo
- Security and quality checks are on by default
Standard layout essentials
repo-root/
README.md # What this repo does and how to use it
docs/ # Architecture, decisions, runbooks
src/ # Application or models
tests/ # Unit/integration tests
configs/ # Envs: dev, stage, prod
.gitignore
.editorconfig
.pre-commit-config.yaml # Lint/format hooks
pyproject.toml | package.json | dbt_project.yml | main.tf
Dockerfile # Reproducible runtime
Makefile | Taskfile.yml # One-liner commands
ci/ # Reusable CI configs (lint, test, build)
scripts/ # Helper scripts (idempotent)
- One task runner: use consistent commands: make install, make test, make run, make fmt
- Environments: configs/dev.yaml, configs/prod.yaml with secret values injected at runtime (never committed)
- Docs baseline: docs/ADR-000-template.md, docs/runbook.md, docs/architecture.md
- Quality: linters, formatters, type checks, minimal tests
Worked examples
Example 1 — Python batch pipeline service
data-batch-py/
README.md
src/
batch_job/
__init__.py
main.py # Entry point
io.py # S3/GCS/DB I/O
transform.py # Pure functions
tests/
test_transform.py
configs/
dev.yaml
prod.yaml
pyproject.toml # black, ruff, mypy, pytest
Dockerfile
Makefile # install, fmt, lint, test, run
.pre-commit-config.yaml
ci/
pipeline.yaml # lint->test->build->scan
Core commands
make install
make fmt
make lint
make test
make run ARGS="--config configs/dev.yaml"
Minimal contents
# src/batch_job/main.py
from .io import read_source, write_sink
from .transform import clean_records
def run(config):
df = read_source(config)
out = clean_records(df)
write_sink(out, config)
if __name__ == "__main__":
import yaml, sys
cfg = yaml.safe_load(open(sys.argv[1])) if len(sys.argv) > 1 else {}
run(cfg)
Example 2 — dbt analytics repo
analytics-dbt/
README.md
dbt_project.yml
models/
staging/
marts/
macros/
seeds/
tests/ # generic + singular tests
profiles/ # sample profile templates (no secrets)
ci/
dbt-ci.yaml # deps, build, test
Makefile # make deps, make build, make test
docs/
runbook.md
Default commands
make deps
make build # dbt build
make test # dbt test
Example 3 — Terraform (data infra)
infra-tf/
README.md
modules/
data_lake/
main.tf
variables.tf
outputs.tf
envs/
dev/
main.tf # uses modules/data_lake
prod/
main.tf
ci/
tf-ci.yaml # fmt, validate, plan, apply (manual)
Makefile # fmt, validate, plan, apply
docs/
architecture.md
Guardrails
- make validate must pass before plan
- Apply only on protected branches with manual approval
Step-by-step: build your template once
- Create a new template repo named template-workload (e.g., template-data-batch-py).
- Add minimal files: README, src/, tests/, configs/, Makefile/Taskfile, Dockerfile, CI pipeline, pre-commit hooks.
- Define 4 must-have commands: install, fmt, lint, test, plus run/build if relevant.
- Set opinions: formatter, linter, typing/tooling, test framework. Keep to widely used defaults.
- Write docs/runbook.md: how to run locally, configs, CI, troubleshooting.
- Add a sample job/model/module that actually runs (hello world with I/O).
- Dry-run the template: clone fresh, time to first successful make test.
- Version it (v1, v1.1). Document upgrade steps.
Optional improvements
- Pre-templated CODEOWNERS
- Security scans (container scan, IaC scan)
- Data contract/sample schema files
- Conventional commits and release automation
Exercises
Do these in a scratch directory or a test repository.
Exercise 1 — Create a minimal Python data pipeline scaffold
- Goal: pass make fmt, make lint, make test; run a simple transform.
- Folders: src/pipeline/, tests/, configs/
- Files: README.md, pyproject.toml, Makefile, Dockerfile, .pre-commit-config.yaml
Acceptance checklist
- make install installs deps
- make fmt and make lint succeed
- make test runs at least one unit test
- make run uses configs/dev.yaml
Hints
- Use black and ruff for formatting/linting
- Keep transform functions pure (inputs in, outputs out)
- Stub I/O with in-memory data or temp files
Exercise 2 — Add CI config and docs to your scaffold
- Add ci/pipeline.yaml that runs fmt-check, lint, test
- Add docs/runbook.md with how to run locally and in CI
- Ensure the same commands are used locally and in CI
Acceptance checklist
- CI runs on pull requests
- CI uses make fmt, make lint, make test
- docs/runbook.md explains the commands and configs
Common mistakes and self-check
- Too much boilerplate: self-check — can a new dev get to first green test in under 15 minutes?
- Hidden coupling: self-check — can you swap configs/dev.yaml to prod.yaml without code changes?
- Docs drift: self-check — is docs/runbook.md updated in the same PR as code changes?
- Inconsistent commands: self-check — do CI and local use the same make targets?
- Secrets in repo: self-check — configs should contain placeholders; real secrets injected at runtime
Practical projects
Project 1 — Batch pipeline template v1
- Deliver a working template-repo with Python, tests, CI, Dockerfile
- Time-to-first-successful-run under 10 minutes
- Include ADR-000 about chosen tools
Project 2 — dbt template with quality checks
- dbt build and dbt test wired to make targets
- Generic tests for not null and uniqueness on sample models
- CI fails on test failures
Project 3 — Terraform environment template
- Format/validate/plan on PR
- Apply only on main with manual approval
- docs/architecture.md explaining module boundaries
Mini challenge
Pick one of your existing repos. Migrate it to the new scaffold with minimal disruption. Measure before/after:
- How long from clone to first successful test?
- How many commands to run locally?
- How many manual steps are eliminated?
Tip
Create a migration guide in docs/migration.md and do it in small PRs: layout, tooling, CI, docs.
Who this is for
- Data Platform Engineers who support multiple teams
- Data Engineers creating repeatable pipelines
- Analytics Engineers standardizing dbt projects
Prerequisites
- Basic Git workflow
- Familiarity with one stack (Python/dbt/Terraform)
- Comfort with a CI system and a task runner (Make/Task)
Learning path
- Draft a minimal scaffold for one workload (this lesson)
- Roll it out to one pilot team and gather feedback
- Harden CI, security, and docs
- Publish v1 and add versioned upgrade notes
- Extend to additional workloads (dbt, streaming, IaC)
Next steps
- Finish the exercises below and take the quick test
- Templatize common commands across repos
- Create a single internal doc explaining how to pick the right template
Progress saving note: The quick test is available to everyone. Only logged-in users have their progress saved.