Why this matters
As a Data Platform Engineer, you enable teams to ship reliable pipelines fast. A good CLI with solid templates removes guesswork, standardizes structure, embeds best practices, and reduces onboarding time from days to minutes.
- Spin up a new Airflow DAG with CI, ownership, and data quality checks in one command.
- Generate dbt models that follow naming conventions and materialization defaults.
- Create Spark jobs with consistent logging, configs, and deployment manifests.
Who this is for
- Data Platform Engineers building internal tooling for data teams.
- Data Engineers who want consistent scaffolds for pipelines and models.
- Analytics Engineers standardizing dbt project patterns.
Prerequisites
- Comfort with command-line basics and project structures.
- Familiarity with at least one pipeline tool (Airflow, dbt, or Spark).
- Basic understanding of templating concepts (variables, placeholders).
Concept explained simply
A scaffolding CLI is a tool that asks a few questions (or reads a config), then generates a ready-to-run folder with code, configs, tests, and docs. Templates are the blueprints it fills in with your answers.
Mental model
- Think "bento box": slots are fixed (folders, files, CI, docs), fillings vary (names, owners, parameters).
- Inputs: name, type, source, destination, owner, schedule, infra.
- Outputs: standardized repo layout, boilerplate code, CI, data tests, runbook.
- Guardrails: validations and dry-runs so users can preview before generating.
Core building blocks
- Command design:
datax <noun> <verb> [options](e.g.,datax pipeline create). - Template variables: name, domain, owner, tags, schedule, data source/sink, processing engine.
- File rendering: turn
{{ variable }}placeholders into real values. - Conventions: naming (snake_case), folder layout (src/, tests/, dags/, models/), config (yaml), docs (README.md), ownership (CODEOWNERS).
- Safety: dry-run preview, idempotent re-run to update files, versioned templates, changelog.
Worked examples
Example 1: Airflow pipeline scaffold
Goal: create an hourly DAG that reads Kafka topic user_events and writes to a warehouse table.
Show command and result
datax pipeline create airflow \
--name user_events_ingest \
--source kafka:user_events \
--sink warehouse:raw_user_events \
--schedule "0 * * * *" \
--owner data-platform@company.com \
--tags ingest,events
Preview (dry-run):
CREATE dags/user_events_ingest_dag.py
CREATE src/user_events_ingest/extract.py
CREATE src/user_events_ingest/load.py
CREATE tests/test_user_events_ingest.py
CREATE ops/ci.yaml
CREATE README.md
CREATE CODEOWNERS
Key template snippets:
# dags/{{ name }}_dag.py
with DAG(dag_id="{{ name }}", schedule_interval="{{ schedule }}", tags={{ tags }}) as dag:
extract = PythonOperator(task_id="extract", python_callable=extract_fn)
load = PythonOperator(task_id="load", python_callable=load_fn)
extract >> load
Example 2: dbt model scaffold
Goal: create a fact model with tests and documentation.
Show command and result
datax model create dbt \
--name fct_orders \
--source ref:stg_orders \
--materialization table \
--owner analytics@company.com
Generated files:
models/fct_orders.sql
models/schema.yml
README.md
Template pieces:
-- models/{{ name }}.sql
select * from {{ source }}
# models/schema.yml
version: 2
models:
- name: {{ name }}
description: "Fact table for orders"
columns:
- name: order_id
tests: [not_null, unique]
Example 3: Spark streaming job scaffold
Goal: create a streaming job with checkpointing and config-driven options.
Show command and result
datax job create spark \
--name stream_orders \
--checkpoint s3://my-bucket/checkpoints/stream_orders \
--source kafka:orders \
--sink parquet:s3://my-bucket/dwh/bronze/orders
Generated files:
src/stream_orders/main.py
conf/stream_orders.yaml
ops/ci.yaml
README.md
Snippet:
# src/{{ name }}/main.py
spark.readStream.format("kafka").option("subscribe", "{{ source_topic }}") \
.load() \
.writeStream.option("checkpointLocation", "{{ checkpoint }}") \
.format("parquet").start("{{ sink_path }}")
Step-by-step: build a minimal scaffolder
- Decide nouns and verbs
Example: nouns = pipeline, model, job; verbs = create, upgrade, doctor. - Define inputs
Create a YAML questionnaire with defaults and validations (e.g., name pattern^[a-z0-9_]+$). - Prepare templates
Store them undertemplates/<type>/with placeholders like{{ name }},{{ owner }},{{ schedule }}. - Render and preview
Implement a dry-run that prints the file tree and diffs before writing. - Add guardrails
Checks: name uniqueness, reserved words, forbidden characters, missing required inputs, and incompatible flags.
Exercises
Complete these now. They mirror the exercises below so you can compare your results.
- Exercise 1: Design a create command for a new Airflow pipeline and show the generated file tree.
- Exercise 2: Render a simple template using variables for
nameandowner.
Exercise checklist
- The command has clear noun/verb and required flags.
- All generated files follow a consistent layout.
- Template variables are replaced everywhere (file content and names).
- Dry-run output is readable and actionable.
Common mistakes and self-check
- Too many prompts: Overwhelms users. Self-check: Can a sensible default be applied? If yes, make it default.
- Hidden conventions: Undocumented defaults confuse. Self-check: Is every default visible in README or
--help? - Template drift: Generated code stops matching standards. Self-check: Version templates and include a
datax upgradepath. - No validation: Bad names/schedules break deploys. Self-check: Add regex checks and reserved-word lists.
- Missing CI/docs: Only code is generated. Self-check: Ensure CI, CODEOWNERS, and README are always included.
- Destructive overwrites: Re-runs delete custom changes. Self-check: Use idempotent writes with diff previews and backups.
Practical projects
Project 1: Minimal Airflow scaffolder
- Inputs: name, schedule, owner.
- Outputs: DAG file, src/ folder, tests/, ops/ci.yaml, README.md, CODEOWNERS.
- Must support: dry-run and a
--forceflag with diff.
Project 2: dbt model generator
- Inputs: name, source ref, materialization.
- Outputs: model SQL, schema.yml, docs.
- Validation: name matches
^(stg|int|fct)_[a-z0-9_]+$.
Project 3: Template versioning
- Add a
--template-versionflag with a changelog. - Implement
datax upgradeto update generated projects safely.
Learning path
- Start: Design CLI verbs/nouns and required inputs.
- Next: Build templates for one pipeline type end-to-end.
- Then: Add dry-run, validations, and CI templates.
- Finally: Introduce versioning and upgrades.
Next steps
- Pick one team use case and ship a v1 template this week.
- Collect feedback via a short form or office hours.
- Iterate defaults, then document in
README.mdand--help.
Mini challenge
Extend your CLI with a new type: batch_spark. Reuse 80% of the streaming Spark template but swap sink/materialization defaults for batch. Provide a dry-run that shows only the differences from the streaming template.
Quick Test
Everyone can take the test below. Only logged-in users have their progress saved.
When ready, scroll to the Quick Test section.