luvv to helpDiscover the Best Free Online Tools
Topic 1 of 8

CLI And Templates For New Pipelines

Learn CLI And Templates For New Pipelines for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

As a Data Platform Engineer, you enable teams to ship reliable pipelines fast. A good CLI with solid templates removes guesswork, standardizes structure, embeds best practices, and reduces onboarding time from days to minutes.

  • Spin up a new Airflow DAG with CI, ownership, and data quality checks in one command.
  • Generate dbt models that follow naming conventions and materialization defaults.
  • Create Spark jobs with consistent logging, configs, and deployment manifests.

Who this is for

  • Data Platform Engineers building internal tooling for data teams.
  • Data Engineers who want consistent scaffolds for pipelines and models.
  • Analytics Engineers standardizing dbt project patterns.

Prerequisites

  • Comfort with command-line basics and project structures.
  • Familiarity with at least one pipeline tool (Airflow, dbt, or Spark).
  • Basic understanding of templating concepts (variables, placeholders).

Concept explained simply

A scaffolding CLI is a tool that asks a few questions (or reads a config), then generates a ready-to-run folder with code, configs, tests, and docs. Templates are the blueprints it fills in with your answers.

Mental model

  • Think "bento box": slots are fixed (folders, files, CI, docs), fillings vary (names, owners, parameters).
  • Inputs: name, type, source, destination, owner, schedule, infra.
  • Outputs: standardized repo layout, boilerplate code, CI, data tests, runbook.
  • Guardrails: validations and dry-runs so users can preview before generating.

Core building blocks

  • Command design: datax <noun> <verb> [options] (e.g., datax pipeline create).
  • Template variables: name, domain, owner, tags, schedule, data source/sink, processing engine.
  • File rendering: turn {{ variable }} placeholders into real values.
  • Conventions: naming (snake_case), folder layout (src/, tests/, dags/, models/), config (yaml), docs (README.md), ownership (CODEOWNERS).
  • Safety: dry-run preview, idempotent re-run to update files, versioned templates, changelog.

Worked examples

Example 1: Airflow pipeline scaffold

Goal: create an hourly DAG that reads Kafka topic user_events and writes to a warehouse table.

Show command and result
datax pipeline create airflow \
  --name user_events_ingest \
  --source kafka:user_events \
  --sink warehouse:raw_user_events \
  --schedule "0 * * * *" \
  --owner data-platform@company.com \
  --tags ingest,events

Preview (dry-run):

CREATE dags/user_events_ingest_dag.py
CREATE src/user_events_ingest/extract.py
CREATE src/user_events_ingest/load.py
CREATE tests/test_user_events_ingest.py
CREATE ops/ci.yaml
CREATE README.md
CREATE CODEOWNERS

Key template snippets:

# dags/{{ name }}_dag.py
with DAG(dag_id="{{ name }}", schedule_interval="{{ schedule }}", tags={{ tags }}) as dag:
    extract = PythonOperator(task_id="extract", python_callable=extract_fn)
    load = PythonOperator(task_id="load", python_callable=load_fn)
    extract >> load

Example 2: dbt model scaffold

Goal: create a fact model with tests and documentation.

Show command and result
datax model create dbt \
  --name fct_orders \
  --source ref:stg_orders \
  --materialization table \
  --owner analytics@company.com

Generated files:

models/fct_orders.sql
models/schema.yml
README.md

Template pieces:

-- models/{{ name }}.sql
select * from {{ source }}

# models/schema.yml
version: 2
models:
  - name: {{ name }}
    description: "Fact table for orders"
    columns:
      - name: order_id
        tests: [not_null, unique]

Example 3: Spark streaming job scaffold

Goal: create a streaming job with checkpointing and config-driven options.

Show command and result
datax job create spark \
  --name stream_orders \
  --checkpoint s3://my-bucket/checkpoints/stream_orders \
  --source kafka:orders \
  --sink parquet:s3://my-bucket/dwh/bronze/orders

Generated files:

src/stream_orders/main.py
conf/stream_orders.yaml
ops/ci.yaml
README.md

Snippet:

# src/{{ name }}/main.py
spark.readStream.format("kafka").option("subscribe", "{{ source_topic }}") \
    .load() \
    .writeStream.option("checkpointLocation", "{{ checkpoint }}") \
    .format("parquet").start("{{ sink_path }}")

Step-by-step: build a minimal scaffolder

  1. Decide nouns and verbs
    Example: nouns = pipeline, model, job; verbs = create, upgrade, doctor.
  2. Define inputs
    Create a YAML questionnaire with defaults and validations (e.g., name pattern ^[a-z0-9_]+$).
  3. Prepare templates
    Store them under templates/<type>/ with placeholders like {{ name }}, {{ owner }}, {{ schedule }}.
  4. Render and preview
    Implement a dry-run that prints the file tree and diffs before writing.
  5. Add guardrails
    Checks: name uniqueness, reserved words, forbidden characters, missing required inputs, and incompatible flags.

Exercises

Complete these now. They mirror the exercises below so you can compare your results.

  • Exercise 1: Design a create command for a new Airflow pipeline and show the generated file tree.
  • Exercise 2: Render a simple template using variables for name and owner.

Exercise checklist

  • The command has clear noun/verb and required flags.
  • All generated files follow a consistent layout.
  • Template variables are replaced everywhere (file content and names).
  • Dry-run output is readable and actionable.

Common mistakes and self-check

  • Too many prompts: Overwhelms users. Self-check: Can a sensible default be applied? If yes, make it default.
  • Hidden conventions: Undocumented defaults confuse. Self-check: Is every default visible in README or --help?
  • Template drift: Generated code stops matching standards. Self-check: Version templates and include a datax upgrade path.
  • No validation: Bad names/schedules break deploys. Self-check: Add regex checks and reserved-word lists.
  • Missing CI/docs: Only code is generated. Self-check: Ensure CI, CODEOWNERS, and README are always included.
  • Destructive overwrites: Re-runs delete custom changes. Self-check: Use idempotent writes with diff previews and backups.

Practical projects

Project 1: Minimal Airflow scaffolder
  • Inputs: name, schedule, owner.
  • Outputs: DAG file, src/ folder, tests/, ops/ci.yaml, README.md, CODEOWNERS.
  • Must support: dry-run and a --force flag with diff.
Project 2: dbt model generator
  • Inputs: name, source ref, materialization.
  • Outputs: model SQL, schema.yml, docs.
  • Validation: name matches ^(stg|int|fct)_[a-z0-9_]+$.
Project 3: Template versioning
  • Add a --template-version flag with a changelog.
  • Implement datax upgrade to update generated projects safely.

Learning path

  • Start: Design CLI verbs/nouns and required inputs.
  • Next: Build templates for one pipeline type end-to-end.
  • Then: Add dry-run, validations, and CI templates.
  • Finally: Introduce versioning and upgrades.

Next steps

  • Pick one team use case and ship a v1 template this week.
  • Collect feedback via a short form or office hours.
  • Iterate defaults, then document in README.md and --help.

Mini challenge

Extend your CLI with a new type: batch_spark. Reuse 80% of the streaming Spark template but swap sink/materialization defaults for batch. Provide a dry-run that shows only the differences from the streaming template.

Quick Test

Everyone can take the test below. Only logged-in users have their progress saved.

When ready, scroll to the Quick Test section.

Practice Exercises

2 exercises to complete

Instructions

Propose a datax pipeline create command for an hourly Kafka-to-warehouse ingest pipeline named user_events_ingest. Include flags for source, sink, schedule, owner, and tags. Then list the files your scaffold would create. Provide a dry-run style output.

Expected Output
Command with clear flags, and a file tree including: dags/user_events_ingest_dag.py, src/user_events_ingest/, tests/test_user_events_ingest.py, ops/ci.yaml, README.md, CODEOWNERS.

CLI And Templates For New Pipelines — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about CLI And Templates For New Pipelines?

AI Assistant

Ask questions about this tool