What is Developer Experience for Data, and why it matters
Developer Experience (DX) for data is the craft of making it fast, safe, and pleasant for data engineers and analysts to build, test, deploy, and operate data products. For a Data Platform Engineer, strong DX unlocks self-service pipelines, fewer tickets, faster onboarding, and higher reliability across the platform.
- Reduce time-to-first-pipeline with templates and CLI.
- Prevent drift with standard project scaffolding and libraries.
- Ship confidently using CI/CD for data assets and quality checks.
- Enable reproducible local dev with containerized services and seed data.
- Drive adoption via docs, examples, office hours, and feedback loops.
Who this is for
- Data Platform Engineers shaping internal developer platforms for data teams.
- Senior Data Engineers maintaining shared tooling and standards.
Prerequisites
- Comfort with Git, branching, and pull requests.
- Basic Python and shell scripting.
- Familiarity with containers (e.g., Docker) and YAML-based configs.
- Working knowledge of data workflow tools (e.g., orchestration, SQL modeling, storage formats).
Learning path (practical roadmap)
-
Start with a CLI and template
Goal: Create a pipeline from zero to running in minutes.
- Design a command like
data init pipeline --name sales_etl. - Include templated folders, sample tests, and a README with quick start.
- Validate input names, enforce conventions, and add helpful defaults.
- Design a command like
-
Standard project scaffolding
Goal: Prevent drift and improve discoverability.
- Define a minimal, opinionated structure (src, tests, configs, docs).
- Add pre-commit hooks for style and schema checks.
- Ship a Makefile or task runner for common commands.
-
CI/CD for data assets
Goal: Automate unit tests, data quality checks, and deploys.
- Run fast checks on PRs; full integration tests on main.
- Gate merges on tests, lint, and contract checks.
- Automate deploys with environment promotion.
-
Local dev and testing
Goal: Reproduce prod-like behavior locally.
- Provide docker-compose with Postgres/warehouse + object storage.
- Seed sample datasets and credentials via .env files.
- Offer a one-liner to start, run tests, and stop.
-
Standard libraries and SDKs
Goal: Encapsulate platform best practices.
- Common I/O, schema validation, idempotent writes, logging.
- Semantic versioning and changelogs.
- Deprecation guides and code mods for upgrades.
-
Documentation, examples, and enablement
Goal: Discoverable, runnable examples and short guides.
- Template READMEs, cookbook snippets, and FAQ.
- Office hours, internal demos, and support workflows.
- Track feedback and measure adoption.
Worked examples
Example 1: A minimal CLI to scaffold a new pipeline
# file: tools/cli.py
import os
import pathlib
import sys
import argparse
TEMPLATE = {
"README.md": "# {name}\n\nQuick start:\n\n- make setup\n- make test\n- make run\n",
"src/{name}/__init__.py": "",
"src/{name}/jobs/extract.py": "print('extract step')\n",
"tests/test_smoke.py": "def test_smoke():\n assert True\n",
"Makefile": "setup:\n\techo 'setup env'\n\nrun:\n\tpython -m {name}.jobs.extract\n\n test:\n\tpytest -q\n"
}
def create_pipeline(name: str, dest: str = "."):
base = pathlib.Path(dest) / name
base.mkdir(parents=True, exist_ok=True)
for rel, content in TEMPLATE.items():
path = base / rel.format(name=name)
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(content.format(name=name))
print(f"Created pipeline scaffold at {base}")
if __name__ == "__main__":
p = argparse.ArgumentParser()
p.add_argument("name")
args = p.parse_args()
if not args.name.isidentifier():
print("Error: name must be a valid identifier (letters, digits, underscore).", file=sys.stderr)
sys.exit(1)
create_pipeline(args.name)
Usage: python tools/cli.py sales_etl generates a runnable scaffold with tests and Makefile.
Example 2: CI pipeline that tests and deploys data assets
# .github/workflows/data-ci.yml
name: data-ci
on:
pull_request:
push:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with: { python-version: '3.11' }
- name: Install deps
run: pip install -r requirements.txt
- name: Lint & unit tests
run: |
flake8 src
pytest -q
- name: Data quality checks
run: |
python tools/run_dq_checks.py --env ci
deploy:
if: github.ref == 'refs/heads/main' && success()
needs: [ test ]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy models
run: |
python tools/deploy.py --env prod
PRs run fast checks; merges to main trigger deployment after all gates pass.
Example 3: Local dev environment with docker-compose
# docker-compose.yml
version: '3.9'
services:
postgres:
image: postgres:15
environment:
POSTGRES_PASSWORD: example
ports: ["5432:5432"]
volumes:
- pgdata:/var/lib/postgresql/data
minio:
image: quay.io/minio/minio:RELEASE.2023-12-02T09-19-22Z
command: server /data --console-address :9001
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: minio123
ports: ["9000:9000","9001:9001"]
volumes:
- minio:/data
volumes:
pgdata: {}
minio: {}
# Makefile tasks
start:
docker compose up -d
stop:
docker compose down
seed:
python tools/seed_local.py --pg postgresql://postgres:example@localhost:5432/postgres
One-liners to start, seed sample data, and run tests keep feedback loops fast.
Example 4: Standard library for safe, idempotent writes
# file: platform_std/io.py
from pathlib import Path
import pandas as pd
def write_parquet_atomic(df: pd.DataFrame, path: Path) -> None:
tmp = path.with_suffix('.tmp.parquet')
df.to_parquet(tmp, index=False)
tmp.replace(path)
def upsert_by_keys(df: pd.DataFrame, path: Path, keys: list[str]) -> None:
if path.exists():
old = pd.read_parquet(path)
merged = (pd.concat([old, df])
.drop_duplicates(subset=keys, keep='last')
)
else:
merged = df
write_parquet_atomic(merged, path)
Publishing these helpers reduces copy-paste and enforces consistent behavior.
Example 5: Instrument feedback and adoption metrics
# file: tools/telemetry.py
import json, os, time, uuid
def emit_event(event: str, props: dict):
record = {
"event": event,
"props": props,
"ts": int(time.time()),
"trace_id": str(uuid.uuid4())
}
# Append to a local log file; pipelines can forward this centrally.
with open(os.environ.get("DX_EVENT_LOG", "dx_events.log"), "a") as f:
f.write(json.dumps(record) + "\n")
# Usage in CLI after scaffolding success
# emit_event("pipeline_created", {"template": "standard", "duration_ms": 2200})
Track time-to-first-success, template usage, and failure reasons to guide improvements.
Drills and exercises
- [ ] Build a scaffold that creates a pipeline with a single command and runs locally in < 5 minutes.
- [ ] Add pre-commit hooks that block commits without docs and tests.
- [ ] Create a CI workflow that runs unit tests and a simple data quality rule.
- [ ] Package one reusable function into a shared internal library and version it.
- [ ] Add a sample dataset and a make target to seed local dev.
- [ ] Instrument a metric for "pipeline_created" and review weekly counts.
Common mistakes and debugging tips
-
Inconsistent scaffolds across teams
Tip
Centralize templates and auto-apply updates via a small CLI command or code mod. Add a version file in projects and warn in CI if outdated. -
Slow CI wasting developer time
Tip
Split checks: fast unit tests on PR, heavy integration on main or nightly. Cache dependencies and use selective test runs based on changed paths. -
Unreproducible local environments
Tip
Pin versions in requirements and container images. Provide docker-compose and seed scripts. Document a one-command setup. -
Hidden platform errors
Tip
Standardize logging format and levels. Include correlation IDs. Surface failure reasons early in CLI output and CI logs. -
No adoption measurement
Tip
Emit events on template use, CI outcomes, and time-to-first-success. Review trends monthly and prioritize fixes that unblock the most users.
Mini project: From zero to deployed data pipeline with metrics
- Create a CLI command to scaffold a new pipeline with README, tests, and Makefile.
- Spin up local Postgres and object storage with docker-compose; seed sample data.
- Implement a small transform that reads CSV, writes Parquet with idempotent upserts.
- Add CI that runs unit tests and a simple data quality check (e.g., no nulls in primary keys).
- Instrument telemetry to log pipeline creation and first successful run duration.
- Write a short how-to doc with a runnable example.
Acceptance criteria
- New projects start with one command, succeed locally in < 10 minutes.
- CI blocks merge if tests or DQ checks fail.
- Library functions used (not duplicated) for I/O and idempotency.
- Telemetry file shows events for creation and success.
Subskills
- CLI And Templates For New Pipelines — One-command project creation with guardrails.
- Standard Project Scaffolding — Opinionated structure, hooks, and tasks.
- CI CD For Data Assets — Automated tests, quality gates, and deploys.
- Local Dev And Testing Tooling — Reproducible containers, seed data, and envs.
- Standard Libraries And SDKs — Shared patterns for I/O, validation, and logging.
- Documentation And Examples — Runnable examples and concise guides.
- Platform Support And Enablement — Office hours, request triage, SLAs.
- Feedback And Adoption Metrics — Telemetry and surveys to improve DX.
Next steps
- Pick one team and pilot the full workflow. Collect feedback within two weeks.
- Document the top three friction points and ship fixes on a cadence.
- Scale by packaging your best practices into templates and libraries.
Take the skill exam
This exam is available to everyone. If you are logged in, your progress and score are saved automatically.