Who this is for
- Data Platform Engineers who support internal users (data engineers, analysts, ML teams).
- Engineers setting up developer portals, templates, and golden paths for data work.
- Team leads creating support processes, SLOs, and enablement programs.
Prerequisites
- Familiarity with data platform components: storage (data lake/warehouse), orchestration (Airflow or similar), transformation (dbt or SQL), CI/CD basics.
- Basic incident response concepts (severity, escalation).
- Comfort with writing clear documentation and checklists.
Learning path
- Learn the difference between Support (reactive) and Enablement (proactive).
- Define support channels, intake forms, SLAs/SLOs, and a triage workflow.
- Create golden paths: templates, starter repos, and paved-road docs.
- Set DX metrics: time-to-first-pipeline, MTTR, deployment frequency, change failure rate.
- Roll out changes safely: versioning, deprecations, comms, migration guides.
- Observe, measure, and iterate with feedback loops and office hours.
Note: The Quick Test is available to everyone; only logged-in users get saved progress.
Why this matters
Real tasks you will do as a Data Platform Engineer:
- Unblock teams when pipelines fail, without becoming a bottleneck.
- Provide a paved path so new projects ship in hours, not weeks.
- Maintain SLOs (e.g., incident response and resolution) and reduce MTTR.
- Manage support channels (tickets, chat, office hours) with clear priorities.
- Publish migration guides and runbooks for safe platform upgrades.
- Track adoption and DX metrics to guide the platform roadmap.
Concept explained simply
Platform Support is the help desk for your data platform: you triage issues, fix urgent problems, and keep the lights on.
Platform Enablement is the coach: you give teams tools, templates, docs, and training so they can move fast without you.
Mental model
- Runway: Golden paths and templates help teams take off safely.
- Control tower: Triage and SLOs coordinate traffic and incidents.
- Toolbox: Starter repos, runbooks, and cookbooks solve common jobs.
- Radar: Telemetry and feedback loops show where to improve next.
Core components of Support and Enablement
- Support channels and intake: ticket form with mandatory fields (impact, severity, steps tried, logs).
- SLAs/SLOs: target response/resolution times by severity; publish clearly.
- Triage workflow (IDEAL): Intake → Diagnose → Empower → Automate → Learn.
- Golden paths: opinionated templates for ingestion, transformation, and CI/CD.
- Runbooks: step-by-step guides for common incidents and operations.
- Documentation: short, task-focused, with copy-paste commands and screenshots.
- Training: office hours, onboarding labs, and 30–60 minute enablement sessions.
- Observability: dashboards for job success rate, queue health, and user adoption.
- Change management: versioning, feature flags, deprecation windows, migration guides.
- DX metrics: time-to-first-pipeline, MTTR, deployment frequency, change failure rate, ticket backlog age.
Worked examples
Example 1: Triage a failing pipeline
- Intake: Confirm severity (users blocked?), capture logs, pipeline ID, last success time.
- Diagnose: Check platform health dashboard; compare recent changes (deploys, quotas).
- Empower: Share the minimal fix teams can do (e.g., lower parallelism, clear stuck run).
- Automate: Add an alert to catch quota breaches earlier.
- Learn: Update the runbook with the exact error signature and resolution steps.
Ready-to-use comms template
Update format: Context, Impact, Next update time, Workaround, Owner. Keep updates every 30–60 minutes for Sev-1.
Example 2: Build a golden path for ingestion
- Goal: Ingest table from source to lakehouse daily with schema checks and data quality tests.
- Template: Starter repo with Airflow DAG, dbt model, and CI for tests.
- Docs: One-page guide: prerequisites, step-by-step setup, troubleshooting.
- Metric: Target time-to-first-pipeline under 2 hours for a new team member.
Golden path structure
- /template-ingestion: DAG, connection config, sample tests
- Checklist: access, secrets, naming conventions, data contracts
- Validation: run "make validate" before first run
Example 3: Safe deprecation of a connector
- Plan: Provide v2 with adapters; keep v1 for 60 days.
- Comms: Announce now, weekly reminders, and 7-day final notice.
- Safety: Feature flag and dual-run option for high-risk teams.
- Guide: Migration steps with code diff examples and rollback steps.
- Success: 0 Sev-1 incidents and 90% migration by day 45.
Example 4: Intake form and priority
- Fields: environment, severity, business impact, error snippet, last success, changes made, steps tried.
- Priority matrix: P1 = production outage; P2 = production degraded; P3 = non-prod blocked; P4 = request/question.
Exercises
Do these now. They mirror the graded exercises below.
Exercise 1: Write a Tier-1 incident runbook outline
Scenario: Production pipelines across multiple teams are failing with a shared compute pool error.
- Create a runbook outline with sections: Scope, Triggers, First checks, Containment, Root-cause paths, Comms template, Escalation matrix, Verification, Post-incident tasks.
What good looks like
- Clear, skimmable steps (numbered).
- Concrete commands/locations (dashboards, logs).
- Time-boxed checkpoints (e.g., escalate if no resolution in 20 minutes).
Exercise 2: Triage and prioritize a support queue
Tickets:
- A. Non-prod job failed overnight; workaround exists; small team.
- B. Production ingestion down for a revenue dashboard; no workaround.
- C. Access request for a new project; unblock within 2 days.
- D. Question about best practices for dbt testing.
Task: Assign priority (P1–P4) and channel (ticket, chat, office hours) for each, and write one-line justification.
Checklists
Daily support rotation checklist
- Review Sev-1 and Sev-2 queue; acknowledge within SLA.
- Scan platform health dashboard; note anomalies.
- Post daily status in support channel: top risks and mitigations.
- Tag product owners on any blocked deliverables.
- Update open incident tickets with next update time.
Enablement weekly checklist
- Measure time-to-first-pipeline and MTTR; log trends.
- Identify top 2 repeat issues; propose automation or doc upgrade.
- Add one improvement to a golden path template.
- Host office hours; capture FAQs and update docs.
- Review change calendar for upcoming migrations.
Common mistakes and self-check
- Mistake: Only firefighting; no enablement. Fix: Reserve weekly time for templates, docs, and automation.
- Mistake: Vague intake. Fix: Mandatory fields and examples of good tickets.
- Mistake: Hidden SLAs. Fix: Publish SLOs and show them on dashboards.
- Mistake: Breaking changes with no rollback. Fix: Feature flags, dual-run, and clear sunset dates.
- Mistake: Overlong docs. Fix: Short task pages with copy-paste blocks and a Troubleshooting section.
Self-check
- Can a new engineer ship a pipeline in under 2 hours using your golden path?
- Are incident updates posted every 30–60 minutes for P1?
- Do you know last week’s MTTR and top recurring issue?
- Do your runbooks include escalation and verification steps?
Practical projects
- Build a "time-to-first-pipeline" starter kit: repo, one-page guide, and CI checks.
- Create a support intake form and triage SOP using the IDEAL workflow.
- Write and validate two runbooks: "Cluster quota exceeded" and "Credential rotation failure." Run tabletop drills.
Next steps
- Instrument DX metrics and add them to a team dashboard.
- Pick one recurring issue and automate the first fix step.
- Plan one enablement session with a clear before/after success metric.
Mini challenge
In 30 minutes, draft a "golden path" one-pager for adding a new source to your lake or warehouse. Include prerequisites, 5–7 steps, validation, and rollback. Share with a teammate and ask them to try it verbatim.
Reminder: The Quick Test on this page is open to all; log in if you want your score saved.