Who this is for
This lesson is for aspiring and current Platform Engineers, DevOps Engineers, SREs, and backend developers who want a clear, practical understanding of what an internal platform is and how to build one that developers love.
Prerequisites
- Basic understanding of software delivery: repos, CI/CD, environments.
- Familiarity with cloud or on-prem infrastructure concepts (compute, networking, IAM).
- Comfort using Git and reading YAML or similar config formats.
Why this matters
Platform Engineers remove friction from software delivery. A well-designed internal platform:
- Speeds up onboarding and feature delivery with self-service templates and paved paths.
- Improves reliability and security through consistent, reusable building blocks.
- Reduces cognitive load so product teams focus on business logic, not plumbing.
- Makes compliance and governance simpler with policy-by-default and auditability.
Real tasks in the profession:
- Design a Golden Path to create and deploy a new microservice with one command.
- Offer a self-service workflow for a temporary preview environment with cleanup policies.
- Define platform SLOs (e.g., CI queue wait times, deployment success rate) and report them.
- Run user research sessions with developers to prioritize platform improvements.
Concept explained simply
An internal platform is the curated, supported way your company builds, runs, and observes software. It combines tooling, standards, and workflows into a product for internal developers. The platform abstracts complexity (infrastructure, security, compliance) behind self-service interfaces and templates, so delivery is faster, safer, and more consistent.
Mental model
Think of the platform as a power grid for software teams:
- Centralized, reliable utilities (compute, CI/CD, observability, secrets) provided once.
- Standard sockets (APIs, CLI, templates) so teams plug in without wiring the building.
- Meters and dashboards (SLOs, audits) to ensure quality and cost control.
Core principles
- Product mindset: treat developers as customers; discover needs and measure outcomes.
- Self-service first: reduce ticket-based work with safe, auditable automation.
- Paved paths (Golden Paths): opinionated, documented, and supported defaults that work end-to-end.
- Security and compliance by default: guardrails over gates; policy baked into templates and pipelines.
- Abstraction with escape hatches: hide complexity without blocking advanced use cases.
- Thin, modular platform: build the minimum that provides real value; integrate well instead of reinventing.
- Observable and measurable: define SLOs and track developer experience metrics.
Scope and boundaries
What a platform typically offers:
- Service templates and generators (code + infra + pipeline scaffolding)
- Environment management (dev, preview, staging, prod) with safe promotions
- CI/CD orchestration and policy (build, test, security checks, deploy)
- Secrets, IAM, and identity integration
- Runtime standards (container base images, deployment patterns, rollback)
- Observability defaults (logs, metrics, traces, alerts)
- Cost and quota guardrails
What it should avoid:
- Owning product features or business logic
- Blocking flexibility without offering an escape hatch
- Rebuilding commodity tools when integration suffices
Checklist: Define your platform boundary
- State the customer: 'Internal developers and SREs building services X and Y'
- List must-have capabilities for Day 1 (e.g., service scaffolding, CI, deployment)
- List nice-to-have for later (e.g., preview env auto-cleanup, cost dashboards)
- Write explicit non-goals (e.g., 'No support for on-prem this quarter')
- Identify escape hatches (e.g., custom pipeline step interface)
Worked examples
Example 1: Golden Path for a new API service
- Input: repo name, service type 'REST API', team ID.
- Output: bootstrapped repo with code template, Dockerfile, tests, IaC module, CI pipeline, deployment manifest.
- Policies: mandatory security scan step; base image pinned; least-privilege IAM.
- DX: 'create-service' CLI completes in under 5 minutes; initial deploy to dev auto-triggers.
Example 2: Preview environment on pull request
- Trigger: PR labeled 'preview'.
- Automation: provision ephemeral stack (DB, API, frontend) with unique namespace.
- Guardrails: TTL 48h; size quotas; masked secrets.
- Teardown: on PR close or TTL expiry; artifacts persisted for audit.
Example 3: Platform SLOs and user feedback loop
- Define SLOs: CI queue wait < 2 min p95; deploy success > 99%; scaffold time < 5 min.
- Collect feedback: quarterly survey on onboarding time and friction points.
- Prioritize: reduce flaky tests in templates; cache dependencies to cut build times.
Design your first platform slice (step-by-step)
- Interview 5–7 developers about their last deployment; map pain points.
- Choose one journey to smooth (e.g., new service creation).
- Define the MVP: one service template + CI + dev deploy + basic observability.
- Set SLOs and success metrics (lead time, deployment success, onboarding time).
- Ship to 1–2 pilot teams; fix issues; document Golden Path clearly.
- Expand to more service types; add preview envs and guardrails.
Exercises
Complete these exercises. You can compare your answers with the solutions provided.
Exercise 1 — Define a platform MVP
Scenario: Your company has 20 services, slow onboarding (2 weeks), and manual deployments. Draft a 1-page MVP brief for the internal platform.
- State target users and their top 3 pain points.
- List MVP capabilities (3–5), non-goals, and SLOs.
- Describe one Golden Path workflow end-to-end.
Checklist before you finalize
- Users and pain points are explicit and evidence-based (from interviews).
- MVP capabilities are minimal yet end-to-end (scaffold → build → deploy → observe).
- Non-goals prevent scope creep this quarter.
- SLOs are measurable with existing tooling.
Exercise 2 — Golden Path spec
Create a concise spec for a 'create-and-deploy-service' Golden Path.
- Inputs, outputs, and required policies.
- Failure modes and rollback behavior.
- DX requirements (max time, minimal prompts, docs).
Checklist before you finalize
- Every step is automatable and idempotent.
- Security checks are embedded, not optional.
- Rollbacks and timeouts are defined for each stage.
- Success is observable (logs, metrics) with clear owners.
Common mistakes and self-check
- Building a thick, do-everything platform before validating needs. Self-check: Can you name 3 validated pains solved by your MVP?
- No product ownership. Self-check: Do you have a backlog prioritized by developer impact?
- Gates instead of guardrails. Self-check: Are policies automated in pipelines/templates instead of manual approvals?
- Opaque operations. Self-check: Are SLOs published and visible to all teams?
- No escape hatches. Self-check: Can advanced users extend pipelines without forking everything?
Practical projects
- Build a service scaffold generator that creates code, CI pipeline, and deployment manifests in one command.
- Implement a preview environment workflow with automatic TTL cleanup and cost caps.
- Publish platform SLO dashboards and add alerts for CI queue saturation.
- Write a platform RFC describing boundaries, non-goals, and adoption plan.
Learning path
- Start: Internal Platform Concept (this lesson) → define MVP and Golden Paths.
- Next: Infrastructure as Code and Environment Management → make provisioning reproducible.
- Then: CI/CD Orchestration and Policy → embed quality and security checks.
- Later: Observability and SLOs → prove reliability and improve DX with data.
Next steps
- Finish Exercises 1–2 and refine them with a peer review.
- Pick one Practical Project and implement a minimal version.
- Take the Quick Test to confirm understanding.
Quick Test and progress
The quick test is available to everyone for free. If you are logged in, your progress and results will be saved automatically.
Glossary
- Golden Path: An opinionated, supported way to complete a common task end-to-end.
- Escape Hatch: A controlled way to bypass defaults for advanced needs.
- Thin Platform: Minimal, composable capabilities that integrate existing tools.
- SLO: Service Level Objective; the reliability target the platform aims to meet.