What you will learn and why it matters
Platform Engineering Foundations teaches you how to build an Internal Developer Platform (IDP) that gives product teams safe self-serve infrastructure, standard golden paths, and reliable delivery. You will learn how to define platform boundaries, publish templates and guardrails, measure reliability with SLOs, and collaborate with dev teams to reduce cognitive load and lead time.
Quick contrast: Platform Engineering vs. classic DevOps
- Platform Engineering focuses on building reusable products (IDP, templates, paved roads) for internal users (developers).
- DevOps culture is about collaboration; platform engineering implements it with product thinking, self-service, and guardrails.
- Outcome: faster, safer delivery with less variation and fewer handoffs.
Who this is for
- Engineers moving from SRE/DevOps to platform roles.
- Backend engineers tasked with building service templates and infrastructure self-service.
- Tech leads aiming to standardize delivery and reliability across teams.
Prerequisites
- Basic Git workflow and CI familiarity.
- Comfortable reading YAML/JSON, writing simple scripts.
- Intro-level cloud knowledge (compute, networking, IAM basics).
If you are missing prerequisites
Practice creating a small app repo, add a simple CI pipeline, and deploy a hello-world service to any cloud or container platform. Focus on repeatability and documentation.
Learning path (milestones)
- Define the platform — Identify internal users, top 3 use cases, boundaries, and non-goals.
- Self-serve provisioning — Provide request-to-provision workflow with automation and approvals.
- Golden paths — Offer ready-to-use service templates with CI/CD and runtime defaults.
- Service catalog — Register services, owners, versions, and dependencies.
- Guardrails — Policies, pre-commit checks, and deployment rules; avoid blocking innovation.
- Reliability — Define SLOs, track error budgets, focus on user-impacting SLIs.
- Collaboration — Feedback loops, docs-as-code, office hours, platform roadmap.
Step 1: Platform charter
Write a one-page charter: target users, problems solved, supported stacks, and measurable outcomes (e.g., 30% faster service onboarding).
Step 2: MVP self-serve
Automate one path end-to-end (new service + database) with approvals and audit trail.
Step 3: Golden paths
Publish 2–3 templates with CI/CD, logging, metrics, and runbooks.
Step 4: Catalog
Require ownership metadata and standardized docs for every service.
Step 5: Guardrails
Add policy checks (cost, security, reliability) and provide clear remediation.
Step 6: Measure reliability
Adopt SLOs and review error budgets with teams monthly.
Worked examples
1) Minimal golden path template (containerized service)
Provide a starter with health checks, CI, and runtime defaults.
# service-template/ci/github-actions.yml
name: ci
on: [push]
jobs:
build-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm ci && npm test -- --ci
- run: docker build -t acme/${{ github.repository }}:${{ github.sha }} .
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: trivy fs --exit-code 1 --severity HIGH,CRITICAL .
Include a Dockerfile with healthcheck and minimal runtime config:
# Dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["npm","start"]
Why this helps
Teams get a production-ready baseline (tests, security scan, health checks) consistently, reducing setup time and risk.
2) Terraform module with guardrails
# modules/service/variables.tf
variable "service_name" { type = string }
variable "env" { type = string, validation { condition = contains(["dev","staging","prod"], var.env) error_message = "env must be dev|staging|prod" } }
variable "replicas" { type = number, default = 2, validation { condition = var.replicas >= 2 error_message = "At least 2 replicas for HA." } }
# modules/service/main.tf
resource "kubernetes_deployment" "svc" { /* ... */ }
resource "kubernetes_service" "svc" { /* ... */ }
Policy-as-code sketch (conceptual guardrail):
# Example policy idea (pseudocode)
deny if input.env == "prod" and input.replicas < 2
warn if imageTag == "latest"
Why this helps
Guardrails encode reliability and security expectations without custom review every time.
3) Service catalog entity (YAML)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payments-api
description: Handles card charges
tags: [node, payments]
annotations:
pagerduty.com/service-id: PD123
ci.system: github-actions
spec:
type: service
owner: team-payments
lifecycle: production
providesApis: [payments]
Minimum signals: owner, description, lifecycle, on-call, CI system.
4) SLO and error budget math
# 99.9% monthly SLO (~43m 49s budget)
SLO = 99.9%
Total time in 30 days = 43,200 min
Error budget = (1 - 0.999) * 43,200 = 43.2 min
If an incident caused 20 min of SLI violation, remaining budget is ~23.2 min; consider freezing risky changes if budget depletes early.
5) Self-serve request-to-provision workflow
# .github/workflows/provision.yml
name: provision
on:
workflow_dispatch:
inputs:
service_name: { description: 'Service name', required: true }
env: { description: 'Environment', required: true, default: 'dev' }
jobs:
plan-apply:
runs-on: ubuntu-latest
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- name: Validate inputs
run: |
[[ "${{ github.event.inputs.env }}" == "prod" ]] && echo "Requires approval" || true
- name: Terraform Plan
run: |
terraform init
terraform plan -var service_name=${{ github.event.inputs.service_name }} -var env=${{ github.event.inputs.env }}
- name: Manual approval for prod
if: ${{ github.event.inputs.env == 'prod' }}
uses: trstringer/manual-approval@v1
with: { approvers: team-platform }
- name: Terraform Apply
run: |
terraform apply -auto-approve -var service_name=${{ github.event.inputs.service_name }} -var env=${{ github.event.inputs.env }}
Why this helps
Gives developers a safe, auditable button to provision resources with approvals for sensitive environments.
Drills and exercises
- Write a one-page platform charter describing users, problems, and top-3 paved roads.
- Create a minimal service template with CI, container build, and a health endpoint.
- Add two guardrails: minimum replicas in prod, and disallow :latest images.
- Register one service in a catalog with owner, lifecycle, and on-call annotations.
- Define an SLO (availability or latency), calculate monthly error budget, and propose 3 SLIs.
- Draft a workflow that provisions a database through automation with an approval gate.
Common mistakes and debugging tips
- Building for “everyone” — leads to complexity. Tip: start with 1–2 target teams and top-3 use cases.
- Too many hard blocks — slows teams. Tip: prefer warnings and education; reserve hard fails for critical policy.
- Templates without ownership — they rot. Tip: assign template owners, version them, and deprecate old ones clearly.
- Catalog without incentives — stays empty. Tip: make catalog entries a deploy prerequisite; provide a 5-minute form.
- SLOs on internal metrics only — don’t reflect user pain. Tip: use user-facing SLIs (availability, latency, error rate).
- Ignoring docs — increases support load. Tip: docs-as-code in the template; require README sections to pass CI.
Debugging reliability issues
- Start from user impact (SLI dashboards), not system internals.
- Check recent changes and compare to error budget burn rate.
- Roll back quickly if SLO threatens to breach; run a blameless review.
Mini project: Launch a starter Internal Developer Platform (IDP)
Goal: One paved road for a containerized API with self-serve provisioning, registered in a catalog, with basic guardrails and an SLO.
- Deliverables:
- Template repo (service + CI + Dockerfile + runbook sections).
- Provision workflow for dev and prod with approval.
- Catalog registration manifest with owner and on-call.
- Two policy checks (replica minimum, image tag control).
- SLO doc with SLI queries and error budget.
- Acceptance criteria:
- New service can be created and deployed with one command or click.
- Catalog shows owner and lifecycle within minutes of creation.
- Prod deploy fails if guardrails violated (replicas, image tag).
- Latency or availability SLI visible on a dashboard.
Hints
- Keep defaults opinionated: logging, metrics, healthcheck.
- Make rollback easy: versioned artifacts and deploy history.
- Document support: how to escalate, office hours, and known limitations.
Subskills
- Internal Platform Concept
- Self Serve Infrastructure Principles
- Golden Paths And Templates
- Platform Service Catalog
- Reliability And SLO Thinking
- Standardization And Guardrails
- Collaboration With Product And Dev Teams
Practical projects
- Service Bootstrapper CLI: A small CLI that scaffolds services from templates, validates metadata, and pushes to a new repo. Success = a new service passes CI and registers in the catalog automatically.
- Policy Pack: Package 4–6 policies (replicas, image tags, required labels, cost tags) with a CI job that posts friendly remediation tips. Success = developers can fix violations within 10 minutes.
- Error Budget Tracker: A script or dashboard that reads SLI data and shows monthly budget remaining per service. Success = weekly review identifies top budget burners.
Next steps
- Iterate with user feedback sessions; measure lead time to first deploy from a template.
- Add one more golden path (e.g., async worker) and one more environment (staging).
- Expand guardrails to include basic cost controls and secret scanning.