Menu

Platform Engineering Foundations

Learn Platform Engineering Foundations for Platform Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 23, 2026 | Updated: January 23, 2026

What you will learn and why it matters

Platform Engineering Foundations teaches you how to build an Internal Developer Platform (IDP) that gives product teams safe self-serve infrastructure, standard golden paths, and reliable delivery. You will learn how to define platform boundaries, publish templates and guardrails, measure reliability with SLOs, and collaborate with dev teams to reduce cognitive load and lead time.

Quick contrast: Platform Engineering vs. classic DevOps
  • Platform Engineering focuses on building reusable products (IDP, templates, paved roads) for internal users (developers).
  • DevOps culture is about collaboration; platform engineering implements it with product thinking, self-service, and guardrails.
  • Outcome: faster, safer delivery with less variation and fewer handoffs.

Who this is for

  • Engineers moving from SRE/DevOps to platform roles.
  • Backend engineers tasked with building service templates and infrastructure self-service.
  • Tech leads aiming to standardize delivery and reliability across teams.

Prerequisites

  • Basic Git workflow and CI familiarity.
  • Comfortable reading YAML/JSON, writing simple scripts.
  • Intro-level cloud knowledge (compute, networking, IAM basics).
If you are missing prerequisites

Practice creating a small app repo, add a simple CI pipeline, and deploy a hello-world service to any cloud or container platform. Focus on repeatability and documentation.

Learning path (milestones)

  1. Define the platform — Identify internal users, top 3 use cases, boundaries, and non-goals.
  2. Self-serve provisioning — Provide request-to-provision workflow with automation and approvals.
  3. Golden paths — Offer ready-to-use service templates with CI/CD and runtime defaults.
  4. Service catalog — Register services, owners, versions, and dependencies.
  5. Guardrails — Policies, pre-commit checks, and deployment rules; avoid blocking innovation.
  6. Reliability — Define SLOs, track error budgets, focus on user-impacting SLIs.
  7. Collaboration — Feedback loops, docs-as-code, office hours, platform roadmap.

Step 1: Platform charter

Write a one-page charter: target users, problems solved, supported stacks, and measurable outcomes (e.g., 30% faster service onboarding).

Step 2: MVP self-serve

Automate one path end-to-end (new service + database) with approvals and audit trail.

Step 3: Golden paths

Publish 2–3 templates with CI/CD, logging, metrics, and runbooks.

Step 4: Catalog

Require ownership metadata and standardized docs for every service.

Step 5: Guardrails

Add policy checks (cost, security, reliability) and provide clear remediation.

Step 6: Measure reliability

Adopt SLOs and review error budgets with teams monthly.

Worked examples

1) Minimal golden path template (containerized service)

Provide a starter with health checks, CI, and runtime defaults.

# service-template/ci/github-actions.yml
name: ci
on: [push]
jobs:
  build-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm ci && npm test -- --ci
      - run: docker build -t acme/${{ github.repository }}:${{ github.sha }} .
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: trivy fs --exit-code 1 --severity HIGH,CRITICAL .

Include a Dockerfile with healthcheck and minimal runtime config:

# Dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["npm","start"]
Why this helps

Teams get a production-ready baseline (tests, security scan, health checks) consistently, reducing setup time and risk.

2) Terraform module with guardrails

# modules/service/variables.tf
variable "service_name" { type = string }
variable "env" { type = string, validation { condition = contains(["dev","staging","prod"], var.env) error_message = "env must be dev|staging|prod" } }
variable "replicas" { type = number, default = 2, validation { condition = var.replicas >= 2 error_message = "At least 2 replicas for HA." } }
# modules/service/main.tf
resource "kubernetes_deployment" "svc" { /* ... */ }
resource "kubernetes_service" "svc" { /* ... */ }

Policy-as-code sketch (conceptual guardrail):

# Example policy idea (pseudocode)
deny if input.env == "prod" and input.replicas < 2
warn if imageTag == "latest"
Why this helps

Guardrails encode reliability and security expectations without custom review every time.

3) Service catalog entity (YAML)

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  description: Handles card charges
  tags: [node, payments]
  annotations:
    pagerduty.com/service-id: PD123
    ci.system: github-actions
spec:
  type: service
  owner: team-payments
  lifecycle: production
  providesApis: [payments]

Minimum signals: owner, description, lifecycle, on-call, CI system.

4) SLO and error budget math

# 99.9% monthly SLO (~43m 49s budget)
SLO = 99.9%
Total time in 30 days = 43,200 min
Error budget = (1 - 0.999) * 43,200 = 43.2 min

If an incident caused 20 min of SLI violation, remaining budget is ~23.2 min; consider freezing risky changes if budget depletes early.

5) Self-serve request-to-provision workflow

# .github/workflows/provision.yml
name: provision
on:
  workflow_dispatch:
    inputs:
      service_name: { description: 'Service name', required: true }
      env: { description: 'Environment', required: true, default: 'dev' }
jobs:
  plan-apply:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      id-token: write
    steps:
      - uses: actions/checkout@v4
      - name: Validate inputs
        run: |
          [[ "${{ github.event.inputs.env }}" == "prod" ]] && echo "Requires approval" || true
      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -var service_name=${{ github.event.inputs.service_name }} -var env=${{ github.event.inputs.env }}
      - name: Manual approval for prod
        if: ${{ github.event.inputs.env == 'prod' }}
        uses: trstringer/manual-approval@v1
        with: { approvers: team-platform }
      - name: Terraform Apply
        run: |
          terraform apply -auto-approve -var service_name=${{ github.event.inputs.service_name }} -var env=${{ github.event.inputs.env }}
Why this helps

Gives developers a safe, auditable button to provision resources with approvals for sensitive environments.

Drills and exercises

  • Write a one-page platform charter describing users, problems, and top-3 paved roads.
  • Create a minimal service template with CI, container build, and a health endpoint.
  • Add two guardrails: minimum replicas in prod, and disallow :latest images.
  • Register one service in a catalog with owner, lifecycle, and on-call annotations.
  • Define an SLO (availability or latency), calculate monthly error budget, and propose 3 SLIs.
  • Draft a workflow that provisions a database through automation with an approval gate.

Common mistakes and debugging tips

  • Building for “everyone” — leads to complexity. Tip: start with 1–2 target teams and top-3 use cases.
  • Too many hard blocks — slows teams. Tip: prefer warnings and education; reserve hard fails for critical policy.
  • Templates without ownership — they rot. Tip: assign template owners, version them, and deprecate old ones clearly.
  • Catalog without incentives — stays empty. Tip: make catalog entries a deploy prerequisite; provide a 5-minute form.
  • SLOs on internal metrics only — don’t reflect user pain. Tip: use user-facing SLIs (availability, latency, error rate).
  • Ignoring docs — increases support load. Tip: docs-as-code in the template; require README sections to pass CI.
Debugging reliability issues
  • Start from user impact (SLI dashboards), not system internals.
  • Check recent changes and compare to error budget burn rate.
  • Roll back quickly if SLO threatens to breach; run a blameless review.

Mini project: Launch a starter Internal Developer Platform (IDP)

Goal: One paved road for a containerized API with self-serve provisioning, registered in a catalog, with basic guardrails and an SLO.

  • Deliverables:
    • Template repo (service + CI + Dockerfile + runbook sections).
    • Provision workflow for dev and prod with approval.
    • Catalog registration manifest with owner and on-call.
    • Two policy checks (replica minimum, image tag control).
    • SLO doc with SLI queries and error budget.
  • Acceptance criteria:
    • New service can be created and deployed with one command or click.
    • Catalog shows owner and lifecycle within minutes of creation.
    • Prod deploy fails if guardrails violated (replicas, image tag).
    • Latency or availability SLI visible on a dashboard.
Hints
  • Keep defaults opinionated: logging, metrics, healthcheck.
  • Make rollback easy: versioned artifacts and deploy history.
  • Document support: how to escalate, office hours, and known limitations.

Subskills

  • Internal Platform Concept
  • Self Serve Infrastructure Principles
  • Golden Paths And Templates
  • Platform Service Catalog
  • Reliability And SLO Thinking
  • Standardization And Guardrails
  • Collaboration With Product And Dev Teams

Practical projects

  • Service Bootstrapper CLI: A small CLI that scaffolds services from templates, validates metadata, and pushes to a new repo. Success = a new service passes CI and registers in the catalog automatically.
  • Policy Pack: Package 4–6 policies (replicas, image tags, required labels, cost tags) with a CI job that posts friendly remediation tips. Success = developers can fix violations within 10 minutes.
  • Error Budget Tracker: A script or dashboard that reads SLI data and shows monthly budget remaining per service. Success = weekly review identifies top budget burners.

Next steps

  • Iterate with user feedback sessions; measure lead time to first deploy from a template.
  • Add one more golden path (e.g., async worker) and one more environment (staging).
  • Expand guardrails to include basic cost controls and secret scanning.

Platform Engineering Foundations — Skill Exam

This timed exam checks your grasp of platform foundations: internal platform concepts, self-serve flows, golden paths, catalogs, guardrails, and SLOs. Everyone can take the exam for free. If you are logged in, your progress and results are saved; otherwise, you can still complete it without saving.Rules: closed-book, no external tools required. Choose the best answer(s) based on the scenarios. Passing score is 70%.

14 questions70% to pass

Have questions about Platform Engineering Foundations?

AI Assistant

Ask questions about this tool