Menu

Reliability And Operations

Learn Reliability And Operations for Platform Engineer for free: roadmap, examples, subskills, and a skill exam.

Published: January 23, 2026 | Updated: January 23, 2026

Why Reliability and Operations matter for Platform Engineers

Reliability and Operations is where your platform proves itself in real conditions. As a Platform Engineer, you design for resilience, set service goals (SLOs), run on-call, manage changes safely, and solve incidents fast. Mastering this unlocks trusted releases, stable infrastructure, and calm operations even during spikes and failures.

  • You make reliability measurable (SLIs/SLOs, error budgets).
  • You keep systems ready: capacity planned, tested under load, backed up, and recoverable.
  • You reduce risk: safe change rollout, quick rollback, clear runbooks, strong incident response.

Who this is for

  • Platform Engineers building and operating shared services and infrastructure.
  • Backend Engineers moving toward SRE/Platform responsibilities.
  • Team leads who need predictable releases and stable production.

Prerequisites

  • Comfortable with Linux basics (shell, processes, logs).
  • Familiar with containers and orchestration (Docker, Kubernetes basics).
  • Able to read YAML/JSON and write simple scripts (Bash or Python).
  • Basic monitoring knowledge (metrics, logs, traces).

Learning path (practical roadmap)

Milestone 1 — Define reliability: SLIs, SLOs, error budgets
  1. Pick one service. Define its critical user journeys and SLIs (e.g., request latency < 300ms, availability).
  2. Set SLOs (e.g., 99.9% monthly) and compute error budget.
  3. Create alerting based on burn rate, not raw thresholds.
Checkpoint: You can explain what will wake you up at night and why.
Milestone 2 — On-call and incident response
  1. Draft escalation paths and severities (SEV1–SEV3).
  2. Write an incident runbook and practice a 30-minute simulated outage.
  3. Adopt a lightweight post-incident review template.
Checkpoint: You can lead an incident bridge and produce follow-ups.
Milestone 3 — Capacity planning
  1. Measure current load, peak factors, and growth rate.
  2. Forecast next 3–6 months and define headroom (e.g., maintain 30–50%).
  3. Create a scaling plan (HPA settings, instance counts, storage growth).
Checkpoint: You can justify capacity with data and lead time.
Milestone 4 — Performance and load testing
  1. Establish a baseline with a representative test (ramp-up, steady, spike).
  2. Track p95/p99 latency, error rate, and resource saturation.
  3. Find the knee of the curve and record safe throughput.
Checkpoint: You can name the maximum safe RPS and its latency.
Milestone 5 — Chaos and resilience testing
  1. Start with small, contained failures (kill 1 pod, inject latency in staging).
  2. Verify failover paths, time-to-recover, and alerts.
  3. Document guardrails and abort conditions.
Checkpoint: You know the blast radius and how the system heals.
Milestone 6 — Backup and disaster recovery
  1. Define RPO/RTO for each data store.
  2. Automate backups and test restores regularly.
  3. Document the DR runbook and run a timed drill.
Checkpoint: You can restore within target RTO from a known-good backup.
Milestone 7 — Change risk management and operational readiness
  1. Standardize rollout strategies (canary, blue/green, feature flags).
  2. Add preflight checks: health, dependencies, rollback plan.
  3. Ship a release using a checklist and capture learnings.
Checkpoint: Releases feel boring, reversible, and measurable.

Worked examples

Example 1 — Prometheus SLO burn rate alert

Alert on fast and slow burn rates for a 99.9% availability SLO using request_error_ratio.

# Recording rule example (ratio of errors to total over 5m)
- record: job:http_error_ratio:5m
  expr: sum(rate(http_requests_total{status=~"5.."}[5m]))
        / sum(rate(http_requests_total[5m]))

# Fast-burn alert (~14.4x budget burn)
- alert: HighErrorBudgetBurn
  expr: job:http_error_ratio:5m > (1 - 0.999) * 14.4
  for: 2m
  labels:
    severity: page
  annotations:
    summary: Fast burn rate threatens SLO
    runbook: "Open incident, initiate rollback if recent change"

# Slow-burn alert (~6x budget burn)
- alert: ModerateErrorBudgetBurn
  expr: avg_over_time(job:http_error_ratio:5m[1h]) > (1 - 0.999) * 6
  for: 30m
  labels:
    severity: ticket
  annotations:
    summary: Slow burn rate requires attention
    runbook: "Create task, investigate top offenders"

Tip: Tie alerts to clear actions and escalation paths.

Example 2 — Kubernetes HPA + PDB to stay stable during updates
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 4
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: api

Why: HPA adds capacity during spikes; PDB protects quorum during maintenance or node churn.

Example 3 — k6 baseline load test
import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 50 },  // ramp-up
    { duration: '5m', target: 50 },  // steady
    { duration: '1m', target: 150 }, // spike
  ],
  thresholds: {
    http_req_duration: ['p(95)<300'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://staging.example/api/health');
  sleep(0.2);
}

Record p95, p99, error rate, and resource saturation to find your safe throughput.

Example 4 — Backup and restore verification (PostgreSQL)
#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%Y%m%d-%H%M%S)
BACKUP_DIR=/backups
DB_URL=${DB_URL:-postgres://app:secret@db:5432/app}

# 1) Create dump
pg_dump --format=custom --file="$BACKUP_DIR/app-$TS.dump" "$DB_URL"

# 2) Verify restore into a temp database
TMP_DB="verify_$TS"
psql "$DB_URL" -c "CREATE DATABASE \"$TMP_DB\";"
pg_restore --exit-on-error --dbname="${DB_URL/app/$TMP_DB}" "$BACKUP_DIR/app-$TS.dump"

# 3) Integrity check (example: count critical tables)
psql "${DB_URL/app/$TMP_DB}" -c "SELECT count(*) FROM users;"

# 4) Cleanup temp DB
psql "$DB_URL" -c "DROP DATABASE \"$TMP_DB\";"

echo "Backup verified: $BACKUP_DIR/app-$TS.dump"

Automate and schedule this. A backup is only as good as your last successful restore test.

Example 5 — Incident flow and comms templates

Severity levels: SEV1 critical outage, SEV2 partial degradation, SEV3 minor impact.

# Incident roles
- Incident Commander (IC): decision maker, timekeeper
- Communications: status updates, stakeholder sync
- Ops/Tech lead: triage, mitigation

# First 10 minutes checklist
[ ] Declare severity and open an incident channel
[ ] Assign roles (IC, Comms, Tech lead)
[ ] State impact, scope, start time, suspected area
[ ] Mitigate user impact (rollback, feature flag, scale out)
[ ] Update status every 10–15 minutes

# Status update template
Impact: <what users see>
Scope: <regions/services>
Start time: <UTC>
Current action: <mitigation/diagnosis>
Next update: <UTC time>

Drills and exercises

  • [ ] Compute the monthly error budget (minutes) for SLOs 99.5%, 99.9%, and 99.99%.
  • [ ] Write one burn-rate alert and one saturation alert that would page you.
  • [ ] Run a 30-minute load test and record p95/p99, CPU, memory, and errors.
  • [ ] Kill one replica in staging. Measure time to recover and confirm alerts fired.
  • [ ] Execute a timed DB restore drill and record RTO/RPO achieved.
  • [ ] Practice a 20-minute incident simulation with roles and a status update.
  • [ ] Ship a canary deployment with a documented rollback plan.

Common mistakes and how to debug them

Mistake: Alert fatigue from noisy thresholds

Fix: Use SLO burn-rate alerts and add minimum durations (for:) to avoid flapping. Tie every page to a clear action.

Mistake: Unpracticed backups

Fix: Schedule restore verifications. Keep immutable copies and track restore timings against your RTO.

Mistake: Capacity based on averages

Fix: Plan for peaks and variability. Use p95 demand, add headroom, and account for failover (N+1 or zone loss).

Mistake: Big-bang releases

Fix: Prefer canary or blue/green. Automate health checks and rollback triggers.

Mistake: Unclear incident roles

Fix: Assign Incident Commander, Comms, and Tech lead at declaration. Timebox updates and log a timeline.

Mini project: Ship a reliable feature toggle rollout

Goal: Launch a new API feature behind a flag with safe rollout and the ability to rollback fast.

  1. Define SLIs/SLOs for the endpoint; create a fast- and slow-burn alert.
  2. Prepare capacity: set HPA and ensure PDB protects min availability.
  3. Run a k6 baseline; record safe throughput and latencies.
  4. Chaos drill: kill one pod during rollout in staging; verify recovery and alerts.
  5. Create a backup/restore test for the affected database.
  6. Plan the release: canary 5% → 25% → 50% → 100%, with abort thresholds and a rollback procedure.
  7. Write/attach the runbook and operational readiness checklist. Conduct a 15-minute go/no-go.
Operational readiness checklist (copy/paste)
  • [ ] SLOs defined; alerts routed; dashboards ready
  • [ ] Capacity headroom ≥ 30% and HPA configured
  • [ ] Rollout strategy and rollback plan documented
  • [ ] Backups verified in last 7 days
  • [ ] Runbook with triage steps and owners
  • [ ] Launch comms and status page templates ready

Subskills

  • On Call Practices — Build healthy rotations, triage quickly, escalate well, and close the loop.
  • Capacity Planning — Forecast usage, ensure headroom, and plan scale/failover.
  • Performance And Load Testing Basics — Create tests, interpret p95/p99, and find safe limits.
  • Chaos And Resilience Testing Basics — Run safe failure experiments to validate recovery paths.
  • Backup And Disaster Recovery — Meet RPO/RTO with tested, automated restores.
  • Handling Incidents And Outages — Lead response, communicate clearly, and produce learnings.
  • Change Risk Management — Use canary/blue-green, feature flags, and rollback strategies.
  • Runbooks And Operational Readiness — Document procedures and preflight checks for stable ops.

Practical projects

  • Golden Signals Dashboard: Build service dashboards for latency, traffic, errors, saturation with SLO overlays.
  • Resilience Game Day: Design a 60-minute scenario with 3 small failures and measurable recovery targets.
  • Release Safety Kit: Templates for rollout plans, rollback plans, and change windows with approvals.
  • DR Playbook: End-to-end restore from last backup to a clean environment within target RTO.

Next steps

Practice in a staging environment, then apply to one production service at a time. When you feel ready, take the skill exam below to validate your knowledge. Anyone can take the exam; only logged‑in users will have progress saved.

Reliability And Operations — Skill Exam

This exam checks practical understanding of reliability and operations for Platform Engineers. You will see scenario-based and concept questions. Passing score: 70%. Anyone can take the exam for free. Only logged-in users will have their progress saved.Tips: Choose the best answer for your context, assume a typical web service on Kubernetes with Prometheus unless stated otherwise.

12 questions70% to pass

Have questions about Reliability And Operations?

AI Assistant

Ask questions about this tool