How to learn Reliability And Operations for Platform Engineer for free

Why Reliability and Operations matter for Platform Engineers

Reliability and Operations is where your platform proves itself in real conditions. As a Platform Engineer, you design for resilience, set service goals (SLOs), run on-call, manage changes safely, and solve incidents fast. Mastering this unlocks trusted releases, stable infrastructure, and calm operations even during spikes and failures.

You make reliability measurable (SLIs/SLOs, error budgets).
You keep systems ready: capacity planned, tested under load, backed up, and recoverable.
You reduce risk: safe change rollout, quick rollback, clear runbooks, strong incident response.

Who this is for

Platform Engineers building and operating shared services and infrastructure.
Backend Engineers moving toward SRE/Platform responsibilities.
Team leads who need predictable releases and stable production.

Prerequisites

Comfortable with Linux basics (shell, processes, logs).
Familiar with containers and orchestration (Docker, Kubernetes basics).
Able to read YAML/JSON and write simple scripts (Bash or Python).
Basic monitoring knowledge (metrics, logs, traces).

Learning path (practical roadmap)

Milestone 1 — Define reliability: SLIs, SLOs, error budgets

Pick one service. Define its critical user journeys and SLIs (e.g., request latency < 300ms, availability).
Set SLOs (e.g., 99.9% monthly) and compute error budget.
Create alerting based on burn rate, not raw thresholds.

Checkpoint: You can explain what will wake you up at night and why.

Milestone 2 — On-call and incident response

Draft escalation paths and severities (SEV1–SEV3).
Write an incident runbook and practice a 30-minute simulated outage.
Adopt a lightweight post-incident review template.

Checkpoint: You can lead an incident bridge and produce follow-ups.

Milestone 3 — Capacity planning

Measure current load, peak factors, and growth rate.
Forecast next 3–6 months and define headroom (e.g., maintain 30–50%).
Create a scaling plan (HPA settings, instance counts, storage growth).

Checkpoint: You can justify capacity with data and lead time.

Milestone 4 — Performance and load testing

Establish a baseline with a representative test (ramp-up, steady, spike).
Track p95/p99 latency, error rate, and resource saturation.
Find the knee of the curve and record safe throughput.

Checkpoint: You can name the maximum safe RPS and its latency.

Milestone 5 — Chaos and resilience testing

Start with small, contained failures (kill 1 pod, inject latency in staging).
Verify failover paths, time-to-recover, and alerts.
Document guardrails and abort conditions.

Checkpoint: You know the blast radius and how the system heals.

Milestone 6 — Backup and disaster recovery

Define RPO/RTO for each data store.
Automate backups and test restores regularly.
Document the DR runbook and run a timed drill.

Checkpoint: You can restore within target RTO from a known-good backup.

Milestone 7 — Change risk management and operational readiness

Standardize rollout strategies (canary, blue/green, feature flags).
Add preflight checks: health, dependencies, rollback plan.
Ship a release using a checklist and capture learnings.

Checkpoint: Releases feel boring, reversible, and measurable.

Worked examples

Example 1 — Prometheus SLO burn rate alert

Alert on fast and slow burn rates for a 99.9% availability SLO using request_error_ratio.

# Recording rule example (ratio of errors to total over 5m)
- record: job:http_error_ratio:5m
  expr: sum(rate(http_requests_total{status=~"5.."}[5m]))
        / sum(rate(http_requests_total[5m]))

# Fast-burn alert (~14.4x budget burn)
- alert: HighErrorBudgetBurn
  expr: job:http_error_ratio:5m > (1 - 0.999) * 14.4
  for: 2m
  labels:
    severity: page
  annotations:
    summary: Fast burn rate threatens SLO
    runbook: "Open incident, initiate rollback if recent change"

# Slow-burn alert (~6x budget burn)
- alert: ModerateErrorBudgetBurn
  expr: avg_over_time(job:http_error_ratio:5m[1h]) > (1 - 0.999) * 6
  for: 30m
  labels:
    severity: ticket
  annotations:
    summary: Slow burn rate requires attention
    runbook: "Create task, investigate top offenders"

Tip: Tie alerts to clear actions and escalation paths.

Example 2 — Kubernetes HPA + PDB to stay stable during updates

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 4
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: api

Why: HPA adds capacity during spikes; PDB protects quorum during maintenance or node churn.

Example 3 — k6 baseline load test

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 50 },  // ramp-up
    { duration: '5m', target: 50 },  // steady
    { duration: '1m', target: 150 }, // spike
  ],
  thresholds: {
    http_req_duration: ['p(95)<300'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://staging.example/api/health');
  sleep(0.2);
}

Record p95, p99, error rate, and resource saturation to find your safe throughput.

Example 4 — Backup and restore verification (PostgreSQL)

#!/usr/bin/env bash
set -euo pipefail
TS=$(date +%Y%m%d-%H%M%S)
BACKUP_DIR=/backups
DB_URL=${DB_URL:-postgres://app:secret@db:5432/app}

# 1) Create dump
pg_dump --format=custom --file="$BACKUP_DIR/app-$TS.dump" "$DB_URL"

# 2) Verify restore into a temp database
TMP_DB="verify_$TS"
psql "$DB_URL" -c "CREATE DATABASE \"$TMP_DB\";"
pg_restore --exit-on-error --dbname="${DB_URL/app/$TMP_DB}" "$BACKUP_DIR/app-$TS.dump"

# 3) Integrity check (example: count critical tables)
psql "${DB_URL/app/$TMP_DB}" -c "SELECT count(*) FROM users;"

# 4) Cleanup temp DB
psql "$DB_URL" -c "DROP DATABASE \"$TMP_DB\";"

echo "Backup verified: $BACKUP_DIR/app-$TS.dump"

Automate and schedule this. A backup is only as good as your last successful restore test.

Example 5 — Incident flow and comms templates

Severity levels: SEV1 critical outage, SEV2 partial degradation, SEV3 minor impact.

# Incident roles
- Incident Commander (IC): decision maker, timekeeper
- Communications: status updates, stakeholder sync
- Ops/Tech lead: triage, mitigation

# First 10 minutes checklist
[ ] Declare severity and open an incident channel
[ ] Assign roles (IC, Comms, Tech lead)
[ ] State impact, scope, start time, suspected area
[ ] Mitigate user impact (rollback, feature flag, scale out)
[ ] Update status every 10–15 minutes

# Status update template
Impact: <what users see>
Scope: <regions/services>
Start time: <UTC>
Current action: <mitigation/diagnosis>
Next update: <UTC time>

Drills and exercises

[ ] Compute the monthly error budget (minutes) for SLOs 99.5%, 99.9%, and 99.99%.
[ ] Write one burn-rate alert and one saturation alert that would page you.
[ ] Run a 30-minute load test and record p95/p99, CPU, memory, and errors.
[ ] Kill one replica in staging. Measure time to recover and confirm alerts fired.
[ ] Execute a timed DB restore drill and record RTO/RPO achieved.
[ ] Practice a 20-minute incident simulation with roles and a status update.
[ ] Ship a canary deployment with a documented rollback plan.

Common mistakes and how to debug them

Mistake: Alert fatigue from noisy thresholds

Fix: Use SLO burn-rate alerts and add minimum durations (for:) to avoid flapping. Tie every page to a clear action.

Mistake: Unpracticed backups

Fix: Schedule restore verifications. Keep immutable copies and track restore timings against your RTO.

Mistake: Capacity based on averages

Fix: Plan for peaks and variability. Use p95 demand, add headroom, and account for failover (N+1 or zone loss).

Mistake: Big-bang releases

Fix: Prefer canary or blue/green. Automate health checks and rollback triggers.

Mistake: Unclear incident roles

Fix: Assign Incident Commander, Comms, and Tech lead at declaration. Timebox updates and log a timeline.

Mini project: Ship a reliable feature toggle rollout

Goal: Launch a new API feature behind a flag with safe rollout and the ability to rollback fast.

Define SLIs/SLOs for the endpoint; create a fast- and slow-burn alert.
Prepare capacity: set HPA and ensure PDB protects min availability.
Run a k6 baseline; record safe throughput and latencies.
Chaos drill: kill one pod during rollout in staging; verify recovery and alerts.
Create a backup/restore test for the affected database.
Plan the release: canary 5% → 25% → 50% → 100%, with abort thresholds and a rollback procedure.
Write/attach the runbook and operational readiness checklist. Conduct a 15-minute go/no-go.

Operational readiness checklist (copy/paste)

[ ] SLOs defined; alerts routed; dashboards ready
[ ] Capacity headroom ≥ 30% and HPA configured
[ ] Rollout strategy and rollback plan documented
[ ] Backups verified in last 7 days
[ ] Runbook with triage steps and owners
[ ] Launch comms and status page templates ready

Subskills

On Call Practices — Build healthy rotations, triage quickly, escalate well, and close the loop.
Capacity Planning — Forecast usage, ensure headroom, and plan scale/failover.
Performance And Load Testing Basics — Create tests, interpret p95/p99, and find safe limits.
Chaos And Resilience Testing Basics — Run safe failure experiments to validate recovery paths.
Backup And Disaster Recovery — Meet RPO/RTO with tested, automated restores.
Handling Incidents And Outages — Lead response, communicate clearly, and produce learnings.
Change Risk Management — Use canary/blue-green, feature flags, and rollback strategies.
Runbooks And Operational Readiness — Document procedures and preflight checks for stable ops.

Practical projects

Golden Signals Dashboard: Build service dashboards for latency, traffic, errors, saturation with SLO overlays.
Resilience Game Day: Design a 60-minute scenario with 3 small failures and measurable recovery targets.
Release Safety Kit: Templates for rollout plans, rollback plans, and change windows with approvals.
DR Playbook: End-to-end restore from last backup to a clean environment within target RTO.

Next steps

Practice in a staging environment, then apply to one production service at a time. When you feel ready, take the skill exam below to validate your knowledge. Anyone can take the exam; only logged‑in users will have progress saved.

Menu

Reliability And Operations

Table of Contents

Why Reliability and Operations matter for Platform Engineers

Who this is for

Prerequisites

Learning path (practical roadmap)

Worked examples

Drills and exercises

Common mistakes and how to debug them

Mini project: Ship a reliable feature toggle rollout

Subskills

Practical projects

Next steps

Reliability And Operations — Skill Exam

Topics

On Call Practices

Capacity Planning

Backup And Disaster Recovery

Handling Incidents And Outages

Change Risk Management

Performance And Load Testing Basics

Runbooks And Operational Readiness

Chaos And Resilience Testing Basics

Have questions about Reliability And Operations?

AI Assistant