How to learn Reliability And SLO Thinking for Platform Engineering Foundations in Platform Engineer for free

Why this matters

Reliability is the product feature customers notice most when it’s missing. As a Platform Engineer, you design shared systems (CI/CD, service mesh, observability, databases) that other teams rely on. Service Level Objectives (SLOs) turn vague reliability wishes into measurable goals. Error budgets translate those goals into controlled risk and deployment speed.

Prioritize engineering work: use error budget burn to justify reliability work over new features.
Create aligned alerts: page only for customer-impacting issues.
Negotiate expectations: SLOs help product, support, and engineering agree on what “good” means.

Who this is for

Platform and backend engineers building and running services.
Team leads/SREs defining reliability guidelines.
New engineers learning how to measure and defend reliability choices.

Prerequisites

Basic understanding of HTTP services and request/response flow.
Familiarity with metrics (counters, histograms), logging, and alerting concepts.
Comfort with percentages and time windows.

Concept explained simply

Think of reliability as a promise to users. An SLO is the promise, measured by an SLI (what you actually measure). The SLA is a legal or external document and is not required for internal reliability practice.

SLI (Service Level Indicator): a precise measurement. Example: percentage of requests under 300 ms and successful.
SLO (Service Level Objective): the target for the SLI. Example: 99.9% of requests under 300 ms and successful over 30 days.
Error budget: how much unreliability is allowed. If your SLO is 99.9%, your error budget is 0.1% of time/requests.

Mental model

Reliability trades off with velocity. Your SLO sets the maximum acceptable risk. If you burn error budget too fast, you slow changes and focus on fixes. If you have budget left, you can ship faster.

Quick intuition: time-based vs request-based SLOs

Time-based (availability per month): great for infrastructure like control planes. Request-based (success/latency per request): ideal for APIs and user-facing endpoints. Choose what best reflects user experience.

SLI, SLO, SLA — practical differences

SLI: formula only. Example: successful_requests_under_300ms / total_requests (rolling 30 days).
SLO: target + window. Example: SLI ≥ 99.9% over 30 days.
SLA: external commitment with consequences. Often derived from SLOs but not needed for internal practice.

Setting SLOs step-by-step

Define the user journey. What is the critical path? (e.g., Checkout POST).
Choose SLIs that reflect user happiness. Success + latency percentiles.
Pick a sensible window. 28–30 days balances stability and responsiveness; shorter windows for volatile services.
Set the target. Start with what you can meet. 99–99.9% is common for single services; 99.99% requires strong redundancy.
Compute error budget. Budget = 1 − SLO target.
Design alerts on burn rate. Alert only when the budget is burning fast enough to threaten the SLO.
Agree on policies. If budget is exhausted: pause risky deploys, focus on reliability.

Tip: pick availability OR latency first

New teams often start with availability success rate (simple) before adding latency SLOs (harder to measure and tune).

Worked examples

Example 1: Monthly availability SLO

Target: 99.9% availability over 30 days for the API gateway.

Total minutes in 30 days = 30 × 24 × 60 = 43,200.
Error budget = 0.1% = 43.2 minutes of allowed unavailability.
Alerting: use burn rate alerts.
- Fast page: 1-hour window, burn rate ≥ 14.4. (Consumes ~10% of monthly budget in 1 hour.)
- Slow page: 6-hour window, burn rate ≥ 6.

Why these numbers?

Burn rate = (errors allowed per month) / (errors allowed per alert window). Thresholds ensure you don’t sleepwalk into missing the SLO.

Example 2: Latency SLO for a POST /checkout

SLI: fraction of successful requests under 300 ms (exclude 4xx client errors).

SLO: 95% of successful requests under 300 ms over 7 days.

Why 7 days? Checkout traffic is spiky; a shorter window reacts faster to regressions.

Edge cases

Retry storms: cap client retries to avoid artificial failure inflation.
Cold starts/warm-up: treat planned cold starts as part of reliability, not exclusions, unless strictly necessary and documented.

Example 3: Background jobs throughput SLO

SLI: fraction of jobs completed within 5 minutes of enqueue.

SLO: 99% of jobs complete within 5 minutes over 30 days.

Alerting: if the 99th percentile age of jobs exceeds 5 minutes for 30 minutes, open an incident; if sustained for 2 hours, escalate.

Capacity planning insight

When SLO fails during peak, you likely need queue autoscaling or rate limiting on producers.

Error budgets: turning targets into decisions

Budget left: deploy at normal speed; try experiments.
Budget burning fast: tighten change control, prioritize rollbacks and fixes.
Budget exhausted: freeze risky changes; focus on reliability work until budget recovers.

Simple policy template

If 25% budget consumed in first week: review on-call toil and top incidents. If 50% by mid-cycle: restrict deploys to fixes. If 75%: freeze features. If 100%: escalate to leadership and prioritize reliability roadmap.

Measurement and alerts

Metrics needed: total requests, successful requests, latency buckets (histograms), queue age/length, job completion counts.
Rollups: compute SLIs over sliding windows (e.g., 30 days) and shorter alert windows (1h, 6h).
Exclude only what users don’t experience (e.g., maintenance pages during agreed downtime) and document exclusions clearly.

Sample SLI expressions (illustrative)

Availability SLI: successful_requests / total_requests (rolling 30d).
Latency SLI: requests_under_300ms / successful_requests (rolling 7d).
Job SLI: jobs_completed_within_5m / jobs_enqueued (rolling 30d).

Common mistakes and self-check

Too many SLOs per service. Pick 1–3 that reflect user happiness. Self-check: can you explain each SLO’s user impact in one sentence?
Paging on raw errors. Page on budget burn, not every blip. Self-check: does this alert imply we’ll miss the SLO soon?
Hiding behind exclusions. Excluding every tough case gives fake reliability. Self-check: would users agree this time shouldn’t count?
Unowned SLOs. Without owners, no one reacts. Self-check: who reviews SLOs weekly? Is a policy documented?
Targets too high, too soon. 99.99% without redundancy is wishful. Self-check: do we have N+1 and failover tested?

Exercises

These mirror the exercises section below. Try here first; reveal solutions only to verify.

Exercise 1: Compute error budget and burn alerts

You run an API with SLO: 99.9% availability over 30 days.

Task A: How many minutes of downtime are allowed?
Task B: Propose two burn-rate alerts that would catch fast and slow burns.

Show a small hint

30 days = 43,200 minutes. Error budget is 0.1% of that. For alerts, think 1-hour and 6-hour windows with burn rates around 14.4 and 6.

Reveal solution

Allowed downtime: 43.2 minutes per 30-day window. Alerts: page if 1-hour burn rate ≥ 14.4 (about 10% of monthly budget in 1 hour). Page if 6-hour burn rate ≥ 6.

Exercise 2: Define a latency SLI/SLO for POST /checkout

Task A: Write an SLI that measures user-perceived latency and ignores client mistakes.
Task B: Propose an SLO target and window.
Task C: Suggest a paging condition linked to budget burn or sustained p95 latency.

Show a small hint

Filter out 4xx, include only successful responses, use a percentile threshold like 300 ms.

Reveal solution

SLI: requests_under_300ms_success / successful_requests. SLO: 95% under 300 ms over 7 days. Paging: if 7-day burn threatens SLO (e.g., p95 over 300 ms for 30 min and 2h windows) or burn rate ≥ threshold.

Practical projects

Instrument one service: add counters for total/successful requests and a latency histogram; publish a 30-day availability and a 7-day latency SLI panel.
Define SLOs for a queue worker: 99% jobs complete within 5 minutes; add age and success metrics; configure two burn-rate alerts.
Run a game day: simulate a dependency outage; observe burn rate, alerts, and document policy decisions.

Learning path

Start: availability SLO for your most critical endpoint.
Next: latency SLO on success-only requests.
Then: background processing SLO (throughput/age).
Finally: introduce burn-rate alerting and a written error budget policy.

Mini challenge

Your service depends on an upstream that randomly adds 100 ms latency 5% of the time.

Propose an SLI and SLO that keep user experience good without over-alerting.
Decide whether to include upstream-caused latency in your SLI.
Draft one short policy line triggered when 25% of budget is gone in the first week.

One possible direction

Include upstream effects (users feel them). Choose 95% under 350 ms over 7 days; page on p95 over 350 ms for 30 min and burn ≥ threshold; policy: cap deploys to fixes until burn stabilizes.

Next steps

Write down 1–2 SLOs for your service and share with your team.
Implement one burn-rate alert and track it for a week.
Schedule a 30-minute review to adjust thresholds based on real behavior.

Quick test

Take the quick test below to check your understanding. The test is available to everyone; progress is saved only if you are logged in.

Menu

Reliability And SLO Thinking

Table of Contents