Why this skill matters for Backend Engineers
Performance and reliability determine how fast and how consistently your backend serves users, especially under load and during failures. Mastering this skill lets you reduce latency, increase throughput, prevent cascading outages, and recover gracefully when things go wrong. You will design systems that are fast on good days and resilient on bad days.
- Deliver predictable latency and throughput.
- Protect core services with circuit breakers, bulkheads, and timeouts.
- Set and meet SLOs; escalate and resolve incidents effectively.
- Run lightweight postmortems to learn and improve.
What you will be able to do
- Profile APIs and databases to find bottlenecks within minutes.
- Reduce tail latency (p95/p99) using caching, batching, and backpressure.
- Use concurrency safely to scale CPU and I/O work.
- Add circuit breakers, bulkheads, and graceful fallbacks to prevent meltdowns.
- Define SLOs/SLAs/SLIs and build basic alerts around them.
- Handle incidents, run postmortems, and prioritize fixes.
Practical roadmap
- Measure first: Add timing around critical paths; collect p50/p90/p99 latency, throughput, error rate.
- Profile and fix 1–2 bottlenecks: Use language profilers and DB EXPLAIN to eliminate low-hanging fruit.
- Stabilize under failure: Add timeouts, retries with jitter, circuit breakers, and bulkheads.
- Optimize concurrency: Use worker pools, async I/O, and safe parallelism; add backpressure.
- Set SLOs and alerts: Define user-centric SLOs; wire up basic alerting and dashboards.
- Practice incidents: Create runbooks, do a game-day, and run a lightweight postmortem.
Quick glossary (open)
- Latency: Time to handle a request (p50/p95/p99 are percentiles).
- Throughput: Requests handled per second/minute.
- SLI: Quantified measurement (e.g., successful requests %).
- SLO: Target for an SLI (e.g., 99.9% success over 30 days).
- Budget: Allowed error before SLO is breached.
Worked examples
1) Find a slow database query with EXPLAIN
Symptom: API endpoint takes 800 ms. Hypothesis: missing index on WHERE user_id.
# SQL (PostgreSQL)
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123 AND created_at > now() - interval '30 days' ORDER BY created_at DESC LIMIT 50;
If the plan shows a Seq Scan on orders, add a composite index:
# SQL
CREATE INDEX CONCURRENTLY idx_orders_user_created ON orders(user_id, created_at DESC);
Re-run EXPLAIN ANALYZE. Expect Index Scan and latency drop (e.g., 800 ms → 40 ms). Watch write performance and index size when adding indexes.
2) Add a simple in-memory cache with TTL
Symptom: Repeated reads of a popular item cause load and p99 spikes.
# Python (naive in-process cache)
from time import time
cache = {}
def get_with_cache(key, loader, ttl=30):
now = time()
entry = cache.get(key)
if entry and now < entry[1]:
return entry[0]
value = loader()
cache[key] = (value, now + ttl)
return value
Use for hot but not mission-critical data. For multiple instances or larger sets, move to a shared cache like Redis and add negative caching for known misses.
3) Worker pool with backpressure
Goal: Limit concurrency to protect downstream services and avoid queue explosion.
# Go
package main
import (
"fmt"
"time"
)
func worker(id int, jobs <-chan int, results chan<- string) {
for j := range jobs {
time.Sleep(50 * time.Millisecond) // simulate work
results <- fmt.Sprintf("job %d done by %d", j, id)
}
}
func main() {
jobs := make(chan int, 100) // bounded queue = backpressure
results := make(chan string, 100)
for w := 0; w < 8; w++ { go worker(w, jobs, results) }
for j := 0; j < 1000; j++ { jobs <- j }
close(jobs)
for i := 0; i < 1000; i++ { <-results }
}
Tune pool size using CPU cores and downstream limits. Use bounded queues to avoid unbounded memory growth.
4) Circuit breaker with fallback
Goal: Stop hammering a failing dependency and return a degraded response.
# Pseudocode
state = "CLOSED" # CLOSED - normal, OPEN - fail fast, HALF_OPEN - probe
failures = 0
opened_at = 0
function callDownstream():
if state == "OPEN" and now() - opened_at < 10s:
return fallback()
try:
with timeout(300ms):
resp = http.get("/recommendations")
recordSuccess()
return resp
except:
recordFailure()
return fallback()
function recordFailure():
failures += 1
if failures >= 5 and state == "CLOSED":
state = "OPEN"
opened_at = now()
if state == "HALF_OPEN":
state = "OPEN"; opened_at = now()
function recordSuccess():
failures = 0
if state == "HALF_OPEN":
state = "CLOSED"
# After 10s OPEN, allow 1 probe (HALF_OPEN)
if state == "OPEN" and now() - opened_at >= 10s:
state = "HALF_OPEN"
Always pair retries with timeouts and jitter. Make fallbacks acceptable but clearly degraded.
5) Define an SLO and a simple error-budget alert
Context: Checkout API SLO: 99.9% of requests succeed over 30 days. Error budget = 0.1%.
- SLI: success_rate = successful_requests / total_requests.
- Alert: Page when last 1 hour success_rate < 99.5% and 6-hour trend is worsening.
- Action: Stop risky deploys when spending budget too fast.
Choosing thresholds
Use burn-rate alerts: fast burn (e.g., 14.4Ă— over 5 minutes) and slow burn (e.g., 2Ă— over 6 hours) to catch both sharp spikes and smoldering issues.
Drills and exercises
- Add timing logs to one endpoint and report p50/p95/p99 and throughput for a 10-minute window.
- Run EXPLAIN on your top 3 queries; add or adjust one index; measure impact.
- Add a 300 ms timeout and retry-with-jitter to one outbound HTTP call.
- Implement a small worker pool with a bounded queue and measure queue wait time under load.
- Define one user-visible SLO and create a basic burn-rate alert.
- Write a 1-page runbook: how to detect, mitigate, and roll back a failed dependency.
- Run a 30-minute game-day: kill a dependency and verify your circuit breaker plus fallback work.
Common mistakes and debugging tips
Mistake: Optimizing without measuring
Always gather baseline metrics first. Add request timing, query timing, and error counters. Compare before/after.
Mistake: Infinite retries or no timeouts
Retries without timeouts amplify incidents. Use timeouts plus bounded retry count with jittered backoff.
Mistake: Unbounded queues
They hide overload until memory runs out. Use bounded queues and drop or shed load with clear errors when full.
Mistake: Only looking at averages
P95 or P99 drive user perception. Track tail latency and fix outliers first.
Debugging tip: Is it CPU, I/O, or lock contention?
Use a profiler and system metrics. High CPU and low I/O → algorithmic hot path. High I/O wait → disk/network. Threads blocked → lock contention or too much synchronization.
Mini project: Resilient product recommendations
Build a small service that fetches product details and recommendations and returns a combined response. Requirements:
- Two downstream calls: products and recommendations.
- Timeout each call at 300 ms; retry recommendations at most 2 times with jitter.
- Circuit breaker for recommendations; fallback returns top-selling products from cache.
- Use a worker pool to precompute popular recommendations every minute.
- Expose /health and a /metrics-like endpoint (simple counters and timings).
- Define one SLO: 99% requests under 500 ms. Provide a basic burn-rate alert rule.
Acceptance checklist
- Service returns degraded but useful response when recommendations are down.
- p95 <= 500 ms under nominal load; no unbounded queues.
- Logs/metrics show timeouts, retry counts, and circuit state.
- A 1-page runbook and a 30-minute postmortem template are included.
Subskills
- Profiling And Bottleneck Analysis — Find slow code and queries using profilers and EXPLAIN; choose fixes with the best impact-to-effort ratio.
- Latency And Throughput Optimization — Reduce p95/p99 with caching, batching, pagination, and compression; increase throughput safely.
- Concurrency And Parallelism Basics — Use async I/O and worker pools; avoid races; apply backpressure.
- Circuit Breakers And Bulkheads Basics — Isolate failures, fail fast, and limit blast radius across services.
- Graceful Degradation And Fallbacks — Return partial results or cached data when dependencies fail.
- SLO SLA Concepts — Define SLIs/SLOs, track error budgets, and set burn-rate alerts.
- Incident Response Basics — Create runbooks, escalation paths, and practice game-days.
- Postmortems Basics — Blameless reviews, clear actions, owners, and due dates to prevent repeats.
Learning path
- Instrument and measure: add latency/throughput/error metrics and basic logs.
- Profile top endpoints and queries; ship 1–2 high-impact optimizations.
- Add timeouts, retries with jitter, and circuit breakers to one critical path.
- Introduce bounded queues and worker pools; test under load.
- Define an SLO and add a burn-rate alert; build a simple dashboard.
- Write a runbook and run a game-day; follow with a short postmortem.
Who this is for
- Backend engineers building APIs, services, or data pipelines.
- Platform/SRE-adjacent developers improving stability and latency.
- Developers preparing for system design interviews and on-call rotations.
Prerequisites
- Comfort with one backend language (e.g., Go, Java, Python, Node).
- Basic HTTP, REST/JSON, and database fundamentals (SQL or NoSQL).
- Familiarity with logs and simple metrics.
Practical projects you can build
- Rate-limited file upload service with worker pool and backpressure.
- Cache-first product catalog with TTL and negative caching.
- Search API with pagination, batching, and tail-latency reduction.
- Feature service using circuit breakers, bulkheads, and graceful fallbacks.
Quick Q&A
How do I pick a pool size?
Start with CPU cores for CPU-bound work or a multiple of cores for I/O-bound tasks. Tune with load tests and downstream limits.
Retries: how many and how fast?
Use 2–3 attempts with exponential backoff and jitter. Never retry non-idempotent operations unless safe.
How to estimate capacity?
Measure current throughput and CPU/memory usage at p95. Use load tests to find safe headroom (e.g., 50–70% of max).
Next steps
- Work through the subskills below to build depth in each area.
- Complete the mini project and verify your SLO under load.
- Take the skill exam to validate understanding. Progress is saved for logged-in users; the exam is available to everyone.