Menu

Topic 8 of 8

Chaos And Resilience Testing Basics

Learn Chaos And Resilience Testing Basics for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Why this matters

Chaos and resilience testing helps Platform Engineers prove that systems can survive real failures before customers see them. In day-to-day work, you will: verify failover of services and databases, test autoscaling and timeouts under load, and validate that alerts, dashboards, and runbooks guide fast recovery. Practicing this builds confidence, reduces incident impact, and turns unknowns into measurable risks.

Concept explained simply

Chaos engineering is the disciplined practice of injecting controlled failures to learn how systems behave. You define a steady state (normal behavior), form a hypothesis about how the system should handle a failure, run a small experiment, observe, and then improve. Resilience is the system's ability to continue meeting user expectations despite faults.

Mental model

Think of a wind tunnel for software. You send controlled blasts (faults) at your system while watching gauges (SLIs) to see if it stays stable. You start with a light breeze (small blast radius) and turn up the wind only when it's safe.

Core building blocks

  • Steady state: measurable indicators of normal behavior (e.g., 99th percentile latency under 400 ms; error rate under 1%).
  • Hypothesis: what you expect to happen (e.g., "If one pod dies, traffic shifts and error rate stays under 1%").
  • Blast radius: how much you expose to the experiment; keep it small at first.
  • Abort conditions: thresholds that stop the experiment to protect users (e.g., p99 latency exceeds 800 ms for 2 minutes).
  • Fault types: latency, errors, resource exhaustion (CPU/memory/disk), dependency outages, network partitions.
  • Observability: dashboards, logs, traces, alerts, and SLOs/SLIs to measure impact.
  • Game day: a scheduled, collaborative session running planned experiments with a clear rollback plan.

A safe workflow (step-by-step)

  1. Pick a tiny blast radius (1% traffic or a staging environment).
  2. Define steady state and a clear hypothesis.
  3. Choose one fault to inject (e.g., kill one pod).
  4. Set explicit abort conditions and rollback steps.
  5. Run the experiment during a staffed window; announce in chat.
  6. Observe SLIs and logs; record timings and outcomes.
  7. Stop, document, fix gaps (timeouts, retries, alerts, runbooks).
  8. Repeat with slightly larger scope only after fixes are verified.
Checklist: readiness before any chaos experiment
  • Dashboards show request rate, latency, error rate, saturation (CPU/mem), and dependency health.
  • Clear SLOs/SLIs and alert thresholds exist.
  • Runbook with rollback steps and on-call contact is ready.
  • Staging or canary environment available, or production blast radius under 1%.
  • Abort conditions documented.
  • Change freeze windows respected; stakeholders notified.

Worked examples

Example 1: Losing one app instance
  • Steady state: p95 latency under 300 ms; error rate under 0.5% for the checkout service.
  • Hypothesis: If one instance is terminated, the load balancer reroutes and SLOs remain green.
  • Fault: Terminate one pod/instance of checkout.
  • Blast radius: 1 of 6 pods; production canary pool only.
  • Abort: p95 latency over 700 ms for 2 minutes or error rate over 2% for 1 minute.
  • Observe: Autoscaler reaction, retry behavior, circuit breaker status.
  • Outcome: Latency spike to 380 ms for 30 s (acceptable). No alert fired. Improvement: tune autoscaler to add capacity sooner.
Example 2: Database latency spike
  • Steady state: API p99 under 500 ms, error rate under 1%.
  • Hypothesis: When DB latency rises by +400 ms, API timeouts prevent thread exhaustion and degrade gracefully.
  • Fault: Inject 400 ms latency into DB queries for 60 s.
  • Blast radius: Only staging, or a single read-replica in prod canary.
  • Abort: API p99 over 1 s for 90 s, or queue depth exceeds 2x baseline.
  • Outcome: Threads saturated; timeouts were 10 s. Fix: reduce DB timeout to 1 s, add bulkhead limits, implement fallback for read endpoints.
Example 3: Partial network partition
  • Steady state: Message processing rate 500 msg/s, dead-letter under 0.1%.
  • Hypothesis: If network from workers to broker is throttled, workers backoff and DLQ stays under 0.5%.
  • Fault: Limit egress bandwidth from worker nodes to 10% for 3 minutes.
  • Blast radius: One node pool in staging.
  • Abort: Dead-letter over 1% or consumer lag over 10x baseline.
  • Outcome: DLQ spiked to 2%. Fix: adjust retry jitter, increase consumer concurrency, add alert for lag slope, and test idempotency handling.

Who this is for

  • Platform Engineers and SREs validating reliability.
  • Backend Engineers owning services with production SLOs.
  • QA/Testing Engineers adding failure modes to test plans.

Prerequisites

  • Basic understanding of microservices or distributed systems.
  • Observability basics: metrics, logs, traces; familiarity with SLOs/SLIs.
  • Ability to deploy to a staging or canary environment.
  • Comfort with rollbacks and incident communication.

Learning path

  • Start here: define steady state and run a tiny fault injection in staging.
  • Next: introduce timeouts, retries, circuit breakers, and bulkheads.
  • Then: practice game days with cross-team participation.
  • Advance: test regional failover, stateful systems, and data durability.
  • Ongoing: automate recurring experiments in CI or scheduled jobs.

Common faults to test first

  • Process kill: terminate a single instance or pod.
  • Latency injection: add delay to a dependency call.
  • Error injection: force 5xx from a downstream dependency.
  • Resource stress: CPU or memory pressure on one node.
  • Network shaping: throttle bandwidth or add packet loss.

Common mistakes and how to self-check

  • Skipping steady state: If you cannot measure normal behavior, pause and build dashboards first.
  • Too much blast radius: Start smaller. If unsure, use staging or 0.5–1% traffic canary.
  • No abort conditions: Write them down and assign someone to watch them live.
  • Unclear ownership: Name a commander, an observer, and an executor for the experiment.
  • Not capturing learnings: Create an experiment record; include metrics, timeline, decisions, and fixes.
  • Testing only stateless services: Include stateful components (DBs, caches, queues) gradually.
Self-check prompt
  • Can I state the steady state in one sentence with concrete numbers?
  • Do I know exactly when and how the experiment will stop?
  • Can I roll back within 1–2 minutes?
  • Will someone on-call be present and informed?
  • Is customer impact extremely unlikely with my chosen blast radius?

Practical projects

  • Project 1: Add timeouts and retries to a service, then run a latency injection test to verify behavior.
  • Project 2: Create a one-page runbook with rollback and abort conditions; rehearse a 15-minute game day.
  • Project 3: Build a dashboard focused on your SLO; include p95/p99 latency, error rate, saturation, and dependency health.
  • Project 4: Automate a weekly staging experiment that kills one pod and posts results to team chat.

Mini tasks

  • Write a steady state for your most critical endpoint in one sentence.
  • List two abort conditions you would use for a small production canary.
  • Identify one dependency to isolate with a circuit breaker.

Exercises

The following exercises are also available below this lesson. Everyone can take them; only logged-in users will have their progress saved.

  1. Exercise 1: Design a tiny, safe chaos experiment (see details in the Exercises section below).
  2. Exercise 2: Map SLOs to abort conditions and observability checks.
Pre-run checklist for exercises
  • Have a staging or canary target ready.
  • Ensure dashboards show error rate and high-percentile latency.
  • Write a rollback step you can execute in under 2 minutes.

Next steps

  • Expand to dependency-level tests (databases, caches, queues) with stricter aborts.
  • Run a monthly game day and rotate ownership to spread knowledge.
  • Automate experiment definitions as code and integrate with CI for smoke chaos in staging.
  • Track reliability improvements over time: reduction in MTTR, fewer customer-facing incidents, improved SLO compliance.

Ready to check your understanding?

Try the quick test below. Everyone can take it; only logged-in users will see saved progress and stats.

Practice Exercises

2 exercises to complete

Instructions

Your task: propose a minimal experiment to validate instance failure resilience for a single service.

  • Define steady state with 2–3 measurable SLIs.
  • Write one hypothesis.
  • Choose one fault to inject and the exact blast radius.
  • List 2–3 abort conditions and a rollback step.

Keep it safe: assume staging or a production canary of at most 1% traffic.

Expected Output
A short plan (5–8 bullet points) that includes steady state, hypothesis, fault, blast radius, abort conditions, and rollback.

Chaos And Resilience Testing Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Chaos And Resilience Testing Basics?

AI Assistant

Ask questions about this tool