Why this matters
Feature flags let you deploy code continuously while controlling who sees changes. This reduces risk, speeds up releases, and enables safe experiments.
- Turn features on/off instantly (kill switches) without redeploying.
- Roll out gradually (1% → 5% → 25% → 100%) and monitor impact.
- Target specific users, regions, or environments.
- Run A/B tests to validate product and performance assumptions.
Who this is for
- Backend Engineers shipping APIs and services with CI/CD.
- Platform/DevOps engineers enabling safe deploys.
- Any engineer who wants to decouple deployment from release.
Prerequisites
- Basic understanding of deployments and environments (dev, staging, prod).
- Comfort with reading simple code/config files.
- Familiarity with metrics/logs and incident response.
Concept explained simply
A feature flag is a conditional switch in code that decides whether to execute a new path (feature) for some users or conditions. You ship the code dark, keep it off by default, then turn it on in a controlled way.
Mental model
Think of a traffic metering light on a freeway on-ramp. Cars (users) are allowed in gradually to avoid traffic jams (incidents). You can stop the flow instantly (kill switch) if there’s a crash (error spike).
Core building blocks
- Flag key: stable identifier, e.g., "checkout.v2".
- Description and owner: why it exists; who cleans it up.
- Default: off by default for safety.
- Targeting rules: who/when sees the feature (segments, percentages, attributes).
- Environments: separate settings for dev/staging/prod.
- Variations: not just on/off; can be values like numbers or strings.
- Evaluation: where the decision happens (server or client).
- Telemetry: logs/metrics to measure impact.
- Expiration: date to remove the flag once done.
Worked examples
Example 1 — Gradual rollout
We introduce a new pricing endpoint. Off by default, progressively enabled to a percentage of traffic.
// Pseudocode
flag = flags.get("pricing.v2", default=false)
if (flag.enabledFor(user.id, env="prod", percent=5)) {
return pricingV2()
} else {
return pricingV1()
}
Plan: 1% for 30 minutes, check error rate/latency; then 5%, 25%, 50%, 100%.
Example 2 — Kill switch
A cache warmer occasionally overloads Redis. Wrap it with a flag for instant disable.
if (flags.isEnabled("infra.cacheWarmer", default=true)) {
runCacheWarmer()
}
If Redis CPU spikes, flip the flag off without deploying.
Example 3 — Configuration flag
Use a numeric variation to control a concurrency limit.
limit = flags.getNumber("search.maxConcurrent", default=8)
semaphore = new Semaphore(limit)
Adjust at runtime based on system load.
Implementation patterns
- Server-side evaluation (recommended for backend): evaluate on the server and log decisions; avoids exposing hidden features and allows centralized control.
- Caching: cache flag configs and refresh periodically; fall back to safe defaults if the flag service is unreachable.
- Deterministic bucketing: for percentage rollouts, hash a stable key (e.g., user ID) so users consistently see the same variation.
- Observability: tag logs/metrics with flag key and variation. Watch error rate, latency, CPU, and key business metrics.
Safe rollout steps
- Add code paths guarded by a flag. Keep default off.
- Ship to staging: enable 100% on staging; run tests and load checks.
- Prod canary: enable 1% (deterministic). Monitor technical and business metrics.
- Increase gradually if healthy: 1% → 5% → 25% → 50% → 100%.
- If issues arise: flip flag off immediately; investigate; fix; retry.
- After full release: remove the old path and delete the flag.
Risk checklist
- Is default safe (off)?
- Do we have kill switch and quick rollback?
- Are error budgets and thresholds defined?
- Is telemetry (metrics/logs/traces) in place?
- Is there an owner and a removal date?
Security and privacy
- Do not expose sensitive flags to clients; evaluate on server and send only the resulting behavior or non-sensitive config.
- Authenticate flag management; log all changes with actor and timestamp.
- Fail closed: if flag service is unavailable, prefer safe defaults.
Observability
- Emit counters: flag_evaluations{key,variation}, errors, and latencies.
- Correlate flag changes with incidents using change logs.
- Use SLOs to decide rollout pace and when to stop.
Exercises
These mirror the interactive exercise below. Everyone can take them; only logged-in users will have progress saved.
Exercise 1 (ex1) — Design a safe rollout
Design a rollout plan for a new endpoint guarded by flag "orders.v3". Include: default, targeting, percent steps, metrics to watch, rollback criteria, and cleanup plan.
Hints
- Start with off by default.
- Pick deterministic bucketing (hash user ID).
- Define thresholds, e.g., error rate < 1%, p95 latency < 300ms.
- Checklist: Default off and safe fallback.
- Checklist: Monitoring dashboards ready.
- Checklist: Rollout steps and ownership defined.
- Checklist: Removal date scheduled.
Common mistakes
- Leaving flags forever: code rots. Self-check: does each flag have an owner and removal date?
- Client-side secrets: exposing hidden features. Self-check: are sensitive decisions evaluated on the server?
- Non-deterministic rollout: users flip between versions. Self-check: is bucketing based on stable IDs?
- No observability: you can’t see impact. Self-check: do logs/metrics include flag keys and variations?
- No safe default: outages when flag service fails. Self-check: are defaults and fallbacks defined?
Practical projects
- Add a kill switch around a heavy background job; simulate failure and flip it off.
- Implement percentage rollout using hashing on user ID; prove determinism in logs.
- Create a small config-driven flag file (JSON/YAML) with owner, default, rules, and an expiry note; write a linter that warns on expired flags.
Learning path
- Start: Feature Flags Basics (this lesson).
- Next: Rollout strategies, canary and blue/green.
- Then: Observability and incident response basics.
- Advanced: Experiment design and statistical guardrails.
Next steps
- Integrate flags into your CI/CD pipeline: enable on staging post-deploy, disable on incident.
- Automate flag cleanup reminders in code review.
- Pair with SLOs to drive rollout decisions.
Mini challenge
You deploy a new rate limiter behind a flag. After enabling at 10%, p95 latency improves but error rate rises from 0.2% to 1.6%. What do you do? Write a 3-step action plan (include kill switch, diagnostics you’d check, and a safer retry plan).
About progress and tests
The Quick Test is available to everyone. Only logged-in learners have their answers and progress saved.