Why this matters
Feature flags let API engineers ship safely: you can enable features for a subset of users, keep a kill switch for risky integrations, and separate deploy from release. This reduces outages and accelerates learning.
- Release new behaviors gradually without breaking clients.
- Roll back instantly if metrics go red, without redeploying.
- Target tenants, regions, or traffic slices for experiments.
Who this is for
- API engineers and platform engineers introducing or improving flag-driven releases.
- Backend developers who need safe rollouts, canaries, or kill switches.
- Tech leads defining release policies and guardrails.
Prerequisites
- Comfort with HTTP APIs (status codes, headers, caching basics).
- Basic knowledge of server-side programming and configuration management.
- Familiarity with logs/metrics and reading dashboards.
Concept explained simply
A feature flag is a conditional switch in code that chooses between behaviors at runtime. Instead of deploying new code to change behavior, you flip a flag value. Server-side flags are evaluated on the backend, so clients see consistent, authoritative behavior.
Types you will use:
- Boolean: on/off (e.g., enable payment provider B).
- Multivariate: one of several values (e.g., pick algorithm A/B/C).
- Percentage rollout: enable for X% of traffic with stickiness (e.g., by user or tenant).
- Targeted: enable for a list (tenant IDs, regions, plans).
Mental model
Think of flags as circuit breakers and traffic routers:
- Decision point: "Which path should this request take?"
- Inputs: global defaults, environment, targeting rules, and identity (tenant/user/service).
- Outputs: the chosen behavior and telemetry events you emit to observe impact.
Quick visual model (text)
Request -> Identify (tenant/user/service) -> Evaluate flag rules -> Select behavior
-> Execute behavior -> Emit metrics/logs -> Cache decision (short TTL)
-> Fallback to safe default when evaluation fails
Key patterns and decisions
- Evaluation location: Prefer server-side evaluation for APIs. Never trust client-side flags to protect sensitive paths.
- Stickiness: Use a stable key (tenantId, userId) to avoid flapping when doing percentage rollouts.
- Fallbacks: If the flag system is unavailable, use the last-known value or a safe default (usually the old, proven path) to preserve contract and availability.
- Observability: Emit counters for on/off path selection, errors, and latency per variant.
- Caching: Cache evaluated decisions briefly (e.g., 30–120 seconds) to reduce latency; avoid caching that breaks targeting.
- Compatibility: Flags should not break the public API contract. For contract-breaking changes, use versioning; use flags to prepare the path and gather signals.
- Security: Never expose raw internal flag names/values to clients; expose only intentional, stable behavior indicators if needed.
Worked examples
Example 1: Kill switch for a flaky downstream provider
Scenario: You have /payments that calls ProviderA. Add a kill switch to stop calling ProviderA when error rate spikes.
// Pseudocode (Node/Express style)
app.post('/payments', async (req, res) => {
const useProviderA = flags.getBoolean('payments.providerA.enabled', { tenantId: req.tenantId }, { default: true });
if (!useProviderA) {
return res.status(503).json({ message: 'Payments temporarily unavailable. Please retry later.' });
}
try {
const result = await providerA.charge(req.body);
return res.status(200).json(result);
} catch (e) {
metrics.count('payments.providerA.errors');
return res.status(502).json({ message: 'Upstream error' });
}
});
- Fallback: If flag service is down, default true or last-known value. In an incident, ops flips it to false.
- Observation: Monitor 5xx, saturation, and the flag flip event.
Example 2: Gradual rollout of new response field
Scenario: Add field beta_score to /recommendations but avoid surprises.
// Guard behavior behind a flag but keep the contract stable
const enabled = flags.getPercent('recs.beta_score.rollout', { userId }, { default: 0, stickinessKey: 'userId' });
const data = await service.getRecommendations(userId);
if (enabled) {
data.beta_score = await service.computeBetaScore(userId);
}
// Contract safety: If clients are sensitive, place field under a vendor-specific header gate and Vary on that header.
res.setHeader('Vary', 'X-Beta-Features');
Notes:
- Use a Vary header if response shape depends on a request header to keep caches correct.
- Keep a default that returns a valid response without the new field.
Example 3: Tenant-targeted early access
Scenario: Enable a faster search path for specific enterprise tenants.
// Pseudocode (Go-ish)
func Search(w http.ResponseWriter, r *http.Request) {
tenantId := r.Header.Get("X-Tenant-Id")
fastPath := flags.GetBoolean("search.fast.enabled", map[string]string{"tenantId": tenantId}, true)
var results []Item
if fastPath {
results = index.FastSearch(r.Context(), r.URL.Query())
} else {
results = index.SafeSearch(r.Context(), r.URL.Query())
}
// Emit telemetry for both variants
metrics.Count("search.variant", map[string]string{"fast": strconv.FormatBool(fastPath)})
json.NewEncoder(w).Encode(results)
}
Notes:
- Use an allowlist of tenant IDs in the flag rule.
- Track latency and errors per variant to decide rollout.
Step-by-step: Add a flag to an existing API
- Define the goal: What risk are you reducing? How will you know it works? List metrics and exit criteria.
- Choose flag type: boolean, multivariate, or percentage with stickiness.
- Pick identity: tenantId or userId for targeting; avoid IP as identity for stickiness.
- Decide defaults and fallback: Off by default for new features; on or off for kill switches based on safety. Cache last-known values.
- Add code guard: Place the decision at the smallest scope that contains the risk.
- Emit telemetry: Counters for selected variant, errors, latency, and an audit log on flips.
- Warm up and dry run: Validate both paths in staging; run shadow traffic if possible.
- Roll out gradually: 1% → 5% → 25% → 50% → 100%, verifying SLOs at each step.
- Cleanup: Remove flag and dead code once fully rolled out and stable.
Checklist: Safe rollout Definition of Done
- Flag has clear owner and purpose.
- Defaults and fallbacks defined and tested.
- Stickiness configured (user/tenant) for percentage rollouts.
- Logs/metrics per variant in place.
- Cache behavior understood; Vary header set if response differs by request input.
- Runbook for flipping/rollback documented.
- Cleanup task created with a date.
Common mistakes and self-checks
- Mistake: Client-side gating for sensitive server behavior. Self-check: Could a malicious client bypass this? If yes, move evaluation server-side.
- Mistake: Missing stickiness causing users to flip variants. Self-check: Same user gets same decision across requests?
- Mistake: Breaking caches with mixed responses. Self-check: If request input affects response variant, is Vary set correctly?
- Mistake: No fallback for flag unavailability. Self-check: Kill the flag service in a test env; do you still serve a safe response?
- Mistake: Long-lived flags and dead code. Self-check: Is there a cleanup date and owner? Add it now.
Exercises
These mirror the tasks below. Use the hints if stuck. Everyone can take the test; only logged-in users will have their progress saved.
Exercise 1: Design a safe flag for a header change (matches ex1)
You want to add an X-Request-ID response header to all endpoints. Some proxies are sensitive to new headers.
- Design the flag plan: type, targeting, default, fallback, telemetry.
- Define acceptance checks for a 0% → 25% → 100% rollout.
Exercise 2: Implement percent rollout with stickiness (matches ex2)
Write a function enabled(userId, percent) that uses userId % 100 < percent to decide. For userIds [129, 230, 345, 478] and percent=30, compute ON/OFF decisions.
Practical projects
- Build a tiny flag evaluator service: In-memory rules (boolean, percentage by userId), HTTP endpoint to read rules, and a cache with last-updated timestamp.
- Add a kill switch to an API that calls a third-party. Simulate upstream failures and demonstrate instant rollback via flag flip.
- Implement a safe header-gated feature: Add a beta header that enables an extra response field. Ensure Vary is correct and caches behave as expected.
Learning path
- Start: Server-side flag basics and safe defaults (this lesson).
- Next: Observability for releases—latency, error budgets, variant tagging.
- Then: Progressive delivery patterns—canary, blue/green, shadow traffic.
- Advanced: Experimentation and multivariate flags while preserving API contracts.
Next steps
- Pick one endpoint in your service and add a non-breaking flag around a small change.
- Set up metrics per variant and perform a tiny (1%) rollout.
- Schedule cleanup once you reach 100%.
Mini challenge
You must change a default timeout from 2s to 5s on a read endpoint without hurting latency SLOs. Sketch a flag plan: what to measure, rollout steps, and the fallback if p95 latency regresses by 10%.