Why this matters
Backend engineers spend real time aligning on requirements, risk, and impact. A good technical spec:
- Aligns product, backend, frontend, data, security, and ops early.
- Catches risks before code: data loss, migrations, performance, costs, security/privacy.
- Speeds reviews and implementation with a clear contract (APIs, schemas, SLAs).
- Reduces rework by stating goals and non-goals up front.
- Makes rollout safe with a migration and rollback plan.
Concept explained simply
A technical spec is a shared contract for how a change will work and how you will deliver it. It focuses on what, why, and how safely, not just code.
Mental model
- North star: goals and success metrics.
- Guardrails: non-goals, constraints, NFRs (latency, reliability, security, cost).
- Map: architecture, data flows, API/schema, rollout, and observability.
What to include in a solid backend spec
- Summary: one paragraph of the change and impact.
- Context & Problem: current behavior, pain, and drivers.
- Goals & Non-Goals: what must be achieved and what is explicitly out of scope.
- Functional Requirements: user/system behaviors and acceptance criteria.
- Non-Functional Requirements (NFRs): latency, throughput, durability, availability, security/privacy, cost budgets.
- Architecture Overview: components, data stores, queues; include a simple diagram description.
- API & Schema Changes: endpoints, methods, request/response, status codes, DB schema, versioning strategy.
- Data Flows & State: read/write paths, consistency model, idempotency, transactions.
- Failure Modes & SLAs: timeouts, retries, backoff, circuit breakers, error handling, SLAs/SLOs.
- Dependencies: services, third parties, feature flags, configs.
- Risks & Trade-offs: alternatives considered, chosen approach and why, mitigations.
- Migration & Backward Compatibility: phased rollout, dual-writes, backfills, compatibility windows.
- Security & Privacy: authn/z, secrets, PII handling, logging redaction, compliance notes.
- Capacity & Scaling: load assumptions, sizing math, growth plan.
- Observability: logs, metrics, traces, dashboards, alerts.
- Testing Plan: unit/integration/e2e, load tests, chaos/failure testing.
- Rollout & Rollback Plan: flags, canary, monitoring gates, clear rollback steps.
- Open Questions: items to resolve before build.
- Timeline & Estimates: milestones; identify critical path.
- Appendix: glossary, diagram links-in-text (describe), data samples.
Step-by-step: how to write it
- Draft the summary last, write the problem first.
Why
Clarity on the problem shapes the entire design. The summary becomes concise after you decide the approach.
- List goals and non-goals.
Tip
Make boundaries explicit: âNot building admin UI now.â âNo cross-region replication in v1.â
- Capture functional requirements.
Tip
Use Given/When/Then for acceptance criteria to remove ambiguity.
- Set NFRs early.
Tip
Latency targets, throughput, error budgets, cost caps. These drive architecture choices.
- Sketch the architecture.
Tip
Describe components and data flow in text if no diagram tool is available.
- Define API and schema changes precisely.
Tip
Include request/response examples, status codes, versioning, and idempotency keys.
- Plan failure handling and observability.
Tip
Time out, retry with backoff, add metrics and alerts for each key path.
- Write migration, rollout, and rollback.
Tip
Feature flags, canaries, data backfills, and a single-step rollback are your safety net.
- List risks and alternatives with reasoning.
Tip
Make the trade-offs visible: simplicity vs. performance, consistency vs. availability.
- Review, refine, and get signâoff.
Tip
Ask reviewers to challenge assumptions, missing NFRs, and failure modes.
Worked examples
Example 1: Add perâclient rate limiting to public API
- Summary: Introduce token-bucket rate limiting per API key at the gateway, target 95p latency < 150ms at 300 RPS per key.
- Goals: Prevent abuse and protect downstream services; Non-goals: UI rate-limit dashboards (v1).
- NFRs: 99.9% availability; 95p latency < 150ms; cost < $200/month incremental.
- Architecture: API Gateway with Redis for counters; local leaky-bucket fallback on cache miss.
- API: 429 with Retry-After on limit exceed; response includes remaining tokens header.
- Failure: On Redis outage, degrade to local inâmemory limits; alert on 429 burst.
- Observability: MetricsâRPS by key, 429 rate, Redis latency, token bucket saturation.
- Rollout: Shadow mode â 1% enforce â 25% â 100%; rollback flips flag to shadow.
Example 2: Migrate /export reports to a gRPC service
- Summary: Extract CPU-heavy export job from monolith to a gRPC service with a queue.
- Goals: Reduce monolith p95 CPU by 30%. Non-goals: New export formats.
- Functional: POST /v1/exports returns job_id; client polls GET /v1/exports/{id}.
- NFRs: Start < 2s, complete 95% < 2 min, 99% success.
- Architecture: API â enqueue â worker â object storage; status in Postgres.
- Schema: exports(id uuid, status, created_at, owner_id, version).
- Migration: Dual-write status in monolith and new DB for 1 week; cutover flag.
- Risks: Cross-service latency; Mitigation: co-locate worker with queue.
Example 3: Add soft-delete to Orders
- Problem: Business needs to hide orders without losing audit.
- Schema change: add deleted_at TIMESTAMP NULL; unique indexes include (deleted_at).
- Functional: GET excludes soft-deleted by default; admin can include=true.
- NFRs: No data loss; migrations online; p95 queries unaffected (< 120ms).
- Migration: Backfill default NULL; roll forward by deploying read filters first.
- Rollback: Revert read filters via flag; no destructive changes.
Checklists
Pre-review checklist
- Problem, goals, non-goals are explicit.
- Functional requirements use acceptance criteria.
- NFRs are measurable and testable.
- API/schema changes are versioned and idempotent.
- Failure modes and observability are covered.
- Risks, alternatives, and trade-offs are documented.
- Rollout and rollback are safe and simple.
Pre-implementation checklist
- Open questions resolved or time-boxed.
- Capacity math reviewed.
- Security/privacy reviewed.
- Test plan and monitoring dashboards defined.
- Owner and timeline agreed.
Exercises
These mirror the graded tasks below. Do them here first; then submit in the exercises section.
Exercise 1: Rewrite a vague request into Goals/NonâGoals and Functional Requirements
Request: "Make the API faster and stop abuse."
- Write 2â3 goals and 2 non-goals.
- Write 3 functional requirements with acceptance criteria.
Exercise 2: Define NFRs, capacity, and rollback
Feature: New endpoint POST /v1/invoices to create invoices.
- Choose p95 latency, throughput, and availability targets.
- Show a quick capacity calc for peak (e.g., 100 RPS).
- Describe a one-step rollback.
- Self-check: Can a peer implement from your spec without a meeting?
- Self-check: Would your rollback work during a partial outage?
Common mistakes and how to self-check
- Only describing happy paths. Fix: List failure modes and timeouts.
- Skipping NFRs. Fix: Add latency, throughput, and availability targets.
- Under-specifying API versioning. Fix: Include version, compatibility window, and deprecation plan.
- No rollback. Fix: Add a simple, documented rollback step.
- Hand-wavy capacity. Fix: Do back-of-the-envelope math.
- Vague risks. Fix: Write explicit trade-offs and mitigations.
Practical projects
- One-pager spec: Add idempotent retries to payment webhook processing.
- Full spec: Introduce a write-through cache for product reads.
- Spec refactor: Take a past incident and write the spec that would have prevented it.
Who this is for
Backend engineers, tech leads, and SREs who plan, review, or implement service changes.
Prerequisites
- Comfort with HTTP/REST (or gRPC), data modeling, and basic distributed systems concepts.
- Familiarity with service monitoring and deployment practices.
Learning path
- Start: Write a one-pager for a small feature.
- Next: Produce a full spec with NFRs and rollout for a service change.
- Advance: Add performance modeling and cost analysis.
Next steps
- Complete the exercises and compare with solutions.
- Take the quick test to check understanding.
- Apply the checklists on your next real task.
Mini challenge
Time-box 30 minutes to draft a one-page spec for adding request pagination to a list endpoint. Include goals, NFRs, API changes, failure modes, and a rollback.
Quick Test
You can take the quick test below for everyone. If you are logged in, your progress will be saved.