Why this matters
As an NLP Engineer, you ship models that real users and systems call. A clear, stable inference API is how products get value from your models. You will:
- Expose models like sentiment, NER, embeddings, and text generation to apps and pipelines.
- Balance latency, throughput, and reliability under unpredictable load.
- Evolve models without breaking clients (versioning and backwards compatibility).
- Measure quality and performance with consistent logs and metrics.
Concept explained simply
An inference API is a contract: clients send well-formed inputs and get predictable outputs and errors. Your job is to make that contract simple, stable, and observable.
Mental model
Imagine a self-serve restaurant kiosk. The screen (API) offers a small set of clear options, validates your order, gives you an order number (request_id), and the kitchen (model server) prepares it. If the kitchen is busy, the kiosk tells you to wait or come back later (rate limit/overload). The kitchen can change recipes (new model versions), but the menu stays familiar (backwards compatible schema).
Core design elements
- Endpoints: keep them small and purpose-built. Typical set:
- POST /v1/infer or task-specific like POST /v1/sentiment
- GET /v1/health
- GET /v1/metadata (model_version, limits)
- Request shape: predictable top-level fields.
{
"inputs": ["Text to analyze", "More text"],
"parameters": {
"language": "auto",
"max_tokens": 256,
"temperature": 0.2,
"truncate": true
},
"idempotency_key": "8d1e...",
"requested_model": "sentiment-en-1"
}- Response shape: standardized envelope.
{
"request_id": "b1f3...",
"model_version": "sentiment-en-1.4.2",
"results": [ { "label": "positive", "score": 0.98 } ],
"usage": { "tokens_in": 28, "tokens_out": 0, "latency_ms": 42 }
}- Error shape: informative and actionable.
{
"error": {
"type": "validation_error",
"code": "INPUT_TOO_LONG",
"message": "Input exceeds max_length=4096",
"hint": "Set parameters.truncate=true or shorten input",
"retry_after_s": null
},
"request_id": "b1f3..."
}- Idempotency: accept an idempotency_key to safely retry client requests.
- Versioning: include model_version in responses; optionally support api_version in headers or path.
- Batching: allow arrays in inputs to amortize overhead. Document max batch size.
- Streaming: for generation, support streaming tokens when latency matters.
- Limits: document max length, rate limits, timeouts, and default parameters.
Worked examples
Example 1 — Sentiment classification (sync, batched)
Endpoint: POST /v1/sentiment
Request
{
"inputs": ["I love this phone", "Terrible battery life"],
"parameters": {"language": "en"}
}Response
{
"request_id": "r-1001",
"model_version": "sentiment-en-1.4.2",
"results": [
{"label": "positive", "score": 0.99},
{"label": "negative", "score": 0.97}
],
"usage": {"tokens_in": 12, "tokens_out": 0, "latency_ms": 35}
}Notes
- Stable labels: positive/neutral/negative documented.
- Score is probability in [0,1].
- Batch size limit e.g., 32 documents.
Example 2 — Named Entity Recognition (offsets, batching)
Endpoint: POST /v1/ner
Request
{
"inputs": ["Alice lives in Paris."],
"parameters": {"return_offsets": true}
}Response
{
"request_id": "r-2001",
"model_version": "ner-multilingual-2.0.0",
"results": [
{
"entities": [
{"text": "Alice", "type": "PERSON", "start": 0, "end": 5, "score": 0.995},
{"text": "Paris", "type": "LOCATION", "start": 15, "end": 20, "score": 0.993}
]
}
],
"usage": {"tokens_in": 8, "tokens_out": 0, "latency_ms": 48}
}Design tips
- Return character offsets to ground entities into the original text.
- Guarantee stable entity type set (document aliasing and deprecations).
Example 3 — Text generation (streaming and non-streaming)
Endpoints: POST /v1/generate (sync) and POST /v1/generate/stream (streaming)
Sync request
{
"inputs": ["Write a product tagline for an eco-friendly bottle"],
"parameters": {"max_tokens": 60, "temperature": 0.7, "stop": ["\n\n"]}
}Sync response
{
"request_id": "r-3001",
"model_version": "gpt-like-0.9.1",
"results": [ { "text": "Refresh the planet, one refill at a time." } ],
"usage": {"tokens_in": 12, "tokens_out": 10, "latency_ms": 220}
}Streaming response (Server-Sent Events or chunked JSON). Each event includes partial tokens:
event: token
data: {"token": "Refresh", "index": 0}
event: token
data: {"token": " the", "index": 1}
event: done
data: {"request_id": "r-3002", "model_version": "gpt-like-0.9.1", "usage": {"tokens_in": 12, "tokens_out": 10, "latency_ms": 180}}
Design tips
- Streaming improves perceived latency; ensure final 'done' frame with usage and request_id.
- Honor stop sequences server-side for consistent truncation.
Performance, reliability, and batching
- Latency targets: document typical p50/p95 for each endpoint.
- Timeouts: enforce server timeout (e.g., 30s sync) and return a 504-style error with hint to switch to streaming or reduce max_tokens.
- Batching: accept multiple inputs. Protect with max_batch and cumulative token caps; return results aligned to input order.
- Backpressure: return overload errors with retry_after_s when queues grow.
- Idempotency: if idempotency_key repeats within a window, return the original result.
Observability
- Include request_id in every response and error.
- Log minimal structured JSON per request:
{
"ts": "2026-01-05T12:00:08Z",
"request_id": "r-4010",
"endpoint": "/v1/sentiment",
"model_version": "sentiment-en-1.4.2",
"latency_ms": 41,
"status": 200,
"batch_size": 3,
"tokens_in": 35,
"tokens_out": 0
}- Metrics to track: throughput (req/s), latency (p50/p95/p99), error rates by type, token usage, cache hit rate.
Security and safety
- Authentication: require a bearer token or similar credential; reject missing/invalid with a clear error.
- PII: avoid logging raw text unless explicitly enabled for debugging; support a redaction option.
- Payload limits: set max body size; validate UTF-8.
- CORS: allow only trusted origins if serving browsers.
Versioning and compatibility
- Model stability: include model_version in responses and allow requested_model in requests.
- API versioning: support an api_version header or versioned path; deprecate old versions with clear dates.
- Backward compatibility: additive changes are safe; for breaking changes, provide a migration guide and dual-run period.
Error handling patterns
- 400 validation_error (bad schema or parameters)
- 422 unprocessable (cannot fulfill due to content; e.g., empty text)
- 429 rate_limited (include retry_after_s)
- 503 overload (server busy; include retry_after_s)
- 504 timeout (work took too long; suggest streaming or smaller max_tokens)
{
"error": {
"type": "rate_limited",
"code": "TOO_MANY_REQUESTS",
"message": "Rate limit exceeded",
"retry_after_s": 2
},
"request_id": "r-4290"
}Exercises
Practice the core decisions below. Then take the quick test at the end. Note: Everyone can take the test; only logged-in users will have their progress saved.
Exercise 1 — Design a stable sentiment API contract
Goal: Define request/response JSON for POST /v1/sentiment with batching, optional language, and clear errors for long input.
- Inputs: array of strings (1..32)
- Parameters: language (auto|en|es|...), truncate (bool)
- Response: label, score per input; include usage
- Error: INPUT_TOO_LONG with hint
Write your schema as JSON examples.
Exercise 2 — Plan a streaming generation endpoint
Goal: Specify request and streaming response frames for POST /v1/generate/stream. Include final done frame with usage.
- Parameters: max_tokens, temperature, stop
- Frames: token, done
- Document ordering and termination conditions
- Checklist before you move on:
- Endpoints are minimal and task-focused
- Request and response envelopes are consistent
- Errors include actionable hints
- Idempotency and versioning are accounted for
- Metrics and request_id are included
Common mistakes and self-check
- Overloaded single endpoint that tries to do everything. Fix: split by task.
- Undocumented defaults. Fix: echo effective parameters in response if needed.
- Inconsistent fields across endpoints. Fix: reuse a shared envelope schema.
- Ignoring batch alignment. Fix: always preserve input order.
- No streaming for long generations. Fix: add /stream with final summary frame.
- No overload signaling. Fix: 503 with retry_after_s.
Self-check prompts
- Can a client safely retry any request without duplicate effects?
- Can you deprecate a model without breaking existing clients?
- Can you quickly debug a user issue from logs using request_id?
Mini challenge
Extend your design to support tenant-level rate limits and usage reporting. Add a tenant_id in requests and return per-tenant usage in metadata. Ensure errors and streaming flows still include request_id and model_version.
Who this is for
- NLP Engineers and ML Engineers shipping models to production
- Backend Engineers integrating NLP into products
- Data Scientists moving prototypes to stable services
Prerequisites
- Basic HTTP and JSON
- Familiarity with NLP tasks (classification, generation, embeddings)
- Understanding of latency, throughput, and timeouts
Learning path
- Define the contract: inputs, outputs, errors, limits
- Add performance features: batching and streaming
- Harden reliability: idempotency, rate limiting, overload handling
- Instrument: logs, metrics, request_id, usage
- Version and evolve: api_version and model_version strategy
Practical projects
- Ship a POST /v1/embeddings service with batch support and usage reporting
- Convert a prototype text generator into /generate and /generate/stream with stop sequences
- Add model_version pinning and deprecation notices to an existing sentiment service
Next steps
- Complete the exercises above and validate with the checklist
- Take the Quick Test at the end of this page to check your understanding
- Iterate your own API spec using the self-check prompts