luvv to helpDiscover the Best Free Online Tools
Topic 2 of 8

Designing Inference APIs

Learn Designing Inference APIs for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

As an NLP Engineer, you ship models that real users and systems call. A clear, stable inference API is how products get value from your models. You will:

  • Expose models like sentiment, NER, embeddings, and text generation to apps and pipelines.
  • Balance latency, throughput, and reliability under unpredictable load.
  • Evolve models without breaking clients (versioning and backwards compatibility).
  • Measure quality and performance with consistent logs and metrics.

Concept explained simply

An inference API is a contract: clients send well-formed inputs and get predictable outputs and errors. Your job is to make that contract simple, stable, and observable.

Mental model

Imagine a self-serve restaurant kiosk. The screen (API) offers a small set of clear options, validates your order, gives you an order number (request_id), and the kitchen (model server) prepares it. If the kitchen is busy, the kiosk tells you to wait or come back later (rate limit/overload). The kitchen can change recipes (new model versions), but the menu stays familiar (backwards compatible schema).

Core design elements

  • Endpoints: keep them small and purpose-built. Typical set:
    • POST /v1/infer or task-specific like POST /v1/sentiment
    • GET /v1/health
    • GET /v1/metadata (model_version, limits)
  • Request shape: predictable top-level fields.
{
  "inputs": ["Text to analyze", "More text"],
  "parameters": {
    "language": "auto",
    "max_tokens": 256,
    "temperature": 0.2,
    "truncate": true
  },
  "idempotency_key": "8d1e...",
  "requested_model": "sentiment-en-1"
}
  • Response shape: standardized envelope.
{
  "request_id": "b1f3...",
  "model_version": "sentiment-en-1.4.2",
  "results": [ { "label": "positive", "score": 0.98 } ],
  "usage": { "tokens_in": 28, "tokens_out": 0, "latency_ms": 42 }
}
  • Error shape: informative and actionable.
{
  "error": {
    "type": "validation_error",
    "code": "INPUT_TOO_LONG",
    "message": "Input exceeds max_length=4096",
    "hint": "Set parameters.truncate=true or shorten input",
    "retry_after_s": null
  },
  "request_id": "b1f3..."
}
  • Idempotency: accept an idempotency_key to safely retry client requests.
  • Versioning: include model_version in responses; optionally support api_version in headers or path.
  • Batching: allow arrays in inputs to amortize overhead. Document max batch size.
  • Streaming: for generation, support streaming tokens when latency matters.
  • Limits: document max length, rate limits, timeouts, and default parameters.

Worked examples

Example 1 — Sentiment classification (sync, batched)

Endpoint: POST /v1/sentiment

Request

{
  "inputs": ["I love this phone", "Terrible battery life"],
  "parameters": {"language": "en"}
}

Response

{
  "request_id": "r-1001",
  "model_version": "sentiment-en-1.4.2",
  "results": [
    {"label": "positive", "score": 0.99},
    {"label": "negative", "score": 0.97}
  ],
  "usage": {"tokens_in": 12, "tokens_out": 0, "latency_ms": 35}
}

Notes

  • Stable labels: positive/neutral/negative documented.
  • Score is probability in [0,1].
  • Batch size limit e.g., 32 documents.
Example 2 — Named Entity Recognition (offsets, batching)

Endpoint: POST /v1/ner

Request

{
  "inputs": ["Alice lives in Paris."],
  "parameters": {"return_offsets": true}
}

Response

{
  "request_id": "r-2001",
  "model_version": "ner-multilingual-2.0.0",
  "results": [
    {
      "entities": [
        {"text": "Alice", "type": "PERSON", "start": 0, "end": 5, "score": 0.995},
        {"text": "Paris", "type": "LOCATION", "start": 15, "end": 20, "score": 0.993}
      ]
    }
  ],
  "usage": {"tokens_in": 8, "tokens_out": 0, "latency_ms": 48}
}

Design tips

  • Return character offsets to ground entities into the original text.
  • Guarantee stable entity type set (document aliasing and deprecations).
Example 3 — Text generation (streaming and non-streaming)

Endpoints: POST /v1/generate (sync) and POST /v1/generate/stream (streaming)

Sync request

{
  "inputs": ["Write a product tagline for an eco-friendly bottle"],
  "parameters": {"max_tokens": 60, "temperature": 0.7, "stop": ["\n\n"]}
}

Sync response

{
  "request_id": "r-3001",
  "model_version": "gpt-like-0.9.1",
  "results": [ { "text": "Refresh the planet, one refill at a time." } ],
  "usage": {"tokens_in": 12, "tokens_out": 10, "latency_ms": 220}
}

Streaming response (Server-Sent Events or chunked JSON). Each event includes partial tokens:

event: token
data: {"token": "Refresh", "index": 0}

event: token
data: {"token": " the", "index": 1}

event: done
data: {"request_id": "r-3002", "model_version": "gpt-like-0.9.1", "usage": {"tokens_in": 12, "tokens_out": 10, "latency_ms": 180}}

Design tips

  • Streaming improves perceived latency; ensure final 'done' frame with usage and request_id.
  • Honor stop sequences server-side for consistent truncation.

Performance, reliability, and batching

  • Latency targets: document typical p50/p95 for each endpoint.
  • Timeouts: enforce server timeout (e.g., 30s sync) and return a 504-style error with hint to switch to streaming or reduce max_tokens.
  • Batching: accept multiple inputs. Protect with max_batch and cumulative token caps; return results aligned to input order.
  • Backpressure: return overload errors with retry_after_s when queues grow.
  • Idempotency: if idempotency_key repeats within a window, return the original result.

Observability

  • Include request_id in every response and error.
  • Log minimal structured JSON per request:
{
  "ts": "2026-01-05T12:00:08Z",
  "request_id": "r-4010",
  "endpoint": "/v1/sentiment",
  "model_version": "sentiment-en-1.4.2",
  "latency_ms": 41,
  "status": 200,
  "batch_size": 3,
  "tokens_in": 35,
  "tokens_out": 0
}
  • Metrics to track: throughput (req/s), latency (p50/p95/p99), error rates by type, token usage, cache hit rate.

Security and safety

  • Authentication: require a bearer token or similar credential; reject missing/invalid with a clear error.
  • PII: avoid logging raw text unless explicitly enabled for debugging; support a redaction option.
  • Payload limits: set max body size; validate UTF-8.
  • CORS: allow only trusted origins if serving browsers.

Versioning and compatibility

  • Model stability: include model_version in responses and allow requested_model in requests.
  • API versioning: support an api_version header or versioned path; deprecate old versions with clear dates.
  • Backward compatibility: additive changes are safe; for breaking changes, provide a migration guide and dual-run period.

Error handling patterns

  • 400 validation_error (bad schema or parameters)
  • 422 unprocessable (cannot fulfill due to content; e.g., empty text)
  • 429 rate_limited (include retry_after_s)
  • 503 overload (server busy; include retry_after_s)
  • 504 timeout (work took too long; suggest streaming or smaller max_tokens)
{
  "error": {
    "type": "rate_limited",
    "code": "TOO_MANY_REQUESTS",
    "message": "Rate limit exceeded",
    "retry_after_s": 2
  },
  "request_id": "r-4290"
}

Exercises

Practice the core decisions below. Then take the quick test at the end. Note: Everyone can take the test; only logged-in users will have their progress saved.

Exercise 1 — Design a stable sentiment API contract

Goal: Define request/response JSON for POST /v1/sentiment with batching, optional language, and clear errors for long input.

  • Inputs: array of strings (1..32)
  • Parameters: language (auto|en|es|...), truncate (bool)
  • Response: label, score per input; include usage
  • Error: INPUT_TOO_LONG with hint

Write your schema as JSON examples.

Exercise 2 — Plan a streaming generation endpoint

Goal: Specify request and streaming response frames for POST /v1/generate/stream. Include final done frame with usage.

  • Parameters: max_tokens, temperature, stop
  • Frames: token, done
  • Document ordering and termination conditions
  • Checklist before you move on:
    • Endpoints are minimal and task-focused
    • Request and response envelopes are consistent
    • Errors include actionable hints
    • Idempotency and versioning are accounted for
    • Metrics and request_id are included

Common mistakes and self-check

  • Overloaded single endpoint that tries to do everything. Fix: split by task.
  • Undocumented defaults. Fix: echo effective parameters in response if needed.
  • Inconsistent fields across endpoints. Fix: reuse a shared envelope schema.
  • Ignoring batch alignment. Fix: always preserve input order.
  • No streaming for long generations. Fix: add /stream with final summary frame.
  • No overload signaling. Fix: 503 with retry_after_s.
Self-check prompts
  • Can a client safely retry any request without duplicate effects?
  • Can you deprecate a model without breaking existing clients?
  • Can you quickly debug a user issue from logs using request_id?

Mini challenge

Extend your design to support tenant-level rate limits and usage reporting. Add a tenant_id in requests and return per-tenant usage in metadata. Ensure errors and streaming flows still include request_id and model_version.

Who this is for

  • NLP Engineers and ML Engineers shipping models to production
  • Backend Engineers integrating NLP into products
  • Data Scientists moving prototypes to stable services

Prerequisites

  • Basic HTTP and JSON
  • Familiarity with NLP tasks (classification, generation, embeddings)
  • Understanding of latency, throughput, and timeouts

Learning path

  1. Define the contract: inputs, outputs, errors, limits
  2. Add performance features: batching and streaming
  3. Harden reliability: idempotency, rate limiting, overload handling
  4. Instrument: logs, metrics, request_id, usage
  5. Version and evolve: api_version and model_version strategy

Practical projects

  • Ship a POST /v1/embeddings service with batch support and usage reporting
  • Convert a prototype text generator into /generate and /generate/stream with stop sequences
  • Add model_version pinning and deprecation notices to an existing sentiment service

Next steps

  • Complete the exercises above and validate with the checklist
  • Take the Quick Test at the end of this page to check your understanding
  • Iterate your own API spec using the self-check prompts

Practice Exercises

2 exercises to complete

Instructions

Create request and response JSON examples for POST /v1/sentiment.

  • Support batching: 1–32 texts via inputs
  • parameters: language (auto|en|es|...), truncate (bool, default false)
  • Response includes results aligned to inputs, model_version, usage
  • Define an INPUT_TOO_LONG error with hint and max_length detail

Write the JSON examples for a valid call and the error case.

Expected Output
Two JSON snippets: one success response with results array and usage; one error response with error.type='validation_error', error.code='INPUT_TOO_LONG', and request_id.

Designing Inference APIs — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Designing Inference APIs?

AI Assistant

Ask questions about this tool