Who this is for
- MLOps engineers who expose ML models to applications or other services.
- Backend engineers integrating inference into product features.
- Data scientists packaging models who want stable, safe API contracts.
Prerequisites
- Basic HTTP concepts (methods, headers, status codes) and JSON.
- Familiarity with Protobuf and gRPC tooling basics.
- Understanding of your model’s inputs/outputs and common failure modes.
Why this matters
In real products, models must be served reliably, safely, and at low latency. Good REST/gRPC API design lets you:
- Ship models independently of client releases (clear versioning).
- Scale online inference and batch jobs predictably.
- Observe, roll back, and A/B model versions without breaking clients.
- Reduce incidents from schema drift and ambiguous errors.
Concept explained simply
An API is a contract between the model server and clients. REST uses HTTP/JSON and is universal and human-friendly. gRPC uses HTTP/2 with Protobuf for compact, fast, strongly-typed RPC calls. Both can expose the same model; you choose based on clients, performance, and interoperability.
Mental model
- Treat inference as a function: output = f(input, model_version, metadata).
- The API contract is a stable shield around a changing model.
- Design for evolution: add fields without breaking old clients; version when you must.
When to choose REST vs gRPC
Choose REST when...
- You need broad client compatibility (browsers, external partners).
- Ease of debugging with curl/Postman and readable payloads matters.
- Request/response sizes are moderate and latency is acceptable.
Choose gRPC when...
- You control both client and server (internal microservices).
- You need low latency, high throughput, or streaming (audio/video, token streams).
- Strong typing and backward-compatible evolution is important.
Design principles for ML inference APIs
- Contracts & versioning: Use explicit versions (e.g., /v1) for REST or package/service versions in Protobuf. Breaking changes require a new version; non-breaking changes are additive.
- Requests: Make inputs explicit. Include optional
model_versionormodel_idfor pinning, otherwise route to default/alias (e.g., "stable"). Support batching via an array rather than separate endpoints when possible. - Responses: Always include
model_version,latency_ms, and a correlationrequest_id. Return calibrated scores and class labels or structured outputs. - Idempotency & safety: GET must be safe; predictions are typically POST. For job-creating endpoints, support an
Idempotency-Key(REST) or a request key (gRPC) to prevent duplicates. - Error model: Map errors clearly. REST: 400 for bad input, 404 unknown model, 409 conflict (e.g., duplicate job), 429 throttled, 5xx server faults. gRPC: use canonical codes (INVALID_ARGUMENT, NOT_FOUND, ALREADY_EXISTS, RESOURCE_EXHAUSTED, INTERNAL, UNAVAILABLE).
- Observability: Emit metrics: QPS, latency percentiles, error rates by code, payload sizes, model version distribution. Log request_id, user/tenant ids (if allowed), but avoid raw PII content.
- Performance patterns: Batching, compression, and streaming. REST: enable gzip; gRPC: use streams and message compression if needed.
- Compatibility & rollout: Support canary by header or route. Keep old versions alive until clients migrate. In Protobuf, never reuse field tags; reserve removed tags.
Design checklist (quick run-through)
- Versioned path or schema present
- Predict uses POST; GETs are safe
- Clear input schema; batch-friendly
- Response includes model_version and request_id
- Defined error codes and examples
- Idempotency strategy for job creation
- Observability fields/headers planned
- Security: auth, rate limits, PII policy
- Migration plan for next version
Worked examples
Example 1 — REST prediction for text classification
Goal: Sentiment classifier with optional batching and pinning by model alias.
POST /v1/models/sentiment:predict
Content-Type: application/json
Idempotency-Key: 8f0c0a...
{
"inputs": [
{"id": "t1", "text": "I love this!"},
{"id": "t2", "text": "Not great."}
],
"model": {"alias": "stable"}
}
200 OK
{
"model_version": "sentiment-2024-09-15",
"predictions": [
{"id": "t1", "label": "POSITIVE", "score": 0.98},
{"id": "t2", "label": "NEGATIVE", "score": 0.77}
],
"latency_ms": 24,
"request_id": "req_7b1..."
}
Errors: 400 for missing text, 404 for unknown alias, 429 when rate-limited.
Example 2 — gRPC embeddings with unary and streaming
syntax = "proto3";
package embeddings.v1;
message EmbedRequest { string id = 1; bytes image_jpeg = 2; string model_alias = 3; }
message EmbedResponse { string id = 1; repeated float vector = 2; string model_version = 3; }
service Embedder {
rpc Embed(EmbedRequest) returns (EmbedResponse);
rpc EmbedStream(stream EmbedRequest) returns (stream EmbedResponse);
}
Unary for single inputs; streaming for high-throughput pipelines. Use canonical status codes for errors.
Example 3 — Batch job via REST and gRPC
Asynchronous REST job:
POST /v1/jobs/text-batch:submit
{ "s3_uri": "s3://bucket/texts.jsonl", "output_uri": "s3://bucket/out/" }
202 Accepted
{ "job_id": "job_123", "status": "QUEUED" }
GET /v1/jobs/job_123
200 OK
{ "job_id": "job_123", "status": "SUCCEEDED", "output_uri": "s3://bucket/out/" }
gRPC alternative: a long-running server-side streamed JobStatus until completion, or client-stream inputs to a sink service.
Example 4 — Backward-compatible Protobuf change
// v1
message PredictRequest { string text = 1; }
// v1.1 (compatible): add an optional field with a new tag
message PredictRequest { string text = 1; string language = 2; }
// Bad: reusing tag numbers for different meaning (do not do this)
Hands-on exercises
Do these now. You can compare with solutions after you try.
- Exercise 1: Design a REST predict endpoint for a text classifier. Include OpenAPI snippet, request/response bodies, and error mapping.
- Exercise 2: Define a gRPC service for image embeddings with unary and streaming RPCs. Include messages and basic error semantics.
Self-check checklist
- Versioning is explicit (path or schema).
- Batching supported or intentionally excluded with reasoning.
- Response includes model_version and request_id.
- Error codes are well-defined and tested.
Common mistakes and self-checks
- Ambiguous endpoints: Using GET for inference with bodies. Self-check: predictions are POST; GET is safe.
- No versioning: Breaking clients on update. Self-check: Can you deploy a new model without client changes?
- Hidden batch limits: Silent truncation. Self-check: Document max batch size; return 413 or 400 with guidance.
- Poor error mapping: Everything is 500. Self-check: Do you see 4xx for client issues vs 5xx for server?
- Leaking PII in logs: Logging raw inputs. Self-check: Are inputs redacted or hashed? Consent documented?
- Reusing Protobuf tags: Causes data corruption. Self-check: Old tags reserved and never reused.
Practical projects
- Project A: Ship a sentiment REST API with /v1 and /v2. Run a canary: 10% traffic to /v2. Deliverables: OpenAPI spec, curl examples, dashboards for latency and errors.
- Project B: Build a gRPC embedding service with unary and streaming RPC. Load test with 1k QPS. Deliverables: .proto, client code, latency histogram, CPU/mem profile.
- Project C: Create a batch job REST endpoint and a status poller. Deliverables: job lifecycle doc, error table, idempotency test evidence.
Learning path
- Review HTTP basics and status codes; map common ML errors to 4xx/5xx.
- Model your request/response schema with examples and edge cases.
- Add versioning and rollout plan (aliases, canary, deprecation).
- Implement observability: request_id, model_version, latency metrics.
- Load test REST vs gRPC; choose based on latency/throughput and client ecosystem.
Next steps
- Harden your error model with integration tests.
- Add batch and streaming paths if your workload benefits.
- Prepare migration notes for the next version of your API.
Mini challenge
You must serve speech-to-text in real time to an internal microservice and expose offline transcription to external partners. Propose one REST endpoint and one gRPC RPC, explain why each fits its audience, and list key error codes.
Quick Test
The quick test is available to everyone; only logged-in users will have their progress saved.