Why this matters
2) Text generation — gRPC server streaming
syntax = "proto3";
package inference.v1;
message GenerateRequest {
string request_id = 1;
string model = 2;
string prompt = 3;
int32 max_tokens = 4;
float temperature = 5;
}
message GenerateChunk {
string request_id = 1;
string token = 2;
bool is_final = 3;
int64 latency_ms = 4; // First-token latency for first chunk
}
service TextGen {
rpc Generate (GenerateRequest) returns (stream GenerateChunk);
}
Notes: server streaming yields low latency for first tokens; include a final chunk with is_final=true and summary metrics.
3) Tabular anomaly detection — gRPC unary with batching
syntax = "proto3";
package inference.v1;
message Row { repeated float features = 1; string id = 2; }
message DetectRequest { string request_id = 1; repeated Row rows = 2; }
message Anomaly { string id = 1; float score = 2; bool is_anomaly = 3; }
message DetectResponse {
string request_id = 1;
string model_version = 2;
int64 latency_ms = 3;
repeated Anomaly results = 4; // same order as rows
}
service AnomalyService { rpc Detect (DetectRequest) returns (DetectResponse); }
Notes: preserve ordering; include per-item IDs to match results; set max batch size (e.g., 256 rows).
Performance and cost tips
- Batching improves throughput but increases tail latency; tune by workload (consider max_batch_delay_ms).
- Prefer gRPC for large volumes or low-latency needs; REST for simple integration.
- Avoid huge JSON payloads; use base64 only when necessary.
- Warm the model and keep a small pool of ready workers; expose /ready and /health endpoints.
Security and privacy
- Authenticate every call; restrict model access by tenant.
- PII handling: redact in logs; encrypt in transit; set retention policies.
- Validate and sanitize inputs; enforce size/type limits.
Exercises
Try these hands-on tasks. Solutions are available in the exercise cards below. Check off each step as you go.
- Exercise 1: Design a REST/JSON sentiment endpoint for single and batch inputs with clear errors.
- Exercise 2: Define a gRPC Embeddings service with request_id and per-text outputs.
- Checklist
- Request/response include request_id, model, model_version.
- Errors map to HTTP/gRPC codes with machine-readable codes.
- Batching preserves order; per-item errors are possible.
- Reasonable size limits and timeouts are defined.
Common mistakes and self-check
- Mixing API and model versioning. Self-check: Can clients pin a model_version while staying on /v1?
- No request_id. Self-check: Can you trace one call across logs and metrics?
- Unbounded payload sizes. Self-check: Do you reject 20 MB JSON with 413?
- Ambiguous errors. Self-check: Would a client know to retry vs fix input?
- Breaking changes. Self-check: Did you only add optional fields in v1?
- Ignoring timeouts. Self-check: Do clients and server have aligned time budgets?
Practical projects
- Build a dual-protocol service: Offer the same image-classify model via REST and gRPC; confirm equivalent results for a test set.
- Add streaming: Implement server-streaming for a text generator; render tokens as they arrive.
- Observability pack: Add request_id propagation, structured logs, latency histograms, and a /metrics endpoint.
Mini challenge: Introduce a new optional response field (e.g., calibration_score) without breaking existing clients. Document the change in a deprecation note inside the response warnings.
Learning path
- Start: REST/JSON endpoint with clean error mapping.
- Next: gRPC unary, then add server streaming.
- Then: batching strategies and request coalescing.
- Finally: observability, rate limits, and versioning policy.
Next steps
- Write a one-page API contract for your current model.
- Implement input validation and size limits.
- Add request_id propagation end-to-end.
Quick Test
Take the quick test below to check your understanding. Available to everyone; only logged-in users get saved progress.