Why this matters
As a Computer Vision Engineer, your models only create value when they are reliably accessible to products and users. A well-designed inference API makes model outputs predictable, fast, and safe to integrate. You will frequently:
- Serve detection/segmentation/classification models to web and mobile apps.
- Process images in batches for data pipelines.
- Handle large files (videos, PDFs) asynchronously.
- Version models, track latency/throughput targets, and provide stable contracts to frontend/partner teams.
Concept explained simply
An inference API is a clear contract: how clients send inputs to your model and how they receive results. Good design protects clients from model changes and protects servers from bad inputs.
Key ingredients
- Endpoint shape: single, batch, async job.
- Input format: image as URL, base64, or multipart upload; video as URL or job ID.
- Output format: consistent JSON with predictions, confidence, and metadata.
- Error handling: meaningful codes/messages and retry guidance.
- Performance: latency budgets, timeouts, and batch sizes.
- Safety: authentication, input validation, size limits.
- Stability: versioning and deprecation plan.
Mental model
Think in layers:
- Contract: requests/responses are your unbreakable interface.
- Traffic control: sync for quick images, async for heavy jobs.
- Validation gate: reject bad inputs early with clear messages.
- Predictable outputs: same schema regardless of model internals.
- Observability: IDs and metadata for tracing and debugging.
Core design decisions for CV inference APIs
1) Sync vs Async
- Sync (HTTP 200 with result): small images, expected latency < 1–2s.
- Async (HTTP 202 with job_id): large images, PDFs, videos, or latency > 2s. Client polls or subscribes for status.
2) Input format
- image_url: simple, server fetches image. Validate protocol and size.
- image_base64: single-request transport. Enforce max size and mime type.
- multipart/form-data: good for uploads from browsers.
- video_url: required for async jobs; optionally accept clip_time ranges.
3) Output schema
- Always include: request_id, model_version, timestamp.
- Predictions list with class, score, and geometry (bbox, polygon, mask reference).
- Optional: calibration info, processing_time_ms, warnings.
4) Batching
- Accept an array of inputs under inputs[].
- Return outputs aligned by index; on partial failure, include per-item errors.
- Advertise max_batch_size; reject larger with clear errors.
5) Errors and retries
- 4xx for client mistakes (validation, size, auth).
- 5xx for server issues; include retry_after seconds when appropriate.
- Provide error.code (e.g., VALIDATION_ERROR, TIMEOUT) and human-readable message.
6) Versioning
- Path-based (/v1/...), header-based (X-Model-Version), or model_id parameter.
- Never remove fields without deprecation period; add new fields as optional.
7) Security and limits
- Token-based auth in Authorization header.
- Max payload size, allowed mime types, and content scanning.
- Rate limits and fair batching to prevent overload.
Worked examples
Example 1: Sync object detection (single image)
POST /v1/detect
{
"image_url": "https://example.com/street.jpg",
"threshold": 0.35,
"return":"bbox"
}
Response 200
{
"request_id": "req_01H...",
"model_version": "detector-v3.2",
"processing_time_ms": 412,
"predictions": [
{"class": "person", "score": 0.91, "bbox": [x, y, w, h]},
{"class": "car", "score": 0.88, "bbox": [x, y, w, h]}
],
"warnings": []
}
Example 2: Batch OCR (multipart or base64)
POST /v1/ocr: Accepts inputs[] with base64 or urls
{
"inputs": [
{"id": "doc1", "image_base64": "..."},
{"id": "doc2", "image_url": "https://.../invoice.png"}
],
"language": "eng",
"return": "text"
}
Response 200 (aligned by inputs id)
{
"request_id": "req_01J...",
"model_version": "ocr-small-1.4",
"results": [
{"id": "doc1", "text": "Total: 124.50 USD", "confidence": 0.94},
{"id": "doc2", "error": {"code": "FETCH_ERROR", "message": "Could not fetch image_url"}}
]
}
Example 3: Async video segmentation
POST /v1/video/segment (returns 202)
{
"video_url": "https://example.com/clip.mp4",
"classes": ["person", "road"],
"output": {"format": "mask_uri"}
}
Response 202 (job queued)
{
"job_id": "job_7f...",
"status": "queued",
"eta_seconds": 45
}
GET /v1/video/segment/job_7f...
{
"job_id": "job_7f...",
"status": "succeeded",
"request_id": "req_...",
"model_version": "vid-seg-2.1",
"artifacts": {
"mask_manifest_uri": "s3://bucket/manifest.json"
}
}
Example 4: Detection with calibration and per-class thresholds
{
"image_url": "https://...",
"thresholds": {"person": 0.3, "car": 0.5},
"calibration": {"temperature": 1.2}
}
{
"predictions": [...],
"calibration": {"temperature": 1.2, "note": "Applied"}
}
Design checklist
- [ ] Is the endpoint sync/async documented and predictable?
- [ ] Are input types validated with clear limits (size, mime, dimensions)?
- [ ] Is batching supported with max_batch_size and per-item errors?
- [ ] Is the response schema consistent and versioned?
- [ ] Do errors carry error.code and actionable messages?
- [ ] Are authentication and rate limits defined?
- [ ] Are request_id and model_version always present?
- [ ] Are timeouts and latency budgets communicated?
Exercises
Do these in order. You can check solutions below each exercise.
Exercise 1 — Batch object detection API spec
Design the request/response for a batch detection endpoint that supports up to 8 images, accepts image_url or base64, and returns bboxes with scores and class names. Include how partial failures are represented.
When done, compare with the solution in the Exercises panel below or the dedicated solution reveal here.
Exercise 2 — Async video classification
Propose an async API for classifying a 20-second clip by action label. Define: job submission payload, 202 response, status polling response, and how to return per-segment scores. Include a retry_after hint if processing is busy.
Common mistakes and self-check
Mistakes to avoid
- Returning different schemas for different classes or models. Fix: use one stable schema with optional fields.
- Hiding errors inside 200 responses. Fix: use correct HTTP status and include error.code.
- Not bounding input sizes. Fix: enforce limits; document rejections.
- Forgetting model_version. Fix: include it in all responses.
- No plan for async jobs. Fix: job_id, status transitions, and polling endpoint.
- Only supporting one input transport. Fix: support URL and base64 or multipart.
Self-check prompts
- Can a client upgrade models without code changes?
- Can you identify and reproduce a bad result using request_id logs?
- Is there a clear path when processing exceeds timeouts?
Practical projects
- Stubbed detection API: Implement a mock /v1/detect that validates inputs and returns a fixed prediction. Focus on schema and errors.
- Batch OCR pipeline: Create /v1/ocr for inputs[] with per-item results and errors. Add max_batch_size enforcement.
- Async video job runner: Design endpoints for submit, status, and cancel. Simulate processing delay. Add retry_after and job TTL.
Who this is for
- Computer Vision Engineers deploying models to production.
- ML Engineers building platform APIs for perception tasks.
- Backend Engineers integrating CV models into services.
Prerequisites
- Basic HTTP and JSON knowledge.
- Understanding of your CV model outputs (bbox, masks, classes).
- Familiarity with authentication and status codes.
Learning path
- Design stable request/response schemas for your current model.
- Add batching and per-item error handling.
- Introduce async for large files and long jobs.
- Implement versioning and deprecation policy.
- Harden validation, auth, and rate limits.
- Instrument request_id, latency, and error metrics.
Mini challenge
Your detection model will soon add instance masks. Without breaking clients, update your API to support masks while keeping bbox-only clients working. Propose the changes.
Hint
Add optional fields (e.g., mask_uri or polygons) and advertise via capabilities in the response while preserving existing fields.
Next steps
- Complete the exercises below and check your answers.
- Take the Quick Test to confirm your understanding.
- Apply the checklist to one of your existing endpoints.
Progress note: The test is available to everyone. Only logged-in users will have their progress saved.