Who this is for
Applied Scientists and ML Engineers who need their models to work reliably with product, backend, and data platforms. If you ship models as services or batch jobs used by other teams, this is for you.
Why this matters
Real product work relies on clear boundaries. A good interface and explicit constraints let multiple teams build in parallel and prevent surprises at launch.
- API consumers know exactly what inputs are accepted and what outputs to expect.
- Ops can monitor SLAs and cost budgets.
- You can iterate and version models without breaking downstream systems.
Typical tasks in the role
- Define request/response schemas for a ranking or scoring service.
- Set latency, throughput, and cost budgets with infra partners.
- Document feature availability and staleness constraints from the feature store.
- Add versioning, deprecation policy, and error handling rules.
- Create contract tests to ensure nothing breaks across releases.
Concept explained simply
An interface is a contract that says: "If you give me X, I will return Y, under these conditions." Constraints are the guardrails: performance, cost, privacy, safety, and operational limits you agree to meet.
Mental model
Think of your ML service as a shipping container:
- The container (interface) has a standard shape and labels (schema, types, version, error codes).
- Loading rules (constraints) define how heavy it can be (payload size), how fast it must arrive (latency SLOs), and inspection rules (validation, observability).
- If you change the inside contents (model), the container specs stay stable or evolve with clear versioning.
Key elements of a good interface and constraints
Interface contract
- Purpose: one clear sentence of what the service does.
- Request schema: types, required vs optional fields, max sizes, allowed value ranges.
- Response schema: fields, types, ranges, confidence, reason codes, and ordering guarantees.
- Behavioral guarantees: idempotency, determinism within a window, ordering, pagination, and batching rules.
- Error handling: error codes, retryability, backoff guidance, partial failure behavior.
- Versioning: version field, deprecation policy, change log expectations.
Constraints (guardrails)
- Latency: p50/p95 targets and timeouts, per batch size.
- Throughput: requests/sec and burst limits; rate limiting rules.
- Capacity and scaling: max concurrent requests, autoscaling triggers.
- Resource limits: CPU/GPU type, memory ceiling, max batch size, vector dimension limits.
- Cost budget: cost per 1k requests or per hour; concurrency tradeoffs.
- Data constraints: PII handling, retention, encryption at rest/in transit.
- Safety and fairness: blocked inputs, thresholds, fail-safe modes.
- Observability: required logs, metrics, traces, and dashboards.
Minimal interface template
{
"purpose": "Short sentence of what this service does",
"version": "v1",
"request": {
"schema": {
"fields": [
{"name": "items", "type": "array[string]", "required": true, "max_len": 32},
{"name": "context", "type": "object", "required": false}
],
"limits": {"max_payload_bytes": 200000}
},
"validation": ["reject_invalid_schema", "trim_whitespace"]
},
"response": {
"schema": {
"fields": [
{"name": "scores", "type": "array[number]", "range": [0, 1]},
{"name": "reason_codes", "type": "array[array[string]]"}
]
},
"errors": [
{"code": "INVALID_INPUT", "retryable": false},
{"code": "TIMEOUT", "retryable": true}
]
},
"constraints": {
"latency_ms_p95": 200,
"throughput_rps": 100,
"max_batch": 32,
"cost_per_1k_requests": 0.50,
"data": {"pii": false}
},
"observability": {
"metrics": ["latency_ms_p50", "latency_ms_p95", "rps", "error_rate"],
"logs": ["request_id", "version", "error_code"],
"traces": true
},
"compatibility": {
"idempotency": true,
"backwards_compatible_fields": true,
"deprecation_policy_days": 90
}
}Worked examples
1) Real-time fraud scoring API
Goal: Score a transaction with a risk score and reason codes.
Purpose: Score a single transaction for fraud likelihood.
Request:
transaction_id: string (required)
amount: number (0..100000)
currency: string (ISO 4217)
features: object (optional)
idempotency_key: string (required)
Response:
risk_score: number (0..1)
reasons: array[string] (max 5)
model_version: string
Constraints:
p95 latency <= 80ms; timeout 120ms
throughput: 300 rps sustained, 600 rps burst (60s)
cost <= $0.20 per 1k requests (Varies by country/company; treat as rough ranges.)
idempotent by idempotency_key
Errors:
INVALID_INPUT (no retry), TIMEOUT (retry with backoff), RATE_LIMIT (retry after header)
Observability:
Log request_id, transaction_id (hashed), version; trace enabled2) Text embeddings service
Purpose: Convert text to 768-dim embeddings for retrieval.
Request:
texts: array[string] (1..32)
normalize: boolean (default true)
Response:
vectors: array[array[number]] (shape: N x 768)
dim: 768
Constraints:
max chars per request: 32000
p95 latency: <= 300ms at batch=16; <= 550ms at batch=32
max batch: 32
memory ceiling: 2GB per pod
throughput: 150 rps (autoscale at CPU>70%)
Error behavior:
If any text exceeds char limit: INVALID_INPUT
Partial failure: allowed=false (fail whole batch)
Versioning:
Additive fields only in v1.x; breaking changes in v2.0 with 90-day overlap3) Online feature retrieval
Purpose: Fetch latest user features for ranking.
Request:
user_id: string (required)
as_of_ts: int (epoch ms, optional)
Response:
features: object { "age_days": int, "avg_spend_7d": number, ... }
Constraints:
staleness <= 10 minutes (p95)
p95 latency <= 40ms
timeout 80ms; retries=1 with jitter
cache: TTL 120s; serve_stale_on_timeout=true
Errors:
NOT_FOUND (no retry), TIMEOUT (retry once)
Observability:
Metric: staleness_ms_p95, cache_hit_rate, error_rateHands-on exercises
Do these to practice. You can compare with the solutions below.
Exercise 1: Draft a ranking API contract
Design an interface for a recommendation ranking API.
- Input: candidate item IDs, user context, optional filters
- Output: ranked items with scores and reasons
- Batching allowed up to 20 users per request
- p95 latency 150ms at batch=1, 350ms at batch=20
- Rate limit 200 rps, burst 400 rps
- Clear error codes and idempotency
Deliverable: a compact JSON-like spec with request/response schemas, constraints, and errors.
Show solution
{
"purpose": "Rank candidate items for a user session",
"version": "v1",
"request": {
"schema": {
"fields": [
{"name": "users", "type": "array[object]", "required": true, "max_len": 20},
{"name": "candidates", "type": "array[string]", "required": true, "max_len": 500},
{"name": "filters", "type": "object", "required": false},
{"name": "idempotency_key", "type": "string", "required": true}
]
}
},
"response": {
"schema": {
"fields": [
{"name": "results", "type": "array[object]", "desc": "per user"},
{"name": "results[].user_id", "type": "string"},
{"name": "results[].ranked", "type": "array[object]"},
{"name": "results[].ranked[].item_id", "type": "string"},
{"name": "results[].ranked[].score", "type": "number", "range": [0,1]},
{"name": "results[].ranked[].reasons", "type": "array[string]", "max_len": 3},
{"name": "model_version", "type": "string"}
]
},
"ordering": "ranked list is sorted by score desc"
},
"constraints": {
"latency_ms_p95": {"batch1": 150, "batch20": 350},
"throughput_rps": 200,
"burst_rps": 400,
"max_batch_users": 20,
"cost_per_1k_requests": 0.80
},
"behavior": {"idempotency": true, "deterministic_within": "24h for same inputs"},
"errors": [
{"code": "INVALID_INPUT", "retryable": false},
{"code": "TIMEOUT", "retryable": true},
{"code": "RATE_LIMIT", "retryable": true}
],
"observability": {"metrics": ["latency_p95", "rps", "error_rate"], "traces": true},
"deprecation": {"policy_days": 90}
}Exercise 2: Budget your latency
You have a 200ms p95 latency SLO. Current components:
- Network: 40ms
- Feature store: 70ms
- Model inference: 80ms
- Post-processing: 40ms
Question: Are you within budget? If not, propose two realistic changes to meet the SLO and list the new target for each component.
Show solution
Total = 40 + 70 + 80 + 40 = 230ms > 200ms. Not within budget.
- Option A: Cache hot features to cut feature store to 40ms; simplify post-processing to 20ms. New total: 40 + 40 + 80 + 20 = 180ms.
- Option B: Quantize model to reduce inference to 60ms; parallelize post-processing to 25ms. New total: 40 + 70 + 60 + 25 = 195ms.
Exercise checklist
- Request/response fields are typed and bounded.
- Error handling includes retry guidance.
- Latency, throughput, and batch size are explicit.
- Versioning and deprecation policy are defined.
- Observability metrics are listed.
- PII/Privacy decisions are stated.
Common mistakes and how to self-check
- Unbounded inputs: No limits on text length or batch size. Fix: add explicit max sizes.
- Ambiguous outputs: Scores without range or meaning. Fix: state range, calibration, and ordering.
- No retry semantics: Clients guess. Fix: mark each error as retryable or not with backoff hints.
- Ignoring p95: Only p50 measured. Fix: define and monitor p95 (or p99) and timeouts.
- Hidden PII: Raw user data sent to model. Fix: document PII policy and minimize payloads.
- Version churn: Breaking changes in-place. Fix: version field + deprecation window.
- Missing idempotency: Duplicate charges or updates. Fix: require idempotency key for mutating or billable calls.
Self-check mini audit
- Can another team implement a mock client/server using only your spec?
- Are all limits testable via contract tests?
- Do SLOs sum up to the end-to-end budget with margin?
- Is there a rollback/fallback behavior when constraints are violated?
Learning path
- Study the interface template and adapt it to your product domain.
- Draft a minimal spec for one service (ranking or scoring).
- Review with backend, SRE, and product for constraints and costs.
- Add observability and error semantics; define p50/p95 and rate limits.
- Create contract tests for schema and limits; add to CI.
- Ship v1 behind a flag; deprecate prototypes with a clear timeline.
Practical projects
- Spam classifier API: Define schema, thresholds, and fail-safe when confidence is low (e.g., route to manual review).
- Embeddings batch job: Design batch file interface, chunk limits, and retry policy for partial failures.
- Feature service: Specify freshness, staleness metrics, and cache rules for real-time ranking.
Mini challenge
Pick any existing internal model. Write a one-page contract that includes purpose, request/response, constraints, and error semantics. Share it with a partner and ask them to implement a mock client. Iterate once based on their feedback.
Next steps
- Apply the template to your next model service.
- Automate schema validation and contract tests.
- Add dashboards for your SLOs and review them weekly.
Quick Test
The quick test is available to everyone. Only logged-in users will have their progress saved.