Who this is for
Machine Learning Engineers and backend practitioners who ship models behind APIs and need predictable, low-latency feature preparation at inference time.
Prerequisites
- Comfort with Python or similar for data transforms
- Basic understanding of your model’s expected inputs
- Familiarity with REST/HTTP or gRPC APIs
- Knowing the training-time preprocessing steps used for your model
Why this matters
In production, raw requests are messy. You must convert them into model-ready features quickly and safely. Good inference-time preprocessing:
- Prevents training–serving skew and accuracy drops
- Meets latency SLOs and throughput goals
- Protects your service with validation and limits
- Enables observability and faster incident response
Real tasks you’ll do on the job
- Pinning exact vocabularies, scalers, and encoders used at training
- Validating request payloads and rejecting malformed data early
- Normalizing text, images, and tabular fields deterministically
- Fetching features from an online store with fallbacks
- Profiling and budgeting milliseconds across preprocessing steps
Concept explained simply
Inference-time preprocessing is the set of transformations that turn a live request into the exact feature tensor(s) your model expects. It must be deterministic (same input, same features), fast, and safe.
Mental model
Think of a funnel:
- Gate: validate and limit payloads
- Shape: clean and normalize to a strict schema
- Enrich: add derived fields or fetch features (with timeouts)
- Pack: produce model-ready tensors with pinned parameters
Everything is a contract: schema in, features out, with strict budgets.
Core principles
- Determinism and parity: use the exact same parameters from training (means, stds, vocab, tokenizers, image sizes, category maps)
- Latency budget: define a P95/P99 budget for preprocessing and measure it
- Stateless handlers: avoid per-request global mutations; allow safe concurrency
- Validation and limits: schema checks, size limits, timeouts, and safe defaults
- Observability: log decision points, count fallbacks, tag out-of-distribution events
- Graceful degradation: if enrichment fails, fallback to a simpler path
Typical workflow
- Load pinned artifacts at startup (vocab.json, scaler.pkl, image mean/std). Do not compute them on live traffic.
- Validate input (types, required fields, max sizes). Reject early with clear errors.
- Normalize (whitespace, casing, timezones, image channels, units).
- Derive features (hash buckets, day-of-week, length, bigram indicators).
- Optional feature fetch from online store with strict timeouts and fallbacks.
- Pack tensors in the exact order/shape expected by the model.
- Profile and monitor latencies and skew counters.
Worked examples
Example 1: Text sentiment API
Goal: Convert user text into token IDs for a model that expects lowercase, max length 128, and a fixed vocabulary with an UNK token.
Steps
- Validate: required field text (string), max length 5k chars
- Normalize: Unicode NFC, strip HTML, lowercase
- Tokenize: split by WordPiece/BPE using saved vocab
- Handle OOV: map unknown tokens to UNK_ID
- Pad/truncate: to length 128; build attention_mask
- Pack: inputs {input_ids, attention_mask}
Before/after example
Request:
{"text": "Amazing product!!! LOVE it."}
Features:
{
"input_ids": [101, 5872, 4031, 999, 999, 999, 102, ...],
"attention_mask": [1,1,1,1,1,1,1,...]
}
Why this works
Deterministic transforms and pinned vocab eliminate skew. Length limits protect latency.
Example 2: Tabular fraud score
Goal: Standardize numeric fields and one-hot encode categories using training-time stats.
Pinned artifacts
means = {"amount": 53.2, "age_days": 430.0}
stds = {"amount": 20.5, "age_days": 120.0}
country_map = {"US":0, "CA":1, "GB":2, "OTHER":3}
Steps
- Validate: amount in [0, 10000], user_id present, timestamp present
- Normalize: convert timestamp to UTC, compute hour_of_day
- Standardize: (x - mean) / std using pinned values
- Encode: map country to index; unknown to OTHER
- Optional: fetch user_risk from online store (timeout 20ms, fallback 0.0)
- Pack: fixed feature order [z_amount, z_age_days, country_idx, hour_of_day, user_risk]
Input/Output sketch
Input: {"amount": 80, "age_days": 365, "country": "DE", "timestamp": "2025-04-05T10:22:00+02:00"}
Output: [1.31, -0.54, 3, 8, 0.0]
Example 3: Image classifier
Goal: Produce a 3x224x224 tensor normalized by channel with pinned mean/std.
Steps
- Validate: image size < 5 MB, supported type, max dimensions 2048x2048
- Decode and convert to RGB
- Resize shortest side to 256, center-crop 224x224
- Reorder to CHW, normalize with pinned mean/std
- Pack: float32 tensor ready for model
Tips
- Reject gigantic images early to protect latency
- Batch requests cautiously; ensure P95 stays within SLO
Implementation patterns
- Ship transforms with the model artifact (same versioning)
- Load artifacts at process start; avoid disk I/O per request
- Use strict schema checks and descriptive error messages
- Apply timeouts around remote calls; keep a no-remote fallback path
- Cache small lookups (e.g., vocab maps) in-memory; avoid global mutation
- Record feature stats counters (nulls replaced, OOV rate, truncation rate)
Common mistakes and how to self-check
- Recomputing stats on live traffic → Self-check: confirm means/stds are loaded from artifact
- Forgetting timezone normalization → Self-check: assert all timestamps are UTC before deriving features
- Changing tokenization between training and serving → Self-check: verify vocab checksum matches training version
- Unlimited payload sizes → Self-check: add explicit size/length guards and tests
- Silent feature reordering → Self-check: unit test feature order against a golden sample
- Unbounded remote fetches → Self-check: enforce timeouts and track fallback rates
Before shipping checklist
- Artifacts pinned and versioned with the model
- Schema validation with clear errors
- Latency budget documented and profiled at P50/P95/P99
- Guards: size limits, timeouts, retries capped
- Fallback logic defined and measured
- Golden test cases for feature parity
Exercises
Do these, then compare with the solutions below.
Exercise 1: Design a safe text preprocessing path
Given a news sentiment model trained on lowercased, punctuation-stripped English text with max length 256 and a fixed vocab (with UNK), outline the inference-time steps, validations, and fallbacks. Include how you would ensure training–serving parity.
Exercise 2: Latency budgeting
Your API SLO is 150ms P95. Known components at P95: auth 5ms, network 15ms, online feature fetch 40ms, model inference 65ms, postprocessing 10ms. How many milliseconds remain for preprocessing? If your current preprocessing is 28ms P95, list two pragmatic steps to meet the SLO.
Show solutions
Exercise 1 — Possible solution
- Validate: text present (string), length ≤ 10k chars; reject otherwise
- Normalize: Unicode NFC, strip HTML, lowercase, remove punctuation per training spec
- Tokenize with the pinned training vocab; OOV → UNK_ID
- Truncate/pad to 256, build attention_mask
- Parity: load tokenizer/vocab artifact by exact version checksum; include a golden sample test comparing training notebook output vs service output
- Safety: length and time limits; log OOV and truncation rates
Exercise 2 — Calculation and actions
Sum knowns: 5 + 15 + 40 + 65 + 10 = 135ms. Remaining budget = 150 − 135 = 15ms for preprocessing. Current is 28ms → over by 13ms. Steps: (1) remove non-essential transforms or defer them; (2) warm caches and use vectorized ops; also consider narrowing max input length or batching tokenization.
Mini challenge
Your tabular model expects standardized numeric features and a 10k-category hash bucket for product_id. Live traffic includes new product_ids daily. Outline how you will handle unseen ids, enforce limits, and keep latency under 20ms for preprocessing. Keep it to 4–6 bullet points and focus on determinism and safety.
Learning path
- Map training transforms → write their serving equivalents
- Create schema and validators → add unit tests with golden cases
- Pin artifacts → load at startup and add checksums
- Add latency timers → set budgets and alerting
- Hardening → limits, timeouts, fallbacks
- Canary release → compare feature distributions and error rates
Practical projects
- Build a small REST service that preprocesses text into model tensors with pinned vocab, including OOV handling and truncation
- Implement a tabular pipeline that standardizes features and falls back when the online feature fetch times out; log fallback metrics
- Create an image preprocessor with strict size/type validation and CHW normalization; run a load test to verify P95 latency
Next steps
- Add observability: counters for null replacements, OOVs, truncations, and fetch timeouts
- Automate parity tests in CI with golden samples from training
- Document the schema contract and latency budget for your endpoint
Quick Test
Available to everyone. Only logged-in users will have their progress saved.