luvv to helpDiscover the Best Free Online Tools
Topic 4 of 9

Preprocessing At Inference Time

Learn Preprocessing At Inference Time for free with explanations, exercises, and a quick test (for Machine Learning Engineer).

Published: January 1, 2026 | Updated: January 1, 2026

Who this is for

Machine Learning Engineers and backend practitioners who ship models behind APIs and need predictable, low-latency feature preparation at inference time.

Prerequisites

  • Comfort with Python or similar for data transforms
  • Basic understanding of your model’s expected inputs
  • Familiarity with REST/HTTP or gRPC APIs
  • Knowing the training-time preprocessing steps used for your model

Why this matters

In production, raw requests are messy. You must convert them into model-ready features quickly and safely. Good inference-time preprocessing:

  • Prevents training–serving skew and accuracy drops
  • Meets latency SLOs and throughput goals
  • Protects your service with validation and limits
  • Enables observability and faster incident response
Real tasks you’ll do on the job
  • Pinning exact vocabularies, scalers, and encoders used at training
  • Validating request payloads and rejecting malformed data early
  • Normalizing text, images, and tabular fields deterministically
  • Fetching features from an online store with fallbacks
  • Profiling and budgeting milliseconds across preprocessing steps

Concept explained simply

Inference-time preprocessing is the set of transformations that turn a live request into the exact feature tensor(s) your model expects. It must be deterministic (same input, same features), fast, and safe.

Mental model

Think of a funnel:

  • Gate: validate and limit payloads
  • Shape: clean and normalize to a strict schema
  • Enrich: add derived fields or fetch features (with timeouts)
  • Pack: produce model-ready tensors with pinned parameters

Everything is a contract: schema in, features out, with strict budgets.

Core principles

  • Determinism and parity: use the exact same parameters from training (means, stds, vocab, tokenizers, image sizes, category maps)
  • Latency budget: define a P95/P99 budget for preprocessing and measure it
  • Stateless handlers: avoid per-request global mutations; allow safe concurrency
  • Validation and limits: schema checks, size limits, timeouts, and safe defaults
  • Observability: log decision points, count fallbacks, tag out-of-distribution events
  • Graceful degradation: if enrichment fails, fallback to a simpler path

Typical workflow

  1. Load pinned artifacts at startup (vocab.json, scaler.pkl, image mean/std). Do not compute them on live traffic.
  2. Validate input (types, required fields, max sizes). Reject early with clear errors.
  3. Normalize (whitespace, casing, timezones, image channels, units).
  4. Derive features (hash buckets, day-of-week, length, bigram indicators).
  5. Optional feature fetch from online store with strict timeouts and fallbacks.
  6. Pack tensors in the exact order/shape expected by the model.
  7. Profile and monitor latencies and skew counters.

Worked examples

Example 1: Text sentiment API

Goal: Convert user text into token IDs for a model that expects lowercase, max length 128, and a fixed vocabulary with an UNK token.

Steps
  1. Validate: required field text (string), max length 5k chars
  2. Normalize: Unicode NFC, strip HTML, lowercase
  3. Tokenize: split by WordPiece/BPE using saved vocab
  4. Handle OOV: map unknown tokens to UNK_ID
  5. Pad/truncate: to length 128; build attention_mask
  6. Pack: inputs {input_ids, attention_mask}
Before/after example
Request:
{"text": "Amazing product!!! LOVE it."}

Features:
{
  "input_ids": [101, 5872, 4031, 999, 999, 999, 102, ...],
  "attention_mask": [1,1,1,1,1,1,1,...]
}
Why this works

Deterministic transforms and pinned vocab eliminate skew. Length limits protect latency.

Example 2: Tabular fraud score

Goal: Standardize numeric fields and one-hot encode categories using training-time stats.

Pinned artifacts
means = {"amount": 53.2, "age_days": 430.0}
stds  = {"amount": 20.5, "age_days": 120.0}
country_map = {"US":0, "CA":1, "GB":2, "OTHER":3}
Steps
  1. Validate: amount in [0, 10000], user_id present, timestamp present
  2. Normalize: convert timestamp to UTC, compute hour_of_day
  3. Standardize: (x - mean) / std using pinned values
  4. Encode: map country to index; unknown to OTHER
  5. Optional: fetch user_risk from online store (timeout 20ms, fallback 0.0)
  6. Pack: fixed feature order [z_amount, z_age_days, country_idx, hour_of_day, user_risk]
Input/Output sketch
Input: {"amount": 80, "age_days": 365, "country": "DE", "timestamp": "2025-04-05T10:22:00+02:00"}
Output: [1.31, -0.54, 3, 8, 0.0]

Example 3: Image classifier

Goal: Produce a 3x224x224 tensor normalized by channel with pinned mean/std.

Steps
  1. Validate: image size < 5 MB, supported type, max dimensions 2048x2048
  2. Decode and convert to RGB
  3. Resize shortest side to 256, center-crop 224x224
  4. Reorder to CHW, normalize with pinned mean/std
  5. Pack: float32 tensor ready for model
Tips
  • Reject gigantic images early to protect latency
  • Batch requests cautiously; ensure P95 stays within SLO

Implementation patterns

  • Ship transforms with the model artifact (same versioning)
  • Load artifacts at process start; avoid disk I/O per request
  • Use strict schema checks and descriptive error messages
  • Apply timeouts around remote calls; keep a no-remote fallback path
  • Cache small lookups (e.g., vocab maps) in-memory; avoid global mutation
  • Record feature stats counters (nulls replaced, OOV rate, truncation rate)

Common mistakes and how to self-check

  • Recomputing stats on live traffic → Self-check: confirm means/stds are loaded from artifact
  • Forgetting timezone normalization → Self-check: assert all timestamps are UTC before deriving features
  • Changing tokenization between training and serving → Self-check: verify vocab checksum matches training version
  • Unlimited payload sizes → Self-check: add explicit size/length guards and tests
  • Silent feature reordering → Self-check: unit test feature order against a golden sample
  • Unbounded remote fetches → Self-check: enforce timeouts and track fallback rates

Before shipping checklist

  • Artifacts pinned and versioned with the model
  • Schema validation with clear errors
  • Latency budget documented and profiled at P50/P95/P99
  • Guards: size limits, timeouts, retries capped
  • Fallback logic defined and measured
  • Golden test cases for feature parity

Exercises

Do these, then compare with the solutions below.

Exercise 1: Design a safe text preprocessing path

Given a news sentiment model trained on lowercased, punctuation-stripped English text with max length 256 and a fixed vocab (with UNK), outline the inference-time steps, validations, and fallbacks. Include how you would ensure training–serving parity.

Exercise 2: Latency budgeting

Your API SLO is 150ms P95. Known components at P95: auth 5ms, network 15ms, online feature fetch 40ms, model inference 65ms, postprocessing 10ms. How many milliseconds remain for preprocessing? If your current preprocessing is 28ms P95, list two pragmatic steps to meet the SLO.

Show solutions

Exercise 1 — Possible solution

  1. Validate: text present (string), length ≤ 10k chars; reject otherwise
  2. Normalize: Unicode NFC, strip HTML, lowercase, remove punctuation per training spec
  3. Tokenize with the pinned training vocab; OOV → UNK_ID
  4. Truncate/pad to 256, build attention_mask
  5. Parity: load tokenizer/vocab artifact by exact version checksum; include a golden sample test comparing training notebook output vs service output
  6. Safety: length and time limits; log OOV and truncation rates

Exercise 2 — Calculation and actions

Sum knowns: 5 + 15 + 40 + 65 + 10 = 135ms. Remaining budget = 150 − 135 = 15ms for preprocessing. Current is 28ms → over by 13ms. Steps: (1) remove non-essential transforms or defer them; (2) warm caches and use vectorized ops; also consider narrowing max input length or batching tokenization.

Mini challenge

Your tabular model expects standardized numeric features and a 10k-category hash bucket for product_id. Live traffic includes new product_ids daily. Outline how you will handle unseen ids, enforce limits, and keep latency under 20ms for preprocessing. Keep it to 4–6 bullet points and focus on determinism and safety.

Learning path

  1. Map training transforms → write their serving equivalents
  2. Create schema and validators → add unit tests with golden cases
  3. Pin artifacts → load at startup and add checksums
  4. Add latency timers → set budgets and alerting
  5. Hardening → limits, timeouts, fallbacks
  6. Canary release → compare feature distributions and error rates

Practical projects

  • Build a small REST service that preprocesses text into model tensors with pinned vocab, including OOV handling and truncation
  • Implement a tabular pipeline that standardizes features and falls back when the online feature fetch times out; log fallback metrics
  • Create an image preprocessor with strict size/type validation and CHW normalization; run a load test to verify P95 latency

Next steps

  • Add observability: counters for null replacements, OOVs, truncations, and fetch timeouts
  • Automate parity tests in CI with golden samples from training
  • Document the schema contract and latency budget for your endpoint

Quick Test

Available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Given a news sentiment model trained on lowercased, punctuation-stripped English text with max length 256 and a fixed vocab (with UNK), outline the inference-time steps, validations, and fallbacks. Include how you would ensure training–serving parity.

Expected Output
A clear ordered list covering validation, normalization, tokenization with pinned vocab, OOV handling, padding/truncation to 256, parity checks (artifact version/checksum), and safety/logging.

Preprocessing At Inference Time — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Preprocessing At Inference Time?

AI Assistant

Ask questions about this tool