Why this matters
Prevention: contract tests on schemas, canary validation, and alerts on null-rate spikes.
Exercise 2 solution
Bottleneck: feature_ms (180–205 ms vs others < 12 ms). Short-term: add read-through cache, increase timeouts/connection pool, batch/parallel fetch. Long-term: optimize feature store queries and index hot keys. Alert: P95 feature_ms >= 120 ms for 5 consecutive minutes or feature_ms/total_ms ratio > 0.6.