Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):
1. _instance_cost queue only counted pending_prefill_tokens, missing
ongoing_decode_tokens entirely — instances with 50 decoding requests
appeared idle to the cost model.
2. Cache hits made overloaded instances look "cheap", creating a positive
feedback loop: more sessions → more cache → lower cost → more routing.
Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
affinity before the cost model runs, matching linear policy behavior.
Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>