Files
agentic-kvc/scripts
Gahow Wang 4b50c5a08d Fix unified cost model: include decode load in queue + hard overload gate
Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):

1. _instance_cost queue only counted pending_prefill_tokens, missing
   ongoing_decode_tokens entirely — instances with 50 decoding requests
   appeared idle to the cost model.

2. Cache hits made overloaded instances look "cheap", creating a positive
   feedback loop: more sessions → more cache → lower cost → more routing.
   Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
   affinity before the cost model runs, matching linear policy behavior.

Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 16:25:02 +08:00
..