3957c2df86bbfa613f8babd44edc06cc5a365e2c
Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%,
rep3 80%, session-routing 6.6%): not load-shedding, but a consumer
EngineCore crash.
Failure chain observed in the consumer logs:
1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story)
2. a large request's KV transfer fails: "Mooncake transfer engine
returned -1" (112k-token request, pool full)
3. scheduler fails the request (kv_load_failure_policy=fail)
4. PromptTokenStats.local_cache_hit = num_cached + recomputed -
num_external_computed goes NEGATIVE (external transfer exceeded
cached count)
5. loggers.record() calls Counter.inc(negative) -> prometheus raises
"Counters can only be incremented by non-negative amounts."
6. EngineCore dies -> every subsequent request fails (the cliff:
all successes in the first ~110s, zero after)
This turns ONE failed request into a total config collapse, and is
what made the round-robin 6P+2D reps look randomly variable.
Fix: clamp the three per-source prompt-token counts to >= 0 in
loggers.record() before they hit Counter.inc(). Pure insertion,
revertible via the existing sentinel mechanism. Lets a transfer
failure stay a single failed request instead of killing the engine,
so routing arms can be compared on equal footing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%