MB5 patch: clamp PD-consumer metrics counter underflow
Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%,
rep3 80%, session-routing 6.6%): not load-shedding, but a consumer
EngineCore crash.
Failure chain observed in the consumer logs:
1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story)
2. a large request's KV transfer fails: "Mooncake transfer engine
returned -1" (112k-token request, pool full)
3. scheduler fails the request (kv_load_failure_policy=fail)
4. PromptTokenStats.local_cache_hit = num_cached + recomputed -
num_external_computed goes NEGATIVE (external transfer exceeded
cached count)
5. loggers.record() calls Counter.inc(negative) -> prometheus raises
"Counters can only be incremented by non-negative amounts."
6. EngineCore dies -> every subsequent request fails (the cliff:
all successes in the first ~110s, zero after)
This turns ONE failed request into a total config collapse, and is
what made the round-robin 6P+2D reps look randomly variable.
Fix: clamp the three per-source prompt-token counts to >= 0 in
loggers.record() before they hit Counter.inc(). Pure insertion,
revertible via the existing sentinel mechanism. Lets a transfer
failure stay a single failed request instead of killing the engine,
so routing arms can be compared on equal footing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -36,6 +36,7 @@ MOONCAKE_REL = (
|
||||
"lib/python3.12/site-packages/vllm/distributed/kv_transfer/"
|
||||
"kv_connector/v1/mooncake/mooncake_connector.py"
|
||||
)
|
||||
LOGGERS_REL = "lib/python3.12/site-packages/vllm/v1/metrics/loggers.py"
|
||||
|
||||
START_MARK = "# MB5_INSTRUMENT_START"
|
||||
END_MARK = "# MB5_INSTRUMENT_END"
|
||||
@@ -192,9 +193,37 @@ MOONCAKE_PATCHES = [
|
||||
MOONCAKE_ANCHOR + MOONCAKE_INSERT),
|
||||
]
|
||||
|
||||
# ---------- Patch 4: vLLM 0.18.1 PD-consumer metrics counter underflow ------
|
||||
# In PromptTokenStats.update_from_output, local_cache_hit is computed as
|
||||
# (num_cached_tokens + recomputed - num_external_computed_tokens). On a
|
||||
# kv_consumer, a remote KV transfer can report more external-computed tokens
|
||||
# than the scheduler's cached count (esp. on a KV-load failure for a large
|
||||
# request), driving local_cache_hit negative. loggers.record() then calls
|
||||
# Counter.inc() with that negative value and prometheus_client raises
|
||||
# "Counters can only be incremented by non-negative amounts.", which kills the
|
||||
# EngineCore — turning one failed request into a total config collapse.
|
||||
# We clamp the per-source counts to >= 0 right before they are recorded.
|
||||
LOGGERS_ANCHOR = " pts = iteration_stats.prompt_token_stats\n"
|
||||
LOGGERS_INSERT = (
|
||||
f" {START_MARK}\n"
|
||||
f" if pts.local_cache_hit < 0:\n"
|
||||
f" pts.local_cache_hit = 0\n"
|
||||
f" if pts.computed < 0:\n"
|
||||
f" pts.computed = 0\n"
|
||||
f" if pts.external_kv_transfer < 0:\n"
|
||||
f" pts.external_kv_transfer = 0\n"
|
||||
f" {END_MARK}\n"
|
||||
)
|
||||
|
||||
LOGGERS_PATCHES = [
|
||||
("PD-consumer counter underflow clamp", LOGGERS_ANCHOR,
|
||||
LOGGERS_ANCHOR + LOGGERS_INSERT),
|
||||
]
|
||||
|
||||
PATCH_FILES = [
|
||||
(TARGET_REL, SCHED_PATCHES),
|
||||
(MOONCAKE_REL, MOONCAKE_PATCHES),
|
||||
(LOGGERS_REL, LOGGERS_PATCHES),
|
||||
]
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user