MB5 patch: clamp PD-consumer metrics counter underflow

Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%,
rep3 80%, session-routing 6.6%): not load-shedding, but a consumer
EngineCore crash.

Failure chain observed in the consumer logs:
  1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story)
  2. a large request's KV transfer fails: "Mooncake transfer engine
     returned -1" (112k-token request, pool full)
  3. scheduler fails the request (kv_load_failure_policy=fail)
  4. PromptTokenStats.local_cache_hit = num_cached + recomputed -
     num_external_computed goes NEGATIVE (external transfer exceeded
     cached count)
  5. loggers.record() calls Counter.inc(negative) -> prometheus raises
     "Counters can only be incremented by non-negative amounts."
  6. EngineCore dies -> every subsequent request fails (the cliff:
     all successes in the first ~110s, zero after)

This turns ONE failed request into a total config collapse, and is
what made the round-robin 6P+2D reps look randomly variable.

Fix: clamp the three per-source prompt-token counts to >= 0 in
loggers.record() before they hit Counter.inc(). Pure insertion,
revertible via the existing sentinel mechanism. Lets a transfer
failure stay a single failed request instead of killing the engine,
so routing arms can be compared on equal footing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 13:01:23 +08:00
parent 8596135680
commit 3957c2df86

View File

@@ -36,6 +36,7 @@ MOONCAKE_REL = (
"lib/python3.12/site-packages/vllm/distributed/kv_transfer/"
"kv_connector/v1/mooncake/mooncake_connector.py"
)
LOGGERS_REL = "lib/python3.12/site-packages/vllm/v1/metrics/loggers.py"
START_MARK = "# MB5_INSTRUMENT_START"
END_MARK = "# MB5_INSTRUMENT_END"
@@ -192,9 +193,37 @@ MOONCAKE_PATCHES = [
MOONCAKE_ANCHOR + MOONCAKE_INSERT),
]
# ---------- Patch 4: vLLM 0.18.1 PD-consumer metrics counter underflow ------
# In PromptTokenStats.update_from_output, local_cache_hit is computed as
# (num_cached_tokens + recomputed - num_external_computed_tokens). On a
# kv_consumer, a remote KV transfer can report more external-computed tokens
# than the scheduler's cached count (esp. on a KV-load failure for a large
# request), driving local_cache_hit negative. loggers.record() then calls
# Counter.inc() with that negative value and prometheus_client raises
# "Counters can only be incremented by non-negative amounts.", which kills the
# EngineCore — turning one failed request into a total config collapse.
# We clamp the per-source counts to >= 0 right before they are recorded.
LOGGERS_ANCHOR = " pts = iteration_stats.prompt_token_stats\n"
LOGGERS_INSERT = (
f" {START_MARK}\n"
f" if pts.local_cache_hit < 0:\n"
f" pts.local_cache_hit = 0\n"
f" if pts.computed < 0:\n"
f" pts.computed = 0\n"
f" if pts.external_kv_transfer < 0:\n"
f" pts.external_kv_transfer = 0\n"
f" {END_MARK}\n"
)
LOGGERS_PATCHES = [
("PD-consumer counter underflow clamp", LOGGERS_ANCHOR,
LOGGERS_ANCHOR + LOGGERS_INSERT),
]
PATCH_FILES = [
(TARGET_REL, SCHED_PATCHES),
(MOONCAKE_REL, MOONCAKE_PATCHES),
(LOGGERS_REL, LOGGERS_PATCHES),
]