MB5 patch: clamp PD-consumer metrics counter underflow

Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%, rep3 80%, session-routing 6.6%): not load-shedding, but a consumer EngineCore crash. Failure chain observed in the consumer logs: 1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story) 2. a large request's KV transfer fails: "Mooncake transfer engine returned -1" (112k-token request, pool full) 3. scheduler fails the request (kv_load_failure_policy=fail) 4. PromptTokenStats.local_cache_hit = num_cached + recomputed - num_external_computed goes NEGATIVE (external transfer exceeded cached count) 5. loggers.record() calls Counter.inc(negative) -> prometheus raises "Counters can only be incremented by non-negative amounts." 6. EngineCore dies -> every subsequent request fails (the cliff: all successes in the first ~110s, zero after) This turns ONE failed request into a total config collapse, and is what made the round-robin 6P+2D reps look randomly variable. Fix: clamp the three per-source prompt-token counts to >= 0 in loggers.record() before they hit Counter.inc(). Pure insertion, revertible via the existing sentinel mechanism. Lets a transfer failure stay a single failed request instead of killing the engine, so routing arms can be compared on equal footing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:01:23 +08:00
parent 8596135680
commit 3957c2df86
1 changed files with 29 additions and 0 deletions
--- a/microbench/fresh_setup/instrument_kv_snapshot.py
+++ b/microbench/fresh_setup/instrument_kv_snapshot.py
@@ -36,6 +36,7 @@ MOONCAKE_REL = (
    "lib/python3.12/site-packages/vllm/distributed/kv_transfer/"
    "kv_connector/v1/mooncake/mooncake_connector.py"
 )
+LOGGERS_REL = "lib/python3.12/site-packages/vllm/v1/metrics/loggers.py"

 START_MARK = "# MB5_INSTRUMENT_START"
 END_MARK = "# MB5_INSTRUMENT_END"
@@ -192,9 +193,37 @@ MOONCAKE_PATCHES = [
     MOONCAKE_ANCHOR + MOONCAKE_INSERT),
 ]

+# ---------- Patch 4: vLLM 0.18.1 PD-consumer metrics counter underflow ------
+# In PromptTokenStats.update_from_output, local_cache_hit is computed as
+# (num_cached_tokens + recomputed - num_external_computed_tokens). On a
+# kv_consumer, a remote KV transfer can report more external-computed tokens
+# than the scheduler's cached count (esp. on a KV-load failure for a large
+# request), driving local_cache_hit negative. loggers.record() then calls
+# Counter.inc() with that negative value and prometheus_client raises
+# "Counters can only be incremented by non-negative amounts.", which kills the
+# EngineCore — turning one failed request into a total config collapse.
+# We clamp the per-source counts to >= 0 right before they are recorded.
+LOGGERS_ANCHOR = "        pts = iteration_stats.prompt_token_stats\n"
+LOGGERS_INSERT = (
+    f"        {START_MARK}\n"
+    f"        if pts.local_cache_hit < 0:\n"
+    f"            pts.local_cache_hit = 0\n"
+    f"        if pts.computed < 0:\n"
+    f"            pts.computed = 0\n"
+    f"        if pts.external_kv_transfer < 0:\n"
+    f"            pts.external_kv_transfer = 0\n"
+    f"        {END_MARK}\n"
+)
+
+LOGGERS_PATCHES = [
+    ("PD-consumer counter underflow clamp", LOGGERS_ANCHOR,
+     LOGGERS_ANCHOR + LOGGERS_INSERT),
+]
+
 PATCH_FILES = [
    (TARGET_REL, SCHED_PATCHES),
    (MOONCAKE_REL, MOONCAKE_PATCHES),
+    (LOGGERS_REL, LOGGERS_PATCHES),
 ]