agentic-kvc

Files

Gahow Wang 3957c2df86 MB5 patch: clamp PD-consumer metrics counter underflow

Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%,
rep3 80%, session-routing 6.6%): not load-shedding, but a consumer
EngineCore crash.

Failure chain observed in the consumer logs:
  1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story)
  2. a large request's KV transfer fails: "Mooncake transfer engine
     returned -1" (112k-token request, pool full)
  3. scheduler fails the request (kv_load_failure_policy=fail)
  4. PromptTokenStats.local_cache_hit = num_cached + recomputed -
     num_external_computed goes NEGATIVE (external transfer exceeded
     cached count)
  5. loggers.record() calls Counter.inc(negative) -> prometheus raises
     "Counters can only be incremented by non-negative amounts."
  6. EngineCore dies -> every subsequent request fails (the cliff:
     all successes in the first ~110s, zero after)

This turns ONE failed request into a total config collapse, and is
what made the round-robin 6P+2D reps look randomly variable.

Fix: clamp the three per-source prompt-token counts to >= 0 in
loggers.record() before they hit Counter.inc(). Pure insertion,
revertible via the existing sentinel mechanism. Lets a transfer
failure stay a single failed request instead of killing the engine,
so routing arms can be compared on equal footing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 13:01:23 +08:00

connector_tax

Connector tax: trace-replay confirms +45% kv_both penalty is gone; DR-fix adds 22% more

2026-05-27 09:13:50 +08:00

fresh_setup

MB5 patch: clamp PD-consumer metrics counter underflow

2026-05-28 13:01:23 +08:00

interference

Microbench 1 plots: prefill-decode interference heatmap + lines

2026-05-26 14:21:30 +08:00

lifecycle

PD-sep server-side profiling: vLLM patches + per-request breakdown