PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers

Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token
KV transfer fails -> negative prompt-token counter -> prometheus
ValueError -> engine dead -> cliff failure). Explains the round-robin
6P+2D rep variance (100/56/80%) as intermittent consumer death, and
notes the counter-clamp patch needed to compare routing arms fairly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 13:02:21 +08:00
parent 3957c2df86
commit 2e6a369046

View File

@@ -161,6 +161,39 @@ the other idles.** Splitting the KV pool by role exposes it:
This is **H1 (D-pool capacity ceiling)** and **H2 (static-partition
mismatch)** turning out to be the *same* phenomenon seen from two ratios.
### 5.1 The same pressure crashes consumers (a vLLM 0.18.1 fragility)
D-pool saturation doesn't just slow things down — under this workload it
**crashes the decode instances**. The exact chain, from the 6P+2D consumer
logs:
1. D-pool fills to **97.2%** (the capacity ceiling above).
2. A large request needs its KV pulled to the consumer, but the transfer
fails: `Mooncake transfer engine returned -1` (observed on a **112,793-token**
request — agentic sessions have very long multi-turn contexts, and the
pool had no room).
3. `kv_load_failure_policy=fail` fails that request — by itself recoverable.
4. **But** the failure path computes `PromptTokenStats.local_cache_hit =
num_cached + recomputed num_external_computed`, which goes **negative**
when the external transfer exceeded the scheduler's cached count.
5. `loggers.record()` calls `Counter.inc(negative)` → prometheus_client raises
*"Counters can only be incremented by non-negative amounts"* → the
**EngineCore dies**.
6. Once the consumer's engine is dead, **every** subsequent request fails.
The signature is a cliff, not a slope: in the session-routing 6P+2D run, all
80 successes landed in the first ~110 s, then **zero** of the next ~2,800 s.
This same intermittent consumer death is almost certainly why the
round-robin 6P+2D reps varied so wildly (100% / 56% / 80%) — the consumer
crashed at different points in each rep.
**Two takeaways:** (a) PD-disagg under agentic context lengths hits KV-transfer
failures that colocation never does (8C never transfers — it prefills and
decodes in the same pool); (b) vLLM 0.18.1's failure handling amplifies one
failed request into a total collapse. We patched the counter underflow
(`instrument_kv_snapshot.py`, clamp to ≥ 0) so a transfer failure stays a
single failed request, which is required to compare routing arms fairly in §6.
---
## 6. The routing handicap — and whether smarter routing rescues PD