PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers
Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token KV transfer fails -> negative prompt-token counter -> prometheus ValueError -> engine dead -> cliff failure). Explains the round-robin 6P+2D rep variance (100/56/80%) as intermittent consumer death, and notes the counter-clamp patch needed to compare routing arms fairly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -161,6 +161,39 @@ the other idles.** Splitting the KV pool by role exposes it:
|
||||
This is **H1 (D-pool capacity ceiling)** and **H2 (static-partition
|
||||
mismatch)** turning out to be the *same* phenomenon seen from two ratios.
|
||||
|
||||
### 5.1 The same pressure crashes consumers (a vLLM 0.18.1 fragility)
|
||||
|
||||
D-pool saturation doesn't just slow things down — under this workload it
|
||||
**crashes the decode instances**. The exact chain, from the 6P+2D consumer
|
||||
logs:
|
||||
|
||||
1. D-pool fills to **97.2%** (the capacity ceiling above).
|
||||
2. A large request needs its KV pulled to the consumer, but the transfer
|
||||
fails: `Mooncake transfer engine returned -1` (observed on a **112,793-token**
|
||||
request — agentic sessions have very long multi-turn contexts, and the
|
||||
pool had no room).
|
||||
3. `kv_load_failure_policy=fail` fails that request — by itself recoverable.
|
||||
4. **But** the failure path computes `PromptTokenStats.local_cache_hit =
|
||||
num_cached + recomputed − num_external_computed`, which goes **negative**
|
||||
when the external transfer exceeded the scheduler's cached count.
|
||||
5. `loggers.record()` calls `Counter.inc(negative)` → prometheus_client raises
|
||||
*"Counters can only be incremented by non-negative amounts"* → the
|
||||
**EngineCore dies**.
|
||||
6. Once the consumer's engine is dead, **every** subsequent request fails.
|
||||
|
||||
The signature is a cliff, not a slope: in the session-routing 6P+2D run, all
|
||||
80 successes landed in the first ~110 s, then **zero** of the next ~2,800 s.
|
||||
This same intermittent consumer death is almost certainly why the
|
||||
round-robin 6P+2D reps varied so wildly (100% / 56% / 80%) — the consumer
|
||||
crashed at different points in each rep.
|
||||
|
||||
**Two takeaways:** (a) PD-disagg under agentic context lengths hits KV-transfer
|
||||
failures that colocation never does (8C never transfers — it prefills and
|
||||
decodes in the same pool); (b) vLLM 0.18.1's failure handling amplifies one
|
||||
failed request into a total collapse. We patched the counter underflow
|
||||
(`instrument_kv_snapshot.py`, clamp to ≥ 0) so a transfer failure stays a
|
||||
single failed request, which is required to compare routing arms fairly in §6.
|
||||
|
||||
---
|
||||
|
||||
## 6. The routing handicap — and whether smarter routing rescues PD
|
||||
|
||||
Reference in New Issue
Block a user