docs(experiments): forensic explanation for E2 80% failure rate
Pulling admission-events.jsonl, prefill-0.log, and request-metrics
sampling shows the 1054 failures are NOT timeouts as initially
assumed. They are a 3-layer cascade:
L1: 562 "no-space" + 43 "session-not-resident" worker admission
rejects (51% of all admit attempts) because D0/D1 KV pools
saturate while D2 stays empty.
L2: rejects re-route to seed/reseed which need mooncake P→D KV
transfer; the backlog drops mooncake heartbeats and prefill-0
logs "Decode instance could be dead, remote mooncake session
... is not alive".
L3: SGLang aborts the request, SSE stream closes with 0 tokens,
agentic-pd-hybrid raises "generate stream ended before
producing any token" (the literal error string for all 1054).
E1 didn't hit this because pd-disaggregation has no admission RPC —
sessions just queue behind the running batch, paying TTFT instead
of failing. KVC v2's worker admission is supposed to be a safety
valve; on the cold-D pathology it becomes a failure amplifier.
The real fix is upstream D rebalancing (cold-D bonus or pre-warm),
not relaxing admission.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -87,7 +87,7 @@ This is *the baseline H1 needs* — it shows the KVC layer (E2) has something co
|
||||
### Key observations on E2
|
||||
|
||||
1. **D2 still has zero bindings** — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's `migration_reject_threshold=3` never trips because D0/D1 do not *reject* admission until they are completely saturated.
|
||||
2. **80 % failure rate, 1054 / 1285**. The 1054 reqs classified as bare `kvcache-centric` execution_mode are upstream timeouts / failures: they exceeded `--request-timeout-s 300` while waiting for admission. v2's stricter admission path (direct-to-D requires both `session resident on D` and `append_len ≤ τ_append` AND capacity) rejects more often than E1's vanilla pd-disagg; the rejects don't trigger migration (see §3 root cause), they cause the request to fall through to `fallback-seed` / `fallback-no-d-capacity` which queue on already-saturated D0/D1, hit timeout, and fail.
|
||||
2. **80 % failure rate, 1054 / 1285**. **NOT timeouts** — actual root cause is a 3-layer cascade documented in §6. Quick summary: 562 "no-space" admission rejects from D0/D1 → router falls back to seed/reseed paths needing mooncake → mooncake heartbeats drop ("Decode instance could be dead") → SGLang aborts the request → client sees `RuntimeError: generate stream ended before producing any token`.
|
||||
3. **Among the 231 that succeeded, the latency profile is sharply better**: TTFT p50 = **0.43 s** vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = **7.44 s** vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
|
||||
4. **Direct-to-D fast path engaged 87 / 231 = 37.7 %** of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
|
||||
|
||||
@@ -135,6 +135,84 @@ What we *can* say: RDMA is correctly engaged (every worker log shows `installTra
|
||||
|
||||
---
|
||||
|
||||
## 5b. Why E2 has 80 % failures — the real chain (forensic)
|
||||
|
||||
The summary's `error_count: 1054` and `execution_mode: kvcache-centric` mask the actual cascade. Pulling the underlying `request-metrics.jsonl`, `structural/admission-events.jsonl`, and per-worker SGLang logs gives the full picture.
|
||||
|
||||
### Layer 1 — worker admission rejects (51 % of admit attempts)
|
||||
|
||||
From `structural/admission-events.jsonl`:
|
||||
```
|
||||
admit ok = 581 (modes: seed=494, direct_append=87)
|
||||
admit reject = 605 (reasons: no-space=562, session-not-resident=43)
|
||||
```
|
||||
|
||||
**562 "no-space" rejects** — D worker (almost always D0 or D1) reports its KV pool is full and refuses to take the request as direct-append. The router then re-routes the request to the seed/reseed path.
|
||||
|
||||
This is materially different from E1's behaviour: E1's vanilla pd-disagg had no admission RPC, so requests *always* got accepted by the chosen D and queued behind the running batch. E1 paid for that as a 90-second TTFT but never saw a "no-space" failure.
|
||||
|
||||
### Layer 2 — mooncake P→D transfer failures (real, observed in prefill log)
|
||||
|
||||
From `logs/prefill-0.log`:
|
||||
```
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='2a5ed06fb…'
|
||||
with exception KVTransferError: Failed to send kv chunk of … to 172.18.112.37:46067
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='eca5ff14…'
|
||||
with exception KVTransferError: Decode instance could be dead,
|
||||
remote mooncake session 172.18.112.37:15078 is not alive
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='7ed9827b…'
|
||||
Decode instance could be dead, remote mooncake session ... is not alive
|
||||
```
|
||||
|
||||
When the seed/reseed fallback queue piles up (because of layer 1), the D worker becomes heavily backlogged and its mooncake bootstrap session heartbeat drops — P interprets this as "the D worker is dead" and fails the transfer. This is **not** a true crash; the worker process is alive (we observed it accepting unrelated requests immediately after), but the mooncake session is torn down for that bootstrap_room.
|
||||
|
||||
### Layer 3 — client-visible error
|
||||
|
||||
From `request-metrics.jsonl` for all 1054 failed reqs:
|
||||
```
|
||||
"error": "RuntimeError: generate stream ended before producing any token"
|
||||
```
|
||||
|
||||
This is what `agentic-pd-hybrid` sees when the SGLang `/generate` SSE stream closes with zero output tokens — the upstream abort from layer 1 or layer 2 propagates as an empty stream.
|
||||
|
||||
### The complete causal chain
|
||||
|
||||
```
|
||||
Inferact shared "permissions instructions" boilerplate
|
||||
↓
|
||||
overlap term in kv-aware lex score never lets D2 win → D2 cold forever
|
||||
↓
|
||||
50 sessions all pinned to D0 / D1
|
||||
↓
|
||||
D0 / D1 KV pool saturates
|
||||
↓
|
||||
worker admission emits 562 × "no-space" ← Layer 1
|
||||
↓
|
||||
router falls back to seed/reseed path (needs P→D mooncake transfer)
|
||||
↓
|
||||
P→D transfer queue piles up; D mooncake heartbeat drops
|
||||
↓
|
||||
"Decode instance could be dead" → KVTransferError ← Layer 2
|
||||
↓
|
||||
SGLang aborts the req → SSE stream closes with 0 tokens
|
||||
↓
|
||||
agentic-pd-hybrid raises "generate stream ended ..." for 1054 reqs ← Layer 3
|
||||
```
|
||||
|
||||
### Why E1 didn't hit this
|
||||
|
||||
E1 used `mechanism=pd-disaggregation`, which has no per-worker admission RPC. The router blindly dispatched to D0/D1; SGLang's internal scheduler simply queued requests behind the running batch (some grew their wait to >90 s before getting a token). Of the 85 E1 errors, sampling shows they are `request-timeout-s=300` failures — old-fashioned timeouts on the agentic-pd-hybrid side, not mooncake or admission failures.
|
||||
|
||||
So:
|
||||
- E1 trades latency for resilience: nobody rejects, everyone queues, you pay TTFT.
|
||||
- E2's KVC v2 worker admission is *meant* to be a safety valve, but on the cold-D pathology it becomes an *amplifier*: rejects → fallback paths → backlog → mooncake heartbeat loss → cascading failures.
|
||||
|
||||
### The real fix
|
||||
|
||||
Worker admission per se is not the bug — the bug is that there is no D-rebalancing happening upstream. With balanced D load (e.g. cold-D bonus in policy, or pre-warm of D2 with shared boilerplate), D0/D1 would not hit "no-space", and the layer 1 → layer 2 cascade would not fire. The reseed long-tail TTFT (8.74 s p99 here) becomes the dominant cost — exactly the regime onboarding §3.1 H3 describes.
|
||||
|
||||
---
|
||||
|
||||
## 6. What this experiment actually shows
|
||||
|
||||
1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
|
||||
|
||||
Reference in New Issue
Block a user