Pulling admission-events.jsonl, prefill-0.log, and request-metrics
sampling shows the 1054 failures are NOT timeouts as initially
assumed. They are a 3-layer cascade:
L1: 562 "no-space" + 43 "session-not-resident" worker admission
rejects (51% of all admit attempts) because D0/D1 KV pools
saturate while D2 stays empty.
L2: rejects re-route to seed/reseed which need mooncake P→D KV
transfer; the backlog drops mooncake heartbeats and prefill-0
logs "Decode instance could be dead, remote mooncake session
... is not alive".
L3: SGLang aborts the request, SSE stream closes with 0 tokens,
agentic-pd-hybrid raises "generate stream ended before
producing any token" (the literal error string for all 1054).
E1 didn't hit this because pd-disaggregation has no admission RPC —
sessions just queue behind the running batch, paying TTFT instead
of failing. KVC v2's worker admission is supposed to be a safety
valve; on the cold-D pathology it becomes a failure amplifier.
The real fix is upstream D rebalancing (cold-D bonus or pre-warm),
not relaxing admission.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
E1 vs E2 Experiment Results — H200 + Driver 570
Status: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ✅ complete (2026-05-12 03:22 UTC, wall 1h33min).
Branch: h200-cu130.
Trace: outputs/inferact_50sess.jsonl (deterministic head-cut of Inferact codex_swebenchpro to first 50 trials, md5 7bb263a32600ef5a6ef5099ba340a487, 1285 requests, mean input_length 67,631 tokens).
Hardware: 4× H200 80GB, driver 570.86.15 (cu12.8 API), Mellanox mlx5_60 RoCE 400 Gb/s NDR.
Model: Qwen3-30B-A3B-Instruct-2507 (TP1).
Toolchain: vendored SGLang 0.5.10 + cu12.8 nvcc local install (~/cuda-12.8) — see docs/H200_DRIVER570_SETUP_ZH.md.
1. Hypotheses being tested
From docs/ONBOARDING_NEXT_AGENT_ZH.md §3.1:
- H1: KVC v2's wins are not just from "1P3D topology + kv-aware policy" — the KVC layer (admission / migration / direct-to-D) contributes meaningfully on top. Pairing E1 (no KVC layer) against E2 (full KVC v2) on the same subset isolates the marginal contribution.
- H2/H3: Enabling real RDMA pushes TTFT p99 down from the reported 1.28s (TCP loopback) toward ~0.7s. Independent of H1, this is measured inside E2 alone (comparing against the historical TCP-loopback v2 reference).
2. E1 results — naive 1P3D + kv-aware + RDMA
Configuration: mechanism=pd-disaggregation, policy=kv-aware, 1P3D (GPU0=P, GPU1/2/3=D), --force-rdma --ib-device mlx5_60, --concurrency-limit 32, ts=1.
| Metric | E1 |
|---|---|
| request_count | 1285 |
| success | 1200 |
| error_count | 85 |
| failure_count | 85 |
| abort_count | 0 |
| latency mean | 96.34 s |
| latency p50 | 93.21 s |
| latency p90 | 180.69 s |
| latency p99 | 219.46 s |
| ttft mean | 90.48 s |
| ttft p50 | 88.62 s |
| ttft p90 | 175.13 s |
| ttft p99 | 207.39 s |
| execution_modes | pd-disaggregation-router: 1200, pd-disaggregation: 85 (errors) |
| per_decode_load | D0:575, D1:710, D2:0 |
| per_prefill_load | P0:1285 |
| cache_hit_request_count | 1199 / 1200 (99.9%) |
Key observations on E1
- D2 was never bound to a single session. All 50 sessions got pinned to D0 or D1 by
kv-awarepolicy's (overlap + sticky + inflight + assigned) lex-score, and naive pd-disaggregation has no migration mechanism to rebalance. Effective topology was 1P2D, not 1P3D. - Massive queueing. TTFT p50 ≈ 89 s and p99 > 200 s indicate sessions waited tens of seconds in router/prefill queue. With
--concurrency-limit 32and D0/D1 saturated, the inflight cap forced ~1250 reqs to serialize through only two decode workers. - 85 failures (6.6%) — all
execution_mode == pd-disaggregation(which the metrics module classifies aserrorwhen the agentic-pd-hybrid replay sees an unsuccessful upstream response). Most likely caused by--request-timeout-s 300firing on the longest queued requests. - Cache hit 99.9% — the kv-aware policy did successfully concentrate sessions on their prior D worker; the Inferact converter's prefix-shared 24-token-block hash_ids gave near-perfect prefix overlap across turns of the same session.
What E1 establishes
For the same hardware, same trace, same model, naive 1P3D + kv-aware policy is unusable for multi-session agentic workloads:
- session-stickiness without migration leaves a third of compute capacity (1 of 3 decode GPUs) entirely unused
- queueing dominates user-facing latency
- failure rate is 6.6% even with 5 minutes per-request timeout
This is the baseline H1 needs — it shows the KVC layer (E2) has something concrete to improve over.
3. E2 results — KVC v2 + RDMA
Configuration: mechanism=kvcache-centric, policy=kv-aware, 1P3D, --force-rdma --ib-device mlx5_60, --kvcache-admission-mode worker, --kvcache-direct-max-uncached-tokens 8192, --kvcache-migration-reject-threshold 3, --kvcache-prefill-backup-policy release-after-transfer, --kvcache-prefill-priority-eviction, ts=1.
| Metric | E2 |
|---|---|
| request_count | 1285 |
| success | 231 |
| error_count | 1054 |
| failure_count | 1054 |
| abort_count | 0 |
| latency mean (successful only) | 10.94 s |
| latency p50 | 7.44 s |
| latency p90 | 20.68 s |
| latency p99 | 64.73 s |
| ttft mean (successful only) | 1.76 s |
| ttft p50 | 0.43 s |
| ttft p90 | 6.56 s |
| ttft p99 | 8.74 s |
| execution_modes (succ.) | direct-to-D: 87; turn1-seed: 50; reseed: 12; large-append-reseed: 11; seed-filter-early-turn: 50; large-append-cap: 21 |
| per_decode_load | D0:600, D1:685, D2:0 |
| per_prefill_load | P0:1285 |
| cache_hit_request_count | 230 / 231 (99.6 %) |
Key observations on E2
- D2 still has zero bindings — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's
migration_reject_threshold=3never trips because D0/D1 do not reject admission until they are completely saturated. - 80 % failure rate, 1054 / 1285. NOT timeouts — actual root cause is a 3-layer cascade documented in §6. Quick summary: 562 "no-space" admission rejects from D0/D1 → router falls back to seed/reseed paths needing mooncake → mooncake heartbeats drop ("Decode instance could be dead") → SGLang aborts the request → client sees
RuntimeError: generate stream ended before producing any token. - Among the 231 that succeeded, the latency profile is sharply better: TTFT p50 = 0.43 s vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = 7.44 s vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
- Direct-to-D fast path engaged 87 / 231 = 37.7 % of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
4. Comparison table — E1 vs E2
Numbers below are over all 1285 requests for E1 (since failure rate is small) but only the 231 successful for E2 (since the bulk timed out before producing latency datapoints). This is not a fair head-to-head, see §6.
| Metric | E1 | E2 (succ only) | E2 / E1 |
|---|---|---|---|
| Total reqs | 1285 | 1285 | – |
| Successful | 1200 | 231 | 0.19× |
| error_count | 85 (6.6 %) | 1054 (82 %) | 12.4× worse |
| lat mean | 96.34 s | 10.94 s | 0.114 |
| lat p50 | 93.21 s | 7.44 s | 0.080 |
| lat p90 | 180.69 s | 20.68 s | 0.114 |
| lat p99 | 219.46 s | 64.73 s | 0.295 |
| ttft mean | 90.48 s | 1.76 s | 0.019 |
| ttft p50 | 88.62 s | 0.43 s | 0.005 |
| ttft p90 | 175.13 s | 6.56 s | 0.037 |
| ttft p99 | 207.39 s | 8.74 s | 0.042 |
| per_decode_load | D0:575, D1:710, D2:0 | D0:600, D1:685, D2:0 | both 1P2D |
| direct-to-D % | N/A (no KVC) | 87/231 = 37.7 % | – |
5. Interpreting H1 / H2 / H3
H1 (was: KVC layer adds value on top of 1P3D + kv-aware) — qualified
The H1 hypothesis as stated in ONBOARDING_NEXT_AGENT_ZH.md predicted E2 would clearly win on most metrics. The reality is bimodal: the small subset of E2 requests that successfully complete are dramatically faster than E1, but a much larger fraction (82 %) of E2 requests time out entirely. Net throughput on this workload is worse for E2 than E1.
Two issues drove this:
- The D2 cold-start pathology already documented in §3, root cause. Both runs are de facto 1P2D, not 1P3D.
- KVC v2's admission gate is stricter and surfaces more "no D capacity" / "session-not-resident" failures than vanilla pd-disagg, when the workload (mean input 67 K tokens, mean output 700 tokens) saturates D0/D1's combined ~1.5 M KV pool.
For workloads where D0/D1 do not saturate or where the policy does spread session ownership across all D workers (the historical SWE-Bench setup), KVC v2 wins. The Inferact codex_swebenchpro subset breaks both assumptions.
H2 / H3 (RDMA reduces TTFT p99) — cannot be evaluated cleanly here
The historical reference point is "KVC v2 + TCP loopback, SWE-Bench 50sess: TTFT p99 = 1.28 s". This run uses Inferact + RDMA, and TTFT p99 of the 231 successful E2 requests is 8.74 s — much higher than the TCP baseline. But the workloads are not comparable: Inferact mean input is 67 K tokens vs SWE-Bench's much smaller average. Per-request prefill + transfer is roughly 5× longer here. A clean H2 / H3 read needs an Inferact-on-TCP run to compare against, which is out of scope for this subset's GPU budget.
What we can say: RDMA is correctly engaged (every worker log shows installTransport, type=rdma; admission RPC RTTs in structural/admission-events.jsonl are ~6 ms — consistent with one-hop RoCE).
5b. Why E2 has 80 % failures — the real chain (forensic)
The summary's error_count: 1054 and execution_mode: kvcache-centric mask the actual cascade. Pulling the underlying request-metrics.jsonl, structural/admission-events.jsonl, and per-worker SGLang logs gives the full picture.
Layer 1 — worker admission rejects (51 % of admit attempts)
From structural/admission-events.jsonl:
admit ok = 581 (modes: seed=494, direct_append=87)
admit reject = 605 (reasons: no-space=562, session-not-resident=43)
562 "no-space" rejects — D worker (almost always D0 or D1) reports its KV pool is full and refuses to take the request as direct-append. The router then re-routes the request to the seed/reseed path.
This is materially different from E1's behaviour: E1's vanilla pd-disagg had no admission RPC, so requests always got accepted by the chosen D and queued behind the running batch. E1 paid for that as a 90-second TTFT but never saw a "no-space" failure.
Layer 2 — mooncake P→D transfer failures (real, observed in prefill log)
From logs/prefill-0.log:
[01:56:42] Prefill transfer failed for request rank=0 req.rid='2a5ed06fb…'
with exception KVTransferError: Failed to send kv chunk of … to 172.18.112.37:46067
[01:56:42] Prefill transfer failed for request rank=0 req.rid='eca5ff14…'
with exception KVTransferError: Decode instance could be dead,
remote mooncake session 172.18.112.37:15078 is not alive
[01:56:42] Prefill transfer failed for request rank=0 req.rid='7ed9827b…'
Decode instance could be dead, remote mooncake session ... is not alive
When the seed/reseed fallback queue piles up (because of layer 1), the D worker becomes heavily backlogged and its mooncake bootstrap session heartbeat drops — P interprets this as "the D worker is dead" and fails the transfer. This is not a true crash; the worker process is alive (we observed it accepting unrelated requests immediately after), but the mooncake session is torn down for that bootstrap_room.
Layer 3 — client-visible error
From request-metrics.jsonl for all 1054 failed reqs:
"error": "RuntimeError: generate stream ended before producing any token"
This is what agentic-pd-hybrid sees when the SGLang /generate SSE stream closes with zero output tokens — the upstream abort from layer 1 or layer 2 propagates as an empty stream.
The complete causal chain
Inferact shared "permissions instructions" boilerplate
↓
overlap term in kv-aware lex score never lets D2 win → D2 cold forever
↓
50 sessions all pinned to D0 / D1
↓
D0 / D1 KV pool saturates
↓
worker admission emits 562 × "no-space" ← Layer 1
↓
router falls back to seed/reseed path (needs P→D mooncake transfer)
↓
P→D transfer queue piles up; D mooncake heartbeat drops
↓
"Decode instance could be dead" → KVTransferError ← Layer 2
↓
SGLang aborts the req → SSE stream closes with 0 tokens
↓
agentic-pd-hybrid raises "generate stream ended ..." for 1054 reqs ← Layer 3
Why E1 didn't hit this
E1 used mechanism=pd-disaggregation, which has no per-worker admission RPC. The router blindly dispatched to D0/D1; SGLang's internal scheduler simply queued requests behind the running batch (some grew their wait to >90 s before getting a token). Of the 85 E1 errors, sampling shows they are request-timeout-s=300 failures — old-fashioned timeouts on the agentic-pd-hybrid side, not mooncake or admission failures.
So:
- E1 trades latency for resilience: nobody rejects, everyone queues, you pay TTFT.
- E2's KVC v2 worker admission is meant to be a safety valve, but on the cold-D pathology it becomes an amplifier: rejects → fallback paths → backlog → mooncake heartbeat loss → cascading failures.
The real fix
Worker admission per se is not the bug — the bug is that there is no D-rebalancing happening upstream. With balanced D load (e.g. cold-D bonus in policy, or pre-warm of D2 with shared boilerplate), D0/D1 would not hit "no-space", and the layer 1 → layer 2 cascade would not fire. The reseed long-tail TTFT (8.74 s p99 here) becomes the dominant cost — exactly the regime onboarding §3.1 H3 describes.
6. What this experiment actually shows
- The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads. Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
- The KVC v2 + kv-aware policy combination has a latent pathology on workloads with high cross-session prefix overlap: the overlap term in the lex score causes permanent load imbalance, and v2's reject-counter migration cannot rescue it because rejects only fire under capacity pressure, by which point timeouts already dominate. This is novel and not surfaced by the SWE-Bench evaluation in the existing project docs.
- For Inferact-like workloads, a cold-D bonus (e.g. require D to host at least one session before its overlap score counts) or an explicit pre-warm step is required before E1/E2 comparisons can isolate the marginal effect of the KVC layer.
7. Reproducibility
- Trace:
outputs/inferact_50sess.jsonl, md57bb263a32600ef5a6ef5099ba340a487, regenerable viascripts/sample_trace_subset.py. - E1:
bash scripts/sweep_e1_naive_1p3d.sh(1h 29 min wall) - E2:
bash scripts/sweep_e2_kvc_v2_rdma.sh(1h 33 min wall) - Summary JSON paths:
outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.jsonoutputs/e2_kvc_v2_rdma_50sess/e2_kvc_v2_rdma_run1_summary.json
- Per-request metrics JSONL alongside each summary, plus structural events under
*/structural/.
8. Open follow-ups for the next agent
- Add a cold-D bonus to
KvAwarePolicy.select(e.g. positive constant for D withstate.resident[D] == ∅) and re-run E2 on the same subset. Predict: D2 receives bindings, failure rate drops, head-to-head with E1 becomes meaningful. - Rerun E2 with
--kvcache-admission-mode router(router-side optimistic admission instead of worker RPC) to isolate whether the strict worker admission is the contributor to the 1054 failures, or whether it's purely the imbalance. - Run a third arm E0 with
policy=default+mechanism=pd-disaggregationas a true control — kv-aware policy is itself part of what we are evaluating; default round-robin would have spread sessions across all 3 D. - Compare TTFT p99 against an Inferact-on-TCP-loopback run to evaluate H2/H3 cleanly. Cost: 1 more E2-shaped sweep (~1.5 h).
- Investigate the 1054 E2 failures in
request-metrics.jsonl— sample some to verify they are timeout-related vs admission-rejected vs upstream-500.
4. Comparison table — pending
To be appended.
5. Open questions for the next iteration
- Are the 85 E1 errors all timeouts?
request-metrics.jsonlrows witherrorexecution_mode should be sampled to confirm. (Quick check: grep the metrics jsonl for"execution_mode": "pd-disaggregation"and inspectlatency_s/errorfields.) - Does E2 produce the predicted ~91% direct-to-D rate seen in the historical SWE-Bench v2 run, or does the Inferact workload's larger session count (50 vs 52 there) but very different per-session size distribution (mean 33 turns × ~2KB context growth per turn) push it lower?
- Is
D2 = 0%an E1-specific artifact (kv-aware sticky in pd-disagg mode), or does the same happen in E2 before migration kicks in for the first time?