docs(experiments): E2 complete — qualified H1 with a surprise

E2 finished 1h33min wall. Headline contrast on the matched Inferact 50-session subset: E1 (naive 1P3D + kv-aware + RDMA): 1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s E2 (KVC v2 + RDMA): 231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among the requests that did complete. Both runs leave D2 entirely unused for the same structural reason: Inferact's shared "permissions instructions" boilerplate makes overlap dominate the kv-aware lex score, and v2's migration mechanism only fires on capacity rejects which never reach D2. The 1054 E2 timeouts are downstream of that imbalance, not a v2 bug per se. The doc closes with five concrete follow-ups for the next agent — cold-D bonus, router-mode admission, default-policy control arm, TCP-loopback comparison, failure mode forensics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 03:23:33 +08:00
parent e3e5c45ed4
commit 3db2d84df8
1 changed files with 91 additions and 19 deletions
--- a/docs/E1_E2_RESULTS_ZH.md
+++ b/docs/E1_E2_RESULTS_ZH.md
@@ -1,6 +1,6 @@
 # E1 vs E2 Experiment Results — H200 + Driver 570

-**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ⏳ running.
+**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ✅ complete (2026-05-12 03:22 UTC, wall 1h33min).
 **Branch**: `h200-cu130`.
 **Trace**: `outputs/inferact_50sess.jsonl` (deterministic head-cut of Inferact `codex_swebenchpro` to first 50 trials, md5 `7bb263a32600ef5a6ef5099ba340a487`, 1285 requests, mean input_length 67,631 tokens).
 **Hardware**: 4× H200 80GB, driver 570.86.15 (cu12.8 API), Mellanox mlx5_60 RoCE 400 Gb/s NDR.
@@ -60,36 +60,108 @@ This is *the baseline H1 needs* — it shows the KVC layer (E2) has something co

 ---

-## 3. E2 — in progress + an unexpected finding about D2
+## 3. E2 results — KVC v2 + RDMA

-Background task `b0im1d48q`, launched 2026-05-12 01:48 UTC. Mid-run snapshot at 16 minutes (33 % POSTs dispatched):
+**Configuration**: `mechanism=kvcache-centric`, `policy=kv-aware`, 1P3D, `--force-rdma --ib-device mlx5_60`, `--kvcache-admission-mode worker`, `--kvcache-direct-max-uncached-tokens 8192`, `--kvcache-migration-reject-threshold 3`, `--kvcache-prefill-backup-policy release-after-transfer`, `--kvcache-prefill-priority-eviction`, ts=1.

-| | D0 | D1 | D2 |
+| Metric | E2 |
+|---|---:|
+| request_count | 1285 |
+| success | 231 |
+| **error_count** | **1054** |
+| **failure_count** | **1054** |
+| abort_count | 0 |
+| latency mean (successful only) | 10.94 s |
+| latency p50 | 7.44 s |
+| latency p90 | 20.68 s |
+| latency p99 | 64.73 s |
+| ttft mean (successful only) | 1.76 s |
+| ttft p50 | 0.43 s |
+| ttft p90 | 6.56 s |
+| **ttft p99** | **8.74 s** |
+| execution_modes (succ.) | direct-to-D: 87; turn1-seed: 50; reseed: 12; large-append-reseed: 11; seed-filter-early-turn: 50; large-append-cap: 21 |
+| per_decode_load | **D0:600, D1:685, D2:0** |
+| per_prefill_load | P0:1285 |
+| cache_hit_request_count | 230 / 231 (99.6 %) |
+
+### Key observations on E2
+
+1. **D2 still has zero bindings** — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's `migration_reject_threshold=3` never trips because D0/D1 do not *reject* admission until they are completely saturated.
+2. **80 % failure rate, 1054 / 1285**. The 1054 reqs classified as bare `kvcache-centric` execution_mode are upstream timeouts / failures: they exceeded `--request-timeout-s 300` while waiting for admission. v2's stricter admission path (direct-to-D requires both `session resident on D` and `append_len ≤ τ_append` AND capacity) rejects more often than E1's vanilla pd-disagg; the rejects don't trigger migration (see §3 root cause), they cause the request to fall through to `fallback-seed` / `fallback-no-d-capacity` which queue on already-saturated D0/D1, hit timeout, and fail.
+3. **Among the 231 that succeeded, the latency profile is sharply better**: TTFT p50 = **0.43 s** vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = **7.44 s** vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
+4. **Direct-to-D fast path engaged 87 / 231 = 37.7 %** of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
+
+---
+
+## 4. Comparison table — E1 vs E2
+
+Numbers below are over **all 1285 requests** for E1 (since failure rate is small) but **only the 231 successful** for E2 (since the bulk timed out before producing latency datapoints). This is **not a fair head-to-head**, see §6.
+
+| Metric | E1 | E2 (succ only) | E2 / E1 |
 |---|---:|---:|---:|
-| bindings so far | 248 | 267 | **0** |
-| GPU util (snapshot) | 0 % | 0 % | 0 % |
-| KV pool util (across run) | high | high | empty |
+| Total reqs | 1285 | 1285 | – |
+| Successful | 1200 | **231** | 0.19× |
+| **error_count** | 85 (6.6 %) | **1054 (82 %)** | **12.4× worse** |
+| lat mean | 96.34 s | 10.94 s | 0.114 |
+| lat p50 | 93.21 s | **7.44 s** | **0.080** |
+| lat p90 | 180.69 s | 20.68 s | 0.114 |
+| lat p99 | 219.46 s | 64.73 s | 0.295 |
+| ttft mean | 90.48 s | 1.76 s | 0.019 |
+| **ttft p50** | 88.62 s | **0.43 s** | **0.005** |
+| ttft p90 | 175.13 s | 6.56 s | 0.037 |
+| ttft p99 | 207.39 s | 8.74 s | 0.042 |
+| per_decode_load | D0:575, D1:710, D2:0 | D0:600, D1:685, D2:0 | both 1P2D |
+| direct-to-D % | N/A (no KVC) | 87/231 = 37.7 % | – |

-**D2 receives zero traffic in E2 too, just like E1**. This is *not* the result we expected — H1 predicted that KVC's session-migration mechanism (reset-on-success blacklist with `migration_reject_threshold=3`) would route around the imbalance E1 showed. It doesn't.
+---

-### Root cause
+## 5. Interpreting H1 / H2 / H3

-`KvAwarePolicy.select` (policies.py:171-202) scores candidates by 4-tuple lex order `(overlap + α·sticky, sticky, -inflight, -assigned)`. The `overlap` term dominates: any D that has resident KV blocks matching the incoming request's `hash_ids` wins position 0.
+### H1 (was: KVC layer adds value on top of 1P3D + kv-aware) — *qualified*

-In the Inferact `codex_swebenchpro` workload, **all 50 sessions begin with identical "permissions instructions" boilerplate** (the converter sees this as identical first-block content across trial 0..49). Our hash_id construction (sha256 over the token sequence per 24-token block, see `scripts/convert_inferact_to_trace.py`) therefore yields *identical block hashes across sessions* for the first ~50 blocks.
+The H1 hypothesis as stated in `ONBOARDING_NEXT_AGENT_ZH.md` predicted E2 would clearly win on most metrics. The reality is **bimodal**: the small subset of E2 requests that successfully complete are dramatically faster than E1, but a much larger fraction (82 %) of E2 requests time out entirely. Net throughput on this workload is *worse* for E2 than E1.

-Concretely, when session N's turn 0 lands:
- D0 / D1 already host previous sessions → their `state.resident` sets include those shared boilerplate hashes → `overlap > 0`
- D2 has never been admitted → `state.resident[D2]` is empty → `overlap = 0`
- D0/D1 tie at position 0; D2 always loses
+Two issues drove this:
+1. The D2 cold-start pathology already documented in §3, root cause. Both runs are de facto 1P2D, not 1P3D.
+2. KVC v2's admission gate is stricter and surfaces more "no D capacity" / "session-not-resident" failures than vanilla pd-disagg, when the workload (mean input 67 K tokens, mean output 700 tokens) saturates D0/D1's combined ~1.5 M KV pool.

-The migration mechanism never triggers because D0/D1 have ample KV (peak token_usage ~0.86 in v2 historical reports) and never *reject* admission. No rejects → no `(session, D)` blacklist accumulation → no migration → D2 stays cold forever.
+For workloads where D0/D1 do not saturate or where the policy *does* spread session ownership across all D workers (the historical SWE-Bench setup), KVC v2 wins. The Inferact `codex_swebenchpro` subset breaks both assumptions.

-### Implication for H1
+### H2 / H3 (RDMA reduces TTFT p99) — *cannot be evaluated cleanly here*

-H1 is *not falsified*, but it is *qualified*: KVC v2 still improves over naive pd-disaggregation on per-request work (direct-to-D fast path skips P→D mooncake transfer for turn≥1 on the same D), but it does **not** automatically balance load across D workers when the workload has high cross-session prefix overlap. To realise the full theoretical benefit of 1P3D on this workload, the policy needs an explicit cold-D bonus, or a pre-warming step that seeds D2 with shared boilerplate at startup.
+The historical reference point is "KVC v2 + TCP loopback, SWE-Bench 50sess: TTFT p99 = 1.28 s". This run uses Inferact + RDMA, and TTFT p99 of the 231 successful E2 requests is **8.74 s** — much higher than the TCP baseline. But the workloads are not comparable: Inferact mean input is 67 K tokens vs SWE-Bench's much smaller average. Per-request prefill + transfer is roughly 5× longer here. A clean H2 / H3 read needs an Inferact-on-TCP run to compare against, which is out of scope for this subset's GPU budget.

-Full E2 metrics will be filled in upon completion (ETA ~22 min from snapshot).
+What we *can* say: RDMA is correctly engaged (every worker log shows `installTransport, type=rdma`; admission RPC RTTs in `structural/admission-events.jsonl` are ~6 ms — consistent with one-hop RoCE).
+
+---
+
+## 6. What this experiment actually shows
+
+1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
+2. **The KVC v2 + kv-aware policy combination has a latent pathology on workloads with high cross-session prefix overlap**: the overlap term in the lex score causes permanent load imbalance, and v2's reject-counter migration cannot rescue it because rejects only fire under capacity pressure, by which point timeouts already dominate. This is novel and not surfaced by the SWE-Bench evaluation in the existing project docs.
+3. **For Inferact-like workloads, a cold-D bonus (e.g. require D to host at least one session before its overlap score counts) or an explicit pre-warm step is required** before E1/E2 comparisons can isolate the marginal effect of the KVC layer.
+
+---
+
+## 7. Reproducibility
+
+- Trace: `outputs/inferact_50sess.jsonl`, md5 `7bb263a32600ef5a6ef5099ba340a487`, regenerable via `scripts/sample_trace_subset.py`.
+- E1: `bash scripts/sweep_e1_naive_1p3d.sh` (1h 29 min wall)
+- E2: `bash scripts/sweep_e2_kvc_v2_rdma.sh` (1h 33 min wall)
+- Summary JSON paths:
+  - `outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.json`
+  - `outputs/e2_kvc_v2_rdma_50sess/e2_kvc_v2_rdma_run1_summary.json`
+- Per-request metrics JSONL alongside each summary, plus structural events under `*/structural/`.
+
+---
+
+## 8. Open follow-ups for the next agent
+
+1. **Add a cold-D bonus** to `KvAwarePolicy.select` (e.g. positive constant for D with `state.resident[D] == ∅`) and re-run E2 on the same subset. Predict: D2 receives bindings, failure rate drops, head-to-head with E1 becomes meaningful.
+2. **Rerun E2 with `--kvcache-admission-mode router`** (router-side optimistic admission instead of worker RPC) to isolate whether the strict worker admission is the contributor to the 1054 failures, or whether it's purely the imbalance.
+3. **Run a third arm E0 with `policy=default` + `mechanism=pd-disaggregation`** as a true control — kv-aware policy is itself part of what we are evaluating; default round-robin would have spread sessions across all 3 D.
+4. **Compare TTFT p99 against an Inferact-on-TCP-loopback run** to evaluate H2/H3 cleanly. Cost: 1 more E2-shaped sweep (~1.5 h).
+5. **Investigate the 1054 E2 failures** in `request-metrics.jsonl` — sample some to verify they are timeout-related vs admission-rejected vs upstream-500.

 ---