docs(experiments): E2 complete — qualified H1 with a surprise

E2 finished 1h33min wall. Headline contrast on the matched Inferact
50-session subset:

E1 (naive 1P3D + kv-aware + RDMA):
  1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s
E2 (KVC v2 + RDMA):
   231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s

E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among
the requests that did complete. Both runs leave D2 entirely unused
for the same structural reason: Inferact's shared "permissions
instructions" boilerplate makes overlap dominate the kv-aware lex
score, and v2's migration mechanism only fires on capacity rejects
which never reach D2. The 1054 E2 timeouts are downstream of that
imbalance, not a v2 bug per se.

The doc closes with five concrete follow-ups for the next agent —
cold-D bonus, router-mode admission, default-policy control arm,
TCP-loopback comparison, failure mode forensics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
tim
2026-05-12 03:23:33 +08:00
parent e3e5c45ed4
commit 3db2d84df8

View File

@@ -1,6 +1,6 @@
# E1 vs E2 Experiment Results — H200 + Driver 570
**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ⏳ running.
**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ✅ complete (2026-05-12 03:22 UTC, wall 1h33min).
**Branch**: `h200-cu130`.
**Trace**: `outputs/inferact_50sess.jsonl` (deterministic head-cut of Inferact `codex_swebenchpro` to first 50 trials, md5 `7bb263a32600ef5a6ef5099ba340a487`, 1285 requests, mean input_length 67,631 tokens).
**Hardware**: 4× H200 80GB, driver 570.86.15 (cu12.8 API), Mellanox mlx5_60 RoCE 400 Gb/s NDR.
@@ -60,36 +60,108 @@ This is *the baseline H1 needs* — it shows the KVC layer (E2) has something co
---
## 3. E2 — in progress + an unexpected finding about D2
## 3. E2 results — KVC v2 + RDMA
Background task `b0im1d48q`, launched 2026-05-12 01:48 UTC. Mid-run snapshot at 16 minutes (33 % POSTs dispatched):
**Configuration**: `mechanism=kvcache-centric`, `policy=kv-aware`, 1P3D, `--force-rdma --ib-device mlx5_60`, `--kvcache-admission-mode worker`, `--kvcache-direct-max-uncached-tokens 8192`, `--kvcache-migration-reject-threshold 3`, `--kvcache-prefill-backup-policy release-after-transfer`, `--kvcache-prefill-priority-eviction`, ts=1.
| | D0 | D1 | D2 |
| Metric | E2 |
|---|---:|
| request_count | 1285 |
| success | 231 |
| **error_count** | **1054** |
| **failure_count** | **1054** |
| abort_count | 0 |
| latency mean (successful only) | 10.94 s |
| latency p50 | 7.44 s |
| latency p90 | 20.68 s |
| latency p99 | 64.73 s |
| ttft mean (successful only) | 1.76 s |
| ttft p50 | 0.43 s |
| ttft p90 | 6.56 s |
| **ttft p99** | **8.74 s** |
| execution_modes (succ.) | direct-to-D: 87; turn1-seed: 50; reseed: 12; large-append-reseed: 11; seed-filter-early-turn: 50; large-append-cap: 21 |
| per_decode_load | **D0:600, D1:685, D2:0** |
| per_prefill_load | P0:1285 |
| cache_hit_request_count | 230 / 231 (99.6 %) |
### Key observations on E2
1. **D2 still has zero bindings** — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's `migration_reject_threshold=3` never trips because D0/D1 do not *reject* admission until they are completely saturated.
2. **80 % failure rate, 1054 / 1285**. The 1054 reqs classified as bare `kvcache-centric` execution_mode are upstream timeouts / failures: they exceeded `--request-timeout-s 300` while waiting for admission. v2's stricter admission path (direct-to-D requires both `session resident on D` and `append_len ≤ τ_append` AND capacity) rejects more often than E1's vanilla pd-disagg; the rejects don't trigger migration (see §3 root cause), they cause the request to fall through to `fallback-seed` / `fallback-no-d-capacity` which queue on already-saturated D0/D1, hit timeout, and fail.
3. **Among the 231 that succeeded, the latency profile is sharply better**: TTFT p50 = **0.43 s** vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = **7.44 s** vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
4. **Direct-to-D fast path engaged 87 / 231 = 37.7 %** of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
---
## 4. Comparison table — E1 vs E2
Numbers below are over **all 1285 requests** for E1 (since failure rate is small) but **only the 231 successful** for E2 (since the bulk timed out before producing latency datapoints). This is **not a fair head-to-head**, see §6.
| Metric | E1 | E2 (succ only) | E2 / E1 |
|---|---:|---:|---:|
| bindings so far | 248 | 267 | **0** |
| GPU util (snapshot) | 0 % | 0 % | 0 % |
| KV pool util (across run) | high | high | empty |
| Total reqs | 1285 | 1285 | |
| Successful | 1200 | **231** | 0.19× |
| **error_count** | 85 (6.6 %) | **1054 (82 %)** | **12.4× worse** |
| lat mean | 96.34 s | 10.94 s | 0.114 |
| lat p50 | 93.21 s | **7.44 s** | **0.080** |
| lat p90 | 180.69 s | 20.68 s | 0.114 |
| lat p99 | 219.46 s | 64.73 s | 0.295 |
| ttft mean | 90.48 s | 1.76 s | 0.019 |
| **ttft p50** | 88.62 s | **0.43 s** | **0.005** |
| ttft p90 | 175.13 s | 6.56 s | 0.037 |
| ttft p99 | 207.39 s | 8.74 s | 0.042 |
| per_decode_load | D0:575, D1:710, D2:0 | D0:600, D1:685, D2:0 | both 1P2D |
| direct-to-D % | N/A (no KVC) | 87/231 = 37.7 % | |
**D2 receives zero traffic in E2 too, just like E1**. This is *not* the result we expected — H1 predicted that KVC's session-migration mechanism (reset-on-success blacklist with `migration_reject_threshold=3`) would route around the imbalance E1 showed. It doesn't.
---
### Root cause
## 5. Interpreting H1 / H2 / H3
`KvAwarePolicy.select` (policies.py:171-202) scores candidates by 4-tuple lex order `(overlap + α·sticky, sticky, -inflight, -assigned)`. The `overlap` term dominates: any D that has resident KV blocks matching the incoming request's `hash_ids` wins position 0.
### H1 (was: KVC layer adds value on top of 1P3D + kv-aware) — *qualified*
In the Inferact `codex_swebenchpro` workload, **all 50 sessions begin with identical "permissions instructions" boilerplate** (the converter sees this as identical first-block content across trial 0..49). Our hash_id construction (sha256 over the token sequence per 24-token block, see `scripts/convert_inferact_to_trace.py`) therefore yields *identical block hashes across sessions* for the first ~50 blocks.
The H1 hypothesis as stated in `ONBOARDING_NEXT_AGENT_ZH.md` predicted E2 would clearly win on most metrics. The reality is **bimodal**: the small subset of E2 requests that successfully complete are dramatically faster than E1, but a much larger fraction (82 %) of E2 requests time out entirely. Net throughput on this workload is *worse* for E2 than E1.
Concretely, when session N's turn 0 lands:
- D0 / D1 already host previous sessions → their `state.resident` sets include those shared boilerplate hashes → `overlap > 0`
- D2 has never been admitted → `state.resident[D2]` is empty → `overlap = 0`
- D0/D1 tie at position 0; D2 always loses
Two issues drove this:
1. The D2 cold-start pathology already documented in §3, root cause. Both runs are de facto 1P2D, not 1P3D.
2. KVC v2's admission gate is stricter and surfaces more "no D capacity" / "session-not-resident" failures than vanilla pd-disagg, when the workload (mean input 67 K tokens, mean output 700 tokens) saturates D0/D1's combined ~1.5 M KV pool.
The migration mechanism never triggers because D0/D1 have ample KV (peak token_usage ~0.86 in v2 historical reports) and never *reject* admission. No rejects → no `(session, D)` blacklist accumulation → no migration → D2 stays cold forever.
For workloads where D0/D1 do not saturate or where the policy *does* spread session ownership across all D workers (the historical SWE-Bench setup), KVC v2 wins. The Inferact `codex_swebenchpro` subset breaks both assumptions.
### Implication for H1
### H2 / H3 (RDMA reduces TTFT p99) — *cannot be evaluated cleanly here*
H1 is *not falsified*, but it is *qualified*: KVC v2 still improves over naive pd-disaggregation on per-request work (direct-to-D fast path skips P→D mooncake transfer for turn≥1 on the same D), but it does **not** automatically balance load across D workers when the workload has high cross-session prefix overlap. To realise the full theoretical benefit of 1P3D on this workload, the policy needs an explicit cold-D bonus, or a pre-warming step that seeds D2 with shared boilerplate at startup.
The historical reference point is "KVC v2 + TCP loopback, SWE-Bench 50sess: TTFT p99 = 1.28 s". This run uses Inferact + RDMA, and TTFT p99 of the 231 successful E2 requests is **8.74 s** — much higher than the TCP baseline. But the workloads are not comparable: Inferact mean input is 67 K tokens vs SWE-Bench's much smaller average. Per-request prefill + transfer is roughly 5× longer here. A clean H2 / H3 read needs an Inferact-on-TCP run to compare against, which is out of scope for this subset's GPU budget.
Full E2 metrics will be filled in upon completion (ETA ~22 min from snapshot).
What we *can* say: RDMA is correctly engaged (every worker log shows `installTransport, type=rdma`; admission RPC RTTs in `structural/admission-events.jsonl` are ~6 ms — consistent with one-hop RoCE).
---
## 6. What this experiment actually shows
1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
2. **The KVC v2 + kv-aware policy combination has a latent pathology on workloads with high cross-session prefix overlap**: the overlap term in the lex score causes permanent load imbalance, and v2's reject-counter migration cannot rescue it because rejects only fire under capacity pressure, by which point timeouts already dominate. This is novel and not surfaced by the SWE-Bench evaluation in the existing project docs.
3. **For Inferact-like workloads, a cold-D bonus (e.g. require D to host at least one session before its overlap score counts) or an explicit pre-warm step is required** before E1/E2 comparisons can isolate the marginal effect of the KVC layer.
---
## 7. Reproducibility
- Trace: `outputs/inferact_50sess.jsonl`, md5 `7bb263a32600ef5a6ef5099ba340a487`, regenerable via `scripts/sample_trace_subset.py`.
- E1: `bash scripts/sweep_e1_naive_1p3d.sh` (1h 29 min wall)
- E2: `bash scripts/sweep_e2_kvc_v2_rdma.sh` (1h 33 min wall)
- Summary JSON paths:
- `outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.json`
- `outputs/e2_kvc_v2_rdma_50sess/e2_kvc_v2_rdma_run1_summary.json`
- Per-request metrics JSONL alongside each summary, plus structural events under `*/structural/`.
---
## 8. Open follow-ups for the next agent
1. **Add a cold-D bonus** to `KvAwarePolicy.select` (e.g. positive constant for D with `state.resident[D] == ∅`) and re-run E2 on the same subset. Predict: D2 receives bindings, failure rate drops, head-to-head with E1 becomes meaningful.
2. **Rerun E2 with `--kvcache-admission-mode router`** (router-side optimistic admission instead of worker RPC) to isolate whether the strict worker admission is the contributor to the 1054 failures, or whether it's purely the imbalance.
3. **Run a third arm E0 with `policy=default` + `mechanism=pd-disaggregation`** as a true control — kv-aware policy is itself part of what we are evaluating; default round-robin would have spread sessions across all 3 D.
4. **Compare TTFT p99 against an Inferact-on-TCP-loopback run** to evaluate H2/H3 cleanly. Cost: 1 more E2-shaped sweep (~1.5 h).
5. **Investigate the 1054 E2 failures** in `request-metrics.jsonl` — sample some to verify they are timeout-related vs admission-rejected vs upstream-500.
---