docs(experiments): E4-v8 results on real-timestamp SWE-Bench trace

V8 ran the third_party qwen35-swebench-50sess trace (4449 reqs,
5.44h original timeline, p50 inter-turn 2.53s) at TIME_SCALE=2 with
the SnapshotStore refactor, PREFILL_MEM_FRAC=0.7, DECODE_MEM_FRAC=0.8,
16 GB snapshot_buf.

Headline result on this realistic workload:
  TTFT p99 = 167 ms  (vs E1's 207s on burst trace)
  Latency p99 = 7.4s
  100% success rate
  96.4% direct-to-D fast path

The earlier TTFT 100+s numbers on E1/E4-v3 were a burst-trace
queueing artifact (all 1285 reqs arrived at t=0). On real-time
arrivals KVC stays in normal sub-second TTFT territory.

D→P snapshot link infrastructure works end-to-end (16 GB
snapshot_buf alloc'd, RPCs reach handlers, structural log
captures everything). But 0 OK events because sessions get
evicted from D before agentic's reseed path calls dump. Three
fix paths identified in §5.
This commit is contained in:
Claude Code Agent
2026-05-13 19:07:59 +08:00
parent 9cca2c60c9
commit f09562123b

202
docs/E4_V8_RESULTS_ZH.md Normal file
View File

@@ -0,0 +1,202 @@
# E4-v8 完整结果 — KVC 在真实节奏 trace 上的表现
**日期**2026-05-13
**Status**:实验跑完
**Run**`outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/...20260513T075500Z/`
**前置**`docs/SNAPSHOT_STORE_REFACTOR_ZH.md``docs/E4_VS_E1_RESULTS_ZH.md`
---
## 0. TL;DR
V8 跑 **真实节奏 trace**`third_party/traces/qwen35-swebench-50sess.jsonl`4449 reqs × 52 sessions原始 5.44h 时间线)在 TIME_SCALE=2 压缩到 ~2.7h wall clock
| 指标 | V8 实测 |
|---|---:|
| 总请求 | 4449 |
| Failure / Error / Abort | **0 / 0 / 0** |
| Success rate | **100%** |
| Latency mean / p50 / p90 / p99 | 1.28s / 0.51s / 3.17s / **7.44s** |
| **TTFT mean / p50 / p90 / p99** | **49ms / 40ms / 68ms / 167ms** |
| Direct-to-D fast path | **96.4%** (4291/4449) |
| Reseed paths | 51 (1.1%) |
| D→P sync OK | **0** (architecturally wired but no successful pushes — see §3) |
**关键结论**:先前 E1 和 E4-v3 上 TTFT 上百秒的"灾难数字"是**burst trace 排队累积的人为产物**。在真实节奏 SWE-Bench trace 上,**KVC 表现为亚秒到个位数秒的正常生产 serving 性能**。
---
## 1. 实验配置
```
Workload: third_party/traces/qwen35-swebench-50sess.jsonl
4449 reqs / 52 sessions / 5.44h original wall-clock span
per-session inter-turn p50: 2.53s (real SWE-agent timing)
input length p50: 27K, p99: 92K, max: 104K
Compression: TIME_SCALE=2 → 2.72h actual run-time
Topology: 1P + 3D, 4× H200 80GB single-node
RDMA: mlx5_60 NDR 400Gb / mooncake
Model: Qwen3-30B-A3B-Instruct-2507 (TP=1)
Concurrency: 32
Memory: PREFILL_MEM_FRAC=0.7 / DECODE_MEM_FRAC=0.8
snapshot_buf=16 GB on each worker (alloc succeeded)
KVC config: --kvcache-load-floor-bonus 200
--kvcache-migration-reject-threshold 1
--kvcache-direct-max-uncached-tokens 8192
--enable-d-to-p-sync (with SnapshotStore refactor)
```
---
## 2. 完整 v8 数据
### 2.1 Headline
```
request_count : 4449
abort_count : 0
error_count : 0
failure_count : 0
cache_hit_request_count : 4446 / 4449 = 99.9%
mean cached_tokens : 30,513 / req (out of avg 32K input)
```
### 2.2 Latency / TTFT
```
count mean p50 p90 p99
latency_stats_s 4449 1.28 0.51 3.17 7.44 s
ttft_stats_s 4449 0.049 0.040 0.068 0.167 s ← p99 = 167ms
```
### 2.3 Execution_mode 分布
```
kvcache-direct-to-d-session 4291 (96.4%) ← KVC 独特 fast path
pd-router-turn1-seed 52 ( 1.2%) ← 每个 session 第一个 turn
pd-router-fallback-session-not-resident-seed-filter 52 ( 1.2%) ← seed-filter 早 turn fallback
pd-router-d-session-reseed 47 ( 1.1%) ← 真正的 reseed (session 曾在 D)
pd-router-fallback-real-large-append-session-cap 3
pd-router-fallback-session-not-resident-session-cap 1
pd-router-policy-no-bypass-reseed 1
pd-router-real-large-append-reseed 1
pd-router-session-not-resident-reseed 1
-----
4449
```
### 2.4 Per-decode load
```
decode-0: 1505 bindings (33.8%)
decode-1: 1497 bindings (33.6%)
decode-2: 1447 bindings (32.5%)
```
负载完美均衡load-floor bonus K=200 起作用)。
---
## 3. D→P snapshot link 状态(重构验证)
**SnapshotStore 重构commit 2dfe22a成功**
- 旧设计 prepare_receive 用 `token_to_kv_pool_allocator.alloc(N)` 抢 P 的 KV pool slot → 90%+ alloc-failed
- 新设计 prepare_receive 从独立 16 GB GPU `snapshot_buf` 分配 slab → **0 alloc-failed**
```
sync events total: 102
by (stage, reason):
('dump', 'session-not-resident'): 96 (D 端 session 已 evict 或从未 resident)
('prepare', 'snapshot-buf-full'): 6 (snapshot_buf 偶尔满)
('ok', None): 0 (无成功 push)
```
**为什么 0 OK**
mem_fraction=0.8 让 D 的 trim 机制总是成功 → admission 不拒绝 → reseed path 不通过"D 曾持有 session"分支触发,而是通过 first-turn-fallback 等路径触发,那些路径下 D 端**从未持有** sessiondump 必然失败。
102 个 sync 事件中:
- 96 个 dump session-not-resident包含 52 个 turn-1 first-seed-fallbacksession 从未 resident+ 44 个其他 fallback
- 6 个 snapshot-buf-full偶尔出现证明 buffer 在 working
D→P **底层链路 + agentic orchestration 都已就位**——只是 agentic 触发的 reseed 场景里 D 端 session 不存在。要让 D→P 真正 fire OK需要
1. 给 D-side SessionAwareCache 加 "pending-snapshot pinning" 保护,让 evict 不打掉等 sync 的 session
2. **或者** 加 D-side push-on-evictionD 端在 evict 一个 session 前先 push 给 PD-driven 主动模式)
3. **或者** 调小 mem_fraction 让 admission 真正拒绝("还有 session 时就拒"),让 reseed 命中真正"session 仍在 D"的场景
---
## 4. 跟之前几次实验对比
| Run | Trace | failures | TTFT p99 | Latency p99 | D→P OK |
|---|---|---:|---:|---:|---:|
| E1 (naive PD) | inferact 1285 burst | 6.6% | **207s** | 219s | n/a |
| E4-v3 (KVC + load-floor, no D→P fix) | inferact 1285 burst | 0% | 225s | 234s | n/a |
| E4-v4/v5 (KVC + D→P, bug) | inferact 1285 burst | 0% / 12% | similar | similar | 0 (logger NameError or alloc-fail) |
| **E4-v8 (refactor + real trace)** | **swebench 4449 real-time** | **0%** | **167ms** | **7.4s** | 0 (D-side eviction timing) |
E1 vs v8 的数字差距巨大但**不直接可比**——因为 trace 完全不同:
- E1 burst trace所有 1285 req 在 t=0 全部到达 → 队列累积 → TTFT 上百秒
- v8 real-time tracereq 按 2.53s p50 inter-turn 真实节奏到达 → 系统不饱和 → TTFT 几十 ms
**To be fair**: 要跟 v8 真实对比 KVC vs naive PD需要也用 swebench trace 跑一遍 naive PD。这是下一步。
---
## 5. 给 D→P sync 真正生效的下一步
按重要性排序:
### P1让 sync 能在 reseed 时 fire OK
**最直接的方法**:在 agentic 监测到 admission 拒绝时**立即**触发 dump**在 D evict 之前**)。当前实现是 reseed 决策做完才 dump已经太晚。
**方案**
1. 改 agentic `admit_direct_append` 调用之后,如果返回 reason=`no-space`**立即 invoke sync** 到 source D把 session KV 推给 P → 然后 retry admit 或转 fallback
2. 在 D-side SessionAwareCache 加 "pending-snapshot pinning",让 eviction 暂时 skip 这个 session
### P2D-driven 主动模式
每次 D 完成 `cache_finished_req` 后,**异步**推 incremental KV 给所有注册的 P。这是设计 doc §2.5 提到的方向。开销显著(每次 turn 都推流量)但确保 sync 一直有数据。
### P3mem-fraction tuning
把 decode mem-fraction 调到 0.5-0.55,让 admission 自然拒绝更多,从而 reseed 路径命中真正的"session-resident-on-some-D"分支。但这降低 throughput。
---
## 6. 对 ProjectGoal 的回答
> 寻找 KVC 如何才能在保持自身独特性的情况下胜过 naive PD Disagg
**V8 数据回答**:在真实节奏 SWE-Bench workload 下:
- **96.4% 请求走 direct-to-D fast path**KVC 独特价值)
- TTFT p99 = 167mslatency p99 = 7.44s
- **0% failure**
- D→P snapshot 底层架构 ready但 trigger 的时机问题导致目前 OK rate=0
**要全面证明 KVC > naive PD**,需要补:
- 用 swebench trace 跑一次 naive PD baseline → 直接对比
- 修 P1agentic admission-rejection 时立即 sync→ 让 D→P 真起作用
---
## 7. 当前 branch HEAD
```
git log --oneline -5
9cca2c6 feat(experiments): expose PREFILL_MEM_FRAC + plumb --prefill-mem-fraction-static
5c09a3a feat(experiments): per-second GPU util sampler in E4-pressured sweep
19612ff feat(experiments): parameterize TIME_SCALE in E4-pressured sweep
a953346 feat(experiments): E4-pressured points at third_party/traces SWE-Bench trace
2dfe22a refactor(snapshot): dedicated GPU snapshot_buf replaces kv_pool alloc
```
`outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/` 包含完整 metrics + structural logs + GPU util CSV会另外做对比图与 swebench-on-naive-PD 一旦跑出)。
---
**核心句**V8 数据把 KVC TTFT 数字从 100+sburst trace 假象)拉回 167ms真实 workload证明 KVC 在真实在线 serving 节奏下表现优异。D→P snapshot link 架构全栈 deploy 完毕但 trigger 时机仍需调整才能真正 fire。