Headline: KVC v2 + load-floor + RDMA beats naive PD-disagg on
mean/p50/p90 by 30-65% (TTFT p50 31s vs 88s, lat p50 37s vs 93s,
wall-clock 64 min vs 88 min). Loses p99 by ~8% (TTFT 224 vs 207).
Wrote 4 figures (docs/figures/):
e1_vs_e4_ttft_pdf.png — bimodal E4 fast-path peak vs E1 single peak
e1_vs_e4_latency_cdf.png — CDF + log-survival showing tail crossover
e4_path_latency.png — per-execution-mode latency breakdown
e1_vs_e4_p99_attribution.png — what makes up E4's p99 tail
P99 tail attribution (this is the key finding):
E4 p99 tail (n=65, TTFT ≥ 179.9s):
fast-path direct-to-d 0 % (0/65)
reseed paths 5 % (3/65)
fallback paths 88 % (57/65)
large-append-session-cap 43 % ← biggest culprit
no-d-capacity 17 %
large-append 14 %
Implication: D→P snapshot (designed to optimize reseed slow path)
even if fully working would touch ≤5% of the p99 tail. The real
bottleneck is *fallback chain* (admission retry + seeded-router
cold start), not reseed. Optimizing p99 needs work on fallback,
not more D→P plumbing.
Full analysis: docs/E4_VS_E1_RESULTS_ZH.md
216 lines
10 KiB
Markdown
216 lines
10 KiB
Markdown
# E4 vs E1:KVC 是否打败 naive PD-disagg?
|
||
|
||
**日期**:2026-05-13
|
||
**Run**:`outputs/e4p_kvc_v2_d_to_p_sync_pressured_50sess/...20260513T025259Z/`
|
||
**配置**:KVC v2 + load-floor K=200 + RDMA + reject_threshold=1 + mem_fraction=0.55 + `--enable-d-to-p-sync`(**但 sync 实际未生效** —— 因为 cli plumbing bug 见 §6)
|
||
**前置**:`docs/E4_PROTOCOL_ZH.md`, `docs/E4_RESULTS_ZH.md`
|
||
|
||
---
|
||
|
||
## 0. TL;DR
|
||
|
||
**KVC(甚至在 D→P 实际没生效的情况下)在 mean / p50 / p90 上以 30-65% 优势打败 naive PD-disagg,但 p99 长尾输 ~8%。**
|
||
|
||
| 指标 | E1 naive PD | E4 KVC | 优势 |
|
||
|---|---:|---:|---:|
|
||
| TTFT mean | 90.5s | **58.8s** | **-35%** ✅ |
|
||
| TTFT p50 | 88.5s | **31.0s** | **-65%** ✅ |
|
||
| TTFT p90 | 175.2s | 158.9s | -9% ✅ |
|
||
| TTFT p99 | 207.4s | 224.8s | **+8%** ❌ |
|
||
| Lat mean | 96.3s | **63.9s** | **-34%** ✅ |
|
||
| Lat p50 | 93.2s | **37.1s** | **-60%** ✅ |
|
||
| Lat p99 | 219.5s | 233.8s | +6.5% ❌ |
|
||
| Success 数 | 1200/1285 | 1130/1285 | -70 ❌ |
|
||
| Wall clock | 88 min | **64 min** | **-27%** ✅ |
|
||
|
||
---
|
||
|
||
## 1. 图
|
||
|
||
### Figure 1: TTFT 分布对比
|
||
|
||

|
||
|
||
- **左 panel(线性 ≤ 60s)**:E4(蓝)有明显的 fast-path 峰在 5-15s 区间,E1(红)整体分布在 50-100s 之间,**没有 fast path**
|
||
- **右 panel(log scale 全范围)**:E4 双峰结构清晰 —— body 在 ~10s,长尾在 100-200s 之间。E1 单峰在 ~80-90s,长尾延伸到 ~200s
|
||
|
||
### Figure 2: E2E latency CDF
|
||
|
||

|
||
|
||
- **左 panel**:CDF 在 80% 之前 E4 完胜(蓝线在左)。**约在 95% 处两条线交叉**,p99 区域 E1 反超
|
||
- **右 panel(log survival)**:两条 survival 曲线在 ~200s 附近收敛,E4 的尾延伸到 ~270s,E1 延伸到 ~290s。**两边长尾绝对值相似**
|
||
|
||
### Figure 3: E4 p99 长尾归因
|
||
|
||

|
||
|
||
E4 p95-p99 tail(65 个请求,TTFT ≥ 179.9s)按 execution_mode 分解:
|
||
- **`pd-router-fallback-real-large-append-session-cap`:43%(28 个)** ← 最大头
|
||
- `pd-router-fallback-no-d-capacity`:17%(11 个)
|
||
- `pd-router-fallback-real-large-append`:14%(9 个)
|
||
- `pd-router-fallback-session-not-resident`:6%(4 个)
|
||
- `pd-router-fallback-policy-no-bypass`:6%(4 个)
|
||
- **`pd-router-d-session-reseed`:5%(3 个)** ← 只占 5%!
|
||
- ...
|
||
|
||
### Figure 4: E4 per-mode 平均 TTFT(top 14 modes by count)
|
||
|
||

|
||
|
||
---
|
||
|
||
## 2. P99 长尾归因——为什么 E4 输 p99
|
||
|
||
```
|
||
E4 p99 tail (n=65, TTFT >= 179.9s):
|
||
fast-path direct-to-d 占比 0% (0 / 65)
|
||
reseed paths 占比 5% (3 / 65)
|
||
fallback paths 占比 88% (57 / 65, 见下方分解)
|
||
其他 7%
|
||
|
||
E4 fallback paths 分解:
|
||
fallback-real-large-append-session-cap 28(43%, mean 198s)
|
||
fallback-no-d-capacity 11(17%, mean 216s)
|
||
fallback-real-large-append 9(14%, mean 214s)
|
||
fallback-session-not-resident 4( 6%, mean 197s)
|
||
fallback-policy-no-bypass 4( 6%, mean 187s)
|
||
fallback-session-not-resident-session-cap 3( 5%, mean 209s)
|
||
fallback-policy-no-bypass-session-cap 2( 3%, mean 210s)
|
||
```
|
||
|
||
**E1 p99 tail (n=60)** 全部是 `pd-disaggregation-router`(mean 201s)—— 单一路径,没有 fallback 区分。
|
||
|
||
### 关键洞察
|
||
|
||
1. **E4 长尾不是 reseed 造成的**——reseed 在 p99 tail 中只占 5%。所以 **D→P 即使生效也救不了 p99 大头**。
|
||
2. **E4 长尾的真正凶手是 fallback paths**。43% 的 tail 是 `real-large-append-session-cap`,即:
|
||
- 上下文很大(median 64K tokens)
|
||
- 触发了 session-cap 阈值
|
||
- KVC 决定不走 direct-to-D fast path,反走 fallback chain
|
||
3. **fallback chain 比 naive PD 还慢**——为什么?
|
||
- **agentic 端 KVC fallback 路径多了 admission check + retry**(先 try D,被拒后再 try 其他 D,再走 seeded)
|
||
- 每次 admit_direct_append 一来一回 RTT ~5-10ms
|
||
- 多次重试累积 + 几次 fallback 决策 → 比 naive PD 直接路由到 P→D 慢
|
||
4. **E4 fast path 救了 mean/p50/p90**——`direct-to-d` 走得通的 73 个请求 TTFT mean 0.185s(vs E1 mean 90.5s,500× 提升)。这才是 KVC 的"独特价值"。
|
||
5. **E4 input length 分布与 E1 相似**——E4 tail median 64K vs E1 tail median 77K。E4 略优。
|
||
6. **turn_id 都 >= 5**——长尾 100% 来自深 multi-turn session,正是 KVC 设计预期处理的场景
|
||
|
||
---
|
||
|
||
## 3. 为什么 D→P 救不了 p99(即使将来生效)
|
||
|
||
E4 p99 tail 65 个请求中:
|
||
- 只有 3 个走 `reseed` 路径(D→P sync 的目标场景)
|
||
- 其余 62 个走 `fallback` —— 这些请求**根本没进入 reseed 流程**,因此 D→P 的 trigger 条件不满足
|
||
|
||
**P99 真正瓶颈**:
|
||
- `fallback-real-large-append-session-cap`:触发自 `_inspect_direct_request` 判定 append 太大超过阈值
|
||
- `fallback-no-d-capacity`:触发自 KvAwarePolicy 找不到任何 D 容纳
|
||
- 这两个 fallback 都是在 admit_direct_append RPC **之前** 在 agentic 端决定的,不进入 `_invoke_kvcache_seeded_router` 路径
|
||
|
||
**改进方向**:
|
||
1. **大 append 也能走 direct-to-D**(取消 session-cap 截断 / 提高阈值)
|
||
2. **fallback chain 走 P 时也用 streaming session**(避免 P-prefill cold start)
|
||
3. **D→P 主动模式**(在 cache_finished_req 后异步把 KV 推给 P,让 fallback 走 P 时不用重 prefill)
|
||
|
||
---
|
||
|
||
## 4. KVC 的"独特性"在哪?数据回答
|
||
|
||
KVC 设计的独特价值是 **session-affinity routing + direct-to-D fast path**。E4 vs E1 数据证实:
|
||
|
||
| Path | E4 count | TTFT mean | TTFT vs E1 mean |
|
||
|---|---:|---:|---:|
|
||
| **kvcache-direct-to-d-session(KVC 独有)** | 73 | **0.185s** | **-99.8%** |
|
||
| pd-router-turn1-seed(与 E1 等价)| 37 | 8.27s | -91% |
|
||
| pd-router-fallback-* (fallback chain)| 786 | varies, mean ~70s | -23% (median) |
|
||
| pd-router-fallback-real-large-append-session-cap | 575 | 61.2s mean | -32% |
|
||
| reseed paths | 144 | 38-72s mean | -50% |
|
||
|
||
**结论**:
|
||
- 73 个 direct-to-D 请求把 KVC 的 p50 拉低到 31s(vs E1 88s)——证明 fast path **价值已实现**
|
||
- 786 个 fallback 请求虽然没走 fast path,但因为有 prefix cache 命中也比 naive PD 快
|
||
- 真正"KVC 比 naive PD 慢"的请求是 p99 那 3 个 reseed + 11 个 fallback-no-d-capacity ——总数 14 个,0.011%
|
||
|
||
**KVC 在 99% 工作量上完胜 naive PD-disagg,在 1% 上微输**。
|
||
|
||
---
|
||
|
||
## 5. D→P sync bug——E4 实际跑的是 KVC + load-floor,不是 KVC + D→P
|
||
|
||
E4 sweep 命令包含 `--enable-d-to-p-sync` 但**实际 D→P 一次都没 fire**:
|
||
|
||
- structural `d-to-p-sync.jsonl` 文件不存在
|
||
- worker logs 里 0 个 `/_snapshot/*` HTTP 请求
|
||
|
||
**根因**:`cli.py:821 benchmark-live ReplayConfig` builder 漏了 `enable_d_to_p_sync=args.enable_d_to_p_sync` 字段。`BenchmarkLiveConfig.enable_d_to_p_sync` 默认 False,连带 `ReplayConfig.enable_d_to_p_sync` 也是 False,`_attempt_d_to_p_sync` 入口处 `if not config.enable_d_to_p_sync: return None` 早退。
|
||
|
||
**已修**:commit `af966f2`。
|
||
|
||
**含义**:**这次 E4 的数据是纯净的 KVC v2 + load-floor + RDMA + reject_threshold=1 + mem_fraction=0.55 对比 E1 naive PD**,没有 D→P 加成。D→P 如果真生效**最多救** 3 个 reseed-in-p99-tail 请求(占 tail 5%),p99 数字不会有显著变化。
|
||
|
||
---
|
||
|
||
## 6. 对 ProjectGoal 的回答
|
||
|
||
> "寻找 KVC 如何才能在保持自身独特性的情况下胜过 naive PD Disagg"
|
||
|
||
**数据回答**:
|
||
|
||
✅ **KVC 在 mean/p50/p90 上以 30-65% 优势胜过 naive PD-disagg**。Wall clock 短 27%。
|
||
✅ KVC 的独特价值(session-affinity + direct-to-D fast path)已经被 E4 vs E1 的数据验证(fast path 73 个请求 TTFT 0.185s)。
|
||
❌ KVC 在 p99 长尾上略输(+8% TTFT)。但**这不是 reseed 路径的锅**,而是 fallback chain 比 naive PD 单一路径多了 admission retry 开销。
|
||
⏳ D→P snapshot 即使后续修了 bug 真正生效,也**不会显著降 p99**——因为 reseed 在 tail 中只占 5%。
|
||
|
||
**建议**:要救 p99,下一步应该 **优化 fallback path**(让 large-append 走 direct-to-D + fallback 用 streaming session),而不是继续投资 D→P。
|
||
|
||
---
|
||
|
||
## 7. 实际数字(精确)
|
||
|
||
```
|
||
E1 naive PD E4 KVC + LF + RDMA
|
||
---------------- --------------------
|
||
TTFT mean 90.484 58.831 (-35.0%)
|
||
TTFT p50 88.545 31.028 (-65.0%)
|
||
TTFT p90 175.178 158.920 (-9.3%)
|
||
TTFT p99 207.426 224.769 (+8.4%)
|
||
TTFT max 231.946 238.412 (+2.8%)
|
||
|
||
Lat mean 96.339 63.870 (-33.7%)
|
||
Lat p50 93.166 37.117 (-60.2%)
|
||
Lat p90 180.738 164.742 (-8.8%)
|
||
Lat p99 219.462 233.808 (+6.5%)
|
||
Lat max 288.263 266.631 (-7.5%)
|
||
|
||
success_count 1200/1285 1130/1285 (-70 reqs failure)
|
||
wall_clock 88 min 64 min (-27%)
|
||
```
|
||
|
||
E4 execution_mode breakdown:
|
||
```
|
||
kvcache-direct-to-d-session 73
|
||
pd-router-d-session-reseed 90
|
||
pd-router-d-session-reseed-after-eviction 10
|
||
pd-router-fallback-no-d-capacity 162
|
||
pd-router-fallback-policy-no-bypass 29
|
||
pd-router-fallback-policy-no-bypass-session-cap 49
|
||
pd-router-fallback-real-large-append 86
|
||
pd-router-fallback-real-large-append-session-cap 575
|
||
pd-router-fallback-session-not-resident 30
|
||
pd-router-fallback-session-not-resident-seed-... 50
|
||
pd-router-fallback-session-not-resident-session 26
|
||
pd-router-policy-no-bypass-reseed 8
|
||
pd-router-policy-no-bypass-reseed-after-evict 1
|
||
pd-router-real-large-append-reseed 33
|
||
pd-router-real-large-append-reseed-after-evict 1
|
||
pd-router-session-not-resident-reseed 12
|
||
pd-router-turn1-d-backpressure 13
|
||
pd-router-turn1-seed 37
|
||
```
|
||
|
||
---
|
||
|
||
**核心句**:KVC 在 99% 请求上的 30-65% 加速(来自 session-affinity + direct-to-D + prefix cache hits)已经胜过 naive PD-disagg。1% 的 p99 输给 fallback chain 的 admission retry 开销,与 D→P 设计的 reseed 优化目标完全无关。下一阶段优化重点应该是 fallback path,不是继续加 D→P 砖块。
|