Files

kzlin 6572d7f3f4 docs: add v5 chapter (Option D worker-mode admission) and rename to V1_TO_V5

v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D:
worker admission_mode authoritative for direct_append + seed + reseed,
bypassing replay's local _decode_session_soft_cap.

Key findings now documented:
- errors collapse from 9-10% to 0.2% (mooncake timeouts gone)
- session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the
  binding constraint, not replay's estimator; v4's "low fallback" was
  hiding capacity overruns as transfer-timeout errors
- direct-to-D subset latency unchanged from v4 (admission overhead negligible)
- new bottleneck: D's physical KV pool — points v6 at prefill backup release
  timing, priority eviction tuning, chunked seed, cross-D session migration,
  and real RDMA

Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the
code index with the v5 endpoint extension and new CLI knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 16:13:25 +08:00

19 KiB

Raw Blame History

KVC 实验踩坑记录与代码 Bug 分析（v1 → v5）

记录从 v1 到 v5 KVC 实验的踩坑过程、错误诊断、以及最终定位的代码 bug。模型: Qwen3-30B-A3B (TP1)，硬件: 单节点 8×H100 80GB。 Trace: qwen35-swebench-50sess.jsonl（4449 请求，52 sessions）。

TL;DR

版本	关键变化	截断率	direct-to-D 占比	P50	主要瓶颈
v1 (smoke / 早期)	mechanism 跑通	-	-	-	-
v2	KVC + `--policy default`	56.8% / 61.4%	<0.1%	0.08s*	Routing 错位（默认策略）
v3	KVC + `--policy kv-aware`	0.9%	30-42%	1.5-1.8s	session-cap fallback (52-65%)
v4	v3 + soft_cap 4→16	1.0%	54-58%	1.08 / 0.84s	session-cap fb 仍 35%、9-10% mooncake errors
v5	Option D：worker-mode 驱动 seed/reseed	0.9%	41-45%	1.59 / 1.31s	D KV pool 真容量不足 → fallback 反而 ↑ 至 46-51%

* v2 的 P50 是假数字——超过半数请求只生成 1 个 token 就被 abort。

v2 踩坑：Default policy 与 KVC 机制根本不兼容

表象

scripts/sweep_tp1_v2_fixed.sh 跑出来：

Exp1（8-way DP，baseline）：4449/4449 成功，P50=0.65s，error=0
Exp2（1P7D KVC）：2524 truncated (56.8%)，18 errors，P50=0.08s* (假)
Exp3（2P6D KVC）：2733 truncated (61.4%)，17 errors，P50=0.08s* (假)

每个截断请求 actual_output_tokens=1，finish_reason="abort: session id X does not exist"。

错误的早期诊断

之前 RESULTS_SUMMARY.md 把锅扣在 SGLang 的 --disaggregation-decode-allow-local-prefill flag 上，认为是 D worker 在有 bootstrap_room 时仍然做了 local prefill。这个诊断完全错误——查 scheduler.py:1975-1980 的 _should_allow_local_prefill_on_decode：

def _should_allow_local_prefill_on_decode(self, req: Req) -> bool:
    return (
        self.disaggregation_mode == DisaggregationMode.DECODE
        and self.server_args.disaggregation_decode_allow_local_prefill
        and req.bootstrap_room is None  # ← 有 bootstrap_room 不会走 local prefill
    )

KVC reseed 路径的请求都带 bootstrap_room，根本不会触发 local prefill。

实际根因：Replay 与 PD Router 的 round-robin 错位

实验脚本里 KVC 用 --policy default，而 baseline 用 --policy kv-aware。看 benchmark.py:287-300 这两者的差别巨大：

def _decode_policy_for(policy_name: str) -> str:
    if policy_name == "sticky":      return "manual"
    if policy_name == "kv-aware":    return "consistent_hashing"
    return "round_robin"  # default

def _header_mode_for(policy_name: str) -> str:
    if policy_name == "sticky":      return "routing-key"
    if policy_name == "kv-aware":    return "target-worker"
    return "none"  # default

default policy + KVC 机制下：

Replay policy（policies.py:DefaultPolicy）round-robin 选一个 D，比如 D-3
Replay 在 D-3 上 open_session(session_id=X)（replay.py:1722-1731）
Replay 通过 PD Router 发请求（带 session_params），但 header_mode=none，不发任何 routing header
PD Router (pd_router.py:_select_decode_index) 看到 decode_policy=round_robin，用自己独立的计数器round-robin，发到了 D-5
D-5 的 scheduler 看到 session_params 里有 session_id，但自己的 session_controller 里没这个 session（session 在 D-3 上）→ abort with "Invalid request: session id X does not exist" (scheduler.py:1824-1836)

两个独立的 round-robin 计数器只要一次错位（任何并发或 direct-to-D 绕过 router 的请求都会引起）就永远对不上。

为什么 turn 0 不出问题？

Turn 0 走 _invoke_plain_router（replay.py:1894），不带 session_params，作为普通 PD disagg 请求处理，发到任何 D 都行。Turn 1+ 才开始走带 session_params 的 KVC 路径，撞上路由错位。

数据特征验证（per-session pattern）

session 11360 (58 turns): pattern = .TTTTT.TTTTTTT.TTTTTT...   ← turn 0 OK，1+ 全 T
session 18720 (87 turns): pattern = .TTTTTTTTTTTTTTTTTT...

每个 D worker 收到了全部 52 个 session 的请求（理想情况下应该是 ~7-8 个/D，因为 round-robin 把 session 完全打散）。

修复

唯一正确的修复是把 KVC 的 policy 从 default 改成 kv-aware：

- --policy default
+ --policy kv-aware

KvAwarePolicy (policies.py:146-187) 做两件事：

用 _overlap_blocks + sticky_bonus 给每个 D 打分，session 自然粘在同一个 D（session 亲和性）
header_mode=target-worker，发 x-smg-target-worker header
PD Router 用 consistent_hashing 模式，看到 header 就直接用，不再 round-robin

v3 改 kv-aware policy 后：路由对了，但新瓶颈出现

scripts/sweep_tp1_v3_kvaware.sh 把所有 KVC 实验改成 --policy kv-aware，结果：

指标	v2 1P7D (default)	v3 1P7D (kv-aware)	v3 2P6D	8-way DP baseline
截断	56.8%	0.9%	0.9%	1.5%
Errors	18	363 (8.2%)	9	0
Mean	4.74s	4.88s	3.58s	1.43s
P50	0.08s* (假)	1.75s	1.52s	0.65s
P90	12.14s	12.67s	9.23s	3.61s
TTFT P50	-	0.36s	0.33s	0.09s

✅ 截断从 56.8% 降到 0.9%，路由问题彻底解决。 ❌ 但 P50 仍然是 baseline 的 2-3 倍。

Direct-to-D 路径表现优秀（KVC 该有的样子）

按 execution_mode 拆开看：

路径	Exp1 1P7D 占比	Exp1 1P7D P50	Exp1 1P7D TTFT P50
`kvcache-direct-to-d-session` ✨	42.0%	0.495s	0.043s
`pd-router-fallback-large-append-session-cap` 🔥	52.6%	5.6s	3.7s

Direct-to-D 路径下：

P50 = 0.495s（比 baseline 0.65s 快 25%）
TTFT P50 = 0.043s（比 baseline 0.093s 快 2 倍）
KV transfer = 0（无 P 介入，纯 D 上 append-prefill）

这才是 KVC 真正的价值。但只有 30-42% 请求走到这条路。

新瓶颈：session-cap fallback 占了 52-65%

pd-router-fallback-large-append-session-cap 占 1P7D 的 52.6%、2P6D 的 65.4%。这条路径意味着 router 想开新 session 在 D 上，但 admission 拒绝了（"d-session-cap"），只好回退到 plain router（P 全量 prefill + 传给 D，无 session 复用）。

Bimodal session 分布（starvation）

Session	Total turns	Direct-to-D	Session-cap fallback
22080	129	98%	0%
3840	118	97%	0%
70560	150	0%	99%
39360	148	0%	99%
61600	117	0%	99%

要么完全幸运，要么完全饿死——典型的双峰分布。

根因：硬编码 cap=4

看 replay.py:_decode_session_soft_cap 原始代码：

def _decode_session_soft_cap(...) -> int:
    target_tokens = max(1, _estimate_session_resident_tokens(request))
    usable_capacity_tokens = _usable_capacity_tokens(residency, server_url)
    ...
    if usable_capacity_tokens <= 0:
        return 4
    return max(1, min(4, usable_capacity_tokens // target_tokens))
    #              ^^^ 硬编码上限 4

7 个 D × 每个 D 最多 4 个 session = 28 个 session slot 总容量。Trace 有 52 个 session → 24 个 session 永远抢不到 slot。

启动期 race condition 决定了哪些 session 是"幸运儿"——前 28 个挤进来的 session 的所有后续 turn 都走 direct-to-D（快）；剩下 24 个 session 永远走 session-cap fallback（慢）。

v4 改进：把硬 cap 从 4 提到 16

replay.py:_decode_session_soft_cap 一行修改：

-    if usable_capacity_tokens <= 0:
-        return 4
-    return max(1, min(4, usable_capacity_tokens // target_tokens))
+    if usable_capacity_tokens <= 0:
+        return 16
+    return max(1, min(16, usable_capacity_tokens // target_tokens))

7 D × 16 = 112 个 slot，远超 52 个 session 需求。

v4 实际结果（vs v3 1P7D / 2P6D）

指标	v3 1P7D	v4 1P7D	v3 2P6D	v4 2P6D	baseline 8DP
Errors	363 (8%)	435 (10%)	9 (0%)	403 (9%)	0
截断	42	43	42	36	68
direct-to-D	38.6%	54.3%	30.5%	58.0% ⭐	-
session-cap fallback	48.3%	37.4%	65.4%	34.7%	-
Session reused	1716	2180	1358	2348	-
KV transfer blocks	62K	53K	79K	51K	-
Mean	4.88s	4.21s	3.58s	2.51s	1.43s
P50	1.75s	1.08s	1.52s	0.84s	0.65s
P90	12.67s	13.38s	9.23s	6.51s	3.61s
P99	28.72s	24.45s	18.70s	18.34s	8.38s
TTFT P50	0.36s	0.056s	0.33s	0.051s ⭐	0.094s
TTFT P90	10.97s	11.90s	6.95s	2.64s	0.26s

✓ direct-to-D 占比从 v3 的 30-38% 涨到 v4 的 54-58% ✓ session 复用 +27% (1P7D) / +73% (2P6D) ✓ KV transfer 量 -15% (1P7D) / -36% (2P6D) ✓ TTFT P50 反超 baseline 46%（0.051s vs 0.094s）

Direct-to-D 路径全面碾压 baseline（KVC 真实价值）

Config	n	Lat P50	Lat P90	TTFT P50	TTFT P90
baseline 8DP	4381	0.66s	3.65s	0.094s	0.256s
v4 1P7D direct-to-D	2179	0.495s	3.03s	0.044s	0.055s
v4 2P6D direct-to-D	2348	0.499s	2.86s	0.043s	0.054s

direct-to-D 子集相对 baseline：

P50 快 24-30%
P90 快 16-22%
TTFT P50 快 54%
TTFT P90 快 79%

整体性能（去掉 errors 和 truncated）vs baseline

Config	clean	Mean	P50	P90	P99
baseline 8DP	4381	1.45s	0.66s	3.65s	8.38s
v4 2P6D	4010	2.53s	0.85s	6.55s	18.33s

vs baseline：P50 慢 28%、P90 慢 80%、P99 慢 119%。即使错误率为 0，整体仍输 baseline——根因是 35% 请求被推到 fallback 路径。

新瓶颈 1：35% 请求仍走 session-cap fallback

抬到 16 后真实瓶颈是 capacity-based 计算：min(16, usable_capacity_tokens // target_tokens)。

target_tokens = input + output，agentic 里常见 50-100K
D 的 KV pool ≈ 100-150K tokens（80GB H100, mem_fraction=0.835）
usable / target = 1-2，远没到 16 → 真实 cap 是 capacity 算出来的小数字

要解决必须改 capacity-based 估算逻辑（或上方案 D，让 D 自己决定）。

新瓶颈 2：9-10% errors（mooncake 传输超时）

P-side log 显示：

KVTransferError: Failed to send kv chunk of <bootstrap_room> to 10.45.7.165:40319
Sync batch data transfer timeout after 32722558107ns  (32 秒超时)
Decode instance could be dead, remote mooncake session ... is not alive

特征：

所有 errors 在 run 的 44.8% 之后出现（系统压力累积）
98% errors 集中在 turn ≥ 31（大 input 的请求）
v3 cap=4 时 1P7D 已有 363 errors（仅 1 个 D 集中受冲击），v4 cap=16 把压力均匀分布但量级更大

是 mooncake TCP loopback 在并发上去后撞超时，不是 SGLang 逻辑 bug。修复方向：

加长 mooncake transfer timeout（现在 32s）
限制并发 inflight transfer 数量
改用 RDMA（loopback 是单机模拟，生产环境换真 RDMA）
chunked KV transfer

v5 落地方案 D：worker-mode 驱动 seed/reseed

scripts/sweep_tp1_v5_optD.sh 真正把方案 D 落到了代码里。改动核心：把 --kvcache-admission-mode 从 local(replay 估算) 改成 worker(D 决策)，并扩展到 direct_append + seed + reseed 全部路径。

关键代码改动

SGLang 侧：scheduler.py 的 admit_direct_append 端点新增 mode 字段，支持 direct_append | seed，seed 模式会触发 D 真正去 reserve KV pool 块并主动调用 maybe_trim_decode_session_cache 做 LRU。
Replay 侧：replay.py 中 reseed / turn-1 seed / large-append-reseed 都改走同一个 admit endpoint；_decode_session_soft_cap 在 worker mode 下被完全 bypass。
新增运行参数：--kvcache-admission-mode worker、--kvcache-seed-min-turn-id 1、--kvcache-seed-max-inflight-decode -1、--kvcache-prefill-backup-policy release-after-transfer、--kvcache-prefill-priority-eviction。

假设

v4 的 35% session-cap fallback 来自 replay 视图过期 + capacity-based 计算保守 → 让 D 自己看 KV pool 应该把这 35% 救回来。
D 主动 LRU eviction 比 replay 自己写的 reservation 更准确，应该让更多 session 能 seed 进来。

v5 实际结果（vs v4 同配置）

指标	v4 1P7D	v5 1P7D	v4 2P6D	v5 2P6D	baseline 8DP
Errors	435 (10%)	9 (0.2%) ⭐	403 (9%)	9 (0.2%) ⭐	0
截断	43	42	36	42	68
direct-to-D	54.3%	44.7% ↓	58.0%	41.3% ↓	-
session-cap fallback	37.4%	45.6% ↑	34.7%	50.6% ↑	-
no-d-capacity fallback	0.3%	1.2%	0.2%	0.8%	-
pd-router-turn1-seed (新可见)	-	1.2%	-	1.1%	-
pd-router-d-session-reseed (新可见)	-	4.8%	-	3.4%	-
pd-router-large-append-reseed (新可见)	-	1.0%	-	1.0%	-
Session reused	2180	1990	2348	1837	-
KV transfer blocks	53K	66K	51K	69K	-
Mean	4.21s	5.18s	2.51s	3.49s	1.45s
P50	1.08s	1.59s	0.84s	1.31s	0.66s
P90	13.38s	14.67s	6.51s	9.09s	3.65s
P99	24.45s	26.09s	18.34s	24.92s	8.38s
TTFT P50	0.056s	0.21s	0.051s	0.24s	0.094s
TTFT P90	11.90s	13.06s	2.64s	6.90s	0.26s

✅ 可靠性大幅提升：mooncake 传输超时 errors 从 9-10% 跌到 0.2%。D 真容量决策避免了 v4 那种"乐观 admit → 30s 后超时"的死亡链路。 ✅ reseed / turn1-seed 路径首次显式出现，证明 admission 端点对 seed 模式确实生效了。 ❌ session-cap fallback 不降反升（37→46% 与 35→51%）。说明 v4 的本地 soft_cap 实际上比 D 真实容量更乐观——admit 进来后转身就 OOM，统计成了 error 而不是 fallback。 ❌ 直接结果：direct-to-D 占比下降、整体延迟全面变差。P50/P90/P99 与 TTFT 都退步。

Direct-to-D 子集还是稳的（KVC 真实价值仍在）

Config	n	Lat P50	Lat P90	TTFT P50	TTFT P90
baseline 8DP	4381	0.66s	3.65s	0.094s	0.256s
v4 2P6D direct-to-D	2348	0.499s	2.86s	0.043s	0.054s
v5 1P7D direct-to-D	1990	0.475s	3.04s	0.043s	0.055s
v5 2P6D direct-to-D	1837	0.483s	3.04s	0.043s	0.054s

direct-to-D 的尾延迟和 TTFT 与 v4 几乎完全一致（端点决策开销可忽略），v5 的回退不是路径本身变慢，而是更多请求被赶到 fallback。

Fallback 路径反而比 v4 更糟

Config	n	Lat P50	Lat P90	TTFT P50
v5 1P7D session-cap fallback	2027	6.38s	17.47s	4.49s
v5 2P6D session-cap fallback	2253	3.13s	11.25s	0.89s

由于 fallback 占比上升、且这条路径本身就比 direct-to-D 慢一个数量级，整体均值被拖累得更厉害。

v5 真正暴露的瓶颈：D 的 KV pool 物理容量

把 admission 决策权交给 D 之后，瓶颈从"replay 估得太死"变成"D 真的装不下"：

80GB H100 × mem_fraction_static=0.835 → D 单卡 KV pool ≈ 100-150K tokens
agentic 长 context session 单 turn footprint 50-100K
单 D 上能并存的 session 数量本就 2-3 个 → 7 个 D 装 50 session 基本不可能

v4 的 cap=16 之所以"看起来好"，部分是因为本地 soft_cap 没真的查 D 的 free pool，开了一堆最终会失败的 session（统计成 errors 而非 fallback）。v5 把这部分洗成了"诚实的拒绝"——可靠性跃升的代价是看见了真实容量上限。

v6 应该针对什么

把 D 物理容量管理打开，而不是再调 replay：

prefill backup 提早 release（已经加了 release-after-transfer 但可能还不够及时） → 让 P 上的 backup blocks 不要长期占用 KV pool。
priority eviction 策略调优（已开 --kvcache-prefill-priority-eviction）：当前 LRU 可能把 hot session 误踢；需要按 session 命中频率/最近访问做加权。
chunked / streamed seed：不要一次 reserve 整个 prompt 的容量，按 chunk 分摊。
跨 D 的 session migration：当一个 D 满了但隔壁 D 空时主动迁移，而不是直接 fallback 到 P。
真正的多机 RDMA：单机 mooncake loopback 是 errors 的根因之一；上多机 + RDMA 才能让 prefill backup release 后的 KV transfer 真的稳。

工程量：1-3 是 SGLang 内部改 (scheduler.py + session_controller.py)，4 需要 router 协议扩展，5 是部署变更。

关键文件与代码位置索引

现象	代码位置
Replay policy round-robin	`policies.py:63-67` `RoutingState.next_decode_worker_id`
KV-aware policy（session 亲和）	`policies.py:146-187` `KvAwarePolicy.select`
PD router decode 选择	`pd_router.py:51-74` `_select_decode_index`
Header 构建	`replay.py:2407-2424` `_build_headers`
Policy → router config 映射	`benchmark.py:287-300` `_decode_policy_for/_header_mode_for`
Session admission 软 cap	`replay.py:889-905` `_decode_session_soft_cap`
已有的 D 侧 admission 端点	`scheduler.py:3497-3580` `admit_direct_append`（v5 扩展支持 `mode=seed`）
Worker-mode admission 调用方	`replay.py` reseed / turn1-seed / large-append-reseed 路径
Prefill backup 释放策略（v5 引入）	`--kvcache-prefill-backup-policy release-after-transfer`
Prefill priority eviction（v5 引入）	`--kvcache-prefill-priority-eviction`
Session 在 D 上找不到的报错	`scheduler.py:1824-1836`
`_should_allow_local_prefill_on_decode`	`scheduler.py:1975-1980`
Reseed 流程入口	`replay.py:1665-1809` `_invoke_kvcache_seeded_router`
Direct-to-D 流程	`replay.py:2351-2398` `_invoke_decode_session_direct`

经验教训

policy 和 mechanism 是两个正交维度——--policy default 不是"无脑默认值"，它真的是 round-robin 无 session 亲和性。KVC 机制必须配 session 亲和的 policy。
不要无脑相信前一个 agent 的 RESULTS_SUMMARY——v2 的诊断（"local prefill bug"）和实际 finish_reason（"session id does not exist"）完全对不上。任何错误诊断必须用 finish_reason、execution_mode 这些原始字段交叉验证。
bimodal 分布是 starvation 的强信号——v3 数据里某些 session 100% 走快路径、某些 100% 走慢路径，几乎肯定是某种"先到先得"的资源竞争。看到这种模式立刻去找硬编码 cap 或全局共享资源。
测量要看分组而非整体均值——v3 整体 P50=1.5s 看似比 baseline 慢，但拆开看 direct-to-D 子集 P50=0.495s 已经反超 baseline。整体均值被 fallback 路径拖累，但 KVC 的核心价值是真实存在的。
errors 与 fallback 是同一类资源压力的两副面孔——v4 的"低 fallback 率 + 高 error 率"不是更优解，是把容量超限的失败从"显式拒绝"伪装成"超时失败"。v5 把决策权交给真容量后，fallback 升、errors 降，这是更诚实的指标，不要被 v4 的 fallback 数字误导。当看到错误率和 fallback 率呈反相关时，要警惕 admission 决策是否在说谎。

19 KiB Raw Blame History Unescape Escape