Files
agentic-pd-hybrid/docs/SWEBENCH_EXPERIMENT_RESULTS.md
kzlin c9d350b372 docs: KVC v1-v4 debug journey + raise session soft_cap to 16
Document the iterative debugging from v1 (broken KVC) through v4
(routing fixed + session cap raised), with code-level analysis of
the two main bugs encountered:

1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`):
   `--policy default` for KVC mechanism caused replay's round-robin
   policy and the PD router's round-robin to diverge, sending requests
   with `session_params` to a D worker that did not have the session
   open. Resulted in 56-61% truncation with finish_reason
   "session id X does not exist".
   Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay
   emits `x-smg-target-worker` and PD router uses consistent_hashing.

2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap`
   dominated 52-65% of requests. Root cause was hardcoded
   `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4
   sessions = 28 slots for 52 trace sessions, ~24 sessions starved
   permanently (bimodal direct-to-D rate of 0% or 99%).
   Fix: raise the cap to 16 (replay.py).

Also includes the v3 finding that direct-to-d-session path P50=0.495s
and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s)
- the KVC core mechanism works when fallback paths are avoided.

Files:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index
- docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes
- scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts
- src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields
- src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill
- src/agentic_pd_hybrid/metrics.py: truncated_request_count

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:10:41 +08:00

4.4 KiB
Raw Blame History

SWE-Bench PD Hybrid Experiment Results

实验配置

  • 模型: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4
  • 硬件: 8x H100 80GB, NVLink, 单节点
  • Transfer backend: mooncake TCP (loopback)
  • Trace: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances)
  • 时间压缩: time-scale=10, concurrency-limit=32

结果汇总

Experiment A: pd-disaggregation (baseline)

Metric Value
Run dir pd-disaggregation-default-20260426T202540Z
Requests 4,449 / 4,449 (100%)
Errors 0
Mean Latency 1.662s
P50 Latency 0.973s
P90 Latency 3.644s
P99 Latency 7.676s
Mean TTFT 0.445s
P50 TTFT 0.340s
P90 TTFT 0.880s
Mean TPOT 5.20ms
Cache Hit Rate 94.4% (4199/4449)
Mean Cached Tokens 27,794
KV Transfer Blocks 105,235

Experiment B: pd-colo (colocation) — FAILED

Metric Value
Run dir pd-colo-default-20260426T210129Z
Status CRASHED
Error token_to_kv_pool_allocator memory leak detected!
Root Cause SGLang v0.5.10 --disaggregation-mode null 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容
Requests ~10 / 4,449 (0.2%)

结论: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。

Experiment C: kvcache-centric (session-aware PD)

Metric Value
Run dir kvcache-centric-default-worker-admission-20260426T210800Z
Requests 4,449 total
Errors 4,390 (98.7%)
Successful 59 (1.3%)
Mean Latency (success) 1.238s
P50 Latency (success) 0.484s
P90 Latency (success) 2.550s
Mean TTFT (success) 0.179s
P50 TTFT (success) 0.081s
Mean TPOT (success) 4.70ms
Direct-to-D Sessions 56
KV Transfer (actual) 196 blocks (vs 105,235 planned)

Execution Mode 分布:

  • kvcache-centric (failed): 4,390
  • kvcache-direct-to-d-session (success): 56
  • pd-router-* variants: 3

关键分析

1. pd-disaggregation (A) — 稳定可靠

  • 100% 成功率0 错误
  • Mean latency 1.66s 合理 (包含 P→D KV transfer 开销)
  • 94.4% cache hit 说明 prefix cache 在 P 侧工作良好
  • KV transfer 105K blocks = 主要开销来源
  • 适合生产使用

2. pd-colo (B) — 不可用

  • Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在 disaggregation-mode null 下触发内存泄漏
  • 这是 SGLang 的 bug不是 agentic-pd-hybrid 的问题
  • 需要 SGLang 修复后重新测试

3. kvcache-centric (C) — Admission 过于保守

  • 98.7% 错误率说明 admission control 拒绝了几乎所有请求
  • kvcache-seed-min-turn-id=2 过滤了 turn 1 的 seed正确行为
  • 但绝大多数 turn 2+ 请求也走 kvcache-centric 模式后失败
  • 可能原因:
    • Worker admission 查询发现 D 侧没有对应 session 的 KV cache因为 turn 1 没有 seed
    • D 侧 transfer queue 积压导致 admission 拒绝
  • 成功的 56 个 direct-to-d-session 请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x
  • 需要调优 admission 参数,或使用 kvcache-seed-min-turn-id=1 允许 turn 1 seed

4. kvcache-centric 成功请求 vs pd-disaggregation 对比

Metric pd-disagg (A) kvcache-centric (C, success only) Delta
Mean Latency 1.662s 1.238s -25.5%
P50 Latency 0.973s 0.484s -50.3%
Mean TTFT 0.445s 0.179s -59.8%
P50 TTFT 0.340s 0.081s -76.2%
Mean TPOT 5.20ms 4.70ms -9.6%
Actual KV Transfer 105,235 blk 196 blk -99.8%

当 kvcache-centric 成功时,性能提升显著:

  • TTFT 降低 60-76% (D 侧直接 append无需 P→D transfer)
  • 端到端 latency 降低 25-50%
  • KV transfer 减少 99.8%

后续建议

  1. 修复 pd-colo: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏
  2. 调优 kvcache-centric admission:
    • 尝试 --kvcache-seed-min-turn-id 1 允许 turn 1 seed
    • 放宽 --kvcache-seed-max-decode-transfer-queue-reqs 阈值
    • 使用 --kvcache-admission-mode router (shadow state, 不在 critical path)
  3. 增加 D 侧内存: 调整 --mem-fraction-static 给 KV cache 更多空间
  4. 多 P/D 配置: 测试 2P2D (TP2) 配置以增加并行度

实验日期

2026-04-27