# SWE-Bench PD Hybrid Experiment Results ## 实验配置 - **模型**: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4 - **硬件**: 8x H100 80GB, NVLink, 单节点 - **Transfer backend**: mooncake TCP (loopback) - **Trace**: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances) - **时间压缩**: time-scale=10, concurrency-limit=32 ## 结果汇总 ### Experiment A: pd-disaggregation (baseline) | Metric | Value | |--------|-------| | Run dir | `pd-disaggregation-default-20260426T202540Z` | | Requests | 4,449 / 4,449 (100%) | | Errors | 0 | | **Mean Latency** | **1.662s** | | P50 Latency | 0.973s | | P90 Latency | 3.644s | | P99 Latency | 7.676s | | Mean TTFT | 0.445s | | P50 TTFT | 0.340s | | P90 TTFT | 0.880s | | Mean TPOT | 5.20ms | | Cache Hit Rate | 94.4% (4199/4449) | | Mean Cached Tokens | 27,794 | | KV Transfer Blocks | 105,235 | ### Experiment B: pd-colo (colocation) — FAILED | Metric | Value | |--------|-------| | Run dir | `pd-colo-default-20260426T210129Z` | | Status | **CRASHED** | | Error | `token_to_kv_pool_allocator memory leak detected!` | | Root Cause | SGLang v0.5.10 `--disaggregation-mode null` 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容 | | Requests | ~10 / 4,449 (0.2%) | **结论**: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。 ### Experiment C: kvcache-centric (session-aware PD) | Metric | Value | |--------|-------| | Run dir | `kvcache-centric-default-worker-admission-20260426T210800Z` | | Requests | 4,449 total | | **Errors** | **4,390 (98.7%)** | | Successful | 59 (1.3%) | | Mean Latency (success) | 1.238s | | P50 Latency (success) | 0.484s | | P90 Latency (success) | 2.550s | | Mean TTFT (success) | 0.179s | | P50 TTFT (success) | 0.081s | | Mean TPOT (success) | 4.70ms | | Direct-to-D Sessions | 56 | | KV Transfer (actual) | 196 blocks (vs 105,235 planned) | **Execution Mode 分布**: - `kvcache-centric` (failed): 4,390 - `kvcache-direct-to-d-session` (success): 56 - `pd-router-*` variants: 3 ## 关键分析 ### 1. pd-disaggregation (A) — 稳定可靠 - 100% 成功率,0 错误 - Mean latency 1.66s 合理 (包含 P→D KV transfer 开销) - 94.4% cache hit 说明 prefix cache 在 P 侧工作良好 - KV transfer 105K blocks = 主要开销来源 - **适合生产使用** ### 2. pd-colo (B) — 不可用 - Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在 `disaggregation-mode null` 下触发内存泄漏 - 这是 SGLang 的 bug,不是 agentic-pd-hybrid 的问题 - **需要 SGLang 修复后重新测试** ### 3. kvcache-centric (C) — Admission 过于保守 - 98.7% 错误率说明 admission control 拒绝了几乎所有请求 - `kvcache-seed-min-turn-id=2` 过滤了 turn 1 的 seed(正确行为) - 但绝大多数 turn 2+ 请求也走 `kvcache-centric` 模式后失败 - 可能原因: - Worker admission 查询发现 D 侧没有对应 session 的 KV cache(因为 turn 1 没有 seed) - D 侧 transfer queue 积压导致 admission 拒绝 - 成功的 56 个 `direct-to-d-session` 请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x - **需要调优 admission 参数,或使用 `kvcache-seed-min-turn-id=1` 允许 turn 1 seed** ### 4. kvcache-centric 成功请求 vs pd-disaggregation 对比 | Metric | pd-disagg (A) | kvcache-centric (C, success only) | Delta | |--------|:---:|:---:|:---:| | Mean Latency | 1.662s | 1.238s | **-25.5%** | | P50 Latency | 0.973s | 0.484s | **-50.3%** | | Mean TTFT | 0.445s | 0.179s | **-59.8%** | | P50 TTFT | 0.340s | 0.081s | **-76.2%** | | Mean TPOT | 5.20ms | 4.70ms | -9.6% | | Actual KV Transfer | 105,235 blk | 196 blk | **-99.8%** | **当 kvcache-centric 成功时,性能提升显著:** - TTFT 降低 60-76% (D 侧直接 append,无需 P→D transfer) - 端到端 latency 降低 25-50% - KV transfer 减少 99.8% ## 后续建议 1. **修复 pd-colo**: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏 2. **调优 kvcache-centric admission**: - 尝试 `--kvcache-seed-min-turn-id 1` 允许 turn 1 seed - 放宽 `--kvcache-seed-max-decode-transfer-queue-reqs` 阈值 - 使用 `--kvcache-admission-mode router` (shadow state, 不在 critical path) 3. **增加 D 侧内存**: 调整 `--mem-fraction-static` 给 KV cache 更多空间 4. **多 P/D 配置**: 测试 2P2D (TP2) 配置以增加并行度 ## 实验日期 2026-04-27