Two cleanups:
1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
GPU hours are precious; naive 1P3D + policy=default has near-certain
loss on multi-turn cache hit (it's round-robin without prefix awareness),
so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
5.5h parallel. Updated:
- §0 TL;DR ("3 组" -> "2 组")
- §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
- §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
- §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
- §6 decision table + expected-range table
- §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
- §9 deliverables
2. Move 8 deprecated docs to docs/archive/:
AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded)
STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes)
V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation)
REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1)
KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress)
SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup)
SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot)
All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
`docs/FOO.md` to `docs/archive/FOO.md` via sed pass.
Added `docs/archive/README.md` explaining what each archived doc is
and when (if ever) to reopen it. Designed so a new reader hitting
the archive dir immediately knows it's not required reading.
After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.4 KiB
4.4 KiB
SWE-Bench PD Hybrid Experiment Results
实验配置
- 模型: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4
- 硬件: 8x H100 80GB, NVLink, 单节点
- Transfer backend: mooncake TCP (loopback)
- Trace: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances)
- 时间压缩: time-scale=10, concurrency-limit=32
结果汇总
Experiment A: pd-disaggregation (baseline)
| Metric | Value |
|---|---|
| Run dir | pd-disaggregation-default-20260426T202540Z |
| Requests | 4,449 / 4,449 (100%) |
| Errors | 0 |
| Mean Latency | 1.662s |
| P50 Latency | 0.973s |
| P90 Latency | 3.644s |
| P99 Latency | 7.676s |
| Mean TTFT | 0.445s |
| P50 TTFT | 0.340s |
| P90 TTFT | 0.880s |
| Mean TPOT | 5.20ms |
| Cache Hit Rate | 94.4% (4199/4449) |
| Mean Cached Tokens | 27,794 |
| KV Transfer Blocks | 105,235 |
Experiment B: pd-colo (colocation) — FAILED
| Metric | Value |
|---|---|
| Run dir | pd-colo-default-20260426T210129Z |
| Status | CRASHED |
| Error | token_to_kv_pool_allocator memory leak detected! |
| Root Cause | SGLang v0.5.10 --disaggregation-mode null 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容 |
| Requests | ~10 / 4,449 (0.2%) |
结论: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。
Experiment C: kvcache-centric (session-aware PD)
| Metric | Value |
|---|---|
| Run dir | kvcache-centric-default-worker-admission-20260426T210800Z |
| Requests | 4,449 total |
| Errors | 4,390 (98.7%) |
| Successful | 59 (1.3%) |
| Mean Latency (success) | 1.238s |
| P50 Latency (success) | 0.484s |
| P90 Latency (success) | 2.550s |
| Mean TTFT (success) | 0.179s |
| P50 TTFT (success) | 0.081s |
| Mean TPOT (success) | 4.70ms |
| Direct-to-D Sessions | 56 |
| KV Transfer (actual) | 196 blocks (vs 105,235 planned) |
Execution Mode 分布:
kvcache-centric(failed): 4,390kvcache-direct-to-d-session(success): 56pd-router-*variants: 3
关键分析
1. pd-disaggregation (A) — 稳定可靠
- 100% 成功率,0 错误
- Mean latency 1.66s 合理 (包含 P→D KV transfer 开销)
- 94.4% cache hit 说明 prefix cache 在 P 侧工作良好
- KV transfer 105K blocks = 主要开销来源
- 适合生产使用
2. pd-colo (B) — 不可用
- Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在
disaggregation-mode null下触发内存泄漏 - 这是 SGLang 的 bug,不是 agentic-pd-hybrid 的问题
- 需要 SGLang 修复后重新测试
3. kvcache-centric (C) — Admission 过于保守
- 98.7% 错误率说明 admission control 拒绝了几乎所有请求
kvcache-seed-min-turn-id=2过滤了 turn 1 的 seed(正确行为)- 但绝大多数 turn 2+ 请求也走
kvcache-centric模式后失败 - 可能原因:
- Worker admission 查询发现 D 侧没有对应 session 的 KV cache(因为 turn 1 没有 seed)
- D 侧 transfer queue 积压导致 admission 拒绝
- 成功的 56 个
direct-to-d-session请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x - 需要调优 admission 参数,或使用
kvcache-seed-min-turn-id=1允许 turn 1 seed
4. kvcache-centric 成功请求 vs pd-disaggregation 对比
| Metric | pd-disagg (A) | kvcache-centric (C, success only) | Delta |
|---|---|---|---|
| Mean Latency | 1.662s | 1.238s | -25.5% |
| P50 Latency | 0.973s | 0.484s | -50.3% |
| Mean TTFT | 0.445s | 0.179s | -59.8% |
| P50 TTFT | 0.340s | 0.081s | -76.2% |
| Mean TPOT | 5.20ms | 4.70ms | -9.6% |
| Actual KV Transfer | 105,235 blk | 196 blk | -99.8% |
当 kvcache-centric 成功时,性能提升显著:
- TTFT 降低 60-76% (D 侧直接 append,无需 P→D transfer)
- 端到端 latency 降低 25-50%
- KV transfer 减少 99.8%
后续建议
- 修复 pd-colo: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏
- 调优 kvcache-centric admission:
- 尝试
--kvcache-seed-min-turn-id 1允许 turn 1 seed - 放宽
--kvcache-seed-max-decode-transfer-queue-reqs阈值 - 使用
--kvcache-admission-mode router(shadow state, 不在 critical path)
- 尝试
- 增加 D 侧内存: 调整
--mem-fraction-static给 KV cache 更多空间 - 多 P/D 配置: 测试 2P2D (TP2) 配置以增加并行度
实验日期
2026-04-27