Two cleanups:
1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
GPU hours are precious; naive 1P3D + policy=default has near-certain
loss on multi-turn cache hit (it's round-robin without prefix awareness),
so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
5.5h parallel. Updated:
- §0 TL;DR ("3 组" -> "2 组")
- §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
- §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
- §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
- §6 decision table + expected-range table
- §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
- §9 deliverables
2. Move 8 deprecated docs to docs/archive/:
AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded)
STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes)
V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation)
REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1)
KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress)
SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup)
SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot)
All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
`docs/FOO.md` to `docs/archive/FOO.md` via sed pass.
Added `docs/archive/README.md` explaining what each archived doc is
and when (if ever) to reopen it. Designed so a new reader hitting
the archive dir immediately knows it's not required reading.
After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.8 KiB
4.8 KiB
SWE-Bench PD Hybrid Experiment Progress
实验目标
在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism,对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。
硬件环境
- 8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
- 无 RDMA/IB 设备
- Transfer backend: mooncake TCP (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault,已放弃)
实验矩阵
| 实验 | Mechanism | Workers | GPU 分配 | Router | Policy |
|---|---|---|---|---|---|
| A | pd-disaggregation | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
| B | pd-colo | 2 direct (TP4 each) | D0: 0-3, D1: 4-7 | No | default |
| C | kvcache-centric | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
测试负载
- 源数据:
simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl - 39,417 lines (turns), 497 unique instances (sessions)
- 每个 instance 8-150 turns (均值 79.3)
- 转换为 agentic-pd-hybrid trace 格式:
outputs/qwen35-swebench-500.jsonl
关键发现
Transfer Backend 选择
- nixl (UCX): pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持,导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
- mooncake (TCP): 可用。需要两处修改:
third_party/sglang/.../mooncake_transfer_engine.py: 从环境变量MOONCAKE_PROTOCOL读取协议,而非硬编码"rdma"src/agentic_pd_hybrid/stack.py: 当transfer_backend == "mooncake"且非force_rdma时,自动设置MOONCAKE_PROTOCOL=tcp
代码修改记录
-
third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py- 将
"rdma"硬编码改为os.environ.get("MOONCAKE_PROTOCOL", "rdma")
- 将
-
src/agentic_pd_hybrid/stack.py- 在
_build_process_env()中添加: mooncake 非 force_rdma 时默认设置MOONCAKE_PROTOCOL=tcp
- 在
-
scripts/convert_audit_to_trace.py(新建)- 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式
实验进度
- Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
- Step 1: Trace 格式转换 (39,417 lines 验证通过)
- Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — 通过
- 100/100 requests, 0 errors
- Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
- TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
- 91/100 cache hits
- Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — 中止
- Run dir:
outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z(无metrics,被kill) - 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
- 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
- 教训: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
- Run dir:
- Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — 完成
- Run dir:
outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z - Trace:
outputs/qwen35-swebench-50sess.jsonl(10% sample, 52 sessions) - 结果: 4449/4449 成功, 0 errors
- Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
- TTFT: mean=0.45s, P50=0.34s, P90=0.88s
- TPOT: mean=5.2ms, P50=5.2ms
- Cache hit: 4199/4449 (94.4%)
- Run dir:
- Step 4: 实验 B — pd-colo — 失败: SGLang bug
- Run dir:
outputs/swebench-exps/pd-colo-default-20260426T210129Z - Bug:
--disaggregation-mode null(colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏 - 错误:
ValueError: token_to_kv_pool_allocator memory leak detected! - 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
- 结论: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
- Run dir:
- Step 5: 实验 C — kvcache-centric — 完成 (高错误率)
- Run dir:
outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z - 4390/4449 errors (98.7%) — admission control 过于保守
- 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
- 详细分析见
docs/SWEBENCH_EXPERIMENT_RESULTS.md
- Run dir:
- Step 6: 结果对比分析 — 完成
- 完整报告:
docs/SWEBENCH_EXPERIMENT_RESULTS.md
- 完整报告:
启动脚本
scripts/run_exp_a_pd_disagg.sh— 实验 Ascripts/run_exp_b_pd_colo.sh— 实验 Bscripts/run_exp_c_kvcache_centric.sh— 实验 Cscripts/convert_audit_to_trace.py— Trace 转换
已知风险
- Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph),长 session (150 turns) 可能 OOM
- mooncake TCP loopback 延迟远低于真实跨机,结果偏乐观
- 原始 trace 时间跨度 ~6000s,全量回放非常耗时