Files

kzlin 7590e55189 docs: archive deprecated docs to docs/archive/, drop E1 from onboarding

Two cleanups:

1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
   GPU hours are precious; naive 1P3D + policy=default has near-certain
   loss on multi-turn cache hit (it's round-robin without prefix awareness),
   so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
   The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
   v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
   5.5h parallel. Updated:
   - §0 TL;DR ("3 组" -> "2 组")
   - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
   - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
   - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
   - §6 decision table + expected-range table
   - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
   - §9 deliverables

2. Move 8 deprecated docs to docs/archive/:
     AGENTIC_FIT_ANALYSIS_ZH.md         (ts=10 era analysis; superseded)
     STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
     KVC_DEBUG_JOURNEY_V1_TO_V5.md      (v1-v5 sweep process notes)
     V5_PROFILE_INVESTIGATION_ZH.md     (v5 1Hz polling investigation)
     REFACTOR_PLAN_ZH.md                (v0 plan; superseded by V1)
     KVCACHE_CENTRIC_PROGRESS_ZH.md     (earliest 2026-04-27 progress)
     SWEBENCH_EXPERIMENT_PROGRESS.md    (early SWE trace setup)
     SWEBENCH_EXPERIMENT_RESULTS.md     (early SWE result snapshot)

   All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
   REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
   `docs/FOO.md` to `docs/archive/FOO.md` via sed pass.

   Added `docs/archive/README.md` explaining what each archived doc is
   and when (if ever) to reopen it. Designed so a new reader hitting
   the archive dir immediately knows it's not required reading.

After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 22:40:35 +08:00

4.8 KiB

Raw Blame History

SWE-Bench PD Hybrid Experiment Progress

实验目标

在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism，对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。

硬件环境

8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
无 RDMA/IB 设备
Transfer backend: mooncake TCP (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault，已放弃)

实验矩阵

实验	Mechanism	Workers	GPU 分配	Router	Policy
A	pd-disaggregation	1P + 1D (TP4 each)	P: 0-3, D: 4-7	Yes	default
B	pd-colo	2 direct (TP4 each)	D0: 0-3, D1: 4-7	No	default
C	kvcache-centric	1P + 1D (TP4 each)	P: 0-3, D: 4-7	Yes	default

测试负载

源数据: simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl
39,417 lines (turns), 497 unique instances (sessions)
每个 instance 8-150 turns (均值 79.3)
转换为 agentic-pd-hybrid trace 格式: outputs/qwen35-swebench-500.jsonl

关键发现

Transfer Backend 选择

nixl (UCX): pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持，导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
mooncake (TCP): 可用。需要两处修改:
1. third_party/sglang/.../mooncake_transfer_engine.py: 从环境变量 MOONCAKE_PROTOCOL 读取协议，而非硬编码 "rdma"
2. src/agentic_pd_hybrid/stack.py: 当 transfer_backend == "mooncake" 且非 force_rdma 时，自动设置 MOONCAKE_PROTOCOL=tcp

代码修改记录

third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py
- 将 "rdma" 硬编码改为 os.environ.get("MOONCAKE_PROTOCOL", "rdma")
src/agentic_pd_hybrid/stack.py
- 在 _build_process_env() 中添加: mooncake 非 force_rdma 时默认设置 MOONCAKE_PROTOCOL=tcp
scripts/convert_audit_to_trace.py (新建)
- 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式

实验进度

Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
Step 1: Trace 格式转换 (39,417 lines 验证通过)
Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — 通过
- 100/100 requests, 0 errors
- Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
- TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
- 91/100 cache hits
Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — 中止
- Run dir: outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z (无metrics,被kill)
- 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
- 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
- 教训: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — 完成
- Run dir: outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z
- Trace: outputs/qwen35-swebench-50sess.jsonl (10% sample, 52 sessions)
- 结果: 4449/4449 成功, 0 errors
- Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
- TTFT: mean=0.45s, P50=0.34s, P90=0.88s
- TPOT: mean=5.2ms, P50=5.2ms
- Cache hit: 4199/4449 (94.4%)
Step 4: 实验 B — pd-colo — 失败: SGLang bug
- Run dir: outputs/swebench-exps/pd-colo-default-20260426T210129Z
- Bug: --disaggregation-mode null (colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏
- 错误: ValueError: token_to_kv_pool_allocator memory leak detected!
- 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
- 结论: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
Step 5: 实验 C — kvcache-centric — 完成 (高错误率)
- Run dir: outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z
- 4390/4449 errors (98.7%) — admission control 过于保守
- 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
- 详细分析见 docs/SWEBENCH_EXPERIMENT_RESULTS.md
Step 6: 结果对比分析 — 完成
- 完整报告: docs/SWEBENCH_EXPERIMENT_RESULTS.md

启动脚本

scripts/run_exp_a_pd_disagg.sh — 实验 A
scripts/run_exp_b_pd_colo.sh — 实验 B
scripts/run_exp_c_kvcache_centric.sh — 实验 C
scripts/convert_audit_to_trace.py — Trace 转换

已知风险

Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph)，长 session (150 turns) 可能 OOM
mooncake TCP loopback 延迟远低于真实跨机，结果偏乐观
原始 trace 时间跨度 ~6000s，全量回放非常耗时

4.8 KiB Raw Blame History Unescape Escape

SWE-Bench PD Hybrid Experiment Progress

实验目标

硬件环境

实验矩阵

测试负载

关键发现

Transfer Backend 选择

代码修改记录

实验进度

启动脚本

已知风险

4.8 KiB

Raw Blame History