Files
agentic-pd-hybrid/docs/archive/SWEBENCH_EXPERIMENT_PROGRESS.md
kzlin 7590e55189 docs: archive deprecated docs to docs/archive/, drop E1 from onboarding
Two cleanups:

1. Drop "E1: naive 1P3D default" experiment from the onboarding manual.
   GPU hours are precious; naive 1P3D + policy=default has near-certain
   loss on multi-turn cache hit (it's round-robin without prefix awareness),
   so the comparison doesn't add information vs E1=naive 1P3D kv-aware.
   The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC
   v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial /
   5.5h parallel. Updated:
   - §0 TL;DR ("3 组" -> "2 组")
   - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware)
   - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop)
   - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2)
   - §6 decision table + expected-range table
   - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2")
   - §9 deliverables

2. Move 8 deprecated docs to docs/archive/:
     AGENTIC_FIT_ANALYSIS_ZH.md         (ts=10 era analysis; superseded)
     STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded)
     KVC_DEBUG_JOURNEY_V1_TO_V5.md      (v1-v5 sweep process notes)
     V5_PROFILE_INVESTIGATION_ZH.md     (v5 1Hz polling investigation)
     REFACTOR_PLAN_ZH.md                (v0 plan; superseded by V1)
     KVCACHE_CENTRIC_PROGRESS_ZH.md     (earliest 2026-04-27 progress)
     SWEBENCH_EXPERIMENT_PROGRESS.md    (early SWE trace setup)
     SWEBENCH_EXPERIMENT_RESULTS.md     (early SWE result snapshot)

   All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS /
   REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from
   `docs/FOO.md` to `docs/archive/FOO.md` via sed pass.

   Added `docs/archive/README.md` explaining what each archived doc is
   and when (if ever) to reopen it. Designed so a new reader hitting
   the archive dir immediately knows it's not required reading.

After this commit the active docs in docs/ are 9 files (down from 17),
which should make the onboarding doc's "Level 1 / Level 2 / Level 3"
classification self-evident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:40:35 +08:00

4.8 KiB
Raw Blame History

SWE-Bench PD Hybrid Experiment Progress

实验目标

在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。

硬件环境

  • 8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
  • 无 RDMA/IB 设备
  • Transfer backend: mooncake TCP (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault已放弃)

实验矩阵

实验 Mechanism Workers GPU 分配 Router Policy
A pd-disaggregation 1P + 1D (TP4 each) P: 0-3, D: 4-7 Yes default
B pd-colo 2 direct (TP4 each) D0: 0-3, D1: 4-7 No default
C kvcache-centric 1P + 1D (TP4 each) P: 0-3, D: 4-7 Yes default

测试负载

  • 源数据: simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl
  • 39,417 lines (turns), 497 unique instances (sessions)
  • 每个 instance 8-150 turns (均值 79.3)
  • 转换为 agentic-pd-hybrid trace 格式: outputs/qwen35-swebench-500.jsonl

关键发现

Transfer Backend 选择

  • nixl (UCX): pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持,导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
  • mooncake (TCP): 可用。需要两处修改:
    1. third_party/sglang/.../mooncake_transfer_engine.py: 从环境变量 MOONCAKE_PROTOCOL 读取协议,而非硬编码 "rdma"
    2. src/agentic_pd_hybrid/stack.py: 当 transfer_backend == "mooncake" 且非 force_rdma 时,自动设置 MOONCAKE_PROTOCOL=tcp

代码修改记录

  1. third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py

    • "rdma" 硬编码改为 os.environ.get("MOONCAKE_PROTOCOL", "rdma")
  2. src/agentic_pd_hybrid/stack.py

    • _build_process_env() 中添加: mooncake 非 force_rdma 时默认设置 MOONCAKE_PROTOCOL=tcp
  3. scripts/convert_audit_to_trace.py (新建)

    • 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式

实验进度

  • Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
  • Step 1: Trace 格式转换 (39,417 lines 验证通过)
  • Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — 通过
    • 100/100 requests, 0 errors
    • Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
    • TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
    • 91/100 cache hits
  • Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — 中止
    • Run dir: outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z (无metrics,被kill)
    • 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
    • 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
    • 教训: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
  • Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — 完成
    • Run dir: outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z
    • Trace: outputs/qwen35-swebench-50sess.jsonl (10% sample, 52 sessions)
    • 结果: 4449/4449 成功, 0 errors
    • Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
    • TTFT: mean=0.45s, P50=0.34s, P90=0.88s
    • TPOT: mean=5.2ms, P50=5.2ms
    • Cache hit: 4199/4449 (94.4%)
  • Step 4: 实验 B — pd-colo — 失败: SGLang bug
    • Run dir: outputs/swebench-exps/pd-colo-default-20260426T210129Z
    • Bug: --disaggregation-mode null (colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏
    • 错误: ValueError: token_to_kv_pool_allocator memory leak detected!
    • 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
    • 结论: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
  • Step 5: 实验 C — kvcache-centric — 完成 (高错误率)
    • Run dir: outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z
    • 4390/4449 errors (98.7%) — admission control 过于保守
    • 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
    • 详细分析见 docs/SWEBENCH_EXPERIMENT_RESULTS.md
  • Step 6: 结果对比分析 — 完成
    • 完整报告: docs/SWEBENCH_EXPERIMENT_RESULTS.md

启动脚本

  • scripts/run_exp_a_pd_disagg.sh — 实验 A
  • scripts/run_exp_b_pd_colo.sh — 实验 B
  • scripts/run_exp_c_kvcache_centric.sh — 实验 C
  • scripts/convert_audit_to_trace.py — Trace 转换

已知风险

  1. Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph),长 session (150 turns) 可能 OOM
  2. mooncake TCP loopback 延迟远低于真实跨机,结果偏乐观
  3. 原始 trace 时间跨度 ~6000s全量回放非常耗时