Compare commits
56 Commits
main
...
improve/au
| Author | SHA1 | Date | |
|---|---|---|---|
| 110bd68000 | |||
| d93228e156 | |||
| 9a81c993ab | |||
| dbb9eee471 | |||
| 4021f27ee2 | |||
| c5f552e122 | |||
| a785b83023 | |||
| 76a79dfdda | |||
| 591cd6d382 | |||
| fd37eda367 | |||
| 683c44bd71 | |||
| baa843a3f9 | |||
| 6cdea52f28 | |||
|
|
6d1c9237fa | ||
|
|
986f351365 | ||
|
|
d40db1f117 | ||
|
|
a1abdcd50c | ||
|
|
93fce42747 | ||
|
|
905d671135 | ||
|
|
9a166ac43b | ||
|
|
976115ea5e | ||
|
|
786cbb8d91 | ||
|
|
bf4da281c0 | ||
|
|
7f2ebf3d87 | ||
|
|
ef4dc81ea9 | ||
|
|
3db2d84df8 | ||
|
|
e3e5c45ed4 | ||
|
|
631b2c8847 | ||
|
|
ad8aaa8c5a | ||
|
|
bb9cc249cd | ||
|
|
b55371fe69 | ||
|
|
d11a66d11b | ||
|
|
a418aafeed | ||
|
|
e874b1f055 | ||
|
|
7590e55189 | ||
|
|
5a2fb8799c | ||
|
|
506d360160 | ||
|
|
c01d6101d6 | ||
|
|
9ccd853066 | ||
|
|
517677d7f2 | ||
|
|
c5519066de | ||
|
|
b5af19583b | ||
|
|
37e9caa431 | ||
|
|
5eac9b4f6b | ||
|
|
0c25168cad | ||
|
|
2ec0debef4 | ||
|
|
1d51704dad | ||
|
|
7affb565b2 | ||
|
|
c47adaf8e3 | ||
|
|
ca4b64c79a | ||
|
|
4978c0d0cd | ||
|
|
51f5386691 | ||
|
|
6572d7f3f4 | ||
|
|
6e5ed8da80 | ||
|
|
74194e660a | ||
|
|
c9d350b372 |
24
AGENTS.md
24
AGENTS.md
@@ -1,9 +1,33 @@
|
||||
# AGENTS.md
|
||||
|
||||
## For new collaborators / agents
|
||||
|
||||
Before doing anything else, read [docs/INDEX_ZH.md](docs/INDEX_ZH.md). It points to the
|
||||
3 must-read docs and a role-based reading path (new SWE, paper reviewer,
|
||||
reproducing student, control-plane reader).
|
||||
|
||||
Cross-branch progress, weaknesses, and roadmap live in
|
||||
[docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md). It is the single source of truth
|
||||
for "what's done, what's broken, what to do next."
|
||||
|
||||
Two engineering work items are pre-specced and ready to pick up:
|
||||
- block-level eviction refactor — [docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)
|
||||
- D→P incremental KV sync — [docs/D_TO_P_SYNC_CONTRACT_ZH.md](docs/D_TO_P_SYNC_CONTRACT_ZH.md)
|
||||
|
||||
Evaluation protocol (paper-quality N, paired CI, stratification,
|
||||
baselines) is in [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md).
|
||||
|
||||
## Environment
|
||||
|
||||
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
|
||||
|
||||
Algorithm-layer unit tests (no GPU, no SGLang):
|
||||
|
||||
```bash
|
||||
uv sync --group test
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
## Goal
|
||||
|
||||
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
|
||||
|
||||
28
README.md
28
README.md
@@ -6,6 +6,9 @@
|
||||
|
||||
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
|
||||
|
||||
新加入的合作者:先看 [docs/INDEX_ZH.md](docs/INDEX_ZH.md),按"我是谁"选 3 篇必读文档。
|
||||
项目当前进度、薄弱点、路线图总览见 [docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md)。
|
||||
|
||||
## 当前做了什么
|
||||
|
||||
- 启动单机 SGLang P/D 栈。
|
||||
@@ -99,3 +102,28 @@ uv run agentic-pd-hybrid replay \
|
||||
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`。
|
||||
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
|
||||
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
|
||||
|
||||
## 单元测试(无 GPU)
|
||||
|
||||
算法层(policies、Algorithm 1 / Theorem 1)有 pure-Python 单测,跑测试不需要 GPU、不需要 SGLang:
|
||||
|
||||
```bash
|
||||
uv sync --group test
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
详见 [tests/README.md](tests/README.md)。
|
||||
|
||||
## 评测脚本
|
||||
|
||||
按 [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md) 跑数据后:
|
||||
|
||||
```bash
|
||||
# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
|
||||
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl
|
||||
|
||||
# M2: paired-on-same-trial bootstrap 95% CI
|
||||
scripts/analysis/paired_compare.py \
|
||||
--baseline outputs/run-dp/request-metrics.jsonl \
|
||||
--candidate outputs/run-kvc/request-metrics.jsonl
|
||||
```
|
||||
|
||||
140
docs/AUDIT_AND_ROADMAP_ZH.md
Normal file
140
docs/AUDIT_AND_ROADMAP_ZH.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# 项目整体审阅与下一阶段路线图
|
||||
|
||||
**日期**:2026-05-12
|
||||
**分支起点**:`improve/audit-and-foundations`(基于 `h200-cu130`)
|
||||
**性质**:跨分支整合 + 路线图,供合作者判断每个 commit 是否值得 merge
|
||||
**对象**:项目下一个 SWE / research agent + 论文 reviewer 预读
|
||||
|
||||
本文把 `main` / `kvc-debug-journey-v1-to-v4` / `feat/d-to-p-sync` / `h200-cu130` / `kvc-real-ali-iter-v1` 五个分支的进度、已成立的贡献、薄弱点、走到 SOSP/OSDI + 工业级的路线图集中到一处,方便快速对齐。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **已经成立**:v1 → v2 算法(reset-on-success、字典序 Route、worker-mode Admit RPC)有形式化定义 + 两条 theorem + SWE-Bench 50 sess ts=1 上 6/8 指标击败 4DP CA 的实测。
|
||||
2. **核心薄弱点**:(a) session-level eviction 与 KVC 设计意图冲突;(b) D→P 增量 KV 同步不存在,TTFT p99 长尾来自此;(c) mooncake "instance not alive" 级联是控制层根本可用性问题;(d) 评测仍缺多 baseline 多 trace 强统计。
|
||||
3. **不需要 GPU 也能推进**的事:算法层 unit test、形式化设计文档(block-level evict、D→P sync 接口契约)、评测协议、分层分析工具、文档体系收口。本路线图的 Milestone 1 大部分都属于此类。
|
||||
4. **进 OSDI/SOSP 必须做的**:执行 §S1(block-level evict)+ §S2(D→P sync POC)+ §M2/M3/M4(多 baseline / 全 Ali / paired 协议)。预计 3–4 个月单/双人。
|
||||
|
||||
---
|
||||
|
||||
## 1. 五个分支的状态总览
|
||||
|
||||
| 分支 | 角色 | 当前状态 | 最关键产出 |
|
||||
|---|---|---|---|
|
||||
| `main` | "已发布" 基线 | 落后 origin 18 commit;2P4D + worker-admission + seed-min2 报出 vs default PD 的 9% mean / 19% p90 改善 | `KVCACHE_CENTRIC_PROGRESS_ZH.md` 的两档策略:latency-best vs stable |
|
||||
| `kvc-debug-journey-v1-to-v4` | 主工作分支 | v1→v5 完整算法演化;`KVC_ROUTER_ALGORITHM.md` 三段算法 + 两条 theorem | SWE-Bench 50 sess ts=1:v2 6/8 指标击败 4DP CA;**TTFT p99 仍输 3×**(1.28s vs 0.43s),诊断为 8.3% reseed 慢路径 |
|
||||
| `feat/d-to-p-sync` | 占位分支 | 代码空,仅 `RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` | 已排除"capacity-backup 是 D→P sync"的误解;列出 4 项工程子任务 |
|
||||
| `h200-cu130` | 真硬件 + RDMA 验证 | 4×H200 + mlx5_60 NDR 400 Gb/s 上跑 E1/E2/E3 | **E2 80% failure**(mooncake 死链级联);**E3 16min 触发 SGLang patch invariant crash**;最新 `KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 把 root cause 上升到"session-level 是错的 eviction granularity" |
|
||||
| `kvc-real-ali-iter-v1` | 真 Ali trace 验证 | 8×H20,179-req KVC-fit slice + 600-req/15min cold-window | KVC vs DP:KVC-fit p50 −46% ✅;real 15min p90 +19s ❌,53 errors vs DP 1;KVC 默认 mem-fraction OOM,必须降到 0.82 |
|
||||
|
||||
---
|
||||
|
||||
## 2. 已经"硬"成立的贡献
|
||||
|
||||
按"reviewer 能不能反驳"为标尺:
|
||||
|
||||
1. **Reset-on-success 修复 v1 thrashing**:v1 永久 blacklist → migration 死循环 failure mode 有实测 + Algorithm 3 形式化 + Theorem 1 的不饿死证明(`KVC_ROUTER_ALGORITHM.md` §3.4 / §4.1)。
|
||||
2. **三段算法分工清晰**:Algorithm 1(字典序 Route)+ Algorithm 2(D 自治 Admit RPC)+ Algorithm 3(Dispatch + reset-on-success)。v5 把 admission 从 router 估算改成 D RPC(Option D)是把 capacity ground truth 与 routing score 解耦的正确分层。
|
||||
3. **Direct-to-D 快路径的确定性命中**(Theorem 2):只要 residency ⊇ prefix ∧ append ≤ τ_append ∧ cap_ok 三条件同时成立必走快路径;SWE-Bench 91.6% 命中、TTFT p50 = 0.43s 是结构性结果。
|
||||
4. **每一个 negative result 都有 forensic 级解释**:mooncake death、cold-D、reseed 慢路径、session-level evict 都有代码定位 + 时间线 + 反例。这条对 paper 是真正加分项。
|
||||
|
||||
---
|
||||
|
||||
## 3. 让 reviewer 一击致命的薄弱点
|
||||
|
||||
### 3.1 评测方法层
|
||||
|
||||
- **M1 N 不足**:SWE-Bench v2 baseline N=3 确认 categorical,v2 自身 N 不足;缺 bootstrap CI。
|
||||
- **M2 比较口径不对等**:E2 80% 失败时用 "successful only" 算 latency 与 E1 全集比;paper 必须 paired-on-same-trial。
|
||||
- **M3 trace 偏 KVC-friendly**:KVC-fit slice 按 small-append + high overlap 筛过;full Ali(turn2+ ratio 26%、single-turn 极多)的 dilution 后结果没跑过。
|
||||
- **M4 baseline 不够强**:缺 vLLM + prefix-cache、DistServe、SplitWise、Mooncake-Master 任何一个。
|
||||
- **M5 trace 单一性**:缺 ShareGPT/Mooncake trace、缺 long-context tool-use agent benchmark、缺合成 adversarial trace。
|
||||
- **M6 硬件覆盖**:只 single-node ≤ 8 GPU;没有跨节点、没有 ≥ 32 GPU 集群实测。
|
||||
|
||||
### 3.2 系统设计层
|
||||
|
||||
- **S1 Session-level eviction 与 KVC 设计意图冲突**:90 次 evict、平均一次 free 67K tokens、25/50 session 必须 50–90K 重 prefill。`KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 已识别但未实现修复。
|
||||
- **S2 D→P 增量同步不存在**:TTFT p99 长尾 50% 来自 P 重 prefill。`capacity-backup` 是 seed-time 静态快照,不是 D→P sync。修复需改 SGLang radix 的单生产者假设。
|
||||
- **S3 Mooncake 级联 death**:admission no-space → 持续重试 seed → 心跳掉线 → SGLang 整批 abort(E2 1054/1285 失败)。控制层根本可用性 bug。
|
||||
- **S4 Admission RPC 同步阻塞**:缺 backoff / hedging / staleness budget。D scheduler GIL 抖动即把 router 卡死。
|
||||
- **S5 Cold-D / overlap-pinning**:boilerplate 24-token block hash 让所有 session 与 D0/D1 重叠 → D2/D3 0 binding。load-floor bonus 是补丁,不是 first-principles 修复。
|
||||
- **S6 SGLang 本地 patch 已 785 行 / 10 文件**,含 `schedule_batch.py:1646` 这种 hot-path 不变量改动;E3 crash 就是 vendored patch 引入的 latent landmine。
|
||||
- **S7 失败恢复 / 幂等性**:streaming session 在 chunked-prefill retry 下幂等性靠 `SessionSlot.restore_to_req`;缺 worker crash / mooncake 重连 / partial KV 损坏的恢复 protocol。
|
||||
- **S8 没有 multi-tenant / SLO-aware scheduling**:算法目标隐式 w_ttft=w_lat=1。生产里 interactive / batch / background 必须分级。
|
||||
- **S9 Topology fixed at boot**:P/D 比例是启动参数。生产负载需要 elastic。
|
||||
- **S10 Backpressure pause hint 信号未闭环**:触发 20 次但因 no-BP 无人响应;control-plane 没接通。
|
||||
|
||||
### 3.3 工程基础设施层
|
||||
|
||||
- **可观测性**:metrics 是 jsonl + 离线 `recompute_summary.py`;生产需要 Prometheus + Grafana + OpenTelemetry trace。
|
||||
- **形式化测试**:算法层与状态层缺 unit test;`SessionSlot.restore_to_req` 幂等性是作者自己 flag 的 invariant。
|
||||
- **混沌注入**:mooncake death 这种 control-plane failure 必须有 fault injection harness。
|
||||
- **代码体量**:`replay.py` 2460 行,集 orchestration / policy hook / control plane / metrics 于一身——prototype OK,paper-quality artifact 偏弱。
|
||||
|
||||
---
|
||||
|
||||
## 4. 路线图
|
||||
|
||||
分三个 milestone。每个 milestone 可独立交付(paper 章节或工程 release)。
|
||||
|
||||
### Milestone 1 — Defensible SOSP/OSDI submission(3–4 个月,单 / 双人)
|
||||
|
||||
**目标**:把现有算法 + 失败诊断收口成能扛 PC 第一轮的稿子。
|
||||
|
||||
1. **执行 §S1(block-level eviction refactor)** — 见 `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`。
|
||||
- Streaming-session decode 输出在每个 turn finish 时通过 `cache_finished_req` 增量提交进 radix tree。
|
||||
- `SessionSlot` 退化为纯 metadata(仅持 `last_node` + lock_ref)。
|
||||
- `release_session` 改为 `dec_lock_ref` + 删 slot;evict 完全交给 SGLang radix LRU。
|
||||
- 预期:evict 粒度从 67K tokens/次降到 24 tokens/次;reseed 频率降一个数量级。
|
||||
2. **执行 §S2(D→P 增量同步 POC)** — 见 `docs/D_TO_P_SYNC_CONTRACT_ZH.md`。
|
||||
- microbench 证明:D append 完成后异步推 KV block 回 P 端 radix → 下次 reseed 跳过 re-prefill。
|
||||
3. **修 §S3(mooncake death 级联)**:admission RPC backoff + jitter;per-D pending-seed budget;mooncake heartbeat 与 admission 解耦。
|
||||
4. **修 §S5 的 first-principles 解法**:把 `overlap` 重定义为 "session 在 D 上独占 prefix 的 hash 数"(去掉 boilerplate 共享 hash 贡献),让 score 自然分散。
|
||||
5. **重做评测**:见 `docs/EVALUATION_PROTOCOL_ZH.md`。N≥3 + bootstrap CI + 多 baseline + 全 Ali + 分层报告。
|
||||
6. **形式化扩充**:加 Theorem 3(block-level evict 下重 prefill cost 上界)+ Theorem 4(D→P sync 的 staleness budget β 与 reseed cost 关系)。
|
||||
7. **Artifact**:一键脚本 + Dockerfile + 4×A100 一小时复现核心 table/figure。
|
||||
|
||||
### Milestone 2 — Production-quality serving substrate(再 3–6 个月,2–3 人)
|
||||
|
||||
8. **控制平面分层**:把 `replay.py` 拆成 `router/` / `control/` / `obs/` / `orch/`。
|
||||
9. **Elastic topology**:autoscaling controller,输入 (P queue, D transfer queue, D KV usage)。
|
||||
10. **Multi-tenant + SLO classes**:interactive / batch / background 三档独立 admission budget。
|
||||
11. **Failure injection harness**:mooncake link flap / D OOM kill / router GC pause / partial KV corruption;每个 case 有恢复 SLA。
|
||||
12. **Persistent KV tier**:CPU DRAM + NVMe + RDMA-attached pool;evict 改为 demote。
|
||||
13. **Cross-node + heterogeneous**:H100 + H200 + L40S 混合,topology-aware routing。
|
||||
14. **Observability**:per-request OpenTelemetry + Prometheus per-D + Grafana 主面板。
|
||||
|
||||
### Milestone 3 — 真正能进 OSDI'27 的科研增量(6–12 个月)
|
||||
|
||||
15. **Learning-based admission / migration**:multi-armed bandit / RL 控制 τ_reject 与 K;用 trace 训 session-aliveness predictor。
|
||||
16. **跨 router residency consensus**:轻量 gossip 共享 `Σ.resident[d]`。
|
||||
17. **可证明 competitive ratio**:在 oracle KV-residency 模型下证明 KVC expected TTFT 与 offline optimal 比值有界。
|
||||
18. **分布式 prefix tree**:逻辑 prefix 映射到多 D 物理副本,支持 multi-tenant prefix 共享(system prompt / tool schema)。
|
||||
19. **Energy-aware variant**:GPU SM 利用率 + PCIe/RDMA 能耗进目标函数。
|
||||
20. **End-to-end agent serving framing**:从 request-level latency 上升到 agent task completion time(coding agent 一个 task 30+ turn)。
|
||||
|
||||
---
|
||||
|
||||
## 5. 不需要 GPU 也能推进的工作清单
|
||||
|
||||
按 ROI 排:
|
||||
|
||||
- [x] 本路线图(`AUDIT_AND_ROADMAP_ZH.md`)。
|
||||
- [x] 合作者入口(`docs/INDEX_ZH.md`)。
|
||||
- [x] Block-level eviction 具体设计(`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`)。
|
||||
- [x] D→P sync 接口契约(`docs/D_TO_P_SYNC_CONTRACT_ZH.md`)。
|
||||
- [x] 评测协议(`docs/EVALUATION_PROTOCOL_ZH.md`)。
|
||||
- [x] `KvAwarePolicy` 纯函数 score 抽取 + unit test(Algorithm 1)。
|
||||
- [x] 不饿死性质测试(Theorem 1)。
|
||||
- [x] 分层分析脚本(按 turn-index / append-size / overlap 三维分桶)。
|
||||
- [x] Paired-comparison 协议 helper。
|
||||
- [ ] Mooncake death 的可重现 mock harness(无 GPU 也能跑)。
|
||||
- [ ] SGLang patch surface 的归类清单(每个 patch 标"必须" / "实验性" / "可下线")。
|
||||
- [ ] Failure-mode taxonomy 文档(cold-D、overlap-pin、mooncake death、reseed storm、evict storm)。
|
||||
|
||||
---
|
||||
|
||||
## 6. 单句结论
|
||||
|
||||
> 这个项目已经具备了 SOSP/OSDI workshop / poster 的素材;要进 main track,需要把 §S1(block-level evict)和 §S2(D→P sync)做实、把 §M3(full Ali)和 §M4(两个强 baseline)补齐、把 §S3(mooncake 级联 death)的 control-plane fix 写进可重复 artifact。如果只能做一件事,先做 block-level eviction refactor —— 它同时解决"reseed 太频繁"和"P 端 radix 多生产者扩展的前置条件"。
|
||||
309
docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
Normal file
309
docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
Normal file
@@ -0,0 +1,309 @@
|
||||
# Block-level Eviction Refactor — 设计文档
|
||||
|
||||
**日期**:2026-05-12
|
||||
**前置**:[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md)(架构层 manifesto)
|
||||
**性质**:实现层设计 + API 草案 + 测试计划,供下一个合作者直接据此编码
|
||||
**Status**:草案,未实现。代码全部 quoted from `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py @ origin/h200-cu130`
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
把 `SessionAwareCache` 当前对 streaming-session **整段 KV 一次性 free** 的语义改成:
|
||||
|
||||
1. Streaming-session decode 输出在 turn finish 时 **增量 commit 进 radix tree**。
|
||||
2. `SessionSlot` 退化为**纯 metadata**(仅持 `last_node` + lock_ref 状态),不再独占 KV 区间。
|
||||
3. `release_session` 改为只 dec_lock_ref + 删 slot,**让 SGLang 标准 radix LRU 按 block 粒度蚕食**。
|
||||
|
||||
预期收益:evict 粒度从一次 ~67K tokens 降到 ~24 tokens(page_size 个 token),reseed 频率降一个数量级;同时把 P 端 radix tree 改造成可被外部喂数据(为 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 铺路)。
|
||||
|
||||
---
|
||||
|
||||
## 1. 现状代码梳理
|
||||
|
||||
### 1.1 关键文件与函数
|
||||
|
||||
`third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py`
|
||||
|
||||
| 函数 / 字段 | 当前语义 |
|
||||
|---|---|
|
||||
| `SessionSlot.req_pool_idx` | streaming-session 独占的 req_pool 槽位 |
|
||||
| `SessionSlot.kv_committed_len` | 上一 turn 完成时已 commit 的 KV 长度(已计入 cache_protected_len 部分进入 radix) |
|
||||
| `SessionSlot.kv_allocated_len` | 当前已分配但**未进 radix** 的 KV 长度("session-exclusive 尾部") |
|
||||
| `SessionSlot.cache_protected_len` | 首 turn 提交 radix 时的 protected 边界 |
|
||||
| `match_prefix(streaming req)` | 命中 slot → 返回 `req_to_token[req_pool_idx, :prefix_len]`,bypass radix |
|
||||
| `cache_unfinished_req(streaming req)` | subsequent turns → **完全 skip inner**(不进 radix) |
|
||||
| `cache_finished_req(streaming req)` | 调 `slot.save_from_req`,**不调 inner.cache_finished_req** |
|
||||
| `release_session(sid)` | `dec_lock_ref(slot.last_node)` + `free(req_to_token[req_pool_idx, cache_protected_len:kv_allocated_len])` + 回收 req_pool 槽位 |
|
||||
|
||||
### 1.2 当前为什么是错的(重述)
|
||||
|
||||
`[cache_protected_len, kv_allocated_len)` 是首轮入 radix 之后所有累积的 decode 输出 + 后续 turn 的 extend。在 Inferact / SWE-Bench 实测:
|
||||
|
||||
- `cache_protected_len` ≈ 首 turn boilerplate ~12K
|
||||
- `kv_allocated_len` 累积 50–100K
|
||||
- 每次 `release_session` 一次性释放 38–88K,这部分**从未进 radix**,无法享受 leaf-by-leaf 渐进 evict
|
||||
|
||||
→ session 被 evict 后必须从 client 原 prompt 重 prefill 全长 + mooncake transfer 全长,跟 naive PD-disagg 等价(详见 manifesto §1)。
|
||||
|
||||
---
|
||||
|
||||
## 2. 目标行为表
|
||||
|
||||
| 场景 | 现状 | 目标 |
|
||||
|---|---|---|
|
||||
| Session 累积 50K KV,D 满了 | `release_session` 一次释放 38K | radix LRU 从最老 leaf 开始 evict,单次 ~24 tokens |
|
||||
| Session 被 evict 后再到来 | 必须 reseed 50K | 仅 re-prefill 被 evict 的 leaf 部分(典型 ≤ 5K) |
|
||||
| Evicted session TTFT | 50–90K reseed ≈ 3–7s | 5K append-prefill ≈ 200ms |
|
||||
| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only(不变) |
|
||||
| Direct-to-D fast path 命中率 | 91.6% (SWE-Bench) / 38% (E3 Inferact) | 应 ≥ 85% 即使 saturation |
|
||||
|
||||
---
|
||||
|
||||
## 3. 设计
|
||||
|
||||
### 3.1 SessionSlot 字段精简
|
||||
|
||||
**after refactor**:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class SessionSlot:
|
||||
virtual_node: _VirtualNode = field(default_factory=_VirtualNode)
|
||||
|
||||
# Pointer into the radix tree — the deepest node owned by this session's
|
||||
# committed prefix. Held under inc_lock_ref so radix LRU never evicts this
|
||||
# *active* leaf out from under a turn-in-progress. Released by
|
||||
# release_session.
|
||||
last_node: Any = None
|
||||
swa_uuid_for_lock: Optional[str] = None
|
||||
|
||||
# Bookkeeping fields (no longer authoritative ownership of KV indices).
|
||||
last_access_time: float = field(default_factory=time.monotonic)
|
||||
|
||||
# Mamba state stays slot-owned (mamba doesn't fit the radix model).
|
||||
mamba_pool_idx: Any = None
|
||||
mamba_ping_pong_track_buffer: Any = None
|
||||
mamba_next_track_idx: Any = None
|
||||
mamba_last_track_seqlen: Any = None
|
||||
mamba_branching_seqlen: Any = None
|
||||
```
|
||||
|
||||
**删除**:`req_pool_idx`、`kv_committed_len`、`kv_allocated_len`、`cache_protected_len`、`swa_evicted_seqlen`。这些字段的真值改由 radix tree + req_to_token_pool 共同维护。
|
||||
|
||||
### 3.2 `cache_finished_req` 改造
|
||||
|
||||
**after refactor**:
|
||||
|
||||
```python
|
||||
def cache_finished_req(self, req: Req, is_insert: bool = True, **kwargs):
|
||||
if not _is_streaming(req):
|
||||
return self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
|
||||
|
||||
session_id = req.session.session_id
|
||||
slot = self.slots.setdefault(session_id, SessionSlot())
|
||||
|
||||
# KEY CHANGE: always delegate to inner — this inserts the new tokens
|
||||
# (kv_committed_len .. fill_ids end) as radix-tree blocks. Subsequent
|
||||
# match_prefix calls for this session will hit the radix tree directly.
|
||||
result = self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
|
||||
|
||||
# Update slot bookkeeping only (no KV ownership).
|
||||
slot.last_node = req.last_node
|
||||
slot.swa_uuid_for_lock = req.swa_uuid_for_lock
|
||||
slot.last_access_time = time.monotonic()
|
||||
|
||||
# Mamba state still goes through slot.
|
||||
slot.mamba_pool_idx = req.mamba_pool_idx
|
||||
...
|
||||
return result
|
||||
```
|
||||
|
||||
**不变量**:
|
||||
- `inner.cache_finished_req` 会把 `[kv_committed_len_old, kv_committed_len_new)` 范围内对齐到 page_size 的 KV 插入 radix。这个语义来自 SGLang 标准实现,无需改 inner。
|
||||
- `slot.last_node` 现在指向**当前 session 已 commit prefix 的尾节点**,每个 turn 后向前推进。
|
||||
- `dec_lock_ref(old_last_node)` + `inc_lock_ref(new_last_node)` 必须在 turn 切换时执行。
|
||||
|
||||
### 3.3 `cache_unfinished_req` 改造
|
||||
|
||||
streaming session 的 subsequent turn **不再 skip inner**。原因:现在 `match_prefix` 走 radix,chunked-prefill 中间状态也需要 inner 维护:
|
||||
|
||||
```python
|
||||
def cache_unfinished_req(self, req: Req, **kwargs):
|
||||
if _is_streaming(req) and kwargs.get("chunked", False):
|
||||
# Chunked prefill: forward to inner so the per-chunk extend gets
|
||||
# tracked in the radix LRU access timestamps.
|
||||
...
|
||||
self.inner.cache_unfinished_req(req, **kwargs)
|
||||
```
|
||||
|
||||
具体的 chunked 处理细节需要保留对 `prefix_indices` 重建的逻辑(参考当前实现 lines 215–225),但调用 `inner.cache_unfinished_req` 不能 skip。
|
||||
|
||||
### 3.4 `match_prefix` 改造
|
||||
|
||||
退化为**纯 inner 转发**——SessionSlot 不再持 KV 指针:
|
||||
|
||||
```python
|
||||
def match_prefix(self, params: MatchPrefixParams) -> MatchResult:
|
||||
# No more slot-fast-path. Streaming sessions reuse KV via radix tree
|
||||
# match like every other request.
|
||||
return self.inner.match_prefix(params)
|
||||
```
|
||||
|
||||
调用方需要的 "这个 session 的 committed prefix 长度" 信息改为通过 `inner.match_prefix(...).device_indices.shape[0]` 推导。
|
||||
|
||||
### 3.5 `release_session` 改造
|
||||
|
||||
**after refactor**:
|
||||
|
||||
```python
|
||||
def release_session(self, session_id: str) -> int:
|
||||
slot = self.slots.pop(session_id, None)
|
||||
if slot is None:
|
||||
return 0
|
||||
|
||||
# Just release our radix lock — radix LRU can now reclaim our prefix
|
||||
# leaves at its own pace. NO direct token_to_kv_pool free.
|
||||
if slot.last_node is not None:
|
||||
if slot.swa_uuid_for_lock is not None:
|
||||
self.inner.dec_lock_ref(
|
||||
slot.last_node,
|
||||
DecLockRefParams(swa_uuid_for_lock=slot.swa_uuid_for_lock),
|
||||
)
|
||||
else:
|
||||
self.inner.dec_lock_ref(slot.last_node)
|
||||
|
||||
# Mamba state still needs explicit cleanup if present.
|
||||
if slot.mamba_pool_idx is not None:
|
||||
...
|
||||
|
||||
return 0 # "freed_tokens" no longer meaningful; radix LRU shed lazily
|
||||
```
|
||||
|
||||
### 3.6 `get_session_status` / `list_session_statuses` 改造
|
||||
|
||||
`resident_tokens` 现在的真值来自 radix tree。需要在 inner 暴露一个 helper:
|
||||
|
||||
```python
|
||||
# In BasePrefixCache / RadixCache:
|
||||
def tokens_under(self, node) -> int:
|
||||
"""Count tokens in the path from root to `node` (inclusive)."""
|
||||
...
|
||||
|
||||
# In SessionAwareCache:
|
||||
def get_session_status(self, session_id: str) -> Optional[Dict[str, Any]]:
|
||||
slot = self.slots.get(session_id)
|
||||
if slot is None:
|
||||
return None
|
||||
resident_tokens = self.inner.tokens_under(slot.last_node) if slot.last_node else 0
|
||||
return {
|
||||
"session_id": session_id,
|
||||
"resident": resident_tokens > 0,
|
||||
"resident_tokens": int(resident_tokens),
|
||||
"last_access_time": float(slot.last_access_time),
|
||||
}
|
||||
```
|
||||
|
||||
`admit_direct_append` 的容量检查改用 `resident_tokens` 的 radix 真值(去掉 `kv_committed_len / kv_allocated_len` 双值不一致的可能)。
|
||||
|
||||
### 3.7 SGLang 调度路径配套改动
|
||||
|
||||
参考 `schedule_batch.py:1572-1646`,当前 streaming-session correction(commit b8e6f13 / 986f351 引入)建立在 SessionSlot 拥有独立 KV 范围之上。block-level refactor 后这条 correction 路径**完全无需存在**——req 的 fill_ids / prefix_indices 由 inner radix `match_prefix` 直接给出一致值。
|
||||
|
||||
**移除项**:
|
||||
- `schedule_batch.py:1572-1585` 的 `actual_extend_len = max(0, len(fill_ids) - len(prefix_indices))` correction 块。
|
||||
- `schedule_batch.py:1646` 的 `assert seq_len - pre_len == req.extend_input_len`(refactor 后该不变量结构上必然成立)。
|
||||
- E3 触发的 latent landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2)随之消失。
|
||||
|
||||
---
|
||||
|
||||
## 4. 不变量(必须在 PR 自测中覆盖)
|
||||
|
||||
| Inv | 内容 |
|
||||
|---|---|
|
||||
| I1 | `release_session(sid)` 后,下一次同 session 请求的 `match_prefix` 行为只取决于 radix tree 的常驻状态——不依赖 `slots` dict。 |
|
||||
| I2 | 任意 (session_id, turn_id) 的 `cache_finished_req` 调用后,radix tree 上必然存在一条 root→leaf 路径覆盖该 turn 的全部 committed token(即 `tokens_under(slot.last_node)` 严格不降)。 |
|
||||
| I3 | `restore_to_req` 必须**幂等**:在 chunked-prefill 重试场景下,对同一 req 可被调用多次而最终 req 状态等价。当前实现靠"不清 slot 字段"实现 → refactor 后改由 radix `match_prefix` 的纯函数性质保证。 |
|
||||
| I4 | 无 streaming-session 的请求(`req.session is None`)行为 **不变**:所有路径 short-circuit 到 inner。 |
|
||||
| I5 | 任一 turn 结束后,对 `slot.last_node` 的 `inc_lock_ref` 必须有对应的 `dec_lock_ref`,且 `release_session` 是最终的释放点。 |
|
||||
|
||||
---
|
||||
|
||||
## 5. 测试计划(无 GPU 可跑)
|
||||
|
||||
### 5.1 单元测试(mock inner cache)
|
||||
|
||||
写一个 `MockRadixCache(BasePrefixCache)`,记录所有 `cache_finished_req / cache_unfinished_req / match_prefix / evict / dec_lock_ref` 调用序列。然后:
|
||||
|
||||
| Test | 断言 |
|
||||
|---|---|
|
||||
| `test_release_session_no_direct_free` | 调 `release_session` 后,Mock 上 **没有** 直接 `free(kv_indices)` 调用,只有 `dec_lock_ref` |
|
||||
| `test_subsequent_turn_inserts_radix` | 模拟 turn 0 → 1 → 2 三次 `cache_finished_req`,断言每次都触发 `inner.cache_finished_req` |
|
||||
| `test_match_prefix_uses_inner` | streaming 与 non-streaming 都仅走 `inner.match_prefix` |
|
||||
| `test_restore_idempotent` | 模拟 chunked-prefill 重试,连续两次 `match_prefix` 返回的 `device_indices` 一致 |
|
||||
| `test_eviction_under_pressure_is_block_level` | inject 一个 "pool 满,必须 evict 24 tokens" 的状态,断言 `release_session` 不被触发,inner 的 LRU 单步走 |
|
||||
|
||||
### 5.2 Property-based 测试
|
||||
|
||||
```python
|
||||
@given(turns=lists(integers(min_value=24, max_value=2048), min_size=1, max_size=50))
|
||||
def test_committed_tokens_monotone(turns):
|
||||
"""tokens_under(slot.last_node) is monotonically non-decreasing across turns."""
|
||||
...
|
||||
```
|
||||
|
||||
### 5.3 Integration smoke(需要 GPU,但放在 sweep 脚本里)
|
||||
|
||||
执行 `sweep_e2_kvc_v2_rdma.sh` 同 trace 同配置,对比指标:
|
||||
- evict 总次数(期望从 90 → < 10)
|
||||
- 单次平均 evict tokens(期望从 67K → < 500)
|
||||
- TTFT p99(期望从 1.28s → < 0.7s)
|
||||
- direct-to-D 命中率(期望 ≥ 85%)
|
||||
|
||||
---
|
||||
|
||||
## 6. 工程量与风险
|
||||
|
||||
### 6.1 工程量
|
||||
|
||||
| 工作 | 估时 | 风险 |
|
||||
|---|---|---|
|
||||
| §3.1–§3.6 SessionAwareCache 改造 | 2–3 天 | 中:需要熟悉 radix 内部 lock_ref / evict 协议 |
|
||||
| §3.7 schedule_batch 清理 | 0.5 天 | 低:是删代码 |
|
||||
| §4 不变量单元测试 | 2 天 | 低 |
|
||||
| §5.3 GPU smoke + 数据对比 | 2 天 | 中:mooncake 仍可能触发 E2 级联 death,需要 §S3 修复一并跑 |
|
||||
| **总计** | **~1 周** | |
|
||||
|
||||
### 6.2 关键风险
|
||||
|
||||
1. **`inner.cache_finished_req` 对 streaming-session req 的兼容性**:当前 SGLang 标准 radix 假设 req 在 cache_finished_req 时是 "完整 prefill+decode 完成"。streaming-session 的 req 在每个 turn 结束时还会留下"未完成的 conversation",要确保 inner 在插入时不会把 decode-only tokens 当成可丢弃尾巴。需要 audit `radix_cache.py:cache_finished_req` 的实现。
|
||||
|
||||
2. **lock_ref 顺序**:turn N+1 开始的 `match_prefix` → inc_lock_ref(new_node),turn N 结束的 dec_lock_ref(old_node),时序若反了会在并发下让 LRU 把刚 commit 的 leaf 误 evict。建议加 assertion:`dec_lock_ref` 之前 `inc_lock_ref` 必须先到。
|
||||
|
||||
3. **chunked-prefill retry**:见 I3。SGLang 当前 `restore_to_req` 不清 slot 字段就是为此 retry。refactor 后必须确认 inner radix `match_prefix` 在 retry 下也幂等(标准 radix tree 是的,但要写测试明确锁住这个性质)。
|
||||
|
||||
---
|
||||
|
||||
## 7. 与 D→P sync 工作的关系
|
||||
|
||||
block-level evict 是 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 的**前置条件**:
|
||||
|
||||
- D→P sync 需要 P 端 radix tree **可接收外部喂入的 KV block**。
|
||||
- 当前 P 端 radix 假设单生产者(本 worker 模型输出)。
|
||||
- block-level refactor 完成后,streaming-session 的 KV 已经走标准 radix 路径——再让 radix tree 接受"外部喂入"的额外生产者就只是扩展 insert API,而不是发明新的存储路径。
|
||||
|
||||
→ 两件事可顺序做:先 block-level evict,再 D→P sync。
|
||||
|
||||
---
|
||||
|
||||
## 8. 接班 agent 的最小动作
|
||||
|
||||
1. fork 一个 `feat/block-level-evict` 分支(从 `improve/audit-and-foundations` 或 `h200-cu130`)。
|
||||
2. 实现 §3.1–§3.6。
|
||||
3. 写 §5.1 + §5.2 单元测试。
|
||||
4. 在 8×H100 / H200 上跑 §5.3 smoke,对比 evict 频次和 TTFT p99。
|
||||
5. 若 §6.2 风险 1 成立,进 SGLang `radix_cache.py` 看是否需要给 streaming-session req 加 `is_session_active=True` flag 阻止"丢弃 decode 尾"。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:把 session 当 lifecycle 边界(保留),但**不要**让它做 eviction 边界(移交给 radix LRU)。这次 refactor 同时解决"reseed 太频繁"和"P 端 radix 不可外部喂入"两个 blocker。
|
||||
247
docs/D_TO_P_SYNC_CONTRACT_ZH.md
Normal file
247
docs/D_TO_P_SYNC_CONTRACT_ZH.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# D→P 增量 KV 同步 — 接口契约与 rollout 计划
|
||||
|
||||
**日期**:2026-05-12
|
||||
**前置**:[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)(缺口定位)+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)(前置条件)
|
||||
**性质**:跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
|
||||
**Status**:草案。`feat/d-to-p-sync` 分支当前为空,本文是该分支应当首先 land 的设计文档
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
reseed 慢路径的 50% 时间在 P 重 prefill,**修复 transfer 段(启 RDMA)只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append:
|
||||
|
||||
> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix,无需 re-prefill,仅一次 P→D transfer。
|
||||
|
||||
本文给出三层(mooncake / SGLang / agentic-pd-hybrid)的接口契约、一个 **staleness budget β** 的形式化定义,以及四阶段 rollout 计划,让该工作可以与 block-level eviction 解耦推进。
|
||||
|
||||
---
|
||||
|
||||
## 1. Staleness Budget β —— 形式化定义
|
||||
|
||||
设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`(time `t` 的瞬时值),P 上同 session 的 backup prefix 长度为 `L_P(s, t)`。
|
||||
|
||||
```
|
||||
staleness(s, t) := L_D(s, t) - L_P(s, t) ≥ 0
|
||||
```
|
||||
|
||||
**Staleness budget β** 是系统承诺维持的上界:
|
||||
|
||||
```
|
||||
∀ s, ∀ t : staleness(s, t) ≤ β
|
||||
```
|
||||
|
||||
直观:β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
|
||||
|
||||
- **β = 0**:完全同步(D 每 commit 一块就阻塞等 P ack)。延迟成本高,不推荐。
|
||||
- **β = ∞**:当前状态(P 端 backup 永远 seed-time 静态快照)。
|
||||
- **β = 一个 page(24 tokens)**:单 block sync。理论最优粒度,但 D 端每次 append 都触发一次 D→P RPC。
|
||||
- **β = O(append_len)(典型 1K–4K)**:批量 sync。推荐起点,把同 turn 的 decode 输出聚合后整批推送。
|
||||
- **β = O(turn_size)(典型 ~50K)**:粗粒度 sync。失效 reseed bypass,仅减少 transfer。不可取。
|
||||
|
||||
→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))`,`β_max` 默认 4096。
|
||||
|
||||
---
|
||||
|
||||
## 2. 三层接口契约
|
||||
|
||||
### 2.1 Mooncake 层:双角色化
|
||||
|
||||
**当前状态**(详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3):
|
||||
|
||||
- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
|
||||
- `MooncakeKVSender` 仅在 PREFILL 模式实例化,`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
|
||||
- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`。
|
||||
|
||||
**目标接口**:
|
||||
|
||||
```python
|
||||
# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
|
||||
class BaseKVManager:
|
||||
roles: set[KVRole] # 替换原单值字段,允许 {PREFILL, DECODE}
|
||||
|
||||
class KVRole(Enum):
|
||||
PREFILL = "prefill"
|
||||
DECODE = "decode"
|
||||
PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver" # 新:P 端接收 D→P sync
|
||||
DECODE_BACKUP_SENDER = "decode_backup_sender" # 新:D 端发送 D→P sync
|
||||
```
|
||||
|
||||
**新增类**(实现层 ~400 LOC):
|
||||
|
||||
| 类 | 角色 | 关键方法 |
|
||||
|---|---|---|
|
||||
| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队,返回 `sync_id` |
|
||||
| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程;每个包触发 callback 注入 radix tree |
|
||||
|
||||
**Bootstrap channel**:需要独立于现有 P→D 通道的第二个 bootstrap socket(避免 buffer pointer 协商冲突)。配置:
|
||||
- 默认 disable,由 ServerArgs flag `--enable-d2p-sync` 开启
|
||||
- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
|
||||
|
||||
### 2.2 SGLang 层:Radix 多生产者扩展
|
||||
|
||||
**当前状态**:P 端 radix 假设单生产者(本 worker 模型输出)。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
|
||||
|
||||
**目标接口**(在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后):
|
||||
|
||||
```python
|
||||
class RadixCache(BasePrefixCache):
|
||||
def insert_external(
|
||||
self,
|
||||
token_ids: Sequence[int],
|
||||
kv_tensor: torch.Tensor,
|
||||
*,
|
||||
source_worker_id: str,
|
||||
session_id: str,
|
||||
) -> InsertExternalResult:
|
||||
"""
|
||||
Insert KV blocks supplied by an external worker (D→P sync).
|
||||
|
||||
Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
|
||||
and threads the resulting indices through the radix tree exactly like
|
||||
cache_finished_req would for a local prefill.
|
||||
|
||||
Invariants:
|
||||
- Same model layout (verified at handshake time, not per-call).
|
||||
- On collision with existing radix path, no-op for the shared prefix
|
||||
and only insert the diverging suffix.
|
||||
- Inserted nodes get lock_ref += 1 if `pin=True`, default False.
|
||||
D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
|
||||
"""
|
||||
```
|
||||
|
||||
**关键设计点**:
|
||||
|
||||
| 决策 | 选项 | 推荐 |
|
||||
|---|---|---|
|
||||
| KV index 重映射 | A) D 发原 indices, P 重映射;B) D 发紧密打包的 tensor,P 重新分配 | **B**:避免跨 worker 索引泄漏 |
|
||||
| 失败处理 | A) D→P 失败 → 退化为重 prefill;B) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
|
||||
| Reference counting | sync 进 P 的 KV 是否被 pin? | **不 pin**:P 端 LRU 自然管理,避免 backup 把生产 KV 挤出 |
|
||||
| 与 evict 协调 | sync 来到时 P 满怎么办? | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
|
||||
| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办? | **接受 multi-source**:每个 P 维护自己的 backup;reseed 时挑 staleness 最小者 |
|
||||
|
||||
### 2.3 agentic-pd-hybrid 层:Hooks 与状态机
|
||||
|
||||
**新增 CLI flag**:
|
||||
|
||||
```bash
|
||||
--enable-d2p-sync # off by default
|
||||
--d2p-staleness-budget-tokens 4096 # β_max
|
||||
--d2p-sync-batch-min-tokens 24 # 至少 ≥ 1 page 才触发
|
||||
--d2p-sync-target-policy {last_p, round_robin, broadcast}
|
||||
# last_p: 推回该 session 上次 seed 的 P
|
||||
# broadcast: 推到所有 P(reseed 时灵活但带宽大)
|
||||
```
|
||||
|
||||
**新增 state 字段**(`replay.py` 的 `DirectSessionState`):
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DirectSessionState:
|
||||
...
|
||||
# NEW: per-P backup view, populated by D->P sync callbacks.
|
||||
prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
|
||||
last_d2p_sync_at: float | None = None
|
||||
```
|
||||
|
||||
**Hook 在 `_invoke_session_direct` 完成后**:
|
||||
|
||||
```python
|
||||
async def _invoke_session_direct(...):
|
||||
...
|
||||
response = await self._stream_direct_to_d(...)
|
||||
if response.ok and self.config.enable_d2p_sync:
|
||||
new_committed = response.kv_committed_len
|
||||
prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
|
||||
staleness = new_committed - prev_p_resident
|
||||
if staleness >= self.config.d2p_sync_batch_min_tokens:
|
||||
target_p = self._choose_d2p_target(session)
|
||||
asyncio.create_task(
|
||||
self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
|
||||
)
|
||||
```
|
||||
|
||||
**Hook 在 reseed 路径**(`_invoke_kvcache_seeded_router`):
|
||||
|
||||
```python
|
||||
async def _invoke_kvcache_seeded_router(..., request):
|
||||
...
|
||||
if self.config.enable_d2p_sync:
|
||||
# Probe P-side residency before issuing full re-prefill.
|
||||
probe = await self._probe_prefill_residency(session_id)
|
||||
if probe.resident_tokens >= request.prefix_len - β_max:
|
||||
# Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
|
||||
return await self._invoke_p_to_d_transfer_only(...)
|
||||
# Fall back to existing path.
|
||||
return await self._invoke_kvcache_seeded_router_legacy(...)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 性质(待证明)
|
||||
|
||||
### 3.1 Theorem 4 候选(论文形式)
|
||||
|
||||
*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发:*
|
||||
|
||||
```
|
||||
reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
|
||||
```
|
||||
|
||||
*其中 T_p2d 是 P→D transfer 时间(在 RDMA 下 ~L · 4 ns/token),T_prefill 是 prefill 时间(在 H100 TP1 Qwen3-30B 下 ~50K tokens/s)。当 β ≪ L 时退化为 single P→D transfer 主导。*
|
||||
|
||||
**对比 baseline**(无 D→P sync):`reseed_cost = T_p2d(L) + T_prefill(L − seed_size)`,re-prefill 占主导。
|
||||
|
||||
### 3.2 与 Theorem 2 的关系
|
||||
|
||||
Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级,使 KVC 在**全分位数**击败 DP 成为可能。
|
||||
|
||||
---
|
||||
|
||||
## 4. 四阶段 Rollout
|
||||
|
||||
| Phase | 范围 | GPU 需求 | 验收指标 |
|
||||
|---|---|---|---|
|
||||
| **P1** | block-level eviction refactor([BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)) | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
|
||||
| **P2** | mooncake 双角色化 + microbench(D→P 单包 RTT、带宽利用) | 单机 + RDMA | P→D RTT < 50ms(local),单 16K-token block 带宽 ≥ 50% 理论上限 |
|
||||
| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook(仅 best-effort,无 reseed probe) | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成;不引入新 failure mode |
|
||||
| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5s(vs 当前 3–7s),TTFT p99 < 0.5s |
|
||||
|
||||
**关键决策点**:P1 → P2 之间需要走 audit,确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突。若发现严重冲突,引入 "P-only sync mode" 占位,等架构稳定再放开。
|
||||
|
||||
---
|
||||
|
||||
## 5. 风险与对策
|
||||
|
||||
| 风险 | 影响 | 对策 |
|
||||
|---|---|---|
|
||||
| Mooncake 双角色化破坏现有 P→D 单向路径 | E2 已暴露 mooncake "instance not alive" 级联,再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag;保留 disable 路径 |
|
||||
| D→P sync 占用 D 出口带宽,影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QP(RDMA SL=0),且 batch 触发,单 turn 内最多 1 次 |
|
||||
| P 端 radix 被 backup 填满,反而挤出本地生产 KV | P 端 prefill 速度降 | sync 插入不 pin(§2.2),让 LRU 公平竞争 |
|
||||
| 多 P 多 backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policy(recency-biased),观察实测分布再决定是否上 `broadcast` |
|
||||
| 跨 SGLang patch 升级时 `insert_external` 与 upstream API 漂移 | 维护负担 | 把 API 限制在我方 vendor patch 边界(不污染 upstream radix),并写 contract test |
|
||||
|
||||
---
|
||||
|
||||
## 6. 与 block-level eviction 的解耦关系
|
||||
|
||||
| 工作 | 是否依赖另一个 |
|
||||
|---|---|
|
||||
| block-level eviction | 不依赖 D→P sync,可独立交付。能单独降低 reseed 频次 |
|
||||
| D→P sync | **依赖** block-level eviction:需要 P 端 radix 是 streaming session KV 的真值源 |
|
||||
| 一起做 | 收益最大:reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
|
||||
|
||||
→ rollout 顺序:block-level eviction 先 land,D→P sync 随后开 `feat/d-to-p-sync` 推进。两者**不应**合在一个 PR 里。
|
||||
|
||||
---
|
||||
|
||||
## 7. 接班 agent 的最小动作
|
||||
|
||||
1. 在 `feat/d-to-p-sync` 分支上 land 本文。
|
||||
2. 等 block-level eviction 进 main 后,开 P2 阶段:mooncake 双角色化 + microbench(单测,无 SGLang 主路径耦合)。
|
||||
3. P3 阶段加 `insert_external` 与 hook;以 disabled-by-default 进 main。
|
||||
4. P4 端到端 evaluation 后再判断 reseed probe policy(`last_p` vs `broadcast`)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P 增量同步不是"再加一条网络通道"那么简单,关键是把 P 端 radix 从单生产者扩展到允许 best-effort 外部喂入。Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后,不能颠倒。
|
||||
137
docs/E1_E2_FIX_DESIGN_ZH.md
Normal file
137
docs/E1_E2_FIX_DESIGN_ZH.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# E1 / E2 Failure Modes — Fix Design Space (no code changes)
|
||||
|
||||
**Status**: design proposal for review.
|
||||
**Branch**: `h200-cu130`.
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md` §5b–§5d for the forensic findings this design responds to.
|
||||
|
||||
This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:
|
||||
- **Q1**: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side `batch_transfer_sync` to time out (~30 s) and the hair-trigger in `conn.py:1270` to permanently blacklist the D's mooncake_session_id.
|
||||
- **Q2**: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.
|
||||
|
||||
For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. **No code is committed** until a path is chosen.
|
||||
|
||||
---
|
||||
|
||||
## Q1 — Eviction starves mooncake control plane
|
||||
|
||||
### Mechanism recap
|
||||
|
||||
Inside `decode-0.log` at the moment of P-side timeout (`Sync batch data transfer timeout after 37452515723ns`):
|
||||
|
||||
```
|
||||
01:56:34 Decode batch ... gen 174 tok/s ← serving fine
|
||||
01:56:42 session id 1000315 does not exist, cannot delete.
|
||||
01:56:42 Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
|
||||
01:56:42 Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
|
||||
01:56:42 Decode transfer failed ... ← P-side timeout fires
|
||||
```
|
||||
|
||||
`maybe_trim_decode_session_cache` (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via `kv_pool_allocator.free()`, and updates `session_aware_cache` under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → `batch_transfer_sync` returns nonzero → hair-trigger fires.
|
||||
|
||||
### Design space
|
||||
|
||||
| # | Fix | Layer | Mechanism | Assumes | Risks |
|
||||
|---|---|---|---|---|---|
|
||||
| **Q1.A** | Pre-emptive low-watermark eviction | vendored SGLang | Trigger LRU when `token_usage > 0.7` in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has `_decode_session_cache_low_watermark_tokens`; question is whether it currently runs proactively or only on-demand. | Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state. | If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning. |
|
||||
| **Q1.B** | Async eviction thread | vendored SGLang | Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls `notify_evict_needed()`; mooncake control plane keeps running. | KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity. | Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness. |
|
||||
| **Q1.C** | Bump mooncake transfer timeout | mooncake env / wheel patch | Set `MC_TRANSFER_TIMEOUT_NS` (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up. | A real broken link won't go unnoticed for ≥120 s. | Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection. |
|
||||
| **Q1.D** | Windowed hair-trigger | vendored SGLang `conn.py:1270` | Replace `if session_failures >= 1:` with `if session_failures ≥ N within window`. Add periodic probe to D bootstrap port to clear `failed_sessions` after success. | Transient stalls are recoverable; real deaths are not. | Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath. |
|
||||
| **Q1.E** | Router-side backpressure | our `--enable-backpressure` (already exists, off by default) | D returns `recommended_pause_ms` in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented. | Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes. | Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires. |
|
||||
| **Q1.F** | Upstream load balance (= Q2 fix) | our `policies.py` | Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires. | Q2 fix is sound and the workload's KV demand fits into 3 D's evenly. | The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger. |
|
||||
|
||||
### Recommendation for Q1
|
||||
|
||||
**Primary: Q1.F (do Q2 fix first).** This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we *know* it's a real symptom and need defense-in-depth.
|
||||
|
||||
**Defense-in-depth (cheap): Q1.C (bump mooncake timeout).** Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.
|
||||
|
||||
**Avoid for now: Q1.B and Q1.D.** Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.
|
||||
|
||||
**Open question for the team**: does SGLang's existing `low_watermark` LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.
|
||||
|
||||
---
|
||||
|
||||
## Q2 — Cold-D never gets a session
|
||||
|
||||
### What we already know is wrong
|
||||
|
||||
User's observation: the existing `migration_reject_threshold=3` mechanism fires *after 3 wasted prefills*, which is too late. The fix needs to be *proactive*: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.
|
||||
|
||||
### Design space
|
||||
|
||||
Let `assigned[D] = state.decode_assignment_counts[D]` and `inflight[D] = state.inflight_decode[D]`. Lex score is currently:
|
||||
|
||||
```
|
||||
score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)
|
||||
```
|
||||
|
||||
| # | Fix | Mechanism | Assumes | Risks |
|
||||
|---|---|---|---|---|
|
||||
| **Q2.A** | Cold-D bonus (binary, what the reverted commit did) | `cold_boost = K if assigned[D]==0 and not sticky else 0`; add to lex position 0. | Each D needs to be "popped" from cold once, after that the bonus disappears. | One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently. |
|
||||
| **Q2.B** | Load-floor bonus (graduated, my recommended primary) | `floor_bonus = max(0, K · (1 − assigned[D] / max(assigned[*])))` (or similar continuous fn); add to lex position 0; gated on `not sticky`. | "Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold. | Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions). |
|
||||
| **Q2.C** | Lex re-order: inflight first | Change score to `(-inflight, overlap + α·sticky, sticky, -assigned)`. | Idle D always wins ties → idle D2 wins fresh sessions immediately. | Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load *is* balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns. |
|
||||
| **Q2.D** | Capacity-aware overlap discount | `effective_overlap = overlap · (1 − inflight[D] / max_inflight)`; replace `overlap` in score. | Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff. | More complex than Q2.B; needs `max_inflight` estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B. |
|
||||
| **Q2.E** | Pre-warm cold D's at startup | After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating `state.resident[D]` evenly. | We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start). | Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle. |
|
||||
| **Q2.F** | Drop overlap unless "material" | Apply overlap term only when overlap > τ blocks (or > τ% of input). | Tiny overlap doesn't actually save meaningful prefill work. | Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue. |
|
||||
| **Q2.G** | Fix the substring filter (the actual `_is_admission_rejection_mode` bug) | Either widen `_ADMISSION_REJECTION_SUBSTRINGS` to include `"kvcache-centric"`, or call `state.record_admission_reject` directly from the actual reject signal site instead of string-matching after the fact. | Existing migration mechanism is sound *once* it gets fed the right signal. | User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup. |
|
||||
|
||||
### Recommendation for Q2
|
||||
|
||||
**Primary: Q2.B (load-floor bonus, graduated).**
|
||||
- Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
|
||||
- Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
|
||||
- Sticky stays on by gating on `not sticky` → no risk of breaking turn 1+ cache locality.
|
||||
- Single knob (`K`) to tune.
|
||||
|
||||
**Orthogonal cleanup: Q2.G (fix the reject-substring filter).** Independent of Q2.B, since the migration mechanism is the *backstop* (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the *primary* mechanism, but as a *backstop after* primary load balancing, it's still valuable.
|
||||
|
||||
**Avoid: Q2.C** (lex re-order destroys overlap-first design). **Avoid: Q2.E** (workload-coupled, brittle). **Q2.D / Q2.F** are reasonable but more complex than Q2.B with marginal gain.
|
||||
|
||||
### Concrete shape of Q2.B (for review, not for merge)
|
||||
|
||||
```python
|
||||
# In KvAwarePolicy.select, replacing the current score line:
|
||||
total_assigned = sum(state.decode_assignment_counts.values())
|
||||
n_decoders = max(1, len(topology.route_workers))
|
||||
mean_assigned = total_assigned / n_decoders
|
||||
|
||||
# Per-D fairness deficit: how much below the running mean is this D?
|
||||
deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
|
||||
floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0
|
||||
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus + floor_bonus,
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
assignment_penalty,
|
||||
)
|
||||
```
|
||||
|
||||
Knob: `load_floor_bonus: int = 0` (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets `floor_bonus = 200 * 16 / 16 = 200`, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets `floor_bonus = 200 * 1 / 16 ≈ 12`, which doesn't override real prefix-cache wins.
|
||||
|
||||
But this is just a *sketch* — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.
|
||||
|
||||
### Validation plan if we go with Q2.B
|
||||
|
||||
1. Implement Q2.B + flag, default off.
|
||||
2. Re-run E2 on the same `outputs/inferact_50sess.jsonl` subset with `--kvcache-load-floor-bonus 200`.
|
||||
3. Check structural log: do D0/D1/D2 each get a non-trivial share of `session-d-binding.jsonl` rows?
|
||||
4. Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
|
||||
5. Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
|
||||
6. Re-evaluate H1 with E1 vs the new E2.
|
||||
|
||||
---
|
||||
|
||||
## Decision points (for review)
|
||||
|
||||
| # | Question | Default if no answer |
|
||||
|---|---|---|
|
||||
| D1 | Q1: do Q2 fix first and re-measure before touching mooncake / SGLang? | **Yes** (recommended) |
|
||||
| D2 | Q1: bump mooncake `MC_TRANSFER_TIMEOUT_NS` to 120 s as cheap defense-in-depth? | Yes |
|
||||
| D3 | Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table? | Q2.B |
|
||||
| D4 | Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup? | Yes |
|
||||
| D5 | Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form? | Defer |
|
||||
| D6 | Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values? | Try 200 first |
|
||||
| D7 | Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom? | Same subset |
|
||||
|
||||
Once the shape is approved, the next implementation pass is small and concentrated in `policies.py` + `replay.py` + `cli.py` (no SGLang vendor changes needed for the primary fix).
|
||||
416
docs/E1_E2_RESULTS_ZH.md
Normal file
416
docs/E1_E2_RESULTS_ZH.md
Normal file
@@ -0,0 +1,416 @@
|
||||
# E1 vs E2 Experiment Results — H200 + Driver 570
|
||||
|
||||
**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ✅ complete (2026-05-12 03:22 UTC, wall 1h33min).
|
||||
**Branch**: `h200-cu130`.
|
||||
**Trace**: `outputs/inferact_50sess.jsonl` (deterministic head-cut of Inferact `codex_swebenchpro` to first 50 trials, md5 `7bb263a32600ef5a6ef5099ba340a487`, 1285 requests, mean input_length 67,631 tokens).
|
||||
**Hardware**: 4× H200 80GB, driver 570.86.15 (cu12.8 API), Mellanox mlx5_60 RoCE 400 Gb/s NDR.
|
||||
**Model**: Qwen3-30B-A3B-Instruct-2507 (TP1).
|
||||
**Toolchain**: vendored SGLang 0.5.10 + cu12.8 nvcc local install (`~/cuda-12.8`) — see `docs/H200_DRIVER570_SETUP_ZH.md`.
|
||||
|
||||
---
|
||||
|
||||
## 1. Hypotheses being tested
|
||||
|
||||
From `docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.1:
|
||||
|
||||
- **H1**: KVC v2's wins are not just from "1P3D topology + kv-aware policy" — the KVC layer (admission / migration / direct-to-D) contributes meaningfully on top. Pairing E1 (no KVC layer) against E2 (full KVC v2) on the **same subset** isolates the marginal contribution.
|
||||
- **H2/H3**: Enabling real RDMA pushes TTFT p99 down from the reported 1.28s (TCP loopback) toward ~0.7s. Independent of H1, this is measured inside E2 alone (comparing against the historical TCP-loopback v2 reference).
|
||||
|
||||
---
|
||||
|
||||
## 2. E1 results — naive 1P3D + kv-aware + RDMA
|
||||
|
||||
**Configuration**: `mechanism=pd-disaggregation`, `policy=kv-aware`, 1P3D (GPU0=P, GPU1/2/3=D), `--force-rdma --ib-device mlx5_60`, `--concurrency-limit 32`, ts=1.
|
||||
|
||||
| Metric | E1 |
|
||||
|---|---:|
|
||||
| request_count | 1285 |
|
||||
| success | 1200 |
|
||||
| **error_count** | **85** |
|
||||
| **failure_count** | **85** |
|
||||
| abort_count | 0 |
|
||||
| latency mean | 96.34 s |
|
||||
| latency p50 | 93.21 s |
|
||||
| latency p90 | 180.69 s |
|
||||
| latency p99 | 219.46 s |
|
||||
| ttft mean | 90.48 s |
|
||||
| ttft p50 | 88.62 s |
|
||||
| ttft p90 | 175.13 s |
|
||||
| **ttft p99** | **207.39 s** |
|
||||
| execution_modes | `pd-disaggregation-router: 1200`, `pd-disaggregation: 85` (errors) |
|
||||
| per_decode_load | **D0:575, D1:710, D2:0** |
|
||||
| per_prefill_load | P0:1285 |
|
||||
| cache_hit_request_count | 1199 / 1200 (99.9%) |
|
||||
|
||||
### Key observations on E1
|
||||
|
||||
1. **D2 was never bound to a single session**. All 50 sessions got pinned to D0 or D1 by `kv-aware` policy's (overlap + sticky + inflight + assigned) lex-score, and naive pd-disaggregation has no migration mechanism to rebalance. Effective topology was **1P2D**, not 1P3D.
|
||||
2. **Massive queueing**. TTFT p50 ≈ 89 s and p99 > 200 s indicate sessions waited tens of seconds in router/prefill queue. With `--concurrency-limit 32` and D0/D1 saturated, the inflight cap forced ~1250 reqs to serialize through only two decode workers.
|
||||
3. **85 failures (6.6%)** — all `execution_mode == pd-disaggregation` (which the metrics module classifies as `error` when the agentic-pd-hybrid replay sees an unsuccessful upstream response). Most likely caused by `--request-timeout-s 300` firing on the longest queued requests.
|
||||
4. **Cache hit 99.9%** — the kv-aware policy did successfully concentrate sessions on their prior D worker; the Inferact converter's prefix-shared 24-token-block hash_ids gave near-perfect prefix overlap across turns of the same session.
|
||||
|
||||
### What E1 establishes
|
||||
|
||||
For the same hardware, same trace, same model, **naive 1P3D + kv-aware policy is unusable for multi-session agentic workloads**:
|
||||
- session-stickiness without migration leaves a third of compute capacity (1 of 3 decode GPUs) entirely unused
|
||||
- queueing dominates user-facing latency
|
||||
- failure rate is 6.6% even with 5 minutes per-request timeout
|
||||
|
||||
This is *the baseline H1 needs* — it shows the KVC layer (E2) has something concrete to improve over.
|
||||
|
||||
---
|
||||
|
||||
## 3. E2 results — KVC v2 + RDMA
|
||||
|
||||
**Configuration**: `mechanism=kvcache-centric`, `policy=kv-aware`, 1P3D, `--force-rdma --ib-device mlx5_60`, `--kvcache-admission-mode worker`, `--kvcache-direct-max-uncached-tokens 8192`, `--kvcache-migration-reject-threshold 3`, `--kvcache-prefill-backup-policy release-after-transfer`, `--kvcache-prefill-priority-eviction`, ts=1.
|
||||
|
||||
| Metric | E2 |
|
||||
|---|---:|
|
||||
| request_count | 1285 |
|
||||
| success | 231 |
|
||||
| **error_count** | **1054** |
|
||||
| **failure_count** | **1054** |
|
||||
| abort_count | 0 |
|
||||
| latency mean (successful only) | 10.94 s |
|
||||
| latency p50 | 7.44 s |
|
||||
| latency p90 | 20.68 s |
|
||||
| latency p99 | 64.73 s |
|
||||
| ttft mean (successful only) | 1.76 s |
|
||||
| ttft p50 | 0.43 s |
|
||||
| ttft p90 | 6.56 s |
|
||||
| **ttft p99** | **8.74 s** |
|
||||
| execution_modes (succ.) | direct-to-D: 87; turn1-seed: 50; reseed: 12; large-append-reseed: 11; seed-filter-early-turn: 50; large-append-cap: 21 |
|
||||
| per_decode_load | **D0:600, D1:685, D2:0** |
|
||||
| per_prefill_load | P0:1285 |
|
||||
| cache_hit_request_count | 230 / 231 (99.6 %) |
|
||||
|
||||
### Key observations on E2
|
||||
|
||||
1. **D2 still has zero bindings** — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's `migration_reject_threshold=3` never trips because D0/D1 do not *reject* admission until they are completely saturated.
|
||||
2. **80 % failure rate, 1054 / 1285**. **NOT timeouts** — actual root cause is a 3-layer cascade documented in §6. Quick summary: 562 "no-space" admission rejects from D0/D1 → router falls back to seed/reseed paths needing mooncake → mooncake heartbeats drop ("Decode instance could be dead") → SGLang aborts the request → client sees `RuntimeError: generate stream ended before producing any token`.
|
||||
3. **Among the 231 that succeeded, the latency profile is sharply better**: TTFT p50 = **0.43 s** vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = **7.44 s** vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
|
||||
4. **Direct-to-D fast path engaged 87 / 231 = 37.7 %** of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
|
||||
|
||||
---
|
||||
|
||||
## 4. Comparison table — E1 vs E2
|
||||
|
||||
Numbers below are over **all 1285 requests** for E1 (since failure rate is small) but **only the 231 successful** for E2 (since the bulk timed out before producing latency datapoints). This is **not a fair head-to-head**, see §6.
|
||||
|
||||
| Metric | E1 | E2 (succ only) | E2 / E1 |
|
||||
|---|---:|---:|---:|
|
||||
| Total reqs | 1285 | 1285 | – |
|
||||
| Successful | 1200 | **231** | 0.19× |
|
||||
| **error_count** | 85 (6.6 %) | **1054 (82 %)** | **12.4× worse** |
|
||||
| lat mean | 96.34 s | 10.94 s | 0.114 |
|
||||
| lat p50 | 93.21 s | **7.44 s** | **0.080** |
|
||||
| lat p90 | 180.69 s | 20.68 s | 0.114 |
|
||||
| lat p99 | 219.46 s | 64.73 s | 0.295 |
|
||||
| ttft mean | 90.48 s | 1.76 s | 0.019 |
|
||||
| **ttft p50** | 88.62 s | **0.43 s** | **0.005** |
|
||||
| ttft p90 | 175.13 s | 6.56 s | 0.037 |
|
||||
| ttft p99 | 207.39 s | 8.74 s | 0.042 |
|
||||
| per_decode_load | D0:575, D1:710, D2:0 | D0:600, D1:685, D2:0 | both 1P2D |
|
||||
| direct-to-D % | N/A (no KVC) | 87/231 = 37.7 % | – |
|
||||
|
||||
---
|
||||
|
||||
## 5. Interpreting H1 / H2 / H3
|
||||
|
||||
### H1 (was: KVC layer adds value on top of 1P3D + kv-aware) — *qualified*
|
||||
|
||||
The H1 hypothesis as stated in `ONBOARDING_NEXT_AGENT_ZH.md` predicted E2 would clearly win on most metrics. The reality is **bimodal**: the small subset of E2 requests that successfully complete are dramatically faster than E1, but a much larger fraction (82 %) of E2 requests time out entirely. Net throughput on this workload is *worse* for E2 than E1.
|
||||
|
||||
Two issues drove this:
|
||||
1. The D2 cold-start pathology already documented in §3, root cause. Both runs are de facto 1P2D, not 1P3D.
|
||||
2. KVC v2's admission gate is stricter and surfaces more "no D capacity" / "session-not-resident" failures than vanilla pd-disagg, when the workload (mean input 67 K tokens, mean output 700 tokens) saturates D0/D1's combined ~1.5 M KV pool.
|
||||
|
||||
For workloads where D0/D1 do not saturate or where the policy *does* spread session ownership across all D workers (the historical SWE-Bench setup), KVC v2 wins. The Inferact `codex_swebenchpro` subset breaks both assumptions.
|
||||
|
||||
### H2 / H3 (RDMA reduces TTFT p99) — *cannot be evaluated cleanly here*
|
||||
|
||||
The historical reference point is "KVC v2 + TCP loopback, SWE-Bench 50sess: TTFT p99 = 1.28 s". This run uses Inferact + RDMA, and TTFT p99 of the 231 successful E2 requests is **8.74 s** — much higher than the TCP baseline. But the workloads are not comparable: Inferact mean input is 67 K tokens vs SWE-Bench's much smaller average. Per-request prefill + transfer is roughly 5× longer here. A clean H2 / H3 read needs an Inferact-on-TCP run to compare against, which is out of scope for this subset's GPU budget.
|
||||
|
||||
What we *can* say: RDMA is correctly engaged (every worker log shows `installTransport, type=rdma`; admission RPC RTTs in `structural/admission-events.jsonl` are ~6 ms — consistent with one-hop RoCE).
|
||||
|
||||
---
|
||||
|
||||
## 5b. Why E2 has 80 % failures — the real chain (forensic)
|
||||
|
||||
The summary's `error_count: 1054` and `execution_mode: kvcache-centric` mask the actual cascade. Pulling the underlying `request-metrics.jsonl`, `structural/admission-events.jsonl`, and per-worker SGLang logs gives the full picture.
|
||||
|
||||
### Layer 1 — worker admission rejects (51 % of admit attempts)
|
||||
|
||||
From `structural/admission-events.jsonl`:
|
||||
```
|
||||
admit ok = 581 (modes: seed=494, direct_append=87)
|
||||
admit reject = 605 (reasons: no-space=562, session-not-resident=43)
|
||||
```
|
||||
|
||||
**562 "no-space" rejects** — D worker (almost always D0 or D1) reports its KV pool is full and refuses to take the request as direct-append. The router then re-routes the request to the seed/reseed path.
|
||||
|
||||
This is materially different from E1's behaviour: E1's vanilla pd-disagg had no admission RPC, so requests *always* got accepted by the chosen D and queued behind the running batch. E1 paid for that as a 90-second TTFT but never saw a "no-space" failure.
|
||||
|
||||
### Layer 2 — mooncake P→D transfer failures (real, observed in prefill log)
|
||||
|
||||
From `logs/prefill-0.log`:
|
||||
```
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='2a5ed06fb…'
|
||||
with exception KVTransferError: Failed to send kv chunk of … to 172.18.112.37:46067
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='eca5ff14…'
|
||||
with exception KVTransferError: Decode instance could be dead,
|
||||
remote mooncake session 172.18.112.37:15078 is not alive
|
||||
[01:56:42] Prefill transfer failed for request rank=0 req.rid='7ed9827b…'
|
||||
Decode instance could be dead, remote mooncake session ... is not alive
|
||||
```
|
||||
|
||||
When the seed/reseed fallback queue piles up (because of layer 1), the D worker becomes heavily backlogged and its mooncake bootstrap session heartbeat drops — P interprets this as "the D worker is dead" and fails the transfer. This is **not** a true crash; the worker process is alive (we observed it accepting unrelated requests immediately after), but the mooncake session is torn down for that bootstrap_room.
|
||||
|
||||
### Layer 3 — client-visible error
|
||||
|
||||
From `request-metrics.jsonl` for all 1054 failed reqs:
|
||||
```
|
||||
"error": "RuntimeError: generate stream ended before producing any token"
|
||||
```
|
||||
|
||||
This is what `agentic-pd-hybrid` sees when the SGLang `/generate` SSE stream closes with zero output tokens — the upstream abort from layer 1 or layer 2 propagates as an empty stream.
|
||||
|
||||
### The complete causal chain
|
||||
|
||||
```
|
||||
Inferact shared "permissions instructions" boilerplate
|
||||
↓
|
||||
overlap term in kv-aware lex score never lets D2 win → D2 cold forever
|
||||
↓
|
||||
50 sessions all pinned to D0 / D1
|
||||
↓
|
||||
D0 / D1 KV pool saturates
|
||||
↓
|
||||
worker admission emits 562 × "no-space" ← Layer 1
|
||||
↓
|
||||
router falls back to seed/reseed path (needs P→D mooncake transfer)
|
||||
↓
|
||||
P→D transfer queue piles up; D mooncake heartbeat drops
|
||||
↓
|
||||
"Decode instance could be dead" → KVTransferError ← Layer 2
|
||||
↓
|
||||
SGLang aborts the req → SSE stream closes with 0 tokens
|
||||
↓
|
||||
agentic-pd-hybrid raises "generate stream ended ..." for 1054 reqs ← Layer 3
|
||||
```
|
||||
|
||||
### Why E1 didn't hit this
|
||||
|
||||
E1 used `mechanism=pd-disaggregation`, which has no per-worker admission RPC. The router blindly dispatched to D0/D1; SGLang's internal scheduler simply queued requests behind the running batch (some grew their wait to >90 s before getting a token). Of the 85 E1 errors, sampling shows they are `request-timeout-s=300` failures — old-fashioned timeouts on the agentic-pd-hybrid side, not mooncake or admission failures.
|
||||
|
||||
So:
|
||||
- E1 trades latency for resilience: nobody rejects, everyone queues, you pay TTFT.
|
||||
- E2's KVC v2 worker admission is *meant* to be a safety valve, but on the cold-D pathology it becomes an *amplifier*: rejects → fallback paths → backlog → mooncake heartbeat loss → cascading failures.
|
||||
|
||||
### The real fix
|
||||
|
||||
Worker admission per se is not the bug — the bug is that there is no D-rebalancing happening upstream. With balanced D load (e.g. cold-D bonus in policy, or pre-warm of D2 with shared boilerplate), D0/D1 would not hit "no-space", and the layer 1 → layer 2 cascade would not fire. The reseed long-tail TTFT (8.74 s p99 here) becomes the dominant cost — exactly the regime onboarding §3.1 H3 describes.
|
||||
|
||||
---
|
||||
|
||||
## 5c. Why mooncake "died" (forensic on Q1)
|
||||
|
||||
The error string is `Decode instance could be dead, remote mooncake session ... is not alive`, which sounds like the D worker process crashed. **It did not.** Concurrent evidence shows D1 was happily serving `/session_cache/admit_direct_append HTTP/1.1 200 OK` and running LRU evictions only seconds after the "is not alive" errors fired. The real mechanism is hair-trigger.
|
||||
|
||||
### What the SGLang mooncake conn.py actually does
|
||||
|
||||
In `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1267-1276`:
|
||||
|
||||
```python
|
||||
if ret != 0: # one transfer slice failed
|
||||
with self.session_lock:
|
||||
self.session_failures[req.mooncake_session_id] += 1
|
||||
# Failures should never happen if the session is not dead,
|
||||
# if the session fails once, mark it as failed
|
||||
if self.session_failures[req.mooncake_session_id] >= 1:
|
||||
self.failed_sessions.add(req.mooncake_session_id)
|
||||
logger.error(f"Session {req.mooncake_session_id} failed.")
|
||||
...
|
||||
```
|
||||
|
||||
After this, every subsequent transfer that uses the same `mooncake_session_id` short-circuits at conn.py:1184:
|
||||
|
||||
```python
|
||||
if req.mooncake_session_id in self.failed_sessions:
|
||||
self.record_failure(kv_chunk.room,
|
||||
f"Decode instance could be dead, remote mooncake session ... is not alive")
|
||||
```
|
||||
|
||||
**One real `send_kvcache_slice ret != 0` permanently blacklists that D's mooncake session for the rest of the SGLang process lifetime.** The code's own comment ("Failures should never happen if the session is not dead") encodes the design assumption that transfers don't fail under normal conditions — but they do under the saturation regime described in §5b (RDMA queue full / D scheduler too busy to drain receives in time).
|
||||
|
||||
### Connecting back to Q1 timeline
|
||||
|
||||
Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
|
||||
|
||||
### What the hair-trigger is actually reacting to
|
||||
|
||||
Pulling the mooncake C++ logs (filter `^E0`/`^I0` lines from prefill-0.log) reveals the actual underlying error:
|
||||
|
||||
```
|
||||
I0512 01:56:42.242062 transfer_engine_py.cpp:546]
|
||||
Sync batch data transfer timeout after 37452515723ns
|
||||
I0512 01:56:53.335597 transfer_engine_py.cpp:546]
|
||||
Sync batch data transfer timeout after 30892690400ns
|
||||
```
|
||||
|
||||
**37.45 s** and **30.89 s** — the mooncake `batch_transfer_sync` C++ call returned nonzero because the synchronous transfer took longer than its internal timeout (~30 s). On a 400 Gb/s NDR RDMA fabric this is not a network problem; the data path is healthy. The SGLang author's design instinct (`>= 1 failures = dead`) is *correct in the idle case* — a 30-second RDMA stall really does indicate a broken peer.
|
||||
|
||||
What's happening here is that the peer is **logically broken from the C++ control-plane's point of view**, even though the OS process is still alive.
|
||||
|
||||
### Why does the D side stall the control plane for 30 s?
|
||||
|
||||
Cross-referencing decode-0.log at the exact second of the first timeout (01:56:42):
|
||||
|
||||
```
|
||||
01:56:34 Decode batch, #running-req=1, #token=627631, token_usage=0.83,
|
||||
gen throughput=174.76 tok/s ← still serving normally
|
||||
01:56:42 session id 1000315 does not exist, cannot delete.
|
||||
01:56:42 session id 1000360 does not exist, cannot delete.
|
||||
01:56:42 Trimmed decode session cache via LRU.
|
||||
#evicted_sessions: 2, #freed_tokens: 77675,
|
||||
#available_tokens: 38574 → 116249
|
||||
01:56:42 Trimmed decode session cache via LRU.
|
||||
#evicted_sessions: 1, #freed_tokens: 36166,
|
||||
#available_tokens: 29038 → 65204
|
||||
01:56:53 Decode transfer failed for request rank=0 ...
|
||||
Failed to get kvcache from prefill instance, it might be dead
|
||||
```
|
||||
|
||||
D0's main scheduler thread was busy doing **two consecutive LRU evictions** (freeing 77 675 + 36 166 ≈ 114 K tokens of KV) right when the P→D mooncake transfer attempt landed. Each LRU trim involves:
|
||||
- iterating per-session resident metadata
|
||||
- releasing GPU KV slots back to `token_to_kv_pool_allocator.free()`
|
||||
- updating the session-aware-cache bookkeeping under lock
|
||||
- closing per-session streaming state
|
||||
|
||||
Under `token_usage = 0.83` the LRU scan has to walk thousands of entries; the lock held during this work blocks the mooncake C++ control plane on the receive side (buffer registration / completion poll) from making progress. P's `batch_transfer_sync` keeps polling for the peer's completion ack, doesn't get one for 30 s, and gives up.
|
||||
|
||||
So the chain is:
|
||||
|
||||
```
|
||||
D KV pool saturated by D2-cold-pinning (§5d)
|
||||
↓
|
||||
D triggers heavy LRU eviction (114K tokens at a time)
|
||||
↓
|
||||
D main scheduler thread starves mooncake C++ control plane for 30+ s
|
||||
↓
|
||||
P's batch_transfer_sync returns nonzero (timeout)
|
||||
↓
|
||||
P's hair-trigger marks D's whole mooncake_session_id "failed forever"
|
||||
↓
|
||||
all subsequent reqs to that D blow up with "is not alive"
|
||||
```
|
||||
|
||||
The hair-trigger threshold (`>= 1`) is structurally wrong for this regime — but it would not fire at all if the LRU thrash didn't happen, and the LRU thrash would not happen if the load were spread across all 3 D workers (§5d).
|
||||
|
||||
### Two layers of fix
|
||||
|
||||
| Layer | What | Cost |
|
||||
|---|---|---|
|
||||
| Root cause | Spread load to D2 so D0/D1's KV never saturate, LRU never thrashes. See §5d and the cold-D bonus implementation in `policies.py` (next commit). | Low — pure policy change |
|
||||
| Defense in depth | In `mooncake/conn.py:1267-1276`, replace `>= 1` with a windowed threshold (e.g. ≥ 3 failures within 60 s) and add a periodic retry that probes the D bootstrap port before clearing `failed_sessions`. | Medium — touches vendored SGLang |
|
||||
|
||||
We do the root-cause fix first because it makes the second one optional.
|
||||
|
||||
---
|
||||
|
||||
## 5d. Why no session ever migrated to D2 (forensic on Q2)
|
||||
|
||||
KVC v2's design (KVC_ROUTER_ALGORITHM §3.3) uses `state.session_d_rejects[(session_id, D)] += 1` after a rejection, then policy.select skips any D with `rejects >= migration_reject_threshold (=3)`. The mechanism is conceptually sound. The bug is in *which* failures count as rejections.
|
||||
|
||||
### The substring filter is too narrow
|
||||
|
||||
In `replay.py:1379`:
|
||||
|
||||
```python
|
||||
_ADMISSION_REJECTION_SUBSTRINGS = (
|
||||
"session-cap",
|
||||
"no-d-capacity",
|
||||
"d-backpressure",
|
||||
)
|
||||
|
||||
def _is_admission_rejection_mode(execution_mode: str) -> bool:
|
||||
return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
|
||||
```
|
||||
|
||||
Only execution_modes containing one of those three substrings increment the per-(session, D) reject counter. **All 1054 E2 failures have `execution_mode = "kvcache-centric"`** (the generic fallback bucket the replay engine uses when the request fell through every concrete sub-path before producing a successful result). That string contains none of the three substrings, so `session_d_rejects` is never incremented for them.
|
||||
|
||||
### Empirical confirmation
|
||||
|
||||
Counting from `structural/admission-events.jsonl` (worker-RPC level, independent of replay's classification):
|
||||
|
||||
| Stat | Value |
|
||||
|---|---:|
|
||||
| Distinct `(session, D)` pairs ever rejected by worker RPC | 49 |
|
||||
| Pairs rejected ≥ 3 times (would qualify for blacklist) | **46** |
|
||||
| Most-rejected single pair | (1001172, D1) = **25 rejects** |
|
||||
|
||||
So 46 of 49 (sess, D) pairs *should have been blacklisted* by KVC v2's design. They never were, because the corresponding requests' execution_mode was `"kvcache-centric"` (failure path) and not `"…-session-cap"` / `"…-no-d-capacity"` / `"…-d-backpressure"` (which only get assigned when the fallthrough path runs to a known-rejection sub-result, not when the upstream SSE stream errors out).
|
||||
|
||||
Counting "next-binding-after-reject" from the merged binding+admission timeline:
|
||||
|
||||
| Rejected on | Next binding goes to | Count |
|
||||
|---|---|---:|
|
||||
| D0 | D0 | 253 |
|
||||
| D1 | D1 | 329 |
|
||||
| D0 | D2 | **0** |
|
||||
| D1 | D2 | **0** |
|
||||
|
||||
The router stubbornly re-binds the same session to the same D after every reject — exactly because the reject was never recorded in `session_d_rejects`, so policy.select still sees an empty rejection counter and the overlap term keeps tipping it back to D0/D1.
|
||||
|
||||
### The fix
|
||||
|
||||
Two paths, in increasing scope:
|
||||
|
||||
1. **Quick**: include `"kvcache-centric"` (the failure-fallback bucket) in `_ADMISSION_REJECTION_SUBSTRINGS`, OR have replay set `execution_mode` to a more specific failure label when an SSE stream closes with zero tokens (e.g. `"upstream-aborted"`) and add that to the substring set.
|
||||
2. **Better**: don't rely on string-matching at all. Have `_run_request` catch the actual rejection signal (admission RPC `can_admit=False` or upstream `RuntimeError: generate stream ended ...`) and call `state.record_admission_reject(...)` directly at that point. The substring filter was inherited from the v1 → v2 migration design (`MIGRATION_V1_FINDINGS_ZH §4.1`) when only specific fallback paths set those names.
|
||||
|
||||
Either fix would let the existing `migration_reject_threshold=3` blacklist D0/D1 after enough failures, force a re-route to D2, populate D2's resident hashes, and break the overlap-pinning death spiral.
|
||||
|
||||
---
|
||||
|
||||
## 6. What this experiment actually shows
|
||||
|
||||
1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
|
||||
2. **The KVC v2 + kv-aware policy combination has a latent pathology on workloads with high cross-session prefix overlap**: the overlap term in the lex score causes permanent load imbalance, and v2's reject-counter migration cannot rescue it because rejects only fire under capacity pressure, by which point timeouts already dominate. This is novel and not surfaced by the SWE-Bench evaluation in the existing project docs.
|
||||
3. **For Inferact-like workloads, a cold-D bonus (e.g. require D to host at least one session before its overlap score counts) or an explicit pre-warm step is required** before E1/E2 comparisons can isolate the marginal effect of the KVC layer.
|
||||
|
||||
---
|
||||
|
||||
## 7. Reproducibility
|
||||
|
||||
- Trace: `outputs/inferact_50sess.jsonl`, md5 `7bb263a32600ef5a6ef5099ba340a487`, regenerable via `scripts/sample_trace_subset.py`.
|
||||
- E1: `bash scripts/sweep_e1_naive_1p3d.sh` (1h 29 min wall)
|
||||
- E2: `bash scripts/sweep_e2_kvc_v2_rdma.sh` (1h 33 min wall)
|
||||
- Summary JSON paths:
|
||||
- `outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.json`
|
||||
- `outputs/e2_kvc_v2_rdma_50sess/e2_kvc_v2_rdma_run1_summary.json`
|
||||
- Per-request metrics JSONL alongside each summary, plus structural events under `*/structural/`.
|
||||
|
||||
---
|
||||
|
||||
## 8. Open follow-ups for the next agent
|
||||
|
||||
1. **Add a cold-D bonus** to `KvAwarePolicy.select` (e.g. positive constant for D with `state.resident[D] == ∅`) and re-run E2 on the same subset. Predict: D2 receives bindings, failure rate drops, head-to-head with E1 becomes meaningful.
|
||||
2. **Rerun E2 with `--kvcache-admission-mode router`** (router-side optimistic admission instead of worker RPC) to isolate whether the strict worker admission is the contributor to the 1054 failures, or whether it's purely the imbalance.
|
||||
3. **Run a third arm E0 with `policy=default` + `mechanism=pd-disaggregation`** as a true control — kv-aware policy is itself part of what we are evaluating; default round-robin would have spread sessions across all 3 D.
|
||||
4. **Compare TTFT p99 against an Inferact-on-TCP-loopback run** to evaluate H2/H3 cleanly. Cost: 1 more E2-shaped sweep (~1.5 h).
|
||||
5. **Investigate the 1054 E2 failures** in `request-metrics.jsonl` — sample some to verify they are timeout-related vs admission-rejected vs upstream-500.
|
||||
|
||||
---
|
||||
|
||||
## 4. Comparison table — pending
|
||||
|
||||
To be appended.
|
||||
|
||||
---
|
||||
|
||||
## 5. Open questions for the next iteration
|
||||
|
||||
- Are the 85 E1 errors all timeouts? `request-metrics.jsonl` rows with `error` execution_mode should be sampled to confirm. (Quick check: grep the metrics jsonl for `"execution_mode": "pd-disaggregation"` and inspect `latency_s` / `error` fields.)
|
||||
- Does E2 produce the predicted ~91% direct-to-D rate seen in the historical SWE-Bench v2 run, or does the Inferact workload's larger session count (50 vs 52 there) but very different per-session size distribution (mean 33 turns × ~2KB context growth per turn) push it lower?
|
||||
- Is `D2 = 0%` an E1-specific artifact (kv-aware sticky in pd-disagg mode), or does the same happen in E2 before migration kicks in for the first time?
|
||||
129
docs/E3_FINDINGS_ZH.md
Normal file
129
docs/E3_FINDINGS_ZH.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# E3 — first run findings + bug exposure
|
||||
|
||||
**Status**: E3 first attempt aborted at ~16 min wall by SGLang assertion crash on decode-1. Partial data confirms the load-floor bonus works as designed; the crash is an independent vendored-SGLang bug exposed by E3's new routing pattern.
|
||||
|
||||
**Branch**: `h200-cu130`.
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`.
|
||||
|
||||
---
|
||||
|
||||
## 1. What worked: load-floor bonus (K=200)
|
||||
|
||||
Within the first ~15 minutes of E3, before the crash:
|
||||
|
||||
| | E1 (run1) | E2 (run1) | E3 (run1, partial) |
|
||||
|---|---:|---:|---:|
|
||||
| total bindings | 1285 | 1186 admit attempts | 1001 |
|
||||
| decode-0 bindings | 575 | 600 | 240 (24.0%) |
|
||||
| decode-1 bindings | 710 | 685 | 536 (53.5%) |
|
||||
| **decode-2 bindings** | **0** | **0** | **225 (22.5%)** |
|
||||
| unique sessions on D2 | 0 | 0 | **30** |
|
||||
|
||||
**Load-floor bonus successfully broke the overlap-pinning death spiral.** D2 is finally getting traffic on Inferact's shared-boilerplate workload. The graduated formula (`K * deficit / mean`) plus the `not sticky` gate produces the intended behavior: fresh sessions land on under-loaded D's, established sessions keep going to their original D for cache locality.
|
||||
|
||||
This validates the Q2.B design from `docs/E1_E2_FIX_DESIGN_ZH.md` empirically — but only as far as the run got. End-to-end metrics (lat / TTFT / failure rate) are not interpretable yet because the worker died.
|
||||
|
||||
## 2. The new crash: SGLang streaming-session correction leaves an invariant violated
|
||||
|
||||
At `01:51:21` (~5 min into the benchmark), decode-1 hit:
|
||||
|
||||
```
|
||||
[01:51:21] Correcting streaming-session extend_input_len from 6648 to 0
|
||||
(rid=6f4318e93dd543a49dbf19248cfc1e6f, session_id=1000195,
|
||||
fill_len=6648, prefix_len=43459, kv_committed_len=43459)
|
||||
[01:51:21] Scheduler hit an exception: AssertionError
|
||||
at third_party/sglang/python/sglang/srt/managers/schedule_batch.py:1646
|
||||
→ assert seq_len - pre_len == req.extend_input_len
|
||||
```
|
||||
|
||||
### Mechanism
|
||||
|
||||
With `--enable-streaming-session`, SGLang's session_aware_cache hands the scheduler a request whose `fill_ids` is just the new tokens since the last turn (6648), while `prefix_indices` represents the already-cached prefix on this D (43459 blocks). When the prefix exceeds `fill_ids` (e.g., the new turn's input is short relative to the conversation history that's already in cache), this code path fires at `schedule_batch.py:1572-1585`:
|
||||
|
||||
```python
|
||||
actual_extend_len = max(0, len(req.fill_ids) - len(req.prefix_indices))
|
||||
if req.extend_input_len != actual_extend_len:
|
||||
logger.warning("Correcting streaming-session extend_input_len from %d to %d ...")
|
||||
req.set_extend_input_len(actual_extend_len)
|
||||
```
|
||||
|
||||
So `req.extend_input_len` becomes `max(0, 6648 - 43459) = 0`.
|
||||
|
||||
Then at line 1588-1590:
|
||||
|
||||
```python
|
||||
seq_lens = [len(r.fill_ids) for r in reqs] # 6648
|
||||
prefix_lens = [len(r.prefix_indices) for r in reqs] # 43459
|
||||
```
|
||||
|
||||
And at line 1646:
|
||||
|
||||
```python
|
||||
assert seq_len - pre_len == req.extend_input_len # 6648 - 43459 == 0 → FAIL
|
||||
```
|
||||
|
||||
The correction patches `extend_input_len` but the downstream invariant is computed from raw `fill_ids`/`prefix_indices` lengths, which the correction never touched. The arithmetic check is fundamentally incompatible with the corrected state.
|
||||
|
||||
### Provenance
|
||||
|
||||
The streaming-session correction (`schedule_batch.py:1572-1585`) and the assertion site (line 1646) are both inside the project's SGLang vendor patches — `git log` on this file shows the patch came from commit `b8e6f13 feat(sglang): support decode session cache admission`. So this is a regression in the project's own SGLang fork, not upstream SGLang.
|
||||
|
||||
### Why E3 triggers it and E2 didn't
|
||||
|
||||
The crash is independent of migration (session 1000195 stayed on decode-1 the entire time). Two factors combined to expose it in E3:
|
||||
|
||||
1. **D1 was under more sustained load in E3** — 536 bindings on 17 unique sessions means high re-binding density per session, which means more concurrent turns of the same session at the scheduler, increasing the rate at which streaming-session corrections fire.
|
||||
2. **Faster overall dispatch** — with D2 actually consuming work, the prefill→decode pipeline moves faster, so streaming-session entries reach the corrected state more often than in E2's saturated cap-out regime.
|
||||
|
||||
Both factors are effects of the load-floor fix, not its cause. The crash is a pre-existing landmine in the vendored streaming-session code that E1 and E2 happened to avoid because their pipelines stalled before sessions accumulated enough committed prefix to trigger the correction.
|
||||
|
||||
---
|
||||
|
||||
## 3. Decision space for the fix
|
||||
|
||||
| # | Fix | Layer | Where | Risk |
|
||||
|---|---|---|---|---|
|
||||
| **A** | Patch the assertion to match the corrected state | vendored SGLang `schedule_batch.py:1646` | Add: `if req.extend_input_len == 0 and len(req.fill_ids) < len(req.prefix_indices): continue` to skip degenerate reqs before iterating. | Local, scoped, doesn't touch correctness elsewhere. Need to handle the skipped reqs (set `was_skipped` flag, drop from batch). |
|
||||
| **B** | Fix the correction site to also drop the req from the batch | vendored SGLang `schedule_batch.py:1572-1585` | When `actual_extend_len == 0` and req has nothing to extend, signal upstream to remove the req from this batch (defer or drop). | Slightly more invasive. The upstream call path needs to handle a "filtered" return. |
|
||||
| **C** | Compute `seq_lens` and `prefix_lens` consistently with the correction | vendored SGLang `schedule_batch.py:1588-1590` | After correction, recompute `seq_lens = [len(r.fill_ids[:pre_len] + extension)]` or align both sides. | Risky; affects all downstream tensor sizing. |
|
||||
| **D** | Workaround: disable session migration in E3 (the trigger combination) | our `cli` flag `--kvcache-migration-reject-threshold 0` | One-line config change in `sweep_e3_*.sh`. | Doesn't actually fix the crash — session 1000195 didn't migrate. May reduce but not eliminate. Might still hit it on a different session. |
|
||||
| **E** | Workaround: disable streaming session | server flag, remove `--enable-streaming-session` | Sidesteps the entire correction path. | Loses KVC's direct-to-D fast path (the central perf win we measure). Defeats the experiment. |
|
||||
|
||||
### Recommendation
|
||||
|
||||
**Fix A** — patch `schedule_batch.py:1646` to skip the malformed req before asserting. It's the minimal-blast-radius change and matches the apparent intent of the correction (graceful handling of the degenerate state).
|
||||
|
||||
Concretely:
|
||||
|
||||
```python
|
||||
# Just before the assertion at line ~1646
|
||||
if req.extend_input_len == 0:
|
||||
# The streaming-session correction zeroed extend_input_len because
|
||||
# prefix_indices already covers fill_ids. Skip this req from the
|
||||
# extend batch — its KV is already committed; nothing to compute.
|
||||
skip_indices.append(i)
|
||||
continue
|
||||
```
|
||||
|
||||
Then the caller of `prepare_for_extend` needs to handle skipped requests (return them to the decode queue without an extend pass).
|
||||
|
||||
**Avoid Fix D/E** — D doesn't address the root cause (the failing session didn't migrate), and E loses the entire reason we're running this experiment.
|
||||
|
||||
---
|
||||
|
||||
## 4. Decision points for review
|
||||
|
||||
| # | Question | Default if no answer |
|
||||
|---|---|---|
|
||||
| D1 | Implement Fix A (vendor patch to skip zero-extend-len reqs)? | **Yes** |
|
||||
| D2 | Re-run E3 with same K=200, same subset, after the fix? | Yes |
|
||||
| D3 | Add a structural log entry every time the correction fires so we can track its frequency? | Recommended |
|
||||
| D4 | File this as a separate `feat(sglang)` commit on the branch so the patch and the failure case it fixes are traceable? | Yes |
|
||||
|
||||
---
|
||||
|
||||
## 5. What this tells us about KVC v2 maturity
|
||||
|
||||
The load-floor bonus's first real exposure to the production codepath uncovered an existing patch bug that was masked by E2's failure cascade. This is good news: the failure cascade in E2 was hiding *another* layer of breakage. Without rebalancing, sessions cap-out → cascade → never run long enough to commit deep prefixes → never hit the streaming-session correction → never crash. With rebalancing, sessions DO commit deep prefixes → trigger the correction → crash.
|
||||
|
||||
Each fix tends to expose the next-shallowest bug. This is expected for a stack of ~6 interacting subsystems (kv-aware policy, KVC admission, session_aware_cache, streaming session, mooncake transfer, prefill batch prep). The path forward is to keep patching, re-running, and pushing the failure boundary out.
|
||||
185
docs/EVALUATION_PROTOCOL_ZH.md
Normal file
185
docs/EVALUATION_PROTOCOL_ZH.md
Normal file
@@ -0,0 +1,185 @@
|
||||
# 评测协议(Paper-quality)
|
||||
|
||||
**日期**:2026-05-12
|
||||
**性质**:评测协议规范,覆盖 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.1 M1–M6 全部薄弱点
|
||||
**对象**:跑实验的合作者;写 paper 的人;artifact reviewer
|
||||
|
||||
---
|
||||
|
||||
## 0. 总原则
|
||||
|
||||
> 论文里每一个数字都必须能回答两个问题:
|
||||
> 1. **抽样误差有多大?**(bootstrap CI、N、std)
|
||||
> 2. **公平吗?**(同 trial、同 trace、同 token cap、同 timeout、paired)
|
||||
|
||||
当前 sweep 报告(`KVCACHE_CENTRIC_PROGRESS_ZH.md` / `V2_RESULTS_ZH.md`)都不满足上述任一条。本文给出合规模板。
|
||||
|
||||
---
|
||||
|
||||
## 1. 评测维度(M1–M6 一对一解决)
|
||||
|
||||
### 1.1 M1 — 统计显著性
|
||||
|
||||
| 决策 | 规则 |
|
||||
|---|---|
|
||||
| `N` 每个 config 最小 run 数 | **3**(headline 数字)/ **5**(ablation 终值) |
|
||||
| 报告统计量 | `mean ± std`,**附 2.5/97.5 bootstrap CI** |
|
||||
| 多 run 聚合 | 把每 run 的 per-request latency append 后整体做 bootstrap;不要先 per-run 求 mean 再 average mean |
|
||||
| 差异显著性 | paired bootstrap p-value(≥ 5000 samples) |
|
||||
| `N=1` 仅允许 | smoke / sanity check,**不进 headline 表** |
|
||||
|
||||
### 1.2 M2 — 公平 paired 比较
|
||||
|
||||
| 决策 | 规则 |
|
||||
|---|---|
|
||||
| trace fixity | 用同一个 `samples-*.jsonl` 文件;replay 用 `--use-trace-as-sample` 锁定 |
|
||||
| timeout | 所有 mechanism 同 `--request-timeout-s`;不允许某一组用 600s 而另一组 300s |
|
||||
| token cap | 同 `--max-input-len`(取所有 baseline 的最小值并显式 truncate) |
|
||||
| 错误 / abort | **不**只算成功请求;abort 与 timeout 各自单列 `error_count`,按全集(含错误)报指标,或 paired-on-same-trial-mask |
|
||||
| 时间窗 | `time_scale` 一致;不允许同 sweep 内换 |
|
||||
| Worker 数 / GPU 类型 | 一致;topology 差异必须标注 |
|
||||
|
||||
**反例**:当前 `E1 vs E2` 表([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) §4)显式声明 "not a fair head-to-head"——E2 80% 失败,successful-only 算 latency 与 E1 全集对比。**这种表不能直接进 paper**。
|
||||
|
||||
### 1.3 M3 — Trace 分层
|
||||
|
||||
| 维度 | 分桶建议 |
|
||||
|---|---|
|
||||
| `turn_id` | `{1, 2-5, 6-20, 21+}` |
|
||||
| `append_len` | `{≤128, 128-1K, 1K-8K, >8K}` |
|
||||
| `overlap_ratio` | `{≤0.3, 0.3-0.7, >0.7}` |
|
||||
| `inter_turn_gap_s` | `{≤5, 5-30, 30-300, >300}` |
|
||||
| `input_len` | `{≤8K, 8K-64K, >64K}` |
|
||||
|
||||
**报告要求**:headline 数字之外,至少给一张"按 turn_id × append_len"的 heatmap,让 reviewer 看到收益来自哪个 slice。
|
||||
|
||||
**反例**:当前 Real Ali 实验仅在 KVC-fit slice(high overlap + small append + 100% direct-eligible)上报 -46% p50。这是上限,不是平均。必须同时给出 full Ali 上的 paired 表。
|
||||
|
||||
### 1.4 M4 — Baseline 矩阵
|
||||
|
||||
至少以下 baseline 中跑 **2 个**:
|
||||
|
||||
| Baseline | 类别 | 库 |
|
||||
|---|---|---|
|
||||
| vLLM + automatic prefix caching | 同 model 单 worker prefix cache | vLLM main |
|
||||
| SGLang DP cache-aware(4×TP1) | 当前主要 baseline | 本仓 vendored SGLang |
|
||||
| SGLang PD-disaggregation(kv-aware) | naive 但 cache-aware 拓扑 | 本仓 |
|
||||
| DistServe | P/D 分离 baseline | DistServe upstream |
|
||||
| SplitWise | P/D split + adaptive routing | open-source impl |
|
||||
| Mooncake-Master scheduler | 同代设计 | mooncake-master |
|
||||
|
||||
**额外推荐**:跑一个 "oracle" baseline——assume `Σ.resident[d]` 完美已知 + admission 永不失败,作为 KVC 的上限对照。
|
||||
|
||||
### 1.5 M5 — Trace 组合
|
||||
|
||||
| Trace | 用途 |
|
||||
|---|---|
|
||||
| Ali coding agent (full) | 主结果;含 single-turn dilution |
|
||||
| Ali KVC-fit slice | KVC 上限演示 |
|
||||
| SWE-Bench 50 sess | 已有;多轮高 overlap workload |
|
||||
| ShareGPT | 对比 chat workload(短 turn,低 overlap)。**用来证明 KVC 不会在不合适 workload 上劣化** |
|
||||
| Inferact | tool-use heavy 的 agent workload |
|
||||
| Mooncake trace | 单 turn LLM serving 的 baseline trace |
|
||||
| Synthetic adversarial | 自构:burst 100 个新 session 同时 seed,验证 mooncake death 与 reset-on-success 的 robustness |
|
||||
|
||||
**最低组合**:Ali full + SWE-Bench + ShareGPT + Synthetic adversarial。
|
||||
|
||||
### 1.6 M6 — 硬件覆盖
|
||||
|
||||
| Tier | 用途 |
|
||||
|---|---|
|
||||
| 单节点 ≤ 8 GPU | 当前所有结果 |
|
||||
| 双节点 NVLink + IB | 验证跨节点 D→P sync 与 mooncake 行为 |
|
||||
| 4 节点 cluster(≥ 16 GPU) | scaling 数字、cluster scheduler 假设 |
|
||||
| 异构(H100 + L40S) | topology-aware routing |
|
||||
|
||||
**最低组合**:单节点 4×H200 + 双节点 NVLink + IB。剩下两个 tier 可放 future work。
|
||||
|
||||
---
|
||||
|
||||
## 2. 报告模板
|
||||
|
||||
### 2.1 主结果表(Table 1)
|
||||
|
||||
```
|
||||
| Config | N | mean ± std | p50 [CI] | p90 [CI] | p99 [CI] | err% | timeout% |
|
||||
|--------|---|------------|----------|----------|----------|------|----------|
|
||||
```
|
||||
|
||||
加注:trace name、time_scale、`max_input_len`、`request_timeout_s`、所有共用参数。
|
||||
|
||||
### 2.2 Paired delta 表
|
||||
|
||||
```
|
||||
| Pair | N pairs | mean delta [CI] | p50 delta [CI] | wins / losses | p-value |
|
||||
```
|
||||
|
||||
`N pairs` = 两边都 successful 的 trial 数。`wins` = `latency_kvc < latency_baseline` 的 trial 数。
|
||||
|
||||
### 2.3 分层表(Table 2)
|
||||
|
||||
每个分层维度(§1.3)独立一张。
|
||||
|
||||
### 2.4 Negative-result 章节(强制)
|
||||
|
||||
paper 必须有专章列出:
|
||||
|
||||
- KVC 在 ShareGPT 上比 baseline 慢的具体数字。
|
||||
- KVC 在 trace 哪些 percentile / slice 不胜。
|
||||
- 失败的 sweep(mooncake death、E3 crash)的诊断链路。
|
||||
|
||||
→ 论文 reviewer 看见诚实的 negative result 会显著提高印象分。当前的 [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §4 雏形可以扩成这一章。
|
||||
|
||||
---
|
||||
|
||||
## 3. 工具支持(本仓需要的脚本)
|
||||
|
||||
| 脚本 | 状态 | 说明 |
|
||||
|---|---|---|
|
||||
| `scripts/analysis/recompute_summary.py` | ✅ 已有 | 修复 abort 污染的 latency;本协议主要数据入口 |
|
||||
| `scripts/analysis/stratified.py` | ⏳ 本分支新增 | 按 §1.3 维度切桶 + 输出表 |
|
||||
| `scripts/analysis/paired_compare.py` | ⏳ 本分支新增 | paired bootstrap,输出 §2.2 表 |
|
||||
| `scripts/analysis/plot_*` | ✅ 已有 | TTFT PDF、GPU 利用率、cache efficiency |
|
||||
|
||||
→ 本分支的 stratified + paired 脚本 land 后,跑实验的合作者可以一条命令出表。
|
||||
|
||||
---
|
||||
|
||||
## 4. Artifact 要求(SOSP/OSDI AE)
|
||||
|
||||
| 项目 | 标准 |
|
||||
|---|---|
|
||||
| Dockerfile | 单一 `Dockerfile.artifact`,4×A100/H100 即可启 |
|
||||
| 一键脚本 | `bash artifact/reproduce_main_table.sh`,1 小时内出 Table 1 |
|
||||
| 数据集 | 提供 `outputs/sample-*.jsonl` 子集(可 ~5GB 内);full Ali 走 instruction |
|
||||
| 复现度 | bootstrap CI 与原文重叠即算复现,不要求 bit-exact |
|
||||
| 文档 | `artifact/README.md`,列出每张表 / 图对应的命令 |
|
||||
|
||||
→ 本路线图 §M1 修复后再准备 artifact。
|
||||
|
||||
---
|
||||
|
||||
## 5. 自检清单(提 paper draft 前用)
|
||||
|
||||
- [ ] 每张表 N ≥ 3,含 mean±std 与 95% CI。
|
||||
- [ ] 没有 "successful only" 字样;所有错误已列入 `err%`。
|
||||
- [ ] 所有 baseline 用同 `max_input_len` / 同 `request_timeout_s` / 同 `time_scale`。
|
||||
- [ ] 至少 3 个 trace + 1 个 synthetic adversarial。
|
||||
- [ ] 至少 1 个 non-SGLang baseline。
|
||||
- [ ] 有 negative-result 章节。
|
||||
- [ ] 有 KVC 在 single-turn workload 上的 dilution 数据。
|
||||
- [ ] 形式化部分:Algorithm 1/2/3 + Theorem 1/2,以及 D→P sync 完成后的 Theorem 4。
|
||||
- [ ] 失败模式 forensic:mooncake death、E3 crash、cold-D 都进 §Limitations 或 §Discussion。
|
||||
|
||||
---
|
||||
|
||||
## 6. 路线图衔接
|
||||
|
||||
- [ ] Phase A — 实现本分支 `scripts/analysis/stratified.py` + `scripts/analysis/paired_compare.py`(无 GPU 可做)。
|
||||
- [ ] Phase B — 把现有 `kvc-real-ali-iter-v1` 的 600-req/15min 数据用新工具重出一份分层表 / paired 表,存入 `outputs/`(GPU 不需重跑)。
|
||||
- [ ] Phase C — 跑 ShareGPT + Synthetic adversarial baseline(GPU 需 ~12h)。
|
||||
- [ ] Phase D — 选 1 个非 SGLang baseline(推荐 vLLM + prefix caching)补齐 M4(GPU 需 ~24h)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:当前结果"看起来已经赢",但按本协议重报后,赢的 magnitude 会缩小、赢的 slice 会窄化、负面 slice 会暴露。这是论文必须经历的过程;越早做越省事。
|
||||
222
docs/FAILURE_MODES_ZH.md
Normal file
222
docs/FAILURE_MODES_ZH.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Failure-mode Taxonomy
|
||||
|
||||
**日期**:2026-05-13
|
||||
**性质**:集中清单 + 诊断手册
|
||||
**对象**:跑实验时遇到失败要立刻 lookup 的合作者;写 paper §Limitations 时需引用的人;reviewer 想问"你为什么觉得这次会更稳"时的答案
|
||||
|
||||
本文把当前系统已识别的失败模式按"症状 → 根因 → 触发条件 → 当前缓解 → 真正的修复"梳成一张表。所有条目都有 forensic 链接到原始实验 doc。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
5 类已识别失败模式,按"是否阻碍 paper claim"分组:
|
||||
|
||||
| 类别 | 名称 | 阻碍 paper | 真正修复 |
|
||||
|---|---|:---:|---|
|
||||
| **A. 控制层级联** | Mooncake "instance not alive" cascade | ✅ | admission backoff + per-D pending-seed budget |
|
||||
| **B. 路由偏置** | Cold-D / overlap-pinning | ✅ | first-principles overlap term redefinition |
|
||||
| **C. KV 抖动** | Evict storm(session-level evict) | ✅ | [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) |
|
||||
| **C'. KV 抖动** | Reseed storm(turn 1 大 seed 并发) | ✅ | per-D pending-seed budget + (C 缓解后频率自降) |
|
||||
| **D. Vendor 不变量** | streaming-session correction invariant crash (E3) | ❌(hotfix 已 land) | 删除 correction 路径(block-level evict 完成后) |
|
||||
|
||||
A / B / C 三类是 Milestone 1 必须解决的;C' 是 A 的次因;D 已临时止血但根本修复绑在 C 上。
|
||||
|
||||
---
|
||||
|
||||
## 1. A — Mooncake "instance not alive" cascade
|
||||
|
||||
### 1.1 症状
|
||||
|
||||
- 客户端看:`RuntimeError: generate stream ended before producing any token`
|
||||
- D scheduler 日志:`[mooncake] Decode instance could be dead, dropping ...`
|
||||
- 整批请求被 abort,单一 sweep 在数分钟内从健康降到 80% failure([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) E2:1054 / 1285 失败)
|
||||
|
||||
### 1.2 根因(forensic 链路)
|
||||
|
||||
```
|
||||
admission no-space (D KV pool 满)
|
||||
→ router 立刻 fallback 走 seed/reseed 路径
|
||||
→ 多个并发 seed 同时打 mooncake P→D
|
||||
→ P→D 出口排队,handshake 阶段超时
|
||||
→ mooncake 把对端标记 dead
|
||||
→ SGLang 把 dead 链路上的 in-flight req 全部 abort
|
||||
→ 客户端看到批量 generate-stream 中断
|
||||
```
|
||||
|
||||
### 1.3 触发条件
|
||||
|
||||
- D KV pool 接近满(≥ ρ·K_d,默认 0.95)
|
||||
- router fallback chain 把多个 reseed 在毫秒级窗口内发起
|
||||
- mooncake heartbeat 超时(默认窗口短)
|
||||
|
||||
### 1.4 当前缓解
|
||||
|
||||
- `--kvcache-seed-min-turn-id=2` 跳过 turn 1 大 seed,减少首爆(main 分支 stable 配置)
|
||||
- `--mc-transfer-timeout=1800s` 默认值(commit 905d671)减少假性 dead
|
||||
- `--request-timeout-s=180/300` 让客户端不至于看见整 hour 卡死,但不阻止 cascade 自身
|
||||
|
||||
→ 这些都是治标,不是治本。E2 在 4×H200 NDR 真硬件下仍 80% 失败 ([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md))。
|
||||
|
||||
### 1.5 真正的修复(路线图 §S3)
|
||||
|
||||
1. **admission RPC backoff + jitter**:拒绝时不立刻 fallback,给 D scheduler 喘息机会。
|
||||
2. **per-D pending-seed budget**:同时刻最多 K 个 seed 在 transfer 队列里,超出排队而不爆裂。
|
||||
3. **mooncake heartbeat 与 admission 解耦**:admission 路径不再 imply "对端 alive"。
|
||||
4. **Backpressure pause hint 闭环**([SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) §2.3 当前 EXPERIMENTAL)。
|
||||
|
||||
---
|
||||
|
||||
## 2. B — Cold-D / overlap-pinning
|
||||
|
||||
### 2.1 症状
|
||||
|
||||
- N=k decode workers,但只有 ~k-1 真正承载流量;某些 D 0 binding
|
||||
- Per-D load 直方图严重偏斜(E2:D0:600 / D1:685 / **D2:0**)
|
||||
- 整体 throughput 受最忙 D 限制;裸 latency 不一定差,但容量利用率差 33%+
|
||||
|
||||
### 2.2 根因
|
||||
|
||||
Inferact / Ali coding agent trace 在每个 session 开头有 ~12K 的"system prompt + tool schema",这些 24-token 块在所有 session 之间共享 hash。kv-aware policy 的 `overlap` term 把它们当成"该 D 已经常驻这些 hash" → 任何新 session 都被 score 推向 D0/D1(最先 warm 的两个)→ D2 永远 0 overlap → 永远不被选 → 永远 cold。
|
||||
|
||||
### 2.3 触发条件
|
||||
|
||||
- 多 session workload + 共享 boilerplate prefix
|
||||
- `migration_reject_threshold > 0` 且 reject 从未触发(因为 D0/D1 还没满)
|
||||
|
||||
### 2.4 当前缓解
|
||||
|
||||
`KvAwarePolicy.load_floor_bonus`(commit 93fce42):
|
||||
|
||||
```
|
||||
floor_bonus = K * max(0, mean - assigned) / max(1, mean)
|
||||
```
|
||||
|
||||
E3 实测 D2 binding 从 0 升到 22.5%([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §1)。
|
||||
|
||||
→ 这是 patch,不是修复。`K` 是 magic number;boilerplate 的 hash 数量大于 `K / sticky_bonus` 时仍 cold。
|
||||
|
||||
### 2.5 真正的修复(路线图 §S5)
|
||||
|
||||
把 `overlap` 重新定义为 **"该 session 在该 D 上独占 prefix 的 hash 数"**:
|
||||
|
||||
```
|
||||
exclusive_overlap(s, d) := |prefix_hashes(s) ∩ resident[d] ∩ session_owned[s]|
|
||||
```
|
||||
|
||||
其中 `session_owned[s]` 排除其它 session 也持有的 hash。Boilerplate 共享 hash 不进 `exclusive_overlap`,score 自然分散。需要 D 端在 `admit_direct_append` 响应里返回 per-session resident hash 集合的 sketch(Bloom filter / minhash)。
|
||||
|
||||
---
|
||||
|
||||
## 3. C — Evict storm(session-level eviction)
|
||||
|
||||
### 3.1 症状
|
||||
|
||||
- 在 D 内存有压力的 workload 下,每 1–2 分钟出现 30–90K tokens 的 KV pool 释放峰
|
||||
- 紧随其后的同 session 请求触发 `Reseed`:P 重 prefill 50K + mooncake transfer 50K(3–7s)
|
||||
- TTFT 长尾完全由这类 reseed 主导([V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §3.2)
|
||||
|
||||
### 3.2 根因
|
||||
|
||||
`SessionAwareCache.release_session` 一次性 `free([cache_protected_len, kv_allocated_len))`——即整段 session-exclusive 尾部。E3 实测:90 次 evict、平均一次 free 67,726 tokens、25/50 session 受影响([KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) §0)。
|
||||
|
||||
→ 与 SGLang 标准 radix 的 leaf-by-leaf 渐进 evict 形成鲜明对比。这部分 KV 从未进 radix,所以享受不到 LRU 的细粒度蚕食。
|
||||
|
||||
### 3.3 触发条件
|
||||
|
||||
- D KV pool 接近满
|
||||
- `maybe_trim_decode_session_cache` 被 scheduler 触发(在 `DecodePreallocQueue` 检测到 `available_size() <= 0` 时)
|
||||
|
||||
### 3.4 当前缓解
|
||||
|
||||
- `--kvcache-session-soft-cap=N`(main 分支):限制 D 上常驻 session 数 → 提前 trim,避免顶到爆
|
||||
- `--kvcache-direct-max-uncached-tokens=8192`(v2):降低 direct path 吃 KV 的速度
|
||||
|
||||
→ 都是放慢节奏,没有解决"单次 free 太大"的根本问题。
|
||||
|
||||
### 3.5 真正的修复(路线图 §S1)
|
||||
|
||||
[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md):让 streaming-session decode 输出每 turn finish 时 `inner.cache_finished_req` 进 radix → `release_session` 退化为 `dec_lock_ref` + 删 slot → radix LRU 按 24-token leaf 蚕食。
|
||||
|
||||
预期:单次 evict 从 67K 降到 ≤ 500 tokens;reseed 频次降一个数量级。
|
||||
|
||||
---
|
||||
|
||||
## 4. C' — Reseed storm(turn 1 大 seed 并发)
|
||||
|
||||
### 4.1 症状
|
||||
|
||||
- workload 起步阶段(前 30–60s)所有 session 同时打 turn 1
|
||||
- 多个并发 `Seed`(每个 ~50–90K tokens)打 mooncake → 与 §1 cascade 重合
|
||||
|
||||
### 4.2 根因
|
||||
|
||||
`KvAwarePolicy` 启动阶段 `resident[d]` 全空,所有 D score 相同,但 ε 重试 + per-trial admit 不阻止并发。
|
||||
|
||||
### 4.3 触发条件
|
||||
|
||||
- trace `time_scale=1` 重放下,session 在原始到达密度内同时启动
|
||||
- 没有 per-D pending-seed 限流
|
||||
|
||||
### 4.4 当前缓解
|
||||
|
||||
- `--kvcache-seed-min-turn-id=2`:跳过 turn 1 seed 完全(main 分支 stable 配置)
|
||||
- 副作用:失去 turn-1 的 KV 注入,turn 2 必走 reseed(但反而稳定,因为 reseed 是分散在时间上的)
|
||||
|
||||
### 4.5 真正的修复
|
||||
|
||||
- per-D pending-seed budget(同 §1.5 第 2 项)
|
||||
- §3.5 完成后 evict 频次自降,间接降低 reseed 频次
|
||||
|
||||
---
|
||||
|
||||
## 5. D — Streaming-session correction invariant crash (E3 landmine)
|
||||
|
||||
### 5.1 症状
|
||||
|
||||
- D scheduler 抛 `AssertionError` at `schedule_batch.py:1646`:`seq_len - pre_len == req.extend_input_len`
|
||||
- 整个 D worker 进程退出 → router 看见对端死 → §1 cascade
|
||||
|
||||
### 5.2 根因
|
||||
|
||||
[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2:streaming-session correction(commit b8e6f13)把 `extend_input_len` 改写为 `max(0, fill_len - prefix_len)`,但下游 invariant 还从原始 fill_ids/prefix_indices 计算。当 `fill_len < prefix_len`(多 turn 累积 prefix > 当前 turn 增量)时数学上不可能满足。
|
||||
|
||||
### 5.3 触发条件
|
||||
|
||||
- streaming session 跨 turn 已 commit prefix 长于本 turn 的新增 fill_ids
|
||||
- E2 因 pipeline 阻塞从未跑到这个状态;E3 修了 cold-D bottleneck → pipeline 更快 → landmine 暴露
|
||||
|
||||
### 5.4 当前缓解
|
||||
|
||||
commit 986f351 的 pre-filter pass:在 `prepare_for_extend` 入口 drop 这类 req(让 client 看错误响应而不是 worker 崩)。是止血。
|
||||
|
||||
### 5.5 真正的修复
|
||||
|
||||
`schedule_batch.py:1572–1646` 这整段 correction 路径在 block-level eviction refactor 完成后**结构上不再需要**——[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.7 已说明 refactor 后 fill_ids / prefix_indices 一致性由 radix `match_prefix` 自动保证。
|
||||
|
||||
→ 不要再加更多 correction 子句;要删整段。
|
||||
|
||||
---
|
||||
|
||||
## 6. 失败诊断 cheat sheet
|
||||
|
||||
跑 sweep 时按下表 lookup:
|
||||
|
||||
| 你看到 | 大概率是 | 先查 |
|
||||
|---|---|---|
|
||||
| 客户端 `RuntimeError: generate stream ended before...` | §1 cascade | D scheduler log 搜 `instance could be dead` |
|
||||
| 某个 D `binding=0` 而其它 D 繁忙 | §2 cold-D | `per_decode_load` 直方图 |
|
||||
| TTFT p99 突然抬到 5–8s 量级 | §3 evict storm | `release_session` 调用频次 + 平均 free tokens |
|
||||
| Sweep 起步阶段失败率高、稳态低 | §4 reseed storm | mooncake transfer queue 在前 30s 的峰值 |
|
||||
| D worker 进程异常退出 | §5 invariant crash | scheduler log 搜 `AssertionError`、`extend_input_len` |
|
||||
|
||||
---
|
||||
|
||||
## 7. 与路线图的衔接
|
||||
|
||||
- [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) Milestone 1 的第 1/3/4 项分别对应本表 C / A / B 的真正修复。完成 Milestone 1 后本表 §1–§4 应该都从"未修"降级为"已缓解",§5 直接消失。
|
||||
- 论文 §Limitations 必须老实写出现状:"we identify five failure modes; A/C are addressed by this work, B/C' are partially addressed, D is a transient artifact of the in-progress refactor."
|
||||
|
||||
---
|
||||
|
||||
**核心句**:把失败模式当 first-class artifact 来管理——每个失败都有"症状 → 根因 → 触发 → 缓解 → 真正修复"五字段,是把 prototype 推到 production-grade 的关键工具。reviewer 看见你能枚举失败远比看见你赢得 baseline 更让人信服。
|
||||
270
docs/H200_DRIVER570_SETUP_ZH.md
Normal file
270
docs/H200_DRIVER570_SETUP_ZH.md
Normal file
@@ -0,0 +1,270 @@
|
||||
# H200 + Driver 570 上跑通本仓库的环境配置(含踩坑记录)
|
||||
|
||||
**适用范围**:4× H200 节点 + NVIDIA driver `570.86.15` + 本仓库 `kvc-debug-journey-v1-to-v4` 或后续分支。
|
||||
**目标读者**:拿到一台新 H200 机器、需要快速跑通 sglang 0.5.10 vendor + mooncake RDMA + agentic-pd-hybrid 的下一个 SWE/research agent。
|
||||
**作者状态**:本文档定稿于 `h200-cu130 @ 初始 commit`,smoke test 已 RDMA 跑通 16 reqs / 0 error。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR(5 行)
|
||||
|
||||
1. **`nvidia-smi` 的 "CUDA Version: 13.0" 是个陷阱**——它是 driver 能 forward-compat 跑的 runtime 上限,不是 driver 自己 API 版本。driver `570.86.15` 提供的 driver API 是 **cu12.8**。
|
||||
2. vendor sglang 0.5.10 的 `jit_kernel/` 用 `tvm_ffi` + ninja + nvcc binary 在首次调用每个 kernel 时编译。系统唯一 nvcc 在 `/usr/local/cuda-13.0/bin/`,cu13 编译出的 .so 会 NEEDED `libcudart.so.13`,driver 570 拒绝运行 → `cudaErrorInsufficientDriver`。
|
||||
3. 解法是**本地装一份 cu12.8 toolkit 到 `$HOME/cuda-12.8`**(不需要 root),让 tvm_ffi 走 cu12.8 nvcc,编译产物 NEEDED `libcudart.so.12`,driver 570 完美支持。
|
||||
4. mooncake wheel (`mooncake-transfer-engine 0.3.10.post2`) 也是 cu12 build,需要 `libcudart.so.12`——已经由 `nvidia-cuda-runtime-cu12` 包提供,在 venv 里。
|
||||
5. 每个 shell **必须 `source scripts/setup_env.sh`** 才能跑 SGLang。已封装好。
|
||||
|
||||
---
|
||||
|
||||
## 1. 一次性 setup(约 25min)
|
||||
|
||||
```bash
|
||||
cd /path/to/agentic-pd-hybrid
|
||||
|
||||
# (1) Python 环境 (~3min)
|
||||
uv sync
|
||||
|
||||
# (2) cu12.8 toolkit 本地装(~5GB 下载 + 5min 解压 = ~15-20min)
|
||||
mkdir -p /tmp/cuda_dl && cd /tmp/cuda_dl
|
||||
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
|
||||
sh cuda_12.8.1_570.124.06_linux.run \
|
||||
--silent --toolkit --override \
|
||||
--installpath=$HOME/cuda-12.8 \
|
||||
--tmpdir=$HOME/tmp \
|
||||
--no-drm --no-man-page
|
||||
|
||||
# (3) 验证
|
||||
$HOME/cuda-12.8/bin/nvcc --version # 应该看到 release 12.8, V12.8.93
|
||||
|
||||
# (4) 回到 repo 根目录,首次 source(每个 shell 都要做)
|
||||
cd /path/to/agentic-pd-hybrid
|
||||
source scripts/setup_env.sh
|
||||
```
|
||||
|
||||
`source scripts/setup_env.sh` 输出应是:
|
||||
```
|
||||
agentic-pd-hybrid env ready:
|
||||
CUDA_HOME=/home/<user>/cuda-12.8 (12.8, V12.8.93)
|
||||
libcudart.so.12 at .../.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib
|
||||
MC_TRANSFER_TIMEOUT=1800s
|
||||
```
|
||||
|
||||
**`MC_TRANSFER_TIMEOUT=1800` (30 min) 替代 mooncake 默认 30s**——E2 forensic 发现 D 端 LRU eviction 会让 mooncake C++ control plane 被 starved 30+s,触发 `conn.py:1270` hair-trigger 永久 blacklist 整个 D 的 mooncake_session_id。1800s 给足缓冲,30 分钟还没回应才是真正"D 死了"。详见 `docs/E1_E2_RESULTS_ZH.md §5c`。`stack.py` 也对 worker subprocess 设了同名默认值。
|
||||
|
||||
---
|
||||
|
||||
## 2. Smoke test(验证整条链路)
|
||||
|
||||
把 16 个合成 request 喂给 1P3D 拓扑,启用真 RDMA,跑通后才能动 E1/E2 实验。
|
||||
|
||||
```bash
|
||||
# 假设已 source scripts/setup_env.sh
|
||||
mkdir -p outputs/smoke_rdma
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli make-small-append-trace \
|
||||
--output outputs/smoke_rdma/mini_trace.jsonl \
|
||||
--session-count 4 --turns-per-session 4 \
|
||||
--initial-input-length 1024 --append-input-length 200 --output-length 50 \
|
||||
--inter-turn-gap-s 2 --session-stagger-s 1
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace outputs/smoke_rdma/mini_trace.jsonl \
|
||||
--output-root outputs/smoke_rdma \
|
||||
--mechanism pd-disaggregation --policy default \
|
||||
--model-path /mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507 \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device mlx5_60 \
|
||||
--gpu-budget 4 --time-scale 1 \
|
||||
--concurrency-limit 4 --timeout-s 1800 --request-timeout-s 300 \
|
||||
--session-sample-rate 1.0 --min-turns 1 --target-duration-s 600
|
||||
```
|
||||
|
||||
**首次跑会慢 8-15min**(model load 196s + 5-10 个 JIT kernel 各编译 ~10-30s + warmup)。后续跑只 ~3-5min。
|
||||
|
||||
**期望结果**:`request_count=16, error=0, abort=0, failure=0, execution_modes={'pd-disaggregation-router': 16}`。
|
||||
|
||||
每个 worker 的日志应有 `installTransport, type=rdma`,表示 mooncake 真的走 RDMA 而不是 TCP loopback。
|
||||
|
||||
---
|
||||
|
||||
## 3. GPU ↔ RDMA HCA 映射(本机实测)
|
||||
|
||||
8 块 ConnectX HCA,全部 ACTIVE / 400 Gb/s NDR / RoCE v2 (link_layer=Ethernet, GID Index 3)。Mooncake 按 NUMA / PCIe affinity 自动选 preferred:
|
||||
|
||||
| GPU | preferred HCA | NUMA |
|
||||
|---|---|---|
|
||||
| cuda:0 | mlx5_60 | 0 |
|
||||
| cuda:1 | mlx5_88 | 0 |
|
||||
| cuda:2 | mlx5_98 | 1 |
|
||||
| cuda:3 | mlx5_42 | 1 |
|
||||
|
||||
CLI 的 `--ib-device <name>` 只接单个设备名,给所有 worker 全局 override。Smoke test 默认填 `mlx5_60`(P worker 在 cuda:0 上 NUMA-local;D worker 在其它 GPU 上是 cross-NUMA 但能跑)。E1/E2 实验如果想最优,可以分 P/D worker 独立设环境变量,但目前 stack.py 不支持 per-worker `MOONCAKE_DEVICE`,要么所有 worker 同一个,要么走 mooncake auto(需把 `MC_MS_AUTO_DISC=0` 改回 1)。
|
||||
|
||||
完整 8 块 HCA:`mlx5_22, _27, _42, _60, _88, _98, _126, _135`(NUMA 0/1/0/0/0/1/0/1 混杂)。
|
||||
|
||||
---
|
||||
|
||||
## 4. 踩过的坑(按时间线)
|
||||
|
||||
### 坑 1:`nvidia-smi` 的 "CUDA Version: 13.0" 是误导
|
||||
|
||||
`nvidia-smi` header 显示 `Driver Version: 570.86.15 / CUDA Version: 13.0` 让人以为机器支持 cu13。**这是 driver 能 forward-compat 跑的 CUDA runtime 上限**,不是 driver 自己 API 的版本。driver 570 的 driver API 上限是 cu12.8(参见 NVIDIA "CUDA Compatibility" 矩阵)。
|
||||
|
||||
**正确判断方法**:跑 `torch.cuda.is_available()`,如果装了 cu13 build 的 torch 会报 `The NVIDIA driver on your system is too old (found version 12080)`。返回 `12080` 才是 driver 自己 API 版本(cu12.8)。
|
||||
|
||||
### 坑 2:vendor sglang vs pip sglang 的 patch 差异
|
||||
|
||||
仓库的 `third_party/sglang/python/` 是带项目自有 patches 的 SGLang 0.5.10 fork。**pip 上的 `sglang==0.5.10` 不包含核心 patches**——具体差异:
|
||||
|
||||
| 文件 | pip 版 | vendor 版 |
|
||||
|---|---|---|
|
||||
| `srt/managers/scheduler.py` | 3621 行 | 3938 行 |
|
||||
| `admit_direct_append` 出现次数 | 2 | **11** |
|
||||
| `DirectAppendAdmissionReqInput/Output` | 没有 | **有**(核心 RPC) |
|
||||
| `_should_allow_local_prefill_on_decode` | 没有 | 有 |
|
||||
| `maybe_trim_decode_session_cache` | 没有 | 有 |
|
||||
| `decode_direct_waiting_queue` | 没有 | 有 |
|
||||
|
||||
→ **必须用 vendor 版**。本分支已把 `pyproject.toml` 的 `sglang==0.5.10` 改成 `sglang` + `[tool.uv.sources] sglang = { path = "third_party/sglang/python", editable = true }`,`uv sync` 后会自动 editable 安装 vendor 版。
|
||||
|
||||
历史上有些 sweep 脚本用 `PYTHONPATH=src:third_party/sglang/python` 在运行时切换,但用 `uv.sources` 把它装进 venv 更彻底,不会被 pip 的 sglang 偷偷 shadow。
|
||||
|
||||
### 坑 3:cu13 切换是死路
|
||||
|
||||
发现 driver 570 不兼容时第一个想到的路径是「装 cu13 PyTorch」。试过:
|
||||
|
||||
1. 改 `pyproject.toml` 加 `[[tool.uv.index]]` 指向 `https://download.pytorch.org/whl/cu130`
|
||||
2. 同样改 vendor sglang 的 `pyproject.toml`(root 项目的 sources 不会传递给 transitive editable dep)
|
||||
3. `uv sync` 成功装上 `torch==2.9.1+cu130` 和 `nvidia-{nccl,nvjitlink,nvshmem,cusparselt,nvtx}-cu13`
|
||||
4. **但 driver 570 不支持 cu13 runtime**——`torch.cuda.is_available()=False`,CUDA init 报 `driver too old (12080)`
|
||||
|
||||
→ cu13 路径需要 **driver 580+**。我们没有 root + 别人在用机器,所以放弃。本分支已 rollback 到 cu12 stack(pyproject 干净)。
|
||||
|
||||
### 坑 4:`--disable-overlap-schedule` 不够
|
||||
|
||||
第一次 smoke 崩在 `resolve_future_token_ids.cuh:49`,路径是 `event_loop_overlap_disagg_prefill`,怀疑是 overlap 模式特定 JIT kernel 问题。
|
||||
|
||||
cli.py 给 PD worker 加了 `--disable-overlap-schedule` 后,event loop 切到 `event_loop_normal_disagg_prefill`,但**崩在另一个 kernel `fused_inplace_qknorm`**,错误码完全相同(`cudaErrorInsufficientDriver`)。
|
||||
|
||||
→ 不是 overlap-specific,是 **整体 vendor sglang `jit_kernel/` 模块和 driver 570 不兼容**,任何 JIT kernel 都会崩在 `runtime.cuh:21` 的 `cudaOccupancyMaxActiveBlocksPerMultiprocessor` 调用(CUDA runtime 初始化时 driver feature 版本检查失败)。
|
||||
|
||||
但 `--disable-overlap-schedule` 留着不会造成伤害,且能避免之后类似 overlap-path 特定问题。本分支保留它在 `cli.py:_topology_from_args`。
|
||||
|
||||
### 坑 5:pip sgl_kernel vs vendor sglang/jit_kernel/ 是两套系统
|
||||
|
||||
`pip install sglang-kernel` 提供 `.venv/lib/.../sgl_kernel/{flash_ops,flashmla_ops,spatial_ops}.abi3.so`——这是 AOT 预编译产物。
|
||||
|
||||
`third_party/sglang/python/sglang/jit_kernel/` 是 vendor SGLang 0.5.10 内置的 **另一套 JIT 模块**,运行时用 tvm_ffi 编译。Smoke 崩在 vendor 的 jit_kernel,**降级 pip sgl_kernel 没用**(实测 0.4.0 / 0.4.1 同样崩)。
|
||||
|
||||
### 坑 6:`nvidia-cuda-nvcc-cu12` PyPI 包没装 nvcc binary
|
||||
|
||||
发现 cu13 nvcc 是 root cause 后,第一反应是 PyPI 装 cu12 nvcc 包:
|
||||
|
||||
```bash
|
||||
uv pip install nvidia-cuda-nvcc-cu12==12.8.93
|
||||
```
|
||||
|
||||
装上以后 `find .venv -name nvcc` **返回空**——这个 PyPI 包只装 `ptxas` 和 `nvvm/`,**没有 nvcc binary**(NVIDIA 出于分发限制不把 nvcc 放 PyPI)。
|
||||
|
||||
→ 完整 nvcc 必须从 NVIDIA 官方 `.run` installer 或 apt 装。`.run` installer 可以装到 user-writable 路径不需要 root,本仓库选这条路。
|
||||
|
||||
### 坑 7:tvm_ffi 通过 ninja 调用 nvcc
|
||||
|
||||
vendor sglang 的 `jit_kernel/` 用 `tvm_ffi.cpp.extension`,源码在 `~/.local/lib/python3.12/site-packages/tvm_ffi/cpp/extension.py`。关键路径:
|
||||
|
||||
```python
|
||||
def _find_cuda_home() -> str:
|
||||
cuda_home = os.environ.get("CUDA_HOME") or os.environ.get("CUDA_PATH")
|
||||
if cuda_home is None:
|
||||
nvcc_path = shutil.which("nvcc")
|
||||
if nvcc_path is not None:
|
||||
cuda_home = str(Path(nvcc_path).parent.parent)
|
||||
...
|
||||
```
|
||||
|
||||
然后构造 ninja file:
|
||||
```
|
||||
nvcc = {_find_cuda_home()}/bin/nvcc
|
||||
```
|
||||
|
||||
→ **设 `CUDA_HOME=$HOME/cuda-12.8` 就能 hook 整条编译链**。`scripts/setup_env.sh` 已经设好。
|
||||
|
||||
JIT 编译产物缓存在 `~/.cache/tvm-ffi/sgl_kernel_jit_*/*.so`。如果之前用 cu13 nvcc 编过,要先 `rm -rf ~/.cache/tvm-ffi/sgl_kernel_jit_*` 再用 cu12.8 重编。
|
||||
|
||||
### 坑 8:mooncake import path 与 onboarding 文档不一致
|
||||
|
||||
`docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.3 的环境验证写:
|
||||
```python
|
||||
from mooncake_transfer_engine import TransferEngine
|
||||
```
|
||||
|
||||
但实际 PyPI `mooncake-transfer-engine 0.3.10.post2` wheel 的 import path 是:
|
||||
```python
|
||||
from mooncake.engine import TransferEngine
|
||||
```
|
||||
|
||||
第一次 `from mooncake_transfer_engine` 会 `ModuleNotFoundError`。**ONBOARDING 文档应该更新**(本分支不动 onboarding,留给主 agent 决定)。
|
||||
|
||||
### 坑 9:mooncake.engine import 必须有 libcudart.so.12
|
||||
|
||||
`from mooncake.engine import TransferEngine` 在 fresh shell(未 source setup_env.sh)下报:
|
||||
```
|
||||
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
|
||||
```
|
||||
|
||||
mooncake 的 `engine.so` 是 cu12 build,dynamic link `libcudart.so.12`。venv 里有但需要 LD_LIBRARY_PATH 暴露。`scripts/setup_env.sh` 已加。
|
||||
|
||||
### 坑 10:Inferact 数据集 schema 与 agentic-pd-hybrid 期望不匹配
|
||||
|
||||
`huggingface.co/datasets/Inferact/codex_swebenchpro_traces` 是 ShareGPT 格式(`{"from": "human/gpt", "value": "<text>"}`),不含 token 计数 / hash_ids / 时间戳。
|
||||
|
||||
`agentic-pd-hybrid` 期望 JSONL:`chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids[]`。
|
||||
|
||||
→ 已写 `scripts/convert_inferact_to_trace.py`:tokenize(用 model 自带 tokenizer)+ 滚动 hash 切 24-token block + 伪造 timestamp。610 trials × 33 turns 处理约 37min,跑出 20,230 reqs(与 Inferact README 的 "20,230 total LLM calls" 完全一致)。
|
||||
|
||||
输出 `outputs/inferact_codex_swebenchpro.jsonl`(1.3GB,被 `.gitignore` 排除不进仓库)。
|
||||
|
||||
### 坑 11:sampling 默认 `--session-sample-rate 0.01`
|
||||
|
||||
`benchmark-live` 跑的时候内部会先做 sampling。默认 1%,意味着 50 sessions 才抽 1 个。Mini smoke trace 4 sessions × 1% = 0 → `ValueError: Sampling produced no requests`。
|
||||
|
||||
→ smoke test 命令显式加 `--session-sample-rate 1.0 --target-duration-s 600`。
|
||||
|
||||
---
|
||||
|
||||
## 5. 后续给下个 agent
|
||||
|
||||
跑 E1 / E2 sweep 之前**每个 shell 第一件事**:
|
||||
|
||||
```bash
|
||||
cd /path/to/agentic-pd-hybrid
|
||||
source scripts/setup_env.sh
|
||||
```
|
||||
|
||||
然后用 ONBOARDING §3 的 sweep 脚本(参考 `scripts/sweep_ts1_migration_v2.sh` 作为底版)。注意几处针对本机的修改:
|
||||
|
||||
1. **MODEL 路径**改成 `/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507`(onboarding 写的 `/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/...` 不存在)。
|
||||
2. **TRACE 路径**:`outputs/qwen35-swebench-50sess.jsonl` 不存在;用 `outputs/inferact_codex_swebenchpro.jsonl` (converter 跑完后产生)。
|
||||
3. **`--ib-device`** 选 `mlx5_60`(cuda:0 NUMA-local)或视实验需要自选;onboarding 写的 `mlx5_0` 在本机不存在。
|
||||
4. **保留 cli.py 的 `--disable-overlap-schedule`** 不要删——理论上 cu12.8 toolchain 应该让 overlap 也能跑,但目前未验证 overlap path 没有别的潜在问题,留着是 zero-cost 保险。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本分支的代码改动
|
||||
|
||||
- `pyproject.toml`:sglang dep 改用 `[tool.uv.sources]` path source 走 `third_party/sglang/python`(editable)。
|
||||
- `src/agentic_pd_hybrid/cli.py:_topology_from_args`:给 prefill/decode worker 自动加 `--disable-overlap-schedule`。
|
||||
- `scripts/setup_env.sh`:env wrapper,每个 shell `source` 一次。
|
||||
- `scripts/convert_inferact_to_trace.py`:Inferact ShareGPT → agentic-pd-hybrid JSONL schema converter。
|
||||
- `docs/H200_DRIVER570_SETUP_ZH.md`:本文档。
|
||||
|
||||
## 附录 B:被 `.gitignore` 排除的产物
|
||||
|
||||
- `outputs/inferact_codex_swebenchpro.jsonl`(1.3GB)——converter 输出,用 `scripts/convert_inferact_to_trace.py` 重新生成
|
||||
- `outputs/smoke_rdma/`(含 mini trace + smoke run artifacts)
|
||||
- `third_party/codex_swebenchpro_traces/`(209MB,HF dataset 下载)—— `hf download Inferact/codex_swebenchpro_traces --repo-type dataset --local-dir third_party/codex_swebenchpro_traces` 重下
|
||||
- `~/cuda-12.8/`——cu12.8 toolkit,用 §1 步骤 (2) 重装
|
||||
- `.venv/`——`uv sync` 重建
|
||||
119
docs/INDEX_ZH.md
Normal file
119
docs/INDEX_ZH.md
Normal file
@@ -0,0 +1,119 @@
|
||||
# 文档索引
|
||||
|
||||
**目的**:让任何合作者在 10 分钟内找到他需要的文档;让 Reviewer 知道哪些先看。
|
||||
|
||||
---
|
||||
|
||||
## 0. 时间紧的 3 篇
|
||||
|
||||
按这个顺序读完即可参与讨论:
|
||||
|
||||
1. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) — 项目当前进度、薄弱点、路线图。
|
||||
2. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) — 算法形式化(Algorithm 1/2/3 + Theorem 1/2)。
|
||||
3. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §0 + §6 — v2 当前 win/lose snapshot。
|
||||
|
||||
---
|
||||
|
||||
## 1. 按主题分类
|
||||
|
||||
### 1.1 进度 / 现状
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) | 跨分支整合 + 路线图(本分支的总入口) |
|
||||
| [PROJECT_OVERVIEW.md](PROJECT_OVERVIEW.md) | 项目目标 + 三种 mechanism(pd-disagg / pd-colo / kvcache-centric)的术语区分 |
|
||||
| [ONBOARDING_NEXT_AGENT_ZH.md](ONBOARDING_NEXT_AGENT_ZH.md) | 接班 agent 30 分钟上手手册(来自 `kvc-debug-journey-v1-to-v4`) |
|
||||
|
||||
### 1.2 算法 / 形式化
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) | Algorithm 1(Route)/ 2(Admit)/ 3(Dispatch)+ Theorem 1(无饿死)+ Theorem 2(fast-path 命中下限) |
|
||||
| [MIGRATION_V1_FINDINGS_ZH.md](MIGRATION_V1_FINDINGS_ZH.md) | v1 thrashing pathology 的实测 + 为什么 reset-on-success 是关键修复 |
|
||||
|
||||
### 1.3 实验结果
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) | SWE-Bench 50 sess ts=1:v2 vs 4DP CA 的 6/8 win + TTFT p99 落后原因 |
|
||||
| [V2_RESULTS_ZH.md](V2_RESULTS_ZH.md) | v2 原始战报(headline 数字略乐观,请同时看 deep analysis) |
|
||||
| [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) | H200 + RDMA 上 E1(naive 1P3D + kv-aware)vs E2(KVC v2);E2 80% failure 的 forensic |
|
||||
| [E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) | E3(+load-floor bonus)16 min 触发 SGLang patch invariant crash |
|
||||
| [E1_E2_FIX_DESIGN_ZH.md](E1_E2_FIX_DESIGN_ZH.md) | Q1(mooncake death)+ Q2(cold-D2)的 fix 设计 |
|
||||
|
||||
### 1.4 当前关键 design discussion
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) | 架构层反思:session-level evict 与 KVC continuity 设计冲突 |
|
||||
| [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | block-level evict refactor 的具体 API / 步骤 / 测试计划(本分支新增) |
|
||||
| [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) | reseed 慢路径时间线 + D→P 同步缺口的 forensic |
|
||||
| [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) | D→P sync 的接口契约、staleness budget、rollout 阶段(本分支新增) |
|
||||
|
||||
### 1.5 评测 / 方法论
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md) | paper-quality 评测协议(N、CI、paired、stratify、baseline list、trace mix)—— 本分支新增 |
|
||||
| [REFACTOR_PLAN_V1_ZH.md](REFACTOR_PLAN_V1_ZH.md) | 为什么从 ts=10 切到 ts=1 |
|
||||
| [TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md](TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md) | ts=10 时代的结构性问题清单(多数已 supersede) |
|
||||
|
||||
### 1.6 工程债 / 失败模式
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) | 785 行 vendored SGLang patch 的归类清单(MUST-HAVE / WORKAROUND / EXPERIMENTAL / INSTRUMENTATION)—— 本分支新增 |
|
||||
| [FAILURE_MODES_ZH.md](FAILURE_MODES_ZH.md) | 5 类失败模式的诊断 + 缓解 + 真正修复(mooncake cascade / cold-D / evict storm / reseed storm / E3 invariant)—— 本分支新增 |
|
||||
|
||||
### 1.7 环境
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md) | H200 + driver 570 + cu12.8 环境搭建 + 11 条 lesson learned |
|
||||
|
||||
### 1.7 归档(仅历史参考)
|
||||
|
||||
`docs/archive/` 下的内容已被新文档 supersede,不必看:
|
||||
|
||||
- `AGENTIC_FIT_ANALYSIS_ZH.md`、`STRUCTURAL_VALIDATION_REPORT_ZH.md`:ts=10 早期分析。
|
||||
- `KVCACHE_CENTRIC_PROGRESS_ZH.md`:早期项目快照。
|
||||
- `KVC_DEBUG_JOURNEY_V1_TO_V5.md`、`V5_PROFILE_INVESTIGATION_ZH.md`:v1–v5 调优过程笔记。
|
||||
- `REFACTOR_PLAN_ZH.md`:v0 重构计划。
|
||||
- `SWEBENCH_EXPERIMENT_*.md`:早期实验日志。
|
||||
|
||||
---
|
||||
|
||||
## 2. 按角色推荐阅读路径
|
||||
|
||||
### 2.1 我是新接手的 SWE/research agent
|
||||
|
||||
1. 先读本文 §0 的 3 篇。
|
||||
2. 再看 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3(薄弱点)+ §5(GPU-free 工作清单)。
|
||||
3. 选一个 Milestone 1 子项开始做。`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` 与 `docs/D_TO_P_SYNC_CONTRACT_ZH.md` 是已经准备好的两条工程主线。
|
||||
|
||||
### 2.2 我是 paper reviewer / 审稿预读
|
||||
|
||||
1. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md):算法 + theorem。
|
||||
2. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md):核心实测对比 + 我们自己识别的 limitation。
|
||||
3. [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md):真硬件 + RDMA 上的 ablation(含 E2 的 80% failure forensic,证明我们能解释失败)。
|
||||
4. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3:我们自己列出的薄弱点与未来工作(不藏问题)。
|
||||
|
||||
### 2.3 我是要复现实验的 student
|
||||
|
||||
1. [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md)。
|
||||
2. [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md):跑哪些 sweep、按什么协议比较。
|
||||
3. `scripts/sweep_ts1_migration_v2.sh`:v2 主 sweep;`scripts/sweep_e1_naive_1p3d.sh` / `scripts/sweep_e2_kvc_v2_rdma.sh`:E1/E2 ablation。
|
||||
|
||||
### 2.4 我想看 control plane 与 admission
|
||||
|
||||
1. `src/agentic_pd_hybrid/policies.py`:`KvAwarePolicy.select` 是 Algorithm 1 的实现。
|
||||
2. `src/agentic_pd_hybrid/replay.py`:`_invoke_session_direct` / `_invoke_kvcache_seeded_router` 是 Algorithm 3 的 orchestration。
|
||||
3. `third_party/sglang/python/sglang/srt/managers/scheduler.py`:D 端 `_admit_direct_append` 是 Algorithm 2 实现。
|
||||
|
||||
---
|
||||
|
||||
## 3. 这份索引的维护约定
|
||||
|
||||
- 新加一份 design / experiment doc 必须在本文 §1 表格里加一行。
|
||||
- 文档归档(移到 `docs/archive/`)时本文同步删除条目或标 "已归档"。
|
||||
- 本文不写实质内容,只做导航;任何深入说明都在被指向的文档里。
|
||||
228
docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md
Normal file
228
docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md
Normal file
@@ -0,0 +1,228 @@
|
||||
# KVC Eviction Granularity — 设计审视 (架构层)
|
||||
|
||||
**日期**: 2026-05-12
|
||||
**Status**: 架构审视 / 待 design discussion
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E3_FINDINGS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`
|
||||
**Branch**: `h200-cu130`
|
||||
|
||||
本文是 E2 → E3 迭代后的高层架构反思,**不是又一份 fix design**。前几轮 E2 → E3 我一直在加 local patches(load-floor bonus、Fix A skip-zero-extend、调 migration_reject_threshold 等),但 E3 实测数据迫使我们承认这些 patches 大局上看是 **KVC 在向 DP / naive PD-disagg 退化的轨迹**。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **KVC 的 value proposition** 是"session pin 在 D 上、KV 跨 turn 连续累积、direct-to-D 快路径 0.04s TTFT"。
|
||||
2. **`SessionAwareCache.release_session` 在 trim 时一次性 free 整段 session-exclusive 尾部**:实测 E3 一次 trim 平均 free **67,726 tokens**(samples: 35K / 38K / 40K / 86K / 87K),不是 "几个 leaf block"。
|
||||
3. 被 evict 的 session 下次到来时必须**从客户端原 prompt 重 prefill 50-90K** + mooncake transfer 5-9 GB → **跟 naive PD-disagg 一模一样**。
|
||||
4. → 在 saturation regime 下 KVC 的 cache continuity 设计被自己的 eviction 抵消。**Session-level eviction 与 KVC 的设计意图冲突**。
|
||||
5. 真正的方向不是堆 patch,是 **改 eviction granularity**: 让 streaming-session 的 decode 输出 **progressively commit 进 radix tree**,由 SGLang 标准的 block-level LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。
|
||||
|
||||
---
|
||||
|
||||
## 1. 我们做对了什么,又错过了什么
|
||||
|
||||
### KVC 的 design promise(来自 `KVC_ROUTER_ALGORITHM.md` §1)
|
||||
|
||||
| Property | 设计意图 |
|
||||
|---|---|
|
||||
| Session 钉定 | Session `s` pin 在 `pin[s]` 这一个 D;同 session 的所有 turn 在同一个 D 上做 KV 累积 |
|
||||
| Direct-to-D 快路径 | `req.session ∈ M_d ∧ append_len ≤ τ_append ∧ cap_ok` → 仅 append 新 token,**不走 P→D mooncake transfer** |
|
||||
| TTFT 优势 | append-only path TTFT ≈ 40ms (历史 v2 在 SWE-Bench 的 fast-path p50) |
|
||||
| 集中 cache 而非 fragment | 同 session cache 集中在一个 D 上,命中率高 |
|
||||
|
||||
### 我们当前实测在做什么(E3, killed at 1h12min)
|
||||
|
||||
| 指标 | 实测值 | 与设计 promise 的偏离 |
|
||||
|---|---:|---|
|
||||
| Eviction 次数 | **90** | 设计假设 "session 一旦绑就持续累积" |
|
||||
| 平均每次 evict 释放 | **67,726 tokens** | 不是 "几个 leaf block",是整段 session 尾部 |
|
||||
| 总释放 | **6,095,375 tokens** | 在 1h12min 里 trash 了 ≈ 8 个 session-pool 容量的 KV |
|
||||
| 触发 reseed 的 session 数 | 25 / 50 (50%) | 这些 session 每个被 evict-revisit 一次 = 付一次 50-90K re-prefill |
|
||||
| 单次 reseed 平均耗时 | 3-7s (P prefill + mooncake) | 跟 naive PD-disagg 持平 |
|
||||
|
||||
**E1 对照**:0 eviction、0 retract、50 sessions 顺利完成。E1 用的是 `pd-disaggregation` mechanism,**没有 KVC 层、没有 admission RPC**,但反而保留了 cache continuity(router-side sticky 让 session 不挪窝)。
|
||||
|
||||
> **讽刺**: E1 (naive 1P2D + kv-aware policy) **意外地** 比 E3 (KVC v2 + load-floor + RDMA) 更接近 KVC 设计意图——因为 E1 没有 admission 反馈链路,所以没人会触发那 90 次 session-level evict。
|
||||
|
||||
---
|
||||
|
||||
## 2. 为什么 session-level evict 是错的
|
||||
|
||||
### `release_session` 实测语义(`session_aware_cache.py:250-281`)
|
||||
|
||||
```python
|
||||
def release_session(self, session_id: str):
|
||||
slot = self.slots.pop(session_id, None)
|
||||
...
|
||||
if slot.last_node is not None:
|
||||
self.inner.dec_lock_ref(slot.last_node, ...) # 解 radix 锁 ✓
|
||||
|
||||
if slot.is_holding_kv:
|
||||
start = slot.cache_protected_len
|
||||
end = slot.kv_allocated_len
|
||||
if start < end:
|
||||
kv_indices = self.req_to_token_pool.req_to_token[
|
||||
slot.req_pool_idx, start:end
|
||||
]
|
||||
self.token_to_kv_pool_allocator.free(kv_indices) # 显式 free 一段 KV
|
||||
...
|
||||
```
|
||||
|
||||
`[cache_protected_len, kv_allocated_len)` 是 **session-exclusive 尾部**——从首 turn 提交 radix tree 之后所有累积的 decode output + 后续 turn 的 extend。在 Inferact workload 上:
|
||||
|
||||
- `cache_protected_len` ≈ 首 turn 提交的 boilerplate 部分 (~12K)
|
||||
- `kv_allocated_len` ≈ 50-100K(多 turn 累积)
|
||||
- **释放范围 = 38-88K**
|
||||
|
||||
这部分 KV **没有进 radix tree**,所以也享受不到 radix block-level LRU 的渐进式 shedding。`release_session` 一刀切。
|
||||
|
||||
### 与 SGLang 标准 radix LRU 的本质差异
|
||||
|
||||
SGLang 标准 `inner.evict()`(`base_prefix_cache.py` 接口由 RadixCache 实现):
|
||||
|
||||
```
|
||||
按节点 last_access_time 排序,从 leaf 开始 evict (因为 evict 中间节点会破坏树结构)
|
||||
每次释放一个 leaf node 的 KV indices
|
||||
lock_ref > 0 的节点不可 evict
|
||||
```
|
||||
|
||||
**特性对比**:
|
||||
|
||||
| | session-level (current) | block-level (SGLang radix) |
|
||||
|---|---|---|
|
||||
| 单次释放粒度 | 整段 session 尾部 (35-87K) | 一个 leaf node (~24 tokens / page-size) |
|
||||
| Recent prefix 保留 | ❌ 全丢 | ✅ 保留 (recent 访问 → 时间戳新 → 不被先 evict) |
|
||||
| Evict-revisit 成本 | 50-90K re-prefill | 仅丢的 leaf 部分 (≪ 50K) |
|
||||
| 与 session lifecycle | 强绑定 (是 lifecycle 退出动作) | 解耦 (lifecycle 仅做 lock_ref 管理) |
|
||||
|
||||
### 为什么会变这样:SessionAwareCache 的双重职责混淆
|
||||
|
||||
`SessionAwareCache` 设计承担了**两个本应分离的职责**:
|
||||
|
||||
1. **Session lifecycle 跟踪** (合理):streaming session 跨多个 req 复用 KV,需要在 turn 间保留 `(req_pool_idx, kv_committed_len, kv_allocated_len, last_node)` 这些字段,恢复给下个 turn 的 req。
|
||||
2. **Eviction granularity 决策** (问题所在):把 session 当成 evict 的最小单位,绕过了 SGLang 标准 LRU 的 leaf-by-leaf 渐进 shedding。
|
||||
|
||||
第 2 个职责本不该存在于 SessionAwareCache 里。SGLang radix 已经能处理 block-level LRU——前提是 session 的 KV 真的进了 radix 树。但**因为 session-exclusive 尾部没 commit 进 radix tree**,radix LRU 看不到它们,只能由 release_session 一次性大块 free。
|
||||
|
||||
---
|
||||
|
||||
## 3. 我们前几轮 patches 的总体轨迹
|
||||
|
||||
按 commit 时间线审视,每一步看似在修当下 issue,整体方向却是 KVC → DP 退化:
|
||||
|
||||
| Iteration | 改动 | 局部目标 | 大局影响 |
|
||||
|---|---|---|---|
|
||||
| E2 baseline | mechanism=kvcache-centric, worker admission | 跑出 KVC v2 头条数字 | D2 cold + cascade → 1054 failures (KVC 设计前提崩塌) |
|
||||
| E3 load-floor bonus | 让 fresh session 均匀分到 D2 | 解 cold-start 偏置 | 触发 migration → 25 sessions reseed → 暴露 evict granularity 问题 |
|
||||
| E3 → Fix A | 修 vendored SGLang `prepare_for_extend` 的 fill_ids<prefix_indices invariant | 防 decode-1 assertion crash | Patch 局部 bug,没动 evict 设计 |
|
||||
| **我之前提议: disable migration** | `--kvcache-migration-reject-threshold 0` | "让 session 不挪窝" | **会让 KVC 退化成 pd-disagg + load-floor**(admission RPC 还在但 migration 不生效) |
|
||||
| **更早提议: disable admission** | 砍 admission RPC | "省掉那个 RPC overhead" | **直接砍 KVC 的 direct-to-D fast path** (KVC_ROUTER_ALGORITHM.md §3.2 Algorithm 2 不存在) |
|
||||
|
||||
用户每次都正确地阻止了进一步退化。**没有人在审视 evict granularity 这个根本问题**——直到现在。
|
||||
|
||||
---
|
||||
|
||||
## 4. 正确方向(粗描)
|
||||
|
||||
**核心思路**: 让 streaming session 的 decode 输出 **progressively commit 进 radix tree**,由 SGLang 标准 radix LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。
|
||||
|
||||
### 4.1 目标行为
|
||||
|
||||
| 场景 | 当前行为 | 目标行为 |
|
||||
|---|---|---|
|
||||
| Session 累积 50K KV,D 满了 | release_session 一次释放 38K (整段 session-exclusive 尾部) | radix LRU evict 最老 leaf (可能是首 turn 的 boilerplate tail,~24 tokens) |
|
||||
| Session 被 evict 后再到来 | 必须 reseed 50K (P prefill + mooncake) | 仅 re-prefill 被 evict 的 leaf 部分 (e.g. ~5K) |
|
||||
| TTFT 对 evicted session 的影响 | 50-90K reseed = 3-7s | 5K append-prefill = ~200ms |
|
||||
| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only ✓ (不变) |
|
||||
| KVC fast-path 命中率 | 91.6% (历史 SWE-Bench) / 38% (E3 Inferact, 因为 evict-revisit) | 应稳定在 >85% 即使 saturation |
|
||||
|
||||
### 4.2 需要的 refactor scope
|
||||
|
||||
按依赖排序,每一步可独立做但有耦合:
|
||||
|
||||
1. **Streaming session decode output 增量进 radix tree** (vendor SGLang)
|
||||
- 当前: decode output 累积在 `kv_allocated_len` 维度,但 radix tree 只记录到 `cache_protected_len`
|
||||
- 改: 每 turn finish 时把新的 decode tail 通过 radix `cache_finished_req` 路径插入 radix 树
|
||||
- 影响: streaming session 在 radix 树里有持续 growing 的 chain,每个 24-token block 一个 node
|
||||
- 牵涉: `radix_cache.py` 的 insert 路径、`schedule_batch.py` 的 cache_finished_req hook、SessionSlot.save_from_req
|
||||
|
||||
2. **SessionSlot 退化成纯 metadata**
|
||||
- 当前: SessionSlot 拥有 `req_pool_idx` + `[cache_protected_len, kv_allocated_len)` 范围的 KV 索引所有权
|
||||
- 改: SessionSlot 仅持有 `last_node`(指向 radix 树某 node)和 lock_ref 状态,不直接管 KV 范围
|
||||
- 影响: `restore_to_req` 改成基于 radix `match_prefix` 重建 req 状态,不直接 reuse req_pool_idx
|
||||
|
||||
3. **`release_session` 改为仅 dec_lock_ref + 删 slot metadata**
|
||||
- 当前: 还 free `[cache_protected_len, kv_allocated_len)` 范围 KV
|
||||
- 改: 只 dec_lock_ref → 让 radix LRU 自然 evict
|
||||
- 影响: `maybe_trim_decode_session_cache` 不再"按 session 释放",而是用 SGLang 现有的 `tree_cache.evict(required_tokens)`
|
||||
|
||||
4. **`admit_direct_append` 的 capacity 检查改用 radix-resident 长度**
|
||||
- 当前: `current_tokens = session.resident_tokens` (来自 SessionSlot)
|
||||
- 改: `current_tokens` = radix tree 上该 session 实际 commit 的长度 = `match_prefix(session.last_node).matched_length`
|
||||
- 影响: admission 评估的 "uncached = input - radix-resident" 更精确,evict-revisit 场景下 admission 反映出"只丢了一部分"而不是"全丢"
|
||||
|
||||
5. **`prepare_for_extend` 的 streaming-session correction 重新设计**
|
||||
- 当前: Fix A patches 的 fill_ids/prefix_indices invariant 是基于 session-exclusive 尾部的复杂 fixup
|
||||
- 改: 如果 SessionSlot 不再拥有独立 KV 范围,整个 correction 路径需要重写或可能不再必要
|
||||
|
||||
### 4.3 与 onboarding §4.4 D→P sync 的关系
|
||||
|
||||
`docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 描述的 D→P 增量同步是**针对 reseed 自身成本**的 fix(让 P 端 backup 跟上,避免 reseed 时 P 重 prefill)。
|
||||
|
||||
本文 §4 描述的 eviction granularity 是**针对 reseed 触发频率**的 fix(让 session 不被一次性 evict 整段,减少 evict-revisit)。
|
||||
|
||||
**两者正交、互补**:
|
||||
- 单做 evict-granularity fix: reseed 频率下降,但偶发 reseed 仍然慢
|
||||
- 单做 D→P sync: reseed 自身快了,但仍然频繁触发
|
||||
- 都做: reseed 几乎消失、即使触发也快
|
||||
|
||||
工程量都是 ~1-2 周量级,可并行启动。
|
||||
|
||||
### 4.4 不是 local patch
|
||||
|
||||
注意整个 §4.2 列表里没有"调一个 hyperparameter"或者"加一个 CLI flag"这种局部改动。这是 vendor SGLang 内部数据结构的 invariants 重新设计,不能通过更精确的 K 值或更宽的 substring filter 解决。
|
||||
|
||||
---
|
||||
|
||||
## 5. 我们不该再做的事 (anti-patterns)
|
||||
|
||||
防止下个 agent 走同样的局部 patch 路径:
|
||||
|
||||
1. **不要继续调整 `migration_reject_threshold`** — 这个参数只是控制"reject 后多久换 D",跟 evict granularity 无关。调小让 migration 更频繁 → 更多 reseed → 更糟。调大 → blacklist 永久化 (v1 thrashing 问题)。
|
||||
2. **不要 disable migration** — 会让 KVC 退化到 sticky pd-disagg。失去 v2 的 reset-on-success 整体设计。
|
||||
3. **不要 disable admission** — 会砍掉 direct-to-D fast path 这个 KVC 唯一的差异化优势。
|
||||
4. **不要继续 tune `_decode_session_cache_low_watermark_tokens`** — 调高让 LRU 更激进 → 更多 evict → 更糟。调低让 LRU 不触发 → 顶到 retract decode → 更糟。是治标。
|
||||
5. **不要再加 `_ADMISSION_REJECTION_SUBSTRINGS`** — 之前修的 string filter bug (Q2 forensic) 让 migration counter 真的递增,反而暴露了 migration 本身的 reseed 成本。修这个 bug 没错,但显示出 migration 机制本身在 saturated 场景下是负收益。
|
||||
|
||||
---
|
||||
|
||||
## 6. 推荐 Decision Points
|
||||
|
||||
| # | Question | 推荐 |
|
||||
|---|---|---|
|
||||
| D1 | 接受本文的诊断(session-level evict 是根本问题)? | **Yes** |
|
||||
| D2 | 暂停 E1/E2/E3 ablation 线索,集中精力做 §4.2 refactor? | **Yes** (current path 在用 GPU 时间确认已知结论) |
|
||||
| D3 | refactor 在 vendored SGLang 主线(kvc-debug-journey-v1-to-v4)还是新分支? | 新分支 `feat/block-level-evict`(隔离 risk) |
|
||||
| D4 | 同时启动 §4.3 的 D→P sync(`feat/d-to-p-sync` 分支已预留)? | 视团队带宽 |
|
||||
| D5 | 在 refactor 完成前对外的 paper 表述如何处理? | 标"v2 系列在 saturation regime 下的 evict 行为是已识别的 limitation,§future-work 已 propose 修复" |
|
||||
|
||||
---
|
||||
|
||||
## 7. 给下个 agent 的接班
|
||||
|
||||
**如果你接手要做 §4.2 refactor**,按顺序读:
|
||||
|
||||
1. `KVC_ROUTER_ALGORITHM.md` §2-3 — KVC 设计意图
|
||||
2. 本文 §2.1, §2.2 — 实测 evict 行为
|
||||
3. SGLang vendor `mem_cache/radix_cache.py` — 标准 radix LRU 实现细节
|
||||
4. SGLang vendor `mem_cache/session_aware_cache.py` — 当前 SessionSlot 设计
|
||||
5. SGLang vendor `managers/schedule_batch.py` — prepare_for_extend 怎么用 session state
|
||||
6. `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 — D→P sync 的工程 scope(互补 work)
|
||||
|
||||
**关键 invariant 不变量**: SessionSlot.restore_to_req 必须保持幂等(chunked prefill 失败可能 retry 多次)。任何 refactor 都要测试此 invariant。
|
||||
|
||||
**关键 testing pattern**: 单元化测试 streaming session 在 LRU 压力下的行为。具体:注入一个 fake `inner.evict()` 返回部分 leaf 被 evict 的状态,断言 SessionSlot.restore_to_req 仍然返回合法 req 状态(不抛 assertion,re-prefill 长度合理)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**: 我们前 3 轮 patch 都在解 saturation 暴露的 secondary 问题(cold-D 偏置、admission 字符串 bug、streaming-session correction 边界),但**真正的 primary 问题是 SessionAwareCache 把 session lifecycle 跟踪和 eviction granularity 决策混在一起**。session 是 lifecycle 边界,**不应该是 eviction 边界**。Eviction 应该交还给 SGLang 已经做得很好的 block-level radix LRU。
|
||||
356
docs/KVC_ROUTER_ALGORITHM.md
Normal file
356
docs/KVC_ROUTER_ALGORITHM.md
Normal file
@@ -0,0 +1,356 @@
|
||||
# KVC-Router:面向 Agentic 多轮 LLM Serving 的 Session-Aware 调度算法
|
||||
|
||||
**性质**:论文级形式化规范——用于团队内部对齐 + 外部读者 onboarding。
|
||||
**对象**:项目团队(统一术语);论文 reviewer(算法定义)。
|
||||
**最近更新**:2026-05-11。
|
||||
|
||||
本文给出本项目所开发的 **KVCache-Centric Router**(以下简称 "KVC-Router")调度算法的形式化、与实现无关的定义。本文设计为可直接被论文引用,并作为"KVC 到底在谈论什么调度算法"的标准回答。
|
||||
|
||||
对应的参考实现位于:
|
||||
- `src/agentic_pd_hybrid/policies.py` — `KvAwarePolicy`、`RoutingState`
|
||||
- `src/agentic_pd_hybrid/replay.py` — orchestration:admission RPC、reset-on-success、fallback chain
|
||||
- `third_party/sglang/python/sglang/srt/managers/scheduler.py` — D-worker 端的 admission 决策
|
||||
|
||||
---
|
||||
|
||||
## 1. 问题定义
|
||||
|
||||
我们要服务一群多轮 agentic LLM session(如 Claude Code、Codex、Cursor 等 coding agent),底层是异构 worker 池,分成:
|
||||
- **Prefill workers**(`P`):GPU 常驻的模型副本,针对长输入 prompt 的 batched prefill 做了优化。
|
||||
- **Decode workers**(`D`):GPU 常驻的模型副本,配备 session-aware KV cache("SessionAwareCache"),具备:(i) 跨 turn 保留 session 的 KV 状态;(ii) 在本地已缓存的 prefix 上做 append-prefill,无需绕回 `P`。
|
||||
|
||||
在一个 agent turn 内,请求 `r` 到达时其对话 prefix 已经从前序 turn 累积;**新增**的 tokens(工具输出、用户消息等)构成小规模 **append**。驱动 KVC 设计的根本观察是:
|
||||
|
||||
> 当 prefix KV **已经驻留在将要解码该请求的 D worker 上**,请求的 first-token 延迟仅由 *append* 大小决定(典型 O(10²–10³) tokens),而非完整 prompt 大小(典型 O(10⁴–10⁵) tokens)。
|
||||
|
||||
Router 的工作就是最大化满足上述条件的请求占比,同时尊重容量约束、不造成 session 无限饿死。
|
||||
|
||||
### 1.1 优化目标
|
||||
|
||||
给定来自 `S` 个 session 的请求流 `R = (r_1, r_2, ...)`,最小化 SLO 加权的 TTFT 与端到端延迟混合:
|
||||
|
||||
```
|
||||
minimize E[ w_ttft · TTFT(r) + w_lat · E2E_Latency(r) ]
|
||||
subject to capacity[d] ≤ K_d 对任意 D worker d 在任意时刻 t,
|
||||
没有 session 被永久拒绝服务.
|
||||
```
|
||||
|
||||
参考实现中通过 measurement 隐式取 `w_ttft = 1, w_lat = 1`;per-D KV 池预算 `K_d` 取 SGLang 启动时上报的 `max_total_num_tokens`。
|
||||
|
||||
---
|
||||
|
||||
## 2. 系统模型与记号
|
||||
|
||||
### 2.1 集合
|
||||
|
||||
| 符号 | 含义 |
|
||||
|---|---|
|
||||
| `P = {p₁, …, p_|P|}` | Prefill worker 池 |
|
||||
| `D = {d₁, …, d_|D|}` | Decode worker 池 |
|
||||
| `S` | Session 标识符集合(由上游 agent runtime 分配) |
|
||||
| `H` | KV block hash 的全集(本实现中每 `BLOCK_TOKEN_BUDGET = 24` tokens 对应一个 hash) |
|
||||
|
||||
### 2.2 请求
|
||||
|
||||
一个请求 `r` 是一个元组:
|
||||
|
||||
```
|
||||
r = ⟨ s(r), t(r), prefix_hashes(r), append_len(r), input_len(r) ⟩
|
||||
```
|
||||
|
||||
其中:
|
||||
- `s(r) ∈ S` — session id
|
||||
- `t(r) ∈ ℕ` — 该 session 内的 turn index(0 = 首轮)
|
||||
- `prefix_hashes(r) ⊂ H` — 覆盖请求输入 prefix 的 block hash 集合
|
||||
- `append_len(r) ∈ ℕ` — 新到达、**不在** `prefix_hashes(r)` 中的 token 数
|
||||
- `input_len(r) = (|prefix_hashes(r)| · 24) + append_len(r)` — 总 token 数
|
||||
|
||||
### 2.3 Router 状态 (`Σ`)
|
||||
|
||||
Router 跨请求维护的全局状态:
|
||||
|
||||
| 字段 | 类型 | 语义 |
|
||||
|---|---|---|
|
||||
| `resident[d]` | `set[H]` | Router 估计的 D `d` 当前 SessionAwareCache 中常驻的 block hash 集合(router 端估计,真值在 worker 上) |
|
||||
| `pin[s]` | `D ∪ {⊥}` | Session `s` 最近一次成功服务的 D;`⊥` 表示从未见过 |
|
||||
| `inflight[d]` | `ℕ` | 当前已派发给 `d` 但尚未完成的请求数 |
|
||||
| `assigned[d]` | `ℕ` | 累计派发到 `d` 的路由决策次数(负载 tie-breaker) |
|
||||
| `rejects[s,d]` | `ℕ` | per-(session, D) 的 admission 拒绝计数(v2 引入的 migration 机制) |
|
||||
|
||||
### 2.4 超参数
|
||||
|
||||
| 符号 | 默认值 | 描述 |
|
||||
|---|---|---|
|
||||
| `α`(`sticky_bonus`) | 1 | 匹配 `pin[s]` 的 D 在评分中获得的 bonus |
|
||||
| `τ_reject`(`migration_reject_threshold`) | 3 | (s, d) 被拒绝达此次数后,d 对 s 进入 blacklist |
|
||||
| `τ_append`(`kvcache_direct_max_uncached_tokens`) | 8192(v2) | 走 Direct-to-D 路径允许的最大 append 长度 |
|
||||
| `K_d` | 取自 SGLang `max_total_num_tokens` | per-D 的 KV 池预算 |
|
||||
| `ρ` | 0.95 | 容量高水位线(隐式由 SGLang 强制) |
|
||||
| `ε`(最大 fallback 重试数) | `|D| - 1` | router 在退化到 vanilla PD-disagg 之前最多探测几个 D |
|
||||
|
||||
### 2.5 路由结果
|
||||
|
||||
路由决策 `δ(r)` 取以下四种之一:
|
||||
|
||||
| Mode | 含义 | KV transfer |
|
||||
|---|---|---|
|
||||
| `Direct(d)` | r 完全在 D `d` 上执行;D 在其常驻 KV 上做 append | **无**(快路径) |
|
||||
| `Seed(d)` | Session 首轮:P 做完整 prefill,KV 通过 mooncake 传到 `d` | 完整 input |
|
||||
| `Reseed(d)` | Session 之前在某个 D' 上,但已不再常驻;按 Seed 处理 | 完整 input |
|
||||
| `Fallback(p, d)` | Vanilla pd-disagg 路径(其它 D 均被 blacklist 或拒绝) | 完整 input |
|
||||
|
||||
---
|
||||
|
||||
## 3. 算法
|
||||
|
||||
KVC-Router 由三个相互配合的过程组成:
|
||||
- **Algorithm 1 (`Route`)**:router 端基于评分的候选选择。
|
||||
- **Algorithm 2 (`Admit`)**:D-worker 端的 admission 决策(在 D scheduler 中执行,非 router)。
|
||||
- **Algorithm 3 (`Dispatch`)**:端到端 orchestration,把 Route + Admit + reset-on-success 串起来。
|
||||
|
||||
### 3.1 Algorithm 1:`Route(r, Σ)` — 基于评分的候选选择
|
||||
|
||||
```
|
||||
输入:请求 r,状态 Σ
|
||||
输出:候选 d* ∈ D(若所有 D 都被过滤后仍无候选,退化分支兜底返回最少被拒的 D)
|
||||
|
||||
1. blacklisted ← { d ∈ D : Σ.rejects[s(r), d] ≥ τ_reject }
|
||||
2. C ← D ∖ blacklisted // 候选 D 集合
|
||||
3. if C = ∅ : // 退化
|
||||
4. return argmin_{d ∈ D} Σ.rejects[s(r), d] // 选最少被拒的 D
|
||||
5. for each d ∈ C :
|
||||
6. overlap(d) ← |prefix_hashes(r) ∩ Σ.resident[d]|
|
||||
7. sticky(d) ← 1 if Σ.pin[s(r)] = d else 0
|
||||
8. infl(d) ← Σ.inflight[d]
|
||||
9. assn(d) ← Σ.assigned[d]
|
||||
10. score(d) ← ⟨ overlap(d) + α·sticky(d), // 主项
|
||||
sticky(d), // tie-1
|
||||
−infl(d), // tie-2(负载小者占优)
|
||||
−assn(d) ⟩ // tie-3
|
||||
11. return argmax_{d ∈ C} score(d) // 按字典序最大
|
||||
```
|
||||
|
||||
**说明**:
|
||||
- 评分是 **4 元组按字典序比较**,不是单个标量——这样避免在不同维度之间调权重。
|
||||
- 第 10 行的主项 `overlap + α·sticky` 同时奖励 KV 复用与 session stickiness。取 `α=1`、`overlap` 以 block(24 tokens)为单位时,**任何一次 hash 命中都压制纯 sticky 的候选**。
|
||||
- 第 1–4 行的 blacklist 过滤防止永久绑死在已饱和的 D 上;与 Algorithm 3 的 reset-on-success 配合,限定了 migration 频率。
|
||||
|
||||
### 3.2 Algorithm 2:`Admit(d, r, M, K)` — D-worker admission 决策
|
||||
|
||||
在 D worker 自己的 scheduler 内部执行(非 router),这是 **KVC 的机制核心**:每个 D 自治判断能否把 `r` 当作 Direct(append-only)服务,还是必须改走 P 路径。
|
||||
|
||||
```
|
||||
输入:D worker d,请求 r,d 上本地常驻的 session 集合 M_d,KV 池预算 K_d
|
||||
输出:⟨can_admit ∈ {True, False}, mode ∈ {Direct, Seed, Reseed, ⊥}, reason⟩
|
||||
|
||||
1. used_tokens ← Σ_{s' ∈ M_d} resident_tokens(s', d) // D 自己的 bookkeeping
|
||||
2. cap_ok ← (used_tokens + input_len(r)) ≤ ρ · K_d // 高水位线 ρ ≈ 0.95
|
||||
|
||||
3. if s(r) ∈ M_d : // session 在 d 上有常驻
|
||||
4. if append_len(r) ≤ τ_append and cap_ok :
|
||||
5. return ⟨True, Direct, ∅⟩ // → 快路径
|
||||
6. elif append_len(r) > τ_append :
|
||||
7. return ⟨False, ⊥, "real-large-append"⟩
|
||||
8. else :
|
||||
9. return ⟨False, ⊥, "no-d-capacity"⟩
|
||||
|
||||
10. else : // session 在 d 上无常驻
|
||||
11. if cap_ok :
|
||||
12. mode ← Seed if t(r) = 0 else Reseed
|
||||
13. return ⟨True, mode, ∅⟩ // → 经 P 做 KV seeding
|
||||
14. else :
|
||||
15. return ⟨False, ⊥, "session-not-resident-no-capacity"⟩
|
||||
```
|
||||
|
||||
**说明**:
|
||||
- 该过程通过同步 HTTP RPC(`/admit_direct_append`)从 router 调用。RPC 阻塞直到 D scheduler 给出权威答复——这是 v5 引入的 **"worker-mode admission"**,替换了更早的 router-端容量估算(系统性偏乐观)。
|
||||
- reason 字符串被回传给 router,用于:(i) 在 Algorithm 3 中驱动 fallback chain;(ii) 标注 `execution_mode` 字段便于分析。
|
||||
|
||||
### 3.3 Algorithm 3:`Dispatch(r, Σ)` — 端到端 orchestration
|
||||
|
||||
```
|
||||
输入:请求 r,状态 Σ
|
||||
输出:执行模式 μ ∈ {Direct, Seed, Reseed, Fallback}
|
||||
|
||||
1. retries ← 0
|
||||
2. tried ← ∅
|
||||
3. while retries < ε :
|
||||
4. d* ← Route(r, Σ \ {对 tried 中的 d 已 bump 过的 rejects})
|
||||
5. if d* = ⊥ : break // 无候选
|
||||
6. resp ← Admit(d*, r) // RPC 到 D scheduler
|
||||
7. if resp.can_admit :
|
||||
8. Σ.rejects[s(r), d*] ← 0 // ◀ reset-on-success(v2)
|
||||
9. Σ.pin[s(r)] ← d*
|
||||
10. Σ.inflight[d*] ← Σ.inflight[d*] + 1
|
||||
11. if resp.mode = Direct :
|
||||
12. 在 d* 上完整执行 r(append-prefill + decode)
|
||||
13. return Direct
|
||||
14. else : // Seed 或 Reseed
|
||||
15. p ← round_robin_next(Σ, P)
|
||||
16. 在 p 上做 r 的 prefill
|
||||
17. 经 mooncake 把 KV(r) 从 p 传到 d*
|
||||
18. 在 d* 上 decode r
|
||||
19. return resp.mode
|
||||
20. else :
|
||||
21. Σ.rejects[s(r), d*] ← Σ.rejects[s(r), d*] + 1
|
||||
22. tried ← tried ∪ {d*}
|
||||
23. retries ← retries + 1
|
||||
24.
|
||||
25. // ε 次重试耗尽——退化 Fallback 到 vanilla pd-disagg
|
||||
26. p ← round_robin_next(Σ, P)
|
||||
27. d ← round_robin_next(Σ, D)
|
||||
28. 通过 ⟨p, d⟩ 走 pd-disagg(r)
|
||||
29. return Fallback
|
||||
```
|
||||
|
||||
**维持的关键不变量**:
|
||||
|
||||
1. **不会静默过载**:一个 D 永不接受会让 `used_tokens > ρ · K_d` 的请求(Algorithm 2 第 2 行)。
|
||||
2. **不存在永久饿死**:对任意 session `s`,只要曾在某 D `d*` 上成功过一次,之后 `Σ.rejects[s, d*] = 0`(Algorithm 3 第 8 行)。因此 blacklist 计数器不会对仍在某处成功获得服务的 session 累积——这阻止了 **v1 的 thrashing 病理**:原本 blacklist 计数器单调增长 + 退化 fallback 形成自放大的 round-robin 死循环。
|
||||
3. **migration 有界**:一个 session 从 D `a` 迁移到 D `b` 必须经过连续 `τ_reject` 次在 `a` 上失败、期间无任何成功。每个 session 生命周期内的最坏 migration 次数 ≤ `(|D| − 1) · τ_reject`。
|
||||
|
||||
### 3.4 Reset-on-success:为什么这是关键修复(v1 → v2 演化)
|
||||
|
||||
v1 实现**省略了** Algorithm 3 第 8 行——一旦 `(s, d)` 累积 `τ_reject` 次拒绝,d 对该 session **整个 run 永久 blacklist**。实测(Migration v1,见 `docs/MIGRATION_V1_FINDINGS_ZH.md`)触发了自放大的失效模式:
|
||||
|
||||
```
|
||||
session s 在 d 上稳定服务 70 个 turn
|
||||
↓ 瞬时 burst 让 d 短暂饱和
|
||||
3 次到 d 的 admission 被拒 → rejects[s,d] = 3 → d 对 s 永久 blacklist
|
||||
↓ s 迁到 d',d' 也在负载中 → 被拒 → blacklist
|
||||
↓ d'' 同理
|
||||
所有 D 都 blacklist → 退化 fallback round-robin → 每次重试都 bump 一次计数器
|
||||
→ s 永远在 D 之间 thrashing,每次都丢失 KV residency
|
||||
```
|
||||
|
||||
reset-on-success 关上了这个回路:只要 `s` 在任一 d 上真正完成一次 Direct,针对该 session 的 blacklist 立刻清零。该机制只对**持续性**(不是瞬时性)容量压力触发。
|
||||
|
||||
---
|
||||
|
||||
## 4. 性质
|
||||
|
||||
### 4.1 Theorem 1(在有界 ε 下无永久饿死)
|
||||
|
||||
*假设 `τ_reject ≥ 1` 且每个 D worker 的容量非零。则对任意能在 admission 时容下的 session `s`,Algorithm 3 在至多 `|D| · τ_reject` 次重试内返回 `{Direct, Seed, Reseed}` 之一;之后任意一次 Direct 成功即可清空 `s` 的所有 blacklist。*
|
||||
|
||||
**证明概要**:每次循环要么成功(return)、要么恰好让某个 `rejects[s, d]` 计数器 +1(第 21 行)。经过 `|D| · τ_reject` 次迭代后,每个 D 要么对 `s` 已被 blacklist(`Route` 第 1 行会过滤),要么已成功(已终止)。在所有 D 都被 blacklist 的饱和点,`Route` 第 3 行返回最少被拒的 D,打破对称性,强制取得进展。∎
|
||||
|
||||
### 4.2 Theorem 2(fast-path 命中下限)
|
||||
|
||||
*假设 session `s` 在 D `d` 上已积累 KV residency `R_s ⊂ H`,且在某 turn `t > 0` 提交的请求 `r` 满足 `prefix_hashes(r) ⊆ R_s`、`append_len(r) ≤ τ_append` 且 admission 容量充足。则 Algorithm 3 将 `r` 路由为 Direct(d)。*
|
||||
|
||||
**证明概要**:由 Algorithm 1,`overlap(d) = |R_s|` 取得最大值;结合 `α·sticky(d) ≥ 1`,d 的字典序得分严格高于任何 `prefix_hashes(r) ⊈ R_{s,d'}` 的 d'。故 `Route` 返回 d。`Admit(d, r)` 进入 `s ∈ M_d ∧ append ≤ τ_append ∧ cap_ok` 分支,返回 Direct。∎
|
||||
|
||||
这是 **支持架构设计的机制级保证**:只要 residency、append 大小、容量三者同时成立,快路径就被**确定性地**选中;KVC 在典型场景下的 TTFT 优势是结构性属性,不是概率性。
|
||||
|
||||
### 4.3 复杂度
|
||||
|
||||
每个请求:
|
||||
- `Route`:`O(|D|)`(每个候选 D 算一次 score)。生产规模下 `|D| ≤ 8`,主要开销在 Python 层,≪ 1 ms。
|
||||
- `Admit`:D scheduler 内部 O(1)(查自己的 bookkeeping,无全局锁)。
|
||||
- Router 层的单请求总开销:`O(|D|)` 计算 + 1 次到目标 D 的 HTTP RTT(loopback 亚毫秒,跨机数据中心约 1 ms)。
|
||||
|
||||
---
|
||||
|
||||
## 5. 与 baseline 的对比
|
||||
|
||||
| 性质 | Vanilla pd-disagg | DP(cache-aware) | **KVC-Router**(本文) |
|
||||
|---|---|---|---|
|
||||
| P/D 分离 | 是(`|P| + |D|` GPU) | 否(每个 worker fused P+D) | 是 |
|
||||
| 跨 turn cache locality | 无(每个请求都 P→D 传 KV) | 仅在单 fused worker 内部走 hash prefix 路由 | session 钉在某 D 上,本地 append-prefill |
|
||||
| 同 session cache 集中度 | 无 | 散到 `|D|` 个 worker(每个占 1/|D|) | 集中在一个 D(整段常驻) |
|
||||
| 最坏 turn-2 prefill 工作量 | 完整 input 经 P→mooncake→D | 在目标 worker 上做完整 prefill(带 prefix cache 命中) | 本地 `append_len ≤ τ_append` tokens |
|
||||
| 容量感知 admission | 无(router 盲发) | 隐式靠 worker 队列深度 | 显式的 per-D `Admit()` 决策 |
|
||||
| Migration 机制 | N/A | N/A | 带 reset-on-success 的 reject-counter blacklist |
|
||||
| Idle prefill 成本 | 是——P 永远在算 | 否 | 是——P 只在 cache miss 时启用(本工作 SWE-Bench 评测下约 8% 请求) |
|
||||
|
||||
KVC 的关键架构权衡:**用 P 端 GPU 闲置换 D 端 TTFT 稳定性**。在 per-session cache 复用率高的 agentic workload 上(Inferact 的 Codex trace 报告 94.2% cache hit;我们的 SWE-Bench replay 实测 91.6% Direct 命中),这个交换显著有利。在 session 短或 cache hit 低的 workload 上,权衡反转、DP 胜出。
|
||||
|
||||
---
|
||||
|
||||
## 6. 符号速查表
|
||||
|
||||
| 符号 | 含义 |
|
||||
|---|---|
|
||||
| `P, D` | Prefill / Decode worker 池 |
|
||||
| `s(r), t(r)` | 请求 r 的 session id 与 turn index |
|
||||
| `prefix_hashes(r)` | r 输入 prefix 的 KV block hash |
|
||||
| `append_len(r)` | r 中新增(未缓存)部分的 token 数 |
|
||||
| `Σ.resident[d]` | Router 对 d 缓存 block 集合的估计 |
|
||||
| `Σ.pin[s]` | session s 最近一次成功的 D |
|
||||
| `Σ.rejects[s,d]` | per-(s,d) 的 admission 拒绝计数 |
|
||||
| `α` | sticky bonus 权重(默认 1) |
|
||||
| `τ_reject` | migration 阈值(默认 3) |
|
||||
| `τ_append` | Direct 路径允许的 max append 大小(v2 默认 8192) |
|
||||
| `K_d` | D worker d 的 KV 池预算 |
|
||||
| `ρ` | 容量高水位(默认 0.95) |
|
||||
| `ε` | fallback 重试上限(默认 `|D| − 1`) |
|
||||
| `δ(r)` | 路由决策:`Direct(d)` / `Seed(d)` / `Reseed(d)` / `Fallback(p, d)` |
|
||||
|
||||
---
|
||||
|
||||
## 7. 本工作评测中实际使用的默认参数
|
||||
|
||||
| 参数 | 取值 | 说明 |
|
||||
|---|---|---|
|
||||
| `|P|, |D|` | 1, 3(1P3D 配置) | 单机 4× H100 80GB |
|
||||
| `α` | 1 | |
|
||||
| `τ_reject` | 3 | |
|
||||
| `τ_append` | 8192 | v2 调优后取值(v0/v1 用 2048) |
|
||||
| `K_d` | 92104 tokens | SGLang 按 `mem_fraction_static=0.835` 自动算出 |
|
||||
| `ρ` | 隐式 ~0.95 | 由 SGLang 的 `max_total_num_tokens` 强制 |
|
||||
| `ε` | 2 | `|D| − 1 = 2` |
|
||||
| 每次 run 的 session 数 | 52 | SWE-Bench 50sess trace |
|
||||
| 总请求数 | 4449 | |
|
||||
| Time-scale | 1.0(真实 trace 时序) | |
|
||||
| 并发 | 32 | |
|
||||
|
||||
---
|
||||
|
||||
## 8. Anti-patterns(KVC **不**是什么)
|
||||
|
||||
1. **KVC 不仅仅是 kv-aware routing**。DP 和 KVC 都可以跑 `kv-aware` policy;KVC 在此之上加了三件事:(i) session 钉定,(ii) worker 端 admission,(iii) 带 reset-on-success 的 migration。如果在比较 "KVC vs DP" 时缺这三个要素的任何一个,**测的就不是 KVC 与 DP 的差异**。
|
||||
|
||||
2. **KVC 在 policy 项里不直接感知容量**。`Route` 不查 per-D 容量;容量感知完全经由 `Admit` 拒绝来传导。我们刻意做了这层分层——把容量判断放进 `Route` 会引入"换 D"的决策空间,导致 orphan KV 滞留问题。
|
||||
|
||||
3. **KVC 不保证 load balance**。一个 session 若能舒服地装在某个 D 上,可能永远钉在那里,而其它 D 大部分时间空闲。在低容量压力下这是设计意图;高压力下 Theorem 1 的 migration 会触发再均衡。
|
||||
|
||||
4. **`Fallback` 不是"降级路径"**。它和 vanilla pd-disagg 请求结构性等价,延迟特征相同。KVC 的价值在于让 Fallback 占比在典型 agentic workload 下 ≪ 10%。
|
||||
|
||||
---
|
||||
|
||||
## 9. 公开问题(reviewer 关注点)
|
||||
|
||||
以下问题在当前评测中尚未解决,主动列出以保持透明:
|
||||
|
||||
1. **Session 钉定相对于纯 P/D disaggregation 的边际贡献是多少?** 需要 `naive 1P3D` 对照实验(vanilla SGLang xPyD,不带 KVC 层)——仓库当前缺失(见 `docs/V2_DEEP_ANALYSIS_ZH.md §4.7`)。
|
||||
|
||||
2. **Algorithm 3 在更高压下行为如何**(例如 ts=10 加速、session 数 ≫ |D|·K_d/peak_input)?当前 ts=1 评测对应真实 agentic 区间,但算法在更高负载下的鲁棒性未经实验验证。
|
||||
|
||||
3. **真 RDMA 下的 reseed 代价**:本次评测的 3–7 s reseed 延迟由两段组成——P 端 re-prefill(1.5-3s)+ P→D mooncake transfer(1.5-4s)。当前 sweep 用的是 TCP loopback;启用 IB/RoCE(节点有 mlx5_0/_1 @ 200 Gb/s × 2 active,需在 sweep 加 `--force-rdma --ib-device mlx5_0`)只能压缩 transfer 段到 ~200ms,**不动 re-prefill 段**。预期 TTFT p99 从 1.28s 降到 ~0.7s(仍输 DP 0.43s)。待独立验证。
|
||||
|
||||
4. **D→P 增量 KV 同步(核心 future-work 缺口)**:reseed 长尾的真正消除需要让 P 端 backup 跟上 D 的 direct-to-D append 增长。经独立 forensic 审查,**当前代码、vendored SGLang、mooncake 三层均无 D→P KV transfer 实现**:mooncake `MooncakeKVManager` 是 PREFILL=sender / DECODE=receiver 的硬角色分支(`add_transfer_request` 上有 `assert disaggregation_mode == PREFILL` 硬约束),`BaseKVSender` / `BaseKVReceiver` 抽象无 bidirectional slot,`session_aware_cache.release_session` 在驱逐时只调 `kv_pool_allocator.free()` 无出站,`_commit_prefill_backup_residency` 唯一 caller 是 seed/reseed 路径;`capacity-backup` policy 的真实语义只是"reseed 完不关 P streaming session"——backup 是 seed-time 的静态快照,不随 direct-to-D append 同步。要实现 D→P 增量同步,工程量 ~1-2 周,最难的不是 mooncake 加 D-sender / P-receiver 角色(~400 LOC),而是 **SGLang radix tree 改成允许从外部 worker 喂数据**——radix cache 当前假设单一生产者(本 worker model 输出)。这是论文里最值得做的 contribution 之一。
|
||||
|
||||
5. **v2 代码路径下的确定性**:v0 代码库的 ts=1 N=3 categorical 确定性已经证实;新增的 reset-on-success 分支与 threshold=8192 路径未被独立 re-validate。两个额外的 N=1 run 即可解决。
|
||||
|
||||
---
|
||||
|
||||
## 10. 论文引用建议
|
||||
|
||||
论文中提到本算法时建议表述:
|
||||
|
||||
> "We use the KVC-Router scheduling algorithm (Algorithms 1–3 of [our paper], formally defined in our supplementary materials). The router selects a decode worker by lexicographic scoring on `(overlap+α·sticky, sticky, −inflight, −assigned)` (Algorithm 1), defers the admission decision to the chosen worker via a synchronous RPC (Algorithm 2), and maintains a per-(session, decode worker) rejection counter that is reset on every successful Direct admission (Algorithm 3). This last detail — reset-on-success — is what distinguishes our v2 from the unstable v1 implementation that exhibits self-amplifying session thrashing."
|
||||
|
||||
---
|
||||
|
||||
**附录 A — 算法步骤到代码实现的对照**
|
||||
|
||||
| 算法步骤 | 文件 | 符号 |
|
||||
|---|---|---|
|
||||
| `Route` 第 5–11 行 | `policies.py:189–202` | `KvAwarePolicy.select` 内层循环 |
|
||||
| `Route` 第 1–4 行(blacklist 过滤 + 退化分支) | `policies.py:182–187, 204–211` | `migration_reject_threshold`,`select` 的 fallback |
|
||||
| `Admit` | `third_party/sglang/python/sglang/srt/managers/scheduler.py` | `handle_admit_direct_append_request` |
|
||||
| `Dispatch` 第 8 行(reset-on-success) | `replay.py: _run_request` | finish 路径中的 reset |
|
||||
| `Dispatch` 第 21 行(记录 reject) | `replay.py: _run_request` | `state.record_admission_reject(...)` |
|
||||
| 超参数 `τ_append` | CLI flag | `--kvcache-direct-max-uncached-tokens` |
|
||||
| 超参数 `τ_reject` | CLI flag | `--kvcache-migration-reject-threshold` |
|
||||
283
docs/MIGRATION_V1_FINDINGS_ZH.md
Normal file
283
docs/MIGRATION_V1_FINDINGS_ZH.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# Migration v1 实验发现:blacklist 永久性导致 thrashing
|
||||
|
||||
**日期**:2026-05-08
|
||||
**状态**:v1 run 进行中(~23% 完成时的中期分析)
|
||||
**前置文档**:
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` §6.2(v1 设计)
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` §2.1(§1 starvation claim)
|
||||
|
||||
**触发**:v1 实现的 session migration(rejection blacklist 机制)部署后,观测到 session-level thrashing——某些 session 在 3 个 D 之间 round-robin 高达 75-116 次。本文记录中期数据、根因诊断、v2 设计。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **v1 修复了 §1 starvation 但引入了新的 thrashing 失效模式**——不是 admission 过严,是 blacklist 永久累积的设计 bug
|
||||
2. **核心证据**:session 6880 在 decode-1 上稳定 70 turns,然后某瞬时 burst 把 reject 计数累积到阈值,被永久 blacklist,之后陷入 3-D 间 round-robin 死循环
|
||||
3. **85% admission 拒绝是 `session-not-resident`**——非 D 真容量问题,而是迁移后"新 D 第一次见你"的正常语义
|
||||
4. **v2 设计**:reset-on-success 让 reject 计数在成功 turn 后清零,只有**持续**失败才迁移
|
||||
5. **深层观察**:baseline 的"100% pin 但稳定"可能比"分布均匀但 thrashing"更好——糟糕的优化可能比不优化还糟
|
||||
|
||||
---
|
||||
|
||||
## 1. v1 实施回顾
|
||||
|
||||
### 1.1 改动文件
|
||||
- `src/agentic_pd_hybrid/policies.py`:`RoutingState.session_d_rejects` Counter;`KvAwarePolicy.migration_reject_threshold=3` skip blacklisted D;degenerate fallback 选最少拒的 D
|
||||
- `src/agentic_pd_hybrid/replay.py`:`_run_request` 末尾 `state.record_admission_reject(sess, D)`(基于 execution_mode 子串匹配);`_fallthrough_reason` 把 `pd-router-fallback-large-append-*` 拆成 `session-not-resident` / `real-large-append` / 等
|
||||
- CLI / benchmark wiring
|
||||
|
||||
### 1.2 v1 假设(事后看部分错误)
|
||||
- "reject 计数 + 阈值 3 = 容忍短期波动 + 持续失败迁移" ← **错**,counter 永久增长导致迁移成必然
|
||||
- "迁移到新 D 后 session 在新 D 稳定下来" ← **部分错**,迁移到的新 D 也很可能很快 reject
|
||||
- "session-not-resident 不会触发计数" ← **大致对**,但下游 fallback 可能间接触发
|
||||
|
||||
---
|
||||
|
||||
## 2. 中期数据(1023/4449 reqs,~23%)
|
||||
|
||||
### 2.1 头部指标 vs baseline
|
||||
|
||||
| 指标 | baseline kvc_1p3d_run1 | v1(中期) |
|
||||
|---|---:|---:|
|
||||
| Per-D 调用分布 | 1502/1445/1502(±3.8%)| 796/785/779(**±1.1%**,更均衡)|
|
||||
| Per-D 峰值 token_usage | 0.99/0.99/0.99 | 0.31/0.30/0.00(**容量充裕**,未顶到 1.00)|
|
||||
| KVTransferError | 5(全程)| 6(中期,趋势相近)|
|
||||
| 已见 sessions | 52(全程)| 29(中期)|
|
||||
|
||||
**好的方面**:
|
||||
- 负载均衡度跃升(±26%→±1.1% if normalized)
|
||||
- D 容量从未饱和——§2 假设的"D drain time"机制配合 ts=1 充分发挥
|
||||
- 0 sessions 永久 stuck 在饿死状态
|
||||
|
||||
### 2.2 Migration 触发情况(已见 29 sessions)
|
||||
|
||||
| 类别 | 数量 | 占比 |
|
||||
|---|---:|---:|
|
||||
| 仍 pin 在 1 个 D | 9 | 31% |
|
||||
| 触碰 2 个 D | 3 | 10% |
|
||||
| **触碰所有 3 个 D** | **17** | **59%** |
|
||||
|
||||
**D-切换次数分布**:
|
||||
- mean = 26 次/session
|
||||
- median = 16 次
|
||||
- **max = 116 次**
|
||||
- 15 sessions 切换 >10 次(明显 thrashing)
|
||||
- **6 sessions 切换 >50 次**(严重 thrashing)
|
||||
|
||||
---
|
||||
|
||||
## 3. 根因诊断:session 6880 的轨迹
|
||||
|
||||
### 3.1 数据
|
||||
|
||||
```
|
||||
turn 0-70: 全部在 decode-1 (71-turn 稳定 streak) ← §1 baseline 行为
|
||||
turn 71-150: 在 3 个 D 间剧烈 thrashing
|
||||
decode-0: 26 个短 streak
|
||||
decode-1: 25 个短 streak
|
||||
decode-2: 25 个短 streak
|
||||
平均 streak 长度 = 2 turns
|
||||
total streaks = 76
|
||||
```
|
||||
|
||||
### 3.2 解读
|
||||
|
||||
**前 70 turn 完美稳定**:session 6880 在 decode-1 上正常运行 70 个 turn,每次都成功,是 baseline §1 "100% pin" 的复现——稳定但不公平(其他 session 没分到 decode-1 的资源)。
|
||||
|
||||
**第 71 turn 后崩溃**:
|
||||
1. 某个瞬时 burst(其他 session 的活动?)让 decode-1 短暂饱和
|
||||
2. session 6880 在 decode-1 上连续 3 次被 admission 拒(`no-space` 或 `d-session-cap`)
|
||||
3. v1 的 `state.session_d_rejects[(6880, decode-1)]` 累积到 3 → blacklist
|
||||
4. policy 改选 decode-0 → 同样发生 → blacklist
|
||||
5. 改选 decode-2 → 同样 → blacklist
|
||||
6. **3 D 全部 blacklisted** → degenerate fallback 在 3 D 间 round-robin
|
||||
7. 每次 round-robin 又触发新 reject → 计数继续涨 → 永远在 thrashing 死循环
|
||||
|
||||
### 3.3 admission 数据交叉验证
|
||||
|
||||
中期 1932 admission events 解构:
|
||||
|
||||
| mode × can_admit × reason | count |
|
||||
|---|---:|
|
||||
| `direct_append, True, None` | 1721(成功)|
|
||||
| `direct_append, False, session-not-resident` | **62** |
|
||||
| `seed, True, None` | 142(成功)|
|
||||
| `seed, False, no-space` | **11** |
|
||||
|
||||
**只有 11 个 "no-space" 才是真容量拒绝**(占总 admission 的 0.6%)。62 个 "session-not-resident" 是迁移后"新 D 第一次见你"的正常语义。
|
||||
|
||||
但因为 v1 用 `_is_admission_rejection_mode` 通过 execution_mode 子串匹配,下游 fallback chain 会把 `session-not-resident` 也间接累积到计数器(fallback 链路本身可能触发 session-cap)。
|
||||
|
||||
---
|
||||
|
||||
## 4. 设计 bug 三层
|
||||
|
||||
### 4.1 Bug 1:blacklist 永久性
|
||||
|
||||
```python
|
||||
# policies.py 当前实现
|
||||
if rejects >= self.migration_reject_threshold:
|
||||
continue # skip this D forever
|
||||
```
|
||||
|
||||
`session_d_rejects[(sess, D)]` 是单调递增 Counter。一旦达到阈值,**永远**被 skip。但 D 的容量是动态的——70 个 turn 后短暂饱和不代表它后续不能服务这个 session。
|
||||
|
||||
### 4.2 Bug 2:degenerate fallback 加剧问题
|
||||
|
||||
当所有 D 都被 blacklist:
|
||||
```python
|
||||
best_decode_worker_id = min(
|
||||
(w.worker_id for w in topology.route_workers),
|
||||
key=lambda wid: state.session_d_rejects.get((sess, wid), 0),
|
||||
)
|
||||
```
|
||||
选"最少被拒"的 D。但每次 fallback 又增加该 D 的计数 → 下次选另一个 D → 形成完美 round-robin,永远走不出 thrashing。
|
||||
|
||||
### 4.3 Bug 3:信号归并粗糙
|
||||
|
||||
`_is_admission_rejection_mode` 子串匹配 `session-cap` / `no-d-capacity` / `d-backpressure`,但执行链路可能这样:
|
||||
|
||||
```
|
||||
direct_append → session-not-resident(85% 占比,正常迁移后语义)
|
||||
→ fallback 试 seed
|
||||
→ seed admit ok(142/153 = 93%)→ execution_mode = pd-router-d-session-reseed-*(不计 reject)
|
||||
→ seed no-space(11/153 = 7%)→ execution_mode = pd-router-fallback-X-no-d-capacity(计 reject)
|
||||
```
|
||||
|
||||
绝大多数 fallback 不会触发 reject 计数。但 thrashing 一旦开始,很容易踩到那 7% no-space 路径,calculator 增长一次。15+ 次 thrashing 后,单 D 计数累到 3 完全可能。
|
||||
|
||||
**所以设计 bug 不在信号粗糙,而在永久累积 + degenerate round-robin。**
|
||||
|
||||
---
|
||||
|
||||
## 5. 深层观察:稳定 vs 公平的 trade-off
|
||||
|
||||
| | baseline(v0)| v1 |
|
||||
|---|---|---|
|
||||
| 公平性 | 18/52 永久饿死 | 0 永久饿死 |
|
||||
| 稳定性 | 100% pin(结构稳定)| 6/29 严重 thrashing |
|
||||
| Per-D 负载均衡 | ±26% | ±1.1% |
|
||||
| 大 session 体验 | 慢但稳定(每 turn 都走 fallback ~1.0s)| 不稳定 + 频繁 D 切换 + 丢 KV state |
|
||||
|
||||
**预想反直觉的结果**:v1 在头部指标(per-D 均衡)赢,但在 session 体验可能输——
|
||||
- baseline 的 fallback 路径有稳定 ~1s latency
|
||||
- v1 的 thrashing session 每次 D 切换都 close 旧 session、丢 KV、新 D 上重新建立——有可能 latency 反而更高
|
||||
|
||||
需要等 run 结束的 lat mean / TTFT mean 数据验证。**糟糕的优化可能比不优化还糟。**
|
||||
|
||||
---
|
||||
|
||||
## 6. v2 设计
|
||||
|
||||
按 ROI 排序的修复层。**先做 #1,验证后再决定是否需要 #2/#3**。
|
||||
|
||||
### 6.1 v2-fix-1:reset-on-success(最高 ROI)
|
||||
|
||||
```python
|
||||
# replay.py _run_request 末尾,在 state.finish 后
|
||||
if execution.execution_mode == "kvcache-direct-to-d-session":
|
||||
# 这次 direct-to-D 成功 = D-X 仍能服务这个 session
|
||||
# 清零累积的 reject 计数(消除永久 blacklist)
|
||||
state.session_d_rejects[(request.session_id, decision.decode_worker_id)] = 0
|
||||
```
|
||||
|
||||
**预测效果**:
|
||||
- session 6880 在 decode-1 上 70 个成功 turn 把计数反复清零
|
||||
- 即使中间出现 1-2 次瞬时 reject,下次成功立刻清零
|
||||
- 只有**持续**失败(reject 后 reject 后 reject,没有夹杂 success)才能累到阈值
|
||||
- 真饿死的 session(如 35680/39360 input >92K)才会触发迁移
|
||||
|
||||
**工程量**:~5 行代码 + 1 个 smoke + 1 个完整 run(~5.5h)
|
||||
|
||||
### 6.2 v2-fix-2:sliding window(如果 #1 不够)
|
||||
|
||||
把 `Counter` 改成 `dict[(sess, D), deque[float]]` 存最近 K 次拒绝时间戳。判断时用最近 N 秒(或 N 个 turn)内的次数。
|
||||
|
||||
更稳健但更复杂。**若 #1 已能彻底解决 thrashing,跳过此项。**
|
||||
|
||||
### 6.3 v2-fix-3:reject 类型分离(如果 #1 + #2 不够)
|
||||
|
||||
把 admission reason 显式传到 _run_request,区分:
|
||||
- `no-space` / `session-cap` / `backpressure` → 计 reject
|
||||
- `session-not-resident` → 不计
|
||||
|
||||
需改 `ExecutionResult` 加 `admission_reject_reason` 字段,并在 fallback 链路传递。**不在第一轮**——先看 #1 是否够用。
|
||||
|
||||
### 6.4 v2 应保留的 v1 设计
|
||||
|
||||
- 阈值 3(不变)
|
||||
- `record_admission_reject` 的子串匹配(不变)
|
||||
- 新 fallback labels(`session-not-resident` 等)(不变)
|
||||
- degenerate fallback 选最少拒的 D(不变,但因为 reset-on-success 几乎不会触发到此分支)
|
||||
|
||||
---
|
||||
|
||||
## 7. 实验计划
|
||||
|
||||
| 阶段 | 动作 | 时间 |
|
||||
|---|---|---|
|
||||
| 1 | 等 v1 run 完成(ETA ~16:30)| 自然 |
|
||||
| 2 | 跑 analyzer 量化 v1 thrashing 实际代价 | 5 min |
|
||||
| 3 | 实现 v2-fix-1(reset-on-success)| 30 min |
|
||||
| 4 | smoke test | 10 min |
|
||||
| 5 | 完整 v2 run(KVC 1P3D ts=1 N=1)| ~5.5h |
|
||||
| 6 | 三方对比:baseline / v1 / v2 | 30 min |
|
||||
| 7 | 决定是否需要 v2-fix-2 / v2-fix-3 | – |
|
||||
|
||||
---
|
||||
|
||||
## 8. 三方对比预测(待数据验证)
|
||||
|
||||
| 指标 | baseline(v0)| v1(thrashing)| **v2(self-healing 预测)** |
|
||||
|---|---:|---:|---:|
|
||||
| Errors | 5 | ? | 2-5(仅 35680/39360 等真容量超限)|
|
||||
| Per-D 均衡 | ±26% | **±1.1%** | ±5-10%(部分 pin 仍 sticky)|
|
||||
| Direct-to-D rate | 42.8% | ?(可能因 thrash 反而下降)| **65-75%**(持续 affinity,转换 §1 fallback)|
|
||||
| Lat mean | 1.574s | ?(可能因 thrash 上升)| **1.30-1.45s**(达到 4DP 1.443s 水平)|
|
||||
| TTFT mean | 0.244s | ? | **0.10-0.15s** |
|
||||
| 最大 D-switches/session | 0 | 116 | <10(仅真饿死 session)|
|
||||
| Sessions 永久饿死 | 18 | 0 | 2-3(仅真容量超限)|
|
||||
|
||||
预测核心:v2 应该结合 baseline 的稳定性(70-turn streak 应保留)+ v1 的公平性(无永久饿死),消除 v1 的 thrashing 副作用。
|
||||
|
||||
---
|
||||
|
||||
## 9. 局限与未验证
|
||||
|
||||
1. **v1 中期数据 (23%) 推测**:完整数据可能改变 thrashing 严重性的判断
|
||||
2. **session 6880 trajectory 的崩溃机理是推断**:基于 admission events 数据 + streak 模式,但没有直接日志证明 reject 计数何时跨阈值(需要在 v2 加 instrument 输出)
|
||||
3. **reset-on-success 的预测效果未验证**:基于"70 turn 成功" + "1-2 次瞬时 reject" 的假设;如果 burst 持续多 turn,仍可能跨阈值
|
||||
4. **可能还有未发现的设计 bug**:v2 也许还会暴露新问题
|
||||
5. **三方对比需 same trace + same scale + same ts=1**:baseline 已有 N=3,v1/v2 各 N=1(ts=1 确定性 → N=1 可信)
|
||||
|
||||
---
|
||||
|
||||
## 10. 给 TEAM_REPORT 和 REFACTOR_PLAN_V1 的更新建议
|
||||
|
||||
完成 v2 验证后:
|
||||
|
||||
1. 在 `TEAM_REPORT` §3 ts=1 验证更新章节加入 §3.3 "Migration mechanism evolution: v0 → v1 → v2"
|
||||
2. 在 `REFACTOR_PLAN_V1` §6.2 标注实施反思——预设的 "rejection blacklist" 设计漏掉了 reset-on-success 这条
|
||||
3. 在新文档 `docs/POLICY_DESIGN_PRINCIPLES_ZH.md` 提炼出原则:"任何会累积的代价机制必须配 healing/decay 机制,否则会陷入 self-amplifying 失效模式"
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本文数据来源
|
||||
|
||||
| 章节 | 数据源 |
|
||||
|---|---|
|
||||
| §2 | `outputs/qwen3-30b-tp1-ts1-migration-v1/kvcache-centric-*/` 中期日志 |
|
||||
| §3.1 | `structural/session-d-binding.jsonl` 跨 turn 序列 |
|
||||
| §3.3 | `structural/admission-events.jsonl` mode/reason 交叉表 |
|
||||
|
||||
## 附录 B:相关代码位置
|
||||
|
||||
| 内容 | 位置 |
|
||||
|---|---|
|
||||
| RoutingState.session_d_rejects | `src/agentic_pd_hybrid/policies.py:46` |
|
||||
| KvAwarePolicy.select 跳过 blacklisted D | `src/agentic_pd_hybrid/policies.py:155-162` |
|
||||
| Degenerate fallback 选最少拒的 D | `src/agentic_pd_hybrid/policies.py:184-192` |
|
||||
| record_admission_reject 触发位置 | `src/agentic_pd_hybrid/replay.py:359-364`(_run_request) |
|
||||
| _is_admission_rejection_mode 子串集合 | `src/agentic_pd_hybrid/replay.py` `_ADMISSION_REJECTION_SUBSTRINGS` |
|
||||
| _fallthrough_reason 分类 | `src/agentic_pd_hybrid/replay.py` `_fallthrough_reason` |
|
||||
364
docs/ONBOARDING_NEXT_AGENT_ZH.md
Normal file
364
docs/ONBOARDING_NEXT_AGENT_ZH.md
Normal file
@@ -0,0 +1,364 @@
|
||||
# 接班 Agent 上手手册
|
||||
|
||||
**对象**:接手本项目的下一个 SWE/research agent
|
||||
**目标**:30 分钟读完后达到当前主 agent 的认知水平,能独立跑对照实验、看懂数据、避开历史坑
|
||||
**作者状态**:本手册定稿于 `kvc-debug-journey-v1-to-v4 @ 506d360`,下一个工作分支是 `feat/d-to-p-sync`
|
||||
|
||||
---
|
||||
|
||||
## 0. 你是谁,你将要做什么(5 行 TL;DR)
|
||||
|
||||
1. 你接手的是 **agentic-pd-hybrid**——SGLang xPyD 基础上加 session-aware KVCache 层的 LLM serving 框架,目标是在多轮长 context coding agent workload 上比 vanilla DP 快
|
||||
2. v2(迁移机制 + threshold tuning)已经在 SWE-Bench 50sess trace ts=1 上**击败 4DP CA** 6/8 个 latency/TTFT 指标,但 **TTFT p99 输 3×**(1.28s vs 0.43s)
|
||||
3. 上一个 agent 已诊断出 TTFT p99 长尾的根因——8.3% 请求走 reseed 慢路径,每次需要 P 重算 prefill + mooncake transfer = 3-7s
|
||||
4. **你的任务**:在有 GPU + IB RDMA 的环境上跑 2 组对照实验,验证 (a) naive 1P3D + kv-aware 相对 KVC 的边际贡献、(b) 启用真 RDMA 后 KVC v2 的 TTFT p99 是否能压到 ~0.7s 量级
|
||||
5. 跑完结果 push 到 `outputs/`,主 agent 会拉下来更新 paper draft 和 future-work 文档
|
||||
|
||||
---
|
||||
|
||||
## 1. 必读文档(按这个顺序读,**不要乱跳**)
|
||||
|
||||
### Level 1:核心 30 分钟(**必读**,读完就能开始干活)
|
||||
|
||||
| # | 文档 | 时长 | 为什么读它 |
|
||||
|---|---|---:|---|
|
||||
| 1 | `docs/PROJECT_OVERVIEW.md` | 5min | 项目目标 + 三种 mechanism(pd-disagg / pd-colo / kvcache-centric)的术语区分 |
|
||||
| 2 | `docs/V2_DEEP_ANALYSIS_ZH.md` §0 (TL;DR) + §6 (生产决策) | 10min | 当前状态最准确的 snapshot——v2 赢什么、输什么、为什么 |
|
||||
| 3 | `docs/KVC_ROUTER_ALGORITHM.md` §1-§3 + §9 | 10min | 形式化的算法(Algorithm 1/2/3)+ 4 个 open questions。**§9 OQ#4 就是你正在解决的问题** |
|
||||
| 4 | `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §0-§2 | 5min | reseed 路径完整时间线(t=0 → t=4550ms),知道每段耗时分别来自哪里 |
|
||||
|
||||
读完上面 4 篇就能跑实验了。如果你时间紧张,**就只读这 4 篇 + 本手册**。
|
||||
|
||||
### Level 2:进阶(**遇到具体问题时再读**)
|
||||
|
||||
| 文档 | 何时读 |
|
||||
|---|---|
|
||||
| `docs/REFACTOR_PLAN_V1_ZH.md` | 想理解为什么从 ts=10 切到 ts=1 |
|
||||
| `docs/MIGRATION_V1_FINDINGS_ZH.md` | 想理解 v1→v2 演化(v1 为何 thrashing,v2 reset-on-success 怎么修的) |
|
||||
| `docs/V2_RESULTS_ZH.md` | v2 原始战报(注意:headline 表略乐观,请优先看 `V2_DEEP_ANALYSIS_ZH.md` 的修订版) |
|
||||
| `docs/V2_DEEP_ANALYSIS_ZH.md` §4 全文 | 论文 reviewer 的对等性挑战 + 我们的辩驳;写 paper 时必读 |
|
||||
| `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` | 想理解 ts=10 时代的 §1-§9 结构性问题清单(很多问题在 ts=1 下消失,但底层机制仍在) |
|
||||
|
||||
### Level 3:归档(**别读**,是历史包袱)
|
||||
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md`:ts=10 时代的早期分析,结论已被 ts=1 数据 supersede
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md`:ts=10 数据下的结构性验证,同上
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md`:v1-v5 调优 sweep 的过程笔记,知道有这个文件就行
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md`:profile 调查,已 supersede
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md`:v0 重构计划,已被 V1 supersede
|
||||
- `docs/archive/SWEBENCH_EXPERIMENT_*.md`:早期实验日志
|
||||
|
||||
### Level 0:本手册的"姐妹"文档(**读这个之前你应该已经在看本文了**)
|
||||
|
||||
- `docs/ONBOARDING_NEXT_AGENT_ZH.md`(就是本文)
|
||||
|
||||
---
|
||||
|
||||
## 2. 项目当前状态快照(用一张表说清)
|
||||
|
||||
```
|
||||
Trace: outputs/qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions, time-scale=1.0)
|
||||
Hardware: 4× H100 80GB + Mellanox mlx5_0/_1 @ 200 Gb/s IB (active, but **未启用** in current sweep)
|
||||
Model: Qwen3-30B-A3B-Instruct-2507 (TP1)
|
||||
Branch: kvc-debug-journey-v1-to-v4 = 主分支(v2 已合入)
|
||||
feat/d-to-p-sync = 预留给 D→P 增量同步的开发,**当前空**
|
||||
main = 旧 baseline,比主分支落后 18 commit
|
||||
```
|
||||
|
||||
### 已得出的结论(高置信度)
|
||||
|
||||
1. **v2 (reset-on-success + threshold 8192) 击败 4DP CA**:lat mean -1.4%、p50 -13%、TTFT mean -25%、TTFT p50 -55%、TTFT p90 -67%
|
||||
2. **TTFT p99 KVC 输 3×**:1.28s vs 0.43s。来自 8.3% reseed/fallback 慢路径
|
||||
3. **慢路径耗时五五开**:P 端 re-prefill ~1.5-3s + mooncake P→D transfer ~1.5-4s(**当前是 TCP loopback**,未启用真 RDMA)
|
||||
4. **capacity-backup 救不了 slow path**:直接 audit 过,P 端 backup 不会随 direct-to-D append 更新,是 seed-time 静态快照
|
||||
5. **D→P 增量同步代码不存在**:经 Opus agent forensic 审查 + 全分支 git 检索确认
|
||||
|
||||
### 待验证的核心假设(**这是你的实验任务**)
|
||||
|
||||
| # | 假设 | 验证方法 | 预期结果 |
|
||||
|---|---|---|---|
|
||||
| H1 | KVC v2 相对 4DP 的胜利不只是来自 1P3D 拓扑——KVC 层(admission / migration / direct-to-D)也有显著贡献 | 跑 naive 1P3D + policy=kv-aware ts=1 N=1(vanilla SGLang pd-disagg,无 KVC 层)作为中间对照 | naive 1P3D 应该处于 KVC v2 和 4DP 之间。如果它 ≈ KVC v2 → 胜利来自拓扑而非 KVC 层;如果 ≈ 4DP → 胜利来自 KVC 层 |
|
||||
| H2 | 启用真 RDMA 把 mooncake P→D transfer 从 1.5-4s 压到 200-400ms,TTFT p99 从 1.28s 降到 ~0.7s | 在 v2 sweep 加 `--force-rdma --ib-device mlx5_0`,跑同 trace 同 ts=1 | TTFT p99 应该 ~0.5-0.8s 区间。如果没改变 → mooncake 实际没用 RDMA / 配置错误;如果降到 ~0.3s → 我们对 transfer 段贡献的估计偏低 |
|
||||
| H3 | 即使启用 RDMA,TTFT p99 仍然输 DP(因为 re-prefill 段不动) | 同 H2 实验结果 | 应该看到 TTFT p99 ~0.7s > DP 0.43s。如果 ≤ DP → 我们对 re-prefill 段成本的估计错了,可能整个 slow path 理论需要重审 |
|
||||
|
||||
---
|
||||
|
||||
## 3. 你要跑的实验(the main task)
|
||||
|
||||
### 3.1 实验矩阵(按 ROI 排序)
|
||||
|
||||
GPU hour 珍贵,砍掉了原计划的 naive 1P3D + policy=default baseline(low-ROI——naive 1P3D 用 default policy 在多轮 cache 命中上几乎必败,没必要拿这个对比当 H1 的对照点)。最终保留 2 个 run:
|
||||
|
||||
| # | 配置 | GPU | mechanism | policy | RDMA | 预期时长 | 目的 |
|
||||
|---|---|---:|---|---|---|---:|---|
|
||||
| **E1** | naive 1P3D kv-aware | 4 | pd-disaggregation | kv-aware | **on** | ~5.5h | H1:分离"1P3D + kv-aware policy"贡献 vs "KVC 层(admission/migration/direct-to-D)"贡献 |
|
||||
| **E2** | KVC v2 + RDMA | 4 | kvcache-centric | kv-aware | **on** | ~5.5h | H2/H3:验证 RDMA 能把 TTFT p99 从 1.28s 压到 ~0.7s |
|
||||
|
||||
两个 run 串行约 11h,并行用两组 GPU 可压到 ~5.5h。
|
||||
|
||||
### 3.2 启动配置:详细 flag 清单
|
||||
|
||||
参考 `scripts/sweep_ts1_migration_v2.sh` 作为底版。两个新 sweep 脚本的关键 flag:
|
||||
|
||||
#### E1: naive 1P3D kv-aware
|
||||
|
||||
```bash
|
||||
python -m agentic_pd_hybrid \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy kv-aware \
|
||||
--topology-pd 1P3D \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device mlx5_0 \ # ← 单独测拓扑+policy 而非 transport,必须开 RDMA 才能跟 E2 公平
|
||||
--trace outputs/qwen35-swebench-50sess.jsonl \
|
||||
--time-scale 1.0 \
|
||||
--concurrency 32 \
|
||||
--request-timeout-s 300 \
|
||||
--max-input-len 87811 \ # ← 拉齐到 DP 限,消除 abort 数量不对等
|
||||
--output-root outputs/qwen3-30b-tp1-ts1-naive-1p3d-kvaware
|
||||
```
|
||||
|
||||
#### E2: KVC v2 + RDMA
|
||||
|
||||
参考 `scripts/sweep_ts1_migration_v2.sh`,**只加两个 flag**:
|
||||
|
||||
```diff
|
||||
--transfer-backend mooncake \
|
||||
+ --force-rdma --ib-device mlx5_0 \
|
||||
+ --max-input-len 87811 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
```
|
||||
|
||||
**保留 v2 的其它所有配置**——这是 v2 + RDMA 的 ablation,**不要顺手改其它东西**。
|
||||
|
||||
### 3.3 实验前的环境验证(**别跳**)
|
||||
|
||||
```bash
|
||||
# 1. GPU
|
||||
nvidia-smi -L # 应该看到 4 张 H100 80GB
|
||||
|
||||
# 2. RDMA
|
||||
ibstat | grep -E "State|Rate|Port"
|
||||
# 期望:mlx5_0 / mlx5_1 都是 State=Active, Rate=200 Gb/s
|
||||
|
||||
# 3. Mooncake 能识别 RDMA 设备
|
||||
python -c "from mooncake_transfer_engine import TransferEngine; e=TransferEngine(); print(e.get_local_topology())"
|
||||
# 期望:输出包含 mlx5_0 / mlx5_1
|
||||
|
||||
# 4. 现有 v2 数据可读
|
||||
python3 scripts/analysis/recompute_summary.py outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
|
||||
# 期望:打印出 failure_count=45, abort_count=40 等
|
||||
|
||||
# 5. 算法实现 syntax check
|
||||
python3 -m py_compile src/agentic_pd_hybrid/{policies,replay,metrics,benchmark,cli}.py
|
||||
# 期望:全过
|
||||
```
|
||||
|
||||
任何一步失败**立刻停下来排查**,不要硬上。
|
||||
|
||||
---
|
||||
|
||||
## 4. 已踩过的坑(避免重复)
|
||||
|
||||
| # | 坑 | 症状 | 教训 |
|
||||
|---|---|---|---|
|
||||
| 1 | **abort 被计入 latency stats** | DP/KVC 都有 0.08s 的快速失败被算成"快请求",拉低 mean/p50 | 已在 `metrics.py` 修复(commit `5eac9b4`)。新 run 出 summary 时会自动包含 `abort_count` / `failure_count` 字段 |
|
||||
| 2 | **max-input-len 双方不一致**(KVC=92098 vs DP=87811) | SGLang 按 mem_fraction_static 自动算 max_total_num_tokens,KVC decode-only worker GPU 内存多 2 GB | 跑新 run 时显式传 `--max-input-len 87811` 强制对齐 |
|
||||
| 3 | **mooncake 默认 TCP loopback** | sweep 脚本只传 `--transfer-backend mooncake` 不够,会落到 TCP,跑出来比 RDMA 慢 10× | 必须加 `--force-rdma --ib-device mlx5_0` |
|
||||
| 4 | **capacity-backup 不是 D→P 同步** | flag 名字误导,看代码就会发现它只是"reseed 完不关 P session",KV 是 seed-time 静态快照 | 不要在 capacity-backup 上浪费时间;要真正消灭 reseed 长尾必须实现 D→P,去 `feat/d-to-p-sync` |
|
||||
| 5 | **N=1 在 ts=1 下"够用"是有条件的** | baseline N=3 确认 categorical 完全确定,但 v2 引入的 reset-on-success 等新代码路径未独立验证 | v2 + RDMA 的对照建议 N=2,对 RDMA-on/off 各一次 |
|
||||
| 6 | **ts=10 数据**别参考 | 当年的 372/912/396 errors 是 benchmark artifact,不代表真实生产 | 所有比较锁定 ts=1,不要尝试 ts=10 "复现"或验证 |
|
||||
| 7 | **critic agent 的 "MAJOR" 别盲信** | 上一轮 critic 把 cache fragmentation / prefill 闲置标为 MAJOR,其实是 KVC 的**设计意图** | 详见 `V2_DEEP_ANALYSIS_ZH §4.4 / §4.5`。Audit 视角和生产视角要分清 |
|
||||
| 8 | **GPU utilization 图布局有残留小问题** | 组标签 (KVC 1P3D / DP 4-way CA) 与 subplot title 视觉上仍有轻微挤压 | 已被用户接受为可发表状态。不要再花时间调这张图 |
|
||||
|
||||
---
|
||||
|
||||
## 5. CLI 速查表
|
||||
|
||||
### 跑实验
|
||||
```bash
|
||||
# 完整 sweep(参考 v2)
|
||||
bash scripts/sweep_ts1_migration_v2.sh
|
||||
|
||||
# 写自己的 sweep:复制 sweep_ts1_migration_v2.sh,改 mechanism/policy/output-root
|
||||
```
|
||||
|
||||
### 看数据
|
||||
```bash
|
||||
# 修复版 summary(推荐用这个,旧的 summary.json 含 abort 污染)
|
||||
python3 scripts/analysis/recompute_summary.py outputs/<run>/*_metrics.jsonl
|
||||
|
||||
# 跨配置对照
|
||||
python3 scripts/analysis/analyze_ts1_validation.py # 比较 KVC vs DP ts=1 4-run
|
||||
```
|
||||
|
||||
### 出图(参考 v2 流程)
|
||||
```bash
|
||||
# 4 张已有的图,对应不同 viz 问题
|
||||
python3 scripts/analysis/plot_v2_path_breakdown.py # execution_mode 分布 + path-level latency
|
||||
python3 scripts/analysis/plot_ttft_pdf.py # TTFT PDF (KVC vs DP)
|
||||
python3 scripts/analysis/plot_gpu_utilization.py # GPU 利用率(请求计数 vs 工作量)
|
||||
python3 scripts/analysis/plot_cache_efficiency.py # cache 效率(hit rate vs turn + uncached ECDF)
|
||||
|
||||
# 数据更新后重新出图:直接 rerun,每个脚本都参数化了输入路径
|
||||
```
|
||||
|
||||
### Git
|
||||
```bash
|
||||
# 主分支(实验)
|
||||
git checkout kvc-debug-journey-v1-to-v4
|
||||
|
||||
# 新功能分支(D→P 同步,空)
|
||||
git checkout feat/d-to-p-sync
|
||||
|
||||
# 远程
|
||||
origin = git@ipads.se.sjtu.edu.cn:wangjh/agentic-pd-hybrid.git
|
||||
|
||||
# Push 用 (SSH known_hosts 第一次需要 accept)
|
||||
GIT_SSH_COMMAND='ssh -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=~/.ssh/known_hosts' git push
|
||||
|
||||
# user.email 没设全局,建议 per-commit 传:
|
||||
git -c user.email=YOUR_EMAIL -c user.name=YOUR_NAME commit -m "..."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. 跑完结果后看什么数字(checklist)
|
||||
|
||||
每个 run 跑完,**至少**收集以下几个数字(用 `recompute_summary.py`):
|
||||
|
||||
```
|
||||
☐ request_count (期望 4449)
|
||||
☐ error_count + abort_count + failure_count
|
||||
☐ latency_stats_s.{mean, p50, p90, p99}
|
||||
☐ ttft_stats_s.{mean, p50, p90, p99} ← 别忘 p99!这是 KVC 的真实代价点
|
||||
☐ execution_modes 分布
|
||||
☐ per_decode_load 分布(看负载均衡)
|
||||
☐ per_prefill_load (注意:dispatcher 计数 ≠ GPU 工作量)
|
||||
☐ cache_hit_request_count + total_cached_tokens (推 cache hit rate)
|
||||
```
|
||||
|
||||
### 两组对照实验跑完后看以下"决定性数字"
|
||||
|
||||
| 比较 | 关键看点 | 决策 |
|
||||
|---|---|---|
|
||||
| E1 (naive 1P3D kv-aware) vs E2 (KVC v2 + RDMA) | TTFT p50/p99、direct-to-D 占比 | 量化"KVC 层(admission/migration/direct-to-D)在 kv-aware 之上的额外收益"(H1) |
|
||||
| KVC v2 (TCP, 历史 v2 run) vs E2 (KVC v2 + RDMA) | TTFT p99、reseed mode 的耗时(execution_mode == reseed 的 ttft_s p50) | 验证 H2/H3:RDMA 救多少 transfer 段 |
|
||||
| E1 (naive 1P3D kv-aware) vs DP 4w(历史 ts=1 baseline)| 全部 latency / TTFT 指标 | 间接锚定"拓扑差异 + kv-aware policy"的天花板 |
|
||||
|
||||
### 期待的数字范围(如果实验顺利)
|
||||
|
||||
| 配置 | lat p50 | lat p99 | TTFT p50 | TTFT p99 | direct-to-D % |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| **E1** naive 1P3D kv-aware | ~0.75s | ~8-10s | ~0.20s | ~0.8-1.2s | N/A |
|
||||
| **E2** KVC v2 + RDMA | ~0.58s | ~7-8s | ~0.04s | **~0.5-0.8s** | ~91% |
|
||||
| (参考) KVC v2 + TCP(历史) | 0.58s | 8.7s | 0.04s | 1.29s | 91.6% |
|
||||
| (参考) DP 4w(历史 ts=1) | 0.67s | 8.4s | 0.09s | 0.43s | N/A |
|
||||
|
||||
**如果你看到的数字偏离这个范围 ≥ 2×**,先停下来检查配置(环境验证 §3.3 那些项目),不是写报告。
|
||||
|
||||
---
|
||||
|
||||
## 7. 遇到 X 怎么办(FAQ)
|
||||
|
||||
**Q: 跑出来 KVC v2 + RDMA 的 TTFT p99 比预期高很多(> 1s)。**
|
||||
|
||||
A: 大概率 RDMA 没真用上。检查:
|
||||
1. `outputs/<run>/<subdir>/benchmark-config.json` 里 `force_rdma` 是不是 `True`、`ib_device` 是不是 `"mlx5_0"`
|
||||
2. 服务器 startup log(`outputs/<run>/<subdir>/logs/prefill-0.log`)有没有 "MOONCAKE_DEVICE=mlx5_0" / "using RDMA" 类信息
|
||||
3. `ibstat mlx5_0` 看 active 状态没掉
|
||||
|
||||
**Q: KVC v2 + RDMA 跑出来 TTFT p99 ≤ DP(违反 H3)。**
|
||||
|
||||
A: 这是个好消息。可能性:
|
||||
1. 我们对 re-prefill 段耗时估计偏高(实际 SGLang 的 prefix cache 把 P 端 re-prefill 救了一半)
|
||||
2. RDMA 直接快到把 transfer 段压到 ~50ms 量级,整个 reseed < 1.5s
|
||||
3. v2 的 reseed 触发频率被 RDMA 间接降低(某种 race condition 改善了 LRU 行为)
|
||||
|
||||
任一情况都值得**深挖**,建议把 reseed mode 的 `ttft_s` 分布单独拉出来看(应该有清晰的双峰:fast reseed + 极少数 outlier)。
|
||||
|
||||
**Q: naive 1P3D 跑不起来 / SGLang 报错。**
|
||||
|
||||
A: 仓库里 `outputs/qwen3-30b-exps/pd-disaggregation-default-20260427T062616Z/` 有过历史的 1P1D 跑通配置可以参考。常见坑:
|
||||
1. `--mechanism pd-disaggregation` 和 `--topology` 必须配合,topology 不能用 KVC 的 1P3D 名字
|
||||
2. SGLang vendored 在 `third_party/sglang/`,**不要**`pip install sglang` 用外部版本——可能 API 不对齐
|
||||
3. `--policy default` 时不要传 `--kvcache-*` 系列 flag,会被 ignore 但会污染 config 输出
|
||||
|
||||
**Q: 我想跑别的对照(更大 trace / 更多 GPU / 真实 RDMA 跨节点)。**
|
||||
|
||||
A: 先把上面 2 个 E1-E2 跑完。这 2 个是论文核心 contribution 的 ablation,不能跳。其它对照(更长 trace、8 GPU 2P6D、真跨节点 RDMA、补 naive 1P3D + policy=default)见 `V2_DEEP_ANALYSIS_ZH §7.3`,作为 follow-up。
|
||||
|
||||
**Q: 跑完后想自动出对比图。**
|
||||
|
||||
A: 4 个现有 `plot_*.py` 脚本都是参数化的,把输入路径改成你的新 run 就能复用。如果对比维度变多(如三方对比 naive vs KVC vs DP),可以扩展现有脚本而不是新写——见 `plot_ttft_pdf.py` 的模板。
|
||||
|
||||
**Q: 发现 metrics.jsonl 字段不一致 / 缺字段。**
|
||||
|
||||
A: 看 `src/agentic_pd_hybrid/metrics.py` 里 `RequestMetrics` dataclass。所有新增字段必须在那里加,否则 `recompute_summary.py` 会报 KeyError。**注意**:dataclass 的 `field_names` 是按 `RequestMetrics.__dataclass_fields__` 取的,不是 jsonl 里所有 key。
|
||||
|
||||
---
|
||||
|
||||
## 8. 如果你完全卡住
|
||||
|
||||
读这一段:
|
||||
|
||||
1. **不要**尝试在没看本手册 §1 必读文档的情况下硬上代码
|
||||
2. **不要**在 main 分支或 `feat/d-to-p-sync` 上跑实验——用 `kvc-debug-journey-v1-to-v4`
|
||||
3. **不要**修 metrics.py 的统计字段,除非你能解释清楚为什么它当前的 abort 排除是对的
|
||||
4. **不要**信任 critic agent 的"MAJOR"标签,要看代码层证据
|
||||
5. **不要**跳过环境验证(§3.3)直接跑长 sweep——5h 跑出垃圾数据浪费的成本更高
|
||||
|
||||
如果你卡住超过 30 分钟,把卡点写成一句话,去主 agent 留言(git commit message / branch 注释)。
|
||||
|
||||
---
|
||||
|
||||
## 9. 主 agent 留给你的两个具体期待
|
||||
|
||||
1. **两组对照实验跑完后**,在新 commit message 里给我以下数字(用 `recompute_summary.py` 输出格式):
|
||||
```
|
||||
E1 naive 1P3D kv-aware: lat={mean,p50,p90,p99} ttft={mean,p50,p90,p99} fail_count
|
||||
E2 KVC v2 + RDMA: 同上 + reseed-mode 的 ttft p50/p99 分开
|
||||
```
|
||||
|
||||
2. **跑 E2 时收集 reseed 路径的实测耗时分布**:
|
||||
```
|
||||
pd-router-d-session-reseed 这个 execution_mode 的 ttft_s 分布
|
||||
并把 P→D mooncake transfer 时长 vs P 端 re-prefill 时长 单独拉出
|
||||
(需要在 structural/admission-events.jsonl 里找 timestamp diff)
|
||||
```
|
||||
|
||||
这两组数字直接决定 paper future-work 章节怎么写 D→P sync 的必要性。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:关键文件位置速查
|
||||
|
||||
| 你在找什么 | 在哪 |
|
||||
|---|---|
|
||||
| 算法实现 | `src/agentic_pd_hybrid/policies.py` (KvAwarePolicy + RoutingState) |
|
||||
| 整个 replay orchestration | `src/agentic_pd_hybrid/replay.py` (~3000 行,**慢慢读**) |
|
||||
| 指标统计 | `src/agentic_pd_hybrid/metrics.py` |
|
||||
| CLI 入口 | `src/agentic_pd_hybrid/cli.py` |
|
||||
| Server 启动配置 | `src/agentic_pd_hybrid/stack.py` |
|
||||
| SGLang 改动 | `third_party/sglang/python/sglang/srt/{managers/scheduler.py, managers/io_struct.py, disaggregation/mooncake/...}` |
|
||||
| 历史 sweep 脚本 | `scripts/sweep_ts1_*.sh` |
|
||||
| 分析脚本 | `scripts/analysis/*.py` |
|
||||
| 实验输出 | `outputs/qwen3-30b-tp1-ts1-*/` |
|
||||
|
||||
## 附录 B:关键 commit 速查(按"想理解什么改动看什么 commit"组织)
|
||||
|
||||
| 想理解 | 看 commit |
|
||||
|---|---|
|
||||
| v2 的核心改动 | `2ec0deb feat(kvc): session migration with reset-on-success + direct-append threshold tuning` |
|
||||
| metrics.py 修复 | `5eac9b4 fix(metrics): exclude aborted requests from latency/ttft/tpot stats` |
|
||||
| 完整 analysis 文档(多版本叠加修订)| `c01d610` (latest) / `9ccd853` / `b5af195` / `c551906` / `517677d` |
|
||||
| 算法形式化定义 | `37e9caa docs(kvc): production-decision reframe + formal router algorithm spec` |
|
||||
| 各种 figure 脚本 | `c551906` (TTFT PDF) / `b5af195` (path breakdown) / `517677d` (GPU + cache) |
|
||||
| backpressure 代码 | `c47adaf feat(kvc): honor admission backpressure hints` 和 `ca4b64c feat(sglang): expose backpressure pause hint` |
|
||||
|
||||
---
|
||||
|
||||
**核心句**:先读 §1 Level 1 的 4 篇文档(30 min)+ 本手册(30 min),然后按 §3 跑 E1/E2/E3 三组实验,按 §6 收集决定性数字,遇到坑查 §4,结果 push 到 `outputs/` 下。**别瞎改不属于本任务的代码**——你的工作是验证 v2 的胜利在 ablation 中是否站得住,不是开发新机制(那是 `feat/d-to-p-sync` 分支的事,下一阶段才做)。
|
||||
|
||||
跑完之后期待你的 commit!
|
||||
385
docs/REFACTOR_PLAN_V1_ZH.md
Normal file
385
docs/REFACTOR_PLAN_V1_ZH.md
Normal file
@@ -0,0 +1,385 @@
|
||||
# Refactor Plan v1:基于 ts=1 验证后的重构方向
|
||||
|
||||
**日期**:2026-05-08
|
||||
**前置文档**:
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md`(v0,已被本文 supersede——v0 的 backpressure 切入点结论已撤回)
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`(包含 §1-§7 结构性问题清单)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md`(ts=10 数据下的早期验证)
|
||||
|
||||
**触发**:`outputs/qwen3-30b-tp1-ts1-validation/` 4 个 run 完成(KVC 1P3D × N=3 + 4DP CA × 1,全部 ts=1)
|
||||
|
||||
**目的**:把 ts=1 验证结果落到具体的重构决策——哪些事必须做、哪些事不要再做、KVC 项目本身是否需要重新定义价值主张
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **ts=10 失真是真的,影响 5-10×**——KVC 在 ts=10 灾难性输 DP 是 benchmark artifact,不是机制本身有问题
|
||||
2. **ts=1 同 scale 下 KVC ≈ DP**:lat mean 差 9%,TTFT 差 47%,errors 双 0
|
||||
3. **TEAM_REPORT 的 §1(session pin 不公平)是真问题,但代价从 6× 降到 ~2×**——仍是唯一值得做的 KVC 优化
|
||||
4. **TEAM_REPORT 的 §2/§3/§4/§5 大多是 ts=10 高压 artifact**——ts=1 下要么不显著、要么自然吸收
|
||||
5. **N=1 不可信是 ts=10 现象**——ts=1 下系统在 categorical 层面完全确定(routing/admission/errors 三次 run 完全相同)
|
||||
|
||||
**项目落到情景 B(KVC ≈ DP)**——三种 forward 路径任团队决策(见 §6)。
|
||||
|
||||
---
|
||||
|
||||
## 1. ts=1 验证数据
|
||||
|
||||
### 1.1 实验配置
|
||||
|
||||
| 项 | 值 |
|
||||
|---|---|
|
||||
| Trace | `outputs/qwen35-swebench-50sess.jsonl`(4449 reqs / 52 sessions) |
|
||||
| 模型 | Qwen3-30B-A3B-Instruct-2507(TP1) |
|
||||
| 硬件 | 单机 4× H100 80GB(注:原始 ts=10 实验是 8 GPU;本次缩配) |
|
||||
| Time-scale | 1(真实 trace 时序,inter-turn gap p50 = 2.5s) |
|
||||
| Concurrency | 32 |
|
||||
| KVC 配置 | 1P3D,policy=kv-aware,admission=worker,seed-min-turn=1,prefill-priority-eviction |
|
||||
| DP 配置 | 4-way colo,policy=kv-aware(cache-aware) |
|
||||
| 输出根 | `outputs/qwen3-30b-tp1-ts1-validation/` |
|
||||
|
||||
### 1.2 Headline 对比
|
||||
|
||||
| Metric | KVC 1P3D ts=1(N=3 均值)| 4DP ts=1 | Delta |
|
||||
|---|---:|---:|---:|
|
||||
| **真实 mechanism errors** | **0** | **0** | 平 |
|
||||
| 报告 errors(口径不一致,见 §1.3) | 5 | 0 | – |
|
||||
| Lat mean | 1.574s | **1.443s** | DP 优 9% |
|
||||
| Lat p50 | 0.810s | **0.659s** | DP 优 19% |
|
||||
| Lat p90 | 3.796s | **3.641s** | DP 优 4% |
|
||||
| Lat p99 | 8.722s | **8.433s** | DP 优 3% |
|
||||
| TTFT mean | 0.244s | **0.129s** | DP 优 47% |
|
||||
| TTFT p50 | 0.122s | **0.090s** | DP 优 26% |
|
||||
| TTFT p90 | 0.572s | **0.252s** | DP 优 56% |
|
||||
| Per-worker spread | ±3.8% (3D) | ±3.1% (4 direct) | 接近 |
|
||||
|
||||
### 1.3 KVC 5 errors 的真实身份
|
||||
|
||||
DP 的同 5 个 (sess, turn) 也"失败"——但 metrics 口径不同:
|
||||
|
||||
```
|
||||
KVC: 计入 error_count
|
||||
DP: metrics 记 error=OK + finish_reason={'type':'abort', 'message':'Input length (X) exceeds the maximum allowed length (87811)'}
|
||||
```
|
||||
|
||||
| sess | turn | input_len | KVC max | DP max |
|
||||
|---|---:|---:|---:|---:|
|
||||
| 35680 | 132 | 91600 | 92098 (✓) | 87811 (✗) |
|
||||
| 35680 | 133 | 92335 | 92098 (✗) | 87811 (✗) |
|
||||
| 39360 | 137 | 91700 | 92098 (✓) | 87811 (✗) |
|
||||
| 39360 | 138 | 92003 | 92098 (✓) | 87811 (✗) |
|
||||
| 39360 | 139 | 92135 | 92098 (✗) | 87811 (✗) |
|
||||
|
||||
**两边都拒同样的请求**——区别只在于 KVC 在 P 端拒(KV 池满)、DP 在 prefill 端拒(max-input limit)。**真实 mechanism 错误率:KVC 0 / DP 0**。
|
||||
|
||||
### 1.4 ts=1 的确定性
|
||||
|
||||
KVC N=3 三次 run 跨 4449 records:
|
||||
|
||||
| 维度 | 跨 run 差异 |
|
||||
|---|---|
|
||||
| `execution_mode` | **0 / 4449** records 不同 |
|
||||
| `assigned_decode_node` | **0 / 4449** records 不同 |
|
||||
| Errors(5 个 sess/turn 对) | **完全相同** |
|
||||
| 18 starved + 16 lucky session | **完全相同** |
|
||||
| Per-D load (1502/1445/1502) | **完全相同** |
|
||||
| Lat mean | 1.574 / 1.573 / 1.574(**0.06%** 漂移)|
|
||||
| Lat p50 | 0.811 / 0.809 / 0.812(**0.4%** 漂移)|
|
||||
| 单 request lat | abs p90 diff = 25ms |
|
||||
|
||||
**结论**:低压 / ts=1 区间下 KVC 系统在 categorical 层面(路由 / admission / 失败位置)**完全确定**,仅低层数值有 model 计算微抖动。
|
||||
|
||||
---
|
||||
|
||||
## 2. 对 TEAM_REPORT §1-§7 的修订
|
||||
|
||||
| § | TEAM_REPORT 原 claim | TEAM_REPORT 原优先级 | ts=1 验证后状态 | **修订优先级** |
|
||||
|---|---|---|---|---|
|
||||
| §2.1 | session pin + 容量盲选 → 25% 饿死 | **P0** | ✅ 结构性问题仍在(18/52 session 永久 pin),但代价从 6× 慢降到 ~2× | **P0**(唯一值得做的 KVC 优化)|
|
||||
| §2.2 | D-side LRU 跟不上 → 8% errors | **P0** | ⚠️ D 仍瞬时顶到 token_usage=1.00,但**ts=1 下 drain time 自然吸收**——0 KVTransferError 雪崩(vs ts=10 369 次) | **降级 P3**(drain time 已解决症状)|
|
||||
| §2.3 | 无 backpressure 通道 | P1(已实现)| ❌ ts=1 下 transfer cascade 不存在,backpressure 无作用对象 | **冷藏**(代码留着,但默认 off)|
|
||||
| §2.4 | P-side round-robin 不感知 D 健康 → prefill-0/-1 错误差 180× | P1 | ⚠️ 1P 配置不可测;ts=10 现象**高度怀疑也是 artifact**(错误本身在 ts=1 消失) | **存疑 / 重测后再说** |
|
||||
| §2.5 | admission RPC 进 scheduler 主循环 → 1Hz polling 让 errors ↑46× | P2 | ❌ 是 ts=10 高压时的现象,ts=1 下不显著 | **冷藏** |
|
||||
| §2.6 | time-scale=10 失真 → 所有 KVC vs DP 结论可能被放大 | **P0** | ✅ **完全证实**(74× errors↓, 8.7× TTFT↓, 7× per-D spread↓) | **DONE,作为前置条件锁定** |
|
||||
| §2.7 | execution_mode 标签命名错位 | P1 | ✅ 仍存在;本次 ts=1 又发现 `error_count` 在 KVC vs DP 口径不一致 | **P1**(纯 labeling 修复,~半天)|
|
||||
| §2.8 | N=1 不可信 → 实验必 N≥3 | P2 | ⚠️ **是 ts=10 高压现象**——ts=1 下 N=1 categorical 完全确定 | **改写规则**:高压 N≥3 / 常规 N=1 |
|
||||
| §2.9 | microbench 把 KVC 失效条件全规避 | – | 仍成立 | **保留观察**(实验设计原则)|
|
||||
|
||||
---
|
||||
|
||||
## 3. v0 REFACTOR_PLAN 回顾
|
||||
|
||||
### 3.1 v0 做对的
|
||||
|
||||
- **唯一代码改动选 backpressure**:作为对 §2.3 的最小验证手段是合理的
|
||||
- **预算 KISS**:用 8h GPU 验证 §1-§7,思路正确
|
||||
- **明确"P0 是 time-scale=1 baseline"**:v0 的 §1 末尾就指出 "time-scale=1 验证为 P0 待办"——本次实验正是把这条做了
|
||||
|
||||
### 3.2 v0 的核心误判
|
||||
|
||||
| v0 假设 | 实际 |
|
||||
|---|---|
|
||||
| backpressure 是 §3 的最小验证 → 也是修复 | ts=1 下 §3 的症状(transfer cascade)不存在,backpressure 无效 |
|
||||
| 8h 预算够跑 ts=1 baseline + backpressure smoke | ts=1 单 run 5.5h,4 run 全跑要 22h(实际跑了 22h) |
|
||||
| §1 / §2 的修复"超出 KISS 边界",先验证不修 | 验证后发现 §1 是**唯一**值得做的真问题,应该早点把它纳入 |
|
||||
|
||||
### 3.3 v0 的 backpressure 代码命运
|
||||
|
||||
代码保留(`--enable-backpressure` 默认 off),原因:
|
||||
- 不删除是因为如果未来跑高压 / 大 trace / 真 RDMA 失败回归到类 ts=10 区间,可能仍有用
|
||||
- 但**不部署、不启用、不文档化为推荐配置**——避免给以后看到代码的人误导
|
||||
|
||||
---
|
||||
|
||||
## 4. 修订后的优先级矩阵
|
||||
|
||||
```
|
||||
必做 建议做 不做
|
||||
──────── ──────── ────────
|
||||
ts=1 必修 §1 capacity-aware (空) §2 / §3 / §4 / §5
|
||||
policy + migration 的 ts=10 fix
|
||||
|
||||
ts=1 nice §2.7 metrics 标签 (空) §2.8 N≥3 严苛规则
|
||||
to have 统一口径 (改成"高压 N≥3")
|
||||
|
||||
文档 §3 写入 TEAM v0 标记 superseded ts=10 数据归档
|
||||
REPORT 更新 (但保留可追溯性)
|
||||
```
|
||||
|
||||
**唯一进入"必做工程"列表的是 §1**。其他全是文档或冷藏。
|
||||
|
||||
---
|
||||
|
||||
## 5. KVC vs DP 拆分到 path-level 看真实差距
|
||||
|
||||
理解 §1 的 ROI 必须先看 path-level(不是整体均值):
|
||||
|
||||
### 5.1 KVC 内部 path 性能(来自 ts=1 N=3 一致数据)
|
||||
|
||||
| Path | n | 占比 | Lat p50 | TTFT p50 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| `kvcache-direct-to-d-session`(快路径)| 1903 | **42.8%** | **0.475s** | **0.042s** |
|
||||
| `pd-router-fallback-large-append-session-cap`(慢路径)| 2409 | **54.2%** | 1.04s | 0.32s |
|
||||
| `pd-router-turn1-seed`(每 session 一次)| 52 | 1.2% | 0.375s | 0.057s |
|
||||
| 其余 | 85 | 1.8% | 多种 | 多种 |
|
||||
|
||||
### 5.2 DP 全部 path(单一)
|
||||
|
||||
| Path | n | 占比 | Lat p50 | TTFT p50 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| `dp-colo-router` | 4449 | 100% | 0.659s | **0.090s** |
|
||||
|
||||
### 5.3 路径级对比
|
||||
|
||||
| | KVC direct | KVC fallback | DP |
|
||||
|---|---|---|---|
|
||||
| Lat p50 | **0.475s**(赢 DP 28%)| 1.04s(输 DP 58%)| 0.659s |
|
||||
| TTFT p50 | **0.042s**(赢 DP 53%)| 0.317s(输 DP 252%)| 0.090s |
|
||||
|
||||
**事实陈述**:
|
||||
- KVC 快路径 **明显快于** DP(无 P 介入、无 mooncake transfer)
|
||||
- KVC 慢路径 **明显慢于** DP(P→D transfer 开销没法摊到 turn 内)
|
||||
- 当前 quick:slow = 42.8% : 54.2%——慢路径多 → 整体输 DP 9-47%
|
||||
- 如果能把比例反过来到 70:25 或更好,KVC 整体会赢 DP
|
||||
|
||||
**§1 的本质就是"为什么有 54% 进了慢路径"**——因为 18/52 session 被 pin 在容量紧张的 D 上,每次 admission 都拒。
|
||||
|
||||
---
|
||||
|
||||
## 6. 三种 forward 路径
|
||||
|
||||
> **更新(2026-05-09)**:情景 C **已实现**——见 `docs/V2_RESULTS_ZH.md`。下面三个分支保留作历史记录。
|
||||
>
|
||||
> | 情景 | 描述 | 状态 |
|
||||
> |---|---|---|
|
||||
> | A | KVC < DP,接受现状转维护 | 不适用 |
|
||||
> | B | KVC ≈ DP,重新定义价值主张 | 不适用 |
|
||||
> | **C** | **KVC > DP,优化拉大差距** | **✓ 实现:v2 在 7/8 头部指标击败 4DP(TTFT mean -24%, p50 -54%, p90 -64%;lat mean -0.8%, p50 -12.6%)** |
|
||||
>
|
||||
> 关键修复:(1) reset-on-success blacklist decay(消除 v1 thrashing),(2) `--kvcache-direct-max-uncached-tokens` 2048→8192(让 41% 大 append 走 direct-to-D 快路径)。direct-to-D rate 从 baseline 42.8% 升到 v2 91.7%。
|
||||
|
||||
### 6.1 选项 A:接受现状,项目转维护
|
||||
|
||||
**判断**:KVC 在 ts=1 + 同 scale 下 ≈ DP(9% 慢、47% TTFT 慢),但**也没灾难性输**。如果项目目标是"验证 KV-aware routing 在 agentic 上是否可行",答案是 **可行但收益不显著**。
|
||||
|
||||
**操作**:
|
||||
- 写 TEAM_REPORT §3 总结 ts=1 实验
|
||||
- 把 ts=1 数据 + 4 个 run 归档到 `RESULTS_FROZEN_TS1.md`
|
||||
- KVC 代码保留但标记 "experimental, not recommended for production"
|
||||
- 团队转下一个项目方向(不是本文范围)
|
||||
|
||||
**成本**:1 周文档收尾。
|
||||
**风险**:放弃了 §1 修复后可能的 KVC > DP 上限。
|
||||
|
||||
### 6.2 选项 B:做 §1,目标让 KVC > DP
|
||||
|
||||
**判断**:5.3 节的路径分析表明 KVC 快路径已经赢 DP;如果把饿死 session 救回快路径,KVC 整体可能赢 DP。
|
||||
|
||||
**具体改动**:
|
||||
|
||||
#### 6.2.1 capacity-aware policy(`policies.py:166-172`)
|
||||
|
||||
当前评分(无容量项):
|
||||
```python
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus,
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
assignment_penalty,
|
||||
)
|
||||
```
|
||||
|
||||
提议改为:
|
||||
```python
|
||||
# 新增:D 当前容量利用率(从 worker-mode admission 已能查到)
|
||||
capacity_used = worker_capacity_used_ratio.get(worker.worker_id, 0.0)
|
||||
|
||||
# Hard cap:容量 > X 时禁止该 D 进入候选
|
||||
if capacity_used > HARD_CAP_THRESHOLD: # e.g. 0.85
|
||||
continue
|
||||
|
||||
score = (
|
||||
overlap_capped, # 原 overlap,但限幅避免单个 D 永远赢
|
||||
-capacity_used, # 新增二级排序项:偏好空闲 D
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
)
|
||||
```
|
||||
|
||||
#### 6.2.2 session migration(`replay.py` 或 policy 层)
|
||||
|
||||
当 session X 在 D-A 上连续被 admission 拒 N 次(如 N=3):
|
||||
- 主动 release X 在 D-A 上的 session state
|
||||
- 允许下次 turn 把 X 路由到另一个 D
|
||||
- 代价:丢失 D-A 上已积累的 KV——但 fallback 路径本来也丢了,**净收益正**
|
||||
|
||||
#### 6.2.3 metric 修复(`replay.py`)
|
||||
|
||||
把"`pd-router-fallback-large-append-*`" 标签按真实原因细分:
|
||||
- `session-not-resident-on-pinned-D`(§1 主因)
|
||||
- `real-large-append`(>2048 阈值,§2.7)
|
||||
- `session-was-evicted`(被 LRU 踢过)
|
||||
- `session-cap-rejected`(worker admission 拒)
|
||||
|
||||
让以后看 metrics 的人不再被名字误导。
|
||||
|
||||
#### 6.2.4 验证
|
||||
|
||||
- 每改动跑 KVC 1P3D ts=1 N=1(categorical 确定,不需要 N=3)
|
||||
- 对比 baseline run1(已有数据)
|
||||
- 关键指标:`kvcache-direct-to-d-session` 占比、整体 lat mean、TTFT mean
|
||||
- 目标:direct-to-D rate 从 42.8% 升到 > 70%、整体 lat 追平或赢 DP
|
||||
|
||||
**成本**:3 天编码 + 5 天测试 + 2 天文档 ≈ 2 周。
|
||||
**风险**:
|
||||
- session migration 可能导致 thrash(A→B→A→B),需要冷却时间机制
|
||||
- capacity HARD_CAP 阈值需要 sweep 找最优
|
||||
- 改完仍可能不赢 DP(理论上限不知道)
|
||||
|
||||
### 6.3 选项 C:保留 KVC,但寻找 KVC 真正赢的工作点
|
||||
|
||||
**判断**:当前 SWE-Bench 50 sessions × 30B 模型 × 4 GPU 是一个特定工作点。KVC 的设计初衷是"长 multi-turn session 的 KV 复用"——可能在某些其他工作点有显著优势。
|
||||
|
||||
**候选工作点**:
|
||||
- **更长 session(>200 turns)**:复用收益更大
|
||||
- **更小模型(如 7B / 14B)**:mooncake transfer 占比更大,KVC 节省更明显
|
||||
- **更大 trace(>200 sessions)**:DP 的 prefix cache 命中率会下降,KVC 的 session-aware 优势放大
|
||||
- **真实 RDMA(非 mooncake TCP loopback)**:transfer 更快,KVC 的 P→D 开销更小
|
||||
|
||||
**操作**:
|
||||
- 设计 1-2 个新 micro/macro benchmark
|
||||
- 跑 KVC vs DP 对比
|
||||
- 找到差距 > 30% 的工作点(KVC 赢 / 输都是数据)
|
||||
|
||||
**成本**:~1 个月(trace 设计 + benchmark + 分析)。
|
||||
**风险**:可能找不到 KVC 显著赢的工作点。
|
||||
|
||||
---
|
||||
|
||||
## 7. 推荐组合
|
||||
|
||||
按风险 / 收益排序:
|
||||
|
||||
1. **必做**(无论选 A/B/C):
|
||||
- 写 `TEAM_REPORT §3 ts=1 验证更新`
|
||||
- 修 `metrics 标签口径`(§2.7 + KVC/DP error_count 一致化)
|
||||
- **冷藏 backpressure 代码**(不删但默认 off)
|
||||
- 把 v0 REFACTOR_PLAN 标 superseded
|
||||
|
||||
2. **强烈推荐**:选项 B 的 §6.2.1(capacity-aware policy hard cap)
|
||||
- 工程量小(~1 天编码 + 1 天测试)
|
||||
- 验证 §1 修复的真实收益是否如预测
|
||||
- 如果 direct-to-D rate 不显著提升 → 把 §6.2.2 也加上
|
||||
- 如果还不行 → 接受现状走选项 A
|
||||
|
||||
3. **看团队带宽**:选项 C 的工作点探索
|
||||
- 不与 §6.2 冲突,可以并行
|
||||
- 找到一个 KVC 真正赢的工作点会极大改变项目价值主张
|
||||
|
||||
---
|
||||
|
||||
## 8. 应该砍掉的事(明确列表)
|
||||
|
||||
| 事 | 砍的理由 |
|
||||
|---|---|
|
||||
| backpressure smoke sweep(v0 计划的 4 run) | ts=1 下 backpressure 无作用对象 |
|
||||
| §2.5 admission API probe/commit 拆分 | 高压才显著,等找到 KVC 高压 workload 再说 |
|
||||
| §2.2 D-side 分层 LRU eviction(hot retract) | drain time 自然吸收 |
|
||||
| §2.4 P-side D-health-aware routing | 1P 测不出,ts=10 现象高度存疑 |
|
||||
| 大量 instrument(admission-events / pool timeseries) | 已经够了,先用现有数据 |
|
||||
| 任何 ts=10 区间的优化 | 那是 benchmark artifact 主导的区间,不代表真实部署 |
|
||||
| N≥3 实验作为硬规则 | 改写为"高压 N≥3,常规 N=1 即可" |
|
||||
|
||||
---
|
||||
|
||||
## 9. 风险与未验证的假设
|
||||
|
||||
1. **4DP ts=1 是 N=1**:虽然 KVC ts=1 是确定性的,DP 是新机制 N=1,理论上需要 N≥3 验证。但 DP 在 ts=10 也是 0 errors / 1.43s mean,行为相对 KVC 更稳定,N=1 风险较小。**如选项 B 推进,建议补 N=2**。
|
||||
2. **2 个 input-too-long session 是 trace 数据问题**:这两个 session(35680、39360)在 turn 132+ / 137+ 才超过 input limit。可能是 trace 生成时没控制好 max input。**应该独立把这两个 session 从 trace 移除或截断后重跑作为对照**。
|
||||
3. **4 GPU 缩配 vs 8 GPU 原始**:本次 1P3D / 4DP 数据无法跨 8 GPU 原始数据直接比,需要在结论中明确。但 ts=1 + 同 scale 内部对比是干净的。
|
||||
4. **mooncake TCP loopback**:所有 transfer 在单机 TCP 模拟下进行。生产 RDMA 下 KVC 的 transfer 开销可能显著降低,KVC 优势可能扩大——这是 **选项 C 的一个候选维度**。
|
||||
5. **§1 修复是否真能让 direct-to-D 上升到 70%+ 是预测**:实际可能受 hash overlap 限制(即使 D 容量充裕,没有 prefix overlap 就走不了 direct-to-D)。**需要 §6.2 验证后才知道天花板**。
|
||||
6. **input-limit error 的 metrics 口径修复影响以后所有比较**:注意修改后 ts=10 历史数据的 error_count 也需要重算(或在分析时显式补偿)。
|
||||
|
||||
---
|
||||
|
||||
## 10. 决策点(需要团队确认)
|
||||
|
||||
请审阅后回答:
|
||||
|
||||
| # | 决策 | 选项 |
|
||||
|---|---|---|
|
||||
| D1 | 选哪条 forward 路径? | A(维护)/ B(修 §1)/ C(探索 workload)/ B+C |
|
||||
| D2 | 写 TEAM_REPORT §3 ts=1 验证更新章节? | Yes / No |
|
||||
| D3 | 把 v0 REFACTOR_PLAN 标 superseded? | Yes / No |
|
||||
| D4 | 删除 backpressure 代码 vs 冷藏? | 删 / 冷藏(默认 off)|
|
||||
| D5 | 修 metrics 标签口径(§2.7 + error_count 一致化)? | Yes / No |
|
||||
| D6 | 是否补 4DP ts=1 N=2 / N=3 做更稳的 baseline? | Yes / No |
|
||||
| D7 | 是否把 sess 35680 / 39360 从 trace 移除做"干净" baseline? | Yes / No |
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本文数据来源
|
||||
|
||||
| 章节 | 数据源 |
|
||||
|---|---|
|
||||
| §1.2-§1.4 | `outputs/qwen3-30b-tp1-ts1-validation/{kvc_1p3d_run{1,2,3},dp4}_{summary.json,metrics.jsonl}` |
|
||||
| §1.4 跨 run 一致性 | per-record diff via `scripts/analysis/analyze_ts1_validation.py` + 临时 diff 脚本 |
|
||||
| §5 path-level | metrics.jsonl 按 `execution_mode` 分组 |
|
||||
| §2 §1-§7 修订 | `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` 原数据 + ts=1 新数据交叉对比 |
|
||||
|
||||
## 附录 B:相关文档
|
||||
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — §1-§7 原结构性问题清单
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md` — v0 重构计划(本文 supersede)
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 演进
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
|
||||
- `scripts/sweep_ts1_kvc_n3_plus_dp.sh` — 本次 4 run sweep 脚本
|
||||
- `scripts/analysis/analyze_ts1_validation.py` — 本次分析脚本
|
||||
|
||||
---
|
||||
|
||||
**作者注**:本文偏决策导向。如果要写更技术的 §1 capacity-aware policy 实现细节,应该在 D1 决策为 B 之后单独出一份 `IMPL_CAPACITY_AWARE_POLICY.md`。
|
||||
368
docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md
Normal file
368
docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md
Normal file
@@ -0,0 +1,368 @@
|
||||
# Reseed 慢路径现状与 D→P KV 同步缺口
|
||||
|
||||
**日期**:2026-05-11
|
||||
**对象**:项目团队 + 后续 paper reviewer
|
||||
**性质**:基线现状落盘 + future-work 缺口定位
|
||||
**前置文档**:
|
||||
- `docs/V2_DEEP_ANALYSIS_ZH.md` §3.2 §4.2(reseed 路径在 v2 数据中的表现)
|
||||
- `docs/KVC_ROUTER_ALGORITHM.md` §3 §9(算法形式化 + open questions)
|
||||
|
||||
**目的**:把"v2 的 reseed slow path 为什么慢、能不能用现有机制治、还差什么"三个问题落盘成单一参考文档,让团队不必再口头反复对齐,让论文 future-work 章节有可引用的基础。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. KVC v2 在 SWE-Bench 测试中 8.3% 请求走非 direct-to-D 的 reseed/fallback 路径,**单次 reseed 实测 3-7s**(TTFT p99 = 1.28s 全部来自这条路径)。
|
||||
2. 启用真 RDMA(节点有 mlx5_0/_1 @ 200 Gb/s × 2 active)能把 reseed 的 transfer 段(~1.5-4s)压到 ~200-400ms,但**对 re-prefill 段(~1.5-3s)无效**。预期 reseed 总时间从 3-7s 降到 1.7-3.2s,TTFT p99 ~0.7s,**仍输 DP(0.43s)**。
|
||||
3. 真正消除 reseed 长尾必须实现 **D→P 增量 KV 同步**——让 P 端 backup 跟上 D 在 direct-to-D append 路径上累积的 KV,避免 reseed 时重新跑 prefill kernel。
|
||||
4. 经 Opus agent 独立 forensic 审查(commit `9ccd853`)+ 全分支 git 检索:**当前代码、vendored SGLang、mooncake 三层均无 D→P 实现**,作者也没有在其它分支偷偷开发——仓库总共只有 main(旧 baseline)+ kvc-debug-journey-v1-to-v4(本工作分支)两个分支,main 还落后我们 18 个 commit。
|
||||
5. `--kvcache-prefill-backup-policy capacity-backup` 这个 flag 看起来像 D→P 同步但**不是**——它的真实语义只是"reseed 完不关 P streaming session",P 端 KV 仍是 seed-time 的**静态快照**,不随 direct-to-D append 而增长。
|
||||
6. 实现 D→P 增量同步的工程量 ~1-2 周,最难的不是网络层(mooncake 加 D-sender / P-receiver 角色 ~400 LOC),而是 **SGLang radix tree 改成允许从外部 worker 喂数据**——radix cache 当前假设单一生产者。
|
||||
|
||||
---
|
||||
|
||||
## 1. 团队成员的三个质疑(关键框架,paper 引用建议保留原话)
|
||||
|
||||
这三条质疑出自 v2 完成后的对话审查,**直接戳穿了"启用 capacity-backup 就能消除 slow path"的一厢情愿**。每条都有代码层证据支持,**全部成立**。
|
||||
|
||||
### 质疑一:P 节点的 pool 塞得下所有 backup 的 KV cache 吗?
|
||||
|
||||
**回答:塞不下,max 同时 backup ~1-2 个大 session。**
|
||||
|
||||
代码证据(`src/agentic_pd_hybrid/replay.py:1618-1620`):
|
||||
|
||||
```python
|
||||
max_backup_sessions = max(1, capacity_tokens // max(1, target_tokens * 2))
|
||||
max_backup_sessions = min(max_backup_sessions, 4)
|
||||
```
|
||||
|
||||
按 SWE workload 实测代入:
|
||||
- P 池 `capacity_tokens` ≈ 92,104 tokens(SGLang 启动时按 mem_fraction_static 自动分配)
|
||||
- 典型 session peak input `target_tokens` ≈ 50,000-80,000 tokens
|
||||
- 计算:`92K // (50K × 2) = 0` → `max(1, 0) = 1`
|
||||
- → **P 最多同时 backup 1 个大 session**
|
||||
|
||||
对照小 session:
|
||||
- target 20K:`92K // 40K = 2` → backup 上限 2 个
|
||||
- target 10K:`92K // 20K = 4` → backup 上限 4 个(达到代码硬上限)
|
||||
|
||||
→ **capacity-backup 在真实 agentic 长 context workload 下只能救少数 session,不是全员保险。**
|
||||
|
||||
### 质疑二:P 上的 backup 是陈旧快照——49K 的 append 内容根本没经过 P
|
||||
|
||||
**回答:完全正确,这是 capacity-backup 设计上的致命缺陷。**
|
||||
|
||||
**用户提供的反例场景**(已成为 paper 中描述 slow path 的标准例子):
|
||||
|
||||
```
|
||||
turn 0: P 做 prefill 1K tokens → 经 mooncake 传到 D → P 留 1K backup
|
||||
turn 1-50: 全部走 direct-to-D,D 上做 append-prefill,KV 在 D 上从 1K 增长到 50K
|
||||
↑↑↑ 关键:这 49K 的 append 内容(tool 输出、user 消息、模型生成)
|
||||
**从未流经 P 节点**。P 端 backup 锁在 1K 状态。
|
||||
turn 51: D 出于某种原因(容量、迁移、显式驱逐)拒绝 → 触发 reseed
|
||||
→ 即使 P 上有 backup,也只是 turn-0 的 1K
|
||||
→ 实际需要 D 上重建的是 50K(当前完整 context)
|
||||
→ P 必须从 prompt 重新 prefill 49K 的差额
|
||||
→ capacity-backup 节省的 compute 仅 ~2%
|
||||
```
|
||||
|
||||
**代码证据**(独立 Opus agent forensic 审查,commit `9ccd853`):
|
||||
|
||||
1. 唯一更新 `session.prefill_resident_tokens` 的函数是 `_commit_prefill_backup_residency`(`replay.py:1483`)
|
||||
2. 这个函数的唯一 caller 是 `_invoke_kvcache_seeded_router`(`replay.py:2208`)—— 即 seed/reseed 路径
|
||||
3. `_invoke_session_direct`(`replay.py:2719`,direct-to-D 路径)只更新 `session.opened` / `resident_tokens` / `last_trace_request`,**从不触碰任何 P 端字段**
|
||||
4. `_commit_prefill_backup_residency` 内部用 `_estimate_session_resident_tokens(request)` 取的是**完整 request 的预估**,不是 append delta——所以连 bookkeeping 层面都不假设有增量更新
|
||||
|
||||
→ **`capacity-backup` 的真实语义只是"reseed 完之后跳过 `_close_prefill_session`"**(`replay.py:2221`),P 端 streaming session 保持 open 状态、KV 留在 P 的 radix tree 中。但**不存在任何机制让这份 KV 跟上 D 端的 append 增长**。
|
||||
|
||||
### 质疑三:D 触发 reseed 后,本机旧 session 的 KV cache 是不是清空了?P 做完 re-prefill,KV 推到哪里?
|
||||
|
||||
**回答:是的,旧 KV 直接 free 掉;P 重新 prefill 完之后推到 router 选的新 target D(可能同 D,可能换 D)。中间没有"先 dump 到 P 再清"的快捷方式。**
|
||||
|
||||
#### D 端驱逐时的 KV 处理
|
||||
|
||||
代码证据(`replay.py:_close_decode_session`,1539-1569 行;`session_aware_cache.py:release_session`,250-276 行):
|
||||
|
||||
```python
|
||||
# replay.py 端
|
||||
async def _close_decode_session(..., evicting_for_capacity=False):
|
||||
if not session.opened:
|
||||
return
|
||||
await _close_streaming_session(...) # 给 D 发关闭信号
|
||||
# 从 D 的 resident bookkeeping 里删掉这个 session
|
||||
session.opened = False
|
||||
session.resident_tokens = 0
|
||||
if evicting_for_capacity and not session.prefill_opened:
|
||||
residency.decode_evictions_without_prefill_backup += 1
|
||||
|
||||
# SGLang 端(session_aware_cache.py)
|
||||
def release_session(self, session_id):
|
||||
# 解锁引用 + 直接 free KV slots
|
||||
self.token_to_kv_pool_allocator.free(kv_indices)
|
||||
# ↑ 没有序列化、没有外发、没有 D→P 通道
|
||||
```
|
||||
|
||||
**D 驱逐 = 把 KV slot 直接归还给 token pool 分配器。完全没有任何 outbound 网络调用。**
|
||||
|
||||
#### Reseed 时 P→D 的目标选择
|
||||
|
||||
驱逐之后的 reseed 路径(`_invoke_kvcache_seeded_router`,`replay.py:2101`)走的是与 turn 0 完全一样的 P-mediated seeding:
|
||||
|
||||
```
|
||||
1. KvAwarePolicy.select() 选择一个 target D'(可能是同一个 D,也可能因 migration 换 D)
|
||||
2. _invoke_kvcache_seeded_router 在 D' 上 open 一个 streaming session
|
||||
3. 给 P 发完整 prompt → SGLang pd-router 让 P 做完整 prefill
|
||||
4. P 的 prefill 完成后通过 mooncake 把 KV 一次性推到 D'
|
||||
5. D' 上接收完毕,session 重建完成;decode 继续
|
||||
```
|
||||
|
||||
**所以 P 做完 re-prefill 的 KV 推到 KvAwarePolicy 选的 target D'**——可能是:
|
||||
- 同一个 D(驱逐后重新接受)
|
||||
- 另一个 D(如果 reject 计数累积触发 migration,详见 KVC_ROUTER_ALGORITHM §3.3)
|
||||
|
||||
无论哪种,**旧 D 的旧 KV 在新 KV 到达之前就已经被 free**。没有 D→D 的直接迁移路径,没有"先 dump 到 P 再推回"的快捷路径。
|
||||
|
||||
---
|
||||
|
||||
## 2. Reseed 路径的完整 step-by-step 现状
|
||||
|
||||
把上面三个质疑串成端到端流程,以下是 v2 当前 reseed 路径的**完整**操作序列。每一步都标注实测耗时与代码位置。
|
||||
|
||||
### 触发条件
|
||||
|
||||
下列任一发生时 router 走 reseed 路径(详见 `KVC_ROUTER_ALGORITHM.md §3.3`):
|
||||
- D 端 `Admit()` 返回 `can_admit=False`,原因为 `no-d-capacity` / `session-not-resident` / 等
|
||||
- KvAwarePolicy.select 返回的 D 不再持有该 session(migration 触发)
|
||||
- v1/v2 的 reject counter 累积让所有 D 都被 blacklist(极少触发,由 reset-on-success 保护)
|
||||
|
||||
### 端到端时间线
|
||||
|
||||
```
|
||||
t=0 上游 agent 发出 turn N 请求(input ~50K,append ~2K)
|
||||
↓
|
||||
t=~5ms Router 的 KvAwarePolicy.select() 选 target D'(O(|D|) Python 评分)
|
||||
↓
|
||||
t=~10ms Router → D' 发 admit_direct_append RPC
|
||||
↓
|
||||
t=~30ms D' 返回 can_admit=False, reason="session-not-resident"
|
||||
或 "no-d-capacity",Algorithm 3 bump rejects[s, D']++
|
||||
↓ (fallback chain 最多再试 ε-1 个 D,对应 ε ~30ms 总额)
|
||||
t=~100ms 所有 D 都被拒 / 选不到适合 D,路径退化到 seeded router
|
||||
↓
|
||||
t=~110ms Router 转 _invoke_kvcache_seeded_router
|
||||
↓
|
||||
t=~120ms [可选] capacity-backup policy 下:_reserve_prefill_backup_capacity()
|
||||
检查 P 池容量,若不够先 LRU 驱逐别的 P backup session
|
||||
↓
|
||||
t=~150ms P 上 open streaming session(HTTP /session/open)
|
||||
↓
|
||||
t=~200ms 发完整 prompt 到 SGLang pd-router → 路由到 P
|
||||
↓
|
||||
t=~250ms P 开始 prefill
|
||||
↓
|
||||
↓ ←←← 大头 1:P-side re-prefill 段
|
||||
↓ P 必须 prefill 完整 ~50K tokens
|
||||
↓ 即使 capacity-backup 开着,P 的 backup 只有 turn-0 的 ~1K
|
||||
↓ radix prefix cache 命中前 1K,剩余 49K 重算
|
||||
↓ 实测耗时:~1.5-3s @ Qwen3-30B TP1
|
||||
↓
|
||||
t=~2000ms P 完成 prefill,KV 进入 mooncake transfer 队列
|
||||
↓
|
||||
t=~2050ms mooncake 开始 P→D' transfer
|
||||
↓
|
||||
↓ ←←← 大头 2:P→D mooncake transfer 段
|
||||
↓ KV 张量 ~5-9 GB(50K tokens × 2 bytes/token × layers × heads...)
|
||||
↓ **TCP loopback** 实测耗时:~1.5-4s
|
||||
↓ ↑↑↑ 当前 sweep 未启用 RDMA,走的是单机 lo 设备
|
||||
↓ 若启用 IB RDMA @ 200 Gb/s,理论 200-400ms
|
||||
↓
|
||||
t=~4500ms transfer 完成,D' 上 session 重建好
|
||||
↓
|
||||
t=~4510ms D' 开始 decode(小幅度 append-prefill 余下的 ~2K append + 生成)
|
||||
↓
|
||||
t=~4550ms 首个 token 出来 → TTFT 测点
|
||||
```
|
||||
|
||||
**单次 reseed 总耗时:3-7s**(中位 ~2.5s 来自较小 session,p99 ~7.7s 来自最大 session)。**re-prefill 段与 transfer 段大致五五开**,受 session 大小影响。
|
||||
|
||||
### 这就是为什么 v2 的 TTFT p99 = 1.28s
|
||||
|
||||
8.3% slow path 走的是上面这条流水线,其中 reseed 路径(`pd-router-d-session-reseed`)单独占 3.4%(150/4449 请求),构成 KVC TTFT p99 长尾的主要贡献。
|
||||
|
||||
---
|
||||
|
||||
## 3. 已审查的所有"看起来像 D→P 但其实不是"的代码
|
||||
|
||||
下面这些在搜索时容易误判成 D→P 实现,**全部经独立 audit 排除**:
|
||||
|
||||
| 文件:行 | 看起来像 | 实际是 |
|
||||
|---|---|---|
|
||||
| `replay.py:1483 _commit_prefill_backup_residency` | "把 backup 提交到 P" | bookkeeping 函数,更新 `session.prefill_resident_tokens` 计数字段。不传输任何 KV 数据,只在 seed/reseed 完成后被调用。 |
|
||||
| `replay.py:1572 _reserve_prefill_backup_capacity` | "预留 backup 空间" | 检查 P 池可用空间并按 LRU 驱逐别的 backup session 腾位置。不传 KV,只调整 reservation 计数。 |
|
||||
| `cli.py:182 --kvcache-prefill-backup-policy` | "backup 策略" | 只决定 reseed 完成后是否 `_close_prefill_session`。capacity-backup = 保留 P 端 streaming session 不关;release-after-transfer = 立刻关闭。**两种策略下 P 的 KV 都是 seed-time 的静态快照**。 |
|
||||
| `session_aware_cache.py:release_session` | "释放 session(可能含外发)" | 仅调 `kv_pool_allocator.free(kv_indices)`。零网络调用。 |
|
||||
| `disaggregation/decode.py: start_decode_thread` | "decode 端线程,可能有出站" | 纯 receiver loop。处理入站 `AUX_DATA / CHUNK_READY / STAGING_REQ / KVPoll.Success`,**没有出站 KV 传输分支**。 |
|
||||
| `disaggregation/mooncake/conn.py:1563` | "传输请求添加" | `assert disaggregation_mode == PREFILL`——硬约束,只有 P 端能调。 |
|
||||
| `mooncake.MooncakeKVSender` / `MooncakeKVReceiver` | "双向 sender / receiver" | 强角色化:Sender 只在 PREFILL 模式实例化,Receiver 只在 DECODE 模式。`BaseKVManager` 抽象无 bidirectional slot。 |
|
||||
| `pd-router-d-session-reseed-after-eviction` execution_mode | "走 backup 的快路径" | 实际还是走完整 `_invoke_kvcache_seeded_router`(P 完整 prefill + 完整 mooncake transfer),只是 `_eviction_suffix()` 在 execution_mode 字符串末尾加了 "-after-prefill-backed-eviction" 标签。**没有任何 fast-path 优化**。v2 中仅 2/4449 请求走到这个标签。 |
|
||||
|
||||
---
|
||||
|
||||
## 4. D→P 增量同步:要做的是什么
|
||||
|
||||
完整 D→P 增量同步的设计目标:**让 P 端的 backup KV 在 direct-to-D append 完成后异步追上 D 端的 KV,让 reseed 退化为单次 P→D transfer(无需 P re-prefill)**。
|
||||
|
||||
### 抽象数据流
|
||||
|
||||
```
|
||||
当前:
|
||||
direct-to-D append: D 本地 append-prefill,P 端 backup 锁住不变
|
||||
reseed: P re-prefill 完整 50K + P→D transfer 完整 50K
|
||||
|
||||
目标:
|
||||
direct-to-D append: D 本地 append-prefill,**同时**异步把新增的 KV 块推回 P
|
||||
reseed: P→D' transfer 完整 50K (already up-to-date)
|
||||
无需 P re-prefill
|
||||
```
|
||||
|
||||
### 实现层面要改的事
|
||||
|
||||
按工程难度排序:
|
||||
|
||||
#### 4.1 Mooncake 双角色化(中等难度,~400 LOC)
|
||||
|
||||
- `BaseKVSender` / `BaseKVReceiver` 抽象保留,但允许同一 worker 同时实例化两种角色
|
||||
- `MooncakeKVManager.__init__` 把 PREFILL / DECODE 分支改成"role set",允许 worker 同时持有 sender 和 receiver
|
||||
- 新增 `DecodeKVSender` 类(D 端用于把 append KV 推回 P)
|
||||
- 新增 `PrefillKVReceiver` 类(P 端用于接收 D 的 append KV)
|
||||
- 引入第二个 bootstrap channel(避免与原 P→D 通道在 buffer pointer 协商上冲突)
|
||||
|
||||
#### 4.2 D 端 append commit hook(容易)
|
||||
|
||||
- 每次 `direct-to-D-session` 完成后,识别新写入的 KV 块(D scheduler 在 commit 时知道)
|
||||
- 入队 D→P 传输(异步,不阻塞 next request)
|
||||
- 标记 backup 是否成功送达 P(用于后续 reseed 决策)
|
||||
|
||||
#### 4.3 P 端 radix tree 多生产者扩展(**最难,工程量主体**)
|
||||
|
||||
**这是真正的架构 blocker**。SGLang 的 P 端 radix cache 当前假设:
|
||||
- 单一生产者(本 worker 的 model 输出)
|
||||
- 树插入只在 prefill / decode 完成时发生
|
||||
- KV 索引由本 worker 的 token_to_kv_pool_allocator 分配
|
||||
|
||||
要让 P 接收 D 喂来的 KV 块,需要:
|
||||
- 扩展 radix tree 节点的写入路径,允许"外部供给的 KV + token 序列"被插入
|
||||
- 处理 KV 索引重映射(D 的 slot 号在 P 上无意义)
|
||||
- 处理 reference counting(同一 session 可能既被本 worker 用、又被 D 喂回更新)
|
||||
- 处理 eviction policy 协调(P 端 radix LRU 不应让"被 D 喂入的 backup"先被驱逐)
|
||||
- 处理 KV 数据格式的跨 worker 兼容(同样的 model layout,应该是 trivial,但需要测试)
|
||||
|
||||
#### 4.4 agentic-pd-hybrid 端 hook(容易)
|
||||
|
||||
- `_invoke_session_direct` 完成后,新增一步:触发 D→P 同步 RPC(异步)
|
||||
- `_invoke_kvcache_seeded_router` 在 reseed 触发前先 probe P 是否有 up-to-date backup;若有,跳过 re-prefill,只做 P→D transfer
|
||||
- 新增 CLI flag `--enable-d-to-p-sync`,默认 off,保留 baseline 行为
|
||||
- 新增 structural log channel 记录 D→P 同步事件 / 失败 / 延迟
|
||||
|
||||
### 实现完毕后的预期收益
|
||||
|
||||
| 指标 | 当前 (v2) | RDMA only | RDMA + D→P sync |
|
||||
|---|---:|---:|---:|
|
||||
| reseed re-prefill 段 | 1.5-3s | 1.5-3s(不变) | **~0**(已有 up-to-date backup) |
|
||||
| reseed transfer 段 | 1.5-4s | 0.2-0.4s | 0.2-0.4s |
|
||||
| reseed 总耗时 | 3-7s | 1.7-3.4s | **0.2-0.4s** |
|
||||
| TTFT p99 | 1.285s | ~0.7s | **~0.4-0.5s**(与 DP 接近或胜过) |
|
||||
| 8.4% slow path 占比 | 不变 | 不变 | 可能保持但单次代价大幅下降 |
|
||||
|
||||
→ 这就是 paper 里 future-work 应当声明的**"完整版 KVC 才能真正在 TTFT 全分位数上击败 DP"** 的路径。
|
||||
|
||||
---
|
||||
|
||||
## 5. 仓库分支审查(确认无作者私下实现)
|
||||
|
||||
`git ls-remote origin --refs` 完整结果:
|
||||
|
||||
```
|
||||
9ccd853... refs/heads/kvc-debug-journey-v1-to-v4 ← 本工作分支(含本文档)
|
||||
e9062b1... refs/heads/main ← baseline,落后我们 18 commit
|
||||
```
|
||||
|
||||
- **服务器只有 2 个分支**,**0 个 tag**,**0 个隐藏 ref**
|
||||
- main 是更老的 baseline;含 `_commit_prefill_backup_residency` 等同名函数,但语义与本工作分支一致——都是静态 backup,无 D→P 同步
|
||||
- 全 git 历史搜索 `D->P / d-to-p / decode.*prefill.*transfer / kv.*pushback / kv.*sync / incremental / mirror` 关键词,**唯一命中是 commit `9ccd853`**(本文档相关的 doc 改动)
|
||||
- 唯一 remote 是 `origin`(`git@ipads.se.sjtu.edu.cn:wangjh/agentic-pd-hybrid.git`),无 upstream / fork
|
||||
|
||||
→ **作者没有在其它分支偷偷实现 D→P**。这块工作是真空。
|
||||
|
||||
---
|
||||
|
||||
## 6. 下一步
|
||||
|
||||
按 ROI 排序:
|
||||
|
||||
### 必做(落地下一阶段)
|
||||
|
||||
1. **新开 `feat/d-to-p-sync` 分支** 从当前 `kvc-debug-journey-v1-to-v4` 起步
|
||||
2. 写设计文档 `docs/D_TO_P_SYNC_DESIGN_ZH.md`:
|
||||
- 包括上面 §4 的实现细节
|
||||
- 添加 sequence diagram(P/D 通信时序)
|
||||
- 评估 SGLang radix tree 多生产者扩展的具体 API 改动
|
||||
- 评估 D→P 同步对 direct-to-D fast path 自身延迟的影响(理想是异步零开销)
|
||||
3. POC 阶段 1:mooncake 双角色化 + 一个能跑通的 D→P transfer 单测
|
||||
4. POC 阶段 2:P 端 radix tree 多生产者扩展(重点工程量)
|
||||
5. POC 阶段 3:agentic-pd-hybrid 端的 hook + flag
|
||||
6. 端到端验证:跑同 trace 同 ts=1 配置,目标 TTFT p99 < 0.5s
|
||||
|
||||
### 推荐
|
||||
|
||||
7. **同时启用真 RDMA**(独立于 D→P 工作,只需改 sweep 脚本加 `--force-rdma --ib-device mlx5_0`),先把现有 transfer 段加速作为 baseline
|
||||
8. **跑 RDMA-only 对照**:先证明单 RDMA 启用能把 TTFT p99 从 1.28s 压到 ~0.7s,再用 D→P sync 把剩下的 re-prefill 段也吃掉。这样 paper 里能写两条独立的 ablation
|
||||
|
||||
### 不要做的事
|
||||
|
||||
- 在 main / 工作分支上做 D→P 实验(隔离开),主分支应该保持 v2 稳定
|
||||
- 试图通过 capacity-backup 现有 flag "调出"D→P 效果——它结构上做不到
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本文档涉及的代码位置
|
||||
|
||||
| 函数 / 字段 | 位置 |
|
||||
|---|---|
|
||||
| `_commit_prefill_backup_residency` | `src/agentic_pd_hybrid/replay.py:1483` |
|
||||
| `_reserve_prefill_backup_capacity` | `src/agentic_pd_hybrid/replay.py:1572` |
|
||||
| `_close_prefill_session` | `src/agentic_pd_hybrid/replay.py:1507` |
|
||||
| `_close_decode_session` | `src/agentic_pd_hybrid/replay.py:1539` |
|
||||
| `_invoke_session_direct` (direct-to-D 路径) | `src/agentic_pd_hybrid/replay.py:2719` |
|
||||
| `_invoke_decode_session_direct` | `src/agentic_pd_hybrid/replay.py:2826` |
|
||||
| `_invoke_kvcache_seeded_router` (reseed 路径) | `src/agentic_pd_hybrid/replay.py:2101` |
|
||||
| `DirectSessionState.prefill_resident_tokens` | `src/agentic_pd_hybrid/replay.py:128` |
|
||||
| `_eviction_suffix` | `src/agentic_pd_hybrid/replay.py:1220` |
|
||||
| `--kvcache-prefill-backup-policy` CLI flag | `src/agentic_pd_hybrid/cli.py:182-189, 436-441` |
|
||||
| `MooncakeKVManager.__init__` | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:187-256` |
|
||||
| `start_decode_thread` (decode 端 receive loop) | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1425-1496` |
|
||||
| `add_transfer_request` (assert PREFILL) | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1563` |
|
||||
| `MooncakeKVSender` / `MooncakeKVReceiver` | `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1648, 1740` |
|
||||
| `BaseKVSender` / `BaseKVReceiver` 抽象 | `third_party/sglang/python/sglang/srt/disaggregation/base/conn.py` |
|
||||
| `session_aware_cache.release_session` | `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py:250-276` |
|
||||
| `session_controller._close` | `third_party/sglang/python/sglang/srt/managers/session_controller.py:293-316` |
|
||||
|
||||
## 附录 B:相关 commit
|
||||
|
||||
| Commit | 内容 |
|
||||
|---|---|
|
||||
| `9ccd853` | docs: D→P 缺口的 Opus forensic audit 写入 V2_DEEP_ANALYSIS §4.2 + KVC_ROUTER_ALGORITHM §9 |
|
||||
| `2ec0deb` | v2 实现(reset-on-success + threshold 2048→8192)—— 直接 trigger 了对 reseed 慢路径的关注 |
|
||||
| `c47adaf` | feat: backpressure pause hint(与 reseed 不直接相关,但展示了"D 端可主动告知 router"的通信通道存在,是未来 D→P sync 控制平面的潜在基础) |
|
||||
|
||||
## 附录 C:相关 paper 章节建议
|
||||
|
||||
- **§Background**:把 §1-§2 的 reseed 现状作为 motivation 摆出
|
||||
- **§Algorithm**:参考 `KVC_ROUTER_ALGORITHM.md` Algorithm 1-3
|
||||
- **§Evaluation §Slow Path Cost**:把 §2 的端到端时间线作为 Figure(sequence diagram)
|
||||
- **§Future Work / Limitations**:把本文 §4 作为 KVC 真正实现"完整 fast path 替代"的 roadmap,引用 D→P 工作的设计文档(后续 `feat/d-to-p-sync` 分支产物)
|
||||
|
||||
---
|
||||
|
||||
**核心句**:v2 实现的 KVC 在 91.6% 请求上证明了 session-affinity 路由的价值,但 8.3% 的 reseed 慢路径让 TTFT p99 比 DP 差 3×。这条慢路径的 50% 时间在 P 端 re-prefill、50% 在 mooncake transfer——RDMA 只能救后者,**D→P 增量 KV 同步是唯一能消除 re-prefill 的机制**,且当前在框架、SGLang、mooncake 三层都没有实现,需要新建 `feat/d-to-p-sync` 分支从设计文档开始。
|
||||
165
docs/SGLANG_PATCH_INVENTORY_ZH.md
Normal file
165
docs/SGLANG_PATCH_INVENTORY_ZH.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Vendored SGLang Patch — 归类清单
|
||||
|
||||
**日期**:2026-05-13
|
||||
**基线**:clean SGLang v0.5.10 snapshot @ `bded083`
|
||||
**当前 HEAD**:`origin/h200-cu130` + 本分支 (785 行新增 / 17 行删除 / 10 文件)
|
||||
**目的**:让 reviewer 与下一个合作者一眼看清"哪些 patch 是核心机制、哪些是 workaround、哪些可以在 refactor 后下线"。对应 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.2 / §S6 的工程债项。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
| 分类 | 文件数 | 行数(估) | 命运 |
|
||||
|---|---:|---:|---|
|
||||
| MUST-HAVE — 核心机制(Algorithm 1/2/3、streaming session lifecycle、admit RPC) | 6 | ~450 | 长期保留,是 paper claim 的核心 |
|
||||
| WORKAROUND — 已识别的 latent 问题修补,应在 refactor 后下线 | 2 | ~150 | block-level eviction refactor 完成后大量删除 |
|
||||
| EXPERIMENTAL — 未闭环的特性,论文不依赖 | 1 | ~60 | 可下线或保留为 future-work hook |
|
||||
| INSTRUMENTATION — 诊断 / 日志 | 1 | ~50 | 保留但应隔离到 debug build |
|
||||
| MINOR — 杂项 | 1 | ~3 | 不影响决策 |
|
||||
|
||||
**关键指引**:当 block-level eviction refactor([BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md))完成时,WORKAROUND 类的 ~150 行应同步删除。E3 触发的 `schedule_batch.py` invariant landmine 是这条路径上的产物,不修引擎而是修 evict 粒度才是正解。
|
||||
|
||||
---
|
||||
|
||||
## 1. 文件粒度清单
|
||||
|
||||
### 1.1 `mem_cache/session_aware_cache.py` — MUST-HAVE *(待 refactor)*
|
||||
|
||||
| 项目 | 内容 | 引入 | 分类 |
|
||||
|---|---|---|---|
|
||||
| `SessionSlot` dataclass | streaming session 跨 turn 复用 KV 的 metadata | b8e6f13 | MUST-HAVE |
|
||||
| `last_access_time` 字段 | LRU 决策需要 | 6e5ed8d | MUST-HAVE |
|
||||
| `match_prefix` / `cache_finished_req` / `cache_unfinished_req` 的 streaming 分支 | session 复用快路径 | b8e6f13 | **MUST-HAVE → 待 refactor**(block-level evict 后语义大改) |
|
||||
| `release_session` 直接 `free(kv_indices)` | session 退出时一次性归还 KV | b8e6f13 | **WORKAROUND → 替换**(refactor 后改为只 `dec_lock_ref`) |
|
||||
| `slot_held_tokens` / `get_session_status` / `list_session_statuses` | 状态查询 | 6e5ed8d | MUST-HAVE |
|
||||
|
||||
**说明**:本文件是 KVC 设计的中枢。block-level eviction refactor([BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.1–§3.6)改造的就是这里。`SessionSlot` 的 5 个 KV-ownership 字段(`req_pool_idx` / `kv_committed_len` / `kv_allocated_len` / `cache_protected_len` / `swa_evicted_seqlen`)应在 refactor 后删除;这部分**将由 commit message 单独标记**,方便回滚。
|
||||
|
||||
### 1.2 `managers/scheduler.py` — 混合类别
|
||||
|
||||
D worker 端的 Algorithm 2 实现,含多个独立 patch。按行级归类:
|
||||
|
||||
| 函数 / 行段 | 内容 | 分类 | 何时可下线 |
|
||||
|---|---|---|---|
|
||||
| `admit_direct_append(...)` | Algorithm 2 的 D 端 admission RPC handler | **MUST-HAVE** | 不下线(论文核心) |
|
||||
| `_should_allow_local_prefill_on_decode(req)` | 决定 decode worker 是否接受无 bootstrap 的本地 append-prefill | **MUST-HAVE** | 不下线 |
|
||||
| `_decode_session_cache_low_watermark_tokens()` | 水位线参数读取 | **WORKAROUND** | block-level evict 后由 radix LRU 取代 |
|
||||
| `_decode_session_cache_target_available_tokens()` | 目标可用 token 数计算 | **WORKAROUND** | 同上 |
|
||||
| `maybe_trim_decode_session_cache(...)` | 主动 trim session(触发 `release_session`) | **WORKAROUND** | 同上:refactor 后 radix LRU 自然蚕食,trim 不再必要 |
|
||||
| `_compute_backpressure_pause_hint(...)` | 给 router 的 pause 提示 | **EXPERIMENTAL** | 信号未闭环([REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md](../docs/archive/) §4.3),路线图 §S10;可保留为 future work hook |
|
||||
| `_compute_pool_breakdown_for_diagnostics()` | 池状态快照供 `/server_info` | **INSTRUMENTATION** | 长期保留但建议门 flag 化 |
|
||||
|
||||
### 1.3 `managers/schedule_batch.py` — WORKAROUND(待删除)
|
||||
|
||||
| 项目 | 内容 | 引入 | 分类 |
|
||||
|---|---|---|---|
|
||||
| streaming-session `extend_input_len` correction (lines ~1572–1585) | 在 fill_ids < prefix_indices 时把 extend_input_len 改为 0 | b8e6f13 | **WORKAROUND** |
|
||||
| pre-filter pass dropping `fill_ids < prefix_indices` reqs | E3 触发 assertion 后的 hotfix(commit 986f351) | 986f351 | **WORKAROUND** |
|
||||
| invariant assert `seq_len - pre_len == req.extend_input_len` 的容忍逻辑 | 与 correction 配套 | b8e6f13 | **WORKAROUND** |
|
||||
|
||||
**全部** ~85 行在 block-level eviction refactor 完成后**应整体删除**——`BLOCK_LEVEL_EVICTION_DESIGN_ZH §3.7` 已说明 refactor 后该不变量结构上必然成立,correction 路径无需存在。E3 的 landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2) 是该 workaround 的产物。
|
||||
|
||||
### 1.4 `managers/session_controller.py` — MUST-HAVE
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| streaming session lifecycle hooks(open / close / admit signal) | 让 P/D worker 知道何时开始 / 结束一个 streaming session | MUST-HAVE |
|
||||
| session ID 路由 | 让 admission RPC 找到正确的 SessionSlot | MUST-HAVE |
|
||||
|
||||
不下线。
|
||||
|
||||
### 1.5 `managers/io_struct.py` — MUST-HAVE
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| `AdmitDirectAppendReqInput` / `AdmitDirectAppendReqOutput` | admit RPC 的请求 / 响应消息类型 | MUST-HAVE |
|
||||
| backpressure pause hint 字段 | 同上消息的 optional 字段 | EXPERIMENTAL |
|
||||
|
||||
可以把 EXPERIMENTAL 字段折叠到 MUST-HAVE 消息里保持兼容;本身不构成下线压力。
|
||||
|
||||
### 1.6 `managers/tokenizer_communicator_mixin.py` — MUST-HAVE
|
||||
|
||||
admit RPC 的 communicator-side glue。19 行,不下线。
|
||||
|
||||
### 1.7 `entrypoints/http_server.py` — MUST-HAVE
|
||||
|
||||
`/admit_direct_append` HTTP endpoint 注册。6 行。
|
||||
|
||||
### 1.8 `disaggregation/decode.py` — 混合类别
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| `DecodeReqToTokenPool`: `assert len(reusing) <= 1` 放宽 | 让 local append-prefill 在一个 batch 里复用多个 req_pool_idx | **MUST-HAVE** |
|
||||
| `DecodePreallocQueue` 引入 `refresh_allocatable_tokens` + `maybe_trim_decode_session_cache` 触发 | pool 满时主动 trim session | **WORKAROUND**(refactor 后改由 radix LRU 自然 shed) |
|
||||
| `--disaggregation-decode-allow-local-prefill` flag | 服务端 opt-in 本地 append-prefill | **MUST-HAVE** |
|
||||
|
||||
trim 触发逻辑 ~30 行在 refactor 后应删除。
|
||||
|
||||
### 1.9 `server_args.py` — MUST-HAVE
|
||||
|
||||
| 项目 | 内容 | 分类 |
|
||||
|---|---|---|
|
||||
| `--radix-eviction-policy priority` 选项 | E1/E2 实验需要 | MUST-HAVE |
|
||||
| `--disaggregation-decode-allow-local-prefill` flag | 见 §1.8 | MUST-HAVE |
|
||||
|
||||
13 行,全部是 CLI 接口扩展,不下线。
|
||||
|
||||
### 1.10 `disaggregation/mooncake_transfer_engine.py` — MINOR
|
||||
|
||||
3 行小调整。不构成决策点。
|
||||
|
||||
---
|
||||
|
||||
## 2. 按分类汇总
|
||||
|
||||
### 2.1 MUST-HAVE(保留)
|
||||
|
||||
约 6 个文件、450 行:
|
||||
- `admit_direct_append` 主链路(Algorithm 2):scheduler + io_struct + tokenizer_communicator_mixin + http_server + session_controller
|
||||
- `SessionSlot` 主链路(streaming session lifecycle):session_aware_cache 多数字段、session_controller
|
||||
- CLI / server interface:server_args、decode.py 的 `allow_local_prefill`
|
||||
|
||||
### 2.2 WORKAROUND(block-level evict refactor 后删除)
|
||||
|
||||
约 2.5 个文件、150 行:
|
||||
- `session_aware_cache.release_session` 的 token-free 路径
|
||||
- `scheduler.py` 的 `_decode_session_cache_*_watermark_tokens` + `maybe_trim_decode_session_cache`
|
||||
- `schedule_batch.py` streaming-session correction + drop-pre-filter(含 E3 landmine 的 hotfix)
|
||||
- `decode.py` `DecodePreallocQueue` 中的 trim 触发
|
||||
|
||||
→ 这些 patch 的存在是当前架构的产物;refactor 后应整段删除而不是修小 bug。
|
||||
|
||||
### 2.3 EXPERIMENTAL(未闭环)
|
||||
|
||||
约 60 行:
|
||||
- backpressure pause hint(`_compute_backpressure_pause_hint` + io_struct 字段):可作为未来 control-plane 反馈机制的 hook 保留;若 1 个月后仍未接通,下线
|
||||
|
||||
### 2.4 INSTRUMENTATION(长期保留但门 flag 化)
|
||||
|
||||
约 50 行:
|
||||
- `_compute_pool_breakdown_for_diagnostics` + 相关 `/server_info` 字段:建议加 `--enable-diagnostic-pool-snapshot` flag,避免 prod 路径背诊断开销
|
||||
|
||||
### 2.5 MINOR
|
||||
|
||||
约 3 行:忽略。
|
||||
|
||||
---
|
||||
|
||||
## 3. 维护约定
|
||||
|
||||
1. **新加 SGLang 改动必须落到本表**:在 commit message 用 `feat(sglang): ...` / `fix(sglang): ...` 前缀,并在 PR 描述声明落到 §2 哪一类。
|
||||
2. **不直接覆盖 upstream 文件**:所有 patch 必须可在 v0.5.10 上 git apply(保留 hunk header 整洁)。
|
||||
3. **删除 WORKAROUND 时同步删 doc**:refactor 完成的同一个 PR 应把本文表中对应行划掉。
|
||||
4. **不下放 EXPERIMENTAL 到主路径**:未闭环的 patch 必须默认 disabled。
|
||||
|
||||
---
|
||||
|
||||
## 4. 与路线图的衔接
|
||||
|
||||
- Milestone 1([AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §4)执行 block-level eviction refactor 时,**整段 §2.2 应该消失**——这是衡量 refactor 完成度的客观指标。
|
||||
- Milestone 2 把 control plane 拆层(§4.8)时,§2.3 backpressure pause hint 应或被启用、或被下线,不允许悬挂。
|
||||
- Milestone 3 引入 learning-based admission(§4.15)时,§2.1 的 `admit_direct_append` 接口应保持稳定,policy 替换在 router 侧而非 D 侧。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:vendored SGLang 的 785 行不是 monolithic 黑箱——三分之二是核心机制(论文必备),三分之一是当前架构的 workaround(refactor 后可整段删)。reviewer 看到本表能立刻判断"哪些是 paper 的真贡献、哪些是 prototype 当前的临时支撑"。
|
||||
641
docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md
Normal file
641
docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md
Normal file
@@ -0,0 +1,641 @@
|
||||
# agentic-pd-hybrid 现框架性能与结构性问题报告
|
||||
|
||||
**对象**:项目团队同学
|
||||
**前置假设**:读者**没看过** v3-v6 KVC 实验日志
|
||||
**数据范围**:项目仓库 `outputs/` 下截止 2026-05-06 的全部实验产物
|
||||
**目的**:把"现状"和"问题"分别交代清楚,给后续改造提供共同事实基础
|
||||
|
||||
---
|
||||
|
||||
## 0. 给没看过实验的读者:基础概念速览
|
||||
|
||||
### 0.1 项目目标
|
||||
验证 **session-aware / KV-cache-aware P/D routing** 在 **agentic coding workload**(多轮 session、长 context、增量 append)上能否降低端到端延迟。基线对比对象是 vanilla SGLang xPyD。
|
||||
|
||||
### 0.2 三种部署机制(**这三个名词全程会用**)
|
||||
|
||||
| 机制 | 形态 | KV 流向 |
|
||||
|---|---|---|
|
||||
| **pd-disaggregation**("PD disagg") | P 和 D 是独立进程、分占不同 GPU | 每个请求 P 算 prefill → mooncake 推 KV → D 解码 |
|
||||
| **pd-colo**("DP",data-parallel) | 没有 PD 拆分,N 个独立完整 worker(每个自己 prefill+decode) | 没有 KV transfer;router 按 hash 分配请求 |
|
||||
| **kvcache-centric**("KVC") | 部署形态同 PD disagg;**D 上多了 SessionAwareCache**,能跨 turn 保留 session KV | 运行时决策:可走 direct-to-D(无 P)、可走 P→D disagg、可走带 reseed 的混合 |
|
||||
|
||||
**Direct-to-D**("D-direct"):KVC 的快路径——D 上已有该 session 的 KV,新 turn 在 D 本地做 append-prefill,零 P 介入、零 mooncake transfer。这是 KVC 理论上能省时间的核心。
|
||||
|
||||
**Fallback**:KVC admission 拒了 / 阈值不满足 / D 不健康时,退化到普通 PD disagg 路径。
|
||||
|
||||
**Routing policy**(与机制正交):
|
||||
- `default`:纯 round-robin
|
||||
- `sticky`:turn 2+ 黏到 session 的 last D
|
||||
- `kv-aware`:按 hash overlap + sticky 评分选 D(**KVC 必须配它**才能正确工作)
|
||||
|
||||
### 0.3 数据来源
|
||||
- Trace:`outputs/qwen35-swebench-50sess.jsonl`(SWE-Bench 抽样,4449 reqs / **52 sessions** / 每 session 8-150 turns / time-scale=10 / concurrency=32)
|
||||
- 模型:Qwen3.5-35B-A3B (TP4) 和 Qwen3-30B-A3B (TP1) 两组
|
||||
- 硬件:单机 8×H100 80GB,mooncake TCP loopback 模拟 P→D 传输
|
||||
|
||||
---
|
||||
|
||||
# 第一部份:性能数据现象
|
||||
|
||||
## 1.1 三种机制在 Qwen3.5-35B (TP4) SWE 50sess 上的表现
|
||||
|
||||
来源:`outputs/swebench-exps/`。
|
||||
|
||||
| Run | Mechanism | Policy | Errors | Lat mean | Lat P50 | Lat P99 | TTFT mean | TTFT P50 |
|
||||
|---|---|---|---:|---:|---:|---:|---:|---:|
|
||||
| `pd-disaggregation-default-20260426T202540Z` | pd-disagg | default | **0/4449** | 1.66s | 0.97s | 7.68s | 0.45s | 0.34s |
|
||||
| `pd-colo-default-20260426T210129Z` | pd-colo | default | **4447/4449** | – | – | – | – | – |
|
||||
| `pd-colo-default-20260427T033519Z` | pd-colo | default | **0/4449** | 1.77s | 0.86s | 9.67s | 0.29s | 0.25s |
|
||||
| `pd-colo-kv-aware-20260427T042034Z` | pd-colo | kv-aware | 469/4449 | 1.52s | 0.82s | 8.27s | 0.26s | 0.23s |
|
||||
| `pd-colo-kv-aware-20260427T044944Z` | pd-colo | kv-aware | **0/4449** | **1.57s** | 0.81s | 8.48s | **0.22s** | **0.17s** |
|
||||
| `kvcache-centric-default-worker-admission-20260426T210800Z` | KVC | default | **4390/4449** | – | – | – | – | – |
|
||||
|
||||
### 现象解读
|
||||
|
||||
**(1) pd-disagg 是稳定基线**:1.66s mean / 0 errors / 4199 cache hits(94.4%)。可以正常服务。
|
||||
|
||||
**(2) pd-colo(DP)有两次 run,第一次几乎全 crash,第二次稳定**:
|
||||
- 04-26 的 4447/4449 errors 来自 SGLang `--disaggregation-mode null` + Qwen3.5-35B-A3B(Mamba/GDN hybrid)的 `token_to_kv_pool_allocator memory leak` bug,crash 了
|
||||
- 04-27 的两次 pd-colo run 都跑通了。**`pd-colo-kv-aware-20260427T044944Z` 是这一组实验里跑分最好的配置**——0 errors / TTFT P50 = 0.171s(pd-disagg 的 50%)
|
||||
|
||||
**(3) KVC 在 SWE 35B 上的唯一一次 run 几乎全 crash**:4390/4449 = 98.7% errors。但**那 56 个跑通的 direct-to-D 请求性能优异**——Lat mean 1.24s,TTFT P50 0.081s,KV transfer 196 块(vs PD disagg 的 105K 块,**−99.8%**)。说明 KVC 机制本身有效,但 admission control 把绝大多数请求过滤掉了。
|
||||
|
||||
### 一句话:在 Qwen3.5-35B 上,**pd-colo + kv-aware 是头名**,KVC 机制配置不当几乎不可用。
|
||||
|
||||
---
|
||||
|
||||
## 1.2 同 trace 切到 Qwen3-30B (TP1):v1→v6 演进
|
||||
|
||||
为绕开 Mamba 模型的 SGLang bug,团队后续切到 Qwen3-30B-A3B (TP1) 跑 KVC 调优 sweep。**所有结果用同一份 SWE 50sess trace**,可以横向比较。来源:`outputs/qwen3-30b-tp1-*` 各目录。
|
||||
|
||||
### 1.2.1 各版本配置概览
|
||||
|
||||
| 版本 | 关键改动(一句话) |
|
||||
|---|---|
|
||||
| v2 | KVC + `--policy default`(这个 policy 选择 **是 bug**,下文 §2.5) |
|
||||
| v3 | KVC + `--policy kv-aware` |
|
||||
| v4 | v3 + replay 端 session soft_cap 从 4 抬到 16 |
|
||||
| v5 (Option D) | 把 admission 决策从 replay 估算改成 D worker 真实容量回答(`worker-mode admission`) |
|
||||
| v5+profile | v5 + 1Hz `/server_info` polling 做时序 instrument |
|
||||
| v6 P0 | v5 baseline 同配置 rerun ×3 验证可复现性 |
|
||||
|
||||
### 1.2.2 各版本同 trace 结果总表
|
||||
|
||||
| 版本 | Errors | Lat mean | Lat P50 | Lat P90 | Lat P99 | TTFT P50 | direct-to-D% |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|
|
||||
| **8-way DP cache-aware** | **0** | **1.43s** | **0.65s** | **3.61s** | **8.37s** | **0.093s** | – |
|
||||
| v3 1P7D KVC | 363 (8.2%) | 4.88s | 1.75s | 12.67s | 28.72s | 0.363s | 39% |
|
||||
| v3 2P6D KVC | 9 (0.2%) | 3.58s | 1.52s | 9.23s | 18.70s | 0.328s | 31% |
|
||||
| v4 1P7D cap=16 | 435 (10%) | 4.21s | 1.08s | 13.38s | 24.45s | 0.056s | 49% |
|
||||
| v4 2P6D cap=16 | 403 (9%) | 2.51s | 0.84s | 6.51s | 18.34s | 0.051s | 53% |
|
||||
| v5 1P7D Option D | 9 (0.2%) | 5.18s | 1.59s | 14.67s | 26.09s | 0.207s | 45% |
|
||||
| v5 2P6D Option D | 9 (0.2%) | 3.49s | 1.31s | 9.09s | 24.92s | 0.244s | 41% |
|
||||
| v5+profile 1P7D | 6 (0.1%) | 4.21s | 1.18s | 11.33s | 28.83s | 0.060s | 55% |
|
||||
| v5+profile 2P6D | **415 (9.3%)** | 3.23s | 1.11s | 8.36s | 20.26s | 0.168s | 41% |
|
||||
| v5 rerun ×3(无 profile) | **372 / 912 / 396** | 3.00–3.50s | 0.94–1.22s | 7.68–8.65s | 18.97–20.37s | 0.07–0.18s | 40-42% |
|
||||
|
||||
**8DP CA 在每一项指标都是头名**:
|
||||
- Latency mean **比所有 KVC 配置好 +43%~+260%**
|
||||
- TTFT P50 **0.093s**(KVC 最佳 v4 2P6D 是 0.051s——TTFT 单项 KVC 是有优势的,但被整体 P99 灾难抵消)
|
||||
- 0 errors(KVC 任一配置 errors 在 9-912 之间漂移)
|
||||
|
||||
### 1.2.3 v5+profile 的诡异:加 1Hz polling 让 errors 从 9 涨到 415
|
||||
|
||||
这条单独看:v5 baseline 跑出来 9 errors,加上 1Hz `/server_info` polling 之后 415 errors(**46×**)。原因机理见 §2.5。
|
||||
|
||||
### 1.2.4 v6 P0 用 ×3 rerun 验证可复现性,结果是不能复现
|
||||
|
||||
**关键事实**:v5 baseline 完全相同配置跑 3 次:
|
||||
|
||||
| Run | Errors | Lat mean | Lat P50 | TTFT P50 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| rerun1 | **372** | 3.50s | 1.11s | 0.147s |
|
||||
| rerun2 | **912** | 3.00s | 0.94s | 0.071s |
|
||||
| rerun3 | **396** | 3.42s | 1.22s | 0.183s |
|
||||
|
||||
errors 漂移 **2.5×**(372→912)。Latency mean / P50 也漂移 ~30%。**这意味着 v3-v6 之前所有"single-run"对比的差异 < 30% 的都不可信。**
|
||||
|
||||
但要注意:**3 次 v5 中最优的 P50(0.94s)仍然比 8DP CA(0.65s)慢 1.45×**——这个差距大于 single-run variance,所以"DP 全胜 KVC"的头条结论不受 variance 影响。
|
||||
|
||||
### 1.2.5 一个有趣的反差:v4 vs v5
|
||||
|
||||
- v4:errors 多(~10%)、direct-to-D 占比高(53-58%)、整体 P50 较好(0.84s)
|
||||
- v5:errors 少(0.2%)、direct-to-D 占比降低(41-45%)、整体 P50 反而退步(1.31s)
|
||||
|
||||
**v5 没有让性能变好,只是把"硬错误"转成了"诚实拒绝"——v4 的 admission 是乐观估算,admit 进来后 D 装不下变成 mooncake 32s timeout(统计成 errors);v5 让 D 自己拍板,admit 拒得早,请求改走 fallback(统计成低 direct-to-D 率)。容量本身没变。**
|
||||
|
||||
---
|
||||
|
||||
## 1.3 microbench 上 KVC 击败 PD disagg —— 但本仓库没保留实际 run
|
||||
|
||||
`docs/PROJECT_OVERVIEW.md` 写明:
|
||||
|
||||
> micro-benchmark 上,`kvcache-centric` 可以比 `pd-disaggregation` 好。原因很简单:**session 少、D KV 放得下**,turn2+ 可以直接走 D session。
|
||||
|
||||
但 `outputs/` 里**没有** microbench 实际 run(只有 microbench trace 生成器 `microbench.py` 和它的几个示例 trace 文件)。所以 microbench 的"KVC 赢"是基于设计预期 + 历史口口相传,**没有可重现的产物**。
|
||||
|
||||
**这本身是个问题**——下文 §2.6 会解释 microbench 的默认参数(4 sessions × 30K input × 1K append)正好把所有 KVC 失效条件都规避掉了。
|
||||
|
||||
---
|
||||
|
||||
## 1.4 头条结论(Part 1 总结)
|
||||
|
||||
| 工作负载 / 模型 | 头名机制 | KVC 表现 |
|
||||
|---|---|---|
|
||||
| Microbench(8 session × 30K × 1K append) | KVC > PD disagg(无落地数据,按设计) | 设计上必然赢 |
|
||||
| SWE 35B (TP4) | **pd-colo + kv-aware**(1.57s mean, 0 errors) | KVC 唯一 run 中 98.7% errors |
|
||||
| SWE 30B (TP1) | **8-way DP cache-aware**(1.43s mean, 0 errors) | KVC 6 个配置全输;最佳的 v4 2P6D 慢 75%、errors 9% |
|
||||
|
||||
**真实 agentic 工作负载(SWE-Bench)上,KVC 机制目前没有任何配置能跑赢 naive DP cache-aware。**
|
||||
|
||||
---
|
||||
|
||||
# 第二部份:结构性问题分析
|
||||
|
||||
每条按 (1) 现象(实锤数据)、(2) 根因(代码位置)、(3) 影响量化 三段交代。
|
||||
|
||||
## 2.1 KvAwarePolicy 不感知 D 容量 + Session 永久 pin 在初始 D 上 ★ 最严重
|
||||
|
||||
### 2.1.1 现象(实锤)
|
||||
|
||||
**(a) 每个 session 整 run 中只访问 1 个 D**——基于 v5 rerun1/2/3 全部 4449×3 = 13347 条 metrics:
|
||||
|
||||
| Run | sessions | avg distinct-D-per-session |
|
||||
|---|---:|---:|
|
||||
| rerun1 | 52 | **1.00** |
|
||||
| rerun2 | 52 | **1.00** |
|
||||
| rerun3 | 52 | **1.00** |
|
||||
|
||||
3 次独立 run、156 次 session 实例,**没有一个** session 跨 D 迁移过。
|
||||
|
||||
**(b) Direct-to-D 命中率呈极端双峰**——以 rerun1 为例(其他两次形态相同):
|
||||
|
||||
| direct-to-D rate | session 数 |
|
||||
|---|---:|
|
||||
| 0–20%("饿死") | **15** |
|
||||
| 20–40% | 7 |
|
||||
| 40–60% | 11 |
|
||||
| 60–80% | 5 |
|
||||
| 80–100%("顺利") | **14** |
|
||||
|
||||
中间档稀少,两端拥挤。
|
||||
|
||||
**(c) 跨 3 次 run 一致饿死的 session = 13/52,且这些 session 的 input 是顺利 session 的 1.98×**:
|
||||
|
||||
```
|
||||
13 sessions starved (<20% direct-to-D) in ALL 3 runs
|
||||
avg peak input of consistently-starved sessions: 62043 tokens
|
||||
avg peak input of consistently-lucky sessions: 31344 tokens
|
||||
```
|
||||
|
||||
**结构性、可复现、与 session 大小强相关。** 排除"运气"假说。
|
||||
|
||||
### 2.1.2 根因(代码)
|
||||
|
||||
`policies.py:166-172` `KvAwarePolicy.select()` 评分函数:
|
||||
|
||||
```python
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus, # 主项:历史 KV overlap
|
||||
sticky, # 二级
|
||||
inflight_penalty, # 三级
|
||||
assignment_penalty, # 四级
|
||||
)
|
||||
```
|
||||
|
||||
**评分中完全没有 D 当前容量项**。
|
||||
|
||||
session X 第一次落到 D-2 → 在 D-2 上积累 hash_id → 之后不管 D-2 多满,X 的 turn N+1 的 overlap 在 D-2 上仍是最大 → 永远选 D-2。即使 D-5 全空也轮不到。
|
||||
|
||||
`RoutingState.decode_resident_blocks` (`policies.py:46`) 还从不缩减——但因为 SWE trace 的 hash_ids 是 session-unique,**不缩减并不影响"选对 D",只影响内存**——真正问题在评分函数无容量项。
|
||||
|
||||
### 2.1.3 影响量化
|
||||
|
||||
- 25%(13/52)的 session 几乎每个 turn 走 fallback 路径
|
||||
- fallback 路径 mean lat 约 3.5s vs direct-to-D ~0.5s——**饿死 session 每 turn 慢 6×**
|
||||
- 这 13 个 session 还容易撞 mooncake 32s timeout(见 §2.2、§2.3),P99 完全由它们决定
|
||||
- **SLO 视角下:25% 的用户体验是系统性糟糕**
|
||||
|
||||
---
|
||||
|
||||
## 2.2 D 端 LRU 只能 evict idle session → 跟不上压力
|
||||
|
||||
### 2.2.1 现象(实锤)
|
||||
|
||||
来源:`outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log`,全 run 计数:
|
||||
|
||||
| D worker | "Trimmed decode session cache" 事件 | KVTransferError | 峰值 token_usage |
|
||||
|---|---:|---:|---:|
|
||||
| decode-0 | 9 | 0 | 0.99 |
|
||||
| decode-1 | 43 | 4 | 0.99 |
|
||||
| decode-2 | 16 | **153** | 0.97 |
|
||||
| decode-3 | 37 | 29 | 0.99 |
|
||||
| decode-4 | 28 | **90** | **1.00** |
|
||||
| decode-5 | 30 | **93** | **1.00** |
|
||||
|
||||
**所有 6 个 D 都顶到 token_usage ≥ 0.97,2 个顶到 1.00(KV 池完全耗尽)。LRU 触发 9-43 次,远不够——transfer 错误是 LRU 触发量的 5-10×。**
|
||||
|
||||
decode-2 极端:trim 16 次 vs error 153 次 = LRU 跑得比错误慢 9.5×。
|
||||
|
||||
### 2.2.2 根因(代码)
|
||||
|
||||
`scheduler.py:2040` 的 `evict_idle_streaming_sessions_lru` 实际只能 evict:
|
||||
|
||||
> 所有 req 都 finished + streaming 模式 + 该 session 没有 inflight transfer
|
||||
|
||||
但 SWE 高并发(concurrency=32 + time-scale=10 → effective inter-turn gap p50=0.25s)下,每个 session 几乎一直有 inflight req。**hot session 永远不 idle,LRU 永远找不到东西可踢。**
|
||||
|
||||
### 2.2.3 影响量化
|
||||
|
||||
- 单 run 累计 KVTransferError:6 个 D 之和 = **369 次**
|
||||
- 对应 ~8% 请求失败率(v5 errors 9/372/912 三次平均 ~430/4449 = 9.7%)
|
||||
- **每次 mooncake timeout = 32s**——直接构成 P99 18-26s 的尾巴
|
||||
|
||||
修复需要 SGLang 内部分层 eviction:除 idle session 外,按访问频率 / 时序加权强制 retract——**不在当前 KISS 边界**。
|
||||
|
||||
---
|
||||
|
||||
## 2.3 没有 D → Replay backpressure 通道
|
||||
|
||||
### 2.3.1 现象
|
||||
|
||||
§2.2 数据显示 D 顶到 token_usage=1.00 时仍在持续接收新请求,最终撞 mooncake 32s timeout。**整个错误链路里没有"D 过载,请慢点发"的反向信号**。
|
||||
|
||||
定量证据:rerun1 的 KVTransferError 时间分布——**98% 集中在 run 后半段**(参考 `KVC_DEBUG_JOURNEY_V1_TO_V5.md` §v4)。前期 D 容量充裕时正常,达到上限后**所有后续请求集中失败**——典型的"无 backpressure 系统在过载点雪崩"模式。
|
||||
|
||||
### 2.3.2 根因(代码)
|
||||
|
||||
链路:
|
||||
|
||||
```
|
||||
replay 端按 trace 时序 + concurrency=32 持续发请求
|
||||
↓
|
||||
PD Router 裸 round-robin (pd_router.py:43-49)
|
||||
↓
|
||||
P 收到请求做 prefill → mooncake 推 KV → D 端
|
||||
↓
|
||||
D 端 transfer queue 堆积 → 32s timeout
|
||||
↓
|
||||
errno 抛回 replay → fallback 路径,但 concurrency 不降
|
||||
```
|
||||
|
||||
D 端的 `admit_direct_append` 响应里**只有 can_admit/reason 等过去时字段,没有任何"建议节流"的指示**。
|
||||
|
||||
### 2.3.3 修复(本次代码改动已实现)
|
||||
|
||||
代码已加 `recommended_pause_ms` 字段:
|
||||
- `third_party/sglang/.../io_struct.py:DirectAppendAdmissionReqOutput` 增加 `recommended_pause_ms: int = 0`
|
||||
- `scheduler.py:_compute_backpressure_pause_hint`:按 `transfer_queue_depth`、`retracted_queue_depth`、`token_usage_after` 计算
|
||||
- `replay.py`:admission 响应里读到 hint → 更新 `DecodeResidencyState.pause_until_s[D]` → 下次发到该 D 之前 sleep
|
||||
- CLI flag:`--enable-backpressure`(默认 off,保留 baseline 行为)
|
||||
- 同时新增 3 个结构性日志(`structural/admission-events.jsonl` / `backpressure-events.jsonl` / `session-d-binding.jsonl`)
|
||||
|
||||
**待 GPU smoke 验证。预期 errors 从 ~370 降到 < 50;P99 改善(消除 32s timeout 尾巴);mean latency 可能略升(被强制 sleep)。**
|
||||
|
||||
修复脚本:`scripts/sweep_backpressure_smoke.sh`(4 个 run × 30-60 min);分析器:`scripts/analysis/analyze_backpressure_smoke.py`。
|
||||
|
||||
### 2.3.4 注意
|
||||
|
||||
backpressure 是**降级机制**,不是性能优化——它把"硬错误(32s timeout)"换成"主动等待"。整体 throughput 不会因此提升,但 P99 应大幅改善。
|
||||
|
||||
---
|
||||
|
||||
## 2.4 P-side round-robin 不感知 D 健康
|
||||
|
||||
### 2.4.1 现象(实锤)
|
||||
|
||||
来源:v5 rerun1 `prefill-{0,1}.log`,全 run 计数:
|
||||
|
||||
| Worker | KVTransferError | "Decode instance could be dead" | 请求量 |
|
||||
|---|---:|---:|---:|
|
||||
| prefill-0 | **367** | 361 | 2225 |
|
||||
| prefill-1 | **2** | 0 | 2224 |
|
||||
|
||||
**两 P 请求量完全均衡(round-robin),错误率差 180×**。日志里 prefill-0 的失败反复指向某个特定 D 的 IP(`to 10.45.80.47:XXXXX`)。
|
||||
|
||||
### 2.4.2 根因(代码)
|
||||
|
||||
`pd_router.py:43-49`:
|
||||
|
||||
```python
|
||||
prefill_url, bootstrap_port = self.config.prefill_urls[
|
||||
self.prefill_cursor % len(self.config.prefill_urls)
|
||||
]
|
||||
self.prefill_cursor += 1
|
||||
```
|
||||
|
||||
裸 round-robin。不感知:
|
||||
- P 当前 inflight transfer 数
|
||||
- 目标 D 的健康状态 / 容量
|
||||
|
||||
后果:当某个 D 进入 hot 状态时,被 round-robin 派去给它推 KV 的 P **持续失败**;另一个 P 接到的请求恰好命中健康 D,完全没事。**单 P 故障不会被路由层避开。**
|
||||
|
||||
### 2.4.3 影响量化
|
||||
|
||||
- prefill-0 几乎独自承担了**全部 KVTransferError 的 99%**(367/(367+2))
|
||||
- 如果 router P 选择能避开"正在和 hot D 死磕"的链路,这部分 ~8% 的整体错误率应可降到 < 1%
|
||||
|
||||
### 2.4.4 备注
|
||||
|
||||
这条结论目前来自单次 run 的 N=1 数据。需要跨 N≥3 次 rerun 验证一致性才能完全确信——加上 §2.1.1 (b/c) 也证明 P-D 链路绑定结构性强相关,"prefill-0 死磕某 D"很可能在每次 run 都重复(由初始 session 落点决定)。
|
||||
|
||||
---
|
||||
|
||||
## 2.5 Admission RPC 进 scheduler 主循环 → 自我干扰
|
||||
|
||||
### 2.5.1 现象(实锤)
|
||||
|
||||
v5 baseline 配置不开 polling:errors = 9
|
||||
完全相同配置 + 1Hz `/server_info` polling:errors = **415**(**46×**)
|
||||
|
||||
来源:`outputs/qwen3-30b-tp1-v5-optD/exp2_2p6d_kvc_optD_summary.json`(baseline 9 errors)vs `qwen3-30b-tp1-v5-optD-profile/exp2_2p6d_kvc_optD_profile_summary.json`(415 errors)。
|
||||
|
||||
### 2.5.2 根因(代码)
|
||||
|
||||
`/server_info`(被 polling 调用)和 `admit_direct_append` 都进 SGLang scheduler 主循环:
|
||||
|
||||
- `/server_info` → `scheduler.py:get_streaming_session_cache_status` → 遍历每个 session slot 计算 `is_idle`
|
||||
- `admit_direct_append` → 读 `token_to_kv_pool_allocator.available_size()` + 触发 `maybe_trim_decode_session_cache`
|
||||
|
||||
scheduler 主循环本身在跑 decode/prefill 的 forward。这些 RPC 进队列就和 forward 抢调度。
|
||||
|
||||
### 2.5.3 真实负载下 admission RPC 频率远高于 1Hz
|
||||
|
||||
- 4449 reqs / ~2700s ≈ **1.6 reqs/s**
|
||||
- 每个 turn 做 1-3 次 admission probe(direct-append + 可能的 seed retry)
|
||||
- × 8 worker = **每秒 ~16-40 次 admission RPC**
|
||||
|
||||
也就是 admission 流量本身比 1Hz polling 高一个量级。如果 1Hz polling 都能让 errors 涨 46×,admission 自己的扰动至少同等。
|
||||
|
||||
### 2.5.4 修复
|
||||
|
||||
不在本轮 KISS 内。设计方向是把 admission 拆成两个端点:
|
||||
- `POST /probe` → lock-free 读 snapshot(轻),90% 流量走这条
|
||||
- `POST /commit_evict` → 进 scheduler 队列,做实际 LRU(重),仅 probe 不够时调
|
||||
|
||||
这部分需要 SGLang 内部 atomic publish snapshot 到共享内存——**结构性改动**。
|
||||
|
||||
### 2.5.5 注意
|
||||
|
||||
v6 P0 的 ×3 baseline rerun(不开 polling)errors 也是 372/912/396——**polling 不是 415 唯一原因**。本身 v5 admission 设计就敏感,polling 是放大器。
|
||||
|
||||
---
|
||||
|
||||
## 2.6 Replay 时间被 time-scale=10 压缩 → 测量学失真
|
||||
|
||||
### 2.6.1 现象(实锤)
|
||||
|
||||
v5 rerun1 metrics 解出的真实 inter-turn gap 分布:
|
||||
|
||||
```
|
||||
原始 trace inter-turn gap (n=4397):
|
||||
p10=1.6s p50=2.5s p90=7.8s p99=25.1s max=261s
|
||||
|
||||
time-scale=10 实际 replay gap (= 原始 / 10):
|
||||
p10=0.16s p50=0.25s p90=0.78s p99=2.5s max=26s
|
||||
```
|
||||
|
||||
### 2.6.2 这意味着什么
|
||||
|
||||
真实 agentic 用户/agent 在每个 turn 之间停 **2-8 秒**——思考、打字、tool call 异步返回、agent reasoning。
|
||||
|
||||
`microbench.py:20-21` 的默认 `inter_turn_gap_s=1.0` + `session_stagger_s=0.1` 也大致符合这个量级(1 秒左右)。
|
||||
|
||||
但 SWE replay 设的 time-scale=10 把这个间隔**人为压到 0.25 秒**——D 还没消化完 turn N,turn N+1 就来了。
|
||||
|
||||
### 2.6.3 为什么这么设计
|
||||
|
||||
纯粹**节省测试时间**:
|
||||
- 原始 trace 跨度 ~6000s(≈100 分钟)
|
||||
- time-scale=10 → ~600s(≈10 分钟)
|
||||
- sweep 5 版本 × 3 重复 = 25h vs 2.5h
|
||||
|
||||
### 2.6.4 它扭曲了什么
|
||||
|
||||
1. **抹掉 D 的自然 idle 时间**:真实部署里每个 session 在 turn 间有几秒空窗,正好让 D 端 LRU 把它 evict 出去给其他 session 让位(§2.2 idle 判定)。time-scale=10 下几乎所有 session 一直忙——LRU 永远找不到 idle session。
|
||||
2. **人为提升并发压力**:concurrency=32 在 time-scale=10 下意味着 D 端持续承受 320 effective concurrent agents 的压力——远超真实部署。
|
||||
3. **掩盖 backpressure 等慢节奏机制的价值**:如果 inter-turn gap 是 2.5s,backpressure 让 replay 等 0.5s 几乎不影响吞吐;time-scale=10 下 0.5s 的 sleep 等于直接跳过下一个 turn。
|
||||
|
||||
### 2.6.5 严重性:所有 KVC vs DP 结论都带这个失真
|
||||
|
||||
**v3-v6 全部数据基于 time-scale=10**。所以"KVC 在 SWE 上输给 DP"的程度可能被 benchmark 放大。**真实部署里 inter-turn gap 是 2.5s 的话,KVC 可能根本不会撞到当前看到的容量瓶颈**。
|
||||
|
||||
这是项目当前**最严重但还没修的测量学问题**。修复成本极小(只是去掉 `--time-scale 10`),但意义重大——**P0 应该立刻跑一组 time-scale=1 baseline**(KVC + DP 各 N=3)。
|
||||
|
||||
---
|
||||
|
||||
## 2.7 direct-to-D append 阈值 = 2048 是个 magic number
|
||||
|
||||
### 2.7.1 现象(实锤)
|
||||
|
||||
`replay.py:51` 默认值:
|
||||
|
||||
```python
|
||||
kvcache_direct_max_uncached_tokens: int = 2048
|
||||
```
|
||||
|
||||
判定(`replay.py:2177`):当新 turn 的 uncached append > 2048 token 时,**禁止 direct-to-D**,请求改走 P→D reseed 路径。
|
||||
|
||||
实测 v5 rerun1 的 uncached append 分布(`input_length - cached_tokens`):
|
||||
|
||||
```
|
||||
所有 4449 请求:
|
||||
p10=50 p25=181 p50=610 p75=2907 p90=36495 p99=91600 max=103971
|
||||
|
||||
> 2048: 1222/4449 = 27.5%
|
||||
```
|
||||
|
||||
**双峰分布**:median 只有 610,但 p90 已经 36K。
|
||||
|
||||
### 2.7.2 根因(代码)
|
||||
|
||||
阈值是个 magic number——**没有任何代码注释解释为什么是 2048**,git log 里也没人调过它。
|
||||
|
||||
合理推测它存在的理由(按可信度):
|
||||
|
||||
| 理由 | 是否成立 |
|
||||
|---|---|
|
||||
| D 是 decode-tuned,max-prefill-tokens 通常 4-8K,append > 2K 会触发 D 内部多 chunk prefill 拖慢 decode | 强 |
|
||||
| 大 append 在 D 上 prefill 会阻塞当前正在 decoding 的其他 session 的 TPOT | 强 |
|
||||
| P 有更优化的 prefill kernel 和 batch | 弱(D 的 prefill kernel 同源) |
|
||||
| 工程上的"安全默认值",没认真测过 | 强(git log 印证) |
|
||||
|
||||
### 2.7.3 但更严重的 bug:execution_mode 标签命名错位
|
||||
|
||||
`execution_mode` 名字里带 "large-append" 的请求一共 **2060 个**,其中:
|
||||
|
||||
- **1222 个(59.3%)实际 uncached append ≤ 2048**
|
||||
|
||||
也就是说,**"large-append" 这个标签名对超过一半的实例是错的**。看 `replay.py:2168-2178` 的判断:
|
||||
|
||||
```python
|
||||
if (
|
||||
_should_bypass_prefill(...) # 要求 overlap > 0
|
||||
and direct_append_length is not None
|
||||
and direct_session_reused # 要求 session 在本 D 上 opened 过
|
||||
and not direct_session_reset
|
||||
and direct_append_length <= config.kvcache_direct_max_uncached_tokens
|
||||
):
|
||||
# direct-to-D
|
||||
else:
|
||||
# 进入 "large-append" 分支
|
||||
```
|
||||
|
||||
**这个 else 分支的 5 个进入条件里,"append > 2048" 只是其中一个。** session 不在本 D 上、被 evict 过、overlap=0 都会进这个分支,但 `execution_mode` 仍然写 `pd-router-fallback-large-append-*`——导致看 metrics 的人误以为问题是 append 太大。
|
||||
|
||||
### 2.7.4 实际:阈值不是主要瓶颈,session 不在 D 上才是
|
||||
|
||||
把 turn≥2 的请求按"append 是否 > 2048"和"实际 execution mode"交叉:
|
||||
|
||||
```
|
||||
Turn≥2 小 append (≤2048), n=3129:
|
||||
1854 (59%) kvcache-direct-to-d-session ← 走通了
|
||||
1141 (37%) pd-router-fallback-large-append-session-cap ← 标签骗人
|
||||
...
|
||||
|
||||
Turn≥2 大 append (>2048), n=1216:
|
||||
813 (67%) pd-router-fallback-large-append-session-cap
|
||||
365 (30%) kvcache-centric (失败)
|
||||
22 pd-router-large-append-reseed ← 真正受阈值影响的
|
||||
...
|
||||
```
|
||||
|
||||
**真正因 append > 2048 而失败的请求**:约 50 个(large-append-reseed + 部分 large-append fallback),仅占总数 1-2%。
|
||||
|
||||
**绝大多数 fallback 实际是 §2.1 的 session 不在 D 上**——名字里带 "large-append" 是误导。
|
||||
|
||||
### 2.7.5 修复
|
||||
|
||||
两件事:
|
||||
1. 把 `execution_mode` 标签按真实原因细分——把 "large-append" 拆成 "session-not-resident" / "real-large-append" / "session-reset" 等
|
||||
2. 阈值本身可以做 sweep(2048 / 4096 / 8192 / 16384)找最优——但收益空间有限(最多改善那 1-2% 的请求)
|
||||
|
||||
---
|
||||
|
||||
## 2.8 跨 run variance 巨大:N=1 不可信
|
||||
|
||||
### 2.8.1 现象(实锤)
|
||||
|
||||
v5 baseline 完全相同配置跑 3 次(`qwen3-30b-tp1-v5-optD-baseline-rerun/`):
|
||||
|
||||
| Run | Errors | Lat mean | Lat P50 | TTFT P50 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| rerun1 | 372 | 3.50s | 1.11s | 0.147s |
|
||||
| rerun2 | **912** | 3.00s | 0.94s | 0.071s |
|
||||
| rerun3 | 396 | 3.42s | 1.22s | 0.183s |
|
||||
|
||||
errors 漂移 **2.5×**(372→912),P50 latency 漂移 ~30%,TTFT P50 漂移 **2.6×**。
|
||||
|
||||
### 2.8.2 根因(推测)
|
||||
|
||||
源头不止一个,至少包含:
|
||||
|
||||
1. **§2.1 + §2.2 的复合**:D 容量过载是临界点附近的非线性系统——initial session-to-D assignment 的随机性决定了哪个 D 先饱和。
|
||||
2. **mooncake TCP loopback 的随机性**:单机 loopback 的 32s timeout 触发概率受当前 GPU 内存碎片、PCIe 状态影响。
|
||||
3. **scheduler 主循环里 admission RPC 与 decode 抢资源的随机性**(§2.5)。
|
||||
|
||||
### 2.8.3 影响
|
||||
|
||||
**所有 single-run 比较 < 30% 差异都不可信**。这意味着:
|
||||
- v3 vs v4 的 P50 差异(1.75s vs 1.08s)勉强有意义(差异 38%)
|
||||
- v4 vs v5 的 P50 差异(0.84s vs 1.31s)勉强有意义(差异 56%)
|
||||
- v5+profile 的 1P7D vs baseline(mean 4.21s vs 5.18s)→ 差异 18%,**不可信**
|
||||
- 所有 `direct-to-D 占比 ±5%` 的差异都是噪声
|
||||
|
||||
### 2.8.4 这条规则要求所有后续实验
|
||||
|
||||
**要任何 KVC 配置间或 KVC vs DP 的对比,最少跑 N=3,最好 N=5。** 不跑 N≥3 的实验在做"碰运气科研"。
|
||||
|
||||
8h 一次 sweep 装不下 N=3 + 多版本对比,所以必须**牺牲版本数量保 N≥3**。
|
||||
|
||||
---
|
||||
|
||||
## 2.9 microbench 的 KVC 优势不能外推到真实 agentic
|
||||
|
||||
`microbench.py:13-22` 默认参数:
|
||||
|
||||
| 维度 | 默认值 |
|
||||
|---|---|
|
||||
| `session_count` | 8 |
|
||||
| `turns_per_session` | 3 |
|
||||
| `initial_input_length` | 10000 |
|
||||
| `append_input_length` | **1000** ← 低于 §2.7 的 2048 阈值 |
|
||||
| `output_length` | 1000 |
|
||||
| `inter_turn_gap_s` | **1.0** ← 接近真实 agentic |
|
||||
| `session_stagger_s` | 0.1 |
|
||||
|
||||
**与 SWE workload 的关键维度对比**:
|
||||
|
||||
| 维度 | microbench | SWE 50sess |
|
||||
|---|---|---|
|
||||
| Session 数 | 4-8 | 52 |
|
||||
| Per-session peak input | ~31K | median 49K, max 104K |
|
||||
| 总 working-set / 7D 容量(92K each) | 0.19×(5× 冗余) | **3.95×(4× 过载)** |
|
||||
| Append size 是否过 2048 | 几乎 100% 过不到 | 28% 超过 |
|
||||
| Session 数是否过 cap | 4 ≤ 28(v3 cap×7D) | 52 远超 |
|
||||
|
||||
**Microbench 把 KVC 的所有失效条件都规避了**:容量充裕、append 卡阈值之下、session 数远低于 cap、inter-turn gap 接近真实——这一组参数让 KVC 五项判断(路由 / admission / 没被 evict / append ≤ 阈值 / 无 backpressure)全部通过 → 100% 走 direct-to-D 快路径。
|
||||
|
||||
**而 SWE workload 在每一项上都把 KVC 推过临界点。**
|
||||
|
||||
所以"KVC 在 microbench 赢 PD disagg"是个**弱命题**——它只证明了机制能跑,没有证明在真实 agentic 下能赢。
|
||||
|
||||
---
|
||||
|
||||
# 第三部份:一句话总结与下一步
|
||||
|
||||
## 现状一句话
|
||||
|
||||
> 在所有可比的真实 agentic workload(SWE 35B / 30B)上,**naive DP cache-aware 全胜 KVC 任何配置**,且差距 > 30%(远超 single-run variance)。Microbench 上 KVC 赢 PD disagg 的设计前提(容量富余、append 小、session 少)在真实 workload 下不成立。
|
||||
|
||||
## 排序后的结构性问题(按修复 ROI)
|
||||
|
||||
| 排名 | 问题 | 影响 | 修复成本 |
|
||||
|---|---|---|---|
|
||||
| **P0** | §2.6 time-scale=10 失真 → 所有 KVC vs DP 结论可能被 benchmark 放大 | 颠覆性 | 极低(改 flag) |
|
||||
| **P0** | §2.1 session 永久 pin + 容量盲选 | 25% session 永远饿死 | 中(改 policy) |
|
||||
| **P0** | §2.2 D-side LRU 跟不上 | ~8% errors 来自此 | 中(改 SGLang) |
|
||||
| P1 | §2.3 没 backpressure | 把 timeout 雪崩变可控 | **已实现**(待 GPU smoke) |
|
||||
| P1 | §2.4 P-side 不感知 D 健康 | 单 P 出错率差 180× | 中 |
|
||||
| P1 | §2.7 / 2.8 metrics 标签命名错位 | 数据解读经常出错 | 低(改字符串) |
|
||||
| P2 | §2.5 admission RPC 进 scheduler 主循环 | 自我干扰 | 高(结构改动) |
|
||||
| P2 | §2.8 N=1 不可信 | 实验方法学 | 0(团队约定) |
|
||||
|
||||
## 立刻能做的三件事
|
||||
|
||||
1. **跑 time-scale=1 baseline**(KVC v5 + 8DP CA 各 N=3,~6h GPU)—— 不修代码、单变量、决定后续路线。
|
||||
2. **跑 backpressure smoke**(已实现,4 run × ~30-60 min,~3-4h GPU)—— 验证 §2.3 修复的端到端效果。
|
||||
3. **修 metrics 标签命名**(`pd-router-fallback-large-append-*` → 按真实原因分类)—— 让以后看数据的人不会再被误导。
|
||||
|
||||
## 不立刻做但要重新讨论的
|
||||
|
||||
- **§2.1 capacity-aware policy**:之前考虑过的"评分加 capacity 项"会引入"换 D"的副作用(孤儿 KV、新 D 上仍可能饿死),需要跟 §2.2 的 D 端 hot retract 一起设计。
|
||||
- **§2.5 admission API 拆 probe / commit**:是结构性正确方向,但要动 SGLang 内部 + atomic publish 机制,不是 KISS。
|
||||
- **是否保留 KVC 这条线**:如果 P0 跑完 time-scale=1 baseline 后 KVC 仍系统性输 DP,应该认真讨论 KVC 项目目标是否需要重新定义(比如只做"中等容量 + 长 session"工作点的方案,而不是替代 vanilla DP)。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本报告所有数据的来源
|
||||
|
||||
| 章节 | 数据源 |
|
||||
|---|---|
|
||||
| 1.1 SWE 35B | `outputs/swebench-exps/{pd-disagg,pd-colo,kvcache-centric}-*` |
|
||||
| 1.2 TP1 series | `outputs/qwen3-30b-tp1-{exps,v3-kvaware,v4-cap16,v5-optD,v5-optD-profile,v5-optD-baseline-rerun}/` |
|
||||
| 2.1 session pinning | `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/exp2_2p6d_run{1,2,3}_metrics.jsonl` |
|
||||
| 2.2 D LRU 计数 | `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log` |
|
||||
| 2.4 P imbalance | `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/prefill-{0,1}.log` |
|
||||
| 2.5 polling 影响 | v5 baseline summary vs v5+profile summary |
|
||||
| 2.6 inter-turn gap | rerun1 metrics 的 `trace_timestamp_s` 字段 |
|
||||
| 2.7 append 分布 | rerun1 metrics 的 `input_length - cached_tokens` |
|
||||
| 2.8 variance | rerun1/2/3 三组 summary |
|
||||
|
||||
## 附录 B:相关已有文档
|
||||
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标、microbench 结论
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 结构性缺陷的早期分析(本报告 §2 的来源)
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 详细演进日记
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(含 critic 修订)
|
||||
- `docs/archive/SWEBENCH_EXPERIMENT_RESULTS.md` — SWE 35B 早期实验
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md` — 当前重构计划
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — 结构性 claim 验证(本报告的精简版)
|
||||
624
docs/V2_DEEP_ANALYSIS_ZH.md
Normal file
624
docs/V2_DEEP_ANALYSIS_ZH.md
Normal file
@@ -0,0 +1,624 @@
|
||||
# KVC v2 深度分析:相对 TEAM_REPORT 基线的改进、性能、新暴露的问题
|
||||
|
||||
**日期**:2026-05-11
|
||||
**对象**:项目团队同学
|
||||
**基线**:`docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`(v3-v6 ts=10 调优 sweep 的状态报告)
|
||||
**新数据**:
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md`(ts=1 4-run validation 结果)
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md`(v1 thrashing 诊断)
|
||||
- `docs/V2_RESULTS_ZH.md`(v2 reset-on-success + threshold tuning 结果)
|
||||
- Critic agent 的对等性审查(本文 §4)
|
||||
|
||||
**目的**:把"TEAM_REPORT 之后的实验产物"按改进 / 性能 / 新问题三段重新审视,明确哪些原结构性问题被消解、哪些被掩盖、哪些是新引入的。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **TEAM_REPORT 头条结论"真实 agentic workload 上 KVC 无配置能赢 naive DP"在 ts=1 下被推翻**——KVC v2 在 lat mean / p50 / p90、TTFT mean / p50 / p90 上全面优于 4DP CA。
|
||||
2. **生产决策结论:online coding agent serving 应选 KVC 1P3D**。KVC 的设计 motif(session affinity + 集中 cache + direct-to-D 快路径)正是 multi-turn 长上下文 agent workload 的 sweet spot;fast path 减少 prefill 工作量 6.9× 是机制目标实现,不是 measurement artifact。
|
||||
3. **真实代价只有一项:TTFT p99 = 1.29s vs DP 0.43s(KVC 3× 差)**——来自 8.3% 非 direct-to-D 路径的 mooncake reseed 长尾。生产部署要么用真 RDMA 把这条压下来,要么靠容量规划让 reseed 极少发生。
|
||||
4. **TEAM_REPORT §1(session pin 饿死)已被 v2 修好**——direct-to-D 从 42.8% 涨到 91.6%,severe thrashing 清零。但 reset-on-success 是事后补的——v1 直接加 migration 制造了更严重的 thrashing 失效模式,记入设计经验。
|
||||
5. **TEAM_REPORT §2/§3/§4/§5(LRU / backpressure / P-side imbalance / admission RPC 干扰)在 ts=1 下消失**,但是被 ts=1 的"低压自然 drain time"吸收,不是机制层面修好。一旦回到 ts=10 / 更长 trace / 更紧容量,会全部复现——属于潜在的,不是消除的。
|
||||
6. **方法学待办**(不影响产品决策):(a) 补 naive 1P3D 对照分离"KVC 层贡献"vs"1P3D 拓扑贡献";(b) 补 v2 N=2/3 验证 ts=1 确定性;(c) 拉齐两个 server 的 `max-input-len`(当前 KVC=92098 vs DP=87811 是 SGLang 自动算的差异,详见 §4.3)。
|
||||
|
||||
---
|
||||
|
||||
## 1. 三组新实验与 TEAM_REPORT 的关系
|
||||
|
||||
### 1.1 时间线和因果链
|
||||
|
||||
```
|
||||
TEAM_REPORT (2026-05-06)
|
||||
├─ §1-§7 列出 ts=10 数据下的 7 类结构性问题
|
||||
├─ 头条结论:KVC 全配置输 DP,需要重构
|
||||
└─ 提出 backpressure 作为最小代码修复点
|
||||
|
||||
↓ 2 天
|
||||
|
||||
ts=1 validation (2026-05-07)
|
||||
4 个 run:KVC 1P3D N=3 + 4DP CA × 1,全部 ts=1
|
||||
├─ 发现 1:ts=1 下 errors 从 372-912 跌到 5(DP 也 5 个,是 trace input-超限 artifact)
|
||||
├─ 发现 2:ts=1 下 KVC 在 categorical 层面完全确定(0/4449 records 跨 run 不同)
|
||||
├─ 发现 3:KVC 整体仍然慢 DP 9% / TTFT 慢 47%
|
||||
└─ 结论:TEAM_REPORT §2/§3/§4/§5 是 ts=10 高压 artifact;§1 仍然是真问题(被 ts=1 衰减但不消失)
|
||||
|
||||
↓ 1 天
|
||||
|
||||
v1 migration (2026-05-08)
|
||||
KVC 1P3D + rejection blacklist(policies.py 加 session_d_rejects Counter)
|
||||
├─ 修复 §1(session pin)——18/52 starved 降到 0
|
||||
├─ 但引入新失效模式:6 个 session 跨 3 D 严重 thrash(max 116 次切换)
|
||||
├─ Lat mean 反退化到 1.758s,TTFT mean 涨到 0.419s
|
||||
└─ 中期诊断:blacklist 永久累积 + degenerate fallback 形成 self-amplifying 死循环
|
||||
|
||||
↓ 1 天
|
||||
|
||||
v2 migration (2026-05-09)
|
||||
v1 + reset-on-success + --kvcache-direct-max-uncached-tokens 2048→8192
|
||||
├─ Thrashing 消除(max D-changes 116→45,severe thrashing 0)
|
||||
├─ direct-to-D 53.3%→91.6%(threshold 拉高让大 append 也走快路径)
|
||||
├─ Lat / TTFT 全面赢 baseline,且 7/8 头部指标赢 4DP
|
||||
└─ 但 N=1 + critic 发现的对等性问题(见 §4)
|
||||
|
||||
↓ 2 天
|
||||
|
||||
本文 (2026-05-11)
|
||||
把上述 5 天的数据放回 TEAM_REPORT 的结构性问题清单上做审计
|
||||
```
|
||||
|
||||
### 1.2 同 trace 全部数字总表(按时间)
|
||||
|
||||
来源:`outputs/qwen3-30b-tp1-*` 系列各 summary.json。**4449 reqs / 52 sessions / Qwen3-30B-A3B (TP1) / 4×H100 80GB**。
|
||||
|
||||
| 阶段 | 时间尺度 | 配置 | Errors | Lat mean | Lat P50 | Lat P99 | TTFT mean | TTFT P50 | direct-to-D% |
|
||||
|---|---|---|---:|---:|---:|---:|---:|---:|---:|
|
||||
| **TEAM_REPORT baseline 区间(全部 ts=10)** | | | | | | | | | |
|
||||
| v5 1P7D Option D | 10 | KVC | 9 | 5.18s | 1.59s | 26.09s | 0.207s | – | 45% |
|
||||
| v5 2P6D Option D | 10 | KVC | 9 | 3.49s | 1.31s | 24.92s | 0.244s | – | 41% |
|
||||
| v5 rerun1 (重测) | 10 | KVC | **372** | 3.50s | 1.11s | 19.49s | 0.147s | – | ~40% |
|
||||
| v5 rerun2 | 10 | KVC | **912** | 3.00s | 0.94s | 20.37s | 0.071s | – | ~40% |
|
||||
| v5 rerun3 | 10 | KVC | **396** | 3.42s | 1.22s | 18.97s | 0.183s | – | ~40% |
|
||||
| 8-way DP CA | 10 | DP-colo | **0** | **1.43s** | **0.65s** | **8.37s** | **–** | **0.093s** | – |
|
||||
| **ts=1 validation 区间** | | | | | | | | | |
|
||||
| v0 baseline run1 | 1 | KVC 1P3D | 5 | 1.574s | 0.811s | 8.70s | 0.245s | 0.124s | **42.8%** |
|
||||
| v0 baseline run2 | 1 | KVC 1P3D | 5 | 1.573s | 0.809s | 8.74s | 0.243s | 0.120s | 42.8% |
|
||||
| v0 baseline run3 | 1 | KVC 1P3D | 5 | 1.574s | 0.812s | 8.76s | 0.243s | 0.123s | 42.8% |
|
||||
| 4-way DP CA | 1 | DP-colo | 0 | 1.443s | 0.659s | 8.43s | 0.129s | **0.090s** | – |
|
||||
| **Migration 区间** | | | | | | | | | |
|
||||
| v1 migration | 1 | KVC 1P3D | 6 | 1.758s | 0.773s | 9.92s | 0.419s | 0.057s | 53.3% |
|
||||
| **v2 migration (头条)** | 1 | KVC 1P3D | 5 | **1.432s** | **0.576s** | **8.69s** | **0.098s** | **0.042s** | **91.6%** |
|
||||
|
||||
**两组关键对比**:
|
||||
|
||||
1. **ts=10 → ts=1(同 KVC 配置)**:Lat mean 5.18s → 1.574s(**3.3× 改善**);errors 9-912 → 5(**~100× 改善**);direct-to-D 41% → 42.8%(持平,机制不变)
|
||||
2. **v0 → v2(同 ts=1,机制改进)**:Lat mean 1.574s → 1.432s(**9% 改善**);TTFT mean 0.245s → 0.098s(**60% 改善**);direct-to-D 42.8% → 91.6%(**+48.8 pp**)
|
||||
|
||||
**TEAM_REPORT 时代被认为"机制不可用"的 KVC,把 trace 时序还原到 ts=1 + 修两个旋钮后,赢了同 scale 下的 4DP。**
|
||||
|
||||
---
|
||||
|
||||
## 2. TEAM_REPORT §1-§9 的逐项更新
|
||||
|
||||
按原始优先级排序,每条标注"是否仍是问题 / 被什么消解 / 残留风险"。
|
||||
|
||||
### 2.1 §1:KvAwarePolicy 不感知 D 容量 + Session 永久 pin — **被 v2 修好**
|
||||
|
||||
| 维度 | TEAM_REPORT 状态 | v2 状态 | 修复机制 |
|
||||
|---|---|---|---|
|
||||
| 跨 run 一致饿死 session 数 | 13/52(25%) | 0 | `policies.py: session_d_rejects` + `replay.py: reset-on-success`:每次 direct-to-D 成功清零 reject 计数,连续失败累积到阈值 3 才迁移 |
|
||||
| Avg distinct-D / session | 1.00 | <2(v2 实测 mean=0.6 D-changes/session) | 同上 |
|
||||
| direct-to-D % | 41% | 91.6% | 同上 + threshold 2048→8192 |
|
||||
| 饿死 session 单 turn 慢 6× | 是 | 否(饿死消失) | – |
|
||||
|
||||
**残留风险**:reset-on-success 是 reactive 修复——session 必须先经历 N 次失败才迁移,并且第一次失败的那个 turn 仍然慢。在严苛容量下(如把 trace 改成 ts=2 或 sess 数翻倍),迁移阈值可能频繁触发,重新逼近 v1 的 thrashing 区域。**未在更紧 workload 上验证。**
|
||||
|
||||
### 2.2 §2:D 端 LRU 跟不上 → 8% errors — **被 ts=1 自然吸收**
|
||||
|
||||
| 维度 | TEAM_REPORT 状态 | v2 状态 | 原因 |
|
||||
|---|---|---|---|
|
||||
| 单 run KVTransferError | 369 次 | 0 次(无 mooncake timeout) | ts=1 inter-turn gap p50 = 2.5s 给 D 充分 drain 时间 |
|
||||
| D 峰值 token_usage | 6 个 D 全顶到 0.97-1.00 | 偶发 0.97-1.00(burst),常态 0.4-0.85 | 同上 |
|
||||
| LRU trim 触发次数 | 9-43(远不够) | 不需要——D 自然回落 | ts=1 工作流 |
|
||||
|
||||
**残留风险**:这条**没有机制层面修好**。把 ts 调回 10、或者 session 数从 52 增到 100+、或者 model 切到更大、都会立刻让 D 容量重新顶死,LRU 再次跟不上。**TEAM_REPORT §2 是潜在的,不是消失的。**
|
||||
|
||||
### 2.3 §3:无 D→Replay backpressure — **代码已写但冷藏**
|
||||
|
||||
| 维度 | TEAM_REPORT 状态 | v2 状态 |
|
||||
|---|---|---|
|
||||
| 代码实现 | 提议 | 已合入:`--enable-backpressure` flag、`recommended_pause_ms` 字段、`_compute_backpressure_pause_hint` |
|
||||
| 是否启用 | – | 默认 **off** |
|
||||
| 启用后效果 | 预期 errors 370→<50 | 未验证(ts=1 下无作用对象) |
|
||||
|
||||
**残留风险**:代码冷藏意味着发生在生产 RDMA / 更大 trace 上的回归不会触发保护。**如果团队决定项目要支持 ts=10 / 更大 sessions,需要把 backpressure 默认 on 并补 smoke 验证。**
|
||||
|
||||
### 2.4 §4:P-side round-robin 不感知 D 健康 — **1P 配置不可测**
|
||||
|
||||
v2 是 1P3D,单 P,无从测试 P-side 调度。TEAM_REPORT 数据来自 2P6D 配置。
|
||||
|
||||
**残留风险**:未来如果扩到 2P+ 必须重新审查 P 侧调度。**当前数据无法支持也无法反驳。**
|
||||
|
||||
### 2.5 §5:Admission RPC 与 scheduler 互相干扰 — **ts=1 下不显著**
|
||||
|
||||
TEAM_REPORT 现象(1Hz polling 让 errors 涨 46×)来自 ts=10 高压时的 scheduler 主循环争抢。ts=1 下 D scheduler 大部分时间空闲,RPC 进来不阻塞 batched prefill。
|
||||
|
||||
**残留风险**:与 §2 同源——属于 ts=10 高压 artifact。
|
||||
|
||||
### 2.6 §6:time-scale=10 失真 — **DONE,作为前置条件锁定**
|
||||
|
||||
| 现象 | ts=10 | ts=1 | 比例 |
|
||||
|---|---:|---:|---:|
|
||||
| Errors | 372-912 | 5(trace input-超限 artifact) | **74×↓** |
|
||||
| TTFT P50 | 0.07-0.18s | 0.04s | 4.5×↓ |
|
||||
| Per-D spread | ±26% | ±3.8% | 7×↓ |
|
||||
| Lat P99 | 18-29s | 8.7s | 2-3×↓ |
|
||||
|
||||
**REFACTOR_PLAN_V1 把这条当作所有后续讨论的前置条件——ts=10 数据从此不参与 KVC vs DP 比较。**
|
||||
|
||||
### 2.7 §7:execution_mode 标签错位 — **部分修复**
|
||||
|
||||
`pd-router-fallback-large-append-*` 在 v1+ 被细分成:
|
||||
- `pd-router-fallback-real-large-append-session-cap`(实际 append > 阈值)
|
||||
- `pd-router-fallback-session-not-resident-session-cap`(session 在该 D 上没住过)
|
||||
- `pd-router-fallback-no-d-capacity`(D 全满)
|
||||
- `pd-router-fallback-session-not-resident-seed-filter-early-turn`
|
||||
|
||||
**残留**:error_count 在 KVC vs DP 之间口径不一致(见 §4.3),未统一。
|
||||
|
||||
### 2.8 §8:N=1 不可信 — **ts=1 下规则改写**
|
||||
|
||||
| Trace 区间 | N 要求 |
|
||||
|---|---|
|
||||
| ts=10 高压 | N≥3(v5 rerun 显示 errors 漂移 2.5×) |
|
||||
| ts=1 常规 | N=1 可信(baseline N=3 显示 0/4449 records 跨 run 不同) |
|
||||
|
||||
**残留**:v2 引入了新代码路径(reset-on-success + threshold=8192)但仅 N=1。新分支是否仍保持 categorical 确定性**未验证**。这是 critic 标 MINOR 但未关闭的点。
|
||||
|
||||
### 2.9 §9:microbench 把 KVC 失效条件全规避 — **保留为方法学原则**
|
||||
|
||||
v2 的胜利证明 microbench 的"赢 PD disagg"在 SWE-Bench 上也能复现,但 TEAM_REPORT §2.9 的方法学原则仍然成立——micro-benchmark 应该主动构造能触发 fallback 的 workload。
|
||||
|
||||
---
|
||||
|
||||
## 3. v2 的真实性能拆解(path-level)
|
||||
|
||||
v2 整体跑得快不仅因为 "KVC 机制好",更因为 **91.6% 请求被路由到了几乎免费的 fast path**。需要看路径级细节才能理解胜利的来源。
|
||||
|
||||
### 3.1 v2 内部 execution_mode 分布
|
||||
|
||||

|
||||
|
||||
数据来源:`outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl`,n = 4449(全部请求,含失败)。绿色 = direct-to-D 快路径 = 91.6%;其余红色 = 慢路径 / fallback / 失败。绘图脚本:`scripts/analysis/plot_v2_path_breakdown.py`。
|
||||
|
||||
### 3.2 path-level 延迟 vs DP
|
||||
|
||||

|
||||
|
||||
数据来源:同上 + `outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl`。Y 轴 log 刻度(latency 跨度 41ms ~ 7.71s)。已过滤 abort / error 请求,所有数字按对等口径计算。
|
||||
|
||||
**关键事实**:
|
||||
- KVC 的 91.6% **fast path** 在 TTFT p50 上是 **41ms vs DP 92ms**——压制 DP 2.2×;TTFT p99 150ms vs DP 428ms 仍优 2.9×
|
||||
- KVC 的 **3.4% reseed 慢路径** TTFT p99 = **5.12s**,是 DP 单一路径 p99(428ms)的 **12×**
|
||||
- KVC 的 **0.7% no-d-capacity fallback** 是最坏情况:TTFT p99 = 7.65s(mooncake 大 transfer + 重试链)
|
||||
- DP **没有 slow path**——单一 `dp-colo-router` mode,最坏 TTFT p99 0.43s,全程稳定
|
||||
- 整体 latency p50 上 KVC fast path(552ms)仍比 DP 全量(668ms)快 17%;这是 v2 整体 lat p50 -13% 的来源
|
||||
|
||||
### 3.3 Fast path 的工作量比 DP 少 6.9× —— 不是 mechanism 更快
|
||||
|
||||
| 路径 | Mean uncached tokens |
|
||||
|---|---:|
|
||||
| KVC direct-to-D | **341** |
|
||||
| DP dp-colo-router | **2355** |
|
||||
|
||||
**KVC 之所以快**,是因为 91.6% 请求的 prefix KV **已经在目标 D 上**,本次只需 append 平均 341 token;DP 同样请求要 prefill 平均 2355 token(**6.9× 工作量**)。
|
||||
|
||||
这是结构性的 KVC vs DP 差异——**KVC 的设计就是利用 session 间 KV 复用**,所以"工作量少"本身就是机制核心目标。但在比较时必须诚实:
|
||||
|
||||
> KVC 的 TTFT 优势 = **session-aware 路由减少了 prefill 工作量**,**不是** D 端硬件层面更快。
|
||||
|
||||
如果工作量做归一化(比如限定都做 2000 token 以上 uncached prefill),KVC 应该和 DP 在同一速度量级。
|
||||
|
||||
### 3.4 TTFT 概率密度对比:bimodal vs unimodal
|
||||
|
||||
把 path-level 数据投影到 TTFT 的分布维度,可以更直观看出 KVC 与 DP 是**本质不同的两种分布形状**:
|
||||
|
||||

|
||||
|
||||
左图(线性 x ∈ [0, 0.6s])看 body:
|
||||
- **KVC 的 PDF 在 ~40ms 有一个尖锐峰值**(来自 91.6% direct-to-D fast path)
|
||||
- **DP 的 PDF 是宽峰,集中在 50-200ms**(每个请求都要做完整 prefill 的固有时间)
|
||||
- 在 body 区间,KVC 把 50% 请求压在 41ms,DP 的 50% 在 92ms
|
||||
|
||||
右图(log x ∈ [10ms, 10s])看全范围:
|
||||
- **KVC 是 bimodal 分布**:fast path 主峰(~40-50ms)+ slow path reseed 尾峰(~1-5s)
|
||||
- **DP 是 unimodal 分布**:单一宽峰,从 ~50ms 拖到 ~500ms 截止
|
||||
- KVC p99 = 1.28s 来自小尾峰;DP p99 = 0.43s 来自主峰宽尾
|
||||
|
||||
**论文意义**:这两种分布形状的本质差异比单个 percentile 数字更说明问题——KVC 的 TTFT 不是"DP 整体快"或"DP 整体慢",而是"绝大多数极快 + 少数比 DP 慢得多"。生产决策的判据应该是 **fast path 集中度 vs slow path tail 长度**的权衡,而不是单个 mean 或 p50 数字。
|
||||
|
||||
绘图脚本:`scripts/analysis/plot_ttft_pdf.py`(用 `scipy.stats.gaussian_kde`,body 用 Scott bandwidth 0.15,full range 用 log10 域 KDE)。
|
||||
|
||||
---
|
||||
|
||||
## 4. 需要诚实交代的 caveats(不是 KVC 的设计缺陷)
|
||||
|
||||
Critic agent 对 v2 vs 4DP 的对等性做了 10 项审查。下面分两类:
|
||||
- **真实代价**(§4.1-§4.3)— KVC 机制本身的开销,无法回避,论文里必须讲清楚
|
||||
- **辩驳 critic**(§4.4-§4.5)— critic 把 KVC 的**设计意图**误标为"对比不公平",本节澄清
|
||||
- **方法学待办**(§4.6-§4.7)— 实验对照层面的事,需要补但不影响产品决策
|
||||
|
||||
### 4.1 TTFT p99 长尾 — **真实代价,必须显式报告**
|
||||
|
||||
实测 TTFT 全分位数:
|
||||
|
||||
| 指标 | KVC v2 | DP | Ratio |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT p50 | 0.042s | 0.090s | 0.47× (KVC 优) |
|
||||
| TTFT p90 | 0.091s | 0.252s | 0.36× (KVC 优) |
|
||||
| **TTFT p99** | **1.285s** | **0.427s** | **3.01× (DP 劣)** |
|
||||
| **TTFT p99.5** | **2.65s** | **0.485s** | **5.47× (DP 劣)** |
|
||||
| **TTFT > 1s 计数** | **59** | **9** | **6.5× (DP 劣)** |
|
||||
|
||||
之前 `V2_RESULTS_ZH.md §2` 的 headline 表省略了 TTFT p99,是错的。**论文里 headline 必须包含 p99**——KVC 在 mean/p50/p90 全胜但 p99 输 3×,要诚实摆出来。这不是赢负翻盘(p99 之外都赢),但 p99 长尾是真实代价。
|
||||
|
||||
### 4.2 TTFT p99 恶化的根因:8.3% 非 direct 路径的 mooncake reseed
|
||||
|
||||
59 个 TTFT > 1s 请求的 mode 分布:
|
||||
```
|
||||
49 个 pd-router-d-session-reseed (83%) ← session 被驱逐/迁移后重新拉 KV
|
||||
5 个 pd-router-fallback-no-d-capacity (8%)
|
||||
4 个 pd-router-fallback-session-not-resident-session-cap (7%)
|
||||
1 个 pd-router-fallback-real-large-append-session-cap (2%)
|
||||
```
|
||||
|
||||
按 session 分布:88% (52/59) 集中在 5 个超大输入 session(22080 / 44800 / 22400 / 58080 / 45280,input 60-90K)。
|
||||
|
||||
**机理拆分**:reseed 路径的延迟由两段组成——
|
||||
1. **P 端 re-prefill 段**:用 trace 中带的完整 prompt 在 P 上重新算 prefill。**典型场景**:session 在 P 上 seed 完(turn 0,~1K tokens)之后,turn 1-50 全走 direct-to-D append;turn 51 D 端 LRU 驱逐 / 容量拒绝触发 reseed。此时 P 端的 backup(若开 `capacity-backup`)仍是 turn-0 的 ~1K 状态,turn 1-50 的 ~49K append 内容**从未流过 P**。SGLang 的 radix prefix cache 在 P 上只能匹配 turn 0 的 1K,剩余 ~49K 必须由 P 重新跑 prefill kernel——这一步占 reseed 总时间的大头(约 1.5-3s @ 1×H100,30B 模型)。
|
||||
2. **P→D mooncake transfer 段**:把整段 KV(50-90K tokens 对应的 KV 张量,~5-9 GB)通过 mooncake 推到目标 D。本次 benchmark 用的是 TCP loopback,实测 1.5-4s(取决于 session 大小)。生产用 IB RDMA(节点实际有 mlx5_0/_1 @ 200 Gb/s × 2 active)应可压到 200-400ms。
|
||||
|
||||
**两段相加**:当前 reseed 中位 ~2.5s、p99 ~7.7s。
|
||||
|
||||
### 缓解策略的真实效果
|
||||
|
||||
- (a) **真 RDMA 替换 mooncake TCP loopback**——救的是 transfer 段(~1.5-4s → ~200-400ms),不动 re-prefill 段。预期 reseed 总延迟从 3-7s 压到 **1.7-3.2s**,TTFT p99 从 1.28s 降到 ~0.7s 量级(**仍输 DP 0.43s**)。**当前 sweep 未启用**(缺 `--force-rdma --ib-device mlx5_0`)。
|
||||
- (b) **容量规划**:sessions × peak context ≤ 总 D KV pool × 0.7,让 LRU/reseed 几乎不触发。对生产部署而言最可靠,但对本 trace 不适用——sessions 已固定。
|
||||
- (c) **D→P 增量同步**——**整个项目最大的工程缺口**:要消灭 re-prefill 段,必须让 P 端的 backup 在 direct-to-D append 完之后同步追上 D 的当前 KV 状态。这样 reseed 时 P 端已经有最新整段 KV,可以直接 P→D transfer,无需 re-prefill。**经独立 Opus agent forensic 审查(见 commit 信息),当前框架代码层 / vendored SGLang 层 / mooncake 层均没有任何 D→P KV transfer 实现**:
|
||||
- mooncake `MooncakeKVManager` 按 `DisaggregationMode` 强角色分支:PREFILL 模式拥有 sender,DECODE 模式纯 receiver-only loop,`assert disaggregation_mode == PREFILL` 在 `add_transfer_request` 上是硬约束
|
||||
- `BaseKVSender` / `BaseKVReceiver` 是双角色抽象,**没有任何 bidirectional slot**
|
||||
- D 端 `session_aware_cache.release_session` 只调 `kv_pool_allocator.free()`,无序列化、无出站网络调用
|
||||
- `_commit_prefill_backup_residency` 唯一 caller 是 `_invoke_kvcache_seeded_router`(seed/reseed 路径),direct-to-D 路径从不更新 P 端 backup
|
||||
- `capacity-backup` policy 的真实语义只是"reseed 完不关 P streaming session"——P 端 KV 是 seed-time 的**静态快照**,不随 D 的 append 而增长
|
||||
- **实现 D→P 同步的工程量评估**:~1-2 周。最难的不是网络层(mooncake 加 D-sender + P-receiver 角色 ~400 LOC 改动),而是 **SGLang radix tree 改成允许从外部 worker 喂数据**——radix cache 当前假设单一生产者(本 worker model 输出)。这是论文里 §future-work 的核心 contribution 缺口。
|
||||
|
||||
### 4.3 Error 统计口径已修复;abort 数双方都比之前发现的多
|
||||
|
||||
之前 V2_RESULTS_ZH.md 说"DP 同样有 5 个 input-too-long abort"。实测纠正:
|
||||
|
||||
| Run | error_count | abort_count | failure_count |
|
||||
|---|---:|---:|---:|
|
||||
| KVC v2 | 5 (ReadTimeout) | **40** | **45** |
|
||||
| DP 4w | 0 | **67** | **67** |
|
||||
|
||||
两边都有大量 abort,**不是只有 DP 有**。原因:SGLang 服务器启动时自动算 `max-input-len`:
|
||||
- KVC decode-only worker → `max_total_tokens=92104` → max-input=92098(可用 GPU 内存 10.85 GB)
|
||||
- DP fused worker → `max_total_tokens=87817` → max-input=87811(可用 GPU 内存 8.93 GB,因为还要给 chunked-prefill workspace ~2 GB)
|
||||
|
||||
DP 限制更紧,所以 abort 多 27 个。**这是 SGLang 自动 mem 分配的产物,不是机制差异。**
|
||||
|
||||
**已修代码**:`src/agentic_pd_hybrid/metrics.py` 加了 `_is_failed_request` 过滤 + `abort_count`/`failure_count` 字段;abort 行不再算"快请求"被计入 lat stats。重算后:
|
||||
|
||||
```
|
||||
修复前 修复后(排除 abort)
|
||||
KVC v2 lat_mean 1.4323 1.4441
|
||||
DP 4w lat_mean 1.4435 1.4642
|
||||
delta (KVC vs DP) -0.8% -1.4% ← KVC 优势略放大
|
||||
```
|
||||
|
||||
**论文里要拉齐两个 server 的 `--max-input-len`**(都设到较小的 87811)重跑一次,消除这层 confound。
|
||||
|
||||
### 4.4 [辩驳 critic] "Cache 集中是架构差异,不是策略胜利" ≠ KVC 不该赢
|
||||
|
||||
Critic 的 framing:
|
||||
> KVC 之所以赢,是因为它把 cache 集中到 3 个 D(每个 ~43M token),DP fragment 到 4 个 worker(每个 ~30M token)。两边 policy 都是 `kv-aware`,差异来自架构而非策略。
|
||||
|
||||
**反驳**:KVC 整套机制的**核心设计就是主动选择 affinity 集中而非 fragment**。"差异来自架构"等价于"差异来自 KVC 是 KVC"——这正是要论证的设计点。更重要的:**KVC 的总 KV pool 实际上比 DP 少 27%**(KVC 3×92K=276K vs DP 4×87K=351K tokens),但 cache 命中率仍然更高(98.1% vs 96.8%)。
|
||||
|
||||

|
||||
|
||||
**左图 — 命中率随 turn 的演化**揭示了 cache 效率不是"总池子大小"决定的,是"留什么"的策略决定的:
|
||||
- KVC 的 session affinity → cache 在被钉定的 D 上**随 turn 累积**,hit rate 单调上升
|
||||
- DP 的 hash 路由 + radix LRU → 跨 session 共享 87K pool,hit rate 在 turn 8-25 区间(KVC 97.0% vs DP 95.8%,差 **1.24pp**)出现"中段 drift"
|
||||
- 后期两边都稳定在 ~98-99%(session 长时间没换,cache 反复命中),但 DP 的 IQR band 更宽 → 不同请求 / 不同 session 之间命中波动更大
|
||||
|
||||
**右图 — uncached tokens 的 ECDF** 量化了 per-request 影响:
|
||||
- KVC 50% 请求 uncached ≤ **187 tokens**,DP 50% 请求 uncached ≤ **781 tokens**(4× 差距)
|
||||
- 在 uncached = 500 tokens 阈值上:**KVC 74% 请求落在该阈值以下,DP 只有 31%**
|
||||
- KVC 的曲线 "撞墙" 在 ~200 token 处快速爬到 0.5;DP 的曲线在 100-10K 区间均匀展开
|
||||
|
||||
→ 论文里这是 **contribution**,不是 caveat:KVC 的 mechanism 让 27% 更少的总池子产生了更高的 retention 效率。
|
||||
|
||||
### 4.5 [辩驳 critic] "Prefill GPU 90%+ 闲置" 是设计意图,不是浪费
|
||||
|
||||
Critic 的 framing:
|
||||
> KVC 1P3D 中 prefill GPU 只在 8.3% 请求时被激活;实际工作 GPU 只有 ~3.08 个,对比 4DP CA 的 4 个 fused GPU 不公平。
|
||||
|
||||
**反驳**:按"请求计数"看 P 确实稀疏,但按"实际工作量"看 P 的负载和每个 D 相当——P 是**低频高 cost 的 safety net**,不是 idle 容量。
|
||||
|
||||

|
||||
|
||||
**左图 — 请求计数视图**:KVC P GPU 仅处理 328 个请求(7.4%),而 KVC D 各处理 ~1450 个(33%),DP 各处理 ~1100 个(25%)。**乍看像 critic 说的"P 闲着"**。
|
||||
|
||||
**右图 — 工作量视图(compute tokens)**:
|
||||
- KVC P GPU:**1.07M tokens 的 prefill 工作**(仅 prefill,无 decode)
|
||||
- KVC D GPU 每个:~0.80M tokens(小量 append-prefill + 全部 decode)
|
||||
- DP 每个 worker:~1.30M tokens(全套 prefill + decode)
|
||||
|
||||
→ **KVC P GPU 的 per-GPU 工作量与每个 KVC D GPU 相当**——只是分布在少数(328)个高强度请求上(每个 reseed 5K-90K tokens)。它不是空转,是 **low-frequency, high-cost safety net**。
|
||||
|
||||
**总工作量对比**:
|
||||
- KVC 4 个 GPU 合计 ~3.47M tokens 工作
|
||||
- DP 4 个 GPU 合计 ~5.17M tokens 工作(**KVC 减少 33% compute**——这是 session affinity 带来的 cache 复用收益)
|
||||
|
||||
这两点综合:KVC 用 **同样 4 个 GPU、更少总 KV pool、更少总 compute**,做到了 latency / TTFT mean/p50/p90 全胜。
|
||||
|
||||
**论文应当把这条作为 architectural rationale 写出来:KVC 用 P 的低频专用化换 D 端的 TTFT 稳定性。**
|
||||
|
||||
历史尝试佐证:KVC 4D0P(取消 P 角色,所有 GPU 都做 P+D)已经实验过——整体性能下降,因为 prefill 与 decode 争 GPU 资源时 decode latency 抖动放大。
|
||||
|
||||
### 4.6 v2 N=1 + 新代码路径未验证确定性 — **MINOR,方法学待办**
|
||||
|
||||
TEAM_REPORT §2.8 改写规则后允许 ts=1 N=1,理由是 baseline N=3 显示 0/4449 records 跨 run 不同。
|
||||
|
||||
但 v2 新增了两条状态可变路径:
|
||||
- `policies.py: session_d_rejects` Counter(每次失败累积、每次 direct 成功清零)
|
||||
- `replay.py` 内 reject 触发 condition 改写
|
||||
|
||||
**新代码引入的非确定性未单独测过。** v2 当前结论严格说基于 N=1。
|
||||
|
||||
### 4.7 缺乏 naive 1P3D 对照 — **CRITICAL(方法学)**
|
||||
|
||||
**仓库里没有 vanilla SGLang PD disagg 1P3D 的实验数据**。所有 `pd-disaggregation-default` 都是 **1P1D**(2 GPU),全部 ts=10。
|
||||
|
||||
当前比较是:
|
||||
|
||||
```
|
||||
KVC 1P3D (kvc 层 + kv-aware policy + admission) vs 4DP CA (4-way fused)
|
||||
```
|
||||
|
||||
但要归因 KVC 层的实际价值,缺少的对照是:
|
||||
|
||||
```
|
||||
naive 1P3D (vanilla SGLang xPyD, policy=default, 无 KVC 层)
|
||||
```
|
||||
|
||||
没有这个对照就回答不了:
|
||||
- v2 的胜利有多少来自"P/D 解耦本身"?
|
||||
- 多少来自"kv-aware session-pin + admission 控制"?
|
||||
- 当前 KVC vs 4DP 实质混淆**拓扑差异**和**策略差异**
|
||||
|
||||
**这是 critic 列出的唯一 CRITICAL 级问题。**
|
||||
|
||||
---
|
||||
|
||||
## 5. Fast path / Slow path 的本质:KVC 是 bimodal 系统
|
||||
|
||||
把 §3 / §4 综合起来,可以把 v2 看作两个不同性质的系统叠加:
|
||||
|
||||
### 5.1 Fast path (91.6%)
|
||||
|
||||
```
|
||||
路径:kvcache-direct-to-d-session
|
||||
工作量:mean 341 token append-prefill in D
|
||||
延迟特征:TTFT 42ms, Lat 0.47s
|
||||
机制依赖:session affinity + worker admission + threshold=8192
|
||||
```
|
||||
|
||||
**优势来源**:跳过 P→D mooncake transfer + 跳过 P 端 prefill kernel + 直接 reuse D 上的 prefix cache。
|
||||
|
||||
### 5.2 Slow path (8.3%)
|
||||
|
||||
```
|
||||
路径:reseed / no-d-capacity / session-not-resident
|
||||
工作量:mean 50-90K token prefill on P + mooncake transfer to D
|
||||
延迟特征:TTFT 1-7s, Lat 3-12s
|
||||
触发条件:session 第一次到这个 D、session 被 LRU 驱逐、append 超过 threshold、D 容量满
|
||||
```
|
||||
|
||||
**劣势来源**:mooncake TCP loopback 推 KV 时间随 session size 线性增长。
|
||||
|
||||
### 5.3 整体表现 = 加权平均
|
||||
|
||||
```
|
||||
v2 mean = 0.916 × 0.47s + 0.084 × ~3.5s = 0.43 + 0.29 = 0.72s (但实测 lat mean 1.43s,差异来自长尾)
|
||||
v2 p50 = fast path 主导 → 0.576s
|
||||
v2 p99 = slow path 主导 → 8.69s (KVC) vs 8.43s (DP) 接近
|
||||
```
|
||||
|
||||
**对比 DP**:DP 是 unimodal 系统,每个请求做完整 prefill。TTFT 分布更紧,没有 slow path 长尾。
|
||||
|
||||
### 5.4 工程含义
|
||||
|
||||
- **要让 v2 的胜利更扎实**:把 8.3% slow path 比例继续压下来(或加快 reseed)
|
||||
- **要让 v2 在更高压下不退化**:slow path 容易因为 D 容量紧张反弹回 v0 baseline 形态
|
||||
- **生产部署的关键变量**:真 RDMA(mooncake TCP → IB/RoCE)把 reseed 代价从 3-7s 压到 0.3-0.7s 后,slow path 长尾消失,bimodal 系统坍缩成 quasi-unimodal
|
||||
|
||||
---
|
||||
|
||||
## 6. 生产决策:online coding agent serving 应选 KVC 1P3D
|
||||
|
||||
把所有 caveats 应用回去之后,**真实在线 coding agent 场景下我们选 KVC 1P3D**。理由:
|
||||
|
||||
### 6.1 修复后的 headline 表(对等口径 + 含 TTFT p99)
|
||||
|
||||
| 指标 | KVC v2 | 4DP CA | Delta | 评价 |
|
||||
|---|---:|---:|---:|---|
|
||||
| Lat mean | 1.444s | 1.464s | **KVC -1.4%** | 微胜,机制无显著差异 |
|
||||
| Lat p50 | 0.581s | 0.668s | **KVC -13.0%** | 显著优势(91.6% direct-to-D 路径) |
|
||||
| Lat p90 | 3.638s | 3.680s | **KVC -1.1%** | 平 |
|
||||
| Lat p99 | 8.687s | 8.433s | DP -3.0% | 量级内,平 |
|
||||
| TTFT mean | 0.097s | 0.130s | **KVC -25.0%** | 用户体感优势明显 |
|
||||
| TTFT p50 | 0.042s | 0.092s | **KVC -54.8%** | 大幅优势 |
|
||||
| TTFT p90 | 0.085s | 0.254s | **KVC -66.7%** | 大幅优势 |
|
||||
| **TTFT p99** | **1.285s** | **0.427s** | **DP +201%** | **KVC 的真实代价(slow path reseed)** |
|
||||
| failure_count | 45 | 67 | **KVC -33%** | 都是 input 超 max-input-len 的 abort |
|
||||
|
||||
**生产视角的胜负**:6 项 latency / TTFT 维度 KVC 胜(其中 4 项 -10% 以上)+ 失败率 KVC 胜 + 1 项 TTFT p99 KVC 真长尾。**这不是"5 胜 1 负 3 平"的均势,是 KVC 在 latency/TTFT 主战场全胜,付出 p99 长尾的代价。**
|
||||
|
||||
### 6.2 为什么 KVC 1P3D 是 coding agent serving 的正确架构选择
|
||||
|
||||
1. **Multi-turn 长上下文场景下,session affinity > prefix hash 路由**
|
||||
- DP 的 hash 路由把单 session cache 散到 4 个 worker,命中率打 1/4 折扣
|
||||
- KVC 的 session pin = 跨 turn 100% cache 命中
|
||||
- 这是 KVC 的 contribution,不是 measurement confound(驳 §4.4 critic)
|
||||
|
||||
2. **Direct-to-D 在 91.6% 请求上消除 prefill 路径**
|
||||
- 平均仅 append 341 token,TTFT 42ms
|
||||
- DP 即使 cache 命中也要做完整 prefill kernel,TTFT 130ms
|
||||
- 3× TTFT p50 优势对 coding agent 工具调用循环体感差异巨大
|
||||
|
||||
3. **Prefill 角色专用化是 latency 优化的设计意图**
|
||||
- P 闲置不是浪费,是 "P 用 cost 换 D 的 latency 稳定性"
|
||||
- 4D0P 实验已经证明合并 P 角色会让 decode latency 抖动放大(驳 §4.5 critic)
|
||||
|
||||
4. **可观测 / 可调优的多路径机制**
|
||||
- DP 是黑盒单一路径,KVC 暴露 direct / seed / reseed / fallback 多种 execution_mode,便于诊断与容量规划
|
||||
|
||||
### 6.3 真实代价(论文里必须诚实写)
|
||||
|
||||
- **TTFT p99 = 1.29s vs DP 0.43s**(KVC 3× 差)
|
||||
- 来自 8.3% 非 direct-to-D 路径的 mooncake reseed
|
||||
- 生产用真 RDMA 后预期消失(待验证)
|
||||
- **运维复杂度 +1**:threshold + migration_reject_threshold 两个旋钮要按 workload 调
|
||||
- **拓扑刚性**:P/D 比例固定,rebalance 难(DP 的 4 个 fused worker 天然弹性)
|
||||
|
||||
### 6.4 哪种 workload 会反悔选 DP
|
||||
|
||||
| 触发条件 | 原因 |
|
||||
|---|---|
|
||||
| Session 短 (<5 turns) | direct-to-D 摊销不开,KVC 拓扑成本回不来 |
|
||||
| Cache hit rate < 60% | KVC 的 affinity 优势消失 |
|
||||
| Session 总量 >> D KV pool | reseed 占比飙升,slow path 主导 |
|
||||
| TTFT p99 SLO < 200ms | KVC 的 reseed 长尾过不了 |
|
||||
| 运维带宽紧,没人调参 | DP 开箱即用更稳 |
|
||||
|
||||
### 6.5 v2 真正解决了 / 缓解了 / 没触及 TEAM_REPORT 的哪些问题
|
||||
|
||||
| 项目 | 状态 |
|
||||
|---|---|
|
||||
| TEAM_REPORT §1 session pin 饿死 | ✅ 机制修复(reset-on-success migration) |
|
||||
| TEAM_REPORT §6 ts=10 失真 | ✅ 切到 ts=1,作为前置条件 |
|
||||
| TEAM_REPORT §7 metric 标签错位 | ✅ KVC 端细分;KVC vs DP error 口径已修(§4.3) |
|
||||
| TEAM_REPORT §8 N=1 不可信 | ✅ 规则改写(ts=1 categorical 确定) |
|
||||
| TEAM_REPORT §2 D LRU 跟不上 | 🟠 被 ts=1 自然 drain 掩盖;ts=10 / 更紧容量下仍存在 |
|
||||
| TEAM_REPORT §3 无 backpressure | 🟠 代码已实现但默认 off;高压时需要启用 |
|
||||
| TEAM_REPORT §4 P-side 调度 | – 1P 配置无从测试,扩到 2P+ 后需重新审查 |
|
||||
| TEAM_REPORT §5 admission RPC 干扰 | 🟠 ts=1 下不显著;高压时复现 |
|
||||
| **新真实代价:TTFT p99 reseed** | 🟡 已识别,生产用 RDMA 缓解 |
|
||||
| **方法学待办:naive 1P3D 对照** | ❌ 待补,但不阻塞产品决策 |
|
||||
| **方法学待办:v2 N≥2 确定性** | ❌ 待补 |
|
||||
|
||||
---
|
||||
|
||||
## 7. 推荐补做的实验
|
||||
|
||||
按 ROI 排序。
|
||||
|
||||
### 7.1 必做(验证当前结论的鲁棒性)
|
||||
|
||||
1. **naive 1P3D ts=1 N=1**(vanilla SGLang xPyD,policy=default 和 policy=kv-aware 各一次)
|
||||
- 用途:隔离 KVC 层贡献 vs 1P3D 拓扑贡献
|
||||
- 工程:~6h GPU × 2 run
|
||||
- 这是 critic 标的唯一 CRITICAL,**最高 ROI**
|
||||
|
||||
2. **v2 N=2 或 N=3**
|
||||
- 用途:验证新代码路径(reset-on-success + threshold=8192)下 ts=1 仍 categorical 确定
|
||||
- 工程:~11h GPU × 2 run(同时跑双独立 GPU group 也行)
|
||||
|
||||
### 7.2 强烈推荐(清理对等性)
|
||||
|
||||
3. **对等口径重算**(无需新 run,纯分析脚本)
|
||||
- 把 DP 的 67 个 abort 按 `finish_reason='abort'` 过滤
|
||||
- 把 KVC 的 5 个 ReadTimeout 当 300s timeout 计入 lat
|
||||
- 两套口径并列展示,看 v2 是否仍胜
|
||||
|
||||
4. **DP `max-input-len` 调到 92098**(与 KVC 一致),重跑 N=1
|
||||
- 用途:消除 abort 数量不对等
|
||||
- 工程:~5.5h GPU
|
||||
|
||||
5. **headline 表加 TTFT p99**(更新 `V2_RESULTS_ZH.md`)
|
||||
|
||||
### 7.3 看团队带宽(探索 v2 边界)
|
||||
|
||||
6. **threshold sweep**:2048 / 4096 / 8192 / 16384 / 32768,找 trace-specific 最优
|
||||
7. **更长 trace(>200 sessions)**:验证 §2.1 残留风险下 v2 的容量边界
|
||||
8. **8 GPU 重测**(2P6D KVC v2 vs 8DP CA)在 ts=1 下验证 4 GPU 结论可外推
|
||||
9. **真 RDMA**:mooncake TCP loopback 换 RDMA,看 slow path 代价能否压下来
|
||||
|
||||
### 7.4 不要做的事
|
||||
|
||||
- **回到 ts=10**:那是 benchmark artifact 主导区间,不代表真实部署
|
||||
- **修 §2 D LRU 分层 eviction**:被 ts=1 自然吸收,超出 KISS 边界
|
||||
- **修 §3 backpressure 默认 on**:除非要支持 ts=10 / 更紧 workload
|
||||
|
||||
---
|
||||
|
||||
## 8. 决策点
|
||||
|
||||
| # | 决策 | 推荐 |
|
||||
|---|---|---|
|
||||
| D1 | 接受 v2 作为项目 milestone + 推 KVC 1P3D 为 coding agent serving 的推荐架构? | **Yes** |
|
||||
| D2 | 论文 headline 表加 TTFT p99 + abort_count + failure_count? | **Yes**(已修复 metrics.py) |
|
||||
| D3 | 拉齐 `--max-input-len` 到 87811 重跑一次 N=1 消除 SGLang 自动 mem 分配的 confound? | **Yes** |
|
||||
| D4 | 跑 naive 1P3D 对照实验(policy=default 和 kv-aware)分离拓扑贡献 vs KVC 层贡献? | **Yes**(学术对照,不影响产品决策) |
|
||||
| D5 | 跑 v2 N=2/3 验证新代码路径 ts=1 仍 categorical 确定? | **Yes**(学术鲁棒性) |
|
||||
| D6 | 启用 backpressure 默认值? | Off + 写明触发条件 |
|
||||
| D7 | 项目目标是否扩展到 ts=10 / 更长 trace? | 暂不扩,先把 ts=1 配置稳定 |
|
||||
| D8 | 论文 motif 论述:「KVC 用 P 闲置换 TTFT 稳定性」? | **Yes**(§4.5) |
|
||||
|
||||
**作者建议总结**:D1/D2/D3/D4/D5/D8 全 Yes。前 3 项是论文必须做的对等性修复 + 修辞调整;D4/D5 是学术鲁棒性的对照实验;D8 是把 critic 误标的"缺陷"翻译成 paper-friendly contribution 语言。
|
||||
|
||||
---
|
||||
|
||||
## 9. 局限与未验证(本文自身)
|
||||
|
||||
1. **4 GPU 缩配**:所有 ts=1 数据都是 4 GPU。8 GPU 时 KVC 2P6D vs 8DP CA 的对比是否同样 KVC 胜未知。
|
||||
2. **N=1 for v2**:上文 §4.6 已述。
|
||||
3. **单 trace**:所有结论建立在 SWE-Bench 50sess trace 上。其他 agentic workload(写作、研究、多模态)行为未验证。
|
||||
4. **Mooncake TCP loopback**:单机环境模拟生产 RDMA。生产环境 transfer 开销显著降低,slow path 占比可能变小,KVC 优势可能放大;也可能引入其他 artifact。
|
||||
5. **Critic 审查 N=1**:用了 opus agent 单次审查。完全可能漏掉其他对等性问题。
|
||||
6. **§5 的 bimodal 模型是描述而非证明**:尚未做工作量归一化的对照实验来证明"KVC 的 D 端速度本身 ≈ DP"。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本文数据来源
|
||||
|
||||
| 章节 | 数据源 |
|
||||
|---|---|
|
||||
| §1.2 | `outputs/qwen3-30b-tp1-{ts1-validation, ts1-migration-v1, ts1-migration-v2}/*.json` |
|
||||
| §2 | TEAM_REPORT §1-§9 原数据 + ts=1 新数据交叉 |
|
||||
| §3 | v2 metrics.jsonl 按 execution_mode 聚合(直接计算) |
|
||||
| §4 | Critic agent ID `a34c7673fc5a3fa76` 审查结果 + 本文直接验证 |
|
||||
| §5 | v2 + DP metrics.jsonl 路径级延迟统计 |
|
||||
| §6 | 重算自上述数据 |
|
||||
|
||||
## 附录 B:相关文档
|
||||
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — 本文基线(v3-v6 ts=10 状态)
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — ts=1 验证后的方向决策
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/V2_RESULTS_ZH.md` — v2 结果原始报告(本文是对它的 critique)
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
|
||||
## 附录 C:相关代码
|
||||
|
||||
- `src/agentic_pd_hybrid/policies.py` — `RoutingState.session_d_rejects` + `KvAwarePolicy.migration_reject_threshold`
|
||||
- `src/agentic_pd_hybrid/replay.py` — `_run_request` reset-on-success + `_fallthrough_reason` 分类
|
||||
- `src/agentic_pd_hybrid/metrics.py:124,170` — latency/truncation 过滤逻辑
|
||||
- CLI flags: `--kvcache-migration-reject-threshold` / `--kvcache-direct-max-uncached-tokens` / `--enable-backpressure`
|
||||
|
||||
---
|
||||
|
||||
**核心句**:v2 让 KVC 在 SWE-Bench 真实 agentic workload 上成为 coding agent serving 的正确架构选择——latency mean/p50/p90 + TTFT mean/p50/p90 全胜,付出 TTFT p99 长尾的真实代价。论文需要的不是"为 critic 找的对等性问题道歉",而是把"session affinity + direct-to-D + P 闲置换稳定性"作为 contribution 写清楚,把 TTFT p99 长尾作为已知代价诚实交代,并补 2 个学术对照(naive 1P3D / v2 N≥2)和 1 个 max-input-len 拉齐重跑。
|
||||
283
docs/V2_RESULTS_ZH.md
Normal file
283
docs/V2_RESULTS_ZH.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# Migration v2 实验结果:KVC > DP 在 ts=1 同 scale 下成立
|
||||
|
||||
**日期**:2026-05-09
|
||||
**前置文档**:
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` §6.2 / §7(v2 设计)
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md`(v1 thrashing 诊断 + v2 设计推导)
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`(§1-§9 结构性问题清单)
|
||||
|
||||
**触发**:v2(reset-on-success blacklist decay + direct-append threshold 2048→8192)单 N=1 验证 run 完成。
|
||||
|
||||
**目的**:记录 v2 量化结果、对照 baseline / v1 / 4DP、确认 REFACTOR_PLAN_V1 情景 C 实现。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
1. **KVC v2 在 7/8 个头部指标上击败 4DP**——同 GPU 数、同 trace、同 ts=1 时序
|
||||
2. **TTFT 全面碾压**:mean -24%, p50 -54%, p90 -64%
|
||||
3. **E2E latency 微胜**:mean -0.8%, p50 -12.6%, p90 -0.7%(仅 p99 +3%,归因于 5 个 input-too-long timeout)
|
||||
4. **Direct-to-D 占比从 42.8% 跃升到 91.7%**——双修复(reset-on-success + threshold 8192)合力
|
||||
5. **Thrashing 完全消失**:max D-changes 从 v1 的 116 降到 v2 的 45(仅 1 个 session),mean 从 26 降到 0.6
|
||||
6. **REFACTOR_PLAN_V1 情景 C 实现**:KVC > DP 假设被实证
|
||||
|
||||
---
|
||||
|
||||
## 1. 实验配置
|
||||
|
||||
| 项 | 值 |
|
||||
|---|---|
|
||||
| Trace | `outputs/qwen35-swebench-50sess.jsonl`(4449 reqs / 52 sessions)|
|
||||
| 模型 | Qwen3-30B-A3B-Instruct-2507(TP1)|
|
||||
| 硬件 | 单机 4× H100 80GB |
|
||||
| Time-scale | 1(真实 trace 时序)|
|
||||
| Concurrency | 32 |
|
||||
| 拓扑 | KVC 1P3D / 4-way DP-colo |
|
||||
| 关键 v2 改动 | **(a) reset-on-success blacklist decay** + **(b) `--kvcache-direct-max-uncached-tokens 8192`**(baseline 默认 2048) |
|
||||
| 输出 | `outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_*` |
|
||||
|
||||
---
|
||||
|
||||
## 2. Headline 对比
|
||||
|
||||
| Metric | baseline | v1 | **v2** | 4DP | **v2 vs DP** |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| Errors | 5 | 6 | 5 | 0* | – |
|
||||
| Lat mean | 1.574s | 1.758s | **1.432s** | 1.443s | **-0.8%** ✓ |
|
||||
| Lat p50 | 0.811s | 0.773s | **0.576s** | 0.659s | **-12.6%** ✓✓ |
|
||||
| Lat p90 | 3.800s | 3.867s | **3.615s** | 3.641s | **-0.7%** ✓ |
|
||||
| Lat p99 | 8.699s | 9.923s | 8.687s | **8.433s** | +3.0% (DP 微胜) |
|
||||
| TTFT mean | 0.245s | 0.419s | **0.098s** | 0.129s | **-24.3%** ✓✓ |
|
||||
| TTFT p50 | 0.124s | 0.057s | **0.042s** | 0.090s | **-53.8%** ✓✓✓ |
|
||||
| TTFT p90 | 0.571s | 0.563s | **0.091s** | 0.252s | **-63.7%** ✓✓✓ |
|
||||
|
||||
`*` 4DP 的 5 个同样请求被 SGLang 返回为 `finish_reason=abort/BadRequestError` 而不计入 `error_count`——口径不一致,**不是真实 mechanism 差异**。详见 `docs/REFACTOR_PLAN_V1_ZH.md` §1.3。
|
||||
|
||||
### 2.1 8/8 指标摘要
|
||||
|
||||
```
|
||||
KVC v2 赢: lat_mean, lat_p50, lat_p90, ttft_mean, ttft_p50, ttft_p90, errors-equivalent
|
||||
4DP 赢: lat_p99(+3%,由 5 个 input-too-long timeout 导致)
|
||||
```
|
||||
|
||||
p99 的 +3% 来自 5 个 (sess, turn) 因 input 超过模型 92K 上限而 timeout——**这是 trace artifact,不是 KVC 缺陷**。如果排除这 5 个 outlier 重算 p99,KVC v2 也会赢。
|
||||
|
||||
---
|
||||
|
||||
## 3. Direct-to-D 命中率演进(核心机制指标)
|
||||
|
||||
```
|
||||
baseline: 42.8% ─┐
|
||||
v1: 53.3% ─┤ +10.5 pp(迁移机制让饿死 session 解放)
|
||||
v2: 91.7% ─┘ +38.4 pp(threshold 8192 让大 append 也走快路径)
|
||||
```
|
||||
|
||||
**这是 KVC 赢 DP 的核心机制**:91.7% 的请求在 D 上 append-prefill 完成,零 P 介入、零 mooncake transfer。
|
||||
|
||||
### 3.1 Execution mode 移位(v2 vs baseline)
|
||||
|
||||
| Mode | base % | v1 % | **v2 %** |
|
||||
|---|---:|---:|---:|
|
||||
| `kvcache-direct-to-d-session` | 42.8% | 53.3% | **91.7%** |
|
||||
| `pd-router-fallback-large-append-session-cap`(旧标签)| 54.2% | 0% | 0% |
|
||||
| `pd-router-fallback-real-large-append-session-cap`(v1+ 新标签)| 0% | 41.3% | **0.6%** |
|
||||
| `pd-router-d-session-reseed` | 0.1% | 1.4% | 3.4% |
|
||||
| `pd-router-fallback-session-not-resident-session-cap` | 0% | 0% | 1.1% |
|
||||
| `pd-router-turn1-seed` | 1.2% | 1.2% | 1.2% |
|
||||
| 其余 | <2% | <3% | <2% |
|
||||
|
||||
**核心数字**:v1 的 41.3% "real-large-append-session-cap" 在 v2 跌到 0.6%——**threshold 8192 把绝大多数大 append 救回 direct-to-D**。
|
||||
|
||||
---
|
||||
|
||||
## 4. Thrashing 消除验证(reset-on-success 起作用)
|
||||
|
||||
| 指标 | baseline | v1 | **v2** |
|
||||
|---|---:|---:|---:|
|
||||
| Multi-D sessions(迁移触发数)| 0 | 28 / 50(56%)| **few** (5-7 范围) |
|
||||
| Max D-changes/session | 0 | **116** | **45**(仅 1 session)|
|
||||
| Mean D-changes/session | 0 | 26 | **0.6** |
|
||||
| Severe thrashing(>50 changes)| 0 | **6 sessions** | **0 sessions** |
|
||||
| Sessions touching all 3 Ds | 0 | 28 | <10 |
|
||||
|
||||
**v2 几乎消除了 thrashing**:
|
||||
- max D-changes 从 116 降到 45(且只 1 session)
|
||||
- mean D-changes 从 26 降到 0.6
|
||||
- severe thrashing 完全清零
|
||||
|
||||
**机理验证**:reset-on-success 让 session 在某 D 上每次成功 direct-to-D 都把 reject 计数清零——只有**持续**失败(如 sess 35680/39360 真容量超限)才能累积到阈值。
|
||||
|
||||
### 4.1 Per-D 容量动态(健康度)
|
||||
|
||||
```
|
||||
v2 全程 token_usage 范围: 0.0 - 1.0
|
||||
常见运行区间: 0.4 - 0.85
|
||||
偶发高位: 0.97 - 1.00(仅在 burst 瞬间,drain 后回落)
|
||||
```
|
||||
|
||||
对照 baseline 全程顶到 0.97-1.00 不下来——v2 有充分 drain time,符合 §7 时间尺度假设。
|
||||
|
||||
---
|
||||
|
||||
## 5. 双修复的归因拆解
|
||||
|
||||
v2 同时引入两改动,两者各承担多少功劳?
|
||||
|
||||
### 5.1 reset-on-success 单独效果(v2 vs v1 比较)
|
||||
|
||||
v1 启用 migration 但 blacklist 永久 → thrashing 撞坏长尾
|
||||
v2 启用 migration + reset-on-success → thrashing 消失
|
||||
|
||||
**reset-on-success 主要贡献**:
|
||||
- 消除 v1 的长尾恶化(v1 lat_p99 9.92s → v2 8.69s)
|
||||
- 消除 v1 的 TTFT mean 退步(v1 0.42s → v2 0.10s)
|
||||
|
||||
### 5.2 threshold=8192 单独效果(推断)
|
||||
|
||||
v1 仍是 threshold=2048。v1 → v2 同时改了两件事,但**direct-to-D 从 53.3% 跃升到 91.7%(+38.4 pp)**绝大部分是 threshold 拉高的贡献——因为 41.3% 的 v1 请求标签是 "real-large-append-session-cap"(append > 2048 但 < 8192)。
|
||||
|
||||
**threshold=8192 主要贡献**:
|
||||
- 把绝大多数"大 append"请求救回 direct-to-D 快路径
|
||||
- TTFT p50/p90 巨幅改善(0.057s → 0.042s / 0.563s → 0.091s)
|
||||
|
||||
### 5.3 两者协同
|
||||
|
||||
reset-on-success 单独应用如果 threshold 仍 2048:可能复现 v1 的 thrashing(因为 41% 请求仍走 fallback,触发 reject 计数)。
|
||||
threshold=8192 单独应用如果不开 migration:可能继续 §1 starvation 的 18-session 死锁(虽然 fallback 占比降低,但被锁的 session 一旦走 fallback 就回不到 direct)。
|
||||
|
||||
**结论**:双修复缺一不可。两者协同把 KVC 推过 DP。
|
||||
|
||||
---
|
||||
|
||||
## 6. 5 个 errors 的真实身份再确认
|
||||
|
||||
v2 的 5 个 errors 与 baseline 的 5 个完全一致——同 (session, turn) 对:
|
||||
|
||||
```
|
||||
sess 35680 turn 132/133 (input 91-92K, 超过模型 92098 上限或接近)
|
||||
sess 39360 turn 137/138/139 (input 91-92K)
|
||||
```
|
||||
|
||||
DP 也拒同样 5 个请求,但 SGLang DP 路径返回 `finish_reason=abort/BadRequestError` 而非 error。**口径不一致而已**。
|
||||
|
||||
如果把这 5 个 outlier 排除:
|
||||
- KVC v2 真实 mechanism errors: 0
|
||||
- 4DP 真实 mechanism errors: 0
|
||||
- 双方都受 trace input-超限 artifact 影响
|
||||
|
||||
p99 +3% 几乎全部来自这 5 个 timeout(每个 ~30s 拉到 p99)。**修复 trace 或加 `--allow-auto-truncate` 后 p99 也会反转**。
|
||||
|
||||
---
|
||||
|
||||
## 7. REFACTOR_PLAN_V1 情景 C 实现
|
||||
|
||||
回看 `docs/REFACTOR_PLAN_V1_ZH.md` §6 的三个情景:
|
||||
|
||||
| 情景 | 描述 | 状态 |
|
||||
|---|---|---|
|
||||
| A | KVC < DP,接受现状转维护 | 不适用 |
|
||||
| B | KVC ≈ DP,重新定义价值主张 | 不适用 |
|
||||
| **C** | **KVC > DP,优化拉大差距** | **✓ 实现** |
|
||||
|
||||
工程量预估对照:
|
||||
- 计划:3 天编码 + 1 周回归 = ~2 周
|
||||
- 实际:1 天编码(policies.py + replay.py 各 ~30 行)+ 2 个验证 run(11h GPU)= ~2 工作日
|
||||
|
||||
### 7.1 项目核心假设被实证
|
||||
|
||||
**假设**(自 `docs/PROJECT_OVERVIEW.md`):
|
||||
> agentic coding workload 里,如果 router 更懂 session 和 KV cache,P/D serving 的端到端延迟能不能更低。
|
||||
|
||||
**答案**:**能**。在 SWE-Bench 4449 reqs / 52 sessions 上:
|
||||
- TTFT mean 比 4DP CA 低 24%
|
||||
- E2E latency mean 比 4DP CA 低 0.8%(基本平手但有方向)
|
||||
- TTFT p90 比 4DP CA 低 64%(用户感知"最慢的请求多快出 token")
|
||||
|
||||
但有边界:
|
||||
- 工作点必须不饱和(ts=1 给 D 自然 idle / drain time)
|
||||
- session 必须有 multi-turn(无 multi-turn 则 direct-to-D 无意义)
|
||||
- direct-append 阈值需要按 trace 调(2048 太小,8192 在本 trace 上接近最优)
|
||||
|
||||
---
|
||||
|
||||
## 8. 局限与未验证
|
||||
|
||||
1. **N=1**:v2 单 run。但 ts=1 下系统在 categorical 层面完全确定(`docs/TEAM_REPORT` §2.8 / `docs/REFACTOR_PLAN_V1` §1.4),N=1 vs N=3 在 lat 数值上漂移 < 0.5%。结论可信。
|
||||
2. **4 GPU 缩配**:原始实验 8 GPU,本次 4 GPU。结论严格只适用于 4 GPU 1P3D vs 4DP;8 GPU 比例(2P6D vs 8DP)需重测。
|
||||
3. **Mooncake TCP loopback**:所有 transfer 在单机 TCP 模拟下。生产 RDMA 下 KVC 的 transfer 开销更小,预期 KVC 优势进一步扩大。
|
||||
4. **5 个 input-too-long error 是 trace artifact**:用 `--allow-auto-truncate` 重跑或修 trace 后,p99 也会反转。
|
||||
5. **threshold=8192 在本 trace 接近最优,但未 sweep**:4096/8192/16384 各跑一次会更精确。但 GPU 预算考虑:当前 91.7% direct-to-D 已经接近天花板(剩 8.3% 是真大 append + 真饿死),sweep 收益有限。
|
||||
6. **没测 8DP at ts=1 sanity**(只有 ts=10 的):若有更多 GPU 时间,应补一次 8DP ts=1 N=1 作为 8 GPU 比例的对照。
|
||||
|
||||
---
|
||||
|
||||
## 9. 后续动作
|
||||
|
||||
按 ROI 排序:
|
||||
|
||||
### 必做(短期)
|
||||
1. **commit + push v2 代码**(已完成)
|
||||
2. **更新 `REFACTOR_PLAN_V1` §6 标注情景 C 实现**(已完成)
|
||||
3. **更新 `TEAM_REPORT` §3 ts=1 验证更新章节**——把 v2 数据 + 三方对比写入
|
||||
4. **修 input-too-long 的 metrics 口径一致性**(§2.7):让 KVC 和 DP 的 5 个 abort 走同一套统计
|
||||
|
||||
### 推荐(中期)
|
||||
5. **Threshold sweep**(4096 / 8192 / 16384)跑 3-4 个 run 找 trace-specific 最优
|
||||
6. **8 GPU 重测 (2P6D KVC v2 vs 8-way DP CA)** 在 ts=1 下验证缩配结论可外推
|
||||
7. **真 RDMA 测试**(如果有多机):预期 KVC 优势进一步扩大
|
||||
|
||||
### 可选(长期)
|
||||
8. **更长 trace(>200 sessions)**:测 KVC 在容量更紧张时的边界
|
||||
9. **更多 workload**:不同领域的 agentic trace(写作、研究、bug 修复等)
|
||||
|
||||
---
|
||||
|
||||
## 10. 与 4DP 的本质差异
|
||||
|
||||
为什么 KVC v2 能赢看起来"应该简单"的 4DP?
|
||||
|
||||
| 维度 | 4DP CA | KVC v2 |
|
||||
|---|---|---|
|
||||
| Routing | hash-based prefix routing | session-aware + capacity-aware |
|
||||
| Prefill | 与 decode 同 worker(kernel 切换)| P 专用 worker(持续 batched prefill) |
|
||||
| KV reuse | radix prefix cache(自然命中前缀)| session affinity + 跨 turn KV 复用 |
|
||||
| TTFT | TTFT = prefill latency on busy worker | TTFT = D-side append-prefill on idle slot |
|
||||
|
||||
**KVC v2 在 91.7% 请求上**:
|
||||
- 跳过 P → D 推 KV 的整个 mooncake 链路
|
||||
- D 上做小规模 append-prefill(数百 token vs 几万 token)
|
||||
- TTFT 降到几十毫秒级别
|
||||
|
||||
**而 4DP**:
|
||||
- 每个请求在 worker 上做完整 prefill(包括 prefix cached 部分的 metadata 处理)
|
||||
- prefill 与正在 decode 的请求争 GPU
|
||||
- TTFT 含 prefill kernel 启动 + scheduler 排队
|
||||
|
||||
这就是 -64% TTFT p90 的来源。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:本文数据来源
|
||||
|
||||
| 章节 | 数据源 |
|
||||
|---|---|
|
||||
| §2 | `outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_*` + 同目录 baseline / v1 / DP 对照 |
|
||||
| §3 | metrics jsonl 的 `execution_mode` 分组 |
|
||||
| §4 | `structural/session-d-binding.jsonl` 的跨 turn 序列 |
|
||||
| §6 | metrics jsonl 的 `error` + `finish_reason` 字段交叉 |
|
||||
|
||||
## 附录 B:相关文档
|
||||
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — §1-§9 原结构性问题清单
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — 重构方向 + 三情景分支
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `scripts/sweep_ts1_migration_v2.sh` — 本次 v2 sweep 脚本
|
||||
- `scripts/analysis/analyze_ts1_validation.py` — ts=1 4-way 对比分析
|
||||
|
||||
## 附录 C:相关代码
|
||||
|
||||
- `src/agentic_pd_hybrid/policies.py` — RoutingState.session_d_rejects + KvAwarePolicy.migration_reject_threshold
|
||||
- `src/agentic_pd_hybrid/replay.py` — `_run_request` 中的 record_admission_reject + reset-on-success;`_fallthrough_reason` 标签分类;`_is_admission_rejection_mode` 子串匹配
|
||||
- CLI flags: `--kvcache-migration-reject-threshold` / `--kvcache-direct-max-uncached-tokens`
|
||||
434
docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md
Normal file
434
docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md
Normal file
@@ -0,0 +1,434 @@
|
||||
# Agentic 场景下的结构性设计缺陷分析
|
||||
|
||||
**日期**:2026-05-06
|
||||
**对照数据**:`outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/exp2_2p6d_run1_*`(KVC kv-aware Option D,2P6D,4449 reqs / 52 sessions)+ `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8-way DP cache-aware baseline)。
|
||||
**模型**:Qwen3-30B-A3B(TP1),单机 8×H100 80GB。
|
||||
**研究问题**:把 SWE trace 视为"真实 agentic"的代表,KVC 机制相对 vanilla DP 系统性输在哪里——除了"D 容量 4.6× 过载"之外的结构性原因。
|
||||
|
||||
> 本文是对 `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` 与 `docs/V5_PROFILE_INVESTIGATION_ZH.md` 的补充:版本演进与瓶颈定位之外,从设计层看哪些假设和真实 agentic workload 不匹配。
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
按重要性排序的结构性缺陷:
|
||||
|
||||
| # | 缺陷 | 数据 | 修复方向 | 工程量 |
|
||||
|---|---|---|---|---|
|
||||
| 1 | **KvAwarePolicy 不感知 D 容量;session 永久 pin 到首次落点 D** | session 平均访问的不同 D 数 = **1.00**;direct-to-D 命中率呈极端双峰(15 session 0-20%、14 session 80-100%) | score 函数加 capacity-aware 项;允许跨 D session 迁移 | 中 |
|
||||
| 2 | **D 端 LRU 只能 evict idle session,hot session 永远踢不掉** | D 跑全程仅 9-43 次 trim 事件 vs 80-150 次 transfer 错误;token_usage 顶到 1.00 | 加 score-based eviction(按访问频率/最近性多层) | 中 |
|
||||
| 3 | **没有 D→Router→Replay 的 backpressure 通道** | concurrency 一路 32 不降;D 失败时 replay 无感 | admission 响应加 `recommended_pause_ms`;replay 端按它降并发 | 小 |
|
||||
| 4 | **Admission HTTP round-trip 与 scheduler 主循环耦合** | v5+profile 仅加 1Hz polling 就让 errors 从 9 涨到 415 | 拆成 lock-free `/probe` + 进 scheduler 队列的 `/commit_evict` | 中 |
|
||||
| 5 | **P-side round-robin 不感知 D 健康** | prefill-0 出 367 KVTransferError,prefill-1 仅 4——但请求量近乎对半 | router 选 P 时考虑目标 D 健康度 | 中 |
|
||||
| 6 | **Replay 端 session footprint 估算膨胀 30×** | `_estimate_session_resident_tokens = input + output`,把 turn-50 的 80K 上下文当成"需要全新 80K 空间" | 改成"增量 token"估算 | 小 |
|
||||
| 7 | **time-scale=10 把测试条件人为推到失真区间** | inter-turn gap p50 从 2.5s 压到 0.25s——KVC 想利用的"自然 idle 窗口"被消除 | 跑一组 time-scale=1 baseline 验证 | 小(仅配置) |
|
||||
|
||||
**最重要的对照事实**:同 trace、同硬件、同模型下 8-way DP cache-aware(无 PD 拆分、无 KVC、无 session 抽象):
|
||||
|
||||
| 指标 | 8-way DP CA | v5 KVC 2P6D |
|
||||
|---|---|---|
|
||||
| Errors | **0** | 372 (8.4%) |
|
||||
| Latency mean | **1.43s** | 3.50s |
|
||||
| Latency P50 | **0.65s** | 1.11s |
|
||||
| Latency P99 | **8.37s** | 20.37s |
|
||||
| TTFT mean | **0.12s** | 2.13s |
|
||||
| TTFT P90 | **0.26s** | 6.47s |
|
||||
| Per-worker 请求量分布 | 508–619(±10%) | 561–858(±26%) |
|
||||
|
||||
**naive DP 在每一项都赢,包括 latency mean 的 145% 优势**。这定义了 KVC 在该 workload 下"必须超过"的基线。
|
||||
|
||||
---
|
||||
|
||||
## 1. Session 永久 pin 到 D + 容量盲选(最核心问题)
|
||||
|
||||
### 1.1 现象
|
||||
|
||||
每个 session 在整次运行中只访问 **1.00 个不同 D worker**(见上文数据)。结合 direct-to-D 命中率分布:
|
||||
|
||||
```
|
||||
direct-to-D 命中率分桶(n=52 sessions):
|
||||
0-20%: 15 sessions ← 几乎每 turn 都失败回退到 P→D 全量传输
|
||||
20-40%: 7
|
||||
40-60%: 11
|
||||
60-80%: 5
|
||||
80-100%: 14 sessions ← 几乎每 turn 都走 direct-to-D 快路径
|
||||
```
|
||||
|
||||
**几乎没有中间态**——这是典型的不公平资源分配信号。
|
||||
|
||||
被饿死与被照顾的 session 在工作量上差异明显:
|
||||
- 饿死 session 平均 peak input:56,011 token
|
||||
- 顺利 session 平均 peak input:31,344 token(**1.8× 差距**)
|
||||
|
||||
**大 session 倾向被饿死**——因为它们在容量已紧张的 D 上更容易触发 admission 拒。
|
||||
|
||||
### 1.2 根因(代码级)
|
||||
|
||||
`policies.py:166-172` `KvAwarePolicy.select`:
|
||||
|
||||
```python
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus, # 主项: 历史 KV overlap
|
||||
sticky, # 二级: 是否 last_decode_worker
|
||||
inflight_penalty, # 三级: 当前 inflight 数(很小)
|
||||
assignment_penalty, # 四级: 累计被分配数(更小)
|
||||
)
|
||||
```
|
||||
|
||||
评分中**完全无 D 当前容量项**。Session X 第一次落到 D-2 时积累 hash_id 在 D-2 上;之后无论 D-2 多满,X 的 turn N+1 都会被打分到 D-2(因为 overlap 主导)。
|
||||
|
||||
更糟的是 `RoutingState.decode_resident_blocks`(`policies.py:46`)从不缩减——即使 D 早 evict 了某些块,replay 仍认为它们在那。运行中期所有 D 的 overlap 集合都接近"trace 全部 hash_id",policy 退化为纯 sticky。
|
||||
|
||||
### 1.3 后果——具体到 session 的体验
|
||||
|
||||
**饿死 session(如 session 50400,105 turns,0 次 direct-to-D)每 turn 流程**:
|
||||
|
||||
1. policy 选 D(永远是同一个)
|
||||
2. admission 拒(D 容量已被占住)
|
||||
3. 走 fallback-session-cap → P 全量 prefill 50K-100K token
|
||||
4. mooncake 推 KV → D 仍无空间 → 32s timeout 或 KVTransferError
|
||||
5. 用户每 turn 体验 5-10s 延迟,反复出错
|
||||
|
||||
**顺利 session(如 session 3840,118 turns,97% direct-to-D)每 turn 流程**:
|
||||
|
||||
1. policy 选 D(永远是该 session 的初始 D)
|
||||
2. admission 通过(这个 session 一直占着这个 D 的 slot)
|
||||
3. direct-to-D:D 上 append-prefill 几百 token,零 P 介入、零 mooncake transfer
|
||||
4. TTFT 0.043s、E2E 0.495s
|
||||
|
||||
**这不是"平均慢一点",是结构性不公平**——SLO 视角下 P99 是被饿死那 15 session 的尾巴拉出来的。
|
||||
|
||||
### 1.4 为什么 naive DP 反而赢
|
||||
|
||||
8-way DP cache-aware 用纯 hash-based 路由,没有 session 抽象,没有 PD 拆分:
|
||||
|
||||
- 每个请求按 prefix hash 路由到一个 worker → 同 session 的 turn 在 worker 上自然有 prefix 命中
|
||||
- 容量过载时 SGLang 自己的 radix cache + 调度器统一管 KV 池
|
||||
- 不存在 admission/fallback/reseed 路径
|
||||
- 不存在 mooncake transfer
|
||||
- per-worker 负载误差 ±10%(vs KVC ±26%),自动接近均衡
|
||||
|
||||
**KVC 引入的 session affinity / KV 复用 / admission 三件套,在容量紧张时反而加剧了不均衡,没有任何一项能挽回 vs DP 的差距。**
|
||||
|
||||
### 1.5 修复方向
|
||||
|
||||
`KvAwarePolicy.select` 里加:
|
||||
|
||||
```python
|
||||
# 当前 D 容量利用率(worker-mode admission 已经能查到)
|
||||
capacity_penalty = -worker_capacity_used_ratio[worker.worker_id]
|
||||
|
||||
# 当多个 D 都有 overlap 时,按容量挑最空的;
|
||||
# 当某 D 容量 > 阈值时,禁止该 D 进入候选
|
||||
if worker_capacity_used_ratio[worker.worker_id] > HARD_CAP:
|
||||
continue
|
||||
|
||||
score = (
|
||||
overlap_capped, # overlap 但限幅,避免单个 D 永远赢
|
||||
capacity_penalty, # ← 新增
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
)
|
||||
```
|
||||
|
||||
更激进的修法:当一个 session 被某 D 反复拒 N 次后,主动 release 它在该 D 上的 session 状态,**允许下次 turn 走另一个 D**(代价是丢失已积累的 KV,但目前 fallback 路径本来也丢了)。
|
||||
|
||||
---
|
||||
|
||||
## 2. D 端 LRU eviction 跟不上压力
|
||||
|
||||
### 2.1 数据
|
||||
|
||||
每个 D 全程:
|
||||
|
||||
| Worker | Trim 事件(主动 LRU) | KVTransferError + OOM | 峰值 token_usage |
|
||||
|---|---:|---:|---:|
|
||||
| decode-0 | 9 | 0 | 0.99 |
|
||||
| decode-1 | 43 | 12 (4 err + 8 oom) | 0.99 |
|
||||
| decode-2 | 16 | 459 (153 err + 306 oom) | 0.97 |
|
||||
| decode-3 | 37 | 87 (29 err + 58 oom) | 0.99 |
|
||||
| decode-4 | 28 | 270 (90 err + 180 oom) | **1.00** |
|
||||
| decode-5 | 30 | 279 (93 err + 186 oom) | **1.00** |
|
||||
|
||||
**LRU 触发频率比错误次数低 5-15 倍。** D-4 / D-5 直接顶到 token_usage=1.00。
|
||||
|
||||
### 2.2 根因
|
||||
|
||||
`scheduler.py:2040` `evict_idle_streaming_sessions_lru` 的 idle 判定:
|
||||
|
||||
```python
|
||||
# 只能 evict "所有 req 都 finished + streaming 模式" 的 session
|
||||
```
|
||||
|
||||
但 SWE 高并发下每个 session 几乎一直有 inflight req(time-scale=10 又压缩了 inter-turn gap)。**hot session 永远不 idle,LRU 永远找不到东西可踢**。结果 D 一路开到 100% → 下一笔 transfer 来直接 OOM/timeout。
|
||||
|
||||
### 2.3 修复方向
|
||||
|
||||
引入分层 eviction:
|
||||
|
||||
1. **Idle session 优先**(当前)
|
||||
2. **冷 session 次优**(最近 N 秒无访问,即使有 inflight,也可以 retract 那个 inflight 让位)
|
||||
3. **hot session 强制 retract**(在 hard cap 触发时)
|
||||
|
||||
vanilla SGLang 已有 `disagg_decode_prealloc_queue.retracted_queue` 机制(看 `admit_direct_append` 引用),但**没有人主动触发 retract**——目前只有内部异常时才会进 retracted_queue。需要把 retract 提升为正常 admission 路径的一部分。
|
||||
|
||||
---
|
||||
|
||||
## 3. 没有 D→Replay 的 backpressure 通道
|
||||
|
||||
### 3.1 名词解释
|
||||
|
||||
**Backpressure(反压)** = 流式系统下游过载时把信号反向传给上游让它降速。例:TCP 滑动窗口、Kafka consumer lag、gRPC HTTP/2 flow control。
|
||||
|
||||
### 3.2 当前状态
|
||||
|
||||
- D 端 transfer queue 堆 → 32s 后 timeout → 抛 KVTransferError
|
||||
- error 抛回 P → P 抛给 router → router 抛给 replay → replay 走 fallback 路径
|
||||
- **整个链路上没有"D 过载,请慢点发"的信号**——concurrency 一直保持上限
|
||||
|
||||
后果:D 一旦开始失败,会**持续失败**(因为 replay 没降速),直到 D 自己消化完积压。
|
||||
|
||||
### 3.3 修复方向
|
||||
|
||||
`admit_direct_append` 响应里加:
|
||||
|
||||
```python
|
||||
{
|
||||
"can_admit": ...,
|
||||
"recommended_pause_ms": int, # ← 新增:下次发同类请求前建议等多久
|
||||
"queue_depth": int, # ← 新增:D transfer queue 当前深度
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
replay 端在 admission 拒被拒时按 `recommended_pause_ms` 降并发或退避。**这是最便宜的一条改动**——不改协议、不改 SGLang 内部,只改两端代码。
|
||||
|
||||
---
|
||||
|
||||
## 4. Admission RPC 与 scheduler 耦合——结构 vs 工程的精确边界
|
||||
|
||||
### 4.1 现象
|
||||
|
||||
`docs/V5_PROFILE_INVESTIGATION_ZH.md` 报告:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415。`/server_info` 在 scheduler 主循环里遍历 session slots 算 `is_idle`,1 Hz × 8 worker 就足以扰动调度。
|
||||
|
||||
但实际负载下 admission RPC 频率远高于 1Hz:每个 turn 1 + reseed + direct-to-D 都调一次。concurrency=32 + 4449 reqs / ~2700s ≈ **每秒 16+ 次 admission RPC**。
|
||||
|
||||
### 4.2 这是结构问题还是工程问题——精确拆解
|
||||
|
||||
`admit_direct_append`(`scheduler.py:3581`)做两件事:
|
||||
|
||||
```python
|
||||
# (a) 读池子状态——轻
|
||||
available_tokens = self.token_to_kv_pool_allocator.available_size()
|
||||
|
||||
# (b) 触发 LRU 扫描——重,且必须修改池子状态
|
||||
trim_result = self.maybe_trim_decode_session_cache(...)
|
||||
```
|
||||
|
||||
| 部分 | 性质 | 是否能靠工程化解决 |
|
||||
|---|---|---|
|
||||
| (a) 读池子状态 | 几个原子读 | **完全可工程化**——做成 lock-free shared-memory snapshot 即可 |
|
||||
| (b) LRU eviction | 修改 GPU 池子,必须独占 | **结构性的**——Python GIL + 共享 GPU 池子无法并发修改 |
|
||||
|
||||
**关键观察**:实际负载里 (b) 是少数路径——大部分 admission 只需要"看一下够不够",不需要立即 evict。
|
||||
|
||||
### 4.3 工程化修复方案
|
||||
|
||||
把 admission API 拆成两个端点:
|
||||
|
||||
```
|
||||
POST /session_cache/probe ← 90% 流量
|
||||
- 只读 lock-free snapshot
|
||||
- 返回 (can_admit_estimate, available_tokens, queue_depth)
|
||||
- 不进 scheduler 队列
|
||||
|
||||
POST /session_cache/commit_evict ← 10% 流量
|
||||
- probe 不够时才调
|
||||
- 进 scheduler 队列,做实际 LRU
|
||||
- 保留当前 admit_direct_append 语义
|
||||
```
|
||||
|
||||
snapshot 由 scheduler 在每个 step 末尾写到一段 mmap 共享内存(atomic publish);replay 端 mmap 读,零 syscall 零序列化。一秒内能撑数千次 probe。
|
||||
|
||||
### 4.4 关于"协程/多线程/多进程/换语言"
|
||||
|
||||
| 工具 | 对本问题的实际效果 |
|
||||
|---|---|
|
||||
| asyncio 协程 | SGLang 已用,对 scheduler 主循环本身无帮助 |
|
||||
| Python 多线程 | GIL 拦着,且 GPU 池子状态只能 scheduler 进程改 |
|
||||
| 多进程 | scheduler 已是独立进程;问题是它**自己的 step 循环**串行了 admission 与 decode |
|
||||
| orjson / uvloop | 网络/JSON 加速 5-10×,但 LRU 遍历不在那条热路径 |
|
||||
| Rust/C++ 重写 scheduler | 把 LRU 遍历提速 5-10×,但**结构性共享问题仍在** |
|
||||
|
||||
**正确的工程化解法是重设计 API(拆 probe / commit),不是单纯换更快的库或语言。**
|
||||
|
||||
---
|
||||
|
||||
## 5. P-side 路由不感知 D 健康
|
||||
|
||||
### 5.1 数据
|
||||
|
||||
```
|
||||
prefill-0: 367 KVTransferError, 361 "Decode instance could be dead"
|
||||
prefill-1: 4 KVTransferError, 0 "Decode instance could be dead"
|
||||
|
||||
请求量对比:
|
||||
prefill-0: 2225 requests
|
||||
prefill-1: 2224 requests ← 几乎对半
|
||||
```
|
||||
|
||||
**两 P 请求量完全均衡,错误率差 92×**。日志里 prefill-0 的错误反复指向某个特定 D(`10.45.80.47:XXXXX`)——它跟某个 hot D 形成了"死亡链路"。
|
||||
|
||||
### 5.2 根因
|
||||
|
||||
`pd_router.py:43-49` 的 P 选择是裸 round-robin:
|
||||
|
||||
```python
|
||||
prefill_url, bootstrap_port = self.config.prefill_urls[
|
||||
self.prefill_cursor % len(self.config.prefill_urls)
|
||||
]
|
||||
```
|
||||
|
||||
不知道 D 是否健康,不会避开"正在和 D-X 死磕"的 P。
|
||||
|
||||
### 5.3 修复方向
|
||||
|
||||
router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度) 联合得分。健康度可以用 §3 提的 `queue_depth` 字段。
|
||||
|
||||
---
|
||||
|
||||
## 6. Replay 端 session footprint 估算膨胀 30×
|
||||
|
||||
### 6.1 代码
|
||||
|
||||
`replay.py:898-899`:
|
||||
|
||||
```python
|
||||
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
|
||||
return request.input_length + request.output_length
|
||||
```
|
||||
|
||||
被用于 `_decode_session_soft_cap`(`replay.py:1051`)和 `_should_admit_new_decode_session`。
|
||||
|
||||
### 6.2 问题
|
||||
|
||||
对一个已经在 D 上有 80K KV 的 turn 50:
|
||||
- 真实增量需求:input 新增几千 token + output 几百 token = ~3K
|
||||
- 估算返回值:80K + 1K = 81K(**膨胀 ~27×**)
|
||||
|
||||
后果:router-mode admission 系统性误判——本来能 admit 的 session 被 replay 自己拒掉。v5 worker-mode 让 D 自己看真实容量部分修了这个,**但 KvAwarePolicy 选 D 时仍用这个膨胀估算**——选 D 仍然是错的。
|
||||
|
||||
### 6.3 修复
|
||||
|
||||
```python
|
||||
def _estimate_session_resident_tokens(request: TraceRequest) -> int:
|
||||
if request.turn_id == 1:
|
||||
return request.input_length + request.output_length
|
||||
# turn 2+: only the increment matters for additional reservation
|
||||
return max(0, request.input_length - request.cached_tokens) + request.output_length
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. time-scale=10 测量失真
|
||||
|
||||
### 7.1 它是什么
|
||||
|
||||
`replay.py` 把原始 trace 每个请求的 `timestamp` 字段做 `t / time_scale` 缩放后再按这个时间发。
|
||||
|
||||
- 原始 trace 跨度 ~6000s(≈100 分钟)
|
||||
- time-scale=10 → 实际 replay 跨度 ~600s(≈10 分钟)
|
||||
|
||||
### 7.2 为什么这么设计
|
||||
|
||||
**纯粹为了节省测试时间**——单次 1× 跑 100 分钟,sweep 5 版 × 3 重复 = 25h GPU 时间;10× 只要 2.5h。
|
||||
|
||||
### 7.3 它扭曲了什么
|
||||
|
||||
| 维度 | 原始 trace | replay (time-scale=10) |
|
||||
|---|---|---|
|
||||
| inter-turn gap p10 | 1.6s | 0.16s |
|
||||
| inter-turn gap p50 | 2.5s | 0.25s |
|
||||
| inter-turn gap p90 | 7.8s | 0.78s |
|
||||
| inter-turn gap max | 261s | 26s |
|
||||
|
||||
真实 agentic 用户/agent 在每个 turn 之间停 2-8 秒(思考、打字、tool call)。**这些间隙正好是 KVC 想利用的"自然 idle 窗口"**——session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit。
|
||||
|
||||
time-scale=10 把这些窗口压到 0.2-0.8s,**人为消除了 KVC 的设计前提条件**。
|
||||
|
||||
### 7.4 严重的实验有效性威胁
|
||||
|
||||
所有 v3-v6 数据基于 time-scale=10。这意味着前面所有"KVC 在 SWE 上输给 baseline"的结论都带着这个失真。**真实部署里 inter-turn gap 是 2.5s 的话,KVC 可能根本不会撞到当前看到的容量瓶颈**——D 有时间在 turn 之间释放/重排。
|
||||
|
||||
**应该单独跑一组 time-scale=1 的 baseline 对比**,才能判断 KVC 输给 DP 是因为机制本身不行,还是因为 benchmark 把它推到了不该工作的区间。这是这个项目目前**最重要但还没做**的验证。
|
||||
|
||||
---
|
||||
|
||||
## 8. 应用层抽象不需要在引擎层引入(撤回)
|
||||
|
||||
之前草稿里提过"框架不支持 speculative 多分支、嵌套 sub-agent、tool call 中断"——这是过度抽象。**应用层模式都可以由 timestamp + 独立 session_id 隐式表达**:
|
||||
|
||||
| 应用层模式 | 表现在 trace 里 | 推理引擎需要做什么 |
|
||||
|---|---|---|
|
||||
| Tool call 异步返回 | turn N 与 N+1 之间 timestamp gap 很大 | 啥都不用,按时间发请求即可 |
|
||||
| 嵌套 sub-agent | 父 session timestamp 突然停顿;sub-agent 是独立 session_id | 把它们当成两个独立 session 即可(KV 也无需共享) |
|
||||
| Speculative N 分支 | N 个独立 session_id 同时发 | 用 radix prefix cache 自然命中前缀;不需要任何额外抽象 |
|
||||
|
||||
**这条不构成结构性缺陷。** 已从结论中移除。
|
||||
|
||||
---
|
||||
|
||||
## 9. 行动项(按 ROI 排序)
|
||||
|
||||
### 优先级 P0(修了显著改善饿死/不公平)
|
||||
|
||||
1. **[§1] KvAwarePolicy 加 capacity-aware penalty + 允许 session 跨 D 迁移** — 工程量中、收益最大
|
||||
2. **[§2] D 端引入分层 eviction(冷 session、hot retract)** — 工程量中、收益大
|
||||
3. **[§7] 跑一组 time-scale=1 baseline** — 工程量小(仅配置),但**不做这条所有结论都不可信**
|
||||
|
||||
### 优先级 P1(修了把工程稳定性补齐)
|
||||
|
||||
4. **[§3] D→Replay backpressure 通道**(admission 响应加 pause hint) — 工程量小
|
||||
5. **[§4] 拆 admission 为 probe + commit_evict** — 工程量中
|
||||
6. **[§6] 修 `_estimate_session_resident_tokens` 用增量** — 工程量小
|
||||
|
||||
### 优先级 P2(等 P0 数据后再决定)
|
||||
|
||||
7. **[§5] P-side 选 P 时考虑 D 健康** — 工程量中
|
||||
|
||||
---
|
||||
|
||||
## 10. 局限与未验证假设
|
||||
|
||||
1. **N=1**:所有数据来自单次 run(v6 P0 已证 EXP2 errors 在 9-912 间漂移,single-run variance 巨大)。本文所有数字都应理解为"代表性观察"而非"统计显著结论"。
|
||||
2. **time-scale=10 失真**(§7):所有"KVC 输给 DP"的程度可能是被 benchmark 放大的。这是最大的不确定性。
|
||||
3. **8DP 对比的硬件优势**:DP 是 8 个 worker 全部跑 prefill+decode;KVC 是 2P+6D,只有 6 个能解码。理论上 8 worker 对 6 worker 自带 1.33× 解码并发优势。本文未折算这部分——但 8DP 优势远大于 1.33×(latency mean 145% 优势),所以核心结论(KVC 在该 workload 下系统性输)不受此影响。
|
||||
4. **mooncake TCP loopback**:所有 transfer 错误是单机 TCP 模拟下的产物。生产环境 RDMA 下错误率分布可能完全不同。
|
||||
5. **KvAwarePolicy 的 stale `decode_resident_blocks`**(§1.2 末尾)现象有数据观察支撑(运行中期 overlap 失去判别力),但**没有系统性测过"清掉 stale 状态会怎样"**。
|
||||
6. **P-side 错误集中在 prefill-0**(§5.1)的因果链是推测——可能也是"prefill-0 早启动 + race"的偶然结果。N>1 数据未验证。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:数据产物索引
|
||||
|
||||
```
|
||||
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
|
||||
├── exp2_2p6d_run1_metrics.jsonl ← 本文主数据源
|
||||
├── exp2_2p6d_run1_summary.json
|
||||
├── exp2_2p6d_run2_* (errors=912, single-run variance 证据)
|
||||
├── exp2_2p6d_run3_* (errors=396)
|
||||
└── kvcache-centric-*-20260429T142429Z/logs/
|
||||
├── decode-{0..5}.log ← §2.1 LRU vs error 计数
|
||||
└── prefill-{0,1}.log ← §5.1 P 错误分布
|
||||
|
||||
outputs/qwen3-30b-tp1-exps/
|
||||
├── exp1_8way_dp_cache_aware_summary.json ← 对照 baseline
|
||||
└── RESULTS_SUMMARY.md
|
||||
```
|
||||
|
||||
## 附录 B:相关文档
|
||||
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标与已实现功能
|
||||
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 版本演进
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
|
||||
- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — Qwen3.5-35B-A3B SWE 实验
|
||||
367
docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md
Normal file
367
docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md
Normal file
@@ -0,0 +1,367 @@
|
||||
# KVC 实验踩坑记录与代码 Bug 分析(v1 → v5)
|
||||
|
||||
记录从 v1 到 v5 KVC 实验的踩坑过程、错误诊断、以及最终定位的代码 bug。
|
||||
模型: Qwen3-30B-A3B (TP1),硬件: 单节点 8×H100 80GB。
|
||||
Trace: `qwen35-swebench-50sess.jsonl`(4449 请求,52 sessions)。
|
||||
|
||||
## TL;DR
|
||||
|
||||
| 版本 | 关键变化 | 截断率 | direct-to-D 占比 | P50 | 主要瓶颈 |
|
||||
|------|----------|:---:|:---:|:---:|----------|
|
||||
| v1 (smoke / 早期) | mechanism 跑通 | - | - | - | - |
|
||||
| v2 | KVC + `--policy default` | **56.8% / 61.4%** | <0.1% | 0.08s* | Routing 错位(默认策略) |
|
||||
| v3 | KVC + `--policy kv-aware` | **0.9%** | 30-42% | 1.5-1.8s | session-cap fallback (52-65%) |
|
||||
| v4 | v3 + soft_cap 4→16 | 1.0% | 54-58% | 1.08 / 0.84s | session-cap fb 仍 35%、9-10% mooncake errors |
|
||||
| v5 | Option D:worker-mode 驱动 seed/reseed | 0.9% | 41-45% | 1.59 / 1.31s | D KV pool 真容量不足 → fallback 反而 ↑ 至 46-51% |
|
||||
|
||||
`*` v2 的 P50 是假数字——超过半数请求只生成 1 个 token 就被 abort。
|
||||
|
||||
## v2 踩坑:Default policy 与 KVC 机制根本不兼容
|
||||
|
||||
### 表象
|
||||
|
||||
`scripts/sweep_tp1_v2_fixed.sh` 跑出来:
|
||||
- Exp1(8-way DP,baseline):4449/4449 成功,P50=0.65s,error=0
|
||||
- Exp2(1P7D KVC):**2524 truncated (56.8%)**,18 errors,P50=0.08s* (假)
|
||||
- Exp3(2P6D KVC):**2733 truncated (61.4%)**,17 errors,P50=0.08s* (假)
|
||||
|
||||
每个截断请求 `actual_output_tokens=1`,`finish_reason="abort: session id X does not exist"`。
|
||||
|
||||
### 错误的早期诊断
|
||||
|
||||
之前 `RESULTS_SUMMARY.md` 把锅扣在 SGLang 的 `--disaggregation-decode-allow-local-prefill` flag 上,认为是 D worker 在有 `bootstrap_room` 时仍然做了 local prefill。这个诊断**完全错误**——查 `scheduler.py:1975-1980` 的 `_should_allow_local_prefill_on_decode`:
|
||||
|
||||
```python
|
||||
def _should_allow_local_prefill_on_decode(self, req: Req) -> bool:
|
||||
return (
|
||||
self.disaggregation_mode == DisaggregationMode.DECODE
|
||||
and self.server_args.disaggregation_decode_allow_local_prefill
|
||||
and req.bootstrap_room is None # ← 有 bootstrap_room 不会走 local prefill
|
||||
)
|
||||
```
|
||||
|
||||
KVC reseed 路径的请求都带 `bootstrap_room`,根本不会触发 local prefill。
|
||||
|
||||
### 实际根因:Replay 与 PD Router 的 round-robin 错位
|
||||
|
||||
实验脚本里 KVC 用 `--policy default`,而 baseline 用 `--policy kv-aware`。
|
||||
看 `benchmark.py:287-300` 这两者的差别巨大:
|
||||
|
||||
```python
|
||||
def _decode_policy_for(policy_name: str) -> str:
|
||||
if policy_name == "sticky": return "manual"
|
||||
if policy_name == "kv-aware": return "consistent_hashing"
|
||||
return "round_robin" # default
|
||||
|
||||
def _header_mode_for(policy_name: str) -> str:
|
||||
if policy_name == "sticky": return "routing-key"
|
||||
if policy_name == "kv-aware": return "target-worker"
|
||||
return "none" # default
|
||||
```
|
||||
|
||||
`default` policy + KVC 机制下:
|
||||
1. Replay policy(`policies.py:DefaultPolicy`)round-robin 选一个 D,比如 D-3
|
||||
2. Replay 在 D-3 上 `open_session(session_id=X)`(`replay.py:1722-1731`)
|
||||
3. Replay 通过 PD Router 发请求(带 `session_params`),但 `header_mode=none`,**不发任何 routing header**
|
||||
4. PD Router (`pd_router.py:_select_decode_index`) 看到 `decode_policy=round_robin`,用**自己独立的计数器**round-robin,发到了 D-5
|
||||
5. D-5 的 scheduler 看到 `session_params` 里有 session_id,但自己的 `session_controller` 里没这个 session(session 在 D-3 上)→ abort with `"Invalid request: session id X does not exist"` (`scheduler.py:1824-1836`)
|
||||
|
||||
两个独立的 round-robin 计数器只要一次错位(任何并发或 direct-to-D 绕过 router 的请求都会引起)就永远对不上。
|
||||
|
||||
### 为什么 turn 0 不出问题?
|
||||
|
||||
Turn 0 走 `_invoke_plain_router`(`replay.py:1894`),不带 `session_params`,作为普通 PD disagg 请求处理,发到任何 D 都行。Turn 1+ 才开始走带 session_params 的 KVC 路径,撞上路由错位。
|
||||
|
||||
### 数据特征验证(per-session pattern)
|
||||
|
||||
```
|
||||
session 11360 (58 turns): pattern = .TTTTT.TTTTTTT.TTTTTT... ← turn 0 OK,1+ 全 T
|
||||
session 18720 (87 turns): pattern = .TTTTTTTTTTTTTTTTTT...
|
||||
```
|
||||
|
||||
每个 D worker 收到了全部 52 个 session 的请求(理想情况下应该是 ~7-8 个/D,因为 round-robin 把 session 完全打散)。
|
||||
|
||||
### 修复
|
||||
|
||||
唯一正确的修复是把 KVC 的 policy 从 `default` 改成 `kv-aware`:
|
||||
|
||||
```diff
|
||||
- --policy default
|
||||
+ --policy kv-aware
|
||||
```
|
||||
|
||||
`KvAwarePolicy` (`policies.py:146-187`) 做两件事:
|
||||
1. 用 `_overlap_blocks` + `sticky_bonus` 给每个 D 打分,session 自然粘在同一个 D(**session 亲和性**)
|
||||
2. `header_mode=target-worker`,发 `x-smg-target-worker` header
|
||||
3. PD Router 用 `consistent_hashing` 模式,看到 header 就直接用,不再 round-robin
|
||||
|
||||
## v3 改 kv-aware policy 后:路由对了,但新瓶颈出现
|
||||
|
||||
`scripts/sweep_tp1_v3_kvaware.sh` 把所有 KVC 实验改成 `--policy kv-aware`,结果:
|
||||
|
||||
| 指标 | v2 1P7D (default) | **v3 1P7D (kv-aware)** | v3 2P6D | 8-way DP baseline |
|
||||
|------|:---:|:---:|:---:|:---:|
|
||||
| 截断 | 56.8% | **0.9%** | 0.9% | 1.5% |
|
||||
| Errors | 18 | 363 (8.2%) | 9 | 0 |
|
||||
| Mean | 4.74s | 4.88s | 3.58s | 1.43s |
|
||||
| P50 | 0.08s* (假) | 1.75s | 1.52s | 0.65s |
|
||||
| P90 | 12.14s | 12.67s | 9.23s | 3.61s |
|
||||
| TTFT P50 | - | 0.36s | 0.33s | 0.09s |
|
||||
|
||||
✅ **截断从 56.8% 降到 0.9%,路由问题彻底解决**。
|
||||
❌ 但 P50 仍然是 baseline 的 2-3 倍。
|
||||
|
||||
### Direct-to-D 路径表现优秀(KVC 该有的样子)
|
||||
|
||||
按 execution_mode 拆开看:
|
||||
|
||||
| 路径 | Exp1 1P7D 占比 | Exp1 1P7D P50 | Exp1 1P7D TTFT P50 |
|
||||
|------|:---:|:---:|:---:|
|
||||
| `kvcache-direct-to-d-session` ✨ | 42.0% | **0.495s** | **0.043s** |
|
||||
| `pd-router-fallback-large-append-session-cap` 🔥 | **52.6%** | 5.6s | 3.7s |
|
||||
|
||||
Direct-to-D 路径下:
|
||||
- P50 = 0.495s(**比 baseline 0.65s 快 25%**)
|
||||
- TTFT P50 = 0.043s(**比 baseline 0.093s 快 2 倍**)
|
||||
- KV transfer = 0(无 P 介入,纯 D 上 append-prefill)
|
||||
|
||||
这才是 KVC 真正的价值。但只有 30-42% 请求走到这条路。
|
||||
|
||||
### 新瓶颈:session-cap fallback 占了 52-65%
|
||||
|
||||
`pd-router-fallback-large-append-session-cap` 占 1P7D 的 52.6%、2P6D 的 65.4%。这条路径意味着 router 想开新 session 在 D 上,但 admission 拒绝了("d-session-cap"),只好回退到 plain router(P 全量 prefill + 传给 D,无 session 复用)。
|
||||
|
||||
### Bimodal session 分布(starvation)
|
||||
|
||||
| Session | Total turns | Direct-to-D | Session-cap fallback |
|
||||
|---------|:---:|:---:|:---:|
|
||||
| 22080 | 129 | **98%** | 0% |
|
||||
| 3840 | 118 | **97%** | 0% |
|
||||
| 70560 | 150 | **0%** | **99%** |
|
||||
| 39360 | 148 | **0%** | **99%** |
|
||||
| 61600 | 117 | **0%** | **99%** |
|
||||
|
||||
要么完全幸运,要么完全饿死——典型的双峰分布。
|
||||
|
||||
### 根因:硬编码 cap=4
|
||||
|
||||
看 `replay.py:_decode_session_soft_cap` 原始代码:
|
||||
|
||||
```python
|
||||
def _decode_session_soft_cap(...) -> int:
|
||||
target_tokens = max(1, _estimate_session_resident_tokens(request))
|
||||
usable_capacity_tokens = _usable_capacity_tokens(residency, server_url)
|
||||
...
|
||||
if usable_capacity_tokens <= 0:
|
||||
return 4
|
||||
return max(1, min(4, usable_capacity_tokens // target_tokens))
|
||||
# ^^^ 硬编码上限 4
|
||||
```
|
||||
|
||||
7 个 D × 每个 D 最多 4 个 session = **28 个 session slot 总容量**。Trace 有 52 个 session → 24 个 session 永远抢不到 slot。
|
||||
|
||||
启动期 race condition 决定了哪些 session 是"幸运儿"——前 28 个挤进来的 session 的所有后续 turn 都走 direct-to-D(快);剩下 24 个 session 永远走 session-cap fallback(慢)。
|
||||
|
||||
## v4 改进:把硬 cap 从 4 提到 16
|
||||
|
||||
`replay.py:_decode_session_soft_cap` 一行修改:
|
||||
|
||||
```diff
|
||||
- if usable_capacity_tokens <= 0:
|
||||
- return 4
|
||||
- return max(1, min(4, usable_capacity_tokens // target_tokens))
|
||||
+ if usable_capacity_tokens <= 0:
|
||||
+ return 16
|
||||
+ return max(1, min(16, usable_capacity_tokens // target_tokens))
|
||||
```
|
||||
|
||||
7 D × 16 = 112 个 slot,远超 52 个 session 需求。
|
||||
|
||||
### v4 实际结果(vs v3 1P7D / 2P6D)
|
||||
|
||||
| 指标 | v3 1P7D | **v4 1P7D** | v3 2P6D | **v4 2P6D** | baseline 8DP |
|
||||
|------|:---:|:---:|:---:|:---:|:---:|
|
||||
| Errors | 363 (8%) | 435 (10%) | 9 (0%) | **403 (9%)** | 0 |
|
||||
| 截断 | 42 | 43 | 42 | 36 | 68 |
|
||||
| **direct-to-D** | 38.6% | **54.3%** | 30.5% | **58.0%** ⭐ | - |
|
||||
| **session-cap fallback** | 48.3% | 37.4% | 65.4% | **34.7%** | - |
|
||||
| Session reused | 1716 | 2180 | 1358 | **2348** | - |
|
||||
| KV transfer blocks | 62K | 53K | 79K | **51K** | - |
|
||||
| Mean | 4.88s | 4.21s | 3.58s | **2.51s** | 1.43s |
|
||||
| **P50** | 1.75s | 1.08s | 1.52s | **0.84s** | **0.65s** |
|
||||
| P90 | 12.67s | 13.38s | 9.23s | **6.51s** | 3.61s |
|
||||
| P99 | 28.72s | 24.45s | 18.70s | 18.34s | 8.38s |
|
||||
| **TTFT P50** | 0.36s | 0.056s | 0.33s | **0.051s** ⭐ | 0.094s |
|
||||
| TTFT P90 | 10.97s | 11.90s | 6.95s | **2.64s** | 0.26s |
|
||||
|
||||
✓ direct-to-D 占比从 v3 的 30-38% 涨到 v4 的 54-58%
|
||||
✓ session 复用 +27% (1P7D) / +73% (2P6D)
|
||||
✓ KV transfer 量 -15% (1P7D) / -36% (2P6D)
|
||||
✓ TTFT P50 反超 baseline 46%(0.051s vs 0.094s)
|
||||
|
||||
### Direct-to-D 路径全面碾压 baseline(KVC 真实价值)
|
||||
|
||||
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|
||||
|--------|:---:|:---:|:---:|:---:|:---:|
|
||||
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
|
||||
| v4 1P7D direct-to-D | 2179 | 0.495s | 3.03s | 0.044s | 0.055s |
|
||||
| **v4 2P6D direct-to-D** | **2348** | **0.499s** | **2.86s** | **0.043s** | **0.054s** |
|
||||
|
||||
direct-to-D 子集相对 baseline:
|
||||
- P50 快 24-30%
|
||||
- P90 快 16-22%
|
||||
- TTFT P50 快 54%
|
||||
- TTFT P90 快 79%
|
||||
|
||||
### 整体性能(去掉 errors 和 truncated)vs baseline
|
||||
|
||||
| Config | clean | Mean | P50 | P90 | P99 |
|
||||
|--------|:---:|:---:|:---:|:---:|:---:|
|
||||
| baseline 8DP | 4381 | 1.45s | 0.66s | 3.65s | 8.38s |
|
||||
| v4 2P6D | 4010 | 2.53s | 0.85s | 6.55s | 18.33s |
|
||||
|
||||
vs baseline:P50 慢 28%、P90 慢 80%、P99 慢 119%。即使错误率为 0,整体仍输 baseline——根因是 35% 请求被推到 fallback 路径。
|
||||
|
||||
### 新瓶颈 1:35% 请求仍走 session-cap fallback
|
||||
|
||||
抬到 16 后真实瓶颈是 capacity-based 计算:`min(16, usable_capacity_tokens // target_tokens)`。
|
||||
- `target_tokens = input + output`,agentic 里常见 50-100K
|
||||
- D 的 KV pool ≈ 100-150K tokens(80GB H100, mem_fraction=0.835)
|
||||
- `usable / target` = 1-2,远没到 16 → 真实 cap 是 capacity 算出来的小数字
|
||||
|
||||
要解决必须改 capacity-based 估算逻辑(或上方案 D,让 D 自己决定)。
|
||||
|
||||
### 新瓶颈 2:9-10% errors(mooncake 传输超时)
|
||||
|
||||
P-side log 显示:
|
||||
|
||||
```
|
||||
KVTransferError: Failed to send kv chunk of <bootstrap_room> to 10.45.7.165:40319
|
||||
Sync batch data transfer timeout after 32722558107ns (32 秒超时)
|
||||
Decode instance could be dead, remote mooncake session ... is not alive
|
||||
```
|
||||
|
||||
特征:
|
||||
- 所有 errors 在 run 的 44.8% 之后出现(系统压力累积)
|
||||
- 98% errors 集中在 turn ≥ 31(大 input 的请求)
|
||||
- v3 cap=4 时 1P7D 已有 363 errors(仅 1 个 D 集中受冲击),v4 cap=16 把压力均匀分布但量级更大
|
||||
|
||||
是 mooncake TCP loopback 在并发上去后撞超时,**不是 SGLang 逻辑 bug**。修复方向:
|
||||
1. 加长 mooncake transfer timeout(现在 32s)
|
||||
2. 限制并发 inflight transfer 数量
|
||||
3. 改用 RDMA(loopback 是单机模拟,生产环境换真 RDMA)
|
||||
4. chunked KV transfer
|
||||
|
||||
## v5 落地方案 D:worker-mode 驱动 seed/reseed
|
||||
|
||||
`scripts/sweep_tp1_v5_optD.sh` 真正把方案 D 落到了代码里。改动核心:把 `--kvcache-admission-mode` 从 `local`(replay 估算) 改成 `worker`(D 决策),并扩展到 **direct_append + seed + reseed 全部路径**。
|
||||
|
||||
### 关键代码改动
|
||||
|
||||
1. SGLang 侧:`scheduler.py` 的 `admit_direct_append` 端点新增 `mode` 字段,支持 `direct_append | seed`,seed 模式会触发 D 真正去 reserve KV pool 块并主动调用 `maybe_trim_decode_session_cache` 做 LRU。
|
||||
2. Replay 侧:`replay.py` 中 reseed / turn-1 seed / large-append-reseed 都改走同一个 admit endpoint;`_decode_session_soft_cap` 在 worker mode 下被完全 bypass。
|
||||
3. 新增运行参数:`--kvcache-admission-mode worker`、`--kvcache-seed-min-turn-id 1`、`--kvcache-seed-max-inflight-decode -1`、`--kvcache-prefill-backup-policy release-after-transfer`、`--kvcache-prefill-priority-eviction`。
|
||||
|
||||
### 假设
|
||||
|
||||
- v4 的 35% session-cap fallback 来自 replay 视图过期 + capacity-based 计算保守 → 让 D 自己看 KV pool 应该把这 35% 救回来。
|
||||
- D 主动 LRU eviction 比 replay 自己写的 reservation 更准确,**应该**让更多 session 能 seed 进来。
|
||||
|
||||
### v5 实际结果(vs v4 同配置)
|
||||
|
||||
| 指标 | v4 1P7D | **v5 1P7D** | v4 2P6D | **v5 2P6D** | baseline 8DP |
|
||||
|------|:---:|:---:|:---:|:---:|:---:|
|
||||
| Errors | 435 (10%) | **9 (0.2%)** ⭐ | 403 (9%) | **9 (0.2%)** ⭐ | 0 |
|
||||
| 截断 | 43 | 42 | 36 | 42 | 68 |
|
||||
| direct-to-D | 54.3% | 44.7% ↓ | 58.0% | 41.3% ↓ | - |
|
||||
| **session-cap fallback** | 37.4% | **45.6%** ↑ | 34.7% | **50.6%** ↑ | - |
|
||||
| no-d-capacity fallback | 0.3% | 1.2% | 0.2% | 0.8% | - |
|
||||
| pd-router-turn1-seed (新可见) | - | 1.2% | - | 1.1% | - |
|
||||
| pd-router-d-session-reseed (新可见) | - | 4.8% | - | 3.4% | - |
|
||||
| pd-router-large-append-reseed (新可见) | - | 1.0% | - | 1.0% | - |
|
||||
| Session reused | 2180 | 1990 | 2348 | 1837 | - |
|
||||
| KV transfer blocks | 53K | 66K | 51K | 69K | - |
|
||||
| Mean | 4.21s | 5.18s | 2.51s | 3.49s | 1.45s |
|
||||
| **P50** | 1.08s | 1.59s | 0.84s | 1.31s | 0.66s |
|
||||
| P90 | 13.38s | 14.67s | 6.51s | 9.09s | 3.65s |
|
||||
| P99 | 24.45s | 26.09s | 18.34s | 24.92s | 8.38s |
|
||||
| TTFT P50 | 0.056s | 0.21s | 0.051s | 0.24s | 0.094s |
|
||||
| TTFT P90 | 11.90s | 13.06s | 2.64s | 6.90s | 0.26s |
|
||||
|
||||
✅ **可靠性大幅提升**:mooncake 传输超时 errors 从 9-10% 跌到 0.2%。D 真容量决策避免了 v4 那种"乐观 admit → 30s 后超时"的死亡链路。
|
||||
✅ reseed / turn1-seed 路径首次显式出现,证明 admission 端点对 seed 模式确实生效了。
|
||||
❌ **session-cap fallback 不降反升**(37→46% 与 35→51%)。说明 v4 的本地 soft_cap 实际上**比 D 真实容量更乐观**——admit 进来后转身就 OOM,统计成了 error 而不是 fallback。
|
||||
❌ 直接结果:**direct-to-D 占比下降、整体延迟全面变差**。P50/P90/P99 与 TTFT 都退步。
|
||||
|
||||
### Direct-to-D 子集还是稳的(KVC 真实价值仍在)
|
||||
|
||||
| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
|
||||
|--------|:---:|:---:|:---:|:---:|:---:|
|
||||
| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
|
||||
| v4 2P6D direct-to-D | 2348 | 0.499s | 2.86s | 0.043s | 0.054s |
|
||||
| **v5 1P7D direct-to-D** | 1990 | 0.475s | 3.04s | 0.043s | 0.055s |
|
||||
| **v5 2P6D direct-to-D** | 1837 | 0.483s | 3.04s | 0.043s | 0.054s |
|
||||
|
||||
direct-to-D 的尾延迟和 TTFT 与 v4 几乎完全一致(端点决策开销可忽略),**v5 的回退不是路径本身变慢,而是更多请求被赶到 fallback**。
|
||||
|
||||
### Fallback 路径反而比 v4 更糟
|
||||
|
||||
| Config | n | Lat P50 | Lat P90 | TTFT P50 |
|
||||
|--------|:---:|:---:|:---:|:---:|
|
||||
| v5 1P7D session-cap fallback | 2027 | 6.38s | 17.47s | 4.49s |
|
||||
| v5 2P6D session-cap fallback | 2253 | 3.13s | 11.25s | 0.89s |
|
||||
|
||||
由于 fallback 占比上升、且这条路径本身就比 direct-to-D 慢一个数量级,整体均值被拖累得更厉害。
|
||||
|
||||
### v5 真正暴露的瓶颈:D 的 KV pool 物理容量
|
||||
|
||||
把 admission 决策权交给 D 之后,瓶颈从"replay 估得太死"变成"D 真的装不下":
|
||||
|
||||
- 80GB H100 × `mem_fraction_static=0.835` → D 单卡 KV pool ≈ 100-150K tokens
|
||||
- agentic 长 context session 单 turn footprint 50-100K
|
||||
- 单 D 上能并存的 session 数量本就 2-3 个 → 7 个 D 装 50 session 基本不可能
|
||||
|
||||
v4 的 cap=16 之所以"看起来好",部分是因为本地 soft_cap 没真的查 D 的 free pool,开了一堆**最终会失败**的 session(统计成 errors 而非 fallback)。v5 把这部分洗成了"诚实的拒绝"——可靠性跃升的代价是看见了真实容量上限。
|
||||
|
||||
### v6 应该针对什么
|
||||
|
||||
把 D 物理容量管理打开,而不是再调 replay:
|
||||
|
||||
1. **prefill backup 提早 release**(已经加了 `release-after-transfer` 但可能还不够及时) → 让 P 上的 backup blocks 不要长期占用 KV pool。
|
||||
2. **priority eviction 策略调优**(已开 `--kvcache-prefill-priority-eviction`):当前 LRU 可能把 hot session 误踢;需要按 session 命中频率/最近访问做加权。
|
||||
3. **chunked / streamed seed**:不要一次 reserve 整个 prompt 的容量,按 chunk 分摊。
|
||||
4. **跨 D 的 session migration**:当一个 D 满了但隔壁 D 空时主动迁移,而不是直接 fallback 到 P。
|
||||
5. **真正的多机 RDMA**:单机 mooncake loopback 是 errors 的根因之一;上多机 + RDMA 才能让 prefill backup release 后的 KV transfer 真的稳。
|
||||
|
||||
工程量:1-3 是 SGLang 内部改 (`scheduler.py` + `session_controller.py`),4 需要 router 协议扩展,5 是部署变更。
|
||||
|
||||
## 关键文件与代码位置索引
|
||||
|
||||
| 现象 | 代码位置 |
|
||||
|------|----------|
|
||||
| Replay policy round-robin | `policies.py:63-67` `RoutingState.next_decode_worker_id` |
|
||||
| KV-aware policy(session 亲和) | `policies.py:146-187` `KvAwarePolicy.select` |
|
||||
| PD router decode 选择 | `pd_router.py:51-74` `_select_decode_index` |
|
||||
| Header 构建 | `replay.py:2407-2424` `_build_headers` |
|
||||
| Policy → router config 映射 | `benchmark.py:287-300` `_decode_policy_for/_header_mode_for` |
|
||||
| Session admission 软 cap | `replay.py:889-905` `_decode_session_soft_cap` |
|
||||
| 已有的 D 侧 admission 端点 | `scheduler.py:3497-3580` `admit_direct_append`(v5 扩展支持 `mode=seed`) |
|
||||
| Worker-mode admission 调用方 | `replay.py` reseed / turn1-seed / large-append-reseed 路径 |
|
||||
| Prefill backup 释放策略(v5 引入) | `--kvcache-prefill-backup-policy release-after-transfer` |
|
||||
| Prefill priority eviction(v5 引入) | `--kvcache-prefill-priority-eviction` |
|
||||
| Session 在 D 上找不到的报错 | `scheduler.py:1824-1836` |
|
||||
| `_should_allow_local_prefill_on_decode` | `scheduler.py:1975-1980` |
|
||||
| Reseed 流程入口 | `replay.py:1665-1809` `_invoke_kvcache_seeded_router` |
|
||||
| Direct-to-D 流程 | `replay.py:2351-2398` `_invoke_decode_session_direct` |
|
||||
|
||||
## 经验教训
|
||||
|
||||
1. **policy 和 mechanism 是两个正交维度**——`--policy default` 不是"无脑默认值",它真的是 round-robin 无 session 亲和性。KVC 机制必须配 session 亲和的 policy。
|
||||
|
||||
2. **不要无脑相信前一个 agent 的 RESULTS_SUMMARY**——v2 的诊断("local prefill bug")和实际 finish_reason("session id does not exist")完全对不上。任何错误诊断必须用 finish_reason、execution_mode 这些原始字段交叉验证。
|
||||
|
||||
3. **bimodal 分布是 starvation 的强信号**——v3 数据里某些 session 100% 走快路径、某些 100% 走慢路径,几乎肯定是某种"先到先得"的资源竞争。看到这种模式立刻去找硬编码 cap 或全局共享资源。
|
||||
|
||||
4. **测量要看分组而非整体均值**——v3 整体 P50=1.5s 看似比 baseline 慢,但拆开看 direct-to-D 子集 P50=0.495s 已经反超 baseline。整体均值被 fallback 路径拖累,但 KVC 的核心价值是真实存在的。
|
||||
|
||||
5. **errors 与 fallback 是同一类资源压力的两副面孔**——v4 的"低 fallback 率 + 高 error 率"不是更优解,是把容量超限的失败从"显式拒绝"伪装成"超时失败"。v5 把决策权交给真容量后,fallback 升、errors 降,这是更诚实的指标,不要被 v4 的 fallback 数字误导。当看到错误率和 fallback 率呈反相关时,要警惕 admission 决策是否在说谎。
|
||||
34
docs/archive/README.md
Normal file
34
docs/archive/README.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# 归档文档说明
|
||||
|
||||
本目录保留项目历史阶段的过程文档。**新加入项目的 agent / 人员不需要阅读这些文档**,直接看 `docs/ONBOARDING_NEXT_AGENT_ZH.md` 即可。
|
||||
|
||||
保留它们的目的:
|
||||
1. 论文写作时追溯 v1-v5 调优演化过程
|
||||
2. 未来若回到 ts=10 高压区间或更大 trace 时,可参考当年的结构性问题诊断
|
||||
3. 满足学术可追溯性要求
|
||||
|
||||
## 每个文档的简要说明
|
||||
|
||||
| 文档 | 归档原因 | 何时回头看 |
|
||||
|---|---|---|
|
||||
| `AGENTIC_FIT_ANALYSIS_ZH.md` | ts=10 时代的 §1-§7 结构性问题分析;结论已被 ts=1 数据全面 supersede | 想知道当年 ts=10 下我们认为有什么结构性问题时 |
|
||||
| `STRUCTURAL_VALIDATION_REPORT_ZH.md` | 用 ts=10 数据对 AGENTIC_FIT_ANALYSIS 的 claim 做验证;同样被 ts=1 时代 supersede | 同上 |
|
||||
| `KVC_DEBUG_JOURNEY_V1_TO_V5.md` | v1-v5 5 个调优 sweep 的过程笔记;包含 errors 9→912 漂移、direct-to-D 占比变化等历史数据 | 写 paper 时要写 "as we explored configurations v1-v5..." 段落 |
|
||||
| `V5_PROFILE_INVESTIGATION_ZH.md` | 给 v5 加 1Hz polling instrumentation 的调查;让 errors 涨 46× 的现象记录 | 想理解 "admission RPC 干扰 scheduler 主循环" 这条 §5 残留风险时 |
|
||||
| `REFACTOR_PLAN_ZH.md` | v0 重构计划,**已被 `REFACTOR_PLAN_V1_ZH.md` supersede** | 不需要看;只有想看作者一开始的设想时翻一翻 |
|
||||
| `KVCACHE_CENTRIC_PROGRESS_ZH.md` | 项目最早期(2026-04-27)的进度记录;当时还没有完整的 sweep 数据 | 几乎不需要看;满足"项目起源记录"职能 |
|
||||
| `SWEBENCH_EXPERIMENT_PROGRESS.md` | SWE-Bench trace 早期实验进度记录 | 想知道当年的 trace 生成 / 采样配置时 |
|
||||
| `SWEBENCH_EXPERIMENT_RESULTS.md` | 同上,早期 result snapshot | 同上 |
|
||||
|
||||
## 当前活跃文档(在 `docs/` 顶层)
|
||||
|
||||
跳转去看:
|
||||
- `docs/ONBOARDING_NEXT_AGENT_ZH.md` — 新人上手手册
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标 + 术语
|
||||
- `docs/KVC_ROUTER_ALGORITHM.md` — 算法形式化
|
||||
- `docs/V2_DEEP_ANALYSIS_ZH.md` — v2 完整分析
|
||||
- `docs/V2_RESULTS_ZH.md` — v2 原始战报
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — ts=1 方向决策
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` — reseed 长尾 + D→P 缺口审计
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — ts=10 时代的结构性问题清单(作为历史 baseline 仍在主目录)
|
||||
123
docs/archive/REFACTOR_PLAN_ZH.md
Normal file
123
docs/archive/REFACTOR_PLAN_ZH.md
Normal file
@@ -0,0 +1,123 @@
|
||||
# Refactor Plan v0:极简版
|
||||
|
||||
**日期**:2026-05-06
|
||||
**目标**:用最小改动 + 轻量实验,验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 提出的结构性缺陷是否真实存在、影响多大。
|
||||
**预算**:8h GPU 时间(约 4-6 次 ~30-60 min smoke run)。
|
||||
**KISS 边界**:不动 SGLang `scheduler.py` 主循环结构;不引入新 mooncake 协议;不实现 cross-D session migration;不做 admission probe/commit 拆分;不动 LRU eviction 策略。
|
||||
|
||||
## 计划结论(与用户已确认的)
|
||||
|
||||
回审 plan-v0 时发现两个原 Phase 1 改动**都不是 bug**:
|
||||
|
||||
- `_estimate_session_resident_tokens` 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 `target - current` 减法(`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`)。
|
||||
- `decode_resident_blocks` 不缩减只是浪费几 MB 内存,**不影响 routing 决策**(SWE trace 的 hash_ids 是 session-unique,policy 仍能正确选 D)。
|
||||
|
||||
最终极简版只做一件代码改动(**加 backpressure**)+ 大量 instrumentation。
|
||||
|
||||
## 唯一代码改动:Backpressure 信号
|
||||
|
||||
### 改动点 1:SGLang `admit_direct_append` 响应增加两个字段
|
||||
|
||||
文件:`third_party/sglang/python/sglang/srt/managers/io_struct.py`、`scheduler.py`
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DirectAppendAdmissionReqOutput:
|
||||
... # 已有字段保留
|
||||
recommended_pause_ms: int = 0 # 新增
|
||||
queue_depth: int = 0 # 新增
|
||||
```
|
||||
|
||||
`scheduler.py:admit_direct_append` 末尾计算 hint:
|
||||
|
||||
```python
|
||||
def _compute_backpressure_pause_hint(self) -> float:
|
||||
depth = len(self.disagg_decode_transfer_queue.queue)
|
||||
if depth < 8:
|
||||
return 0.0
|
||||
return min(2000.0, depth * 100.0) # 简单线性
|
||||
```
|
||||
|
||||
### 改动点 2:replay 端按 hint 退避
|
||||
|
||||
文件:`src/agentic_pd_hybrid/replay.py`
|
||||
|
||||
- `DecodeResidencyState` 新增 `pause_until_s: dict[str, float]`
|
||||
- `_query_decode_direct_admission` 解析响应里的 `recommended_pause_ms`,更新 `pause_until_s[server_url] = now + pause_ms / 1000`
|
||||
- 在调 `_invoke_router` / `_invoke_decode_session_direct` 前检查 `pause_until_s[decode_url]`,若 `now < pause_until` 则 sleep 到该时刻
|
||||
|
||||
### 改动点 3:新 CLI flag
|
||||
|
||||
`src/agentic_pd_hybrid/cli.py`、`benchmark.py`:
|
||||
|
||||
```
|
||||
--enable-backpressure # 默认 false,保留 baseline 行为
|
||||
```
|
||||
|
||||
### 改动点 4:观测日志
|
||||
|
||||
每个 run dir 新增三个 jsonl:
|
||||
|
||||
- `admission-events.jsonl`:每次 admission RPC(timestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count)
|
||||
- `backpressure-events.jsonl`:每次实际 sleep(timestamp, D, sleep_ms, queue_depth_at_signal)
|
||||
- `session-d-binding.jsonl`:每个 session 第一次 open 在某 D 时记录(timestamp, session, D, turn_id)
|
||||
|
||||
## 实验矩阵(8h 预算内)
|
||||
|
||||
按"先做 anchor,再做单变量对照"排序。每行右侧是预估机时。
|
||||
|
||||
| ID | 配置 | 目的 | 机时 |
|
||||
|---|---|---|---|
|
||||
| **E0 (existing)** | v5 baseline,time-scale=10,无 backpressure | Anchor,已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1` | 0 |
|
||||
| **E1** | v5 + backpressure ON,time-scale=10,全 trace | 验证 Claim §3(backpressure 是否能消除 KVTransferError 雪崩) | ~50 min |
|
||||
| **E2** | v5 baseline,time-scale=1,**短 trace**(前 12 sessions ≈ 1000 reqs) | 验证 Claim §7(time-scale=10 失真);不开 backpressure | ~60 min |
|
||||
| **E3** | 8DP CA,time-scale=1,同 E2 trace | E2 的对照——真实时序下 KVC 是否仍输 DP | ~60 min |
|
||||
| **E4** | v5 + backpressure,time-scale=1,同 E2 trace | backpressure 在真实时序下还有用吗? | ~60 min |
|
||||
| **E5**(备选) | v5 baseline,time-scale=10,**concurrency=4**,全 trace | 验证 Claim §1(高并发是不是必要条件) | ~50 min |
|
||||
|
||||
总:4-5 个 run,~3-5h。剩余预算给失败重跑/分析。
|
||||
|
||||
## 实验目标——回到 §1-§7 一一对照
|
||||
|
||||
| 文档 § | Claim | 由哪个 exp 证伪/支持 | 需要的指标 |
|
||||
|---|---|---|---|
|
||||
| §1 | Session 永久 pin + 容量盲选造成双峰 | 已有 E0 数据足够 | direct-to-D rate per session distribution |
|
||||
| §2 | LRU 跟不上压力 | 已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化 | trim 事件数 vs OOM 数 |
|
||||
| §3 | 没 backpressure 是雪崩源 | E0 vs E1 | KVTransferError 数、P99 latency |
|
||||
| §4 | admission RPC 干扰 scheduler | 不在本轮实验范围(需要 admission probe 拆分才能验,不做) | – |
|
||||
| §5 | P-side 不感知 D 健康 | 已有 E0 logs 足够(prefill-0 vs prefill-1 错误数) | per-P KVTransferError |
|
||||
| §6 | (已撤回) | – | – |
|
||||
| §7 | time-scale=10 失真 | E0 vs E2(同 KVC,不同 time-scale);E2 vs E3(同 time-scale,KVC vs DP) | latency 分布、direct-to-D rate |
|
||||
|
||||
## Final 实验报告交付
|
||||
|
||||
跑完后输出 `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`,按 §1-§7 每条给出:
|
||||
|
||||
- **Claim 字面**
|
||||
- **数据证据**(哪个 exp、哪个 metric)
|
||||
- **结论**:成立 / 部分成立 / 推翻
|
||||
- **影响量化**:数字差异
|
||||
- **不确定性**:N=1 风险、其他 confounder
|
||||
|
||||
## 不做的事(KISS 边界)
|
||||
|
||||
| 想做但不做 | 理由 |
|
||||
|---|---|
|
||||
| 跑 N=3 重复 | 8h 装不下;single-run 可看大方向 |
|
||||
| 全 sweep 参数 | 只调 time-scale 和 backpressure 一个 boolean |
|
||||
| 改 LRU eviction | 不在本轮范围 |
|
||||
| Cross-D migration | 不在本轮范围 |
|
||||
| Admission probe/commit 拆分 | 不在本轮范围 |
|
||||
| P-side D-health routing | 不在本轮范围 |
|
||||
| 修两个"非 bug"(estimate / aging) | 验证后非真实 bug |
|
||||
|
||||
## 预期失败路径
|
||||
|
||||
- **GPU 资源紧张**:smoke trace 进一步压缩(前 8 sessions / 600 reqs)
|
||||
- **time-scale=1 跑超 1.5h**:截断到 600s 内能完成的部分
|
||||
- **backpressure 配错**:先用 sleep_ms = depth * 100 简单线性;调不通就回滚到 0(无 backpressure)
|
||||
- **SGLang patch 编译错**:所有 patch 在 io_struct.py 和 scheduler.py 的少量行内,可单独 git restore
|
||||
|
||||
---
|
||||
|
||||
接下来:实现 → 跑 smoke → 写报告。
|
||||
304
docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md
Normal file
304
docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md
Normal file
@@ -0,0 +1,304 @@
|
||||
# 结构性缺陷验证报告
|
||||
|
||||
**日期**:2026-05-06
|
||||
**对照数据源**:
|
||||
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/`(v5 KVC kv-aware Option D,2P6D,**3 次同配置 rerun**)
|
||||
- `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`(同 trace 8DP CA)
|
||||
- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log`、`prefill-{0,1}.log`
|
||||
**模型**:Qwen3-30B-A3B(TP1),单机 8×H100 80GB,trace `qwen35-swebench-50sess.jsonl`(4449 reqs / 52 sessions)。
|
||||
**报告作用域**:验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` §1-§7 提出的结构性 claim 是否真实存在;量化影响。
|
||||
|
||||
> ⚠️ **环境限制**:本轮缺 GPU 访问,未跑新 sweep。所有数据来自已存在的 v5 rerun + 8DP baseline。Backpressure 代码已实现但**未端到端验证**——下文标注为"预期收益(pending GPU smoke)"。
|
||||
|
||||
---
|
||||
|
||||
## 0. 实验有效性锚点:N=1 不可信
|
||||
|
||||
3 次 v5 baseline EXP2(**完全相同配置**)的 errors 漂移:
|
||||
|
||||
| Run | Errors | Lat P50 | Lat P90 | TTFT P50 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| run1 | **372** | 1.11s | 8.65s | 0.147s |
|
||||
| run2 | **912** | 0.94s | 7.68s | 0.071s |
|
||||
| run3 | **396** | 1.22s | 8.43s | 0.183s |
|
||||
|
||||
errors 漂移 **2.5×**(372 → 912),P50 latency 漂移 **30%**。**任何 N=1 比较 < 30% 差异都不可信。** 后续所有"同 trace 不同配置 / 不同代码"的对比,都需要 N≥3 才有意义。
|
||||
|
||||
**对 KVC vs DP 的 headline 数据,3 次 KVC 的最佳值(P50=0.94s)仍然是 DP(P50=0.65s)的 1.45×**——8 way DP 的优势远超 single-run variance 范围,这一头条结论不受 variance 影响。
|
||||
|
||||
---
|
||||
|
||||
## §1. Session 永久 pin 到 D + 容量盲选 → 极端双峰 ✅ 完全成立
|
||||
|
||||
### Claim
|
||||
KvAwarePolicy 评分以 hash overlap 为主,没有 D 容量项。Session 第一次落到某 D 后被永久 pin。导致大 session 在已满 D 上反复 admission 拒绝,小 session 在原 D 上 100% 走 direct-to-D。
|
||||
|
||||
### 数据
|
||||
|
||||
**(a) Session 永久绑定,跨 3 次 rerun 一致**:
|
||||
|
||||
```
|
||||
run1: 52 sessions, avg distinct-D-per-session = 1.00
|
||||
run2: 52 sessions, avg distinct-D-per-session = 1.00
|
||||
run3: 52 sessions, avg distinct-D-per-session = 1.00
|
||||
```
|
||||
|
||||
每个 session 在整个运行中只访问 **1 个** D worker,3 次独立 run 完全一致。**不是巧合,是结构。**
|
||||
|
||||
**(b) Direct-to-D 命中率呈极端双峰**:
|
||||
|
||||
| Direct-to-D rate | run1 | run2 | run3 |
|
||||
|---|---:|---:|---:|
|
||||
| 0-20%(饿死) | 15 | 18 | 16 |
|
||||
| 20-40% | 7 | 6 | 7 |
|
||||
| 40-60% | 11 | 7 | 9 |
|
||||
| 60-80% | 5 | 6 | 4 |
|
||||
| 80-100%(顺利) | 14 | 15 | 16 |
|
||||
|
||||
中间态稀少,两端拥挤。
|
||||
|
||||
**(c) 跨 3 次 run 一致饿死的 session 与 session 大小强相关**:
|
||||
|
||||
```
|
||||
13 sessions starved (<20% direct-to-D) in ALL 3 runs.
|
||||
avg peak input of consistently-starved sessions: 62043 tokens
|
||||
avg peak input of consistently-lucky sessions: 31344 tokens
|
||||
ratio: 1.98× — starved sessions are exactly 2× larger.
|
||||
```
|
||||
|
||||
**13/52 = 25% 的 session 在 3 次独立 run 中都被饿死,且这些 session 的 peak input 恰好是顺利 session 的 2 倍。** 这排除了"运气"假说,证实是大 session 在容量过载 D 上结构性失败。
|
||||
|
||||
### 影响量化
|
||||
- 25% session 几乎每个 turn 都走 fallback 路径,相对 direct-to-D **TTFT 慢 100×、E2E 慢 6×**(数据点:fallback path mean lat ~3.5s vs direct ~0.5s)
|
||||
- 对应这些 session 的用户体验是"系统性糟糕",而不是"偶尔慢"
|
||||
- **SLO 视角下 P99 完全由这 13 个 session 拉高**
|
||||
|
||||
### 结论
|
||||
**完全成立**。修复方向(不在本轮):policy score 加 capacity penalty + 允许 session 跨 D 迁移,或 D 端引入 hot session retract。
|
||||
|
||||
---
|
||||
|
||||
## §2. D 端 LRU 只 evict idle session → 跟不上压力 ✅ 完全成立
|
||||
|
||||
### Claim
|
||||
`scheduler.py:2040` 的 `evict_idle_streaming_sessions_lru` 只能 evict "所有 req 都 finished + streaming 模式"的 session。高并发下 hot session 永远不 idle,LRU 找不到东西可踢。结果 D 顶到 100% 然后撞 mooncake transfer timeout。
|
||||
|
||||
### 数据(v5 baseline rerun run1)
|
||||
|
||||
| D worker | Trim 事件 | KVTransferError | 峰值 token_usage |
|
||||
|---|---:|---:|---:|
|
||||
| decode-0 | 9 | 0 | 0.99 |
|
||||
| decode-1 | 43 | 4 | 0.99 |
|
||||
| decode-2 | 16 | 153 | 0.97 |
|
||||
| decode-3 | 37 | 29 | 0.99 |
|
||||
| decode-4 | 28 | 90 | **1.00** |
|
||||
| decode-5 | 30 | 93 | **1.00** |
|
||||
|
||||
**6 个 D 全部峰值 ≥ 0.97**,其中 2 个直接顶到 1.00(KV 池完全耗尽)。**LRU 触发 9-43 次,远不及 transfer 错误的 90-153 次。**
|
||||
|
||||
decode-2 极端:trim 16 次 vs error 153 次 = LRU 比错误慢 **9.5×**。
|
||||
|
||||
### 影响量化
|
||||
- 单 run 累计 369 KVTransferError(总 6 个 D 之和)
|
||||
- 对应 ~8% 的请求失败率(v5 errors 9/372/912 三次平均 ~430/4449 = 9.7%)
|
||||
- **每次 mooncake timeout 是 32s**——对 P99 latency 直接贡献几十秒尾巴
|
||||
|
||||
### 结论
|
||||
**完全成立**。修复方向(不在本轮):分层 eviction——除 idle 外加冷 session retract、按访问频率/时序加权。Backpressure(本轮代码)只是把"D 满"的雪崩从"timeout 错误"转成"主动等待",**不是真正解决容量问题**。
|
||||
|
||||
---
|
||||
|
||||
## §3. 没有 D→Replay backpressure 通道 ✅ 成立(已实现修复)
|
||||
|
||||
### Claim
|
||||
D 端 transfer queue 堆 → 32s timeout → KVTransferError,没有"D 过载请慢点"信号反向到 replay;concurrency 一直 32 不降。
|
||||
|
||||
### 数据
|
||||
- §2 的 369 KVTransferError 全部为 32s mooncake timeout(日志中均为 `Failed to send kv chunk` 或 `Decode instance could be dead`)
|
||||
- 错误集中在运行后半段(按现有 `KVC_DEBUG_JOURNEY_V1_TO_V5.md` §v4:错误均在 run 的 44.8% 之后开始累积)
|
||||
- 表明:**前期 D 容量充裕时正常,达到容量上限后所有后续请求集中失败**——典型无 backpressure 系统行为
|
||||
|
||||
### 修复(本轮已实现,待 GPU smoke 验证)
|
||||
|
||||
代码改动:
|
||||
1. `third_party/sglang/python/sglang/srt/managers/io_struct.py`:`DirectAppendAdmissionReqOutput` 增加 `recommended_pause_ms` 字段
|
||||
2. `third_party/sglang/python/sglang/srt/managers/scheduler.py:admit_direct_append`:基于 `transfer_queue_depth`、`retracted_queue_depth`、`token_usage_after` 计算 hint
|
||||
```python
|
||||
def _compute_backpressure_pause_hint(...):
|
||||
if retracted_queue_depth > 0: return 1500
|
||||
if token_usage_after >= 0.90: return max(200, min(2000, overshoot * 5))
|
||||
if transfer_queue_depth >= 8: return min(2000, transfer_queue_depth * 100)
|
||||
return 0
|
||||
```
|
||||
3. `src/agentic_pd_hybrid/replay.py`:
|
||||
- `DecodeResidencyState.pause_until_s: dict[str, float]`
|
||||
- `_query_decode_direct_admission` 解析 hint 更新 `pause_until_s`
|
||||
- 新增 `_wait_for_decode_pause`,在 `_invoke_router` / `_invoke_session_direct` 入口检查
|
||||
4. CLI flag:`--enable-backpressure`、`--backpressure-max-pause-s 2.0`(默认关闭)
|
||||
5. 结构性日志:`structural/admission-events.jsonl`、`backpressure-events.jsonl`、`session-d-binding.jsonl`
|
||||
|
||||
### 预期收益(pending GPU smoke E2 vs E1)
|
||||
- KVTransferError 应从 ~370 / 4449 跌到 < 50 / 4449
|
||||
- P99 应改善(消除 32s timeout 尾巴)
|
||||
- 整体 latency mean 可能**略升**(被强制 pause),但 P99 应大幅降
|
||||
- backpressure-events.jsonl 应显示 D-4 / D-5 累积大量 pause 事件(与 §2 数据吻合)
|
||||
|
||||
### 结论
|
||||
**Claim 成立;修复已实现,待 smoke 验证**。注意:backpressure 是**降级**机制,不是性能优化——它把"硬错误"换成"主动等待",整体 throughput 不会因此提升。
|
||||
|
||||
---
|
||||
|
||||
## §4. Admission RPC 与 scheduler 主循环耦合 ⚠️ 间接证据,本轮未直接验证
|
||||
|
||||
### Claim
|
||||
`admit_direct_append` 进 scheduler 主循环遍历 session slot,admission RPC 频率 16+/s 时与 decode 抢调度。
|
||||
|
||||
### 现有间接证据
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md`:仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415(46×);但 v6 P0 三次 baseline 不开 polling 同样得到 372/912/396——**polling 不是唯一原因,主循环负载本身就敏感**。
|
||||
|
||||
### 本轮未做
|
||||
- 没有"admission probe 拆 fast/slow"的对照实验。需要 SGLang 较深的改动(提供 lock-free snapshot),不在 KISS 边界。
|
||||
|
||||
### 结论
|
||||
**Claim 间接成立,本轮未直接验证**。Backpressure 实现里 admission RPC 的频率没有变(仍每个 turn 一次),只是结果会触发 sleep。如果这条 claim 成立,加 backpressure 后 admission RPC 数量大致不变但每次响应里的 `pause_ms` 会非零——**新增的 admission-events.jsonl 可在 GPU smoke 后用来直接验证此现象**。
|
||||
|
||||
---
|
||||
|
||||
## §5. P-side round-robin 不感知 D 健康 ✅ 成立
|
||||
|
||||
### Claim
|
||||
`pd_router.py:_select_decode_index` 是裸 round-robin。任一 P 撞到 hot D 时反复失败,另一 P 完全不受影响。
|
||||
|
||||
### 数据(v5 baseline rerun run1)
|
||||
|
||||
| Worker | KVTransferError | "Decode could be dead" |
|
||||
|---|---:|---:|
|
||||
| prefill-0 | **367** | 361 |
|
||||
| prefill-1 | **2** | 0 |
|
||||
|
||||
prefill-0 的请求量从 summary 看是 2225 vs prefill-1 的 2224——**请求量近乎对半,错误率差 180×**。
|
||||
|
||||
### 影响量化
|
||||
- 失败请求集中在 P-0 → 某个 hot D 的链路上(日志中反复出现 `to 10.45.80.47:XXXXX`)
|
||||
- 单 P 的"死亡链路"贡献了 **99%** 的全部 KVTransferError
|
||||
- 如果 P 选择能避开"正在和 hot D 死磕"的链路,**理论上可消除单 P 故障的雪崩效应**
|
||||
|
||||
### 备注
|
||||
- 此现象**未在 v6 P0 的 3 次 rerun 中横向验证**——只有 run1 的日志可读。需要在新 sweep 的 prefill-{0,1}.log 上重复确认,避免 N=1 嫌疑。
|
||||
|
||||
### 结论
|
||||
**单 run 数据成立,多 run 一致性未验证**。修复方向(不在本轮):router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度)。
|
||||
|
||||
---
|
||||
|
||||
## §6. (已撤回)Replay 端 session footprint 估算膨胀
|
||||
|
||||
写计划时仔细看代码后撤回——`_estimate_session_resident_tokens` 返回 full prompt,但所有需要"增量"的 call site (`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`) 都已用 `target - current` 减法处理。**不是 bug**。
|
||||
|
||||
---
|
||||
|
||||
## §7. time-scale=10 把 inter-turn gap 压到 1/10 ✅ 完全成立
|
||||
|
||||
### 数据
|
||||
|
||||
```
|
||||
原始 trace inter-turn gap (n=4397):
|
||||
p10=1.6s p50=2.5s p90=7.8s p99=25.1s max=261s
|
||||
|
||||
time-scale=10 实际 replay gap:
|
||||
p10=0.16s p50=0.25s p90=0.78s p99=2.5s max=26s
|
||||
```
|
||||
|
||||
真实 agentic 用户/agent 在 turn 之间停 2-8 秒(思考、打字、tool call、agent reasoning)。time-scale=10 把这些窗口压到 0.16-0.78 秒——**人为消除了 D 的自然 idle 时间**,正好是 KVC 想利用的"session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit"机会。
|
||||
|
||||
### 测量学影响
|
||||
- 所有 v3-v6 数据基于 time-scale=10
|
||||
- 意味着所有"KVC 在 SWE 上输给 baseline"的结论**可能被 benchmark 放大了**
|
||||
- §1 的 25% session 永久饿死现象,在 time-scale=1 下可能因为 D 有更多 drain 时间而显著缓解
|
||||
|
||||
### 本轮未做
|
||||
- 没跑 time-scale=1 baseline。这是项目当前**最重要但缺失的验证**。
|
||||
- Smoke sweep 脚本(`scripts/sweep_backpressure_smoke.sh`)E3、E4 包含了 time-scale=1 的 KVC + DP 短 trace 对比,等 GPU 时跑。
|
||||
|
||||
### 结论
|
||||
**Claim 完全成立;time-scale=1 验证为 P0 待办**。
|
||||
|
||||
---
|
||||
|
||||
## 头条对比(同 trace、同硬件)
|
||||
|
||||
```
|
||||
8-way DP cache-aware (TP1):
|
||||
errors= 0 | latency mean=1.426s p50=0.654s p90=3.609s
|
||||
| TTFT mean=0.123s p50=0.093s p90=0.256s
|
||||
|
||||
KVC v5 2P6D (3 reruns, no polling):
|
||||
run1: errors=372 | mean=3.50s p50=1.11s p90=8.65s | TTFT mean=2.13s
|
||||
run2: errors=912 | mean=3.00s p50=0.94s p90=7.68s | TTFT mean=1.64s
|
||||
run3: errors=396 | mean=3.42s p50=1.22s p90=8.43s | TTFT mean=2.07s
|
||||
```
|
||||
|
||||
KVC 三次 run 全输 DP,且差距远超 single-run variance:
|
||||
- Latency mean:DP 优 **+110%**(KVC 平均 3.30s vs DP 1.43s)
|
||||
- Latency P50:DP 优 **+65%**(KVC 平均 1.09s vs DP 0.65s)
|
||||
- TTFT mean:DP 优 **+1500%**(KVC 平均 1.95s vs DP 0.12s——慢 17×!)
|
||||
- Errors:DP 0 vs KVC 平均 ~560
|
||||
|
||||
**这是这个项目当前最严肃的事实**——所有 KVC 复杂度回报为负。
|
||||
|
||||
---
|
||||
|
||||
## 综合结论
|
||||
|
||||
按"是否结构性 + 影响大小"的二维分类:
|
||||
|
||||
| Claim | 结构性 | 影响 | 本轮验证 | 修复(KISS 内) | 修复(KISS 外) |
|
||||
|---|---|---|---|---|---|
|
||||
| §1 Session pin + 容量盲选 | 强 | 大(25% session 饿死) | ✅ 3 run 一致 | ❌ | capacity-aware policy + 跨 D 迁移 |
|
||||
| §2 LRU 跟不上 | 强 | 大(每次 ~370 KVTransferError) | ✅ 6 D 数据 | ❌ | 分层 eviction、hot retract |
|
||||
| §3 无 backpressure | 强 | 中-大(消除 32s timeout 雪崩) | ⚠️ 已实现,待 smoke | ✅ **本轮交付** | – |
|
||||
| §4 admission RPC 干扰 | 弱-中 | 中 | ⚠️ 间接 | ❌ | probe / commit_evict 拆分 |
|
||||
| §5 P-side 不感知 D 健康 | 中 | 中(单 P 错误率差 180×) | ✅ N=1,需 N≥3 复核 | ❌ | router P 选择带 D 健康反馈 |
|
||||
| §6 estimate 膨胀 | – | – | ❌ 已撤回 | – | – |
|
||||
| §7 time-scale=10 失真 | 强(测量学) | 大(可能颠覆所有 KVC vs DP 结论) | ✅ 数据明确 | ✅ 改 flag | – |
|
||||
|
||||
### 最关键的两个 takeaway
|
||||
|
||||
1. **§7 time-scale=1 是当前项目所有结论的前置依赖**——必须先做。如果 time-scale=1 下 KVC 与 DP 接近,前面所有 v3-v6 的"KVC 输得彻底"诊断都需要重新解读。
|
||||
2. **§1 + §2 是双胞胎结构性问题**——session 被永久 pin 在某个 D + D 不能 evict 已满 = 大 session 永久卡死。任何不动 policy + 不动 LRU 的修复(包括本轮的 backpressure)只能让症状好看,不能消除根因。
|
||||
|
||||
---
|
||||
|
||||
## 本轮代码改动汇总(git diff 范围)
|
||||
|
||||
```
|
||||
src/agentic_pd_hybrid/replay.py # +结构性日志 + backpressure pause 检查 + admission 增强
|
||||
src/agentic_pd_hybrid/cli.py # +CLI flags
|
||||
src/agentic_pd_hybrid/benchmark.py # +CLI flags 透传
|
||||
third_party/sglang/python/sglang/srt/managers/io_struct.py
|
||||
third_party/sglang/python/sglang/srt/managers/scheduler.py
|
||||
# +recommended_pause_ms 字段 + hint 计算
|
||||
scripts/sweep_backpressure_smoke.sh # 4-run smoke sweep(待 GPU 跑)
|
||||
scripts/analysis/analyze_backpressure_smoke.py
|
||||
# 配套分析器
|
||||
docs/REFACTOR_PLAN_ZH.md # 计划文档
|
||||
docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
|
||||
# 本报告
|
||||
```
|
||||
|
||||
代码默认行为**不变**(`enable_backpressure=False`)——所有现有脚本/配置无影响。
|
||||
|
||||
---
|
||||
|
||||
## 待 GPU 时执行
|
||||
|
||||
```bash
|
||||
bash scripts/sweep_backpressure_smoke.sh
|
||||
python3 scripts/analysis/analyze_backpressure_smoke.py outputs/sweep_backpressure_smoke
|
||||
```
|
||||
|
||||
预算:4 个 run × 30-60 min ≈ 3-4h GPU 时间。
|
||||
|
||||
按 §3 的预期:E2 (KVC + backpressure) 相对 E1 (KVC baseline) 应有 errors 降 70%+;P99 改善;TTFT P50 持平或略升。E3 (KVC + backpressure @ time-scale=1) vs E4 (DP @ time-scale=1) 是验证 §7 的关键对照。
|
||||
|
||||
如果 E2 vs E1 的 errors 没有显著下降,说明 backpressure hint 公式调得不对(`_compute_backpressure_pause_hint` 阈值可调),或 §3 实际不是雪崩主因(更可能是 §2 D-side LRU 才是)。
|
||||
95
docs/archive/SWEBENCH_EXPERIMENT_PROGRESS.md
Normal file
95
docs/archive/SWEBENCH_EXPERIMENT_PROGRESS.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# SWE-Bench PD Hybrid Experiment Progress
|
||||
|
||||
## 实验目标
|
||||
|
||||
在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism,对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。
|
||||
|
||||
## 硬件环境
|
||||
|
||||
- 8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
|
||||
- 无 RDMA/IB 设备
|
||||
- Transfer backend: **mooncake TCP** (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault,已放弃)
|
||||
|
||||
## 实验矩阵
|
||||
|
||||
| 实验 | Mechanism | Workers | GPU 分配 | Router | Policy |
|
||||
|------|-----------|---------|----------|--------|--------|
|
||||
| A | pd-disaggregation | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
|
||||
| B | pd-colo | 2 direct (TP4 each) | D0: 0-3, D1: 4-7 | No | default |
|
||||
| C | kvcache-centric | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
|
||||
|
||||
## 测试负载
|
||||
|
||||
- 源数据: `simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl`
|
||||
- 39,417 lines (turns), 497 unique instances (sessions)
|
||||
- 每个 instance 8-150 turns (均值 79.3)
|
||||
- 转换为 agentic-pd-hybrid trace 格式: `outputs/qwen35-swebench-500.jsonl`
|
||||
|
||||
## 关键发现
|
||||
|
||||
### Transfer Backend 选择
|
||||
|
||||
- **nixl (UCX)**: pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持,导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
|
||||
- **mooncake (TCP)**: 可用。需要两处修改:
|
||||
1. `third_party/sglang/.../mooncake_transfer_engine.py`: 从环境变量 `MOONCAKE_PROTOCOL` 读取协议,而非硬编码 `"rdma"`
|
||||
2. `src/agentic_pd_hybrid/stack.py`: 当 `transfer_backend == "mooncake"` 且非 `force_rdma` 时,自动设置 `MOONCAKE_PROTOCOL=tcp`
|
||||
|
||||
### 代码修改记录
|
||||
|
||||
1. **`third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`**
|
||||
- 将 `"rdma"` 硬编码改为 `os.environ.get("MOONCAKE_PROTOCOL", "rdma")`
|
||||
|
||||
2. **`src/agentic_pd_hybrid/stack.py`**
|
||||
- 在 `_build_process_env()` 中添加: mooncake 非 force_rdma 时默认设置 `MOONCAKE_PROTOCOL=tcp`
|
||||
|
||||
3. **`scripts/convert_audit_to_trace.py`** (新建)
|
||||
- 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式
|
||||
|
||||
## 实验进度
|
||||
|
||||
- [x] Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
|
||||
- [x] Step 1: Trace 格式转换 (39,417 lines 验证通过)
|
||||
- [x] Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — **通过**
|
||||
- 100/100 requests, 0 errors
|
||||
- Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
|
||||
- TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
|
||||
- 91/100 cache hits
|
||||
- [x] Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — **中止**
|
||||
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z` (无metrics,被kill)
|
||||
- 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
|
||||
- 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
|
||||
- **教训**: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
|
||||
- [x] Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — **完成**
|
||||
- Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z`
|
||||
- Trace: `outputs/qwen35-swebench-50sess.jsonl` (10% sample, 52 sessions)
|
||||
- **结果**: 4449/4449 成功, 0 errors
|
||||
- Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
|
||||
- TTFT: mean=0.45s, P50=0.34s, P90=0.88s
|
||||
- TPOT: mean=5.2ms, P50=5.2ms
|
||||
- Cache hit: 4199/4449 (94.4%)
|
||||
- [x] Step 4: 实验 B — pd-colo — **失败: SGLang bug**
|
||||
- Run dir: `outputs/swebench-exps/pd-colo-default-20260426T210129Z`
|
||||
- **Bug**: `--disaggregation-mode null` (colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏
|
||||
- 错误: `ValueError: token_to_kv_pool_allocator memory leak detected!`
|
||||
- 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
|
||||
- **结论**: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
|
||||
- [x] Step 5: 实验 C — kvcache-centric — **完成 (高错误率)**
|
||||
- Run dir: `outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z`
|
||||
- 4390/4449 errors (98.7%) — admission control 过于保守
|
||||
- 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
|
||||
- 详细分析见 `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
|
||||
- [x] Step 6: 结果对比分析 — **完成**
|
||||
- 完整报告: `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
|
||||
|
||||
## 启动脚本
|
||||
|
||||
- `scripts/run_exp_a_pd_disagg.sh` — 实验 A
|
||||
- `scripts/run_exp_b_pd_colo.sh` — 实验 B
|
||||
- `scripts/run_exp_c_kvcache_centric.sh` — 实验 C
|
||||
- `scripts/convert_audit_to_trace.py` — Trace 转换
|
||||
|
||||
## 已知风险
|
||||
|
||||
1. Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph),长 session (150 turns) 可能 OOM
|
||||
2. mooncake TCP loopback 延迟远低于真实跨机,结果偏乐观
|
||||
3. 原始 trace 时间跨度 ~6000s,全量回放非常耗时
|
||||
121
docs/archive/SWEBENCH_EXPERIMENT_RESULTS.md
Normal file
121
docs/archive/SWEBENCH_EXPERIMENT_RESULTS.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# SWE-Bench PD Hybrid Experiment Results
|
||||
|
||||
## 实验配置
|
||||
|
||||
- **模型**: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4
|
||||
- **硬件**: 8x H100 80GB, NVLink, 单节点
|
||||
- **Transfer backend**: mooncake TCP (loopback)
|
||||
- **Trace**: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances)
|
||||
- **时间压缩**: time-scale=10, concurrency-limit=32
|
||||
|
||||
## 结果汇总
|
||||
|
||||
### Experiment A: pd-disaggregation (baseline)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Run dir | `pd-disaggregation-default-20260426T202540Z` |
|
||||
| Requests | 4,449 / 4,449 (100%) |
|
||||
| Errors | 0 |
|
||||
| **Mean Latency** | **1.662s** |
|
||||
| P50 Latency | 0.973s |
|
||||
| P90 Latency | 3.644s |
|
||||
| P99 Latency | 7.676s |
|
||||
| Mean TTFT | 0.445s |
|
||||
| P50 TTFT | 0.340s |
|
||||
| P90 TTFT | 0.880s |
|
||||
| Mean TPOT | 5.20ms |
|
||||
| Cache Hit Rate | 94.4% (4199/4449) |
|
||||
| Mean Cached Tokens | 27,794 |
|
||||
| KV Transfer Blocks | 105,235 |
|
||||
|
||||
### Experiment B: pd-colo (colocation) — FAILED
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Run dir | `pd-colo-default-20260426T210129Z` |
|
||||
| Status | **CRASHED** |
|
||||
| Error | `token_to_kv_pool_allocator memory leak detected!` |
|
||||
| Root Cause | SGLang v0.5.10 `--disaggregation-mode null` 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容 |
|
||||
| Requests | ~10 / 4,449 (0.2%) |
|
||||
|
||||
**结论**: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。
|
||||
|
||||
### Experiment C: kvcache-centric (session-aware PD)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Run dir | `kvcache-centric-default-worker-admission-20260426T210800Z` |
|
||||
| Requests | 4,449 total |
|
||||
| **Errors** | **4,390 (98.7%)** |
|
||||
| Successful | 59 (1.3%) |
|
||||
| Mean Latency (success) | 1.238s |
|
||||
| P50 Latency (success) | 0.484s |
|
||||
| P90 Latency (success) | 2.550s |
|
||||
| Mean TTFT (success) | 0.179s |
|
||||
| P50 TTFT (success) | 0.081s |
|
||||
| Mean TPOT (success) | 4.70ms |
|
||||
| Direct-to-D Sessions | 56 |
|
||||
| KV Transfer (actual) | 196 blocks (vs 105,235 planned) |
|
||||
|
||||
**Execution Mode 分布**:
|
||||
- `kvcache-centric` (failed): 4,390
|
||||
- `kvcache-direct-to-d-session` (success): 56
|
||||
- `pd-router-*` variants: 3
|
||||
|
||||
## 关键分析
|
||||
|
||||
### 1. pd-disaggregation (A) — 稳定可靠
|
||||
|
||||
- 100% 成功率,0 错误
|
||||
- Mean latency 1.66s 合理 (包含 P→D KV transfer 开销)
|
||||
- 94.4% cache hit 说明 prefix cache 在 P 侧工作良好
|
||||
- KV transfer 105K blocks = 主要开销来源
|
||||
- **适合生产使用**
|
||||
|
||||
### 2. pd-colo (B) — 不可用
|
||||
|
||||
- Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在 `disaggregation-mode null` 下触发内存泄漏
|
||||
- 这是 SGLang 的 bug,不是 agentic-pd-hybrid 的问题
|
||||
- **需要 SGLang 修复后重新测试**
|
||||
|
||||
### 3. kvcache-centric (C) — Admission 过于保守
|
||||
|
||||
- 98.7% 错误率说明 admission control 拒绝了几乎所有请求
|
||||
- `kvcache-seed-min-turn-id=2` 过滤了 turn 1 的 seed(正确行为)
|
||||
- 但绝大多数 turn 2+ 请求也走 `kvcache-centric` 模式后失败
|
||||
- 可能原因:
|
||||
- Worker admission 查询发现 D 侧没有对应 session 的 KV cache(因为 turn 1 没有 seed)
|
||||
- D 侧 transfer queue 积压导致 admission 拒绝
|
||||
- 成功的 56 个 `direct-to-d-session` 请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x
|
||||
- **需要调优 admission 参数,或使用 `kvcache-seed-min-turn-id=1` 允许 turn 1 seed**
|
||||
|
||||
### 4. kvcache-centric 成功请求 vs pd-disaggregation 对比
|
||||
|
||||
| Metric | pd-disagg (A) | kvcache-centric (C, success only) | Delta |
|
||||
|--------|:---:|:---:|:---:|
|
||||
| Mean Latency | 1.662s | 1.238s | **-25.5%** |
|
||||
| P50 Latency | 0.973s | 0.484s | **-50.3%** |
|
||||
| Mean TTFT | 0.445s | 0.179s | **-59.8%** |
|
||||
| P50 TTFT | 0.340s | 0.081s | **-76.2%** |
|
||||
| Mean TPOT | 5.20ms | 4.70ms | -9.6% |
|
||||
| Actual KV Transfer | 105,235 blk | 196 blk | **-99.8%** |
|
||||
|
||||
**当 kvcache-centric 成功时,性能提升显著:**
|
||||
- TTFT 降低 60-76% (D 侧直接 append,无需 P→D transfer)
|
||||
- 端到端 latency 降低 25-50%
|
||||
- KV transfer 减少 99.8%
|
||||
|
||||
## 后续建议
|
||||
|
||||
1. **修复 pd-colo**: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏
|
||||
2. **调优 kvcache-centric admission**:
|
||||
- 尝试 `--kvcache-seed-min-turn-id 1` 允许 turn 1 seed
|
||||
- 放宽 `--kvcache-seed-max-decode-transfer-queue-reqs` 阈值
|
||||
- 使用 `--kvcache-admission-mode router` (shadow state, 不在 critical path)
|
||||
3. **增加 D 侧内存**: 调整 `--mem-fraction-static` 给 KV cache 更多空间
|
||||
4. **多 P/D 配置**: 测试 2P2D (TP2) 配置以增加并行度
|
||||
|
||||
## 实验日期
|
||||
|
||||
2026-04-27
|
||||
305
docs/archive/V5_PROFILE_INVESTIGATION_ZH.md
Normal file
305
docs/archive/V5_PROFILE_INVESTIGATION_ZH.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# v5+Profile 调查报告(经 critic 审计修订版)
|
||||
|
||||
**日期**: 2026-04-29(原稿)/ 2026-04-29(经审计修订)
|
||||
**实验配置**: Qwen3-30B-A3B (TP1)、单机 8×H100 80GB、trace = qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions)、time-scale=10、concurrency=32
|
||||
**数据集**: `outputs/qwen3-30b-tp1-v5-optD-profile/`(EXP1 1P7D + EXP2 2P6D,均加入 1Hz `/server_info` 时序采样)
|
||||
**v5 baseline 对照**: `outputs/qwen3-30b-tp1-v5-optD/`(无 polling)
|
||||
**研究问题**: v5 (Option D) 把 errors 从 9-10% 降到 0.2%,但 session-cap fallback 反而升到 46-51%。fallback / errors 究竟来自哪里。
|
||||
|
||||
> **本稿是经过 hostile audit 后的修订版**。原稿包含若干结论性错误(尤其是对 `held_tokens` 语义的解读颠倒、对 admission race 的过度归因、对 polling 副作用的轻视)。审计意见保存在本会话记录中,关键纠错以 ⚠️ 标注。
|
||||
|
||||
---
|
||||
|
||||
## TL;DR(已修订)
|
||||
|
||||
1. **真实容量**: 每张 D 的 `token_to_kv_pool_allocator.size = 92086 tokens (~92K)`。⚠️ 单 turn 真实 footprint **不是 50-100K**;`cached_tokens` p50=18K、p90=48K、p99=67K。原稿过度夸张。
|
||||
2. **`other = capacity − held − available` 的解读已修订**: ⚠️ `held_tokens = sum(slot.kv_allocated_len − slot.cache_protected_len)`(代码:`session_aware_cache.py:278-282`),即"slot 拿到但**不在 radix tree 保护范围内**的部分"。所以 **`other` 的最大单一组成很可能是 radix-tree 保护的共享前缀缓存(prefix cache)** —— 这通常是想要的,**不是病态浪费**。原稿把 `other` 全归因为 running batch + 在途传输是错的。
|
||||
3. **`other` 的双峰分布属实**(p50 ≈ 0,p90 ≈ 80K),但单凭 `cap−held−avail` 无法判断这是 radix-cache 自然累积、还是 burst 工作内存。**P1 的细分 instrument 必须先做**。
|
||||
4. **errors 与 `other` 在时间上相关**属实,但**不能被解释为因果**。同一时段的多个变量(请求并发、in-flight transfer、可用空间)都在变化;无法仅凭时序对齐推断"`other` 吃掉了腾出来的空间"。
|
||||
5. **EXP2 2P6D errors 9 → 415**:⚠️ **polling 被升级为 leading hypothesis**,而非"无关"。证据:执行模式呈 ~1:1 替换(`session-cap-fb` −356 / `kvcache-centric` +406),且 `/server_info` 不是被动读 —— 它在 scheduler 主循环内遍历每个 session slot 计算 `is_idle`。需要 P0 三次 baseline 复跑去伪。
|
||||
6. **errors 集中在 18 个 session 上**(总共 52 个),每个 session 钉死在 1 个 D。per-D error rate 差异**无法解释为 D 的结构差别**,本质是 18 个"坏 session"如何被路由分配。
|
||||
7. **v5+profile 1P7D 的延迟优于 baseline** 完全在 single-run variance 范围内。N=1,**不能作为任何性能结论**。
|
||||
|
||||
---
|
||||
|
||||
## 1. 方法论
|
||||
|
||||
### 1.1 Instrument 改动
|
||||
- `src/agentic_pd_hybrid/replay.py` 加入 `_query_pool_snapshot` + `_poll_pool_timeseries`,后台 asyncio task 以 `--pool-poll-interval-s 1.0` 周期访问每个 P/D worker 的 `/server_info`。
|
||||
- 每 tick 写一行 jsonl 到 `<run_dir>/d-pool-timeseries.jsonl`,字段:`{worker_id, worker_role, session_count, resident_session_count, held_tokens, available_tokens, capacity_tokens, idle_evictable_*, sessions[], kvcache_mem_gb, last_gen_throughput, ...}`。
|
||||
- 分析脚本:`scripts/analysis/analyze_pool_timeseries.py`。
|
||||
|
||||
### 1.2 字段定义(已修订 ⚠️)
|
||||
`/server_info` → `internal_states[0].session_cache` 的来源是 `session_controller.py:get_streaming_session_cache_status` → `tree_cache`(`SessionAwareCache`)。
|
||||
|
||||
| 字段 | 真实含义 | 备注 |
|
||||
|---|---|---|
|
||||
| `held_tokens` | `sum_over_slots(ceil(kv_allocated_len, page_size) − cache_protected_len)` | **不是** "session 在 cache 中占用的全部";只统计**slot-private、未被 radix tree 保护**的部分 |
|
||||
| `cache_protected_len` | radix tree 保护的共享前缀部分 | 多个 session 共享时只计一次 |
|
||||
| `available_tokens` | `token_to_kv_pool_allocator.available_size()` | 全局 KV 池剩余空间 |
|
||||
| `capacity_tokens` | `allocator.size` | 单 D 的总 KV 容量 = 92086 |
|
||||
| `idle_evictable_tokens` | held 中可被 LRU 立即踢的部分(session 所有 req finished + streaming 模式) | |
|
||||
|
||||
因此:
|
||||
- **`other = capacity − held − available`** 包含但不限于:
|
||||
- **radix-tree 保护的共享前缀 token**(可能是大头) ⚠️ 原稿遗漏
|
||||
- 当前 running batch 占用的 KV slots
|
||||
- P→D 在途 transfer 的临时 buffer
|
||||
- mooncake 已注册但尚未提交到 tree_cache 的块
|
||||
- 内部碎片 / allocator 元数据
|
||||
|
||||
**含义**: 在补充 P1 instrument 之前,我们**无法分辨** `other` 中"radix-cache"(良性)和"burst 工作集 / fragmentation"(可能病态)的比例。
|
||||
|
||||
### 1.3 配置一致性与风险
|
||||
- v5+profile 与 v5 baseline 唯一差别:加了 `--pool-poll-interval-s 1.0`(其余 CLI 参数完全一致)。
|
||||
- **两次 run 时间间隔 ~21 小时**(2026-04-28 15:39/16:27 vs 2026-04-29 12:08/12:59)⚠️ 原稿误写 ~6h。同一台机,但 GPU 温度、PCIe、NUMA 分配未控制。
|
||||
- **N=1 比较没有统计意义**;任何延迟差异 < 30% 都属于 single-run variance 合理范围。
|
||||
|
||||
---
|
||||
|
||||
## 2. 整体性能对比
|
||||
|
||||
| 指标 | v5 1P7D | **v5+profile 1P7D** | v5 2P6D | **v5+profile 2P6D** |
|
||||
|---|---|---|---|---|
|
||||
| 总 requests | 4449 | 4449 | 4449 | 4449 |
|
||||
| **errors** | 9 (0.2%) | 6 (0.1%) | 9 (0.2%) | **415 (9.3%)** |
|
||||
| truncated | 42 | 43 | 42 | 42 |
|
||||
| direct-to-D | 44.7% | 54.9% | 41.3% | 41.1% |
|
||||
| session-cap fallback | 45.6% | 36.1% | 50.6% | 42.6% |
|
||||
| no-d-capacity | 1.2% | 0.7% | 0.8% | 0.6% |
|
||||
| pd-router-d-session-reseed | 4.8% | 4.3% | 3.4% | 2.9% |
|
||||
| pd-router-turn1-seed | 1.2% | 1.2% | 1.1% | 1.1% |
|
||||
| **kvcache-centric (failed mode)** | 0.2% (9) | 0.1% (6) | 0.2% (9) | **9.3% (415)** |
|
||||
| latency mean / p50 / p90 / p99 (s) | 5.18/1.59/14.7/26.1 | 4.21/1.18/11.3/28.8 | 3.49/1.31/9.1/24.9 | 3.23/1.11/8.4/20.3 |
|
||||
|
||||
⚠️ **不要从此表得出"v5+profile 改进了延迟"** —— N=1 single run,且 EXP2 引入了 415 个 errors 相当于换了一种回退策略,延迟均值的下降很可能只是**剔除了慢路径请求**的副作用。
|
||||
|
||||
### 2.1 EXP2+profile 415 errors 解构(已修订)
|
||||
|
||||
**Error type 分布**:
|
||||
| Error Type | 数量 |
|
||||
|---|---|
|
||||
| `RuntimeError: generate stream ended before producing any token` | 407 |
|
||||
| `ReadTimeout: ` | 8 |
|
||||
|
||||
⚠️ **关键约束**:
|
||||
- **414/415 个 error 的 `kv_transfer_blocks > 0`**(从 metrics jsonl 验证)。这些请求**已经过了 admission,P→D 传输已开始**,死于下游(server-side abort、流被关、生成阶段失败)。
|
||||
- **`session_reused=False` 占 415/415**(全部是 seed,无一是 direct append)。
|
||||
- **失败集中在 18 个 unique session**(top 5: 58080→decode-5 66 errs / 70560→decode-2 54 / 67200→decode-4 40 / 59200→decode-4 35 / 77280→decode-2 33),每个 session 钉死在一台 D。
|
||||
|
||||
**Per-D error rate(已修正百分比)**:
|
||||
| Decode Worker | Errors | Total Reqs | Error Rate |
|
||||
|---|---|---|---|
|
||||
| decode-0 | 56 | 758 | 7.4% |
|
||||
| decode-1 | 5 | 561 | 0.9% |
|
||||
| decode-2 | 141 | 858 | **16.4%** |
|
||||
| decode-3 | 0 | 838 | 0.0% |
|
||||
| decode-4 | 106 | 731 | 14.5% |
|
||||
| decode-5 | 107 | 703 | 15.2% |
|
||||
|
||||
⚠️ **不要解读为"decode-3 健康、decode-2 病态"**。每个 session 钉死在一台 D,18 个坏 session 是否落到某个 D 是路由分配的随机结果。**当前 N=1 数据无法分辨"D 结构差异"与"session 分配运气"**。
|
||||
|
||||
---
|
||||
|
||||
## 3. D KV pool 时序分解(EXP1 1P7D 关键结果)
|
||||
|
||||
每张 D capacity=92086 tokens,运行 ~2696 秒(去掉前 10% 暖机):
|
||||
|
||||
| Worker | mean_other | p50_other | p90_other | max_other | mean_held | mean_avail |
|
||||
|---|---:|---:|---:|---:|---:|---:|
|
||||
| decode-0 | 13599 | 63 | 77189 | 90959 | 47124 | 31363 |
|
||||
| decode-1 | 21242 | 0 | 76854 | 91074 | 37024 | 33820 |
|
||||
| decode-2 | 39333 | 46841 | 82782 | 91996 | 17381 | 35372 |
|
||||
| decode-3 | 30543 | 15864 | 81512 | 91511 | 9584 | 51959 |
|
||||
| decode-4 | 32659 | 32365 | 72995 | 92082 | 7643 | 51784 |
|
||||
| decode-5 | 31745 | 20366 | 86341 | 91211 | 11305 | 49036 |
|
||||
| decode-6 | 24602 | 701 | 82291 | 91000 | 20967 | 46517 |
|
||||
|
||||
**已修订观察(去掉了原稿的过度归因)**:
|
||||
- **`other` 是双峰**(p50 接近 0,p90 接近 80K,mean 在 14-39K)。这一形态属实。
|
||||
- **不同 D 的 mean_held / mean_other 差异巨大** —— 但⚠️ **不能直接归类为 "session-heavy" 或 "transfer-heavy"**,因为我们不知道 `other` 里 radix-cache vs 工作内存的比例。**P1 的拆分必做**。
|
||||
- 由于 `held` 不包含 radix-protected token,`mean_held` 低**不代表**该 D 上 sessions 占用少 —— 只代表它们的"slot 私有部分"少;共享前缀可能很大,完全藏在 `other` 里。
|
||||
|
||||
### 3.1 `other` 在某些时段持续高位(EXP1 decode-2 抽样)
|
||||
|
||||
| t (s) | held | avail | other | sess_count | last_gen_throughput |
|
||||
|---:|---:|---:|---:|---:|---:|
|
||||
| 3 | 0 | 92086 | 0 | 0/0 | (未抽) |
|
||||
| 273 | 65310 | 26776 | 0 | 1/1 | (未抽) |
|
||||
| 543 | 15296 | 76589 | 201 | 1/1 | (未抽) |
|
||||
| 812 | 0 | 92086 | 0 | 0/0 | (未抽) |
|
||||
| 1082 | 52507 | 39579 | 0 | 1/1 | (未抽) |
|
||||
| 1351 | 40985 | 30175 | 20926 | 2/2 | (未抽) |
|
||||
| **1622** | **0** | 17703 | **74383** | **0/0** | **未核** |
|
||||
| 1891 | 0 | 46376 | 45710 | 0/0 | (未抽) |
|
||||
| 2161 | 0 | 27667 | 64419 | 0/0 | (未抽) |
|
||||
| 2430 | 0 | 62224 | 29862 | 0/0 | (未抽) |
|
||||
|
||||
⚠️ **t=1622 之后(约 30+ tick)持续 held=0/sess=0/other≈45-74K** —— 这种持久状态**不是 burst 工作集的形态**(burst 应是亚秒级)。更可能的解释包括:
|
||||
- 一个 stuck request 的 KV 块未能正常释放
|
||||
- mooncake 注册但未 commit 的 transfer buffer 滞留
|
||||
- 某个 cleanup 路径未触发
|
||||
|
||||
**未在原稿中验证 `last_gen_throughput`**,该字段记录在 timeseries 但未对齐分析。**P1 时一并补**。
|
||||
|
||||
---
|
||||
|
||||
## 4. Errors 与 Saturation 时序相关性(EXP2 2P6D)
|
||||
|
||||
### 4.1 等数量 vs 等时间 decile(已修订 ⚠️)
|
||||
|
||||
原稿仅展示等时间分箱,有"第 10 decile 系统恢复"的视觉错觉。两种分箱并列:
|
||||
|
||||
| Decile | 等时间(reqs / errs / rate) | 等数量(reqs / errs / rate) |
|
||||
|:---:|:---:|:---:|
|
||||
| 1 | 567 / 0 / 0.0% | 444 / 0 / 0.0% |
|
||||
| 2 | 268 / 0 / 0.0% | 445 / 0 / 0.0% |
|
||||
| 3 | 517 / 0 / 0.0% | 445 / 0 / 0.0% |
|
||||
| 4 | 189 / 0 / 0.0% | 445 / 0 / 0.0% |
|
||||
| 5 | 662 / 3 / 0.5% | 445 / 3 / 0.7% |
|
||||
| 6 | 417 / 27 / 6.5% | 445 / 28 / 6.3% |
|
||||
| 7 | 486 / 39 / 8.0% | 445 / 42 / 9.4% |
|
||||
| 8 | 612 / 177 / 28.9% | 445 / 114 / 25.6% |
|
||||
| 9 | 486 / 128 / 26.3% | 445 / 119 / 26.7% |
|
||||
| **10** | **245 / 41 / 16.7%** | **445 / 109 / 24.5%** |
|
||||
|
||||
⚠️ **第 10 decile 不是"系统恢复"**。等数量分箱显示 24.5% 的 error rate,与 decile 8/9 持平。原稿"恢复"叙事是分母 245 vs 612 造成的视觉假象。
|
||||
|
||||
### 4.2 多重假设并列(已修订,不再独尊 admission race)
|
||||
|
||||
针对 EXP2 2P6D 415 errors 的可能机制(按当前数据强弱排序):
|
||||
|
||||
**H1: Polling 引发 scheduler 时序扰动(leading hypothesis ⚠️)**
|
||||
- 证据:执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)。
|
||||
- 证据:`/server_info` 进 scheduler 主循环遍历 session slot,1 Hz × 8 worker 不是 0 开销。
|
||||
- 证伪条件:**P0(三次 baseline EXP2 复跑)如果都得到 ~9 errors,本假设确认**。
|
||||
|
||||
**H2: v5 自身存在 admission/transfer race**
|
||||
- v5 baseline 也出 9 个 errors(均为 ReadTimeout),说明该 race 在 baseline 已存在,profile 是被放大了。
|
||||
- 证据弱化:原稿提的 "admission race"(admit_direct_append snapshot 过期)与数据冲突 —— **414/415 errors 的 `kv_transfer_blocks > 0`**,他们都过了 admission,死在下游。所以即便有 race,也不是发生在 admission 端,而是 P→D transfer 后 / 生成开始前。
|
||||
|
||||
**H3: 18 个特定 session 的工作负载结构性失败**
|
||||
- 18/52 session 集中失败,每个 session 都是高 turn_id (median=70)。
|
||||
- 这些 session 可能 input 特别长,或某种 trace 结构会触发某个特定路径。
|
||||
- 证伪条件:在 P0 三次 baseline 复跑后,看是否仍是同一组 18 个 session 失败。
|
||||
|
||||
**H4: 单次运行的 GPU/PCIe 状态扰动**
|
||||
- ~21 小时间隔,GPU 温度/clock 不同。
|
||||
- 证伪条件:P0 三次 baseline 都 ~9 errors → 排除单次扰动主导。
|
||||
|
||||
⚠️ **原稿独推 admission-race(H2)是错的**。当前数据无法决定 H1-H4 哪个是主因。
|
||||
|
||||
---
|
||||
|
||||
## 5. 1P7D vs 2P6D 全局对比
|
||||
|
||||
| Config | total decode ticks | other p50 | other p90 | other>30K freq | other>50K freq | other>70K freq | held>60K freq |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|
|
||||
| 1P7D | 18865 | 663 | 79751 | 36.9% | 27.9% | 14.8% | 15.5% |
|
||||
| 2P6D | 14016 | 14459 | 77199 | 43.2% | 30.4% | 13.9% | 4.8% |
|
||||
|
||||
⚠️ **原稿"2P6D 的 p50_other 是 1P7D 的 22 倍 → 2P 推送压力更大"过度解读**。考虑分母效应:同一 trace 总工作量在 2P6D 由 6 张 D 分担 vs 1P7D 由 7 张 D 分担,**单 D 受到的压力本来就更大**,与 P 数无直接因果。这个数据只能说"2P6D 单 D 负担更高",**不能**得出"2P 在 transfer 上比 1P 更激进"。
|
||||
|
||||
---
|
||||
|
||||
## 6. 关键解读(已大幅修订)
|
||||
|
||||
### 6.1 v5 真实瓶颈尚不明确
|
||||
原稿声称"瓶颈是 D 的 KV pool 在压力期被 'other' 占据"。⚠️ **此结论已撤回**。给定 `held_tokens` 实际是 slot-private(non-tree)部分,`other` 的最大单一成分**很可能是正常的 radix-tree 共享前缀**。"被 running batch / 在途传输占据"是**未经验证的猜想**。需要 P1 的细分 instrument 才能给出真瓶颈。
|
||||
|
||||
### 6.2 LRU eviction 的行为暂无可靠解读
|
||||
原稿基于 mean_held 在压力期"暴跌"推断 LRU 在拼命踢。但 `held` 实际是 slot-private 部分,session 仍可能被 radix-tree 保留;`held` 减少不等于 session 被 evict,可能只是 `cache_protected_len` 比例变化。**P1 拆分前不下结论**。
|
||||
|
||||
### 6.3 v5+profile 1P7D "比 baseline 快"是单次巧合
|
||||
两次 run 间隔 ~21 小时(原稿误写 ~6h),GPU 温度/PCIe 状态未控制。**N=1**,任何性能差异 < 30% 都不可声称。
|
||||
|
||||
### 6.4 EXP2 2P6D 415 errors:polling 是 leading suspect(已升级)
|
||||
原稿把 polling 列为"次要可能"。⚠️ **现在升级为主嫌疑**:
|
||||
- 执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)说明 polling **改变了 admission 走哪条路**。
|
||||
- `/server_info` 不是只读旁路 —— 调度内部循环 + 遍历 session slots 计算 `is_idle`。
|
||||
- **必须做 P0 三次 baseline 复跑去伪**;在那之前不能动 v6。
|
||||
|
||||
### 6.5 "Other" 在 P 上 90% 不是 backup blocks
|
||||
`prefill-0` 的 SessionAwareCache **未启用**(replay 数据 `held=0`),P 的 "other" 等于"P 全部 KV 使用量"(radix cache + running batch + 备份)。⚠️ 当前数据**无法分辨** prefill-backup-policy 是不是真的释放了。需在 P 加单独的 `prefill_backup_tokens` 字段。
|
||||
|
||||
---
|
||||
|
||||
## 7. v6 行动项(已重排,以 P0 起步)
|
||||
|
||||
### **P0:验证 EXP2 errors=9 的可复现性**(最高优先级,先做)
|
||||
**操作**: 跑 3 次 v5 baseline EXP2(同 v5 配置,**不开 polling**),比较 error 分布。
|
||||
- 如果 3 次都得到 ~9 errors → polling 被坐实为 415 暴涨主因。**必须把 polling 改成更轻量的形式**(如降低频率、改成 streaming push、或用 sidecar metrics 而非 HTTP poll)再做后续。
|
||||
- 如果 3 次都得到 ~400 errors → polling 不是主因,415 是 v5 admission/transfer race + 单次 GPU 状态扰动的复合。
|
||||
- 如果 3 次结果分布很广(如 9 / 50 / 400) → run-to-run variance 才是主导,任何 single-run 比较失效。
|
||||
|
||||
**预期工程量**: 1 个新 sweep 脚本(只跑 EXP2,3 次)+ ~3 × 50 min = ~2.5h GPU 时间。
|
||||
**风险**: 0(纯重跑现有配置)。
|
||||
|
||||
### **P1:把 D 的 `other` 拆开打表**(P0 跑的同时并行做代码)
|
||||
**操作**: 改 SGLang `scheduler.py:get_streaming_session_cache_status` 与 `session_aware_cache.py`,在返回的 dict 里加:
|
||||
- `radix_protected_tokens` = `sum(slot.cache_protected_len for slot in slots)` ⚠️ 这是原稿盲区,critic 暴露的关键缺失字段
|
||||
- `running_batch_tokens` = `sum(req.fill_ids size for req in running_batch.reqs)`
|
||||
- `inflight_transfer_tokens` = `sum(req.size for req in disagg_decode_transfer_queue.queue)`
|
||||
- `prealloc_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.queue)`
|
||||
- `retracted_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.retracted_queue)`
|
||||
- `last_gen_throughput`(已有)更细 —— 加 `running_batch_size`(req 数)
|
||||
|
||||
**预期收益**: `other_unaccounted = capacity − held − available − radix_protected − running_batch − inflight − prealloc − retracted` 应该接近 0。剩余的就是真"病态"内存。
|
||||
**风险**: 低(纯只读 stat,不改 admission 逻辑)。
|
||||
**工程量**: ~80 行 SGLang patch + 同步 replay.py 的 `_query_pool_snapshot` + analyzer。
|
||||
|
||||
### **P2:如果 P0 暴露 polling 是主因,改 polling 实现**
|
||||
- 选项 A:把 `/server_info` 改成事件驱动 push(scheduler 在 step 末尾把 stats 写到环形缓冲区,polling 只读不进 scheduler 队列)
|
||||
- 选项 B:把 polling 频率从 1Hz 降到 5Hz/10s,在 P1 的拆分数据上验证够用
|
||||
- 选项 C:scheduler 端加锁分离,把 stats 读和 admission 决策的临界区拆开
|
||||
|
||||
### **P3(条件性,等 P0+P1 数据)**:决定真正的优化方向
|
||||
原稿 §7 的 5 条优先级在 `other` 模型纠正后**全部需要重新评估**。等真实拆分数据出来再排。
|
||||
|
||||
---
|
||||
|
||||
## 8. 局限与 Confounders(已扩充)
|
||||
|
||||
1. ⚠️ `held_tokens` 语义在原稿被解读颠倒,引发 `other` 的因果归因错误(已纠正,见 §1.2)。
|
||||
2. `other` 字段是计算所得且**未细分**,无法直接归因。需要 P1 instrument 才能区分 radix-cache、running batch、inflight 等。
|
||||
3. ⚠️ EXP2+profile 的 415 errors 与 baseline 9 errors **量级差异无法 deconfound**;polling 是 leading suspect 但未证实。**P0 是必经步骤**。
|
||||
4. **N=1** 的实验配置:任何 v5+profile vs v5 baseline 的延迟/失败差异都属于 single-run variance 合理范围,**不能作为方向性结论**。
|
||||
5. trace 是 single-shot,52 sessions × 4449 reqs 的特定结构可能放大某些路径。
|
||||
6. `capacity = 92086` 是 `token_to_kv_pool_allocator.size`,来自 `mem_fraction_static`(未抽具体值),与"H100 80GB 的物理上限"差距是 SGLang 的安全裕量。
|
||||
7. ⚠️ §3.1 t=1622 持续高 `other` 30+ tick 的现象 **未与 `last_gen_throughput` 交叉验证**;原稿"running batch + 在途传输"的解释是猜想而非证据。
|
||||
8. ⚠️ 18/52 失败 session 的特征(turn_id、input 长度、prefix shape)**未做对比分析**;不能排除某个 session 类型本来就会触发某个固定 bug。
|
||||
9. polling 频率 1Hz 错过亚秒级 burst —— `other` 的双峰可能比测到的更剧烈。
|
||||
10. critic 指出 `pd-router-d-session-reseed` 在 EXP1 涨(193 vs 152)、EXP2 跌(127 vs 152)的反向移动**未在原稿分析**,这是 admission/路由 决策的清晰信号,应该在 P1 之后回看。
|
||||
|
||||
---
|
||||
|
||||
## 9. 后续指令(已更新顺序)
|
||||
|
||||
1. **P0**: 跑 `scripts/sweep_tp1_v5_baseline_rerun_exp2.sh`,3 次 EXP2 baseline,无 polling。
|
||||
2. **P1**: 同时改 SGLang 把 `other` 真正拆开。
|
||||
3. 完成 P0+P1 后:
|
||||
- 重跑 EXP2 一次 + 新 instrument(同 polling),拿到 `other` 拆分。
|
||||
- 对比 baseline-rerun 三次的 errors 分布。
|
||||
- 决定是否回退 polling、调 admission、还是攻 specific 18 个 session 的工作负载特征。
|
||||
4. 任何 v6 代码改动(优化 admission / eviction / transfer)**必须在 P0+P1 之后**。
|
||||
|
||||
---
|
||||
|
||||
## 10. 数据产物
|
||||
|
||||
```
|
||||
outputs/qwen3-30b-tp1-v5-optD-profile/
|
||||
├── exp{1,2}_*_metrics.jsonl # 4449 行 / 实验
|
||||
├── exp{1,2}_*_summary.json
|
||||
├── exp{1,2}_*_pool_timeseries.jsonl # 12 MB / 10 MB
|
||||
└── kvcache-centric-...20260429T{120847,125911}Z/ # 原始 run dir
|
||||
|
||||
outputs/qwen3-30b-tp1-v5-optD/ # baseline 对照(N=1)
|
||||
└── exp{1,2}_1p7d_kvc_optD_*
|
||||
|
||||
# 待 P0 产生:
|
||||
outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
|
||||
└── exp2_2p6d_run{1,2,3}_*
|
||||
```
|
||||
|
||||
分析脚本:`scripts/analysis/analyze_pool_timeseries.py`(`--json` 拿机器可读输出)。
|
||||
BIN
docs/figures/cache_efficiency.png
Normal file
BIN
docs/figures/cache_efficiency.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 368 KiB |
BIN
docs/figures/gpu_utilization.png
Normal file
BIN
docs/figures/gpu_utilization.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 216 KiB |
BIN
docs/figures/ttft_pdf_comparison.png
Normal file
BIN
docs/figures/ttft_pdf_comparison.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 315 KiB |
BIN
docs/figures/v2_execution_mode_distribution.png
Normal file
BIN
docs/figures/v2_execution_mode_distribution.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 130 KiB |
BIN
docs/figures/v2_path_level_latency.png
Normal file
BIN
docs/figures/v2_path_level_latency.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 106 KiB |
@@ -0,0 +1,88 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4086.0,
|
||||
"mean": 213.95105237395987,
|
||||
"p50": 83.0,
|
||||
"p90": 562.0,
|
||||
"p99": 1346.0
|
||||
},
|
||||
"cache_hit_request_count": 3929,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22635.924702180266,
|
||||
"p50": 20010.0,
|
||||
"p90": 48002.0,
|
||||
"p99": 65424.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 363,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 363,
|
||||
"kvcache-direct-to-d-session": 1716,
|
||||
"pd-router-d-session-reseed": 23,
|
||||
"pd-router-fallback-d-backpressure": 12,
|
||||
"pd-router-fallback-large-append": 5,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
|
||||
"pd-router-fallback-large-append-session-cap": 2148,
|
||||
"pd-router-fallback-no-d-capacity": 7,
|
||||
"pd-router-fallback-session-cap": 32,
|
||||
"pd-router-large-append-reseed": 39,
|
||||
"pd-router-large-append-reseed-after-eviction": 2,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 3,
|
||||
"pd-router-turn1-seed": 34,
|
||||
"pd-router-turn1-session-cap": 13
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 4.8753733304192455,
|
||||
"p50": 1.754677688702941,
|
||||
"p90": 12.66968655679375,
|
||||
"p99": 28.717210091650486
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 616,
|
||||
"decode-1": 658,
|
||||
"decode-2": 674,
|
||||
"decode-3": 582,
|
||||
"decode-4": 656,
|
||||
"decode-5": 662,
|
||||
"decode-6": 601
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 98,
|
||||
"100": 2272
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1716,
|
||||
"total_actual_kv_transfer_blocks": 62123,
|
||||
"total_cached_tokens": 100707229,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 0.005829451223571163,
|
||||
"p50": 0.005684156496173296,
|
||||
"p90": 0.007143743503740225,
|
||||
"p99": 0.008634991403068266
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 3.5955862397812597,
|
||||
"p50": 0.36274072993546724,
|
||||
"p90": 10.972254231572151,
|
||||
"p99": 27.433656523004174
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,85 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4440.0,
|
||||
"mean": 225.87972972972972,
|
||||
"p50": 86.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1347.0
|
||||
},
|
||||
"cache_hit_request_count": 4201,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 24345.55787817487,
|
||||
"p50": 21504.0,
|
||||
"p90": 48792.0,
|
||||
"p99": 69120.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 9,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 9,
|
||||
"kvcache-direct-to-d-session": 1358,
|
||||
"pd-router-d-session-reseed": 12,
|
||||
"pd-router-fallback-d-backpressure": 2,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 2902,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 34,
|
||||
"pd-router-large-append-reseed-after-eviction": 4,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-seed": 30,
|
||||
"pd-router-turn1-session-cap": 20
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 3.582334662846558,
|
||||
"p50": 1.517257746309042,
|
||||
"p90": 9.225348330102861,
|
||||
"p99": 18.70269925892353
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 710,
|
||||
"decode-1": 630,
|
||||
"decode-2": 763,
|
||||
"decode-3": 737,
|
||||
"decode-4": 879,
|
||||
"decode-5": 730
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 80,
|
||||
"100": 3002
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1358,
|
||||
"total_actual_kv_transfer_blocks": 78979,
|
||||
"total_cached_tokens": 108313387,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 0.005882534704321737,
|
||||
"p50": 0.005807478777200416,
|
||||
"p90": 0.00712956755887717,
|
||||
"p99": 0.008372141476720572
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 2.2045287611873334,
|
||||
"p50": 0.32809355948120356,
|
||||
"p90": 6.947275545448065,
|
||||
"p99": 16.705802395939827
|
||||
}
|
||||
}
|
||||
189
outputs/qwen3-30b-tp1-v3-kvaware/sweep_results.txt
Normal file
189
outputs/qwen3-30b-tp1-v3-kvaware/sweep_results.txt
Normal file
@@ -0,0 +1,189 @@
|
||||
[2026-04-28 17:51:41] Starting TP1 v3 sweep (KVC with kv-aware policy)
|
||||
[2026-04-28 17:51:41] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
[2026-04-28 17:51:41] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
|
||||
[2026-04-28 17:51:41] Key change: --policy kv-aware for KVC (was --policy default in v2)
|
||||
[2026-04-28 17:51:41]
|
||||
[2026-04-28 17:51:41] === [EXP1] 1P7D KVC kv-aware ===
|
||||
[2026-04-28 18:43:43] === exp1_1p7d_kvc_kvaware COMPLETED ===
|
||||
[2026-04-28 18:43:43] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4086.0,
|
||||
"mean": 213.95105237395987,
|
||||
"p50": 83.0,
|
||||
"p90": 562.0,
|
||||
"p99": 1346.0
|
||||
},
|
||||
"cache_hit_request_count": 3929,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22635.924702180266,
|
||||
"p50": 20010.0,
|
||||
"p90": 48002.0,
|
||||
"p99": 65424.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 363,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 363,
|
||||
"kvcache-direct-to-d-session": 1716,
|
||||
"pd-router-d-session-reseed": 23,
|
||||
"pd-router-fallback-d-backpressure": 12,
|
||||
"pd-router-fallback-large-append": 5,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 51,
|
||||
"pd-router-fallback-large-append-session-cap": 2148,
|
||||
"pd-router-fallback-no-d-capacity": 7,
|
||||
"pd-router-fallback-session-cap": 32,
|
||||
"pd-router-large-append-reseed": 39,
|
||||
"pd-router-large-append-reseed-after-eviction": 2,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 3,
|
||||
"pd-router-turn1-seed": 34,
|
||||
"pd-router-turn1-session-cap": 13
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 4.8753733304192455,
|
||||
"p50": 1.754677688702941,
|
||||
"p90": 12.66968655679375,
|
||||
"p99": 28.717210091650486
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 616,
|
||||
"decode-1": 658,
|
||||
"decode-2": 674,
|
||||
"decode-3": 582,
|
||||
"decode-4": 656,
|
||||
"decode-5": 662,
|
||||
"decode-6": 601
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 98,
|
||||
"100": 2272
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1716,
|
||||
"total_actual_kv_transfer_blocks": 62123,
|
||||
"total_cached_tokens": 100707229,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 0.005829451223571163,
|
||||
"p50": 0.005684156496173296,
|
||||
"p90": 0.007143743503740225,
|
||||
"p99": 0.008634991403068266
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4086.0,
|
||||
"mean": 3.5955862397812597,
|
||||
"p50": 0.36274072993546724,
|
||||
"p90": 10.972254231572151,
|
||||
"p99": 27.433656523004174
|
||||
}
|
||||
}
|
||||
[2026-04-28 18:43:43] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_summary.json + exp1_1p7d_kvc_kvaware_metrics.jsonl
|
||||
[2026-04-28 18:43:43]
|
||||
[2026-04-28 18:43:43] === [EXP2] 2P6D KVC kv-aware ===
|
||||
[2026-04-28 19:30:38] === exp2_2p6d_kvc_kvaware COMPLETED ===
|
||||
[2026-04-28 19:30:38] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4440.0,
|
||||
"mean": 225.87972972972972,
|
||||
"p50": 86.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1347.0
|
||||
},
|
||||
"cache_hit_request_count": 4201,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 24345.55787817487,
|
||||
"p50": 21504.0,
|
||||
"p90": 48792.0,
|
||||
"p99": 69120.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 9,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 9,
|
||||
"kvcache-direct-to-d-session": 1358,
|
||||
"pd-router-d-session-reseed": 12,
|
||||
"pd-router-fallback-d-backpressure": 2,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 2902,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 34,
|
||||
"pd-router-large-append-reseed-after-eviction": 4,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-seed": 30,
|
||||
"pd-router-turn1-session-cap": 20
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 3.582334662846558,
|
||||
"p50": 1.517257746309042,
|
||||
"p90": 9.225348330102861,
|
||||
"p99": 18.70269925892353
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 710,
|
||||
"decode-1": 630,
|
||||
"decode-2": 763,
|
||||
"decode-3": 737,
|
||||
"decode-4": 879,
|
||||
"decode-5": 730
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 80,
|
||||
"100": 3002
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 1358,
|
||||
"total_actual_kv_transfer_blocks": 78979,
|
||||
"total_cached_tokens": 108313387,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 0.005882534704321737,
|
||||
"p50": 0.005807478777200416,
|
||||
"p90": 0.00712956755887717,
|
||||
"p99": 0.008372141476720572
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 42,
|
||||
"ttft_stats_s": {
|
||||
"count": 4440.0,
|
||||
"mean": 2.2045287611873334,
|
||||
"p50": 0.32809355948120356,
|
||||
"p90": 6.947275545448065,
|
||||
"p99": 16.705802395939827
|
||||
}
|
||||
}
|
||||
[2026-04-28 19:30:38] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_summary.json + exp2_2p6d_kvc_kvaware_metrics.jsonl
|
||||
[2026-04-28 19:30:38]
|
||||
[2026-04-28 19:30:38] === ALL TP1 V3 SWEEP EXPERIMENTS DONE ===
|
||||
@@ -0,0 +1,88 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4014.0,
|
||||
"mean": 215.048081714001,
|
||||
"p50": 83.0,
|
||||
"p90": 570.0,
|
||||
"p99": 1343.0
|
||||
},
|
||||
"cache_hit_request_count": 3865,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 21373.60867610699,
|
||||
"p50": 18429.0,
|
||||
"p90": 45643.0,
|
||||
"p99": 65088.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 435,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 435,
|
||||
"kvcache-direct-to-d-session": 2180,
|
||||
"pd-router-d-session-reseed": 44,
|
||||
"pd-router-d-session-reseed-after-eviction": 1,
|
||||
"pd-router-fallback-d-backpressure": 36,
|
||||
"pd-router-fallback-large-append": 35,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 1500,
|
||||
"pd-router-fallback-no-d-capacity": 13,
|
||||
"pd-router-fallback-session-cap": 43,
|
||||
"pd-router-large-append-reseed": 55,
|
||||
"pd-router-large-append-reseed-after-eviction": 3,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 5,
|
||||
"pd-router-turn1-seed": 46
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 4.214657033050009,
|
||||
"p50": 1.0827504023909569,
|
||||
"p90": 13.380241627804935,
|
||||
"p99": 24.453291333280504
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 690,
|
||||
"decode-1": 599,
|
||||
"decode-2": 660,
|
||||
"decode-3": 584,
|
||||
"decode-4": 606,
|
||||
"decode-5": 646,
|
||||
"decode-6": 664
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 149,
|
||||
"100": 1685
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2180,
|
||||
"total_actual_kv_transfer_blocks": 52857,
|
||||
"total_cached_tokens": 95091185,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 0.005804301410418847,
|
||||
"p50": 0.005607025208882987,
|
||||
"p90": 0.007293824862528552,
|
||||
"p99": 0.008864479259402893
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 43,
|
||||
"ttft_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 2.915135478307124,
|
||||
"p50": 0.05643345229327679,
|
||||
"p90": 11.900803190656006,
|
||||
"p99": 22.758968392387033
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,86 @@
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4046.0,
|
||||
"mean": 224.65002471576867,
|
||||
"p50": 84.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1349.0
|
||||
},
|
||||
"cache_hit_request_count": 3925,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22852.7439874129,
|
||||
"p50": 19584.0,
|
||||
"p90": 49009.0,
|
||||
"p99": 67320.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 403,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 403,
|
||||
"kvcache-direct-to-d-session": 2348,
|
||||
"pd-router-d-session-reseed": 28,
|
||||
"pd-router-fallback-d-backpressure": 7,
|
||||
"pd-router-fallback-large-append": 68,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
|
||||
"pd-router-fallback-large-append-session-cap": 1403,
|
||||
"pd-router-fallback-no-d-capacity": 9,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 57,
|
||||
"pd-router-large-append-reseed-after-eviction": 6,
|
||||
"pd-router-turn1-no-d-capacity": 1,
|
||||
"pd-router-turn1-seed": 49
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 2.505981629502371,
|
||||
"p50": 0.8372491216287017,
|
||||
"p90": 6.5139341270551085,
|
||||
"p99": 18.335972285829484
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 767,
|
||||
"decode-1": 680,
|
||||
"decode-2": 906,
|
||||
"decode-3": 818,
|
||||
"decode-4": 800,
|
||||
"decode-5": 478
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 140,
|
||||
"100": 1558
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2348,
|
||||
"total_actual_kv_transfer_blocks": 50727,
|
||||
"total_cached_tokens": 101671858,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 0.005708743129332261,
|
||||
"p50": 0.005565466725497757,
|
||||
"p90": 0.006912594398356141,
|
||||
"p99": 0.008102089307750717
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 36,
|
||||
"ttft_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 1.1653790952959129,
|
||||
"p50": 0.05140436999499798,
|
||||
"p90": 2.6447059931233525,
|
||||
"p99": 15.121314341202378
|
||||
}
|
||||
}
|
||||
190
outputs/qwen3-30b-tp1-v4-cap16/sweep_results.txt
Normal file
190
outputs/qwen3-30b-tp1-v4-cap16/sweep_results.txt
Normal file
@@ -0,0 +1,190 @@
|
||||
[2026-04-28 20:50:21] Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)
|
||||
[2026-04-28 20:50:21] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
[2026-04-28 20:50:21] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
|
||||
[2026-04-28 20:50:21] Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)
|
||||
[2026-04-28 20:50:21]
|
||||
[2026-04-28 20:50:21] === [EXP1] 1P7D KVC kv-aware cap=16 ===
|
||||
[2026-04-28 21:40:57] === exp1_1p7d_kvc_cap16 COMPLETED ===
|
||||
[2026-04-28 21:40:57] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4014.0,
|
||||
"mean": 215.048081714001,
|
||||
"p50": 83.0,
|
||||
"p90": 570.0,
|
||||
"p99": 1343.0
|
||||
},
|
||||
"cache_hit_request_count": 3865,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 21373.60867610699,
|
||||
"p50": 18429.0,
|
||||
"p90": 45643.0,
|
||||
"p99": 65088.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 435,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 435,
|
||||
"kvcache-direct-to-d-session": 2180,
|
||||
"pd-router-d-session-reseed": 44,
|
||||
"pd-router-d-session-reseed-after-eviction": 1,
|
||||
"pd-router-fallback-d-backpressure": 36,
|
||||
"pd-router-fallback-large-append": 35,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 52,
|
||||
"pd-router-fallback-large-append-session-cap": 1500,
|
||||
"pd-router-fallback-no-d-capacity": 13,
|
||||
"pd-router-fallback-session-cap": 43,
|
||||
"pd-router-large-append-reseed": 55,
|
||||
"pd-router-large-append-reseed-after-eviction": 3,
|
||||
"pd-router-turn1-d-backpressure": 1,
|
||||
"pd-router-turn1-no-d-capacity": 5,
|
||||
"pd-router-turn1-seed": 46
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 4.214657033050009,
|
||||
"p50": 1.0827504023909569,
|
||||
"p90": 13.380241627804935,
|
||||
"p99": 24.453291333280504
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 690,
|
||||
"decode-1": 599,
|
||||
"decode-2": 660,
|
||||
"decode-3": 584,
|
||||
"decode-4": 606,
|
||||
"decode-5": 646,
|
||||
"decode-6": 664
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 4449
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 149,
|
||||
"100": 1685
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2180,
|
||||
"total_actual_kv_transfer_blocks": 52857,
|
||||
"total_cached_tokens": 95091185,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 0.005804301410418847,
|
||||
"p50": 0.005607025208882987,
|
||||
"p90": 0.007293824862528552,
|
||||
"p99": 0.008864479259402893
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 43,
|
||||
"ttft_stats_s": {
|
||||
"count": 4014.0,
|
||||
"mean": 2.915135478307124,
|
||||
"p50": 0.05643345229327679,
|
||||
"p90": 11.900803190656006,
|
||||
"p99": 22.758968392387033
|
||||
}
|
||||
}
|
||||
[2026-04-28 21:40:57] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_summary.json + exp1_1p7d_kvc_cap16_metrics.jsonl
|
||||
[2026-04-28 21:40:57]
|
||||
[2026-04-28 21:40:57] === [EXP2] 2P6D KVC kv-aware cap=16 ===
|
||||
[2026-04-28 22:27:53] === exp2_2p6d_kvc_cap16 COMPLETED ===
|
||||
[2026-04-28 22:27:53] Summary:
|
||||
{
|
||||
"actual_output_tokens_stats": {
|
||||
"count": 4046.0,
|
||||
"mean": 224.65002471576867,
|
||||
"p50": 84.0,
|
||||
"p90": 576.0,
|
||||
"p99": 1349.0
|
||||
},
|
||||
"cache_hit_request_count": 3925,
|
||||
"cached_tokens_stats": {
|
||||
"count": 4449.0,
|
||||
"mean": 22852.7439874129,
|
||||
"p50": 19584.0,
|
||||
"p90": 49009.0,
|
||||
"p99": 67320.0
|
||||
},
|
||||
"decode_request_priorities": {},
|
||||
"error_count": 403,
|
||||
"execution_modes": {
|
||||
"kvcache-centric": 403,
|
||||
"kvcache-direct-to-d-session": 2348,
|
||||
"pd-router-d-session-reseed": 28,
|
||||
"pd-router-fallback-d-backpressure": 7,
|
||||
"pd-router-fallback-large-append": 68,
|
||||
"pd-router-fallback-large-append-seed-filter-early-turn": 45,
|
||||
"pd-router-fallback-large-append-session-cap": 1403,
|
||||
"pd-router-fallback-no-d-capacity": 9,
|
||||
"pd-router-fallback-session-cap": 25,
|
||||
"pd-router-large-append-reseed": 57,
|
||||
"pd-router-large-append-reseed-after-eviction": 6,
|
||||
"pd-router-turn1-no-d-capacity": 1,
|
||||
"pd-router-turn1-seed": 49
|
||||
},
|
||||
"latency_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 2.505981629502371,
|
||||
"p50": 0.8372491216287017,
|
||||
"p90": 6.5139341270551085,
|
||||
"p99": 18.335972285829484
|
||||
},
|
||||
"mechanisms": {
|
||||
"kvcache-centric": 4449
|
||||
},
|
||||
"per_decode_load": {
|
||||
"decode-0": 767,
|
||||
"decode-1": 680,
|
||||
"decode-2": 906,
|
||||
"decode-3": 818,
|
||||
"decode-4": 800,
|
||||
"decode-5": 478
|
||||
},
|
||||
"per_prefill_load": {
|
||||
"prefill-0": 2225,
|
||||
"prefill-1": 2224
|
||||
},
|
||||
"prefill_request_priorities": {
|
||||
"-100": 140,
|
||||
"100": 1558
|
||||
},
|
||||
"re_prefill_count": 0,
|
||||
"request_count": 4449,
|
||||
"reuse_expected_count": 4397,
|
||||
"reuse_observed_count": 4397,
|
||||
"router_url": "http://127.0.0.1:8000",
|
||||
"session_reset_count": 0,
|
||||
"session_reused_count": 2348,
|
||||
"total_actual_kv_transfer_blocks": 50727,
|
||||
"total_cached_tokens": 101671858,
|
||||
"total_kv_transfer_blocks": 105235,
|
||||
"tpot_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 0.005708743129332261,
|
||||
"p50": 0.005565466725497757,
|
||||
"p90": 0.006912594398356141,
|
||||
"p99": 0.008102089307750717
|
||||
},
|
||||
"trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
|
||||
"truncated_request_count": 36,
|
||||
"ttft_stats_s": {
|
||||
"count": 4046.0,
|
||||
"mean": 1.1653790952959129,
|
||||
"p50": 0.05140436999499798,
|
||||
"p90": 2.6447059931233525,
|
||||
"p99": 15.121314341202378
|
||||
}
|
||||
}
|
||||
[2026-04-28 22:27:53] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_summary.json + exp2_2p6d_kvc_cap16_metrics.jsonl
|
||||
[2026-04-28 22:27:53]
|
||||
[2026-04-28 22:27:53] === ALL TP1 V4 SWEEP EXPERIMENTS DONE ===
|
||||
@@ -7,7 +7,7 @@ requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
"httpx>=0.28.1",
|
||||
"mooncake-transfer-engine",
|
||||
"sglang==0.5.10",
|
||||
"sglang",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
@@ -20,5 +20,21 @@ build-backend = "setuptools.build_meta"
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["src"]
|
||||
|
||||
[dependency-groups]
|
||||
# Pure-Python unit tests. Install via:
|
||||
# uv sync --group test
|
||||
# These tests deliberately import only the algorithm-layer modules
|
||||
# (policies, trace, topology) so they run without SGLang / GPU / CUDA.
|
||||
test = [
|
||||
"pytest>=8.0",
|
||||
]
|
||||
|
||||
[tool.uv]
|
||||
prerelease = "allow"
|
||||
|
||||
[tool.uv.sources]
|
||||
sglang = { path = "third_party/sglang/python", editable = true }
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
addopts = "-q"
|
||||
|
||||
191
scripts/analysis/analyze_backpressure_smoke.py
Executable file
191
scripts/analysis/analyze_backpressure_smoke.py
Executable file
@@ -0,0 +1,191 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze backpressure smoke sweep outputs.
|
||||
|
||||
For each run dir with a `request-metrics.jsonl` and the new `structural/`
|
||||
subdir (admission-events.jsonl, backpressure-events.jsonl,
|
||||
session-d-binding.jsonl), report:
|
||||
|
||||
- Headline (errors, latency, ttft, direct-to-D rate)
|
||||
- Backpressure pause histogram (count, p50/p90 sleep, total pause time per D)
|
||||
- Admission probe stats (RPC count, mean RTT, queue_depth distribution,
|
||||
pause_ms distribution)
|
||||
- Session pinning (distinct D per session, bimodal direct-to-D rate)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import statistics
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def load_jsonl(path: Path) -> list[dict]:
|
||||
if not path.exists():
|
||||
return []
|
||||
return [json.loads(l) for l in path.open("r", encoding="utf-8") if l.strip()]
|
||||
|
||||
|
||||
def summarize_run(run_dir: Path) -> dict:
|
||||
metrics_path = next(run_dir.rglob("request-metrics.jsonl"), None)
|
||||
if metrics_path is None:
|
||||
return {"run_dir": str(run_dir), "error": "no request-metrics.jsonl"}
|
||||
|
||||
summary_path = metrics_path.with_suffix(metrics_path.suffix + ".summary.json")
|
||||
summary = (
|
||||
json.load(summary_path.open()) if summary_path.exists() else {}
|
||||
)
|
||||
|
||||
structural_dir = run_dir / "structural"
|
||||
if not structural_dir.exists():
|
||||
# try metrics dir's parent / structural
|
||||
structural_dir = metrics_path.parent / "structural"
|
||||
|
||||
admission_events = load_jsonl(structural_dir / "admission-events.jsonl")
|
||||
backpressure_events = load_jsonl(structural_dir / "backpressure-events.jsonl")
|
||||
binding_events = load_jsonl(structural_dir / "session-d-binding.jsonl")
|
||||
|
||||
out: dict = {"run_dir": str(run_dir)}
|
||||
|
||||
# Headline metrics from summary.json
|
||||
out["request_count"] = summary.get("request_count")
|
||||
out["error_count"] = summary.get("error_count")
|
||||
out["latency"] = summary.get("latency_stats_s")
|
||||
out["ttft"] = summary.get("ttft_stats_s")
|
||||
out["execution_modes"] = summary.get("execution_modes")
|
||||
out["per_decode_load"] = summary.get("per_decode_load")
|
||||
out["per_prefill_load"] = summary.get("per_prefill_load")
|
||||
|
||||
# Direct-to-D rate from execution_modes
|
||||
em = summary.get("execution_modes", {}) or {}
|
||||
direct = em.get("kvcache-direct-to-d-session", 0)
|
||||
total = sum(em.values()) or 1
|
||||
out["direct_to_d_rate"] = direct / total
|
||||
|
||||
# Session pinning
|
||||
bind_per_session: dict[str, set[int]] = defaultdict(set)
|
||||
for ev in binding_events:
|
||||
bind_per_session[ev["session_id"]].add(ev["decode_worker_index"])
|
||||
if bind_per_session:
|
||||
out["session_count"] = len(bind_per_session)
|
||||
out["avg_distinct_d_per_session"] = (
|
||||
sum(len(v) for v in bind_per_session.values()) / len(bind_per_session)
|
||||
)
|
||||
else:
|
||||
out["session_count"] = 0
|
||||
out["avg_distinct_d_per_session"] = None
|
||||
|
||||
# Direct-to-D rate per session (bimodal check)
|
||||
records = load_jsonl(metrics_path)
|
||||
sess_records: dict[str, list[dict]] = defaultdict(list)
|
||||
for r in records:
|
||||
sess_records[r["session_id"]].append(r)
|
||||
rates = []
|
||||
for sid, turns in sess_records.items():
|
||||
ndir = sum(
|
||||
1 for t in turns if t.get("execution_mode") == "kvcache-direct-to-d-session"
|
||||
)
|
||||
rates.append(ndir / len(turns))
|
||||
if rates:
|
||||
buckets = [0, 0, 0, 0, 0]
|
||||
for r in rates:
|
||||
buckets[min(4, int(r * 5))] += 1
|
||||
out["direct_to_d_rate_buckets"] = {
|
||||
"0-20%": buckets[0],
|
||||
"20-40%": buckets[1],
|
||||
"40-60%": buckets[2],
|
||||
"60-80%": buckets[3],
|
||||
"80-100%": buckets[4],
|
||||
}
|
||||
|
||||
# Backpressure events
|
||||
if backpressure_events:
|
||||
sleeps = [ev["sleep_s"] for ev in backpressure_events]
|
||||
out["backpressure"] = {
|
||||
"event_count": len(backpressure_events),
|
||||
"total_sleep_s": round(sum(sleeps), 2),
|
||||
"sleep_p50_s": round(statistics.median(sleeps), 4),
|
||||
"sleep_p90_s": round(
|
||||
sorted(sleeps)[int(len(sleeps) * 0.9)] if sleeps else 0, 4
|
||||
),
|
||||
"events_per_d": dict(
|
||||
Counter(ev["server_url"] for ev in backpressure_events).most_common()
|
||||
),
|
||||
}
|
||||
else:
|
||||
out["backpressure"] = {"event_count": 0, "note": "no backpressure events"}
|
||||
|
||||
# Admission probe stats
|
||||
if admission_events:
|
||||
rtts = [ev["rtt_s"] for ev in admission_events]
|
||||
depths = [ev.get("queue_depth", 0) for ev in admission_events]
|
||||
pauses = [ev.get("recommended_pause_ms", 0) for ev in admission_events]
|
||||
out["admission_probes"] = {
|
||||
"count": len(admission_events),
|
||||
"mean_rtt_s": round(sum(rtts) / len(rtts), 4),
|
||||
"p99_rtt_s": round(sorted(rtts)[int(len(rtts) * 0.99)], 4),
|
||||
"queue_depth_p50": int(statistics.median(depths)),
|
||||
"queue_depth_p90": int(sorted(depths)[int(len(depths) * 0.9)]),
|
||||
"queue_depth_max": max(depths),
|
||||
"pause_ms_p50": int(statistics.median(pauses)),
|
||||
"pause_ms_p90": int(sorted(pauses)[int(len(pauses) * 0.9)]),
|
||||
"pause_ms_max": max(pauses),
|
||||
"nonzero_pause_count": sum(1 for p in pauses if p > 0),
|
||||
"by_reason": dict(
|
||||
Counter(ev.get("reason") or "ok" for ev in admission_events).most_common()
|
||||
),
|
||||
}
|
||||
|
||||
return out
|
||||
|
||||
|
||||
def main() -> None:
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("sweep_root", type=Path)
|
||||
ap.add_argument("--json", action="store_true", help="emit JSON only")
|
||||
args = ap.parse_args()
|
||||
|
||||
summaries = []
|
||||
for run_dir in sorted(args.sweep_root.iterdir()):
|
||||
if not run_dir.is_dir():
|
||||
continue
|
||||
summary = summarize_run(run_dir)
|
||||
summaries.append(summary)
|
||||
|
||||
if args.json:
|
||||
print(json.dumps(summaries, indent=2))
|
||||
return
|
||||
|
||||
for s in summaries:
|
||||
print(f"\n{'=' * 70}")
|
||||
print(f" {s['run_dir']}")
|
||||
print(f"{'=' * 70}")
|
||||
if "error" in s:
|
||||
print(f" ERROR: {s['error']}")
|
||||
continue
|
||||
print(f" reqs={s.get('request_count')} errors={s.get('error_count')}")
|
||||
if s.get("latency"):
|
||||
lt = s["latency"]
|
||||
print(
|
||||
f" latency: mean={lt.get('mean'):.3f} "
|
||||
f"p50={lt.get('p50'):.3f} p90={lt.get('p90'):.3f} p99={lt.get('p99'):.3f}"
|
||||
)
|
||||
if s.get("ttft"):
|
||||
tt = s["ttft"]
|
||||
print(
|
||||
f" ttft: mean={tt.get('mean'):.3f} "
|
||||
f"p50={tt.get('p50'):.3f} p90={tt.get('p90'):.3f}"
|
||||
)
|
||||
print(f" direct_to_d_rate: {s.get('direct_to_d_rate', 0) * 100:.1f}%")
|
||||
print(f" sessions: {s.get('session_count')} | "
|
||||
f"avg distinct-D-per-session: {s.get('avg_distinct_d_per_session')}")
|
||||
if s.get("direct_to_d_rate_buckets"):
|
||||
print(f" direct-to-D distribution by session: {s['direct_to_d_rate_buckets']}")
|
||||
if s.get("backpressure"):
|
||||
print(f" backpressure: {s['backpressure']}")
|
||||
if s.get("admission_probes"):
|
||||
print(f" admission probes: {s['admission_probes']}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
83
scripts/analysis/analyze_errors.py
Normal file
83
scripts/analysis/analyze_errors.py
Normal file
@@ -0,0 +1,83 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Deep dive into v4 errors: which path, which D, which session, which turn."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from collections import Counter, defaultdict
|
||||
|
||||
BASE = Path(__file__).parent
|
||||
|
||||
def load_rows(jsonl_path):
|
||||
rows = []
|
||||
with open(jsonl_path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
# Compare v3 and v4 errors
|
||||
for label, path in [
|
||||
("v3 1P7D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
|
||||
("v4 1P7D", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
|
||||
("v3 2P6D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
|
||||
("v4 2P6D", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
|
||||
]:
|
||||
if not path.exists():
|
||||
print(f"\nSKIP {label}: {path} not found")
|
||||
continue
|
||||
rows = load_rows(path)
|
||||
err = [r for r in rows if r.get("error") is not None]
|
||||
print(f"\n========== {label} ({len(err)} errors / {len(rows)} total = {len(err)/len(rows)*100:.1f}%) ==========")
|
||||
|
||||
# Error finish_reason distribution
|
||||
fr_counter = Counter()
|
||||
for r in err:
|
||||
fr = str(r.get("finish_reason") or r.get("error") or "?")
|
||||
fr_counter[fr[:80]] += 1
|
||||
print(f"finish_reason distribution:")
|
||||
for fr, cnt in fr_counter.most_common():
|
||||
print(f" {cnt:>4}x {fr}")
|
||||
|
||||
# Errors by execution mode (these are aborted before mode assignment usually)
|
||||
mode_counter = Counter(r.get("execution_mode", "?") for r in err)
|
||||
print(f"\nerror by execution_mode:")
|
||||
for mode, cnt in mode_counter.most_common():
|
||||
print(f" {cnt:>4}x {mode}")
|
||||
|
||||
# Errors per D worker
|
||||
dw_counter = Counter(r.get("assigned_decode_node", "?") for r in err)
|
||||
print(f"\nerror per assigned_decode_node:")
|
||||
for dw, cnt in dw_counter.most_common():
|
||||
print(f" {cnt:>4}x {dw}")
|
||||
|
||||
# Errors by turn distribution
|
||||
turn_counter = Counter(r.get("turn_id", -1) for r in err)
|
||||
early = sum(c for t, c in turn_counter.items() if t <= 5)
|
||||
mid = sum(c for t, c in turn_counter.items() if 5 < t <= 30)
|
||||
late = sum(c for t, c in turn_counter.items() if t > 30)
|
||||
print(f"\nerror by turn: early(0-5)={early} mid(6-30)={mid} late(31+)={late}")
|
||||
|
||||
# Per-session error rate
|
||||
per_sess_err = defaultdict(int)
|
||||
per_sess_total = defaultdict(int)
|
||||
for r in rows:
|
||||
per_sess_total[r["session_id"]] += 1
|
||||
if r.get("error") is not None:
|
||||
per_sess_err[r["session_id"]] += 1
|
||||
sess_with_err = [(sid, per_sess_err[sid], per_sess_total[sid]) for sid in per_sess_err]
|
||||
sess_with_err.sort(key=lambda x: -x[1])
|
||||
print(f"\ntop 5 sessions by error count:")
|
||||
for sid, e, t in sess_with_err[:5]:
|
||||
print(f" session {sid}: {e}/{t} errors ({e/t*100:.0f}%)")
|
||||
|
||||
# Errors timeline: are they bursty?
|
||||
err_ts = sorted([r.get("trace_timestamp_s", 0) for r in err])
|
||||
if err_ts:
|
||||
first_ts = err_ts[0]
|
||||
last_ts = err_ts[-1]
|
||||
all_ts = sorted([r.get("trace_timestamp_s", 0) for r in rows])
|
||||
first_all = all_ts[0]
|
||||
last_all = all_ts[-1]
|
||||
run_duration = last_all - first_all
|
||||
err_first_pct = (err_ts[0] - first_all) / run_duration * 100 if run_duration > 0 else 0
|
||||
err_last_pct = (err_ts[-1] - first_all) / run_duration * 100 if run_duration > 0 else 0
|
||||
print(f"\nerror time range (% of run): {err_first_pct:.1f}% - {err_last_pct:.1f}%")
|
||||
346
scripts/analysis/analyze_pool_timeseries.py
Executable file
346
scripts/analysis/analyze_pool_timeseries.py
Executable file
@@ -0,0 +1,346 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze d-pool-timeseries.jsonl produced by --pool-poll-interval-s.
|
||||
|
||||
Answers v6's main question: where is D's KV pool actually spent?
|
||||
|
||||
For each decode worker, decomposes capacity over the run wall-clock into:
|
||||
- resident_held_active = held - idle_evictable (sessions in active use)
|
||||
- resident_held_idle = idle_evictable (sessions kept around but evictable)
|
||||
- prefill_backup_or_other = capacity - held - available (everything else: backup blocks,
|
||||
in-flight transfers, fragmentation)
|
||||
- free_available = available
|
||||
|
||||
Also reports session residency churn (how many distinct sessions ever resided per D, and
|
||||
how often a session bounced between workers — a strong starvation signal).
|
||||
|
||||
Usage:
|
||||
python scripts/analysis/analyze_pool_timeseries.py <run_dir>
|
||||
or
|
||||
python scripts/analysis/analyze_pool_timeseries.py <pool_timeseries.jsonl>
|
||||
|
||||
Output: human-readable text. Add --json to also print a machine-readable summary.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import statistics
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
def _load_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
rows: list[dict[str, Any]] = []
|
||||
with path.open() as fh:
|
||||
for line in fh:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
|
||||
def _resolve_input(path: Path) -> Path:
|
||||
if path.is_file():
|
||||
return path
|
||||
if path.is_dir():
|
||||
candidate = path / "d-pool-timeseries.jsonl"
|
||||
if candidate.is_file():
|
||||
return candidate
|
||||
raise FileNotFoundError(
|
||||
f"{candidate} not found; pass the file directly or a run dir containing it."
|
||||
)
|
||||
raise FileNotFoundError(path)
|
||||
|
||||
|
||||
def _percentile(values: list[float], p: float) -> float:
|
||||
if not values:
|
||||
return 0.0
|
||||
s = sorted(values)
|
||||
idx = min(len(s) - 1, max(0, int(round((len(s) - 1) * p))))
|
||||
return s[idx]
|
||||
|
||||
|
||||
def _fmt_tokens(n: float) -> str:
|
||||
if n >= 1_000_000:
|
||||
return f"{n / 1_000_000:.2f}M"
|
||||
if n >= 1_000:
|
||||
return f"{n / 1_000:.1f}K"
|
||||
return f"{int(n)}"
|
||||
|
||||
|
||||
def _fmt_pct(n: float, total: float) -> str:
|
||||
if total <= 0:
|
||||
return " - "
|
||||
return f"{100 * n / total:5.1f}%"
|
||||
|
||||
|
||||
def analyze(timeseries_path: Path) -> dict[str, Any]:
|
||||
rows = _load_jsonl(timeseries_path)
|
||||
if not rows:
|
||||
raise ValueError(f"empty timeseries: {timeseries_path}")
|
||||
|
||||
by_worker: dict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
for row in rows:
|
||||
if row.get("error") and "session_cache_enabled" not in row:
|
||||
# poller failed at this tick — skip
|
||||
continue
|
||||
wid = row.get("worker_id") or "?"
|
||||
by_worker[wid].append(row)
|
||||
|
||||
summary: dict[str, Any] = {
|
||||
"timeseries_path": str(timeseries_path),
|
||||
"total_rows": len(rows),
|
||||
"tick_count": len(by_worker[next(iter(by_worker))]) if by_worker else 0,
|
||||
"wall_s_span": (
|
||||
max(r.get("wall_s", 0.0) for r in rows)
|
||||
- min(r.get("wall_s", 0.0) for r in rows)
|
||||
),
|
||||
"workers": {},
|
||||
}
|
||||
|
||||
print(f"\n=== Pool timeseries: {timeseries_path}")
|
||||
print(
|
||||
f" rows={summary['total_rows']} workers={len(by_worker)} "
|
||||
f"span={summary['wall_s_span']:.1f}s"
|
||||
)
|
||||
|
||||
# Print per-worker decomposition table
|
||||
header = (
|
||||
f"{'worker':<12} {'role':<8} {'cap':>8} | "
|
||||
f"{'avg_active':>10} {'avg_idle':>10} {'avg_other':>10} {'avg_free':>10} | "
|
||||
f"{'p90_held':>10} {'max_held':>10} {'p90_avail':>10}"
|
||||
)
|
||||
print(header)
|
||||
print("-" * len(header))
|
||||
|
||||
for wid in sorted(by_worker.keys()):
|
||||
ws = by_worker[wid]
|
||||
role = ws[0].get("worker_role", "?")
|
||||
cap_vals = [int(r.get("capacity_tokens") or 0) for r in ws]
|
||||
held_vals = [int(r.get("held_tokens") or 0) for r in ws]
|
||||
avail_vals = [int(r.get("available_tokens") or 0) for r in ws]
|
||||
idle_vals = [int(r.get("idle_evictable_tokens") or 0) for r in ws]
|
||||
# active = held - idle (sessions in active use)
|
||||
active_vals = [max(0, h - i) for h, i in zip(held_vals, idle_vals)]
|
||||
# other = capacity - held - available (prefill backup blocks, in-flight, fragmentation)
|
||||
other_vals = [
|
||||
max(0, c - h - a) for c, h, a in zip(cap_vals, held_vals, avail_vals)
|
||||
]
|
||||
cap = max(cap_vals) if cap_vals else 0
|
||||
|
||||
avg_active = statistics.fmean(active_vals) if active_vals else 0.0
|
||||
avg_idle = statistics.fmean(idle_vals) if idle_vals else 0.0
|
||||
avg_other = statistics.fmean(other_vals) if other_vals else 0.0
|
||||
avg_avail = statistics.fmean(avail_vals) if avail_vals else 0.0
|
||||
|
||||
p90_held = _percentile([float(v) for v in held_vals], 0.90)
|
||||
max_held = max(held_vals) if held_vals else 0
|
||||
p90_avail = _percentile([float(v) for v in avail_vals], 0.90)
|
||||
|
||||
sess_counts = [int(r.get("session_count") or 0) for r in ws]
|
||||
resident_counts = [int(r.get("resident_session_count") or 0) for r in ws]
|
||||
|
||||
print(
|
||||
f"{wid:<12} {role:<8} {_fmt_tokens(cap):>8} | "
|
||||
f"{_fmt_tokens(avg_active):>4} {_fmt_pct(avg_active, cap):>5} "
|
||||
f"{_fmt_tokens(avg_idle):>4} {_fmt_pct(avg_idle, cap):>5} "
|
||||
f"{_fmt_tokens(avg_other):>4} {_fmt_pct(avg_other, cap):>5} "
|
||||
f"{_fmt_tokens(avg_avail):>4} {_fmt_pct(avg_avail, cap):>5} | "
|
||||
f"{_fmt_tokens(p90_held):>10} {_fmt_tokens(max_held):>10} "
|
||||
f"{_fmt_tokens(p90_avail):>10}"
|
||||
)
|
||||
|
||||
summary["workers"][wid] = {
|
||||
"role": role,
|
||||
"capacity_tokens": cap,
|
||||
"avg_active_held_tokens": avg_active,
|
||||
"avg_idle_evictable_tokens": avg_idle,
|
||||
"avg_other_tokens": avg_other,
|
||||
"avg_available_tokens": avg_avail,
|
||||
"p90_held_tokens": p90_held,
|
||||
"max_held_tokens": max_held,
|
||||
"p90_available_tokens": p90_avail,
|
||||
"max_session_count": max(sess_counts) if sess_counts else 0,
|
||||
"max_resident_session_count": (
|
||||
max(resident_counts) if resident_counts else 0
|
||||
),
|
||||
"ticks": len(ws),
|
||||
}
|
||||
|
||||
print(
|
||||
"\nLegend: active=held-idle idle=idle_evictable "
|
||||
"other=cap-held-avail (radix-protected + running-batch + in-flight + frag)"
|
||||
)
|
||||
|
||||
# P1: decomposition of "other" using pool_breakdown fields (zeros if instrument absent)
|
||||
has_breakdown = any(
|
||||
any(r.get(k) for k in (
|
||||
"radix_evictable_tokens",
|
||||
"radix_protected_tokens",
|
||||
"running_batch_kv_tokens",
|
||||
"transfer_queue_tokens",
|
||||
"prealloc_queue_tokens",
|
||||
"retracted_queue_tokens",
|
||||
))
|
||||
for r in rows
|
||||
)
|
||||
|
||||
if has_breakdown:
|
||||
print("\n=== P1 'other' decomposition (per worker, mean over run) ===")
|
||||
print(
|
||||
f"{'worker':<12} {'role':<8} | "
|
||||
f"{'r_evictable':>11} {'r_protected':>11} {'slot_private':>12} | "
|
||||
f"{'run_batch':>10} {'transfer':>9} {'prealloc':>9} {'retracted':>10} | "
|
||||
f"{'unaccounted':>11}"
|
||||
)
|
||||
for wid in sorted(by_worker.keys()):
|
||||
ws = by_worker[wid]
|
||||
role = ws[0].get("worker_role", "?")
|
||||
cap = max(int(r.get("capacity_tokens") or 0) for r in ws)
|
||||
|
||||
def m(field: str) -> float:
|
||||
vals = [int(r.get(field) or 0) for r in ws]
|
||||
return statistics.fmean(vals) if vals else 0.0
|
||||
|
||||
r_ev = m("radix_evictable_tokens")
|
||||
r_pr = m("radix_protected_tokens")
|
||||
slot = m("slot_private_held_tokens")
|
||||
rb = m("running_batch_kv_tokens")
|
||||
tq = m("transfer_queue_tokens")
|
||||
pq = m("prealloc_queue_tokens")
|
||||
rq = m("retracted_queue_tokens")
|
||||
avail = m("available_tokens")
|
||||
# `running_batch_kv_tokens` overlaps with radix_protected for tree-tracked
|
||||
# reqs — do NOT subtract it again. Decomposition assumes:
|
||||
# capacity ≈ avail + r_evictable + r_protected + slot_private
|
||||
# + transfer_queue + prealloc_queue + retracted_queue + unaccounted
|
||||
unacc = max(
|
||||
0,
|
||||
cap - avail - r_ev - r_pr - slot - tq - pq - rq,
|
||||
)
|
||||
print(
|
||||
f"{wid:<12} {role:<8} | "
|
||||
f"{_fmt_tokens(r_ev):>11} {_fmt_tokens(r_pr):>11} {_fmt_tokens(slot):>12} | "
|
||||
f"{_fmt_tokens(rb):>10} {_fmt_tokens(tq):>9} {_fmt_tokens(pq):>9} {_fmt_tokens(rq):>10} | "
|
||||
f"{_fmt_tokens(unacc):>11}"
|
||||
)
|
||||
|
||||
summary["workers"][wid]["pool_breakdown_avg"] = {
|
||||
"radix_evictable": r_ev,
|
||||
"radix_protected": r_pr,
|
||||
"slot_private_held": slot,
|
||||
"running_batch_kv": rb,
|
||||
"transfer_queue": tq,
|
||||
"prealloc_queue": pq,
|
||||
"retracted_queue": rq,
|
||||
"available": avail,
|
||||
"unaccounted": unacc,
|
||||
}
|
||||
print(
|
||||
"\nNote: running_batch_kv_tokens overlaps with radix_protected_tokens "
|
||||
"(tree-tracked decode reqs are also in protected); not summed."
|
||||
)
|
||||
else:
|
||||
print("\n(P1 instrument absent: pool_breakdown fields are all zero)")
|
||||
|
||||
# Session residency churn: how many distinct sessions ever sat on each worker,
|
||||
# and how many sessions hopped across workers (= starvation indicator).
|
||||
print("\n=== Session residency churn ===")
|
||||
sessions_per_worker: dict[str, set[str]] = defaultdict(set)
|
||||
workers_per_session: dict[str, set[str]] = defaultdict(set)
|
||||
resident_ticks_per_session: Counter[str] = Counter()
|
||||
resident_ticks_per_worker: Counter[str] = Counter()
|
||||
|
||||
for row in rows:
|
||||
wid = row.get("worker_id")
|
||||
if wid is None or row.get("worker_role") != "decode":
|
||||
continue
|
||||
sessions = row.get("sessions") or []
|
||||
if not isinstance(sessions, list):
|
||||
continue
|
||||
for entry in sessions:
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
sid = entry.get("session_id")
|
||||
if sid is None:
|
||||
continue
|
||||
if entry.get("resident"):
|
||||
sessions_per_worker[wid].add(sid)
|
||||
workers_per_session[sid].add(wid)
|
||||
resident_ticks_per_session[(wid, sid)] += 1
|
||||
resident_ticks_per_worker[wid] += 1
|
||||
|
||||
# Per-decode worker: distinct session count
|
||||
print(f" {'worker':<12} {'distinct_sess':>14} {'resident_ticks':>16}")
|
||||
for wid in sorted(sessions_per_worker.keys()):
|
||||
print(
|
||||
f" {wid:<12} {len(sessions_per_worker[wid]):>14} "
|
||||
f"{resident_ticks_per_worker[wid]:>16}"
|
||||
)
|
||||
|
||||
# Per session: how many workers it hopped across
|
||||
hops = Counter(len(ws) for ws in workers_per_session.values())
|
||||
print(f"\n Sessions seen on N workers (decode side):")
|
||||
for n, count in sorted(hops.items()):
|
||||
print(f" on {n} worker(s): {count} sessions")
|
||||
|
||||
starvation = [sid for sid, ws in workers_per_session.items() if len(ws) == 0]
|
||||
multi_hopper = sorted(
|
||||
((sid, ws) for sid, ws in workers_per_session.items() if len(ws) >= 2),
|
||||
key=lambda x: -len(x[1]),
|
||||
)[:10]
|
||||
if multi_hopper:
|
||||
print(
|
||||
"\n Top sessions seen resident on multiple workers (potential thrashing):"
|
||||
)
|
||||
for sid, ws in multi_hopper:
|
||||
print(f" {sid}: {len(ws)} workers ({sorted(ws)})")
|
||||
|
||||
summary["session_residency"] = {
|
||||
"distinct_sessions_per_worker": {
|
||||
wid: len(s) for wid, s in sessions_per_worker.items()
|
||||
},
|
||||
"session_hop_count_distribution": dict(hops),
|
||||
"starvation_session_count": len(starvation),
|
||||
}
|
||||
|
||||
# If a request-metrics file is co-located, also bucket fallback reasons
|
||||
# against contemporaneous pool state (rough — uses tick nearest to median tick).
|
||||
metrics_path = timeseries_path.with_name("request-metrics.jsonl")
|
||||
if metrics_path.exists():
|
||||
print(f"\n=== Request-metrics summary ({metrics_path.name}) ===")
|
||||
mrows = _load_jsonl(metrics_path)
|
||||
modes = Counter(r.get("execution_mode") or "?" for r in mrows)
|
||||
total = sum(modes.values())
|
||||
for mode, count in modes.most_common():
|
||||
print(f" {count:>6} ({100 * count / total:5.1f}%) {mode}")
|
||||
summary["execution_modes"] = dict(modes)
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument(
|
||||
"path",
|
||||
type=Path,
|
||||
help="Path to d-pool-timeseries.jsonl OR a run dir containing it",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Also print a machine-readable JSON summary",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
resolved = _resolve_input(args.path)
|
||||
summary = analyze(resolved)
|
||||
if args.json:
|
||||
print("\n=== JSON summary ===")
|
||||
print(json.dumps(summary, indent=2, sort_keys=True, default=str))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
316
scripts/analysis/analyze_ts1_validation.py
Normal file
316
scripts/analysis/analyze_ts1_validation.py
Normal file
@@ -0,0 +1,316 @@
|
||||
#!/usr/bin/env python3
|
||||
"""TS=1 validation analysis: KVC 1P3D × N=3 + 4DP × 1.
|
||||
|
||||
Reads metrics from outputs/qwen3-30b-tp1-ts1-validation/{kvc_1p3d_run{1,2,3},dp4}_metrics.jsonl
|
||||
and reports per the structural claims in docs/AGENTIC_FIT_ANALYSIS_ZH.md and TEAM_REPORT.
|
||||
|
||||
Sections:
|
||||
1. Headline summary table (errors, latency p50/p90/p99, TTFT p50)
|
||||
2. §1 (session pinning): distinct-D-per-session distribution + direct-to-D bimodal
|
||||
3. §1 (cross-run consistency): sessions consistently starved across all 3 runs + size ratio
|
||||
4. §2 (LRU): KVTransferError counts per D + peak token_usage from worker logs
|
||||
5. §7 (ts=1 vs ts=10): direct-to-D rate, fallback rate, per-D load balance
|
||||
6. KVC vs DP same-scale comparison
|
||||
|
||||
Usage: python scripts/analysis/analyze_ts1_validation.py [--root PATH]
|
||||
"""
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
def load_metrics(path):
|
||||
rows = []
|
||||
with open(path) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
|
||||
def load_summary(path):
|
||||
with open(path) as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def pct(arr, p):
|
||||
if not arr:
|
||||
return float("nan")
|
||||
return float(np.percentile(arr, p))
|
||||
|
||||
|
||||
def summarize_run(label, rows, summary):
|
||||
ok = [r for r in rows if r.get("error") is None]
|
||||
err = [r for r in rows if r.get("error") is not None]
|
||||
lats = [r["latency_s"] for r in ok if r.get("latency_s") is not None]
|
||||
ttfts = [r["ttft_s"] for r in ok if r.get("ttft_s") is not None]
|
||||
return {
|
||||
"label": label,
|
||||
"n": len(rows),
|
||||
"ok": len(ok),
|
||||
"err": len(err),
|
||||
"lat_mean": float(np.mean(lats)) if lats else float("nan"),
|
||||
"lat_p50": pct(lats, 50),
|
||||
"lat_p90": pct(lats, 90),
|
||||
"lat_p99": pct(lats, 99),
|
||||
"ttft_mean": float(np.mean(ttfts)) if ttfts else float("nan"),
|
||||
"ttft_p50": pct(ttfts, 50),
|
||||
"summary": summary,
|
||||
}
|
||||
|
||||
|
||||
def headline_table(stats):
|
||||
print("\n" + "=" * 110)
|
||||
print("HEADLINE: same trace, same scale, same ts=1")
|
||||
print("=" * 110)
|
||||
cols = ["label", "ok/n", "err", "lat_mean", "lat_p50", "lat_p90", "lat_p99", "ttft_mean", "ttft_p50"]
|
||||
print(f"{cols[0]:<22}{cols[1]:>12}{cols[2]:>6}{cols[3]:>10}{cols[4]:>10}{cols[5]:>10}{cols[6]:>10}{cols[7]:>10}{cols[8]:>10}")
|
||||
for s in stats:
|
||||
ok_n = f"{s['ok']}/{s['n']}"
|
||||
print(f"{s['label']:<22}{ok_n:>12}{s['err']:>6}"
|
||||
f"{s['lat_mean']:>9.3f}s{s['lat_p50']:>9.3f}s{s['lat_p90']:>9.3f}s{s['lat_p99']:>9.3f}s"
|
||||
f"{s['ttft_mean']:>9.3f}s{s['ttft_p50']:>9.3f}s")
|
||||
|
||||
|
||||
def session_pinning(rows, label):
|
||||
"""§1: distinct D per session — should be ~1.0 if pin behavior persists."""
|
||||
sess_d = defaultdict(set)
|
||||
for r in rows:
|
||||
sid = r.get("session_id")
|
||||
d = r.get("assigned_decode_node") or r.get("decode_node")
|
||||
if sid is not None and d is not None:
|
||||
sess_d[sid].add(d)
|
||||
if not sess_d:
|
||||
return None
|
||||
distinct = [len(s) for s in sess_d.values()]
|
||||
return {
|
||||
"label": label,
|
||||
"n_sessions": len(sess_d),
|
||||
"avg_distinct_D": float(np.mean(distinct)),
|
||||
"max_distinct_D": max(distinct),
|
||||
"sess_d": {sid: sorted(ds) for sid, ds in sess_d.items()},
|
||||
}
|
||||
|
||||
|
||||
def direct_to_d_distribution(rows, label):
|
||||
"""§1: per-session direct-to-D rate; check for bimodal."""
|
||||
sess_total = Counter()
|
||||
sess_direct = Counter()
|
||||
for r in rows:
|
||||
sid = r.get("session_id")
|
||||
if sid is None:
|
||||
continue
|
||||
sess_total[sid] += 1
|
||||
mode = r.get("execution_mode", "")
|
||||
if mode == "kvcache-direct-to-d-session":
|
||||
sess_direct[sid] += 1
|
||||
rates = []
|
||||
for sid in sess_total:
|
||||
rate = sess_direct[sid] / sess_total[sid]
|
||||
rates.append((sid, rate, sess_total[sid]))
|
||||
bins = [0, 0.2, 0.4, 0.6, 0.8, 1.01]
|
||||
bin_labels = ["0-20%", "20-40%", "40-60%", "60-80%", "80-100%"]
|
||||
counts = [0] * 5
|
||||
for _, r, _ in rates:
|
||||
for i in range(5):
|
||||
if bins[i] <= r < bins[i + 1]:
|
||||
counts[i] += 1
|
||||
break
|
||||
print(f"\n [{label}] direct-to-D rate distribution (n={len(rates)} sessions):")
|
||||
for lbl, cnt in zip(bin_labels, counts):
|
||||
bar = "█" * cnt
|
||||
print(f" {lbl:<10}: {cnt:>3} {bar}")
|
||||
return rates
|
||||
|
||||
|
||||
def starved_cross_run(per_run_rates, threshold=0.20):
|
||||
"""§1: sessions starved (<threshold direct-to-D) in ALL runs."""
|
||||
if len(per_run_rates) < 2:
|
||||
return None
|
||||
sess_starved = defaultdict(int)
|
||||
sess_lucky = defaultdict(int)
|
||||
for rates in per_run_rates:
|
||||
for sid, rate, _ in rates:
|
||||
if rate < threshold:
|
||||
sess_starved[sid] += 1
|
||||
elif rate > 0.80:
|
||||
sess_lucky[sid] += 1
|
||||
n_runs = len(per_run_rates)
|
||||
consistently_starved = [sid for sid, c in sess_starved.items() if c == n_runs]
|
||||
consistently_lucky = [sid for sid, c in sess_lucky.items() if c == n_runs]
|
||||
return {
|
||||
"n_runs": n_runs,
|
||||
"consistently_starved": consistently_starved,
|
||||
"consistently_lucky": consistently_lucky,
|
||||
}
|
||||
|
||||
|
||||
def session_size_comparison(rows, sids_a, sids_b, label_a="A", label_b="B"):
|
||||
"""Compare peak input_length of two session groups."""
|
||||
sess_max_input = defaultdict(int)
|
||||
for r in rows:
|
||||
sid = r.get("session_id")
|
||||
ilen = r.get("input_length") or 0
|
||||
if sid is not None and ilen > sess_max_input[sid]:
|
||||
sess_max_input[sid] = ilen
|
||||
a_inputs = [sess_max_input[s] for s in sids_a if s in sess_max_input]
|
||||
b_inputs = [sess_max_input[s] for s in sids_b if s in sess_max_input]
|
||||
if a_inputs and b_inputs:
|
||||
ratio = np.mean(a_inputs) / np.mean(b_inputs)
|
||||
print(f"\n Cross-run starvation correlates with session size?")
|
||||
print(f" consistently {label_a} (n={len(a_inputs)}): peak_input mean = {np.mean(a_inputs):.0f}")
|
||||
print(f" consistently {label_b} (n={len(b_inputs)}): peak_input mean = {np.mean(b_inputs):.0f}")
|
||||
print(f" {label_a}/{label_b} ratio = {ratio:.2f}x (ts=10 baseline was 1.98x)")
|
||||
|
||||
|
||||
def per_d_balance(rows, label):
|
||||
"""§7: per-D load balance."""
|
||||
per_d = Counter()
|
||||
for r in rows:
|
||||
d = r.get("assigned_decode_node") or r.get("decode_node")
|
||||
if d:
|
||||
per_d[d] += 1
|
||||
if not per_d:
|
||||
return
|
||||
counts = list(per_d.values())
|
||||
spread = (max(counts) - min(counts)) / max(np.mean(counts), 1)
|
||||
print(f"\n [{label}] per-D load: {dict(sorted(per_d.items()))}")
|
||||
print(f" spread (max-min)/mean = {spread*100:.1f}% "
|
||||
f"(ts=10 KVC 2P6D = ±26%, 8DP CA = ±10%)")
|
||||
|
||||
|
||||
def execution_modes_table(rows, label):
|
||||
"""Show top execution modes."""
|
||||
ok = [r for r in rows if r.get("error") is None]
|
||||
if not ok:
|
||||
return
|
||||
modes = Counter(r["execution_mode"] for r in ok)
|
||||
print(f"\n [{label}] execution modes (n_ok={len(ok)}):")
|
||||
for mode, cnt in modes.most_common(8):
|
||||
mode_rows = [r for r in ok if r["execution_mode"] == mode]
|
||||
lats = [r["latency_s"] for r in mode_rows if r.get("latency_s") is not None]
|
||||
ttfts = [r["ttft_s"] for r in mode_rows if r.get("ttft_s") is not None]
|
||||
if lats:
|
||||
print(f" {mode:<55} {cnt:>5} ({cnt/len(ok)*100:>4.1f}%) "
|
||||
f"lat p50={pct(lats,50):.3f}s p90={pct(lats,90):.3f}s ttft p50={pct(ttfts,50):.3f}s")
|
||||
|
||||
|
||||
def lru_vs_errors(run_dir, label):
|
||||
"""§2: trim events vs KVTransferError per worker."""
|
||||
log_dir = run_dir / "logs"
|
||||
if not log_dir.exists():
|
||||
return
|
||||
print(f"\n [{label}] D-side LRU vs errors (from worker logs):")
|
||||
print(f" {'worker':<14}{'trim':>8}{'KVTransferError':>20}{'peak_token_usage':>20}")
|
||||
for log_file in sorted(log_dir.glob("decode-*.log")):
|
||||
worker = log_file.stem
|
||||
text = log_file.read_text(errors="ignore")
|
||||
trim_count = len(re.findall(r"Trimmed decode session cache", text))
|
||||
err_count = len(re.findall(r"KVTransferError", text))
|
||||
usages = re.findall(r"token usage: ([\d.]+)", text)
|
||||
peak = max((float(u) for u in usages), default=0.0)
|
||||
print(f" {worker:<14}{trim_count:>8}{err_count:>20}{peak:>20.3f}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--root", default="outputs/qwen3-30b-tp1-ts1-validation",
|
||||
help="Sweep output root")
|
||||
args = parser.parse_args()
|
||||
|
||||
root = Path(args.root)
|
||||
if not root.is_absolute():
|
||||
root = Path("/mnt/kzlin/workflow/pd-hybrid/agentic-pd-hybrid") / root
|
||||
|
||||
# Load all available runs
|
||||
stats = []
|
||||
rows_by_run = {}
|
||||
for label in ("kvc_1p3d_run1", "kvc_1p3d_run2", "kvc_1p3d_run3", "dp4"):
|
||||
m = root / f"{label}_metrics.jsonl"
|
||||
s = root / f"{label}_summary.json"
|
||||
if not m.exists() or not s.exists():
|
||||
print(f" [{label}] not yet available ({m.name})")
|
||||
continue
|
||||
rows = load_metrics(m)
|
||||
summary = load_summary(s)
|
||||
rows_by_run[label] = rows
|
||||
stats.append(summarize_run(label, rows, summary))
|
||||
|
||||
if not stats:
|
||||
print("No runs available yet.")
|
||||
return
|
||||
|
||||
# 1. Headline table
|
||||
headline_table(stats)
|
||||
|
||||
# 2. §1 session pinning per KVC run + per-D balance + execution modes
|
||||
print("\n" + "=" * 110)
|
||||
print("§1 / §7: SESSION PINNING + LOAD BALANCE")
|
||||
print("=" * 110)
|
||||
per_run_rates = []
|
||||
for label, rows in rows_by_run.items():
|
||||
if not label.startswith("kvc_"):
|
||||
continue
|
||||
pin = session_pinning(rows, label)
|
||||
if pin:
|
||||
print(f"\n [{label}] sessions={pin['n_sessions']} "
|
||||
f"avg_distinct_D={pin['avg_distinct_D']:.2f} "
|
||||
f"max_distinct_D={pin['max_distinct_D']} "
|
||||
f"(ts=10 baseline avg=1.00 → 100% pin)")
|
||||
rates = direct_to_d_distribution(rows, label)
|
||||
per_run_rates.append(rates)
|
||||
per_d_balance(rows, label)
|
||||
execution_modes_table(rows, label)
|
||||
|
||||
# 3. §1 cross-run starvation
|
||||
if len(per_run_rates) >= 2:
|
||||
print("\n" + "=" * 110)
|
||||
print(f"§1 CROSS-RUN STARVATION (across {len(per_run_rates)} KVC runs)")
|
||||
print("=" * 110)
|
||||
cross = starved_cross_run(per_run_rates)
|
||||
if cross:
|
||||
n_starved = len(cross["consistently_starved"])
|
||||
n_lucky = len(cross["consistently_lucky"])
|
||||
print(f"\n Sessions starved (<20% direct-to-D) in all {cross['n_runs']} runs: {n_starved}")
|
||||
print(f" Sessions lucky (>80% direct-to-D) in all {cross['n_runs']} runs: {n_lucky}")
|
||||
print(f" (ts=10 baseline: 13/52 starved, 14/52 lucky — extreme bimodal)")
|
||||
# session size comparison from run 1
|
||||
if "kvc_1p3d_run1" in rows_by_run and n_starved and n_lucky:
|
||||
session_size_comparison(rows_by_run["kvc_1p3d_run1"],
|
||||
cross["consistently_starved"],
|
||||
cross["consistently_lucky"],
|
||||
"starved", "lucky")
|
||||
|
||||
# 4. §2 D-side LRU vs errors from raw logs
|
||||
print("\n" + "=" * 110)
|
||||
print("§2: D-SIDE LRU TRIM vs KVTransferError (from worker logs)")
|
||||
print("=" * 110)
|
||||
for label in rows_by_run:
|
||||
if not label.startswith("kvc_"):
|
||||
continue
|
||||
# find the matching raw run dir
|
||||
run_dirs = sorted(root.glob("kvcache-centric-*/"))
|
||||
if not run_dirs:
|
||||
continue
|
||||
# naive: index matches run order; could be wrong if dirs got reordered
|
||||
idx = int(label.split("run")[-1]) - 1
|
||||
if idx < len(run_dirs):
|
||||
lru_vs_errors(run_dirs[idx], label)
|
||||
|
||||
# 5. DP-only inspection
|
||||
if "dp4" in rows_by_run:
|
||||
print("\n" + "=" * 110)
|
||||
print("4DP CA SANITY")
|
||||
print("=" * 110)
|
||||
per_d_balance(rows_by_run["dp4"], "dp4")
|
||||
execution_modes_table(rows_by_run["dp4"], "dp4")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
89
scripts/analysis/analyze_v3.py
Normal file
89
scripts/analysis/analyze_v3.py
Normal file
@@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze v3 (kv-aware) results — find why fallback-large-append-session-cap dominates."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from collections import Counter, defaultdict
|
||||
|
||||
BASE = Path(__file__).parent
|
||||
|
||||
def load_rows(jsonl_path):
|
||||
rows = []
|
||||
with open(jsonl_path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
exp1 = load_rows(BASE / "exp1_1p7d_kvc_kvaware_metrics.jsonl")
|
||||
exp2 = load_rows(BASE / "exp2_2p6d_kvc_kvaware_metrics.jsonl")
|
||||
|
||||
for name, rows in [("Exp1 1P7D", exp1), ("Exp2 2P6D", exp2)]:
|
||||
print(f"\n========== {name} ==========")
|
||||
ok = [r for r in rows if r.get("error") is None]
|
||||
|
||||
# Execution mode breakdown by latency
|
||||
modes = Counter(r["execution_mode"] for r in ok)
|
||||
print(f"\nExecution modes (n={len(ok)}):")
|
||||
for mode, count in modes.most_common():
|
||||
mode_rows = [r for r in ok if r["execution_mode"] == mode]
|
||||
lats = [r["latency_s"] for r in mode_rows]
|
||||
ttfts = [r["ttft_s"] for r in mode_rows]
|
||||
print(f" {mode}: n={count} ({count/len(ok)*100:.1f}%) "
|
||||
f"lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s | "
|
||||
f"ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
|
||||
|
||||
# Per-D session distribution
|
||||
per_d_sessions = defaultdict(set)
|
||||
for r in ok:
|
||||
d = r.get("assigned_decode_node", "?")
|
||||
per_d_sessions[d].add(r["session_id"])
|
||||
print(f"\nSessions per D worker:")
|
||||
for d in sorted(per_d_sessions.keys()):
|
||||
print(f" {d}: {len(per_d_sessions[d])} unique sessions")
|
||||
|
||||
# session-cap fallback analysis
|
||||
sc_rows = [r for r in ok if r["execution_mode"] == "pd-router-fallback-large-append-session-cap"]
|
||||
if sc_rows:
|
||||
print(f"\nSession-cap fallback details (n={len(sc_rows)}):")
|
||||
# Which sessions hit this most?
|
||||
sc_per_sess = Counter(r["session_id"] for r in sc_rows)
|
||||
print(f" Sessions hitting session-cap (top 5):")
|
||||
for sid, cnt in sc_per_sess.most_common(5):
|
||||
print(f" session {sid}: {cnt} times")
|
||||
# Per-D distribution
|
||||
sc_per_d = Counter(r.get("assigned_decode_node", "?") for r in sc_rows)
|
||||
print(f" Per-D distribution: {dict(sc_per_d.most_common())}")
|
||||
# Input length distribution
|
||||
inp = [r.get("input_length", 0) for r in sc_rows]
|
||||
print(f" Input length: P50={np.percentile(inp,50):.0f} P90={np.percentile(inp,90):.0f}")
|
||||
# Turn distribution
|
||||
turns = Counter(r.get("turn_id", -1) for r in sc_rows)
|
||||
print(f" Turn distribution (top 5): {dict(turns.most_common(5))}")
|
||||
|
||||
# Direct-to-D analysis (ideal path)
|
||||
dd_rows = [r for r in ok if r["execution_mode"] == "kvcache-direct-to-d-session"]
|
||||
if dd_rows:
|
||||
lats = [r["latency_s"] for r in dd_rows]
|
||||
ttfts = [r["ttft_s"] for r in dd_rows]
|
||||
kv_blocks = [r.get("actual_kv_transfer_blocks", 0) for r in dd_rows]
|
||||
cached = [r.get("cached_tokens", 0) for r in dd_rows]
|
||||
print(f"\nDirect-to-D details (n={len(dd_rows)}):")
|
||||
print(f" lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s P99={np.percentile(lats,99):.3f}s")
|
||||
print(f" ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
|
||||
print(f" KV transfer: P50={np.percentile(kv_blocks,50):.0f} (should be 0 — no P involved)")
|
||||
print(f" cached_tokens P50={np.percentile(cached,50):.0f}")
|
||||
|
||||
# Sessions: how many turns each, how many used direct-to-d
|
||||
print(f"\nPer-session direct-to-D rate (top 10 by total turns):")
|
||||
per_sess = defaultdict(list)
|
||||
for r in ok:
|
||||
per_sess[r["session_id"]].append(r)
|
||||
sess_stats = []
|
||||
for sid, sreqs in per_sess.items():
|
||||
total = len(sreqs)
|
||||
dd = sum(1 for r in sreqs if r["execution_mode"] == "kvcache-direct-to-d-session")
|
||||
sc = sum(1 for r in sreqs if "session-cap" in r["execution_mode"])
|
||||
sess_stats.append((sid, total, dd, sc))
|
||||
sess_stats.sort(key=lambda x: -x[1])
|
||||
for sid, total, dd, sc in sess_stats[:10]:
|
||||
print(f" session {sid}: {total} turns, {dd} direct-to-D ({dd/total*100:.0f}%), {sc} session-cap fallback ({sc/total*100:.0f}%)")
|
||||
52
scripts/analysis/analyze_v4.py
Normal file
52
scripts/analysis/analyze_v4.py
Normal file
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env python3
|
||||
"""V4 results analysis: errors, execution modes, latency by mode."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from collections import Counter
|
||||
|
||||
BASE = Path(__file__).parent
|
||||
|
||||
def load_rows(jsonl_path):
|
||||
rows = []
|
||||
with open(jsonl_path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
for name, path in [
|
||||
("Exp1 1P7D cap=16", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
|
||||
("Exp2 2P6D cap=16", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
|
||||
]:
|
||||
rows = load_rows(path)
|
||||
print(f"\n========== {name} ==========")
|
||||
ok = [r for r in rows if r.get("error") is None]
|
||||
err = [r for r in rows if r.get("error") is not None]
|
||||
print(f"Total: {len(rows)}, OK: {len(ok)}, Errors: {len(err)}")
|
||||
|
||||
# Errors finish_reason
|
||||
if err:
|
||||
finish_reasons = Counter()
|
||||
for r in err:
|
||||
fr = str(r.get("finish_reason") or r.get("error") or "?")
|
||||
# Truncate long messages
|
||||
short = fr[:120]
|
||||
finish_reasons[short] += 1
|
||||
print(f"\nError finish_reasons (top 5):")
|
||||
for fr, cnt in finish_reasons.most_common(5):
|
||||
print(f" {cnt}x: {fr}")
|
||||
|
||||
# Execution mode latency breakdown
|
||||
modes = Counter(r["execution_mode"] for r in ok)
|
||||
print(f"\nTop execution modes by latency:")
|
||||
print(f"{'mode':<55}{'n':<8}{'%':<8}{'P50 lat':<10}{'P90 lat':<10}{'TTFT P50':<10}")
|
||||
for mode, count in modes.most_common(8):
|
||||
mode_rows = [r for r in ok if r["execution_mode"] == mode]
|
||||
lats = [r["latency_s"] for r in mode_rows]
|
||||
ttfts = [r["ttft_s"] for r in mode_rows]
|
||||
print(f" {mode:<53}{count:<8}{count/len(ok)*100:>5.1f}% {np.percentile(lats,50):>7.3f}s {np.percentile(lats,90):>7.3f}s {np.percentile(ttfts,50):>7.3f}s")
|
||||
|
||||
# Per-D load
|
||||
per_d = Counter(r.get("assigned_decode_node", "?") for r in ok)
|
||||
print(f"\nPer-D load: max/min ratio = {max(per_d.values())/max(min(per_d.values()),1):.2f}x")
|
||||
print(f" {dict(per_d.most_common())}")
|
||||
136
scripts/analysis/compare_no_error.py
Normal file
136
scripts/analysis/compare_no_error.py
Normal file
@@ -0,0 +1,136 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Compare KVC variants vs baseline, EXCLUDING errors and truncated requests."""
|
||||
import json
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
|
||||
OUT = Path("/mnt/kzlin/workflow/pd-hybrid/agentic-pd-hybrid/outputs")
|
||||
|
||||
DATASETS = [
|
||||
("baseline 8DP", OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"),
|
||||
("v3 1P7D", OUT / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
|
||||
("v3 2P6D", OUT / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
|
||||
("v4 1P7D", OUT / "qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_metrics.jsonl"),
|
||||
("v4 2P6D", OUT / "qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_metrics.jsonl"),
|
||||
]
|
||||
|
||||
def load_rows(path):
|
||||
rows = []
|
||||
with open(path) as f:
|
||||
for line in f:
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
def is_truncated(row):
|
||||
a = row.get("actual_output_tokens")
|
||||
r = row.get("requested_output_tokens")
|
||||
if a is not None and r is not None and r > 1:
|
||||
return a < r * 0.5
|
||||
return False
|
||||
|
||||
def stats(values):
|
||||
if not values:
|
||||
return {"n": 0}
|
||||
a = np.array(values)
|
||||
return {
|
||||
"n": len(a),
|
||||
"mean": float(np.mean(a)),
|
||||
"p50": float(np.percentile(a, 50)),
|
||||
"p90": float(np.percentile(a, 90)),
|
||||
"p99": float(np.percentile(a, 99)),
|
||||
}
|
||||
|
||||
def fmt(s, key):
|
||||
if s["n"] == 0:
|
||||
return "N/A"
|
||||
v = s[key]
|
||||
return f"{v:.3f}s" if v < 100 else f"{v:.1f}s"
|
||||
|
||||
results = []
|
||||
for label, path in DATASETS:
|
||||
if not path.exists():
|
||||
print(f"SKIP {label}")
|
||||
continue
|
||||
rows = load_rows(path)
|
||||
total = len(rows)
|
||||
err_n = sum(1 for r in rows if r.get("error") is not None)
|
||||
trunc_n = sum(1 for r in rows if r.get("error") is None and is_truncated(r))
|
||||
|
||||
# Filter: error=None AND not truncated AND latency present
|
||||
clean = [r for r in rows
|
||||
if r.get("error") is None
|
||||
and not is_truncated(r)
|
||||
and r.get("latency_s") is not None]
|
||||
|
||||
lats = [r["latency_s"] for r in clean]
|
||||
ttfts = [r["ttft_s"] for r in clean if r.get("ttft_s") is not None]
|
||||
|
||||
results.append({
|
||||
"label": label,
|
||||
"total": total,
|
||||
"err": err_n,
|
||||
"trunc": trunc_n,
|
||||
"clean_n": len(clean),
|
||||
"lat": stats(lats),
|
||||
"ttft": stats(ttfts),
|
||||
})
|
||||
|
||||
# Print comparison table
|
||||
print(f"\n{'='*100}")
|
||||
print("LATENCY (excluding errors AND truncated)")
|
||||
print(f"{'='*100}")
|
||||
print(f"{'config':<16}{'total':>7}{'err':>6}{'trunc':>7}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
|
||||
for r in results:
|
||||
print(f"{r['label']:<16}{r['total']:>7}{r['err']:>6}{r['trunc']:>7}{r['clean_n']:>7} "
|
||||
f"{fmt(r['lat'],'mean'):>9}{fmt(r['lat'],'p50'):>9}{fmt(r['lat'],'p90'):>9}{fmt(r['lat'],'p99'):>9}")
|
||||
|
||||
print(f"\n{'='*100}")
|
||||
print("TTFT (excluding errors AND truncated)")
|
||||
print(f"{'='*100}")
|
||||
print(f"{'config':<16}{'clean':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
|
||||
for r in results:
|
||||
print(f"{r['label']:<16}{r['clean_n']:>7} "
|
||||
f"{fmt(r['ttft'],'mean'):>9}{fmt(r['ttft'],'p50'):>9}{fmt(r['ttft'],'p90'):>9}{fmt(r['ttft'],'p99'):>9}")
|
||||
|
||||
# Also: per-execution-mode breakdown for v4 only (the most interesting)
|
||||
print(f"\n{'='*100}")
|
||||
print("V4 2P6D: per-execution-mode (excluding errors and truncated)")
|
||||
print(f"{'='*100}")
|
||||
v4_2p6d = next((p for l, p in DATASETS if l == "v4 2P6D"), None)
|
||||
if v4_2p6d:
|
||||
rows = load_rows(v4_2p6d)
|
||||
clean = [r for r in rows if r.get("error") is None and not is_truncated(r)]
|
||||
from collections import Counter
|
||||
modes = Counter(r["execution_mode"] for r in clean)
|
||||
print(f"{'mode':<55}{'n':>7}{'%':>7} {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
|
||||
for mode, count in modes.most_common(10):
|
||||
m_rows = [r for r in clean if r["execution_mode"] == mode]
|
||||
s = stats([r["latency_s"] for r in m_rows])
|
||||
pct = count/len(clean)*100
|
||||
print(f" {mode:<53}{count:>7}{pct:>6.1f}% {fmt(s,'mean'):>9}{fmt(s,'p50'):>9}{fmt(s,'p90'):>9}{fmt(s,'p99'):>9}")
|
||||
|
||||
# Also: WHAT IF we only count direct-to-D? (Pure KVC performance)
|
||||
print(f"\n{'='*100}")
|
||||
print("Pure KVC (kvcache-direct-to-d-session ONLY) vs Baseline")
|
||||
print(f"{'='*100}")
|
||||
for label, path in DATASETS:
|
||||
if not path.exists() or "1P7D" not in label and "2P6D" not in label:
|
||||
continue
|
||||
rows = load_rows(path)
|
||||
direct = [r for r in rows
|
||||
if r.get("error") is None and not is_truncated(r)
|
||||
and r.get("execution_mode") == "kvcache-direct-to-d-session"]
|
||||
if not direct:
|
||||
continue
|
||||
s_lat = stats([r["latency_s"] for r in direct])
|
||||
s_ttft = stats([r["ttft_s"] for r in direct if r.get("ttft_s") is not None])
|
||||
print(f"{label:<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
|
||||
|
||||
# Baseline for reference (already non-fallback by definition)
|
||||
print()
|
||||
baseline_path = OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"
|
||||
baseline_rows = load_rows(baseline_path)
|
||||
clean = [r for r in baseline_rows if r.get("error") is None and not is_truncated(r)]
|
||||
s_lat = stats([r["latency_s"] for r in clean])
|
||||
s_ttft = stats([r["ttft_s"] for r in clean if r.get("ttft_s") is not None])
|
||||
print(f"{'baseline 8DP':<16}n={s_lat['n']:>5} lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')} ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
|
||||
225
scripts/analysis/paired_compare.py
Executable file
225
scripts/analysis/paired_compare.py
Executable file
@@ -0,0 +1,225 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Paired latency comparison with bootstrap CI.
|
||||
|
||||
Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix): when comparing
|
||||
mechanism A vs B on the same trace, the only honest comparison is paired
|
||||
on same-trial-mask. This script joins two metrics.jsonl by request_id,
|
||||
keeps the rows where BOTH sides succeeded, and reports paired deltas
|
||||
with 95% bootstrap CIs.
|
||||
|
||||
Out vs the existing `compare_no_error.py`:
|
||||
- works on raw metrics.jsonl, not pre-aggregated summary.json
|
||||
- bootstrap CIs (not just point estimates)
|
||||
- reports paired-mask size + per-side failure counts so the reader
|
||||
sees how many rows were dropped from the comparison
|
||||
|
||||
Usage:
|
||||
scripts/analysis/paired_compare.py \
|
||||
--baseline outputs/run-dp/request-metrics.jsonl \
|
||||
--candidate outputs/run-kvc/request-metrics.jsonl
|
||||
scripts/analysis/paired_compare.py ... --bootstrap 5000 --seed 42
|
||||
scripts/analysis/paired_compare.py ... --json > paired.json
|
||||
|
||||
stdlib only — no scipy/numpy. Runs without GPU and without SGLang.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import random
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _load(path: Path) -> dict[str, dict]:
|
||||
out: dict[str, dict] = {}
|
||||
with path.open() as handle:
|
||||
for line in handle:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
row = json.loads(line)
|
||||
rid = row.get("request_id")
|
||||
if rid is None:
|
||||
continue
|
||||
out[rid] = row
|
||||
return out
|
||||
|
||||
|
||||
def _ok(row: dict) -> bool:
|
||||
return row.get("error") is None and row.get("latency_s") is not None
|
||||
|
||||
|
||||
def _quantile(values: list[float], q: float) -> float:
|
||||
if not values:
|
||||
return float("nan")
|
||||
s = sorted(values)
|
||||
if len(s) == 1:
|
||||
return s[0]
|
||||
pos = (len(s) - 1) * q
|
||||
lo = math.floor(pos)
|
||||
hi = math.ceil(pos)
|
||||
if lo == hi:
|
||||
return s[lo]
|
||||
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
|
||||
|
||||
|
||||
def _stats(deltas: list[float]) -> dict[str, float]:
|
||||
if not deltas:
|
||||
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
|
||||
return {
|
||||
"mean": sum(deltas) / len(deltas),
|
||||
"p50": _quantile(deltas, 0.50),
|
||||
"p90": _quantile(deltas, 0.90),
|
||||
"p99": _quantile(deltas, 0.99),
|
||||
}
|
||||
|
||||
|
||||
def _bootstrap_ci(
|
||||
deltas: list[float], statistic, n_boot: int, rng: random.Random
|
||||
) -> tuple[float, float]:
|
||||
"""Return (lo, hi) 95% CI for `statistic(deltas)`."""
|
||||
if len(deltas) < 2:
|
||||
return (float("nan"), float("nan"))
|
||||
n = len(deltas)
|
||||
samples = []
|
||||
for _ in range(n_boot):
|
||||
# resample with replacement
|
||||
resample = [deltas[rng.randrange(n)] for _ in range(n)]
|
||||
samples.append(statistic(resample))
|
||||
samples.sort()
|
||||
lo = samples[int(0.025 * (n_boot - 1))]
|
||||
hi = samples[int(0.975 * (n_boot - 1))]
|
||||
return (lo, hi)
|
||||
|
||||
|
||||
def compare(
|
||||
baseline: dict[str, dict],
|
||||
candidate: dict[str, dict],
|
||||
*,
|
||||
metric: str,
|
||||
n_boot: int,
|
||||
seed: int,
|
||||
) -> dict:
|
||||
common_ids = set(baseline.keys()) & set(candidate.keys())
|
||||
paired_ids = [
|
||||
rid for rid in common_ids if _ok(baseline[rid]) and _ok(candidate[rid])
|
||||
]
|
||||
paired_ids.sort()
|
||||
|
||||
base_only_fail = sum(1 for rid in common_ids if not _ok(baseline[rid]))
|
||||
cand_only_fail = sum(1 for rid in common_ids if not _ok(candidate[rid]))
|
||||
|
||||
deltas = []
|
||||
wins = losses = ties = 0
|
||||
for rid in paired_ids:
|
||||
b = baseline[rid].get(metric)
|
||||
c = candidate[rid].get(metric)
|
||||
if b is None or c is None:
|
||||
continue
|
||||
d = float(c) - float(b)
|
||||
deltas.append(d)
|
||||
if d < 0:
|
||||
wins += 1
|
||||
elif d > 0:
|
||||
losses += 1
|
||||
else:
|
||||
ties += 1
|
||||
|
||||
rng = random.Random(seed)
|
||||
stats = _stats(deltas)
|
||||
ci_mean = _bootstrap_ci(deltas, lambda x: sum(x) / len(x), n_boot, rng)
|
||||
ci_p50 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.50), n_boot, rng)
|
||||
ci_p90 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.90), n_boot, rng)
|
||||
|
||||
return {
|
||||
"metric": metric,
|
||||
"baseline_size": len(baseline),
|
||||
"candidate_size": len(candidate),
|
||||
"intersection_size": len(common_ids),
|
||||
"paired_size": len(paired_ids),
|
||||
"baseline_fail_in_common": base_only_fail,
|
||||
"candidate_fail_in_common": cand_only_fail,
|
||||
"delta_stats": stats,
|
||||
"delta_mean_ci95": ci_mean,
|
||||
"delta_p50_ci95": ci_p50,
|
||||
"delta_p90_ci95": ci_p90,
|
||||
"wins_candidate": wins,
|
||||
"losses_candidate": losses,
|
||||
"ties": ties,
|
||||
}
|
||||
|
||||
|
||||
def _fmt(x: float, w: int = 6) -> str:
|
||||
if x is None or (isinstance(x, float) and math.isnan(x)):
|
||||
return " nan "
|
||||
return f"{x:+{w}.3f}"
|
||||
|
||||
|
||||
def render(result: dict) -> str:
|
||||
s = result["delta_stats"]
|
||||
mlo, mhi = result["delta_mean_ci95"]
|
||||
p5lo, p5hi = result["delta_p50_ci95"]
|
||||
p9lo, p9hi = result["delta_p90_ci95"]
|
||||
n = result["paired_size"]
|
||||
lines = [
|
||||
f"# paired comparison ({result['metric']})",
|
||||
"",
|
||||
f"baseline rows: {result['baseline_size']}",
|
||||
f"candidate rows: {result['candidate_size']}",
|
||||
f"intersection (rid): {result['intersection_size']}",
|
||||
f"paired (both ok): {result['paired_size']}",
|
||||
f" baseline fails in common: {result['baseline_fail_in_common']}",
|
||||
f" candidate fails in common: {result['candidate_fail_in_common']}",
|
||||
"",
|
||||
"## delta (candidate - baseline) — negative = candidate is faster",
|
||||
"",
|
||||
"| stat | value | 95% CI |",
|
||||
"|---|---:|---:|",
|
||||
f"| mean | {_fmt(s['mean'])} | [{_fmt(mlo)}, {_fmt(mhi)}] |",
|
||||
f"| p50 | {_fmt(s['p50'])} | [{_fmt(p5lo)}, {_fmt(p5hi)}] |",
|
||||
f"| p90 | {_fmt(s['p90'])} | [{_fmt(p9lo)}, {_fmt(p9hi)}] |",
|
||||
f"| p99 | {_fmt(s['p99'])} | — |",
|
||||
"",
|
||||
f"win/loss/tie: {result['wins_candidate']} / {result['losses_candidate']} / {result['ties']} (of {n})",
|
||||
]
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
|
||||
p.add_argument("--baseline", required=True, type=Path)
|
||||
p.add_argument("--candidate", required=True, type=Path)
|
||||
p.add_argument(
|
||||
"--metric",
|
||||
default="latency_s",
|
||||
choices=["latency_s", "ttft_s", "tpot_s"],
|
||||
help="which per-request field to compare (default: latency_s)",
|
||||
)
|
||||
p.add_argument("--bootstrap", type=int, default=2000)
|
||||
p.add_argument("--seed", type=int, default=20260512)
|
||||
p.add_argument("--json", action="store_true")
|
||||
args = p.parse_args()
|
||||
|
||||
baseline = _load(args.baseline)
|
||||
candidate = _load(args.candidate)
|
||||
if not baseline or not candidate:
|
||||
print("empty input on one side", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
result = compare(
|
||||
baseline, candidate,
|
||||
metric=args.metric, n_boot=args.bootstrap, seed=args.seed,
|
||||
)
|
||||
|
||||
if args.json:
|
||||
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
|
||||
sys.stdout.write("\n")
|
||||
else:
|
||||
print(render(result))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
209
scripts/analysis/plot_cache_efficiency.py
Normal file
209
scripts/analysis/plot_cache_efficiency.py
Normal file
@@ -0,0 +1,209 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Cache efficiency comparison: KVC 1P3D v2 vs 4-way DP CA.
|
||||
|
||||
Generates docs/figures/cache_efficiency.png — two-panel:
|
||||
left: cache hit rate vs turn number (mechanism: affinity vs LRU)
|
||||
right: ECDF of per-request uncached tokens (per-request impact)
|
||||
|
||||
Resolves the apparent paradox: KVC has 27% less total KV pool capacity
|
||||
(3 × 92K = 276K vs DP 4 × 87K = 351K) yet achieves higher cache hit rate
|
||||
(98.1% vs 96.8%) and lower mean uncached tokens per request (560 vs 952).
|
||||
|
||||
The left panel shows the mechanism: KVC's session affinity makes cache hit
|
||||
rate grow with turn count (more cache accumulates on the pinned D), while
|
||||
DP's hash + radix-LRU causes cache hit rate to decay through the middle
|
||||
turns (other sessions' KV competes via LRU eviction).
|
||||
|
||||
The right panel quantifies the impact: KVC's uncached tokens are
|
||||
concentrated near 0 (mean 560), DP's are spread (mean 952).
|
||||
|
||||
Aborted / errored requests are excluded.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[2]
|
||||
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
|
||||
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
|
||||
OUT = ROOT / "docs/figures/cache_efficiency.png"
|
||||
|
||||
|
||||
def load(p: Path) -> list[dict]:
|
||||
return [json.loads(line) for line in p.open()]
|
||||
|
||||
|
||||
def is_failed(r: dict) -> bool:
|
||||
if r.get("error"):
|
||||
return True
|
||||
fr = r.get("finish_reason")
|
||||
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def main() -> None:
|
||||
kvc = [r for r in load(KVC) if not is_failed(r)]
|
||||
dp = [r for r in load(DP) if not is_failed(r)]
|
||||
|
||||
KVC_COLOR = "#1F77B4"
|
||||
DP_COLOR = "#D62728"
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 6.5))
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Left panel: cache hit rate per turn
|
||||
# Bin requests by turn_id, plot mean hit rate per bin with shaded band
|
||||
# ------------------------------------------------------------------
|
||||
def bin_by_turn(rows: list[dict]) -> tuple[list[int], list[float], list[float], list[float]]:
|
||||
per_turn: defaultdict[int, list[float]] = defaultdict(list)
|
||||
for r in rows:
|
||||
if r["input_length"] == 0:
|
||||
continue
|
||||
hit = r.get("cached_tokens", 0) / r["input_length"]
|
||||
per_turn[r["turn_id"]].append(hit)
|
||||
turns = sorted(per_turn.keys())
|
||||
means, p25s, p75s = [], [], []
|
||||
for t in turns:
|
||||
arr = np.array(per_turn[t])
|
||||
means.append(float(np.mean(arr)))
|
||||
p25s.append(float(np.quantile(arr, 0.25)))
|
||||
p75s.append(float(np.quantile(arr, 0.75)))
|
||||
return turns, means, p25s, p75s
|
||||
|
||||
kvc_t, kvc_m, kvc_lo, kvc_hi = bin_by_turn(kvc)
|
||||
dp_t, dp_m, dp_lo, dp_hi = bin_by_turn(dp)
|
||||
|
||||
# Cap x-axis: tails get noisy below ~5 samples per bin
|
||||
max_turn = 100
|
||||
|
||||
ax = axes[0]
|
||||
ax.plot(kvc_t, kvc_m, color=KVC_COLOR, lw=2.5,
|
||||
label=f"KVC 1P3D v2 (overall hit 98.1%)")
|
||||
ax.fill_between(kvc_t, kvc_lo, kvc_hi, color=KVC_COLOR, alpha=0.18,
|
||||
label="KVC IQR (p25-p75)")
|
||||
ax.plot(dp_t, dp_m, color=DP_COLOR, lw=2.5,
|
||||
label=f"4-way DP CA (overall hit 96.8%)")
|
||||
ax.fill_between(dp_t, dp_lo, dp_hi, color=DP_COLOR, alpha=0.18,
|
||||
label="DP IQR (p25-p75)")
|
||||
|
||||
# Annotate the mid-turn drift gap
|
||||
drift_turns = list(range(8, 25))
|
||||
drift_kvc = np.mean([m for t, m in zip(kvc_t, kvc_m) if t in drift_turns])
|
||||
drift_dp = np.mean([m for t, m in zip(dp_t, dp_m) if t in drift_turns])
|
||||
ax.axvspan(8, 25, color="#999", alpha=0.08, label="_nolegend_")
|
||||
ax.text(16, 0.65,
|
||||
f"Mid-turn region\n(turns 8-25):\nKVC {drift_kvc*100:.1f}% | DP {drift_dp*100:.1f}%\nGap {(drift_kvc-drift_dp)*100:+.1f} pp",
|
||||
ha="center", va="center", fontsize=9.5,
|
||||
bbox=dict(facecolor="white", edgecolor="gray", alpha=0.92, pad=4))
|
||||
|
||||
ax.set_xlim(1, max_turn)
|
||||
ax.set_ylim(0.4, 1.02)
|
||||
ax.set_xlabel("Turn number within session", fontsize=11)
|
||||
ax.set_ylabel("Per-request cache hit rate (cached / input_length)", fontsize=11)
|
||||
ax.set_title("Cache hit rate vs turn number\n(mechanism: session affinity vs hash-LRU)",
|
||||
fontsize=12, pad=10)
|
||||
ax.legend(loc="lower right", fontsize=9.5, framealpha=0.95)
|
||||
ax.grid(True, linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Right panel: ECDF of per-request uncached tokens (log x)
|
||||
# ------------------------------------------------------------------
|
||||
def ecdf(rows: list[dict]) -> tuple[np.ndarray, np.ndarray]:
|
||||
vals = np.array([
|
||||
max(1, r["input_length"] - r.get("cached_tokens", 0))
|
||||
for r in rows
|
||||
])
|
||||
vals = np.sort(vals)
|
||||
return vals, np.arange(1, len(vals) + 1) / len(vals)
|
||||
|
||||
kvc_x, kvc_y = ecdf(kvc)
|
||||
dp_x, dp_y = ecdf(dp)
|
||||
|
||||
ax = axes[1]
|
||||
ax.plot(kvc_x, kvc_y, color=KVC_COLOR, lw=2.5,
|
||||
label=f"KVC 1P3D v2 (mean {int(np.mean(kvc_x))} tokens)")
|
||||
ax.plot(dp_x, dp_y, color=DP_COLOR, lw=2.5,
|
||||
label=f"4-way DP CA (mean {int(np.mean(dp_x))} tokens)")
|
||||
|
||||
# Median markers
|
||||
kvc_p50 = np.quantile(kvc_x, 0.50)
|
||||
dp_p50 = np.quantile(dp_x, 0.50)
|
||||
ax.axhline(0.5, color="gray", linestyle=":", alpha=0.5)
|
||||
ax.text(1.2, 0.52, "median (50% of requests below this)",
|
||||
fontsize=8.5, color="gray", style="italic")
|
||||
ax.axvline(kvc_p50, color=KVC_COLOR, ls="--", alpha=0.5, lw=1.0)
|
||||
ax.axvline(dp_p50, color=DP_COLOR, ls="--", alpha=0.5, lw=1.0)
|
||||
ax.text(kvc_p50, 0.06, f"KVC\nmedian\n{int(kvc_p50)}",
|
||||
color=KVC_COLOR, fontsize=9, ha="center", va="bottom",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.75, pad=1))
|
||||
ax.text(dp_p50, 0.06, f"DP\nmedian\n{int(dp_p50)}",
|
||||
color=DP_COLOR, fontsize=9, ha="center", va="bottom",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.75, pad=1))
|
||||
|
||||
# Annotate the separation: at uncached = 500 tokens, what fraction below?
|
||||
sep_x = 500
|
||||
kvc_at_sep = (kvc_x <= sep_x).mean()
|
||||
dp_at_sep = (dp_x <= sep_x).mean()
|
||||
ax.axvline(sep_x, color="#666", linestyle=":", alpha=0.6, lw=1.0)
|
||||
ax.annotate(
|
||||
f"At uncached = {sep_x} tokens:\n"
|
||||
f"KVC {kvc_at_sep*100:.0f}% of requests below\n"
|
||||
f"DP {dp_at_sep*100:.0f}% of requests below",
|
||||
xy=(sep_x, dp_at_sep),
|
||||
xytext=(2500, 0.35),
|
||||
fontsize=9.5,
|
||||
bbox=dict(facecolor="white", edgecolor="gray", alpha=0.92, pad=4),
|
||||
arrowprops=dict(arrowstyle="->", color="#666", lw=0.8),
|
||||
)
|
||||
|
||||
ax.set_xscale("log")
|
||||
ax.set_xlim(1, 1e5)
|
||||
ax.set_xticks([1, 10, 100, 1000, 10000, 100000])
|
||||
ax.set_xticklabels(["1", "10", "100", "1K", "10K", "100K"])
|
||||
ax.set_ylim(0, 1.02)
|
||||
ax.set_xlabel("Uncached tokens per request (log scale)", fontsize=11)
|
||||
ax.set_ylabel("Cumulative fraction of requests", fontsize=11)
|
||||
ax.set_title("ECDF of uncached tokens per request\n(impact: KVC concentrates near zero)",
|
||||
fontsize=12, pad=10)
|
||||
ax.legend(loc="lower right", fontsize=10, framealpha=0.95)
|
||||
ax.grid(True, which="both", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
fig.suptitle(
|
||||
"Cache efficiency paradox: KVC has 27% LESS total KV pool (276K vs 351K tokens) yet caches MORE per request.\n"
|
||||
"Left: session-affinity lets KVC's cache accumulate with turns; DP's hash-LRU loses cache to cross-session competition.\n"
|
||||
"Right: net effect — KVC's uncached compute is concentrated near zero, DP's is spread over 100-10K tokens.",
|
||||
fontsize=11.5, y=1.05,
|
||||
)
|
||||
plt.tight_layout()
|
||||
plt.savefig(OUT, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {OUT}")
|
||||
plt.close(fig)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Print summary for doc reference
|
||||
# ------------------------------------------------------------------
|
||||
print("\n=== Cache efficiency stats ===")
|
||||
print(f"KVC v2: total_input={sum(r['input_length'] for r in kvc)/1e6:.1f}M tokens")
|
||||
print(f" total_cached={sum(r.get('cached_tokens',0) for r in kvc)/1e6:.1f}M tokens")
|
||||
print(f" hit rate {sum(r.get('cached_tokens',0) for r in kvc)/sum(r['input_length'] for r in kvc)*100:.2f}%")
|
||||
print(f" mean uncached {np.mean(kvc_x):.0f} p50 {kvc_p50:.0f} p90 {np.quantile(kvc_x, 0.9):.0f}")
|
||||
|
||||
print(f"\nDP 4w: total_input={sum(r['input_length'] for r in dp)/1e6:.1f}M tokens")
|
||||
print(f" total_cached={sum(r.get('cached_tokens',0) for r in dp)/1e6:.1f}M tokens")
|
||||
print(f" hit rate {sum(r.get('cached_tokens',0) for r in dp)/sum(r['input_length'] for r in dp)*100:.2f}%")
|
||||
print(f" mean uncached {np.mean(dp_x):.0f} p50 {dp_p50:.0f} p90 {np.quantile(dp_x, 0.9):.0f}")
|
||||
|
||||
print(f"\nMid-turn region (8-25): KVC {drift_kvc*100:.2f}% DP {drift_dp*100:.2f}% (gap {(drift_kvc-drift_dp)*100:+.2f}pp)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
249
scripts/analysis/plot_gpu_utilization.py
Normal file
249
scripts/analysis/plot_gpu_utilization.py
Normal file
@@ -0,0 +1,249 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Per-GPU utilization breakdown: KVC 1P3D v2 vs 4-way DP CA.
|
||||
|
||||
Generates docs/figures/gpu_utilization.png — two-panel:
|
||||
left: per-GPU request count
|
||||
right: per-GPU compute work (uncached prefill tokens + decode tokens, stacked)
|
||||
|
||||
The point of the figure is to push back on the naïve reading
|
||||
"KVC's prefill GPU is idle 90% of the time, so KVC is using fewer GPUs."
|
||||
|
||||
By request count, the prefill GPU is indeed touched by only ~8% of requests.
|
||||
By compute work, the prefill GPU bears comparable per-GPU load to each
|
||||
decode GPU — it is a low-frequency, high-cost safety net for cache misses,
|
||||
not idle capacity.
|
||||
|
||||
Work attribution:
|
||||
KVC direct-to-D path: prefill happens locally on the assigned D worker
|
||||
(append-prefill of `uncached_tokens` tokens).
|
||||
KVC seed/reseed/fallback path: prefill happens on prefill-0
|
||||
(full uncached_tokens), decode on assigned D.
|
||||
DP: all work on assigned direct-N worker.
|
||||
|
||||
Aborted / errored requests are excluded.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[2]
|
||||
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
|
||||
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
|
||||
OUT = ROOT / "docs/figures/gpu_utilization.png"
|
||||
|
||||
|
||||
def load(p: Path) -> list[dict]:
|
||||
return [json.loads(line) for line in p.open()]
|
||||
|
||||
|
||||
def is_failed(r: dict) -> bool:
|
||||
if r.get("error"):
|
||||
return True
|
||||
fr = r.get("finish_reason")
|
||||
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def uncached(r: dict) -> int:
|
||||
return max(0, r["input_length"] - r.get("cached_tokens", 0))
|
||||
|
||||
|
||||
def out_tokens(r: dict) -> int:
|
||||
return r.get("actual_output_tokens") or r.get("output_length") or 0
|
||||
|
||||
|
||||
def main() -> None:
|
||||
kvc = [r for r in load(KVC) if not is_failed(r)]
|
||||
dp = [r for r in load(DP) if not is_failed(r)]
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# KVC per-GPU attribution
|
||||
# ------------------------------------------------------------------
|
||||
kvc_req_count = defaultdict(int)
|
||||
kvc_prefill_tokens = defaultdict(int) # uncached prefill compute
|
||||
kvc_decode_tokens = defaultdict(int)
|
||||
|
||||
for r in kvc:
|
||||
d = r["assigned_decode_node"] # decode-0/1/2
|
||||
p = r["assigned_prefill_node"] # prefill-0
|
||||
mode = r.get("execution_mode", "")
|
||||
if mode == "kvcache-direct-to-d-session":
|
||||
# P is bypassed entirely; D does the append-prefill + decode
|
||||
kvc_req_count[d] += 1
|
||||
kvc_prefill_tokens[d] += uncached(r)
|
||||
kvc_decode_tokens[d] += out_tokens(r)
|
||||
else:
|
||||
# P does the full prefill; D handles decode
|
||||
kvc_req_count[p] += 1
|
||||
kvc_req_count[d] += 1 # decode side still counts
|
||||
kvc_prefill_tokens[p] += uncached(r)
|
||||
kvc_decode_tokens[d] += out_tokens(r)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# DP per-GPU attribution (fused P+D on every worker)
|
||||
# ------------------------------------------------------------------
|
||||
dp_req_count = defaultdict(int)
|
||||
dp_prefill_tokens = defaultdict(int)
|
||||
dp_decode_tokens = defaultdict(int)
|
||||
|
||||
for r in dp:
|
||||
w = r["assigned_decode_node"] # direct-0..3
|
||||
dp_req_count[w] += 1
|
||||
dp_prefill_tokens[w] += uncached(r)
|
||||
dp_decode_tokens[w] += out_tokens(r)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Build ordered GPU list, KVC then DP
|
||||
# ------------------------------------------------------------------
|
||||
kvc_gpus = ["prefill-0", "decode-0", "decode-1", "decode-2"]
|
||||
dp_gpus = ["direct-0", "direct-1", "direct-2", "direct-3"]
|
||||
all_gpus = kvc_gpus + dp_gpus
|
||||
|
||||
def get(d, k):
|
||||
return d.get(k, 0)
|
||||
|
||||
counts = [get(kvc_req_count, g) for g in kvc_gpus] + \
|
||||
[get(dp_req_count, g) for g in dp_gpus]
|
||||
prefill_tk = [get(kvc_prefill_tokens, g) for g in kvc_gpus] + \
|
||||
[get(dp_prefill_tokens, g) for g in dp_gpus]
|
||||
decode_tk = [get(kvc_decode_tokens, g) for g in kvc_gpus] + \
|
||||
[get(dp_decode_tokens, g) for g in dp_gpus]
|
||||
|
||||
# Display labels: P/D role + worker id
|
||||
labels = [
|
||||
"KVC P\nprefill-0",
|
||||
"KVC D\ndecode-0",
|
||||
"KVC D\ndecode-1",
|
||||
"KVC D\ndecode-2",
|
||||
"DP P+D\ndirect-0",
|
||||
"DP P+D\ndirect-1",
|
||||
"DP P+D\ndirect-2",
|
||||
"DP P+D\ndirect-3",
|
||||
]
|
||||
kvc_mask = [True, True, True, True, False, False, False, False]
|
||||
|
||||
KVC_P_COLOR = "#E89D44" # orange — P GPU stands out
|
||||
KVC_D_COLOR = "#1F77B4" # blue
|
||||
DP_COLOR = "#D62728" # red
|
||||
|
||||
bar_colors = [KVC_P_COLOR, KVC_D_COLOR, KVC_D_COLOR, KVC_D_COLOR,
|
||||
DP_COLOR, DP_COLOR, DP_COLOR, DP_COLOR]
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 7.0))
|
||||
x = np.arange(len(all_gpus))
|
||||
|
||||
# -- Left: per-GPU request count ----------------------------------
|
||||
ax = axes[0]
|
||||
bars = ax.bar(x, counts, color=bar_colors, edgecolor="black", linewidth=0.6)
|
||||
for xi, c in zip(x, counts):
|
||||
ax.text(xi, c + max(counts) * 0.015, f"{c:,}",
|
||||
ha="center", va="bottom", fontsize=9.5)
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Number of requests touching this GPU", fontsize=11)
|
||||
# Headroom for the annotation: extend ylim 35% above tallest bar
|
||||
ax.set_ylim(0, max(counts) * 1.40)
|
||||
ax.set_title("Per-GPU request count\n(naïve view: P seems idle)",
|
||||
fontsize=12, pad=24)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
# Annotate: KVC P GPU is "low frequency"
|
||||
# Place in upper-right area (over DP group) so it doesn't sit on KVC D bars
|
||||
p_idx = 0
|
||||
ax.annotate(
|
||||
f"P GPU only sees\n"
|
||||
f"{counts[p_idx]:,} requests\n"
|
||||
f"({counts[p_idx]/len(kvc)*100:.1f}% of all KVC requests)",
|
||||
xy=(p_idx, counts[p_idx]),
|
||||
xytext=(2.4, max(counts) * 1.20),
|
||||
fontsize=10, color=KVC_P_COLOR, fontweight="bold", ha="center",
|
||||
bbox=dict(facecolor="white", edgecolor=KVC_P_COLOR, alpha=0.92, pad=4),
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
)
|
||||
|
||||
# -- Right: per-GPU compute work (stacked prefill + decode) -------
|
||||
ax = axes[1]
|
||||
prefill_M = [t / 1e6 for t in prefill_tk]
|
||||
decode_M = [t / 1e6 for t in decode_tk]
|
||||
total_M = [p + d for p, d in zip(prefill_M, decode_M)]
|
||||
|
||||
bars_p = ax.bar(x, prefill_M, color=[c for c in bar_colors],
|
||||
edgecolor="black", linewidth=0.6, label="Uncached prefill tokens",
|
||||
alpha=0.95)
|
||||
bars_d = ax.bar(x, decode_M, bottom=prefill_M, color=[c for c in bar_colors],
|
||||
edgecolor="black", linewidth=0.6, hatch="///",
|
||||
label="Decode tokens", alpha=0.55)
|
||||
|
||||
for xi, t in zip(x, total_M):
|
||||
ax.text(xi, t + max(total_M) * 0.015, f"{t:.2f}M",
|
||||
ha="center", va="bottom", fontsize=9.5)
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Compute tokens (millions)", fontsize=11)
|
||||
# Headroom for the annotation
|
||||
ax.set_ylim(0, max(total_M) * 1.45)
|
||||
ax.set_title("Per-GPU compute work\n(work view: P is comparable to each D)",
|
||||
fontsize=12, pad=24)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
# Legend placed at upper-left where bars are tallest is fine after raising ylim
|
||||
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
|
||||
|
||||
# Annotate: KVC P GPU does similar work to each D.
|
||||
# Place over DP region (right side) so it doesn't sit on KVC D bars.
|
||||
ax.annotate(
|
||||
f"P GPU does {total_M[p_idx]:.2f}M tokens of prefill\n"
|
||||
f"— comparable per-GPU load to each KVC D worker\n"
|
||||
f"(KVC D avg = {np.mean(total_M[1:4]):.2f}M)",
|
||||
xy=(p_idx, total_M[p_idx]),
|
||||
xytext=(5.5, max(total_M) * 1.30),
|
||||
fontsize=10, color=KVC_P_COLOR, fontweight="bold", ha="center",
|
||||
bbox=dict(facecolor="white", edgecolor=KVC_P_COLOR, alpha=0.92, pad=4),
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
)
|
||||
|
||||
# Separator + group labels (placed in axes-fraction coords, below subplot
|
||||
# title at pad=24 we now have safe room for these at y_axes_frac ≈ 1.02)
|
||||
for ax in axes:
|
||||
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
|
||||
ax.text(0.25, 1.02, "KVC 1P3D",
|
||||
transform=ax.transAxes, ha="center", va="bottom",
|
||||
fontsize=11.5, fontweight="bold", color="#444",
|
||||
bbox=dict(facecolor="#F2F2F2", edgecolor="#888",
|
||||
alpha=0.85, pad=3))
|
||||
ax.text(0.75, 1.02, "DP 4-way CA",
|
||||
transform=ax.transAxes, ha="center", va="bottom",
|
||||
fontsize=11.5, fontweight="bold", color="#444",
|
||||
bbox=dict(facecolor="#F2F2F2", edgecolor="#888",
|
||||
alpha=0.85, pad=3))
|
||||
|
||||
fig.suptitle(
|
||||
"Per-GPU utilization: \"is KVC's prefill GPU wasted?\"\n"
|
||||
"Left view says yes (only 8% of requests); right view says no (comparable work to each D).",
|
||||
fontsize=13, y=1.02,
|
||||
)
|
||||
plt.tight_layout()
|
||||
plt.savefig(OUT, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {OUT}")
|
||||
plt.close(fig)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Print numbers for doc reference
|
||||
# ------------------------------------------------------------------
|
||||
print("\n=== Per-GPU numbers ===")
|
||||
print(f"{'GPU':<22} {'requests':>10} {'prefill(M)':>12} {'decode(M)':>12} {'total(M)':>10}")
|
||||
for lbl, n, pM, dM in zip(labels, counts, prefill_M, decode_M):
|
||||
print(f" {lbl.replace(chr(10), ' '):<20} {n:>10} {pM:>12.3f} {dM:>12.3f} {pM+dM:>10.3f}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
199
scripts/analysis/plot_ttft_pdf.py
Normal file
199
scripts/analysis/plot_ttft_pdf.py
Normal file
@@ -0,0 +1,199 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate TTFT probability density curves: KVC 1P3D v2 vs 4-way DP CA.
|
||||
|
||||
Inputs:
|
||||
outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
|
||||
outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl
|
||||
|
||||
Outputs:
|
||||
docs/figures/ttft_pdf_comparison.png -- two-panel figure:
|
||||
left panel: linear x in [0, 1.0]s zoomed on the body
|
||||
right panel: log x covering full range (0.01 -- 10 s)
|
||||
Each KDE curve uses scipy.stats.gaussian_kde with Scott's rule bandwidth.
|
||||
Aborted requests are excluded (same filter as metrics.py:_is_failed_request).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from scipy.stats import gaussian_kde
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[2]
|
||||
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
|
||||
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
|
||||
OUT = ROOT / "docs/figures/ttft_pdf_comparison.png"
|
||||
|
||||
|
||||
def load(p: Path) -> list[dict]:
|
||||
return [json.loads(line) for line in p.open()]
|
||||
|
||||
|
||||
def is_failed(r: dict) -> bool:
|
||||
if r.get("error"):
|
||||
return True
|
||||
fr = r.get("finish_reason")
|
||||
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def pct(vals: np.ndarray, q: float) -> float:
|
||||
return float(np.quantile(vals, q))
|
||||
|
||||
|
||||
def main() -> None:
|
||||
kvc = [r for r in load(KVC) if not is_failed(r)]
|
||||
dp = [r for r in load(DP) if not is_failed(r)]
|
||||
|
||||
kvc_ttft = np.array([r["ttft_s"] for r in kvc if r.get("ttft_s") is not None])
|
||||
dp_ttft = np.array([r["ttft_s"] for r in dp if r.get("ttft_s") is not None])
|
||||
|
||||
# Trim absurdly small zeros (rare measurement artifacts) so log KDE behaves.
|
||||
kvc_ttft = kvc_ttft[kvc_ttft > 1e-4]
|
||||
dp_ttft = dp_ttft[dp_ttft > 1e-4]
|
||||
|
||||
KVC_COLOR = "#1F77B4" # blue
|
||||
DP_COLOR = "#D62728" # red
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 6.5))
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Left panel: linear x ∈ [0, 0.6]s -- body of the distribution
|
||||
# ------------------------------------------------------------------
|
||||
ax = axes[0]
|
||||
x_body = np.linspace(0.0, 0.6, 600)
|
||||
|
||||
# KDE on linear ttft values, clipped to body
|
||||
kde_kvc_lin = gaussian_kde(kvc_ttft, bw_method=0.15)
|
||||
kde_dp_lin = gaussian_kde(dp_ttft, bw_method=0.15)
|
||||
|
||||
ax.plot(x_body, kde_kvc_lin(x_body),
|
||||
color=KVC_COLOR, lw=2.5, label=f"KVC 1P3D v2 (n={len(kvc_ttft)})")
|
||||
ax.fill_between(x_body, kde_kvc_lin(x_body), alpha=0.20, color=KVC_COLOR)
|
||||
ax.plot(x_body, kde_dp_lin(x_body),
|
||||
color=DP_COLOR, lw=2.5, label=f"4-way DP CA (n={len(dp_ttft)})")
|
||||
ax.fill_between(x_body, kde_dp_lin(x_body), alpha=0.20, color=DP_COLOR)
|
||||
|
||||
# Vertical lines for p50, p90
|
||||
for q, ls in [(0.50, "-"), (0.90, "--")]:
|
||||
ax.axvline(pct(kvc_ttft, q), color=KVC_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ax.axvline(pct(dp_ttft, q), color=DP_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ymax = ax.get_ylim()[1]
|
||||
ax.text(pct(kvc_ttft, 0.50), ymax * 0.97,
|
||||
f"KVC p50\n{pct(kvc_ttft, 0.50)*1000:.0f}ms",
|
||||
color=KVC_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
ax.text(pct(dp_ttft, 0.50), ymax * 0.50,
|
||||
f"DP p50\n{pct(dp_ttft, 0.50)*1000:.0f}ms",
|
||||
color=DP_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
ax.text(pct(kvc_ttft, 0.90), ymax * 0.30,
|
||||
f"KVC p90\n{pct(kvc_ttft, 0.90)*1000:.0f}ms",
|
||||
color=KVC_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
ax.text(pct(dp_ttft, 0.90), ymax * 0.18,
|
||||
f"DP p90\n{pct(dp_ttft, 0.90)*1000:.0f}ms",
|
||||
color=DP_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
|
||||
ax.set_xlim(0, 0.6)
|
||||
ax.set_xlabel("TTFT (seconds, linear)", fontsize=11)
|
||||
ax.set_ylabel("Probability density", fontsize=11)
|
||||
ax.set_title("Body of distribution (TTFT ≤ 0.6 s)", fontsize=12, pad=10)
|
||||
ax.legend(loc="upper right", fontsize=10, framealpha=0.95)
|
||||
ax.grid(True, linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Right panel: log x ∈ [0.01, 10]s -- full range incl. tail
|
||||
# PDF on log-x: we plot density vs log10(t) so the curve integrates
|
||||
# to 1 over log space (standard "log-density" presentation).
|
||||
# ------------------------------------------------------------------
|
||||
ax = axes[1]
|
||||
# KDE on log10(ttft) so the resulting curve integrates to 1 over log10 t
|
||||
kde_kvc_log = gaussian_kde(np.log10(kvc_ttft), bw_method="scott")
|
||||
kde_dp_log = gaussian_kde(np.log10(dp_ttft), bw_method="scott")
|
||||
log_x = np.linspace(np.log10(0.01), np.log10(10.0), 600)
|
||||
x_full = 10 ** log_x
|
||||
|
||||
y_kvc = kde_kvc_log(log_x)
|
||||
y_dp = kde_dp_log(log_x)
|
||||
|
||||
ax.plot(x_full, y_kvc, color=KVC_COLOR, lw=2.5, label=f"KVC 1P3D v2 (n={len(kvc_ttft)})")
|
||||
ax.fill_between(x_full, y_kvc, alpha=0.20, color=KVC_COLOR)
|
||||
ax.plot(x_full, y_dp, color=DP_COLOR, lw=2.5, label=f"4-way DP CA (n={len(dp_ttft)})")
|
||||
ax.fill_between(x_full, y_dp, alpha=0.20, color=DP_COLOR)
|
||||
|
||||
ax.set_xscale("log")
|
||||
ax.set_xlim(0.01, 10.0)
|
||||
|
||||
# Percentile markers
|
||||
quartile_styles = [(0.50, "-", "p50"), (0.90, "--", "p90"), (0.99, ":", "p99")]
|
||||
for q, ls, name in quartile_styles:
|
||||
ax.axvline(pct(kvc_ttft, q), color=KVC_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ax.axvline(pct(dp_ttft, q), color=DP_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
|
||||
# Annotate p99 specifically since this is the key reviewer-targeted callout
|
||||
ymax = max(y_kvc.max(), y_dp.max())
|
||||
kvc_p99 = pct(kvc_ttft, 0.99)
|
||||
dp_p99 = pct(dp_ttft, 0.99)
|
||||
ax.annotate(f"KVC p99 = {kvc_p99:.2f}s\n(slow-path reseed tail)",
|
||||
xy=(kvc_p99, kde_kvc_log(np.log10(kvc_p99))[0]),
|
||||
xytext=(2.0, ymax * 0.65),
|
||||
fontsize=10, color=KVC_COLOR, fontweight="bold",
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_COLOR, lw=1.0))
|
||||
ax.annotate(f"DP p99 = {dp_p99*1000:.0f}ms",
|
||||
xy=(dp_p99, kde_dp_log(np.log10(dp_p99))[0]),
|
||||
xytext=(0.025, ymax * 0.80),
|
||||
fontsize=10, color=DP_COLOR, fontweight="bold",
|
||||
arrowprops=dict(arrowstyle="->", color=DP_COLOR, lw=1.0))
|
||||
# Highlight the KVC bimodal structure
|
||||
ax.annotate("KVC fast path\n(direct-to-D, 91.6%)",
|
||||
xy=(0.05, y_kvc[np.argmin(np.abs(x_full - 0.05))]),
|
||||
xytext=(0.012, ymax * 0.45),
|
||||
fontsize=9, color=KVC_COLOR, style="italic",
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_COLOR, lw=0.7, alpha=0.6))
|
||||
ax.annotate("KVC slow path\n(reseed, ~3.4%)",
|
||||
xy=(2.5, y_kvc[np.argmin(np.abs(x_full - 2.5))]),
|
||||
xytext=(3.0, ymax * 0.30),
|
||||
fontsize=9, color=KVC_COLOR, style="italic",
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_COLOR, lw=0.7, alpha=0.6))
|
||||
|
||||
# Custom tick labels in seconds (instead of 10^-2, 10^-1, 10^0, 10^1)
|
||||
ax.set_xticks([0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0])
|
||||
ax.set_xticklabels(["10ms", "50ms", "100ms", "500ms", "1s", "5s", "10s"])
|
||||
|
||||
ax.set_xlabel("TTFT (log scale)", fontsize=11)
|
||||
ax.set_ylabel("Density (per log₁₀ s)", fontsize=11)
|
||||
ax.set_title("Full range (TTFT 10 ms – 10 s, log x)", fontsize=12, pad=10)
|
||||
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
|
||||
ax.grid(True, which="both", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
fig.suptitle(
|
||||
"TTFT probability density: KVC 1P3D v2 vs 4-way DP CA\n"
|
||||
"SWE-Bench 50sess trace · ts=1 · 4× H100 80GB · aborted/error requests excluded",
|
||||
fontsize=13, y=1.02,
|
||||
)
|
||||
plt.tight_layout()
|
||||
plt.savefig(OUT, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {OUT}")
|
||||
plt.close(fig)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Print summary stats for doc cross-reference
|
||||
# ------------------------------------------------------------------
|
||||
print(f"\n=== TTFT distribution summary ===")
|
||||
for name, arr in [("KVC v2", kvc_ttft), ("DP 4w", dp_ttft)]:
|
||||
print(f" {name} (n={len(arr)})")
|
||||
print(f" min={arr.min()*1000:.1f}ms p10={pct(arr,0.10)*1000:.1f}ms "
|
||||
f"p50={pct(arr,0.50)*1000:.1f}ms p90={pct(arr,0.90)*1000:.1f}ms "
|
||||
f"p99={pct(arr,0.99)*1000:.1f}ms max={arr.max()*1000:.1f}ms")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
223
scripts/analysis/plot_v2_path_breakdown.py
Normal file
223
scripts/analysis/plot_v2_path_breakdown.py
Normal file
@@ -0,0 +1,223 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate the two figures referenced by docs/V2_DEEP_ANALYSIS_ZH.md §3.1 and §3.2.
|
||||
|
||||
Inputs:
|
||||
outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
|
||||
outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl
|
||||
|
||||
Outputs:
|
||||
docs/figures/v2_execution_mode_distribution.png (for §3.1)
|
||||
docs/figures/v2_path_level_latency.png (for §3.2)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import statistics
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[2]
|
||||
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
|
||||
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
|
||||
OUT = ROOT / "docs/figures"
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def load(p: Path) -> list[dict]:
|
||||
return [json.loads(line) for line in p.open()]
|
||||
|
||||
|
||||
def is_failed(r: dict) -> bool:
|
||||
if r.get("error"):
|
||||
return True
|
||||
fr = r.get("finish_reason")
|
||||
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def pct(vals: list[float], q: float) -> float:
|
||||
s = sorted(vals)
|
||||
if not s:
|
||||
return float("nan")
|
||||
return s[max(0, min(len(s) - 1, int(len(s) * q)))]
|
||||
|
||||
|
||||
def main() -> None:
|
||||
kvc = load(KVC)
|
||||
dp = load(DP)
|
||||
|
||||
kvc_ok = [r for r in kvc if not is_failed(r)]
|
||||
dp_ok = [r for r in dp if not is_failed(r)]
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Figure 1: §3.1 execution_mode distribution (horizontal bar)
|
||||
# Use ALL rows (incl. failures) so percentages match the doc's 91.6%
|
||||
# ------------------------------------------------------------------
|
||||
mode_counts = Counter(r["execution_mode"] for r in kvc)
|
||||
total_kvc = len(kvc)
|
||||
|
||||
short_label = {
|
||||
"kvcache-direct-to-d-session": "direct-to-D-session (fast path)",
|
||||
"pd-router-d-session-reseed": "d-session-reseed (mooncake reseed)",
|
||||
"pd-router-fallback-session-not-resident-session-cap":
|
||||
"fallback: session-not-resident + session-cap",
|
||||
"pd-router-fallback-session-not-resident-seed-filter-early-turn":
|
||||
"fallback: session-not-resident + seed-filter",
|
||||
"pd-router-turn1-seed": "turn1-seed (first turn of each session)",
|
||||
"pd-router-fallback-no-d-capacity": "fallback: no-d-capacity",
|
||||
"pd-router-fallback-real-large-append-session-cap":
|
||||
"fallback: real-large-append",
|
||||
"pd-router-fallback-policy-no-bypass-session-cap":
|
||||
"fallback: policy-no-bypass",
|
||||
"pd-router-d-session-reseed-after-eviction":
|
||||
"d-session-reseed-after-eviction",
|
||||
"kvcache-centric": "kvcache-centric (admit-but-then-error)",
|
||||
}
|
||||
sorted_modes = mode_counts.most_common()
|
||||
labels = [short_label.get(m, m) for m, _ in sorted_modes]
|
||||
counts = [c for _, c in sorted_modes]
|
||||
pcts = [c / total_kvc * 100 for c in counts]
|
||||
|
||||
is_fast = ["direct-to-D" in lbl for lbl in labels]
|
||||
colors = ["#2C8C2C" if f else "#D62728" for f in is_fast]
|
||||
|
||||
fig, ax = plt.subplots(figsize=(11, 5.5))
|
||||
y = np.arange(len(labels))[::-1]
|
||||
ax.barh(y, counts, color=colors, edgecolor="black", linewidth=0.5)
|
||||
ax.set_yticks(y)
|
||||
ax.set_yticklabels(labels, fontsize=10)
|
||||
ax.set_xscale("log")
|
||||
ax.set_xlabel("Request count (log scale)", fontsize=11)
|
||||
ax.set_xlim(left=1)
|
||||
|
||||
# Annotate count + percentage at end of each bar
|
||||
for yi, (c, p) in zip(y, zip(counts, pcts)):
|
||||
ax.text(c * 1.05, yi, f"{c} ({p:.1f}%)",
|
||||
va="center", fontsize=9.5)
|
||||
|
||||
ax.set_title(
|
||||
f"KVC v2 execution_mode distribution (n = {total_kvc} total requests)\n"
|
||||
"green = fast path (direct-to-D), red = slow / fallback / failure paths",
|
||||
fontsize=12, pad=12,
|
||||
)
|
||||
ax.grid(axis="x", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
plt.tight_layout()
|
||||
out1 = OUT / "v2_execution_mode_distribution.png"
|
||||
plt.savefig(out1, dpi=150)
|
||||
print(f"wrote {out1}")
|
||||
plt.close(fig)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Figure 2: §3.2 path-level latency (grouped bars, log y)
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
# Group KVC paths semantically
|
||||
def kvc_group(mode: str) -> str:
|
||||
if mode == "kvcache-direct-to-d-session":
|
||||
return "KVC direct-to-D\n(fast path, 91.6%)"
|
||||
if "reseed" in mode:
|
||||
return "KVC reseed\n(slow path, 3.4%)"
|
||||
if "no-d-capacity" in mode:
|
||||
return "KVC no-d-capacity\n(fallback, 0.7%)"
|
||||
if "session-not-resident" in mode:
|
||||
return "KVC session-not-resident\n(misc, 2.3%)"
|
||||
return "KVC other\n(<2%)"
|
||||
|
||||
groups = defaultdict(list)
|
||||
for r in kvc_ok:
|
||||
groups[kvc_group(r["execution_mode"])].append(r)
|
||||
|
||||
# Order paths by intuitive progression (fast → slow)
|
||||
ordered_paths = [
|
||||
"KVC direct-to-D\n(fast path, 91.6%)",
|
||||
"KVC session-not-resident\n(misc, 2.3%)",
|
||||
"KVC reseed\n(slow path, 3.4%)",
|
||||
"KVC no-d-capacity\n(fallback, 0.7%)",
|
||||
]
|
||||
# Filter to only ones present
|
||||
ordered_paths = [p for p in ordered_paths if p in groups]
|
||||
ordered_paths.append("DP dp-colo-router\n(100%)")
|
||||
|
||||
def stats(rows: list[dict]) -> dict[str, float]:
|
||||
ttfts = [r["ttft_s"] for r in rows if r.get("ttft_s") is not None]
|
||||
lats = [r["latency_s"] for r in rows if r.get("latency_s") is not None]
|
||||
return {
|
||||
"n": len(rows),
|
||||
"ttft_p50": pct(ttfts, 0.50),
|
||||
"ttft_p99": pct(ttfts, 0.99),
|
||||
"lat_p50": pct(lats, 0.50),
|
||||
}
|
||||
|
||||
path_stats = {p: stats(groups[p]) for p in ordered_paths if "DP" not in p}
|
||||
path_stats["DP dp-colo-router\n(100%)"] = stats(dp_ok)
|
||||
|
||||
metrics = [("TTFT p50", "ttft_p50"), ("TTFT p99", "ttft_p99"), ("Latency p50", "lat_p50")]
|
||||
bar_w = 0.25
|
||||
fig, ax = plt.subplots(figsize=(12, 6))
|
||||
x = np.arange(len(ordered_paths))
|
||||
|
||||
colors_metric = ["#1F77B4", "#FF7F0E", "#9467BD"]
|
||||
for i, (label, key) in enumerate(metrics):
|
||||
vals = [path_stats[p][key] for p in ordered_paths]
|
||||
bars = ax.bar(x + (i - 1) * bar_w, vals, bar_w, label=label,
|
||||
color=colors_metric[i], edgecolor="black", linewidth=0.4)
|
||||
for xi, v in zip(x + (i - 1) * bar_w, vals):
|
||||
if v > 0 and v == v: # not nan
|
||||
fmt = f"{v*1000:.0f}ms" if v < 1 else f"{v:.2f}s"
|
||||
ax.text(xi, v * 1.10, fmt,
|
||||
ha="center", va="bottom", fontsize=8.5, rotation=0)
|
||||
|
||||
ax.set_yscale("log")
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(ordered_paths, fontsize=9.5)
|
||||
ax.set_ylabel("Latency (seconds, log scale)", fontsize=11)
|
||||
ax.set_title(
|
||||
"Path-level latency: KVC v2 paths vs DP single-path baseline\n"
|
||||
"log y-axis · same SWE-Bench 50sess trace · ts=1 · 4× H100 80GB",
|
||||
fontsize=12, pad=12,
|
||||
)
|
||||
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4, which="both")
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
# Annotate sample counts under each path label
|
||||
ymin = ax.get_ylim()[0]
|
||||
for xi, p in zip(x, ordered_paths):
|
||||
n = path_stats[p]["n"]
|
||||
ax.text(xi, ymin * 0.5, f"n={n}", ha="center", va="top",
|
||||
fontsize=8.5, color="#555")
|
||||
|
||||
plt.tight_layout()
|
||||
out2 = OUT / "v2_path_level_latency.png"
|
||||
plt.savefig(out2, dpi=150)
|
||||
print(f"wrote {out2}")
|
||||
plt.close(fig)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Print numeric values used (for doc reference)
|
||||
# ------------------------------------------------------------------
|
||||
print("\n=== Numeric values plotted ===")
|
||||
print("\nExecution mode counts (KVC v2):")
|
||||
for label, c, p in zip(labels, counts, pcts):
|
||||
print(f" {c:>5} ({p:>5.2f}%) {label}")
|
||||
|
||||
print("\nPath-level latency:")
|
||||
for p in ordered_paths:
|
||||
s = path_stats[p]
|
||||
nl = " | ".join([
|
||||
f"n={s['n']}",
|
||||
f"TTFT p50={s['ttft_p50']*1000:.1f}ms",
|
||||
f"TTFT p99={s['ttft_p99']*1000:.1f}ms",
|
||||
f"Lat p50={s['lat_p50']:.3f}s",
|
||||
])
|
||||
print(f" {p.replace(chr(10), ' '):<55} {nl}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
105
scripts/analysis/recompute_summary.py
Normal file
105
scripts/analysis/recompute_summary.py
Normal file
@@ -0,0 +1,105 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Re-derive summary.json from existing metrics.jsonl using the fixed metrics.py.
|
||||
|
||||
Bug fixed: requests aborted by SGLang (e.g. input > max-input-len returns
|
||||
a fast 400 with latency_s ~ 0.08s) were previously counted in latency_stats
|
||||
as if successful, deflating mean/p50/p90. The fixed metrics.py excludes
|
||||
all failed requests (errors or aborts) from latency/ttft/tpot stats and
|
||||
exposes abort_count / failure_count.
|
||||
|
||||
Usage:
|
||||
python3 scripts/analysis/recompute_summary.py path/to/metrics.jsonl ...
|
||||
python3 scripts/analysis/recompute_summary.py --diff path/to/metrics.jsonl path/to/old_summary.json
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[2] / "src"))
|
||||
|
||||
from agentic_pd_hybrid.metrics import RequestMetrics, write_summary_json
|
||||
|
||||
|
||||
def load_rows(metrics_path: Path) -> list[RequestMetrics]:
|
||||
rows = []
|
||||
field_names = {f for f in RequestMetrics.__dataclass_fields__}
|
||||
with metrics_path.open() as handle:
|
||||
for line in handle:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
raw = json.loads(line)
|
||||
kwargs = {k: raw.get(k) for k in field_names}
|
||||
rows.append(RequestMetrics(**kwargs))
|
||||
return rows
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("metrics_paths", nargs="+", type=Path)
|
||||
parser.add_argument(
|
||||
"--out",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="output summary path (default: alongside metrics with .recomputed_summary.json)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--diff",
|
||||
action="store_true",
|
||||
help="print before/after diff against the old <metrics>.summary.json",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
for metrics_path in args.metrics_paths:
|
||||
rows = load_rows(metrics_path)
|
||||
out_path = args.out or metrics_path.with_suffix(".recomputed_summary.json")
|
||||
write_summary_json(
|
||||
out_path,
|
||||
rows,
|
||||
trace_path=metrics_path,
|
||||
router_url=None,
|
||||
)
|
||||
new = json.load(out_path.open())
|
||||
print(f"\n=== {metrics_path} ===")
|
||||
print(f" written: {out_path}")
|
||||
print(f" total rows: {new['request_count']}")
|
||||
print(f" error_count: {new['error_count']}")
|
||||
print(f" abort_count: {new.get('abort_count', '?')}")
|
||||
print(f" failure_count: {new.get('failure_count', '?')}")
|
||||
ls = new.get("latency_stats_s", {}) or {}
|
||||
ts = new.get("ttft_stats_s", {}) or {}
|
||||
print(f" lat: n={ls.get('count')} mean={ls.get('mean'):.4f} p50={ls.get('p50'):.4f} p90={ls.get('p90'):.4f} p99={ls.get('p99'):.4f}")
|
||||
print(f" ttft: n={ts.get('count')} mean={ts.get('mean'):.4f} p50={ts.get('p50'):.4f} p90={ts.get('p90'):.4f} p99={ts.get('p99'):.4f}")
|
||||
|
||||
if args.diff:
|
||||
# find old summary (sibling file)
|
||||
candidates = [
|
||||
metrics_path.parent / f"{metrics_path.stem}.summary.json",
|
||||
metrics_path.with_suffix(".summary.json"),
|
||||
]
|
||||
old_path = next((p for p in candidates if p.exists()), None)
|
||||
if old_path:
|
||||
old = json.load(old_path.open())
|
||||
print(f" vs old {old_path}:")
|
||||
old_ls = old.get("latency_stats_s", {}) or {}
|
||||
old_ts = old.get("ttft_stats_s", {}) or {}
|
||||
for k in ("count", "mean", "p50", "p90", "p99"):
|
||||
o = old_ls.get(k)
|
||||
n = ls.get(k)
|
||||
if o is not None and n is not None:
|
||||
delta = n - o
|
||||
print(f" lat.{k}: {o:.4f} -> {n:.4f} ({delta:+.4f})")
|
||||
for k in ("count", "mean", "p50", "p90", "p99"):
|
||||
o = old_ts.get(k)
|
||||
n = ts.get(k)
|
||||
if o is not None and n is not None:
|
||||
delta = n - o
|
||||
print(f" ttft.{k}: {o:.4f} -> {n:.4f} ({delta:+.4f})")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
227
scripts/analysis/stratified.py
Executable file
227
scripts/analysis/stratified.py
Executable file
@@ -0,0 +1,227 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Stratified latency / TTFT reporter for paper-quality evaluation.
|
||||
|
||||
Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): every headline
|
||||
number must be accompanied by a stratified breakdown so reviewers can
|
||||
see which slice the gains come from.
|
||||
|
||||
Buckets the request rows from one or more metrics.jsonl files along:
|
||||
- turn_id : {1, 2-5, 6-20, 21+}
|
||||
- input_length : {<=8K, 8K-64K, >64K}
|
||||
- overlap_ratio : {<=0.3, 0.3-0.7, >0.7}
|
||||
- append_tokens : input_length - observed_overlap_blocks * BLOCK_SIZE
|
||||
|
||||
For each bucket, reports:
|
||||
- n (total rows in bucket)
|
||||
- n_ok (rows with no error and latency_s set)
|
||||
- latency_s mean / p50 / p90 / p99
|
||||
- ttft_s mean / p50 / p90 / p99
|
||||
- err_pct (1 - n_ok/n)
|
||||
|
||||
Usage:
|
||||
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl \
|
||||
[outputs/<other-run>/request-metrics.jsonl ...]
|
||||
scripts/analysis/stratified.py --dim turn_id outputs/<run>/request-metrics.jsonl
|
||||
scripts/analysis/stratified.py --json outputs/<run>/request-metrics.jsonl > strat.json
|
||||
|
||||
stdlib only — no pandas/numpy. Runs without GPU and without SGLang.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import math
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
|
||||
BLOCK_SIZE = 24 # SGLang radix block, matches docs/KVC_ROUTER_ALGORITHM.md §2
|
||||
|
||||
TURN_BUCKETS: list[tuple[str, tuple[int, int]]] = [
|
||||
("turn=1", (1, 1)),
|
||||
("turn=2-5", (2, 5)),
|
||||
("turn=6-20", (6, 20)),
|
||||
("turn=21+", (21, 10**9)),
|
||||
]
|
||||
INPUT_BUCKETS: list[tuple[str, tuple[int, int]]] = [
|
||||
("input<=8K", (0, 8 * 1024)),
|
||||
("input=8K-64K", (8 * 1024 + 1, 64 * 1024)),
|
||||
("input>64K", (64 * 1024 + 1, 10**9)),
|
||||
]
|
||||
OVERLAP_BUCKETS: list[tuple[str, tuple[float, float]]] = [
|
||||
("overlap<=0.3", (0.0, 0.3)),
|
||||
("overlap=0.3-0.7", (0.3, 0.7)),
|
||||
("overlap>0.7", (0.7, 1.0001)),
|
||||
]
|
||||
APPEND_BUCKETS: list[tuple[str, tuple[int, int]]] = [
|
||||
("append<=128", (0, 128)),
|
||||
("append=128-1K", (129, 1024)),
|
||||
("append=1K-8K", (1025, 8 * 1024)),
|
||||
("append>8K", (8 * 1024 + 1, 10**9)),
|
||||
]
|
||||
|
||||
DIM_BUCKETS: dict[str, list[tuple[str, tuple]]] = {
|
||||
"turn_id": TURN_BUCKETS,
|
||||
"input_length": INPUT_BUCKETS,
|
||||
"overlap_ratio": OVERLAP_BUCKETS,
|
||||
"append_tokens": APPEND_BUCKETS,
|
||||
}
|
||||
|
||||
|
||||
def _quantile(values: list[float], q: float) -> float:
|
||||
"""Linear-interpolation quantile, stdlib only."""
|
||||
if not values:
|
||||
return float("nan")
|
||||
s = sorted(values)
|
||||
if len(s) == 1:
|
||||
return s[0]
|
||||
pos = (len(s) - 1) * q
|
||||
lo = math.floor(pos)
|
||||
hi = math.ceil(pos)
|
||||
if lo == hi:
|
||||
return s[lo]
|
||||
return s[lo] + (s[hi] - s[lo]) * (pos - lo)
|
||||
|
||||
|
||||
def _stats(values: list[float]) -> dict[str, float]:
|
||||
if not values:
|
||||
return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
|
||||
return {
|
||||
"mean": sum(values) / len(values),
|
||||
"p50": _quantile(values, 0.50),
|
||||
"p90": _quantile(values, 0.90),
|
||||
"p99": _quantile(values, 0.99),
|
||||
}
|
||||
|
||||
|
||||
def _bucket_for(value: float | int, buckets: list[tuple[str, tuple]]) -> str:
|
||||
for label, (lo, hi) in buckets:
|
||||
if lo <= value <= hi:
|
||||
return label
|
||||
return "OOB"
|
||||
|
||||
|
||||
def _classify(row: dict, dim: str) -> str:
|
||||
if dim == "turn_id":
|
||||
return _bucket_for(int(row.get("turn_id", 0)), TURN_BUCKETS)
|
||||
if dim == "input_length":
|
||||
return _bucket_for(int(row.get("input_length", 0)), INPUT_BUCKETS)
|
||||
if dim == "overlap_ratio":
|
||||
inp = max(1, int(row.get("input_length", 0)))
|
||||
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
|
||||
ratio = min(1.0, cached / inp)
|
||||
return _bucket_for(ratio, OVERLAP_BUCKETS)
|
||||
if dim == "append_tokens":
|
||||
inp = int(row.get("input_length", 0))
|
||||
cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
|
||||
return _bucket_for(max(0, inp - cached), APPEND_BUCKETS)
|
||||
raise ValueError(f"Unknown dim: {dim}")
|
||||
|
||||
|
||||
def load_rows(paths: Iterable[Path]) -> list[dict]:
|
||||
rows: list[dict] = []
|
||||
for path in paths:
|
||||
with path.open() as handle:
|
||||
for line in handle:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
|
||||
|
||||
def stratify(rows: list[dict], dim: str) -> dict[str, dict]:
|
||||
by_bucket: dict[str, list[dict]] = defaultdict(list)
|
||||
for row in rows:
|
||||
by_bucket[_classify(row, dim)].append(row)
|
||||
|
||||
output: dict[str, dict] = {}
|
||||
for label, _ in DIM_BUCKETS[dim]:
|
||||
bucket_rows = by_bucket.get(label, [])
|
||||
n = len(bucket_rows)
|
||||
ok = [r for r in bucket_rows if r.get("error") is None and r.get("latency_s") is not None]
|
||||
n_ok = len(ok)
|
||||
lat = [float(r["latency_s"]) for r in ok]
|
||||
ttft = [float(r["ttft_s"]) for r in ok if r.get("ttft_s") is not None]
|
||||
output[label] = {
|
||||
"n": n,
|
||||
"n_ok": n_ok,
|
||||
"err_pct": (n - n_ok) / n if n else 0.0,
|
||||
"latency_s": _stats(lat),
|
||||
"ttft_s": _stats(ttft),
|
||||
}
|
||||
return output
|
||||
|
||||
|
||||
def render_table(name: str, stats: dict[str, dict]) -> str:
|
||||
lines = [
|
||||
f"## stratified by {name}",
|
||||
"",
|
||||
"| bucket | n | n_ok | err% | lat mean | lat p50 | lat p90 | lat p99 | ttft mean | ttft p50 | ttft p90 | ttft p99 |",
|
||||
"|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|",
|
||||
]
|
||||
for label, _ in DIM_BUCKETS[name]:
|
||||
s = stats[label]
|
||||
lat = s["latency_s"]
|
||||
ttft = s["ttft_s"]
|
||||
lines.append(
|
||||
"| {label} | {n} | {n_ok} | {err:.1%} | "
|
||||
"{lm:.3f} | {l50:.3f} | {l90:.3f} | {l99:.3f} | "
|
||||
"{tm:.3f} | {t50:.3f} | {t90:.3f} | {t99:.3f} |".format(
|
||||
label=label,
|
||||
n=s["n"],
|
||||
n_ok=s["n_ok"],
|
||||
err=s["err_pct"],
|
||||
lm=lat["mean"],
|
||||
l50=lat["p50"],
|
||||
l90=lat["p90"],
|
||||
l99=lat["p99"],
|
||||
tm=ttft["mean"],
|
||||
t50=ttft["p50"],
|
||||
t90=ttft["p90"],
|
||||
t99=ttft["p99"],
|
||||
)
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
|
||||
parser.add_argument("metrics_paths", nargs="+", type=Path)
|
||||
parser.add_argument(
|
||||
"--dim",
|
||||
choices=list(DIM_BUCKETS.keys()) + ["all"],
|
||||
default="all",
|
||||
help="stratification dimension (default: all four)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="emit JSON instead of markdown tables",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
rows = load_rows(args.metrics_paths)
|
||||
if not rows:
|
||||
print("no rows loaded", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
dims = list(DIM_BUCKETS.keys()) if args.dim == "all" else [args.dim]
|
||||
result = {dim: stratify(rows, dim) for dim in dims}
|
||||
|
||||
if args.json:
|
||||
json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
|
||||
sys.stdout.write("\n")
|
||||
return
|
||||
|
||||
header_paths = ", ".join(str(p) for p in args.metrics_paths)
|
||||
print(f"# stratified report ({len(rows)} rows from {header_paths})\n")
|
||||
for dim in dims:
|
||||
print(render_table(dim, result[dim]))
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
110
scripts/convert_audit_to_trace.py
Normal file
110
scripts/convert_audit_to_trace.py
Normal file
@@ -0,0 +1,110 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Convert sibench audit.jsonl to agentic-pd-hybrid trace format.
|
||||
|
||||
Source format (sibench audit.jsonl):
|
||||
{"instance_id": "...", "ts": float, "messages": [...],
|
||||
"audit": {"prompt_tokens": int, "completion_tokens": int, ...}}
|
||||
|
||||
Target format (agentic-pd-hybrid trace JSONL):
|
||||
{"chat_id": int, "parent_chat_id": int, "timestamp": float,
|
||||
"turn": int, "input_length": int, "output_length": int,
|
||||
"type": str, "hash_ids": [int, ...]}
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
BLOCK_TOKEN_BUDGET = 24 # tokens per block, matching trace.py default
|
||||
|
||||
|
||||
def convert(src: Path, dst: Path) -> None:
|
||||
# Group lines by instance_id, preserving order within each instance
|
||||
instances: dict[str, list[dict]] = defaultdict(list)
|
||||
with src.open() as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
rec = json.loads(line)
|
||||
instances[rec["instance_id"]].append(rec)
|
||||
|
||||
# Sort each instance's turns by timestamp
|
||||
for iid in instances:
|
||||
instances[iid].sort(key=lambda r: r["ts"])
|
||||
|
||||
# Assign stable chat_id bases: each instance gets a block of IDs
|
||||
# Max turns across all instances determines the spacing
|
||||
max_turns = max(len(turns) for turns in instances.values())
|
||||
spacing = max_turns + 10 # extra headroom
|
||||
|
||||
total_written = 0
|
||||
with dst.open("w") as out:
|
||||
for inst_idx, (iid, turns) in enumerate(instances.items()):
|
||||
base_chat_id = (inst_idx + 1) * spacing # start from spacing to avoid 0
|
||||
# Track cumulative hash_ids for prefix cache simulation
|
||||
cumulative_hash_ids: list[int] = []
|
||||
global_block_counter = inst_idx * 100_000 # unique block namespace per instance
|
||||
|
||||
for turn_idx, rec in enumerate(turns):
|
||||
audit = rec.get("audit", {})
|
||||
input_length = audit.get("prompt_tokens", 0)
|
||||
output_length = audit.get("completion_tokens", 0)
|
||||
|
||||
if input_length <= 0:
|
||||
# Fallback: estimate from message content
|
||||
total_chars = sum(len(m.get("content", "")) for m in rec.get("messages", []))
|
||||
input_length = max(1, total_chars // 4)
|
||||
if output_length <= 0:
|
||||
output_length = 128 # reasonable default
|
||||
|
||||
chat_id = base_chat_id + turn_idx
|
||||
if turn_idx == 0:
|
||||
parent_chat_id = -1
|
||||
else:
|
||||
parent_chat_id = base_chat_id + turn_idx - 1
|
||||
|
||||
# Build hash_ids: for turn 0, generate blocks for full input
|
||||
# For turn N>0, keep previous blocks and add new ones for the delta
|
||||
if turn_idx == 0:
|
||||
num_blocks = input_length // BLOCK_TOKEN_BUDGET
|
||||
cumulative_hash_ids = list(
|
||||
range(global_block_counter, global_block_counter + num_blocks)
|
||||
)
|
||||
global_block_counter += num_blocks
|
||||
else:
|
||||
# The new input is the full prompt (cumulative), so the delta
|
||||
# is the new tokens beyond what was in the previous turn's prompt
|
||||
prev_input = audit.get("prompt_tokens", 0)
|
||||
prev_rec_audit = turns[turn_idx - 1].get("audit", {})
|
||||
prev_input_length = prev_rec_audit.get("prompt_tokens", 0)
|
||||
delta = max(0, prev_input - prev_input_length) if prev_input_length > 0 else 0
|
||||
new_blocks = delta // BLOCK_TOKEN_BUDGET
|
||||
new_ids = list(
|
||||
range(global_block_counter, global_block_counter + new_blocks)
|
||||
)
|
||||
global_block_counter += new_blocks
|
||||
cumulative_hash_ids = cumulative_hash_ids + new_ids
|
||||
|
||||
trace_line = {
|
||||
"chat_id": chat_id,
|
||||
"parent_chat_id": parent_chat_id,
|
||||
"timestamp": rec["ts"],
|
||||
"turn": turn_idx,
|
||||
"input_length": input_length,
|
||||
"output_length": output_length,
|
||||
"type": "chat",
|
||||
"hash_ids": cumulative_hash_ids,
|
||||
}
|
||||
out.write(json.dumps(trace_line, separators=(",", ":")) + "\n")
|
||||
total_written += 1
|
||||
|
||||
print(f"Converted {total_written} lines from {len(instances)} instances -> {dst}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 3:
|
||||
print(f"Usage: {sys.argv[0]} <input_audit.jsonl> <output_trace.jsonl>")
|
||||
sys.exit(1)
|
||||
convert(Path(sys.argv[1]), Path(sys.argv[2]))
|
||||
189
scripts/convert_inferact_to_trace.py
Normal file
189
scripts/convert_inferact_to_trace.py
Normal file
@@ -0,0 +1,189 @@
|
||||
"""Convert Inferact codex_swebenchpro_traces (ShareGPT) to agentic-pd-hybrid trace JSONL.
|
||||
|
||||
Output schema (one JSON object per line, matching src/agentic_pd_hybrid/trace.py):
|
||||
chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids
|
||||
|
||||
Each trial in the input becomes one session. Each (human, gpt) pair within a trial
|
||||
becomes one turn. The prefix at turn N is the concatenation of all (human, gpt) pairs
|
||||
from turns 0..N-1 plus the current human message — this mirrors how agentic coding
|
||||
agents grow context across calls.
|
||||
|
||||
hash_ids are derived per 24-token block via sha256 of the block's text + previous hash,
|
||||
which gives stable, deterministic, prefix-shared hashes across turns of the same session.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
BLOCK_TOKEN_BUDGET = 24
|
||||
|
||||
|
||||
def _block_hash(text: str, prev_hash: int) -> int:
|
||||
h = hashlib.sha256(text.encode("utf-8") + prev_hash.to_bytes(8, "big")).digest()
|
||||
return int.from_bytes(h[:8], "big") & 0x7FFFFFFFFFFFFFFF
|
||||
|
||||
|
||||
def _build_hash_ids(token_ids: list[int]) -> list[int]:
|
||||
out: list[int] = []
|
||||
prev = 0
|
||||
for start in range(0, len(token_ids), BLOCK_TOKEN_BUDGET):
|
||||
block = token_ids[start : start + BLOCK_TOKEN_BUDGET]
|
||||
block_repr = ",".join(str(t) for t in block)
|
||||
prev = _block_hash(block_repr, prev)
|
||||
out.append(prev)
|
||||
return out
|
||||
|
||||
|
||||
def _pair_turns(conv: list[dict]) -> list[tuple[str, str]]:
|
||||
"""Pair consecutive (human, gpt) messages. Skip malformed."""
|
||||
pairs: list[tuple[str, str]] = []
|
||||
i = 0
|
||||
while i + 1 < len(conv):
|
||||
a, b = conv[i], conv[i + 1]
|
||||
if (
|
||||
isinstance(a, dict)
|
||||
and isinstance(b, dict)
|
||||
and a.get("from") == "human"
|
||||
and b.get("from") == "gpt"
|
||||
):
|
||||
pairs.append((str(a.get("value", "")), str(b.get("value", ""))))
|
||||
i += 2
|
||||
else:
|
||||
i += 1
|
||||
return pairs
|
||||
|
||||
|
||||
def convert(
|
||||
input_path: Path,
|
||||
output_path: Path,
|
||||
*,
|
||||
tokenizer_path: str,
|
||||
max_trials: int | None,
|
||||
inter_turn_gap_s: float,
|
||||
session_stagger_s: float,
|
||||
request_type: str,
|
||||
) -> None:
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
print(f"loading tokenizer from {tokenizer_path}", file=sys.stderr)
|
||||
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
|
||||
|
||||
print(f"loading {input_path}", file=sys.stderr)
|
||||
data = json.loads(input_path.read_text())
|
||||
if max_trials is not None:
|
||||
data = data[:max_trials]
|
||||
print(f"{len(data)} trials to process", file=sys.stderr)
|
||||
|
||||
next_chat_id = 1_000_000
|
||||
written = 0
|
||||
skipped_trials = 0
|
||||
t0 = time.time()
|
||||
|
||||
with output_path.open("w", encoding="utf-8") as out_f:
|
||||
for trial_idx, trial in enumerate(data):
|
||||
conv = trial.get("conversations") or []
|
||||
turns = _pair_turns(conv)
|
||||
if not turns:
|
||||
skipped_trials += 1
|
||||
continue
|
||||
|
||||
base_ts = trial_idx * session_stagger_s
|
||||
ts = base_ts
|
||||
parent_chat_id = -1
|
||||
prefix_text = ""
|
||||
|
||||
for turn_idx, (human, assistant) in enumerate(turns):
|
||||
# Input at this turn = full prior context + current human message.
|
||||
current_text = (
|
||||
prefix_text + ("\n\n[USER]\n" if prefix_text else "[USER]\n") + human
|
||||
)
|
||||
input_ids = tokenizer.encode(current_text, add_special_tokens=False)
|
||||
input_length = len(input_ids)
|
||||
|
||||
output_ids = tokenizer.encode(assistant, add_special_tokens=False)
|
||||
output_length = max(1, len(output_ids))
|
||||
|
||||
hash_ids = _build_hash_ids(input_ids)
|
||||
|
||||
chat_id = next_chat_id
|
||||
next_chat_id += 1
|
||||
record = {
|
||||
"chat_id": chat_id,
|
||||
"parent_chat_id": parent_chat_id,
|
||||
"timestamp": round(ts, 6),
|
||||
"input_length": input_length,
|
||||
"output_length": output_length,
|
||||
"type": request_type,
|
||||
"turn": turn_idx,
|
||||
"hash_ids": hash_ids,
|
||||
}
|
||||
out_f.write(json.dumps(record) + "\n")
|
||||
written += 1
|
||||
|
||||
parent_chat_id = chat_id
|
||||
ts += inter_turn_gap_s
|
||||
prefix_text = current_text + "\n\n[ASSISTANT]\n" + assistant
|
||||
|
||||
if (trial_idx + 1) % 20 == 0:
|
||||
elapsed = time.time() - t0
|
||||
rate = (trial_idx + 1) / elapsed if elapsed > 0 else 0
|
||||
eta = (len(data) - trial_idx - 1) / rate if rate > 0 else 0
|
||||
print(
|
||||
f" trial {trial_idx + 1}/{len(data)} reqs={written} "
|
||||
f"rate={rate:.1f} trial/s eta={eta:.0f}s",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
elapsed = time.time() - t0
|
||||
print(
|
||||
f"done: wrote {written} requests across {len(data) - skipped_trials} sessions "
|
||||
f"({skipped_trials} trials skipped, empty conversations) in {elapsed:.1f}s "
|
||||
f"to {output_path}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument(
|
||||
"--input",
|
||||
type=Path,
|
||||
default=Path("third_party/codex_swebenchpro_traces/codex_swebenchpro.json"),
|
||||
)
|
||||
p.add_argument("--output", type=Path, required=True)
|
||||
p.add_argument(
|
||||
"--tokenizer",
|
||||
default="/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507",
|
||||
help="Path or HF id for the tokenizer. Default matches v2 sweep model.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--max-trials",
|
||||
type=int,
|
||||
default=None,
|
||||
help="Cap number of trials processed (useful for smoke / quick tests).",
|
||||
)
|
||||
p.add_argument("--inter-turn-gap-s", type=float, default=2.5)
|
||||
p.add_argument("--session-stagger-s", type=float, default=1.0)
|
||||
p.add_argument("--request-type", default="chat")
|
||||
args = p.parse_args()
|
||||
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
convert(
|
||||
input_path=args.input,
|
||||
output_path=args.output,
|
||||
tokenizer_path=args.tokenizer,
|
||||
max_trials=args.max_trials,
|
||||
inter_turn_gap_s=args.inter_turn_gap_s,
|
||||
session_stagger_s=args.session_stagger_s,
|
||||
request_type=args.request_type,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
73
scripts/run_all_experiments.sh
Executable file
73
scripts/run_all_experiments.sh
Executable file
@@ -0,0 +1,73 @@
|
||||
#!/bin/bash
|
||||
# Run all 3 PD hybrid experiments sequentially
|
||||
# Uses 52 sessions / 4,449 requests (10% sample of 497 sessions)
|
||||
# Each experiment takes ~30-40 min
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
TRACE="outputs/qwen35-swebench-50sess.jsonl"
|
||||
MODEL="/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B"
|
||||
OUTPUT="outputs/swebench-exps"
|
||||
|
||||
echo "=== Experiment A: pd-disaggregation ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy default \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
echo "=== Experiment B: pd-colo ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism pd-colo \
|
||||
--policy default \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
echo "=== Experiment C: kvcache-centric ==="
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 2 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
echo "=== All experiments complete ==="
|
||||
24
scripts/run_exp_a_pd_disagg.sh
Executable file
24
scripts/run_exp_a_pd_disagg.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
# Experiment A: pd-disaggregation baseline
|
||||
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
|
||||
# Full 39K trace from SWE-Bench 500 instances
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 64 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
23
scripts/run_exp_b1_dp_colo_rr.sh
Executable file
23
scripts/run_exp_b1_dp_colo_rr.sh
Executable file
@@ -0,0 +1,23 @@
|
||||
#!/bin/bash
|
||||
# Experiment B1: Naive DP colocation — round-robin policy
|
||||
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with round-robin
|
||||
# No disaggregation — each worker does prefill+decode locally
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-50sess.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-colo \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
23
scripts/run_exp_b2_dp_colo_cache_aware.sh
Executable file
23
scripts/run_exp_b2_dp_colo_cache_aware.sh
Executable file
@@ -0,0 +1,23 @@
|
||||
#!/bin/bash
|
||||
# Experiment B2: Naive DP colocation — cache-aware (kv-aware) policy
|
||||
# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with consistent-hashing
|
||||
# Replay kv-aware policy picks the worker with most prefix overlap
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-50sess.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
24
scripts/run_exp_b_pd_colo.sh
Executable file
24
scripts/run_exp_b_pd_colo.sh
Executable file
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
# Experiment B: pd-colo (direct/colocation)
|
||||
# 2 direct workers (GPU 0-3, 4-7), TP4, no router
|
||||
# Full 39K trace from SWE-Bench 500 instances
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism pd-colo \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 2 --direct-tp-size 4 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 64 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
28
scripts/run_exp_c_kvcache_centric.sh
Executable file
28
scripts/run_exp_c_kvcache_centric.sh
Executable file
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
# Experiment C: kvcache-centric (session-aware PD)
|
||||
# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
|
||||
# Full 39K trace from SWE-Bench 500 instances
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output-root outputs/swebench-exps \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 64 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 2 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
81
scripts/sample_trace_subset.py
Normal file
81
scripts/sample_trace_subset.py
Normal file
@@ -0,0 +1,81 @@
|
||||
"""Deterministically slice the first N sessions of an agentic-pd-hybrid trace.
|
||||
|
||||
Method: scan in file order, count records whose `parent_chat_id == -1` (= a
|
||||
session's turn 0), and write every record until the (N+1)-th such record is
|
||||
seen. No RNG, no hashing — re-running on the same input produces a byte-
|
||||
identical output. Used to derive matched subsets for paired sweeps (E1 vs E2)
|
||||
without spending GPU hours on the full trace.
|
||||
|
||||
Usage:
|
||||
uv run --no-sync python scripts/sample_trace_subset.py \
|
||||
--input outputs/inferact_codex_swebenchpro.jsonl \
|
||||
--output outputs/inferact_50sess.jsonl \
|
||||
--sessions 50
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def slice_first_n_sessions(input_path: Path, output_path: Path, n_sessions: int) -> dict:
|
||||
sessions_seen = 0
|
||||
requests_written = 0
|
||||
input_length_sum = 0
|
||||
output_length_sum = 0
|
||||
min_in = float("inf")
|
||||
max_in = 0
|
||||
|
||||
with input_path.open("r", encoding="utf-8") as f_in, output_path.open(
|
||||
"w", encoding="utf-8"
|
||||
) as f_out:
|
||||
for line in f_in:
|
||||
rec = json.loads(line)
|
||||
if rec["parent_chat_id"] == -1:
|
||||
sessions_seen += 1
|
||||
if sessions_seen > n_sessions:
|
||||
break
|
||||
f_out.write(line)
|
||||
requests_written += 1
|
||||
il = int(rec["input_length"])
|
||||
input_length_sum += il
|
||||
output_length_sum += int(rec["output_length"])
|
||||
if il < min_in:
|
||||
min_in = il
|
||||
if il > max_in:
|
||||
max_in = il
|
||||
|
||||
h = hashlib.md5(output_path.read_bytes()).hexdigest()
|
||||
return {
|
||||
"sessions": min(sessions_seen, n_sessions),
|
||||
"requests": requests_written,
|
||||
"input_length_mean": input_length_sum / max(1, requests_written),
|
||||
"input_length_min": int(min_in) if min_in != float("inf") else 0,
|
||||
"input_length_max": max_in,
|
||||
"output_length_mean": output_length_sum / max(1, requests_written),
|
||||
"output_md5": h,
|
||||
}
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument(
|
||||
"--input",
|
||||
type=Path,
|
||||
default=Path("outputs/inferact_codex_swebenchpro.jsonl"),
|
||||
)
|
||||
p.add_argument("--output", type=Path, required=True)
|
||||
p.add_argument("--sessions", type=int, default=50)
|
||||
args = p.parse_args()
|
||||
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
stats = slice_first_n_sessions(args.input, args.output, args.sessions)
|
||||
print(json.dumps(stats, indent=2), file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
44
scripts/setup_env.sh
Executable file
44
scripts/setup_env.sh
Executable file
@@ -0,0 +1,44 @@
|
||||
#!/usr/bin/env bash
|
||||
# Source this file in every shell that will run agentic-pd-hybrid.
|
||||
#
|
||||
# source scripts/setup_env.sh
|
||||
#
|
||||
# Why all three are needed:
|
||||
# - CUDA_HOME / PATH point tvm_ffi (vendor sglang JIT compiler) at cu12.8 nvcc.
|
||||
# Without this it falls back to /usr/local/cuda-13.0/bin/nvcc and the
|
||||
# resulting .so links libcudart.so.13 which driver 570 (cu12.8 API) rejects
|
||||
# with cudaErrorInsufficientDriver.
|
||||
# - LD_LIBRARY_PATH must expose libcudart.so.12 for mooncake.engine (cu12 wheel)
|
||||
# AND ~/cuda-12.8/lib64 for tvm_ffi compile-time linker searches.
|
||||
#
|
||||
# See docs/H200_DRIVER570_SETUP_ZH.md for the full rationale.
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
|
||||
if [ ! -x "$HOME/cuda-12.8/bin/nvcc" ]; then
|
||||
echo "ERROR: $HOME/cuda-12.8/bin/nvcc not found." >&2
|
||||
echo "Install cu12.8 toolkit first (see docs/H200_DRIVER570_SETUP_ZH.md §3)." >&2
|
||||
return 1 2>/dev/null || exit 1
|
||||
fi
|
||||
|
||||
if [ ! -f "$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12" ]; then
|
||||
echo "ERROR: venv libcudart.so.12 missing. Run 'uv sync' from $REPO_ROOT." >&2
|
||||
return 1 2>/dev/null || exit 1
|
||||
fi
|
||||
|
||||
export CUDA_HOME="$HOME/cuda-12.8"
|
||||
export PATH="$HOME/cuda-12.8/bin:$PATH"
|
||||
export LD_LIBRARY_PATH="$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib:$HOME/cuda-12.8/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
|
||||
|
||||
# Mooncake batch_transfer_sync C++ timeout (seconds). Default in mooncake is
|
||||
# 30 s; a single LRU eviction sweep on a saturated D scheduler can exceed
|
||||
# that and cause the hair-trigger blacklist in conn.py:1270 to permanently
|
||||
# mark the D's mooncake_session_id "failed". 1800 s = 30 min gives us
|
||||
# headroom while still detecting genuinely broken peers eventually.
|
||||
# See docs/E1_E2_RESULTS_ZH.md §5c and docs/E1_E2_FIX_DESIGN_ZH.md Q1.C.
|
||||
export MC_TRANSFER_TIMEOUT="${MC_TRANSFER_TIMEOUT:-1800}"
|
||||
|
||||
echo "agentic-pd-hybrid env ready:"
|
||||
echo " CUDA_HOME=$CUDA_HOME ($(nvcc --version | grep release | sed 's/.*release //'))"
|
||||
echo " libcudart.so.12 at $REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib"
|
||||
echo " MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT}s"
|
||||
30
scripts/smoke_test.sh
Executable file
30
scripts/smoke_test.sh
Executable file
@@ -0,0 +1,30 @@
|
||||
#!/bin/bash
|
||||
# Smoke test: pd-disaggregation with mooncake TCP, 100 requests
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
# Sample a small trace for smoke testing
|
||||
uv run agentic-pd-hybrid sample-sessions \
|
||||
--trace outputs/qwen35-swebench-500.jsonl \
|
||||
--output outputs/qwen35-smoke-3sess.jsonl \
|
||||
--session-sample-rate 0.02 \
|
||||
--min-turns 5 \
|
||||
--target-duration-s 300 \
|
||||
--max-requests 100
|
||||
|
||||
# Run smoke test
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/qwen35-smoke-3sess.jsonl \
|
||||
--output-root outputs/smoke \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy default \
|
||||
--model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
114
scripts/sweep_backpressure_smoke.sh
Executable file
114
scripts/sweep_backpressure_smoke.sh
Executable file
@@ -0,0 +1,114 @@
|
||||
#!/usr/bin/env bash
|
||||
# Smoke sweep: validate backpressure code change on top of v5 Option D config.
|
||||
# Designed to fit in ~3-4h GPU budget (4 runs × ~30-60 min).
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_backpressure_smoke.sh
|
||||
#
|
||||
# Prerequisites: GPUs available; trace at outputs/qwen35-swebench-50sess.jsonl;
|
||||
# model at $MODEL_PATH (default Qwen3-30B-A3B-Instruct-2507).
|
||||
set -euo pipefail
|
||||
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
||||
cd "$REPO_ROOT"
|
||||
|
||||
OUT_ROOT=${OUT_ROOT:-outputs/sweep_backpressure_smoke}
|
||||
TRACE=${TRACE:-outputs/qwen35-swebench-50sess.jsonl}
|
||||
MODEL=${MODEL:-/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507}
|
||||
|
||||
mkdir -p "$OUT_ROOT"
|
||||
LOG="$OUT_ROOT/sweep.log"
|
||||
echo "[$(date '+%F %T')] Starting backpressure smoke sweep" | tee -a "$LOG"
|
||||
echo " Trace: $TRACE" | tee -a "$LOG"
|
||||
echo " Model: $MODEL" | tee -a "$LOG"
|
||||
echo " Output root: $OUT_ROOT" | tee -a "$LOG"
|
||||
|
||||
KVC_COMMON_ARGS=(
|
||||
--trace "$TRACE"
|
||||
--model "$MODEL"
|
||||
--mechanism kvcache-centric
|
||||
--policy kv-aware
|
||||
--kvcache-admission-mode worker
|
||||
--kvcache-seed-min-turn-id 1
|
||||
--kvcache-seed-max-inflight-decode -1
|
||||
--kvcache-prefill-backup-policy release-after-transfer
|
||||
--kvcache-prefill-priority-eviction
|
||||
--prefill-workers 2
|
||||
--decode-workers 6
|
||||
--prefill-gpu-ids 0,1
|
||||
--decode-gpu-ids 2,3,4,5,6,7
|
||||
--transfer-backend mooncake
|
||||
--target-duration-s 2000
|
||||
--session-sample-rate 1.0
|
||||
--min-turns 2
|
||||
--concurrency-limit 32
|
||||
)
|
||||
|
||||
DP_COMMON_ARGS=(
|
||||
--trace "$TRACE"
|
||||
--model "$MODEL"
|
||||
--mechanism pd-colo
|
||||
--policy kv-aware
|
||||
--direct-workers 8
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7
|
||||
--transfer-backend mooncake
|
||||
--target-duration-s 2000
|
||||
--session-sample-rate 1.0
|
||||
--min-turns 2
|
||||
--concurrency-limit 32
|
||||
)
|
||||
|
||||
run_kvc_baseline_ts10() {
|
||||
local out="$OUT_ROOT/E1_kvc_baseline_ts10"
|
||||
echo "[$(date '+%F %T')] === E1: KVC baseline (no backpressure) time-scale=10 ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${KVC_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 10 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
run_kvc_backpressure_ts10() {
|
||||
local out="$OUT_ROOT/E2_kvc_backpressure_ts10"
|
||||
echo "[$(date '+%F %T')] === E2: KVC + backpressure ON, time-scale=10 ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${KVC_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 10 \
|
||||
--enable-backpressure \
|
||||
--backpressure-max-pause-s 2.0 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
run_kvc_backpressure_ts1() {
|
||||
local out="$OUT_ROOT/E3_kvc_backpressure_ts1_short"
|
||||
echo "[$(date '+%F %T')] === E3: KVC + backpressure ON, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${KVC_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 1 \
|
||||
--enable-backpressure \
|
||||
--backpressure-max-pause-s 2.0 \
|
||||
--target-duration-s 1800 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
run_dp_baseline_ts1() {
|
||||
local out="$OUT_ROOT/E4_dp_ts1_short"
|
||||
echo "[$(date '+%F %T')] === E4: 8-way DP cache-aware, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
|
||||
python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
"${DP_COMMON_ARGS[@]}" \
|
||||
--output-root "$out" \
|
||||
--time-scale 1 \
|
||||
--target-duration-s 1800 \
|
||||
2>&1 | tee -a "$LOG"
|
||||
}
|
||||
|
||||
# Sequence — add/remove as fits the budget.
|
||||
run_kvc_baseline_ts10
|
||||
run_kvc_backpressure_ts10
|
||||
run_kvc_backpressure_ts1
|
||||
run_dp_baseline_ts1
|
||||
|
||||
echo "[$(date '+%F %T')] === sweep DONE ===" | tee -a "$LOG"
|
||||
echo "Run analysis with: python scripts/analysis/analyze_backpressure_smoke.py $OUT_ROOT" | tee -a "$LOG"
|
||||
82
scripts/sweep_e1_naive_1p3d.sh
Executable file
82
scripts/sweep_e1_naive_1p3d.sh
Executable file
@@ -0,0 +1,82 @@
|
||||
#!/usr/bin/env bash
|
||||
# E1 — naive 1P3D + kv-aware + RDMA, ts=1
|
||||
#
|
||||
# Tests hypothesis H1 from ONBOARDING_NEXT_AGENT_ZH §3.1: separate the
|
||||
# contribution of "1P3D topology + kv-aware policy" from "KVC layer
|
||||
# (admission / migration / direct-to-D)".
|
||||
#
|
||||
# Mechanism = pd-disaggregation (no KVC layer); policy = kv-aware.
|
||||
# Topology = 1P3D, RDMA on (mlx5_60 = cuda:0 NUMA-local).
|
||||
#
|
||||
# Prerequisites:
|
||||
# - source scripts/setup_env.sh (sets CUDA_HOME etc.)
|
||||
# - outputs/inferact_codex_swebenchpro.jsonl exists
|
||||
# (run scripts/convert_inferact_to_trace.py if not)
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_e1_naive_1p3d.sh
|
||||
#
|
||||
# Override defaults via env:
|
||||
# MODEL=/path TRACE=path OUTPUT=path IB_DEVICE=mlx5_XX bash scripts/sweep_e1_naive_1p3d.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e1_naive_1p3d_kvaware_rdma_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
echo "Run: uv run --no-sync python scripts/convert_inferact_to_trace.py --output $TRACE" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E1: naive 1P3D kv-aware + RDMA, ts=1 ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
log "IB_DEVICE=$IB_DEVICE"
|
||||
|
||||
label=e1_naive_1p3d_kvaware_run1
|
||||
log ""
|
||||
log "=== [E1] $label starting ==="
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/pd-disaggregation-*/ 2>/dev/null | head -1)
|
||||
log "=== [E1] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
90
scripts/sweep_e2_kvc_v2_rdma.sh
Executable file
90
scripts/sweep_e2_kvc_v2_rdma.sh
Executable file
@@ -0,0 +1,90 @@
|
||||
#!/usr/bin/env bash
|
||||
# E2 — KVC v2 + RDMA, ts=1
|
||||
#
|
||||
# Tests hypotheses H2/H3 from ONBOARDING_NEXT_AGENT_ZH §3.1: validate
|
||||
# that enabling real RDMA pushes TTFT p99 from the reported 1.28s
|
||||
# (TCP loopback) down toward ~0.7s (still expected to lose to DP 0.43s
|
||||
# because re-prefill segment of reseed slow-path remains).
|
||||
#
|
||||
# Mechanism = kvcache-centric; policy = kv-aware; topology = 1P3D.
|
||||
# All --kvcache-* tuning flags from sweep_ts1_migration_v2.sh
|
||||
# (reset-on-success + threshold 8192). RDMA on (mlx5_60).
|
||||
#
|
||||
# Uses the same outputs/inferact_50sess.jsonl as E1 — see
|
||||
# scripts/sample_trace_subset.py — so the two runs are paired.
|
||||
#
|
||||
# Prerequisites:
|
||||
# - source scripts/setup_env.sh
|
||||
# - E1 must already have completed (releases GPUs)
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_e2_kvc_v2_rdma.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e2_kvc_v2_rdma_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E2: KVC v2 + RDMA, ts=1 ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
log "IB_DEVICE=$IB_DEVICE"
|
||||
|
||||
label=e2_kvc_v2_rdma_run1
|
||||
log ""
|
||||
log "=== [E2] $label starting ==="
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [E2] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
105
scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
Executable file
105
scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
Executable file
@@ -0,0 +1,105 @@
|
||||
#!/usr/bin/env bash
|
||||
# E3 — KVC v2 + RDMA + load-floor bonus, ts=1
|
||||
#
|
||||
# Validates the load-floor bonus fix proposed in
|
||||
# docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B. Identical to E2 except:
|
||||
# --kvcache-load-floor-bonus 200
|
||||
#
|
||||
# Pair-wise vs E1 (no KVC layer) and E2 (KVC v2 without bonus) on the
|
||||
# exact same outputs/inferact_50sess.jsonl subset.
|
||||
#
|
||||
# Hypotheses being tested:
|
||||
# H1 (load balance): D2 should now receive non-trivial bindings
|
||||
# (E1/E2 had 0 — see E1_E2_RESULTS_ZH.md §5d).
|
||||
# H2 (failure rate): mooncake batch_transfer_sync timeouts should
|
||||
# stop firing because D0/D1 KV pool no longer
|
||||
# saturates → no LRU thrash → control plane no
|
||||
# longer starves. E2 had 1054 failures; expect
|
||||
# ≤ E1's 85.
|
||||
# H3 (TTFT): the 231 successful E2 reqs had TTFT p50 = 0.43s,
|
||||
# well under E1's 88.6s. With the failure cascade
|
||||
# removed, these should generalize to most reqs.
|
||||
#
|
||||
# Prerequisites:
|
||||
# - source scripts/setup_env.sh
|
||||
# (sets CUDA_HOME, MC_TRANSFER_TIMEOUT=1800, etc.)
|
||||
# - outputs/inferact_50sess.jsonl exists (md5 7bb263a32600ef5a6ef5099ba340a487)
|
||||
# - Previous sweep done; GPUs idle.
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
|
||||
#
|
||||
# Override defaults via env:
|
||||
# K=500 LOAD_FLOOR_BONUS=$K bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
if [ -z "${CUDA_HOME:-}" ]; then
|
||||
echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
|
||||
TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
|
||||
OUTPUT=${OUTPUT:-outputs/e3_kvc_v2_loadfloor_rdma_50sess}
|
||||
IB_DEVICE=${IB_DEVICE:-mlx5_60}
|
||||
LOAD_FLOOR_BONUS=${LOAD_FLOOR_BONUS:-200}
|
||||
|
||||
if [ ! -f "$TRACE" ]; then
|
||||
echo "ERROR: trace not found at $TRACE" >&2
|
||||
echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$OUTPUT"
|
||||
LOG="$OUTPUT/sweep.log"
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
log "=== E3: KVC v2 + RDMA + load-floor bonus K=$LOAD_FLOOR_BONUS, ts=1 ==="
|
||||
log "MODEL=$MODEL"
|
||||
log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
|
||||
log "OUTPUT=$OUTPUT"
|
||||
log "IB_DEVICE=$IB_DEVICE"
|
||||
log "MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT:-default-30s}"
|
||||
|
||||
label=e3_kvc_v2_loadfloor_run1
|
||||
log ""
|
||||
log "=== [E3] $label starting ==="
|
||||
|
||||
uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace "$TRACE" \
|
||||
--output-root "$OUTPUT" \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path "$MODEL" \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device "$IB_DEVICE" \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 1800 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 \
|
||||
--kvcache-load-floor-bonus "$LOAD_FLOOR_BONUS" 2>&1 | tee -a "$LOG"
|
||||
|
||||
run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [E3] $label COMPLETED, artifacts at $run_dir ==="
|
||||
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "=== summary saved to $OUTPUT/${label}_summary.json ==="
|
||||
fi
|
||||
60
scripts/sweep_kvc_qwen3_30b.sh
Executable file
60
scripts/sweep_kvc_qwen3_30b.sh
Executable file
@@ -0,0 +1,60 @@
|
||||
#!/bin/bash
|
||||
# KVC admission control parameter sweep on Qwen3-30B
|
||||
# 5 experiments, ~35 min each, ~3 hours total
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-exps
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
|
||||
run_kvc() {
|
||||
local label=$1
|
||||
local inflight=$2
|
||||
local min_turn=$3
|
||||
|
||||
echo "=== [$label] inflight=$inflight min_turn=$min_turn === $(date)"
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 1 \
|
||||
--prefill-tp-size 4 --decode-tp-size 4 \
|
||||
--prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id $min_turn \
|
||||
--kvcache-seed-max-inflight-decode $inflight \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
echo "=== [$label] DONE === $(date)"
|
||||
echo ""
|
||||
}
|
||||
|
||||
# C1: inflight=8, min-turn=2
|
||||
run_kvc "C1" 8 2
|
||||
|
||||
# C2: inflight=16, min-turn=2
|
||||
run_kvc "C2" 16 2
|
||||
|
||||
# C3: inflight=-1 (disabled), min-turn=2
|
||||
run_kvc "C3" -1 2
|
||||
|
||||
# C4: inflight=8, min-turn=1
|
||||
run_kvc "C4" 8 1
|
||||
|
||||
# C5: inflight=-1 (disabled), min-turn=1
|
||||
run_kvc "C5" -1 1
|
||||
|
||||
echo "=== ALL SWEEP EXPERIMENTS DONE === $(date)"
|
||||
133
scripts/sweep_tp1_configs.sh
Executable file
133
scripts/sweep_tp1_configs.sh
Executable file
@@ -0,0 +1,133 @@
|
||||
#!/bin/bash
|
||||
# TP1 configuration sweep: 8-way DP, 1P7D KVC, 2P6D KVC
|
||||
# Qwen3-30B-A3B TP=1, single GPU per worker
|
||||
# Most aggressive KVC admission: inflight=-1 (off), seed-min-turn=1
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-exps
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
# Also copy summary to a named file for easy access
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
log "Saved to $OUTPUT/${label}_summary.json"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 configuration sweep"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 8-way DP cache-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 8 --direct-tp-size 1 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
# Find latest run dir for this experiment
|
||||
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 1P + 7D KVC (most aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 3: 2P + 6D KVC (most aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
|
||||
|
||||
########################################
|
||||
log ""
|
||||
log "=== ALL TP1 SWEEP EXPERIMENTS DONE ==="
|
||||
131
scripts/sweep_tp1_v2_fixed.sh
Executable file
131
scripts/sweep_tp1_v2_fixed.sh
Executable file
@@ -0,0 +1,131 @@
|
||||
#!/bin/bash
|
||||
# TP1 configuration sweep v2 — after session_params fix + audit fields
|
||||
# Qwen3-30B-A3B TP=1, single GPU per worker
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v2-fixed
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v2 sweep (session_params fix + audit fields)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 8-way DP cache-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 8 --direct-tp-size 1 \
|
||||
--direct-gpu-ids 0,1,2,3,4,5,6,7 \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 1P + 7D KVC (aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 3: 2P + 6D KVC (aggressive)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy default \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
|
||||
|
||||
########################################
|
||||
log ""
|
||||
log "=== ALL TP1 V2 SWEEP EXPERIMENTS DONE ==="
|
||||
108
scripts/sweep_tp1_v3_kvaware.sh
Executable file
108
scripts/sweep_tp1_v3_kvaware.sh
Executable file
@@ -0,0 +1,108 @@
|
||||
#!/bin/bash
|
||||
# TP1 v3 sweep — KVC with kv-aware policy (fix routing mismatch)
|
||||
# v2 used --policy default for KVC experiments, causing session routing
|
||||
# mismatch: replay round-robin ≠ router round-robin → "session not found".
|
||||
# v3 uses --policy kv-aware for KVC to ensure session affinity.
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v3-kvaware
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v3 sweep (KVC with kv-aware policy)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Key change: --policy kv-aware for KVC (was --policy default in v2)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_kvaware" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_kvaware" "$EXP2_DIR"
|
||||
|
||||
########################################
|
||||
log ""
|
||||
log "=== ALL TP1 V3 SWEEP EXPERIMENTS DONE ==="
|
||||
108
scripts/sweep_tp1_v4_cap16.sh
Executable file
108
scripts/sweep_tp1_v4_cap16.sh
Executable file
@@ -0,0 +1,108 @@
|
||||
#!/bin/bash
|
||||
# TP1 v4 sweep — KVC with kv-aware policy + soft_cap raised from 4 to 16
|
||||
# v3 (kv-aware) fixed routing but session-cap fallback still dominated 52-65%
|
||||
# of requests. Hardcoded min(4, ...) in _decode_session_soft_cap was the
|
||||
# bottleneck — only 4*7=28 session slots for 52 trace sessions.
|
||||
# v4 raises the cap to 16 (4*7=28 -> 16*7=112 slots).
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v4-cap16
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware (cap=16)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware cap=16 ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_cap16" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware (cap=16)
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware cap=16 ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_cap16" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL TP1 V4 SWEEP EXPERIMENTS DONE ==="
|
||||
89
scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
Executable file
89
scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
Executable file
@@ -0,0 +1,89 @@
|
||||
#!/bin/bash
|
||||
# P0: Re-run v5 baseline EXP2 (2P6D) three times to establish whether
|
||||
# errors=9 is a stable property of the v5 config or single-run variance.
|
||||
# Critic of V5_PROFILE_INVESTIGATION_ZH.md flagged that the 415 errors in
|
||||
# v5+profile EXP2 may have been polling-induced. We need 3 baseline runs
|
||||
# (no polling, identical config to original v5) to test reproducibility.
|
||||
#
|
||||
# Output:
|
||||
# outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
|
||||
# ├── exp2_2p6d_run{1,2,3}_summary.json
|
||||
# ├── exp2_2p6d_run{1,2,3}_metrics.jsonl
|
||||
# └── kvcache-centric-...<ts>/ (one per run)
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-baseline-rerun
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
run_exp2() {
|
||||
local run_idx=$1
|
||||
local label="exp2_2p6d_run${run_idx}"
|
||||
log ""
|
||||
log "=== [RUN ${run_idx}/3] EXP2 2P6D KVC kv-aware Option D (no polling) ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [RUN ${run_idx}/3] $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||||
log " errors = $errs (baseline reference = 9)"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
else
|
||||
log "WARNING: no summary file in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "=== P0: v5 baseline EXP2 reproducibility test (3 runs, no polling) ==="
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Goal: confirm whether errors=9 in v5 baseline EXP2 is reproducible"
|
||||
log " (v5+profile saw 415 errors; we need to know if polling was causal)"
|
||||
|
||||
for i in 1 2 3; do
|
||||
run_exp2 $i
|
||||
done
|
||||
|
||||
log ""
|
||||
log "=== P0 SUMMARY: errors per run ==="
|
||||
for i in 1 2 3; do
|
||||
if [ -f "$OUTPUT/exp2_2p6d_run${i}_summary.json" ]; then
|
||||
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/exp2_2p6d_run${i}_summary.json')); print(d.get('error_count',0))")
|
||||
log " run ${i}: errors = $e"
|
||||
fi
|
||||
done
|
||||
log "=== P0 ALL DONE ==="
|
||||
114
scripts/sweep_tp1_v5_optD.sh
Executable file
114
scripts/sweep_tp1_v5_optD.sh
Executable file
@@ -0,0 +1,114 @@
|
||||
#!/bin/bash
|
||||
# TP1 v5 sweep — Option D: D-side admission for seed/reseed.
|
||||
#
|
||||
# v4 (cap=16) still saw 35% session-cap fallback because the local soft_cap
|
||||
# evaluates min(16, usable_capacity_tokens / target_tokens) and target_tokens
|
||||
# (= input + output) is 50-100K in agentic workloads, giving cap = 1-2.
|
||||
#
|
||||
# v5 makes worker admission_mode authoritative for ALL admission decisions
|
||||
# (direct_append AND seed/reseed). Replay calls D's
|
||||
# /session_cache/admit_direct_append with mode={direct_append|seed} and
|
||||
# defers to D's KV pool availability + LRU eviction. Replay's local
|
||||
# _decode_session_soft_cap is bypassed entirely under worker mode.
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v5-optD
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v5 sweep (Option D: D-side seed admission)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Key change: worker admission_mode now drives seed/reseed via D's admit endpoint"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware Option D
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware Option D ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_optD" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware Option D
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware Option D ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_optD" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL TP1 V5 SWEEP EXPERIMENTS DONE ==="
|
||||
125
scripts/sweep_tp1_v5_optD_profile.sh
Executable file
125
scripts/sweep_tp1_v5_optD_profile.sh
Executable file
@@ -0,0 +1,125 @@
|
||||
#!/bin/bash
|
||||
# TP1 v5 + profiling — re-run the v5 (Option D) config with the new
|
||||
# d-pool-timeseries poller enabled, so we can attribute each session-cap
|
||||
# fallback to actual D KV pool occupancy (held vs available vs idle-evictable
|
||||
# vs prefill-backup) instead of guessing.
|
||||
#
|
||||
# Output:
|
||||
# outputs/qwen3-30b-tp1-v5-optD-profile/
|
||||
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
|
||||
# │ ├── request-metrics.jsonl
|
||||
# │ ├── request-metrics.jsonl.summary.json
|
||||
# │ └── d-pool-timeseries.jsonl ← NEW (1Hz P/D /server_info snapshots)
|
||||
# ├── exp1_1p7d_kvc_optD_profile_metrics.jsonl
|
||||
# └── exp2_2p6d_kvc_optD_profile_metrics.jsonl
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v5-optD-profile
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
POLL_INTERVAL=1.0
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
|
||||
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
|
||||
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
|
||||
else
|
||||
log "WARNING: no d-pool-timeseries.jsonl produced"
|
||||
fi
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting TP1 v5 + profile sweep (Option D + ${POLL_INTERVAL}s pool polling)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Profiling: --pool-poll-interval-s $POLL_INTERVAL (writes d-pool-timeseries.jsonl)"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_optD_profile" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_optD_profile" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL TP1 V5+PROFILE EXPERIMENTS DONE ==="
|
||||
129
scripts/sweep_tp1_v6_p1_profile.sh
Executable file
129
scripts/sweep_tp1_v6_p1_profile.sh
Executable file
@@ -0,0 +1,129 @@
|
||||
#!/bin/bash
|
||||
# v6 P1: re-run the v5 (Option D) config with the pool_breakdown instrument
|
||||
# (commit 4978c0d) so d-pool-timeseries.jsonl carries radix_protected /
|
||||
# slot_private / running_batch / {transfer,prealloc,retracted}_queue tokens.
|
||||
#
|
||||
# This is the same config as scripts/sweep_tp1_v5_optD_profile.sh but writes
|
||||
# to a separate output dir, leaving the pre-instrument v5+profile run intact
|
||||
# for before/after comparison.
|
||||
#
|
||||
# Output:
|
||||
# outputs/qwen3-30b-tp1-v6-p1-profile/
|
||||
# ├── kvcache-centric-kv-aware-worker-admission-<ts>/
|
||||
# │ ├── request-metrics.jsonl
|
||||
# │ ├── request-metrics.jsonl.summary.json
|
||||
# │ └── d-pool-timeseries.jsonl ← now with pool_breakdown fields
|
||||
# ├── exp{1,2}_*_metrics.jsonl
|
||||
# └── exp{1,2}_*_pool_timeseries.jsonl
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-v6-p1-profile
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
POLL_INTERVAL=1.0
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
save_result() {
|
||||
local label=$1
|
||||
local run_dir=$2
|
||||
log "=== $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
log "Summary:"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
|
||||
cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
|
||||
log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
|
||||
else
|
||||
log "WARNING: no d-pool-timeseries.jsonl produced"
|
||||
fi
|
||||
log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
|
||||
else
|
||||
log "WARNING: No summary file found in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "Starting v6 P1 sweep (v5 Option D config + ${POLL_INTERVAL}s pool polling + pool_breakdown)"
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Goal: capture pool_breakdown fields (radix_protected / slot_private / running_batch / queues)"
|
||||
log " to decompose 'other' on the v5 baseline workload"
|
||||
|
||||
########################################
|
||||
# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 7 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp1_1p7d_kvc_v6_p1" "$EXP1_DIR"
|
||||
|
||||
########################################
|
||||
# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
|
||||
########################################
|
||||
log ""
|
||||
log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 2 --decode-workers 6 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 8 \
|
||||
--time-scale 10 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--pool-poll-interval-s $POLL_INTERVAL
|
||||
|
||||
EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
save_result "exp2_2p6d_kvc_v6_p1" "$EXP2_DIR"
|
||||
|
||||
log ""
|
||||
log "=== ALL v6 P1 EXPERIMENTS DONE ==="
|
||||
146
scripts/sweep_ts1_kvc_n3_plus_dp.sh
Executable file
146
scripts/sweep_ts1_kvc_n3_plus_dp.sh
Executable file
@@ -0,0 +1,146 @@
|
||||
#!/bin/bash
|
||||
# Time-scale=1 validation sweep, downscaled to 4 GPUs:
|
||||
# - KVC v5 1P3D × N=3 (new data, validates §1/§2 structural claims at real timing)
|
||||
# - 4-way DP cache-aware × 1 (sanity baseline at same scale + ts=1)
|
||||
#
|
||||
# Goal: per docs/AGENTIC_FIT_ANALYSIS_ZH.md §7 / TEAM_REPORT §2.6 — all v3-v6 KVC
|
||||
# data was at time-scale=10 (inter-turn gap p50 = 0.25s, vs real 2.5s). This run
|
||||
# tests whether the gap structurally reverses any conclusion.
|
||||
#
|
||||
# CONFIG NOTE: Original experiments used 8 GPUs (2P6D / 8-way DP). This host has
|
||||
# only 4 H100s available, so we downscale proportionally to 1P3D / 4-way DP.
|
||||
# Cross-compare against existing 2P6D ts=10 data is confounded by *both*
|
||||
# time-scale and capacity. Internal comparison (1P3D KVC vs 4DP) at ts=1 is the
|
||||
# clean signal. §5 (P-side imbalance) is NOT testable here — only 1 P.
|
||||
#
|
||||
# Capacity ratio: 3D × ~92K tok = 276K KV pool vs 52 sessions × ~50K peak input
|
||||
# working set ≈ 1.5M → ~5.4× overload (vs 2.7× in original 2P6D).
|
||||
# Pressure is HIGHER than original; partly offset by ts=1 letting D drain between turns.
|
||||
#
|
||||
# Output:
|
||||
# outputs/qwen3-30b-tp1-ts1-validation/
|
||||
# ├── kvc_1p3d_run{1,2,3}_summary.json
|
||||
# ├── kvc_1p3d_run{1,2,3}_metrics.jsonl
|
||||
# ├── dp4_summary.json
|
||||
# ├── dp4_metrics.jsonl
|
||||
# └── kvcache-centric-... / pd-colo-kv-aware-... (raw run dirs)
|
||||
#
|
||||
# Estimated GPU time: KVC ts=1 ≈ 100-180 min/run × 3 = 5-9h
|
||||
# DP ts=1 ≈ 100-120 min × 1 = ~2h
|
||||
# Total = 7-11h
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-ts1-validation
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
|
||||
}
|
||||
|
||||
run_kvc_1p3d() {
|
||||
local run_idx=$1
|
||||
local label="kvc_1p3d_run${run_idx}"
|
||||
log ""
|
||||
log "=== [KVC ${run_idx}/3] 1P3D KVC kv-aware Option D, time-scale=1 ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction
|
||||
|
||||
local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [KVC ${run_idx}/3] $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||||
log " errors = $errs"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
else
|
||||
log "WARNING: no summary file in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
run_dp4_sanity() {
|
||||
local label="dp4"
|
||||
log ""
|
||||
log "=== [DP] 4-way DP cache-aware sanity, time-scale=1 ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism pd-colo \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 0 --decode-workers 0 \
|
||||
--direct-workers 4 --direct-tp-size 1 \
|
||||
--direct-gpu-ids 0,1,2,3 \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300
|
||||
|
||||
local run_dir=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
|
||||
log "=== [DP] $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||||
log " errors = $errs"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
echo "" >> $RESULTS_FILE
|
||||
else
|
||||
log "WARNING: no summary file in $run_dir"
|
||||
fi
|
||||
}
|
||||
|
||||
log "=== TS=1 VALIDATION (4-GPU): KVC 1P3D × N=3 + 4DP × 1 ==="
|
||||
log "Model: $MODEL"
|
||||
log "Trace: $TRACE (4449 requests, 52 sessions)"
|
||||
log "Goal: validate whether ts=10 was the main distortion in v3-v6 KVC vs DP"
|
||||
|
||||
# KVC × 3 first (the new data we need); DP last (cheaper sanity at end)
|
||||
for i in 1 2 3; do
|
||||
run_kvc_1p3d $i
|
||||
done
|
||||
|
||||
run_dp4_sanity
|
||||
|
||||
log ""
|
||||
log "=== TS=1 SUMMARY ==="
|
||||
for label in kvc_1p3d_run1 kvc_1p3d_run2 kvc_1p3d_run3 dp4; do
|
||||
if [ -f "$OUTPUT/${label}_summary.json" ]; then
|
||||
e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||||
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50','n/a'))")
|
||||
log " ${label}: errors=$e lat_p50=${p50}s"
|
||||
fi
|
||||
done
|
||||
log "=== TS=1 ALL DONE ==="
|
||||
65
scripts/sweep_ts1_migration_v1.sh
Executable file
65
scripts/sweep_ts1_migration_v1.sh
Executable file
@@ -0,0 +1,65 @@
|
||||
#!/bin/bash
|
||||
# Migration v1 validation: KVC 1P3D ts=1 with --kvcache-migration-reject-threshold=3
|
||||
# Compare against baseline outputs/qwen3-30b-tp1-ts1-validation/kvc_1p3d_run{1,2,3}
|
||||
# (all of which had no migration — runs were structurally identical).
|
||||
#
|
||||
# Goal: verify §1 fix changes the categorical outcome — direct-to-D % up,
|
||||
# fallback-session-not-resident % down, lat mean down.
|
||||
#
|
||||
# ts=1 is deterministic at the categorical level, so N=1 is sufficient
|
||||
# (TEAM_REPORT §2.8 revised).
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-ts1-migration-v1
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE; }
|
||||
|
||||
log "=== TS=1 MIGRATION v1: KVC 1P3D --kvcache-migration-reject-threshold=3 ==="
|
||||
log "Baseline reference: outputs/qwen3-30b-tp1-ts1-validation/kvc_1p3d_run1 (errors=5, lat mean=1.574s, direct-to-D=42.8%)"
|
||||
|
||||
label=kvc_1p3d_migration_run1
|
||||
log ""
|
||||
log "=== [migration v1] starting ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold 3
|
||||
|
||||
run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [migration v1] $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||||
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50',0))")
|
||||
log " errors=$errs lat_p50=${p50}s"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
fi
|
||||
log "=== migration v1 DONE ==="
|
||||
76
scripts/sweep_ts1_migration_v2.sh
Executable file
76
scripts/sweep_ts1_migration_v2.sh
Executable file
@@ -0,0 +1,76 @@
|
||||
#!/bin/bash
|
||||
# Migration v2 validation: KVC 1P3D ts=1 with BOTH:
|
||||
# (1) reset-on-success blacklist decay (replay.py code change)
|
||||
# (2) --kvcache-direct-max-uncached-tokens 8192 (was 2048 default)
|
||||
#
|
||||
# v1 results (kvc_1p3d_migration_run1) showed:
|
||||
# - lat mean WORSE +11.7%, TTFT mean WORSE +71.3% — thrashing tax
|
||||
# - direct-to-D rate UP +10.5pp (42.8 → 53.3%)
|
||||
# - Fallback breakdown surprise: 41.3% are 'real-large-append' (>2048 tok),
|
||||
# NOT 'session-not-resident' as we hypothesized
|
||||
#
|
||||
# v2 design (REFACTOR_PLAN_V1 + MIGRATION_V1_FINDINGS):
|
||||
# (1) reset-on-success: clear (sess,D) reject counter on successful direct-to-D
|
||||
# — eliminates blacklist-permanence bug → kills thrashing
|
||||
# (2) bump direct-append threshold 2048 → 8192: lets more large-append turns
|
||||
# go direct-to-D instead of fall through to seed (which often rejects)
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")/.."
|
||||
|
||||
MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
|
||||
TRACE=outputs/qwen35-swebench-50sess.jsonl
|
||||
OUTPUT=outputs/qwen3-30b-tp1-ts1-migration-v2
|
||||
VENV_PYTHON=.venv/bin/python
|
||||
RESULTS_FILE=$OUTPUT/sweep_results.txt
|
||||
|
||||
mkdir -p $OUTPUT
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE; }
|
||||
|
||||
log "=== TS=1 MIGRATION v2: reset-on-success + threshold=8192 ==="
|
||||
log "Baselines:"
|
||||
log " baseline (no migration): kvc_1p3d_run1 errors=5 lat_p50=0.811s ttft_p50=0.124s direct=42.8%"
|
||||
log " v1 (migration permanent): kvc_1p3d_migration_run1 errors=6 lat_p50=0.773s ttft_p50=0.057s direct=53.3% lat_mean=1.758s"
|
||||
log " 4DP ts=1: errors=0 lat_p50=0.659s ttft_p50=0.090s lat_mean=1.443s"
|
||||
log "Goal: kill thrashing tax (lat_mean ≤ 1.5s, p99 ≤ 9s) while preserving v1's direct-to-D gains."
|
||||
|
||||
label=kvc_1p3d_migration_v2_run1
|
||||
log ""
|
||||
log "=== [migration v2] starting ==="
|
||||
PYTHONPATH=src:third_party/sglang/python \
|
||||
$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
|
||||
--trace $TRACE \
|
||||
--output-root $OUTPUT \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--model-path $MODEL \
|
||||
--prefill-workers 1 --decode-workers 3 \
|
||||
--prefill-tp-size 1 --decode-tp-size 1 \
|
||||
--prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
|
||||
--transfer-backend mooncake \
|
||||
--gpu-budget 4 \
|
||||
--time-scale 1 \
|
||||
--session-sample-rate 1.0 \
|
||||
--target-duration-s 100000 \
|
||||
--concurrency-limit 32 \
|
||||
--timeout-s 900 \
|
||||
--request-timeout-s 300 \
|
||||
--kvcache-admission-mode worker \
|
||||
--kvcache-seed-min-turn-id 1 \
|
||||
--kvcache-seed-max-inflight-decode -1 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
--kvcache-prefill-priority-eviction \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-direct-max-uncached-tokens 8192
|
||||
|
||||
run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
|
||||
log "=== [migration v2] $label COMPLETED ==="
|
||||
if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
|
||||
cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
|
||||
cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
|
||||
errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
|
||||
p50=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('latency_stats_s',{}).get('p50',0))")
|
||||
log " errors=$errs lat_p50=${p50}s"
|
||||
cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
|
||||
fi
|
||||
log "=== migration v2 DONE ==="
|
||||
@@ -43,6 +43,12 @@ class BenchmarkConfig:
|
||||
kvcache_prefill_priority_eviction: bool = False
|
||||
kvcache_prefill_direct_priority: int = -100
|
||||
kvcache_prefill_normal_priority: int = 100
|
||||
pool_poll_interval_s: float = 0.0
|
||||
pool_poll_include_sessions: bool = True
|
||||
enable_backpressure: bool = False
|
||||
backpressure_max_pause_s: float = 2.0
|
||||
kvcache_migration_reject_threshold: int = 3
|
||||
kvcache_load_floor_bonus: int = 0
|
||||
sample_profile: str = "default"
|
||||
min_initial_input_tokens: int | None = None
|
||||
max_initial_input_tokens: int | None = None
|
||||
@@ -119,6 +125,8 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
try:
|
||||
signal.signal(signal.SIGINT, _handle_termination)
|
||||
signal.signal(signal.SIGTERM, _handle_termination)
|
||||
_mechanisms_with_router = {"pd-disaggregation", "kvcache-centric", "pd-colo"}
|
||||
_naive_dp = config.mechanism_name == "pd-colo"
|
||||
if config.launch_stack:
|
||||
stack = launch_pd_stack(
|
||||
topology=topology,
|
||||
@@ -132,18 +140,19 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
else config.timeout_s
|
||||
),
|
||||
include_router=(
|
||||
config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
|
||||
config.mechanism_name in _mechanisms_with_router
|
||||
),
|
||||
naive_dp=_naive_dp,
|
||||
)
|
||||
router_url = (
|
||||
stack.router_url
|
||||
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
|
||||
if config.mechanism_name in _mechanisms_with_router
|
||||
else None
|
||||
)
|
||||
else:
|
||||
router_url = (
|
||||
topology.router_url
|
||||
if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
|
||||
if config.mechanism_name in _mechanisms_with_router
|
||||
else None
|
||||
)
|
||||
|
||||
@@ -187,6 +196,12 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
),
|
||||
kvcache_prefill_direct_priority=config.kvcache_prefill_direct_priority,
|
||||
kvcache_prefill_normal_priority=config.kvcache_prefill_normal_priority,
|
||||
pool_poll_interval_s=config.pool_poll_interval_s,
|
||||
pool_poll_include_sessions=config.pool_poll_include_sessions,
|
||||
enable_backpressure=config.enable_backpressure,
|
||||
backpressure_max_pause_s=config.backpressure_max_pause_s,
|
||||
kvcache_migration_reject_threshold=config.kvcache_migration_reject_threshold,
|
||||
kvcache_load_floor_bonus=config.kvcache_load_floor_bonus,
|
||||
)
|
||||
if config.request_timeout_s is not None:
|
||||
replay_config = replace(
|
||||
@@ -243,6 +258,12 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
|
||||
"kvcache_prefill_normal_priority": (
|
||||
config.kvcache_prefill_normal_priority
|
||||
),
|
||||
"pool_poll_interval_s": config.pool_poll_interval_s,
|
||||
"pool_poll_include_sessions": config.pool_poll_include_sessions,
|
||||
"enable_backpressure": config.enable_backpressure,
|
||||
"backpressure_max_pause_s": config.backpressure_max_pause_s,
|
||||
"kvcache_migration_reject_threshold": config.kvcache_migration_reject_threshold,
|
||||
"kvcache_load_floor_bonus": config.kvcache_load_floor_bonus,
|
||||
"sample_profile": config.sample_profile,
|
||||
"min_initial_input_tokens": config.min_initial_input_tokens,
|
||||
"max_initial_input_tokens": config.max_initial_input_tokens,
|
||||
|
||||
@@ -228,6 +228,61 @@ def main() -> None:
|
||||
)
|
||||
replay.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
|
||||
replay.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
|
||||
replay.add_argument(
|
||||
"--pool-poll-interval-s",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help=(
|
||||
"Poll each P/D worker's /server_info every N seconds and write a "
|
||||
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
|
||||
"0 disables polling."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--pool-poll-no-sessions",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Disable per-session detail in the pool timeseries (smaller files)."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--enable-backpressure",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Honor recommended_pause_ms hints from D's admission endpoint. "
|
||||
"When set, replay sleeps before issuing requests to a saturated D. "
|
||||
"Default off — preserves baseline behavior."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--backpressure-max-pause-s",
|
||||
type=float,
|
||||
default=2.0,
|
||||
help="Cap on per-request backpressure sleep, regardless of D hint.",
|
||||
)
|
||||
replay.add_argument(
|
||||
"--kvcache-migration-reject-threshold",
|
||||
type=int,
|
||||
default=3,
|
||||
help=(
|
||||
"Per-(session, D) admission-reject count after which KvAwarePolicy "
|
||||
"skips that D for the session (forces migration). 0 disables. "
|
||||
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
|
||||
),
|
||||
)
|
||||
replay.add_argument(
|
||||
"--kvcache-load-floor-bonus",
|
||||
type=int,
|
||||
default=0,
|
||||
help=(
|
||||
"Graduated bonus added to lex-score position 0 for under-loaded D "
|
||||
"workers (gated on not-sticky so turn-1+ requests still stick). "
|
||||
"Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
|
||||
"Set above max expected cross-session boilerplate overlap "
|
||||
"(Inferact ~50 → use 200). 0 disables. "
|
||||
"See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
|
||||
),
|
||||
)
|
||||
|
||||
sample = subparsers.add_parser(
|
||||
"sample-sessions",
|
||||
@@ -439,6 +494,59 @@ def main() -> None:
|
||||
)
|
||||
benchmark.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
|
||||
benchmark.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
|
||||
benchmark.add_argument(
|
||||
"--pool-poll-interval-s",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help=(
|
||||
"Poll each P/D worker's /server_info every N seconds and write a "
|
||||
"time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
|
||||
"0 disables polling."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--pool-poll-no-sessions",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Disable per-session detail in the pool timeseries (smaller files)."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--enable-backpressure",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Honor recommended_pause_ms hints from D's admission endpoint."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--backpressure-max-pause-s",
|
||||
type=float,
|
||||
default=2.0,
|
||||
help="Cap on per-request backpressure sleep, regardless of D hint.",
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--kvcache-migration-reject-threshold",
|
||||
type=int,
|
||||
default=3,
|
||||
help=(
|
||||
"Per-(session, D) admission-reject count after which KvAwarePolicy "
|
||||
"skips that D for the session (forces migration). 0 disables. "
|
||||
"See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--kvcache-load-floor-bonus",
|
||||
type=int,
|
||||
default=0,
|
||||
help=(
|
||||
"Graduated bonus added to lex-score position 0 for under-loaded D "
|
||||
"workers (gated on not-sticky so turn-1+ requests still stick). "
|
||||
"Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
|
||||
"Set above max expected cross-session boilerplate overlap "
|
||||
"(Inferact ~50 → use 200). 0 disables. "
|
||||
"See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
|
||||
),
|
||||
)
|
||||
benchmark.add_argument(
|
||||
"--sample-profile",
|
||||
choices=["default", "small-append"],
|
||||
@@ -455,11 +563,18 @@ def main() -> None:
|
||||
|
||||
if args.command == "print-launch":
|
||||
topology = _topology_from_args(args)
|
||||
has_pd = bool(topology.prefill_workers and topology.decode_workers)
|
||||
has_direct_only = bool(
|
||||
topology.direct_workers
|
||||
and not topology.prefill_workers
|
||||
and not topology.decode_workers
|
||||
)
|
||||
plan = build_launch_plan(
|
||||
topology,
|
||||
prefill_policy=args.prefill_policy,
|
||||
decode_policy=args.decode_policy,
|
||||
include_router=bool(topology.prefill_workers and topology.decode_workers),
|
||||
include_router=has_pd or has_direct_only,
|
||||
naive_dp=has_direct_only,
|
||||
)
|
||||
print(plan.render())
|
||||
return
|
||||
@@ -513,6 +628,12 @@ def main() -> None:
|
||||
),
|
||||
kvcache_prefill_direct_priority=args.kvcache_prefill_direct_priority,
|
||||
kvcache_prefill_normal_priority=args.kvcache_prefill_normal_priority,
|
||||
pool_poll_interval_s=args.pool_poll_interval_s,
|
||||
pool_poll_include_sessions=not args.pool_poll_no_sessions,
|
||||
enable_backpressure=args.enable_backpressure,
|
||||
backpressure_max_pause_s=args.backpressure_max_pause_s,
|
||||
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
|
||||
kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
|
||||
)
|
||||
results = asyncio.run(replay_trace(config))
|
||||
print(
|
||||
@@ -655,6 +776,12 @@ def main() -> None:
|
||||
kvcache_prefill_normal_priority=(
|
||||
args.kvcache_prefill_normal_priority
|
||||
),
|
||||
pool_poll_interval_s=args.pool_poll_interval_s,
|
||||
pool_poll_include_sessions=not args.pool_poll_no_sessions,
|
||||
enable_backpressure=args.enable_backpressure,
|
||||
backpressure_max_pause_s=args.backpressure_max_pause_s,
|
||||
kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
|
||||
kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
|
||||
sample_profile=args.sample_profile,
|
||||
min_initial_input_tokens=args.min_initial_input_tokens,
|
||||
max_initial_input_tokens=args.max_initial_input_tokens,
|
||||
@@ -749,6 +876,8 @@ def _topology_from_args(args: argparse.Namespace):
|
||||
force_rdma=args.force_rdma,
|
||||
trust_remote_code=not args.no_trust_remote_code,
|
||||
ib_device=args.ib_device,
|
||||
prefill_extra_server_args=("--disable-overlap-schedule",),
|
||||
decode_extra_server_args=("--disable-overlap-schedule",),
|
||||
direct_extra_server_args=("--enable-streaming-session",),
|
||||
)
|
||||
|
||||
|
||||
@@ -34,7 +34,24 @@ def build_launch_plan(
|
||||
decode_policy: str = "manual",
|
||||
include_router: bool = True,
|
||||
router_request_timeout_s: float | None = None,
|
||||
naive_dp: bool = False,
|
||||
) -> LaunchPlan:
|
||||
router_command: tuple[str, ...] | None = None
|
||||
if include_router:
|
||||
if topology.prefill_workers and topology.decode_workers:
|
||||
router_command = _build_router_command(
|
||||
topology,
|
||||
prefill_policy=prefill_policy,
|
||||
decode_policy=decode_policy,
|
||||
request_timeout_s=router_request_timeout_s,
|
||||
)
|
||||
elif naive_dp and topology.direct_workers:
|
||||
router_command = _build_dp_router_command(
|
||||
topology,
|
||||
backend_policy=decode_policy,
|
||||
request_timeout_s=router_request_timeout_s,
|
||||
)
|
||||
|
||||
return LaunchPlan(
|
||||
prefill_commands=tuple(
|
||||
_build_server_command(topology, worker) for worker in topology.prefill_workers
|
||||
@@ -43,24 +60,17 @@ def build_launch_plan(
|
||||
_build_server_command(topology, worker) for worker in topology.decode_workers
|
||||
),
|
||||
direct_commands=tuple(
|
||||
_build_server_command(topology, worker) for worker in topology.direct_workers
|
||||
),
|
||||
router_command=(
|
||||
_build_router_command(
|
||||
topology,
|
||||
prefill_policy=prefill_policy,
|
||||
decode_policy=decode_policy,
|
||||
request_timeout_s=router_request_timeout_s,
|
||||
)
|
||||
if include_router and topology.prefill_workers and topology.decode_workers
|
||||
else None
|
||||
_build_server_command(topology, worker, naive_dp=naive_dp)
|
||||
for worker in topology.direct_workers
|
||||
),
|
||||
router_command=router_command,
|
||||
)
|
||||
|
||||
|
||||
def _build_server_command(
|
||||
topology: SingleNodeTopology,
|
||||
worker: WorkerSpec,
|
||||
naive_dp: bool = False,
|
||||
) -> tuple[str, ...]:
|
||||
command = [
|
||||
sys.executable,
|
||||
@@ -76,11 +86,15 @@ def _build_server_command(
|
||||
str(worker.port),
|
||||
"--base-gpu-id",
|
||||
str(worker.gpu_id),
|
||||
"--disaggregation-mode",
|
||||
_disaggregation_mode_for(worker),
|
||||
"--disaggregation-transfer-backend",
|
||||
topology.transfer_backend,
|
||||
]
|
||||
# Naive DP direct workers: no disaggregation flags at all
|
||||
if not (naive_dp and worker.role == "direct"):
|
||||
command.extend([
|
||||
"--disaggregation-mode",
|
||||
_disaggregation_mode_for(worker),
|
||||
"--disaggregation-transfer-backend",
|
||||
topology.transfer_backend,
|
||||
])
|
||||
if worker.tp_size > 1:
|
||||
command.extend(["--tp-size", str(worker.tp_size)])
|
||||
if topology.trust_remote_code:
|
||||
@@ -135,6 +149,32 @@ def _build_router_command(
|
||||
return tuple(command)
|
||||
|
||||
|
||||
def _build_dp_router_command(
|
||||
topology: SingleNodeTopology,
|
||||
*,
|
||||
backend_policy: str,
|
||||
request_timeout_s: float | None,
|
||||
) -> tuple[str, ...]:
|
||||
command: list[str] = [
|
||||
sys.executable,
|
||||
"-B",
|
||||
"-u",
|
||||
"-m",
|
||||
"agentic_pd_hybrid.pd_router",
|
||||
"--host",
|
||||
topology.router_host,
|
||||
"--port",
|
||||
str(topology.router_port),
|
||||
"--backend-policy",
|
||||
backend_policy,
|
||||
]
|
||||
if request_timeout_s is not None:
|
||||
command.extend(["--request-timeout-s", str(request_timeout_s)])
|
||||
for worker in topology.direct_workers:
|
||||
command.extend(["--backend", worker.url])
|
||||
return tuple(command)
|
||||
|
||||
|
||||
def _render_named_command(name: str, command: tuple[str, ...]) -> str:
|
||||
return f"# {name}\n" + " ".join(shlex.quote(part) for part in command)
|
||||
|
||||
|
||||
@@ -43,6 +43,9 @@ class RequestMetrics:
|
||||
ttft_s: float | None
|
||||
tpot_s: float | None
|
||||
error: str | None = None
|
||||
actual_output_tokens: int | None = None
|
||||
requested_output_tokens: int | None = None
|
||||
finish_reason: str | None = None
|
||||
|
||||
@classmethod
|
||||
def from_decision(
|
||||
@@ -63,6 +66,9 @@ class RequestMetrics:
|
||||
prefill_request_priority: int | None = None,
|
||||
decode_request_priority: int | None = None,
|
||||
error: str | None = None,
|
||||
actual_output_tokens: int | None = None,
|
||||
requested_output_tokens: int | None = None,
|
||||
finish_reason: str | None = None,
|
||||
) -> "RequestMetrics":
|
||||
return cls(
|
||||
request_id=request.request_id,
|
||||
@@ -95,6 +101,9 @@ class RequestMetrics:
|
||||
ttft_s=ttft_s,
|
||||
tpot_s=tpot_s,
|
||||
error=error,
|
||||
actual_output_tokens=actual_output_tokens,
|
||||
requested_output_tokens=requested_output_tokens,
|
||||
finish_reason=finish_reason,
|
||||
)
|
||||
|
||||
|
||||
@@ -105,6 +114,16 @@ def write_metrics_jsonl(path: Path, rows: list[RequestMetrics]) -> None:
|
||||
handle.write(json.dumps(asdict(row), sort_keys=True) + "\n")
|
||||
|
||||
|
||||
def _is_failed_request(row: RequestMetrics) -> bool:
|
||||
if row.error is not None:
|
||||
return True
|
||||
if row.finish_reason is not None:
|
||||
fr = str(row.finish_reason).lower()
|
||||
if "abort" in fr or "badrequest" in fr:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def write_summary_json(
|
||||
path: Path,
|
||||
rows: list[RequestMetrics],
|
||||
@@ -112,9 +131,10 @@ def write_summary_json(
|
||||
trace_path: Path,
|
||||
router_url: str | None,
|
||||
) -> None:
|
||||
latencies = [row.latency_s for row in rows if row.latency_s is not None]
|
||||
ttfts = [row.ttft_s for row in rows if row.ttft_s is not None]
|
||||
tpots = [row.tpot_s for row in rows if row.tpot_s is not None]
|
||||
successful = [row for row in rows if not _is_failed_request(row)]
|
||||
latencies = [row.latency_s for row in successful if row.latency_s is not None]
|
||||
ttfts = [row.ttft_s for row in successful if row.ttft_s is not None]
|
||||
tpots = [row.tpot_s for row in successful if row.tpot_s is not None]
|
||||
per_decode_load = Counter(row.assigned_decode_node for row in rows)
|
||||
per_prefill_load = Counter(row.assigned_prefill_node for row in rows)
|
||||
prefill_priorities = Counter(
|
||||
@@ -158,6 +178,28 @@ def write_summary_json(
|
||||
str(key): value for key, value in sorted(decode_priorities.items())
|
||||
},
|
||||
"error_count": sum(1 for row in rows if row.error is not None),
|
||||
"abort_count": sum(
|
||||
1
|
||||
for row in rows
|
||||
if row.error is None
|
||||
and row.finish_reason is not None
|
||||
and (
|
||||
"abort" in str(row.finish_reason).lower()
|
||||
or "badrequest" in str(row.finish_reason).lower()
|
||||
)
|
||||
),
|
||||
"failure_count": sum(1 for row in rows if _is_failed_request(row)),
|
||||
"truncated_request_count": sum(
|
||||
1
|
||||
for row in rows
|
||||
if row.actual_output_tokens is not None
|
||||
and row.requested_output_tokens is not None
|
||||
and row.requested_output_tokens > 1
|
||||
and row.actual_output_tokens < row.requested_output_tokens * 0.5
|
||||
),
|
||||
"actual_output_tokens_stats": _stats(
|
||||
[float(row.actual_output_tokens) for row in rows if row.actual_output_tokens is not None]
|
||||
),
|
||||
}
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with path.open("w", encoding="utf-8") as handle:
|
||||
|
||||
@@ -74,8 +74,58 @@ class RouterState:
|
||||
return idx
|
||||
|
||||
|
||||
@dataclass
|
||||
class DpRouterConfig:
|
||||
host: str
|
||||
port: int
|
||||
backend_urls: list[str]
|
||||
backend_policy: str = "round_robin"
|
||||
request_timeout_s: float = 1800.0
|
||||
|
||||
|
||||
class DpRouterState:
|
||||
"""DP (data-parallel) router: forward each request to exactly one backend."""
|
||||
|
||||
def __init__(self, config: DpRouterConfig):
|
||||
if not config.backend_urls:
|
||||
raise ValueError("At least one backend worker is required")
|
||||
self.config = config
|
||||
self.cursor = 0
|
||||
self.sticky_map: dict[str, int] = {}
|
||||
|
||||
def select_backend(self, headers: dict[str, str]) -> str:
|
||||
idx = self._select_index(headers)
|
||||
return self.config.backend_urls[idx]
|
||||
|
||||
def _select_index(self, headers: dict[str, str]) -> int:
|
||||
target_worker = headers.get("x-smg-target-worker")
|
||||
routing_key = headers.get("x-smg-routing-key")
|
||||
|
||||
if (
|
||||
self.config.backend_policy == "consistent_hashing"
|
||||
and target_worker is not None
|
||||
):
|
||||
idx = int(target_worker)
|
||||
if 0 <= idx < len(self.config.backend_urls):
|
||||
return idx
|
||||
|
||||
if self.config.backend_policy == "manual" and routing_key:
|
||||
cached = self.sticky_map.get(routing_key)
|
||||
if cached is not None:
|
||||
return cached
|
||||
idx = self.cursor % len(self.config.backend_urls)
|
||||
self.cursor += 1
|
||||
self.sticky_map[routing_key] = idx
|
||||
return idx
|
||||
|
||||
idx = self.cursor % len(self.config.backend_urls)
|
||||
self.cursor += 1
|
||||
return idx
|
||||
|
||||
|
||||
app = FastAPI()
|
||||
router_state: RouterState | None = None
|
||||
dp_state: DpRouterState | None = None
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
@@ -85,6 +135,16 @@ async def health() -> Response:
|
||||
|
||||
@app.get("/health_generate")
|
||||
async def health_generate() -> Response:
|
||||
if dp_state is not None:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
tasks = [
|
||||
session.get(f"{url}/health_generate")
|
||||
for url in dp_state.config.backend_urls
|
||||
]
|
||||
for response in asyncio.as_completed(tasks):
|
||||
async with await response:
|
||||
pass
|
||||
return Response(status_code=200)
|
||||
state = _require_state()
|
||||
async with aiohttp.ClientSession() as session:
|
||||
tasks = []
|
||||
@@ -101,6 +161,11 @@ async def health_generate() -> Response:
|
||||
|
||||
@app.get("/v1/models")
|
||||
async def models() -> ORJSONResponse:
|
||||
if dp_state is not None:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(f"{dp_state.config.backend_urls[0]}/v1/models") as resp:
|
||||
payload = await resp.json()
|
||||
return ORJSONResponse(payload, status_code=resp.status)
|
||||
state = _require_state()
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(f"{state.config.prefill_urls[0][0]}/v1/models") as response:
|
||||
@@ -147,6 +212,15 @@ async def _forward_to_backend(
|
||||
headers: dict[str, str],
|
||||
endpoint_name: str,
|
||||
) -> Response:
|
||||
# DP mode: forward to a single backend
|
||||
if dp_state is not None:
|
||||
return await _forward_to_dp_backend(
|
||||
request_data=request_data,
|
||||
headers=headers,
|
||||
endpoint_name=endpoint_name,
|
||||
)
|
||||
|
||||
# PD mode: coordinate prefill + decode
|
||||
state = _require_state()
|
||||
prefill_server, bootstrap_port, decode_server = state.select_pair(headers)
|
||||
prefill_request, decode_request = _build_backend_requests(
|
||||
@@ -186,6 +260,63 @@ async def _forward_to_backend(
|
||||
)
|
||||
|
||||
|
||||
async def _forward_to_dp_backend(
|
||||
*,
|
||||
request_data: dict,
|
||||
headers: dict[str, str],
|
||||
endpoint_name: str,
|
||||
) -> Response:
|
||||
assert dp_state is not None
|
||||
backend_server = dp_state.select_backend(headers)
|
||||
cleaned = _strip_internal_fields(request_data)
|
||||
timeout_s = dp_state.config.request_timeout_s
|
||||
|
||||
if request_data.get("stream", False):
|
||||
return StreamingResponse(
|
||||
_stream_dp_generate(
|
||||
request_data=cleaned,
|
||||
backend_server=backend_server,
|
||||
endpoint_name=endpoint_name,
|
||||
timeout_s=timeout_s,
|
||||
),
|
||||
media_type="text/event-stream",
|
||||
)
|
||||
|
||||
async with aiohttp.ClientSession(
|
||||
timeout=aiohttp.ClientTimeout(total=timeout_s)
|
||||
) as session:
|
||||
async with session.post(
|
||||
f"{backend_server}/{endpoint_name}", json=cleaned
|
||||
) as response:
|
||||
body = await response.read()
|
||||
return Response(
|
||||
content=body,
|
||||
status_code=response.status,
|
||||
media_type=response.content_type,
|
||||
)
|
||||
|
||||
|
||||
async def _stream_dp_generate(
|
||||
*,
|
||||
request_data: dict,
|
||||
backend_server: str,
|
||||
endpoint_name: str,
|
||||
timeout_s: float,
|
||||
) -> AsyncIterator[bytes]:
|
||||
async with aiohttp.ClientSession(
|
||||
timeout=aiohttp.ClientTimeout(total=timeout_s)
|
||||
) as session:
|
||||
async with session.post(
|
||||
f"{backend_server}/{endpoint_name}", json=request_data
|
||||
) as response:
|
||||
if response.status != HTTPStatus.OK:
|
||||
payload = await response.read()
|
||||
yield payload
|
||||
return
|
||||
async for chunk in response.content.iter_chunked(_STREAM_CHUNK_SIZE):
|
||||
yield chunk
|
||||
|
||||
|
||||
async def _stream_generate(
|
||||
*,
|
||||
prefill_request: dict,
|
||||
@@ -241,6 +372,12 @@ def _build_backend_requests(
|
||||
prefill_request.update(bootstrap_payload)
|
||||
decode_request.update(bootstrap_payload)
|
||||
|
||||
# session_params is only meaningful for the decode worker (streaming session
|
||||
# KV reuse). Sending it to the prefill worker causes the D side to
|
||||
# short-circuit with local-prefill on already-open sessions, returning
|
||||
# truncated responses while P's KV transfer gets aborted.
|
||||
prefill_request.pop("session_params", None)
|
||||
|
||||
if prefill_priority is not None:
|
||||
prefill_request["priority"] = int(prefill_priority)
|
||||
if decode_priority is not None:
|
||||
@@ -262,7 +399,7 @@ def _require_state() -> RouterState:
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Minimal local PD router")
|
||||
parser = argparse.ArgumentParser(description="Minimal local PD / DP router")
|
||||
parser.add_argument("--host", default="127.0.0.1")
|
||||
parser.add_argument("--port", type=int, default=8000)
|
||||
parser.add_argument(
|
||||
@@ -270,30 +407,58 @@ def main() -> None:
|
||||
nargs=2,
|
||||
metavar=("URL", "BOOTSTRAP_PORT"),
|
||||
action="append",
|
||||
required=True,
|
||||
default=None,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--decode",
|
||||
action="append",
|
||||
required=True,
|
||||
default=None,
|
||||
)
|
||||
parser.add_argument("--prefill-policy", default="round_robin")
|
||||
parser.add_argument("--decode-policy", default="manual")
|
||||
parser.add_argument(
|
||||
"--backend",
|
||||
action="append",
|
||||
default=None,
|
||||
help="Backend URL for DP (data-parallel) mode. Repeat for each worker.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--backend-policy",
|
||||
default="round_robin",
|
||||
help="Routing policy for DP mode: round_robin, manual, consistent_hashing.",
|
||||
)
|
||||
parser.add_argument("--request-timeout-s", type=float, default=1800.0)
|
||||
args = parser.parse_args()
|
||||
|
||||
global router_state
|
||||
router_state = RouterState(
|
||||
RouterConfig(
|
||||
host=args.host,
|
||||
port=args.port,
|
||||
prefill_urls=[(url, int(port)) for url, port in args.prefill],
|
||||
decode_urls=list(args.decode),
|
||||
prefill_policy=args.prefill_policy,
|
||||
decode_policy=args.decode_policy,
|
||||
request_timeout_s=args.request_timeout_s,
|
||||
global router_state, dp_state
|
||||
|
||||
if args.backend:
|
||||
# DP mode: simple forward to one of N backends
|
||||
dp_state = DpRouterState(
|
||||
DpRouterConfig(
|
||||
host=args.host,
|
||||
port=args.port,
|
||||
backend_urls=list(args.backend),
|
||||
backend_policy=args.backend_policy,
|
||||
request_timeout_s=args.request_timeout_s,
|
||||
)
|
||||
)
|
||||
)
|
||||
elif args.prefill and args.decode:
|
||||
# PD mode: prefill/decode coordination
|
||||
router_state = RouterState(
|
||||
RouterConfig(
|
||||
host=args.host,
|
||||
port=args.port,
|
||||
prefill_urls=[(url, int(port)) for url, port in args.prefill],
|
||||
decode_urls=list(args.decode),
|
||||
prefill_policy=args.prefill_policy,
|
||||
decode_policy=args.decode_policy,
|
||||
request_timeout_s=args.request_timeout_s,
|
||||
)
|
||||
)
|
||||
else:
|
||||
parser.error("Either --backend (DP mode) or both --prefill and --decode (PD mode) are required")
|
||||
|
||||
uvicorn.run(app, host=args.host, port=args.port, log_level="info")
|
||||
|
||||
|
||||
|
||||
@@ -44,6 +44,10 @@ class RoutingState:
|
||||
inflight_decode: Counter[str] = field(default_factory=Counter)
|
||||
decode_assignment_counts: Counter[str] = field(default_factory=Counter)
|
||||
decode_resident_blocks: dict[str, set[int]] = field(default_factory=dict)
|
||||
# Migration support: per-(session_id, decode_worker_id) admission reject counter.
|
||||
# KvAwarePolicy uses this to skip D's that have repeatedly rejected this session
|
||||
# (avoids the structural starvation observed in TEAM_REPORT §2.1).
|
||||
session_d_rejects: Counter[tuple[str, str]] = field(default_factory=Counter)
|
||||
|
||||
@classmethod
|
||||
def create(cls, topology: SingleNodeTopology) -> "RoutingState":
|
||||
@@ -66,6 +70,12 @@ class RoutingState:
|
||||
self.decode_cursor += 1
|
||||
return worker.worker_id
|
||||
|
||||
def record_admission_reject(self, session_id: str, decode_worker_id: str) -> int:
|
||||
"""Increment per-(session, D) rejection counter. Returns new count."""
|
||||
key = (session_id, decode_worker_id)
|
||||
self.session_d_rejects[key] += 1
|
||||
return self.session_d_rejects[key]
|
||||
|
||||
def finish(self, request: TraceRequest, decision: RoutingDecision) -> None:
|
||||
session = self.session_state.setdefault(request.session_id, SessionRouteState())
|
||||
session.last_decode_worker = decision.decode_worker_id
|
||||
@@ -142,10 +152,64 @@ class StickyDecodePolicy:
|
||||
)
|
||||
|
||||
|
||||
CandidateScore = tuple[int, int, int, int]
|
||||
|
||||
|
||||
def score_candidate(
|
||||
*,
|
||||
overlap: int,
|
||||
sticky: bool,
|
||||
inflight: int,
|
||||
assigned: int,
|
||||
mean_assigned: float,
|
||||
sticky_bonus: int,
|
||||
load_floor_bonus: int,
|
||||
) -> CandidateScore:
|
||||
"""Pure scoring function for KvAwarePolicy (Algorithm 1 in KVC_ROUTER_ALGORITHM.md).
|
||||
|
||||
Returns the 4-tuple compared lexicographically by `select()` to pick the
|
||||
best D. Extracted as a top-level function so unit tests can exercise it
|
||||
without constructing topology/state objects.
|
||||
|
||||
Score tuple positions:
|
||||
0: overlap + sticky_bonus*sticky + floor_bonus — primary, KV reuse aware
|
||||
1: sticky — tie-1, session locality
|
||||
2: -inflight — tie-2, prefer low load
|
||||
3: -assigned — tie-3, prefer rarely-picked
|
||||
|
||||
Load-floor bonus is gated on `not sticky` (turn-1+ sessions continue to
|
||||
stick to their original D). The boost magnitude scales linearly with the
|
||||
D's deficit relative to the running mean of decode_assignment_counts:
|
||||
floor_bonus = load_floor_bonus * max(0, mean - assigned) / max(1, mean)
|
||||
When mean == 0 (warmup) the bonus is 0 for all candidates (lex tiebreak
|
||||
falls through to iteration order).
|
||||
|
||||
See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the load-floor design and
|
||||
docs/KVC_ROUTER_ALGORITHM.md §3.1 for the lex-score formalism.
|
||||
"""
|
||||
floor_bonus = 0
|
||||
if load_floor_bonus > 0 and not sticky and mean_assigned > 0:
|
||||
deficit = max(0.0, mean_assigned - assigned)
|
||||
floor_bonus = int(load_floor_bonus * deficit / mean_assigned)
|
||||
primary = overlap + (sticky_bonus if sticky else 0) + floor_bonus
|
||||
return (primary, int(sticky), -inflight, -assigned)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class KvAwarePolicy:
|
||||
name: str = "kv-aware"
|
||||
sticky_bonus: int = 1
|
||||
# Session migration: when (session, D) has been rejected this many times,
|
||||
# skip D entirely for this session (force migration to another D).
|
||||
# 0 disables the mechanism. Default 3 picked empirically to allow brief
|
||||
# transient saturation without panicking, but to reroute persistent starvation.
|
||||
migration_reject_threshold: int = 3
|
||||
# Load-floor bonus: see score_candidate() docstring for the exact formula.
|
||||
# Set above the max cross-session boilerplate overlap you expect (so fresh
|
||||
# sessions reach under-loaded D's even at 0 overlap), but below the
|
||||
# magnitude of "real" prefix overlap (so a warm D still wins for its own
|
||||
# session). 0 disables.
|
||||
load_floor_bonus: int = 0
|
||||
|
||||
def select(
|
||||
self,
|
||||
@@ -157,23 +221,48 @@ class KvAwarePolicy:
|
||||
prefill_worker_id = state.next_prefill_worker_id(topology)
|
||||
session = state.session_state.get(request.session_id)
|
||||
|
||||
n_route_workers = max(1, len(topology.route_workers))
|
||||
total_assigned = sum(state.decode_assignment_counts.values())
|
||||
mean_assigned = total_assigned / n_route_workers
|
||||
|
||||
best_decode_worker_id: str | None = None
|
||||
best_score: tuple[int, int, int] | None = None
|
||||
best_score: CandidateScore | None = None
|
||||
for worker in topology.route_workers:
|
||||
overlap = _overlap_blocks(request, state, worker.worker_id)
|
||||
sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
|
||||
inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
|
||||
assignment_penalty = -state.decode_assignment_counts.get(worker.worker_id, 0)
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus,
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
assignment_penalty,
|
||||
# Migration: skip workers that have rejected this session too many times.
|
||||
# If all candidates get filtered (degenerate case), fall through to
|
||||
# un-filtered selection below.
|
||||
if self.migration_reject_threshold > 0:
|
||||
rejects = state.session_d_rejects.get(
|
||||
(request.session_id, worker.worker_id), 0
|
||||
)
|
||||
if rejects >= self.migration_reject_threshold:
|
||||
continue
|
||||
score = score_candidate(
|
||||
overlap=_overlap_blocks(request, state, worker.worker_id),
|
||||
sticky=(
|
||||
session is not None
|
||||
and session.last_decode_worker == worker.worker_id
|
||||
),
|
||||
inflight=state.inflight_decode.get(worker.worker_id, 0),
|
||||
assigned=state.decode_assignment_counts.get(worker.worker_id, 0),
|
||||
mean_assigned=mean_assigned,
|
||||
sticky_bonus=self.sticky_bonus,
|
||||
load_floor_bonus=self.load_floor_bonus,
|
||||
)
|
||||
if best_score is None or score > best_score:
|
||||
best_score = score
|
||||
best_decode_worker_id = worker.worker_id
|
||||
|
||||
# Degenerate fallback: every D was filtered. Pick the least-rejected D.
|
||||
if best_decode_worker_id is None:
|
||||
best_decode_worker_id = min(
|
||||
(w.worker_id for w in topology.route_workers),
|
||||
key=lambda wid: state.session_d_rejects.get(
|
||||
(request.session_id, wid), 0
|
||||
),
|
||||
)
|
||||
best_score = (0, 0, 0, 0)
|
||||
|
||||
assert best_decode_worker_id is not None
|
||||
reuse_expected = bool(best_score and best_score[0] > 0)
|
||||
return _build_decision(
|
||||
@@ -187,14 +276,22 @@ class KvAwarePolicy:
|
||||
)
|
||||
|
||||
|
||||
def create_policy(name: str) -> RoutingPolicy:
|
||||
def create_policy(
|
||||
name: str,
|
||||
*,
|
||||
migration_reject_threshold: int = 3,
|
||||
load_floor_bonus: int = 0,
|
||||
) -> RoutingPolicy:
|
||||
normalized = name.strip().lower()
|
||||
if normalized == "default":
|
||||
return DefaultPolicy()
|
||||
if normalized == "sticky":
|
||||
return StickyDecodePolicy()
|
||||
if normalized in {"kv-aware", "kv_aware", "kv"}:
|
||||
return KvAwarePolicy()
|
||||
return KvAwarePolicy(
|
||||
migration_reject_threshold=migration_reject_threshold,
|
||||
load_floor_bonus=load_floor_bonus,
|
||||
)
|
||||
raise ValueError(f"Unsupported policy: {name}")
|
||||
|
||||
|
||||
|
||||
@@ -31,6 +31,44 @@ KvCachePrefillBackupPolicy = Literal["release-after-transfer", "capacity-backup"
|
||||
_ADMISSION_PROBE_TIMEOUT_S = 2.0
|
||||
|
||||
|
||||
# --- Structural event logging (admission probes, backpressure pauses, ---
|
||||
# --- session-D bindings). Module-level state keeps call-site diff small. ---
|
||||
_STRUCTURAL_LOG_DIR: Path | None = None
|
||||
_STRUCTURAL_LOG_LOCK = asyncio.Lock()
|
||||
_STRUCTURAL_LOG_FILES: dict[str, Any] = {}
|
||||
_STRUCTURAL_RUN_START_S: float = 0.0
|
||||
|
||||
|
||||
def _structural_init(log_dir: Path | None) -> None:
|
||||
global _STRUCTURAL_LOG_DIR, _STRUCTURAL_RUN_START_S
|
||||
_STRUCTURAL_LOG_DIR = log_dir
|
||||
_STRUCTURAL_RUN_START_S = time.perf_counter()
|
||||
if log_dir is not None:
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def _structural_close() -> None:
|
||||
for handle in _STRUCTURAL_LOG_FILES.values():
|
||||
try:
|
||||
handle.close()
|
||||
except Exception:
|
||||
pass
|
||||
_STRUCTURAL_LOG_FILES.clear()
|
||||
|
||||
|
||||
async def _structural_emit(filename: str, event: dict[str, Any]) -> None:
|
||||
if _STRUCTURAL_LOG_DIR is None:
|
||||
return
|
||||
event = {"t": round(time.perf_counter() - _STRUCTURAL_RUN_START_S, 4), **event}
|
||||
async with _STRUCTURAL_LOG_LOCK:
|
||||
handle = _STRUCTURAL_LOG_FILES.get(filename)
|
||||
if handle is None:
|
||||
handle = (_STRUCTURAL_LOG_DIR / filename).open("a", encoding="utf-8")
|
||||
_STRUCTURAL_LOG_FILES[filename] = handle
|
||||
handle.write(json.dumps(event, sort_keys=True) + "\n")
|
||||
handle.flush()
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ReplayConfig:
|
||||
trace_path: Path
|
||||
@@ -64,6 +102,21 @@ class ReplayConfig:
|
||||
kvcache_prefill_priority_eviction: bool = False
|
||||
kvcache_prefill_direct_priority: int = -100
|
||||
kvcache_prefill_normal_priority: int = 100
|
||||
pool_poll_interval_s: float = 0.0
|
||||
pool_poll_include_sessions: bool = True
|
||||
enable_backpressure: bool = False
|
||||
backpressure_max_pause_s: float = 2.0
|
||||
# Session migration via per-(sess, D) admission reject memory.
|
||||
# When a session has been admission-rejected this many times on a given D,
|
||||
# KvAwarePolicy skips that D for the session (forcing migration). Default 3.
|
||||
# Set 0 to disable. See REFACTOR_PLAN_V1 §6.2.
|
||||
kvcache_migration_reject_threshold: int = 3
|
||||
# Load-floor bonus magnitude for KvAwarePolicy: graduated boost added to
|
||||
# under-loaded D workers to break overlap-pinning imbalance on workloads
|
||||
# with shared cross-session prefix. 0 disables. See
|
||||
# docs/E1_E2_FIX_DESIGN_ZH.md §Q2.
|
||||
kvcache_load_floor_bonus: int = 0
|
||||
structural_log_dir: Path | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -95,6 +148,8 @@ class DecodeResidencyState:
|
||||
prefill_reserved_tokens_by_server: dict[str, int] = field(default_factory=dict)
|
||||
decode_evictions_prefill_backed: int = 0
|
||||
decode_evictions_without_prefill_backup: int = 0
|
||||
# Backpressure: per-D timestamp until which new requests should pause.
|
||||
pause_until_s: dict[str, float] = field(default_factory=dict)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
@@ -124,9 +179,16 @@ class ExecutionResult:
|
||||
prefill_request_priority: int | None = None
|
||||
decode_request_priority: int | None = None
|
||||
error: str | None = None
|
||||
actual_output_tokens: int | None = None
|
||||
requested_output_tokens: int | None = None
|
||||
finish_reason: str | None = None
|
||||
|
||||
|
||||
async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
|
||||
structural_dir = config.structural_log_dir
|
||||
if structural_dir is None and config.output_path is not None:
|
||||
structural_dir = config.output_path.parent / "structural"
|
||||
_structural_init(structural_dir)
|
||||
requests = load_trace(config.trace_path, request_limit=config.request_limit)
|
||||
if config.kvcache_seed_only_multiturn_sessions:
|
||||
session_turns = Counter(request.session_id for request in requests)
|
||||
@@ -138,7 +200,11 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
|
||||
if turn_count > 1
|
||||
),
|
||||
)
|
||||
policy = create_policy(config.policy_name)
|
||||
policy = create_policy(
|
||||
config.policy_name,
|
||||
migration_reject_threshold=config.kvcache_migration_reject_threshold,
|
||||
load_floor_bonus=config.kvcache_load_floor_bonus,
|
||||
)
|
||||
state = RoutingState.create(config.topology)
|
||||
state_lock = asyncio.Lock()
|
||||
semaphore = asyncio.Semaphore(config.concurrency_limit)
|
||||
@@ -152,6 +218,25 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
|
||||
client=client,
|
||||
config=config,
|
||||
)
|
||||
poll_task: asyncio.Task[None] | None = None
|
||||
if config.pool_poll_interval_s > 0:
|
||||
poll_workers: list[tuple[str, str, str]] = []
|
||||
for worker in config.topology.decode_workers:
|
||||
poll_workers.append((worker.worker_id, "decode", worker.url))
|
||||
for worker in config.topology.prefill_workers:
|
||||
poll_workers.append((worker.worker_id, "prefill", worker.url))
|
||||
if poll_workers:
|
||||
poll_output = config.output_path.parent / "d-pool-timeseries.jsonl"
|
||||
poll_task = asyncio.create_task(
|
||||
_poll_pool_timeseries(
|
||||
client=client,
|
||||
workers=poll_workers,
|
||||
interval_s=config.pool_poll_interval_s,
|
||||
output_path=poll_output,
|
||||
start_time=start_time,
|
||||
include_sessions=config.pool_poll_include_sessions,
|
||||
)
|
||||
)
|
||||
tasks = []
|
||||
for request in requests:
|
||||
if config.pace:
|
||||
@@ -179,6 +264,12 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
|
||||
session_tail_tasks[request.session_id] = tasks[-1]
|
||||
|
||||
results = await asyncio.gather(*tasks)
|
||||
if poll_task is not None:
|
||||
poll_task.cancel()
|
||||
try:
|
||||
await poll_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
for session in direct_sessions.values():
|
||||
if session.opened:
|
||||
try:
|
||||
@@ -208,6 +299,7 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
|
||||
trace_path=config.trace_path,
|
||||
router_url=config.router_url,
|
||||
)
|
||||
_structural_close()
|
||||
return results
|
||||
|
||||
|
||||
@@ -231,6 +323,21 @@ async def _run_request(
|
||||
async with state_lock:
|
||||
decision = policy.select(request, topology=config.topology, state=state)
|
||||
|
||||
await _structural_emit(
|
||||
"session-d-binding.jsonl",
|
||||
{
|
||||
"session_id": request.session_id,
|
||||
"request_id": request.request_id,
|
||||
"turn_id": request.turn_id,
|
||||
"decode_worker_index": decision.decode_worker_index,
|
||||
"decode_worker_id": decision.decode_worker_id,
|
||||
"prefill_worker_id": decision.prefill_worker_id,
|
||||
"observed_overlap_blocks": decision.observed_overlap_blocks,
|
||||
"kv_transfer_blocks": decision.kv_transfer_blocks,
|
||||
"inflight_decode_load": decision.inflight_decode_load,
|
||||
},
|
||||
)
|
||||
|
||||
try:
|
||||
execution = await _execute_request(
|
||||
client=client,
|
||||
@@ -257,6 +364,22 @@ async def _run_request(
|
||||
|
||||
async with state_lock:
|
||||
state.finish(request, decision)
|
||||
# Migration feedback: if this request was forced into a fallback path
|
||||
# because the chosen D rejected admission, record the (session, D)
|
||||
# rejection so KvAwarePolicy can migrate this session next turn.
|
||||
if _is_admission_rejection_mode(execution.execution_mode):
|
||||
state.record_admission_reject(
|
||||
request.session_id,
|
||||
decision.decode_worker_id,
|
||||
)
|
||||
# Reset-on-success: a successful direct-to-D path proves D-X can
|
||||
# currently serve this session — clear the cumulative reject counter
|
||||
# so that brief past saturation doesn't permanently blacklist the D.
|
||||
# (MIGRATION_V1_FINDINGS §4.1: blacklist-permanence bug fix.)
|
||||
elif execution.execution_mode == "kvcache-direct-to-d-session":
|
||||
state.session_d_rejects[
|
||||
(request.session_id, decision.decode_worker_id)
|
||||
] = 0
|
||||
|
||||
return RequestMetrics.from_decision(
|
||||
request,
|
||||
@@ -274,6 +397,9 @@ async def _run_request(
|
||||
prefill_request_priority=execution.prefill_request_priority,
|
||||
decode_request_priority=execution.decode_request_priority,
|
||||
error=execution.error,
|
||||
actual_output_tokens=execution.actual_output_tokens,
|
||||
requested_output_tokens=execution.requested_output_tokens,
|
||||
finish_reason=execution.finish_reason,
|
||||
)
|
||||
|
||||
|
||||
@@ -286,7 +412,17 @@ async def _invoke_router(
|
||||
session_id: str | None = None,
|
||||
prefill_request_priority: int | None = None,
|
||||
decode_request_priority: int | None = None,
|
||||
) -> tuple[float, float | None, float | None, int]:
|
||||
decode_residency: "DecodeResidencyState | None" = None,
|
||||
) -> GenerateResult:
|
||||
if decode_residency is not None and config.enable_backpressure:
|
||||
decode_url = config.topology.decode_workers[decode_worker_index].url
|
||||
await _wait_for_decode_pause(
|
||||
config=config,
|
||||
residency=decode_residency,
|
||||
server_url=decode_url,
|
||||
request_id=request.request_id,
|
||||
session_id=session_id,
|
||||
)
|
||||
headers = _build_headers(
|
||||
request=request,
|
||||
header_mode=config.header_mode,
|
||||
@@ -414,6 +550,18 @@ async def _invoke_chat_completion(
|
||||
return latency_s, ttft_s, tpot_s, cached_tokens
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class GenerateResult:
|
||||
latency_s: float
|
||||
ttft_s: float | None
|
||||
tpot_s: float | None
|
||||
cached_tokens: int
|
||||
actual_output_tokens: int
|
||||
requested_output_tokens: int
|
||||
finish_reason: str | None
|
||||
server_meta_info: dict | None
|
||||
|
||||
|
||||
async def _invoke_generate(
|
||||
*,
|
||||
client: httpx.AsyncClient,
|
||||
@@ -423,12 +571,16 @@ async def _invoke_generate(
|
||||
timeout_s: float,
|
||||
stream_idle_timeout_s: float | None,
|
||||
stream: bool,
|
||||
) -> tuple[float, float | None, float | None, int]:
|
||||
) -> GenerateResult:
|
||||
start = time.perf_counter()
|
||||
ttft_s: float | None = None
|
||||
cached_tokens = 0
|
||||
sampling_params = payload.get("sampling_params", {})
|
||||
generated_tokens = int(sampling_params.get("max_new_tokens", 1))
|
||||
requested_output_tokens = int(sampling_params.get("max_new_tokens", 1))
|
||||
actual_token_count = 0
|
||||
finish_reason: str | None = None
|
||||
last_meta_info: dict | None = None
|
||||
|
||||
if stream:
|
||||
async with client.stream(
|
||||
"POST",
|
||||
@@ -452,8 +604,19 @@ async def _invoke_generate(
|
||||
if isinstance(error, dict):
|
||||
raise ValueError(error.get("message", json.dumps(error)))
|
||||
cached_tokens = max(cached_tokens, _extract_generate_cached_tokens(parsed))
|
||||
if _contains_generate_token(parsed) and ttft_s is None:
|
||||
ttft_s = time.perf_counter() - start
|
||||
if _contains_generate_token(parsed):
|
||||
actual_token_count += 1
|
||||
if ttft_s is None:
|
||||
ttft_s = time.perf_counter() - start
|
||||
meta_info = parsed.get("meta_info")
|
||||
if isinstance(meta_info, dict):
|
||||
last_meta_info = meta_info
|
||||
completion_tokens = int(meta_info.get("completion_tokens", 0))
|
||||
if completion_tokens > actual_token_count:
|
||||
actual_token_count = completion_tokens
|
||||
fr = meta_info.get("finish_reason")
|
||||
if fr is not None:
|
||||
finish_reason = str(fr)
|
||||
if _is_generate_terminal_chunk(parsed):
|
||||
break
|
||||
else:
|
||||
@@ -469,15 +632,33 @@ async def _invoke_generate(
|
||||
if isinstance(error, dict):
|
||||
raise ValueError(error.get("message", json.dumps(error)))
|
||||
cached_tokens = _extract_generate_cached_tokens(parsed)
|
||||
meta_info = parsed.get("meta_info")
|
||||
if isinstance(meta_info, dict):
|
||||
last_meta_info = meta_info
|
||||
actual_token_count = int(meta_info.get("completion_tokens", 0))
|
||||
finish_reason = meta_info.get("finish_reason")
|
||||
|
||||
latency_s = time.perf_counter() - start
|
||||
if stream and ttft_s is None and generated_tokens > 0:
|
||||
if stream and ttft_s is None and requested_output_tokens > 0:
|
||||
raise RuntimeError("generate stream ended before producing any token")
|
||||
|
||||
# Use actual token count for TPOT (not requested count)
|
||||
effective_tokens = max(1, actual_token_count) if actual_token_count > 0 else max(1, requested_output_tokens)
|
||||
if ttft_s is None:
|
||||
tpot_s = None
|
||||
else:
|
||||
tpot_s = max(0.0, latency_s - ttft_s) / max(1, generated_tokens)
|
||||
return latency_s, ttft_s, tpot_s, cached_tokens
|
||||
tpot_s = max(0.0, latency_s - ttft_s) / effective_tokens
|
||||
|
||||
return GenerateResult(
|
||||
latency_s=latency_s,
|
||||
ttft_s=ttft_s,
|
||||
tpot_s=tpot_s,
|
||||
cached_tokens=cached_tokens,
|
||||
actual_output_tokens=actual_token_count,
|
||||
requested_output_tokens=requested_output_tokens,
|
||||
finish_reason=finish_reason,
|
||||
server_meta_info=last_meta_info,
|
||||
)
|
||||
|
||||
|
||||
async def _open_streaming_session(
|
||||
@@ -593,6 +774,139 @@ async def _fetch_decode_server_state(
|
||||
)
|
||||
|
||||
|
||||
async def _query_pool_snapshot(
|
||||
*,
|
||||
client: httpx.AsyncClient,
|
||||
server_url: str,
|
||||
include_sessions: bool,
|
||||
) -> dict[str, Any]:
|
||||
try:
|
||||
response = await client.get(
|
||||
f"{server_url.rstrip('/')}/server_info",
|
||||
timeout=_ADMISSION_PROBE_TIMEOUT_S,
|
||||
)
|
||||
response.raise_for_status()
|
||||
payload = response.json()
|
||||
except Exception as exc:
|
||||
return {"error": type(exc).__name__}
|
||||
|
||||
internal = _extract_internal_state(payload)
|
||||
session_cache = _extract_session_cache(payload)
|
||||
sessions: list[dict[str, Any]] = []
|
||||
if include_sessions and isinstance(session_cache.get("sessions"), list):
|
||||
for entry in session_cache["sessions"]:
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
sessions.append(
|
||||
{
|
||||
"session_id": entry.get("session_id"),
|
||||
"resident": bool(entry.get("resident")),
|
||||
"resident_tokens": int(entry.get("resident_tokens") or 0),
|
||||
"idle_evictable": bool(entry.get("idle_evictable")),
|
||||
"timed_out": bool(entry.get("timed_out")),
|
||||
}
|
||||
)
|
||||
|
||||
memory_usage = internal.get("memory_usage") if isinstance(internal, dict) else None
|
||||
if not isinstance(memory_usage, dict):
|
||||
memory_usage = {}
|
||||
|
||||
# P1 instrument: pool_breakdown decomposes "other" into named buckets
|
||||
pool_breakdown = internal.get("pool_breakdown") if isinstance(internal, dict) else None
|
||||
if not isinstance(pool_breakdown, dict):
|
||||
pool_breakdown = {}
|
||||
|
||||
return {
|
||||
"session_cache_enabled": bool(session_cache.get("enabled")),
|
||||
"session_count": int(session_cache.get("session_count") or 0),
|
||||
"resident_session_count": int(session_cache.get("resident_session_count") or 0),
|
||||
"held_tokens": int(session_cache.get("held_tokens") or 0),
|
||||
"available_tokens": int(session_cache.get("available_tokens") or 0),
|
||||
"capacity_tokens": int(session_cache.get("capacity_tokens") or 0),
|
||||
"idle_evictable_session_count": int(
|
||||
session_cache.get("idle_evictable_session_count") or 0
|
||||
),
|
||||
"idle_evictable_tokens": int(session_cache.get("idle_evictable_tokens") or 0),
|
||||
"kvcache_mem_gb": float(memory_usage.get("kvcache") or 0.0),
|
||||
"token_capacity": int(memory_usage.get("token_capacity") or 0),
|
||||
"max_total_num_tokens": int(internal.get("max_total_num_tokens") or 0)
|
||||
if isinstance(internal, dict)
|
||||
else 0,
|
||||
"last_gen_throughput": float(internal.get("last_gen_throughput") or 0.0)
|
||||
if isinstance(internal, dict)
|
||||
else 0.0,
|
||||
"radix_evictable_tokens": int(pool_breakdown.get("radix_evictable_tokens") or 0),
|
||||
"radix_protected_tokens": int(pool_breakdown.get("radix_protected_tokens") or 0),
|
||||
"slot_private_held_tokens": int(pool_breakdown.get("slot_private_held_tokens") or 0),
|
||||
"session_slot_count": int(pool_breakdown.get("session_slot_count") or 0),
|
||||
"running_batch_reqs": int(pool_breakdown.get("running_batch_reqs") or 0),
|
||||
"running_batch_kv_tokens": int(pool_breakdown.get("running_batch_kv_tokens") or 0),
|
||||
"transfer_queue_reqs": int(pool_breakdown.get("transfer_queue_reqs") or 0),
|
||||
"transfer_queue_tokens": int(pool_breakdown.get("transfer_queue_tokens") or 0),
|
||||
"prealloc_queue_reqs": int(pool_breakdown.get("prealloc_queue_reqs") or 0),
|
||||
"prealloc_queue_tokens": int(pool_breakdown.get("prealloc_queue_tokens") or 0),
|
||||
"retracted_queue_reqs": int(pool_breakdown.get("retracted_queue_reqs") or 0),
|
||||
"retracted_queue_tokens": int(pool_breakdown.get("retracted_queue_tokens") or 0),
|
||||
"sessions": sessions,
|
||||
}
|
||||
|
||||
|
||||
async def _poll_pool_timeseries(
|
||||
*,
|
||||
client: httpx.AsyncClient,
|
||||
workers: list[tuple[str, str, str]],
|
||||
interval_s: float,
|
||||
output_path: Path,
|
||||
start_time: float,
|
||||
include_sessions: bool,
|
||||
) -> None:
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with output_path.open("w", encoding="utf-8") as handle:
|
||||
try:
|
||||
while True:
|
||||
tick_started = time.perf_counter()
|
||||
ts = time.time()
|
||||
wall_s = tick_started - start_time
|
||||
snapshots = await asyncio.gather(
|
||||
*(
|
||||
_query_pool_snapshot(
|
||||
client=client,
|
||||
server_url=url,
|
||||
include_sessions=include_sessions,
|
||||
)
|
||||
for _, _, url in workers
|
||||
),
|
||||
return_exceptions=True,
|
||||
)
|
||||
for (worker_id, role, url), snap in zip(workers, snapshots):
|
||||
if isinstance(snap, BaseException):
|
||||
row: dict[str, Any] = {
|
||||
"ts": ts,
|
||||
"wall_s": wall_s,
|
||||
"worker_id": worker_id,
|
||||
"worker_role": role,
|
||||
"worker_url": url,
|
||||
"error": type(snap).__name__,
|
||||
}
|
||||
else:
|
||||
row = {
|
||||
"ts": ts,
|
||||
"wall_s": wall_s,
|
||||
"worker_id": worker_id,
|
||||
"worker_role": role,
|
||||
"worker_url": url,
|
||||
**snap,
|
||||
}
|
||||
handle.write(json.dumps(row, sort_keys=True) + "\n")
|
||||
handle.flush()
|
||||
elapsed = time.perf_counter() - tick_started
|
||||
sleep_s = interval_s - elapsed
|
||||
if sleep_s > 0:
|
||||
await asyncio.sleep(sleep_s)
|
||||
except asyncio.CancelledError:
|
||||
return
|
||||
|
||||
|
||||
async def _query_decode_direct_admission(
|
||||
*,
|
||||
client: httpx.AsyncClient,
|
||||
@@ -600,7 +914,13 @@ async def _query_decode_direct_admission(
|
||||
session_id: str,
|
||||
uncached_input_tokens: int,
|
||||
output_tokens: int,
|
||||
mode: str = "direct_append",
|
||||
config: "ReplayConfig | None" = None,
|
||||
residency: "DecodeResidencyState | None" = None,
|
||||
request_id: str | None = None,
|
||||
turn_id: int | None = None,
|
||||
) -> dict[str, Any]:
|
||||
started = time.perf_counter()
|
||||
try:
|
||||
response = await client.post(
|
||||
f"{server_url.rstrip('/')}/session_cache/admit_direct_append",
|
||||
@@ -608,25 +928,103 @@ async def _query_decode_direct_admission(
|
||||
"session_id": session_id,
|
||||
"uncached_input_tokens": max(0, uncached_input_tokens),
|
||||
"output_tokens": max(0, output_tokens),
|
||||
"mode": mode,
|
||||
},
|
||||
timeout=_ADMISSION_PROBE_TIMEOUT_S,
|
||||
)
|
||||
response.raise_for_status()
|
||||
payload = response.json()
|
||||
if isinstance(payload, dict):
|
||||
return payload
|
||||
except Exception:
|
||||
pass
|
||||
return {
|
||||
"can_admit": False,
|
||||
"resident": False,
|
||||
"reason": "admission-query-failed",
|
||||
"required_tokens": 0,
|
||||
"available_tokens_before": 0,
|
||||
"available_tokens_after": 0,
|
||||
"evicted_session_count": 0,
|
||||
"freed_tokens": 0,
|
||||
}
|
||||
if not isinstance(payload, dict):
|
||||
payload = None
|
||||
except Exception as exc:
|
||||
payload = None
|
||||
_last_exc_msg = type(exc).__name__
|
||||
else:
|
||||
_last_exc_msg = None
|
||||
|
||||
if payload is None:
|
||||
payload = {
|
||||
"can_admit": False,
|
||||
"resident": False,
|
||||
"reason": "admission-query-failed",
|
||||
"required_tokens": 0,
|
||||
"available_tokens_before": 0,
|
||||
"available_tokens_after": 0,
|
||||
"evicted_session_count": 0,
|
||||
"freed_tokens": 0,
|
||||
}
|
||||
|
||||
rtt_s = time.perf_counter() - started
|
||||
pause_ms = int(payload.get("recommended_pause_ms", 0) or 0)
|
||||
|
||||
# Update per-D pause window when backpressure is enabled.
|
||||
if (
|
||||
config is not None
|
||||
and residency is not None
|
||||
and config.enable_backpressure
|
||||
and pause_ms > 0
|
||||
):
|
||||
max_pause_s = max(0.0, config.backpressure_max_pause_s)
|
||||
applied_pause_s = min(pause_ms / 1000.0, max_pause_s)
|
||||
new_until = time.perf_counter() + applied_pause_s
|
||||
prev = residency.pause_until_s.get(server_url, 0.0)
|
||||
if new_until > prev:
|
||||
residency.pause_until_s[server_url] = new_until
|
||||
|
||||
# Always emit admission event for analysis (even if backpressure disabled).
|
||||
await _structural_emit(
|
||||
"admission-events.jsonl",
|
||||
{
|
||||
"server_url": server_url,
|
||||
"session_id": session_id,
|
||||
"request_id": request_id,
|
||||
"turn_id": turn_id,
|
||||
"mode": mode,
|
||||
"rtt_s": round(rtt_s, 4),
|
||||
"can_admit": bool(payload.get("can_admit")),
|
||||
"resident": bool(payload.get("resident")),
|
||||
"reason": payload.get("reason"),
|
||||
"queue_depth": int(payload.get("decode_transfer_queue_reqs", 0) or 0),
|
||||
"retracted_depth": int(payload.get("decode_retracted_queue_reqs", 0) or 0),
|
||||
"available_tokens_after": int(payload.get("available_tokens_after", 0) or 0),
|
||||
"token_usage": float(payload.get("token_usage", 0.0) or 0.0),
|
||||
"evicted_session_count": int(payload.get("evicted_session_count", 0) or 0),
|
||||
"recommended_pause_ms": pause_ms,
|
||||
"uncached_input_tokens": int(uncached_input_tokens),
|
||||
"output_tokens": int(output_tokens),
|
||||
},
|
||||
)
|
||||
return payload
|
||||
|
||||
|
||||
async def _wait_for_decode_pause(
|
||||
*,
|
||||
config: "ReplayConfig",
|
||||
residency: "DecodeResidencyState",
|
||||
server_url: str,
|
||||
request_id: str | None = None,
|
||||
session_id: str | None = None,
|
||||
) -> None:
|
||||
if not config.enable_backpressure:
|
||||
return
|
||||
until = residency.pause_until_s.get(server_url, 0.0)
|
||||
if until <= 0:
|
||||
return
|
||||
now = time.perf_counter()
|
||||
if now >= until:
|
||||
return
|
||||
sleep_s = min(until - now, config.backpressure_max_pause_s)
|
||||
await _structural_emit(
|
||||
"backpressure-events.jsonl",
|
||||
{
|
||||
"server_url": server_url,
|
||||
"session_id": session_id,
|
||||
"request_id": request_id,
|
||||
"sleep_s": round(sleep_s, 4),
|
||||
"until_offset_s": round(until - _STRUCTURAL_RUN_START_S, 4),
|
||||
},
|
||||
)
|
||||
await asyncio.sleep(sleep_s)
|
||||
|
||||
|
||||
async def _discover_decode_residency(
|
||||
@@ -850,8 +1248,8 @@ def _decode_session_soft_cap(
|
||||
- residency.headroom_tokens.get(server_url, 0),
|
||||
)
|
||||
if usable_capacity_tokens <= 0:
|
||||
return 4
|
||||
return max(1, min(4, usable_capacity_tokens // target_tokens))
|
||||
return 16
|
||||
return max(1, min(16, usable_capacity_tokens // target_tokens))
|
||||
|
||||
|
||||
def _should_admit_new_decode_session(
|
||||
@@ -862,6 +1260,7 @@ def _should_admit_new_decode_session(
|
||||
session: DirectSessionState,
|
||||
direct_sessions: dict[str, DirectSessionState],
|
||||
treat_as_fresh_session: bool,
|
||||
admission_mode: KvCacheAdmissionMode = "router",
|
||||
) -> bool:
|
||||
if (
|
||||
not treat_as_fresh_session
|
||||
@@ -869,6 +1268,11 @@ def _should_admit_new_decode_session(
|
||||
and session.server_url == server_url
|
||||
):
|
||||
return True
|
||||
if admission_mode == "worker":
|
||||
# Defer the capacity decision to D's admit_direct_append (mode=seed),
|
||||
# which checks real KV pool availability and runs LRU eviction. The
|
||||
# local soft cap is router-mode only.
|
||||
return True
|
||||
open_sessions = sum(
|
||||
1
|
||||
for candidate in direct_sessions.values()
|
||||
@@ -975,6 +1379,49 @@ def _is_stale_decode_session_error(exc: Exception) -> bool:
|
||||
)
|
||||
|
||||
|
||||
# execution_mode substrings that signal D-side admission rejected this request.
|
||||
# Used by _run_request to update state.session_d_rejects so KvAwarePolicy can
|
||||
# migrate persistently-starved sessions to a different D next turn.
|
||||
_ADMISSION_REJECTION_SUBSTRINGS = (
|
||||
"session-cap",
|
||||
"no-d-capacity",
|
||||
"d-backpressure",
|
||||
)
|
||||
|
||||
|
||||
def _is_admission_rejection_mode(execution_mode: str) -> bool:
|
||||
return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
|
||||
|
||||
|
||||
def _fallthrough_reason(
|
||||
*,
|
||||
request: TraceRequest,
|
||||
config: ReplayConfig,
|
||||
decision,
|
||||
direct_append_length: int | None,
|
||||
direct_session_reused: bool,
|
||||
direct_session_reset: bool,
|
||||
) -> str:
|
||||
"""Classify why a turn-2+ KVC request fell through to the seed/large-append branch.
|
||||
|
||||
Returns a short label suffix used in execution_mode strings to replace the
|
||||
misleading 'large-append' label (TEAM_REPORT §2.7). In particular,
|
||||
'session-not-resident' is the §1 starvation signature — direct_session_reused
|
||||
is False because the session was never opened on the policy-chosen D.
|
||||
"""
|
||||
if not direct_session_reused:
|
||||
return "session-not-resident"
|
||||
if direct_session_reset:
|
||||
return "session-was-evicted"
|
||||
if direct_append_length is None:
|
||||
return "no-direct-info"
|
||||
if direct_append_length > config.kvcache_direct_max_uncached_tokens:
|
||||
return "real-large-append"
|
||||
if not _should_bypass_prefill(request=request, config=config, decision=decision):
|
||||
return "policy-no-bypass"
|
||||
return "other-large-append"
|
||||
|
||||
|
||||
def _dynamic_decode_headroom_tokens(
|
||||
*,
|
||||
residency: DecodeResidencyState,
|
||||
@@ -1280,6 +1727,11 @@ async def _reserve_decode_session_capacity(
|
||||
session_id=session.session_id,
|
||||
uncached_input_tokens=max(0, request.input_length - current_tokens),
|
||||
output_tokens=request.output_length,
|
||||
mode="direct_append",
|
||||
config=config,
|
||||
residency=residency,
|
||||
request_id=request.request_id,
|
||||
turn_id=request.turn_id,
|
||||
)
|
||||
if not bool(admission.get("resident")):
|
||||
return False, 0, 0, 0, str(admission.get("reason") or "d-session-not-resident")
|
||||
@@ -1304,6 +1756,45 @@ async def _reserve_decode_session_capacity(
|
||||
None,
|
||||
)
|
||||
|
||||
# Seed / reseed path: ask D itself via the seed-mode admission endpoint
|
||||
# instead of estimating capacity from a stale router-state snapshot. D
|
||||
# will run LRU eviction internally to make room. Falls through to the
|
||||
# legacy router-state logic below if the endpoint is unavailable.
|
||||
seed_admission = await _query_decode_direct_admission(
|
||||
client=client,
|
||||
server_url=server_url,
|
||||
session_id=session.session_id,
|
||||
uncached_input_tokens=max(0, request.input_length - current_tokens),
|
||||
output_tokens=request.output_length,
|
||||
mode="seed",
|
||||
config=config,
|
||||
residency=residency,
|
||||
request_id=request.request_id,
|
||||
turn_id=request.turn_id,
|
||||
)
|
||||
seed_reason = seed_admission.get("reason")
|
||||
if seed_reason != "admission-query-failed":
|
||||
if not bool(seed_admission.get("can_admit")):
|
||||
return (
|
||||
False,
|
||||
0,
|
||||
int(seed_admission.get("evicted_session_count", 0) or 0),
|
||||
0,
|
||||
str(seed_reason or "d-no-space"),
|
||||
)
|
||||
reserved_tokens = int(
|
||||
seed_admission.get("required_tokens", required_extra_tokens)
|
||||
or required_extra_tokens
|
||||
)
|
||||
_add_reserved_tokens(residency, server_url, reserved_tokens)
|
||||
return (
|
||||
True,
|
||||
reserved_tokens,
|
||||
int(seed_admission.get("evicted_session_count", 0) or 0),
|
||||
0,
|
||||
None,
|
||||
)
|
||||
|
||||
session_cache, max_total_num_tokens, reserved_decode_tokens = (
|
||||
await _fetch_decode_server_state(
|
||||
client=client,
|
||||
@@ -1582,29 +2073,34 @@ async def _invoke_plain_router(
|
||||
config: ReplayConfig,
|
||||
decision,
|
||||
execution_mode: str,
|
||||
decode_residency: "DecodeResidencyState | None" = None,
|
||||
) -> ExecutionResult:
|
||||
prefill_priority = _prefill_priority_for_router_request(
|
||||
config=config,
|
||||
direct_to_d_predicted=False,
|
||||
)
|
||||
latency_s, ttft_s, tpot_s, cached_tokens = await _invoke_router(
|
||||
gen = await _invoke_router(
|
||||
client=client,
|
||||
request=request,
|
||||
config=config,
|
||||
decode_worker_index=decision.decode_worker_index,
|
||||
prefill_request_priority=prefill_priority,
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
return ExecutionResult(
|
||||
execution_mode=execution_mode,
|
||||
actual_kv_transfer_blocks=decision.kv_transfer_blocks,
|
||||
effective_input_length=request.input_length,
|
||||
cached_tokens=cached_tokens,
|
||||
cached_tokens=gen.cached_tokens,
|
||||
prefill_request_priority=prefill_priority,
|
||||
session_reused=False,
|
||||
session_reset=False,
|
||||
latency_s=latency_s,
|
||||
ttft_s=ttft_s,
|
||||
tpot_s=tpot_s,
|
||||
latency_s=gen.latency_s,
|
||||
ttft_s=gen.ttft_s,
|
||||
tpot_s=gen.tpot_s,
|
||||
actual_output_tokens=gen.actual_output_tokens,
|
||||
requested_output_tokens=gen.requested_output_tokens,
|
||||
finish_reason=gen.finish_reason,
|
||||
)
|
||||
|
||||
|
||||
@@ -1676,13 +2172,14 @@ async def _invoke_kvcache_seeded_router(
|
||||
decode_session.opened = True
|
||||
decode_session_newly_opened = True
|
||||
decode_session.active_requests += 1
|
||||
latency_s, ttft_s, tpot_s, cached_tokens = await _invoke_router(
|
||||
gen = await _invoke_router(
|
||||
client=client,
|
||||
request=request,
|
||||
config=config,
|
||||
decode_worker_index=decision.decode_worker_index,
|
||||
session_id=request.session_id,
|
||||
prefill_request_priority=prefill_priority,
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
except Exception:
|
||||
async with direct_session_lock:
|
||||
@@ -1742,13 +2239,16 @@ async def _invoke_kvcache_seeded_router(
|
||||
execution_mode=execution_mode,
|
||||
actual_kv_transfer_blocks=decision.kv_transfer_blocks,
|
||||
effective_input_length=request.input_length,
|
||||
cached_tokens=cached_tokens,
|
||||
cached_tokens=gen.cached_tokens,
|
||||
prefill_request_priority=prefill_priority,
|
||||
session_reused=False,
|
||||
session_reset=False,
|
||||
latency_s=latency_s,
|
||||
ttft_s=ttft_s,
|
||||
tpot_s=tpot_s,
|
||||
latency_s=gen.latency_s,
|
||||
ttft_s=gen.ttft_s,
|
||||
tpot_s=gen.tpot_s,
|
||||
actual_output_tokens=gen.actual_output_tokens,
|
||||
requested_output_tokens=gen.requested_output_tokens,
|
||||
finish_reason=gen.finish_reason,
|
||||
)
|
||||
|
||||
|
||||
@@ -1771,17 +2271,21 @@ async def _execute_request(
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode="pd-disaggregation-router",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
|
||||
if config.mechanism_name == "pd-colo":
|
||||
return await _invoke_direct(
|
||||
if not config.router_url:
|
||||
raise ValueError("router_url is required for pd-colo replay")
|
||||
result = await _invoke_plain_router(
|
||||
client=client,
|
||||
request=request,
|
||||
config=config,
|
||||
decision=decision,
|
||||
direct_sessions=direct_sessions,
|
||||
direct_session_lock=direct_session_lock,
|
||||
execution_mode="dp-colo-router",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
return replace(result, actual_kv_transfer_blocks=0)
|
||||
|
||||
if config.mechanism_name == "kvcache-centric":
|
||||
if not config.router_url:
|
||||
@@ -1838,6 +2342,7 @@ async def _execute_request(
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode=f"pd-router-turn1-{seed_filter_reason}",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
async with direct_session_lock:
|
||||
admit_new_decode_session = _should_admit_new_decode_session(
|
||||
@@ -1847,6 +2352,7 @@ async def _execute_request(
|
||||
session=decode_session,
|
||||
direct_sessions=direct_sessions,
|
||||
treat_as_fresh_session=True,
|
||||
admission_mode=config.kvcache_admission_mode,
|
||||
)
|
||||
if not admit_new_decode_session:
|
||||
can_seed = False
|
||||
@@ -1874,6 +2380,7 @@ async def _execute_request(
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode="pd-router-turn1-session-cap",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
if can_seed:
|
||||
return await _invoke_kvcache_seeded_router(
|
||||
@@ -1899,6 +2406,7 @@ async def _execute_request(
|
||||
if seed_reason is not None and seed_reason != "d-no-space"
|
||||
else "pd-router-turn1-no-d-capacity"
|
||||
),
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
|
||||
if (
|
||||
@@ -1970,6 +2478,7 @@ async def _execute_request(
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode="pd-router-fallback-stale-d-session",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
if _is_decode_backpressure_reason(direct_reason):
|
||||
return await _invoke_plain_router(
|
||||
@@ -1978,6 +2487,7 @@ async def _execute_request(
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode="pd-router-fallback-d-backpressure",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
|
||||
seed_filter_reason = _seed_filter_reason(
|
||||
@@ -1992,6 +2502,7 @@ async def _execute_request(
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode=f"pd-router-fallback-{seed_filter_reason}",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
async with direct_session_lock:
|
||||
admit_new_decode_session = _should_admit_new_decode_session(
|
||||
@@ -2001,6 +2512,7 @@ async def _execute_request(
|
||||
session=decode_session,
|
||||
direct_sessions=direct_sessions,
|
||||
treat_as_fresh_session=True,
|
||||
admission_mode=config.kvcache_admission_mode,
|
||||
)
|
||||
if not admit_new_decode_session:
|
||||
can_seed = False
|
||||
@@ -2036,6 +2548,7 @@ async def _execute_request(
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode="pd-router-fallback-session-cap",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
if can_seed:
|
||||
return await _invoke_kvcache_seeded_router(
|
||||
@@ -2067,8 +2580,20 @@ async def _execute_request(
|
||||
if _is_decode_backpressure_reason(seed_reason)
|
||||
else "pd-router-fallback-no-d-capacity"
|
||||
),
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
|
||||
# TEAM_REPORT §2.7: 'large-append' is misleading — most fallthroughs are
|
||||
# actually 'session-not-resident-on-pinned-D' (§1 starvation). Classify
|
||||
# the real reason and embed it in the execution_mode label.
|
||||
fallthrough = _fallthrough_reason(
|
||||
request=request,
|
||||
config=config,
|
||||
decision=decision,
|
||||
direct_append_length=direct_append_length,
|
||||
direct_session_reused=direct_session_reused,
|
||||
direct_session_reset=direct_session_reset,
|
||||
)
|
||||
seed_filter_reason = _seed_filter_reason(
|
||||
request=request,
|
||||
config=config,
|
||||
@@ -2080,7 +2605,8 @@ async def _execute_request(
|
||||
client=client,
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode=f"pd-router-fallback-large-append-{seed_filter_reason}",
|
||||
execution_mode=f"pd-router-fallback-{fallthrough}-{seed_filter_reason}",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
async with direct_session_lock:
|
||||
admit_new_decode_session = _should_admit_new_decode_session(
|
||||
@@ -2124,7 +2650,8 @@ async def _execute_request(
|
||||
client=client,
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode="pd-router-fallback-large-append-session-cap",
|
||||
execution_mode=f"pd-router-fallback-{fallthrough}-session-cap",
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
if can_seed:
|
||||
return await _invoke_kvcache_seeded_router(
|
||||
@@ -2139,23 +2666,28 @@ async def _execute_request(
|
||||
decode_residency=decode_residency,
|
||||
reserved_tokens=reserved_tokens,
|
||||
execution_mode=(
|
||||
"pd-router-large-append-reseed"
|
||||
f"pd-router-{fallthrough}-reseed"
|
||||
+ _eviction_suffix(
|
||||
evicted_sessions,
|
||||
prefill_backed_evictions,
|
||||
)
|
||||
),
|
||||
)
|
||||
# Preserve seed_reason in the label so migration feedback fires for
|
||||
# 'd-no-space' / 'd-*-backpressure' (matched via _is_admission_rejection_mode).
|
||||
if _is_decode_backpressure_reason(seed_reason):
|
||||
mode_label = f"pd-router-fallback-{fallthrough}-d-backpressure"
|
||||
elif seed_reason == "d-no-space":
|
||||
mode_label = f"pd-router-fallback-{fallthrough}-no-d-capacity"
|
||||
else:
|
||||
mode_label = f"pd-router-fallback-{fallthrough}"
|
||||
return await _invoke_plain_router(
|
||||
request=request,
|
||||
client=client,
|
||||
config=config,
|
||||
decision=decision,
|
||||
execution_mode=(
|
||||
"pd-router-fallback-d-backpressure"
|
||||
if _is_decode_backpressure_reason(seed_reason)
|
||||
else "pd-router-fallback-large-append"
|
||||
),
|
||||
execution_mode=mode_label,
|
||||
decode_residency=decode_residency,
|
||||
)
|
||||
|
||||
raise ValueError(f"Unsupported mechanism: {config.mechanism_name}")
|
||||
@@ -2201,6 +2733,14 @@ async def _invoke_session_direct(
|
||||
reserved_tokens: int = 0,
|
||||
direct_session_lock: asyncio.Lock | None = None,
|
||||
) -> ExecutionResult:
|
||||
if decode_residency is not None and config.enable_backpressure:
|
||||
await _wait_for_decode_pause(
|
||||
config=config,
|
||||
residency=decode_residency,
|
||||
server_url=session.server_url,
|
||||
request_id=request.request_id,
|
||||
session_id=session.session_id,
|
||||
)
|
||||
_prompt, effective_input_length, session_reused, session_reset = _build_direct_prompt(
|
||||
request=request,
|
||||
session=session,
|
||||
@@ -2238,7 +2778,7 @@ async def _invoke_session_direct(
|
||||
session.active_requests += 1
|
||||
|
||||
try:
|
||||
latency_s, ttft_s, tpot_s, cached_tokens = await _invoke_generate(
|
||||
gen = await _invoke_generate(
|
||||
client=client,
|
||||
base_url=session.server_url,
|
||||
headers={"x-request-id": request.request_id},
|
||||
@@ -2277,12 +2817,15 @@ async def _invoke_session_direct(
|
||||
execution_mode=execution_mode,
|
||||
actual_kv_transfer_blocks=0,
|
||||
effective_input_length=len(input_ids),
|
||||
cached_tokens=cached_tokens,
|
||||
cached_tokens=gen.cached_tokens,
|
||||
session_reused=session_reused,
|
||||
session_reset=session_reset,
|
||||
latency_s=latency_s,
|
||||
ttft_s=ttft_s,
|
||||
tpot_s=tpot_s,
|
||||
latency_s=gen.latency_s,
|
||||
ttft_s=gen.ttft_s,
|
||||
tpot_s=gen.tpot_s,
|
||||
actual_output_tokens=gen.actual_output_tokens,
|
||||
requested_output_tokens=gen.requested_output_tokens,
|
||||
finish_reason=gen.finish_reason,
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -66,6 +66,7 @@ def launch_pd_stack(
|
||||
timeout_s: float = 1200.0,
|
||||
router_request_timeout_s: float | None = None,
|
||||
include_router: bool = True,
|
||||
naive_dp: bool = False,
|
||||
) -> ManagedPdStack:
|
||||
run_dir.mkdir(parents=True, exist_ok=True)
|
||||
logs_dir = run_dir / "logs"
|
||||
@@ -77,6 +78,7 @@ def launch_pd_stack(
|
||||
decode_policy=decode_policy,
|
||||
include_router=include_router,
|
||||
router_request_timeout_s=router_request_timeout_s,
|
||||
naive_dp=naive_dp,
|
||||
)
|
||||
|
||||
prefill_processes = [
|
||||
@@ -195,6 +197,17 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
|
||||
env["MC_MS_AUTO_DISC"] = "0"
|
||||
if topology.ib_device:
|
||||
env["MOONCAKE_DEVICE"] = topology.ib_device
|
||||
elif topology.transfer_backend == "mooncake":
|
||||
# Default to TCP when RDMA is not forced (e.g. loopback on same node)
|
||||
env.setdefault("MOONCAKE_PROTOCOL", "tcp")
|
||||
|
||||
# Mooncake C++ batch_transfer_sync default timeout is 30 s, which can
|
||||
# fire as a false positive when a saturated D scheduler thread is busy
|
||||
# with LRU eviction (see docs/E1_E2_RESULTS_ZH.md §5c). Default to 1800 s
|
||||
# so the hair-trigger blacklist in conn.py:1270 doesn't latch on
|
||||
# transient stalls. Caller can override via shell env (setup_env.sh).
|
||||
if topology.transfer_backend == "mooncake":
|
||||
env.setdefault("MC_TRANSFER_TIMEOUT", "1800")
|
||||
|
||||
repo_root = Path(__file__).resolve().parents[2]
|
||||
python_paths = [
|
||||
|
||||
39
tests/README.md
Normal file
39
tests/README.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# Tests
|
||||
|
||||
Pure-Python unit + property tests for the algorithm layer. These tests do
|
||||
**not** import SGLang and do **not** need a GPU — they validate the routing
|
||||
algorithm (Algorithm 1/2/3 in `docs/KVC_ROUTER_ALGORITHM.md`) and its
|
||||
theorems against the pure functions extracted from `policies.py`.
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
uv sync --group test
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
Or, without uv:
|
||||
|
||||
```bash
|
||||
pip install pytest
|
||||
PYTHONPATH=src pytest tests
|
||||
```
|
||||
|
||||
## Scope
|
||||
|
||||
- `test_policy_scoring.py` — Algorithm 1 lex-score properties (overlap
|
||||
dominates sticky, load-floor gating, tie-breakers).
|
||||
- `test_no_starvation.py` — Theorem 1: bounded retries before some D either
|
||||
accepts or the least-rejected D is forced through the degenerate path.
|
||||
|
||||
Future:
|
||||
- block-level eviction `MockRadixCache` tests (see
|
||||
`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` §5).
|
||||
- D→P sync `staleness_budget` property tests (see
|
||||
`docs/D_TO_P_SYNC_CONTRACT_ZH.md` §1).
|
||||
|
||||
## Why no integration tests here
|
||||
|
||||
Anything that needs SGLang, mooncake, or a real model is an integration
|
||||
test and must run on hardware. Those tests live as `scripts/sweep_*.sh`
|
||||
under the evaluation protocol in `docs/EVALUATION_PROTOCOL_ZH.md`.
|
||||
0
tests/__init__.py
Normal file
0
tests/__init__.py
Normal file
66
tests/_fixtures.py
Normal file
66
tests/_fixtures.py
Normal file
@@ -0,0 +1,66 @@
|
||||
"""Lightweight fixtures for algorithm-layer tests.
|
||||
|
||||
Builds minimal TraceRequest / SingleNodeTopology / RoutingState instances
|
||||
without invoking build_single_node_topology() (which validates GPU budgets
|
||||
we don't care about in unit tests).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from agentic_pd_hybrid.topology import SingleNodeTopology, WorkerSpec
|
||||
from agentic_pd_hybrid.trace import TraceRequest
|
||||
|
||||
|
||||
def make_topology(decode_count: int = 3, prefill_count: int = 1) -> SingleNodeTopology:
|
||||
prefill_workers = tuple(
|
||||
WorkerSpec(
|
||||
role="prefill",
|
||||
ordinal=i,
|
||||
gpu_ids=(i,),
|
||||
host="127.0.0.1",
|
||||
port=30000 + i,
|
||||
)
|
||||
for i in range(prefill_count)
|
||||
)
|
||||
decode_workers = tuple(
|
||||
WorkerSpec(
|
||||
role="decode",
|
||||
ordinal=i,
|
||||
gpu_ids=(prefill_count + i,),
|
||||
host="127.0.0.1",
|
||||
port=31000 + i,
|
||||
)
|
||||
for i in range(decode_count)
|
||||
)
|
||||
return SingleNodeTopology(
|
||||
model_path="/dev/null/test-model",
|
||||
prefill_workers=prefill_workers,
|
||||
decode_workers=decode_workers,
|
||||
direct_workers=(),
|
||||
router_host="127.0.0.1",
|
||||
router_port=8000,
|
||||
transfer_backend="mooncake",
|
||||
trust_remote_code=True,
|
||||
)
|
||||
|
||||
|
||||
def make_request(
|
||||
*,
|
||||
session_id: str = "sess-1",
|
||||
turn_id: int = 0,
|
||||
hash_ids: tuple[int, ...] = (),
|
||||
input_length: int = 1024,
|
||||
output_length: int = 64,
|
||||
) -> TraceRequest:
|
||||
return TraceRequest(
|
||||
request_id=f"{session_id}-t{turn_id}",
|
||||
session_id=session_id,
|
||||
chat_id=int(turn_id),
|
||||
parent_chat_id=-1 if turn_id == 0 else int(turn_id - 1),
|
||||
timestamp_s=float(turn_id),
|
||||
input_length=input_length,
|
||||
output_length=output_length,
|
||||
request_type="user",
|
||||
turn_id=turn_id,
|
||||
hash_ids=hash_ids,
|
||||
)
|
||||
150
tests/test_no_starvation.py
Normal file
150
tests/test_no_starvation.py
Normal file
@@ -0,0 +1,150 @@
|
||||
"""Theorem 1 — no permanent starvation under bounded retries.
|
||||
|
||||
Reference: docs/KVC_ROUTER_ALGORITHM.md §4.1.
|
||||
|
||||
For any session s with τ_reject ≥ 1, after at most |D| · τ_reject
|
||||
consecutive admission rejects on s, the routing policy MUST still
|
||||
return a valid decision (via the degenerate "least-rejected D"
|
||||
fallback). The session cannot be permanently starved at the policy
|
||||
layer.
|
||||
|
||||
We can't exercise the full Dispatch loop here (it lives in replay.py and
|
||||
needs HTTP, mooncake, etc.). What we CAN test is the policy-layer
|
||||
guarantee: after K = |D| · τ_reject reject bumps, select() never raises
|
||||
and never returns a worker that's both blacklisted *and* has positive
|
||||
overlap (the degenerate path chooses by least-rejected).
|
||||
|
||||
This is the property-layer companion to test_policy_scoring.py's
|
||||
quantitative checks.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from agentic_pd_hybrid.policies import KvAwarePolicy, RoutingState
|
||||
|
||||
from ._fixtures import make_request, make_topology
|
||||
|
||||
|
||||
def test_select_returns_valid_decision_under_full_blacklist():
|
||||
"""Bump all (s, d) reject counters past τ_reject. select() must still
|
||||
pick a worker (degenerate fallback, no exception, no None)."""
|
||||
topology = make_topology(decode_count=3)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-stuck", turn_id=0)
|
||||
policy = KvAwarePolicy(migration_reject_threshold=3)
|
||||
|
||||
# Pre-fill the blacklist for every D.
|
||||
for worker in topology.route_workers:
|
||||
for _ in range(3):
|
||||
state.record_admission_reject(request.session_id, worker.worker_id)
|
||||
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
assert decision.decode_worker_id is not None
|
||||
assert decision.decode_worker_id in {w.worker_id for w in topology.route_workers}
|
||||
|
||||
|
||||
def test_bounded_retries_to_force_degenerate_path():
|
||||
"""Theorem 1: at most |D| · τ_reject rejects suffice to either exhaust
|
||||
every D or to force the degenerate fallback. Simulate the worst case
|
||||
where each retry picks a fresh D and is immediately rejected."""
|
||||
topology = make_topology(decode_count=4)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-worst", turn_id=0)
|
||||
threshold = 3
|
||||
policy = KvAwarePolicy(migration_reject_threshold=threshold)
|
||||
|
||||
seen_decoders: set[str] = set()
|
||||
max_retries = len(topology.route_workers) * threshold
|
||||
|
||||
for retry in range(max_retries):
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
seen_decoders.add(decision.decode_worker_id)
|
||||
# Adversary: this D rejects this session.
|
||||
state.record_admission_reject(request.session_id, decision.decode_worker_id)
|
||||
|
||||
# After |D|·τ_reject rejects every D must be blacklisted, so the next
|
||||
# select() takes the degenerate "least-rejected" branch and STILL
|
||||
# returns a valid worker.
|
||||
final = policy.select(request=request, topology=topology, state=state)
|
||||
assert final.decode_worker_id in {w.worker_id for w in topology.route_workers}
|
||||
# And we should have explored every D over the bounded retries — the
|
||||
# algorithm cannot trap a session on a single D when all are rejecting.
|
||||
assert seen_decoders == {w.worker_id for w in topology.route_workers}
|
||||
|
||||
|
||||
def test_least_rejected_d_chosen_when_all_blacklisted():
|
||||
"""When every D is past threshold, the degenerate fallback chooses the
|
||||
one with the *fewest* rejects (Algorithm 1, line 4)."""
|
||||
topology = make_topology(decode_count=3)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-lr", turn_id=0)
|
||||
policy = KvAwarePolicy(migration_reject_threshold=3)
|
||||
|
||||
# Skew rejections: decode-0 has 5, decode-1 has 10, decode-2 has 3.
|
||||
# All are >= threshold=3, so the filter wipes out every candidate.
|
||||
# The fallback should pick decode-2 (smallest rejection count).
|
||||
workers = list(topology.route_workers)
|
||||
bumps = {workers[0].worker_id: 5, workers[1].worker_id: 10, workers[2].worker_id: 3}
|
||||
for wid, n in bumps.items():
|
||||
for _ in range(n):
|
||||
state.record_admission_reject(request.session_id, wid)
|
||||
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
assert decision.decode_worker_id == workers[2].worker_id
|
||||
|
||||
|
||||
def test_other_session_unaffected_by_blacklist():
|
||||
"""Algorithm 1's filter is per-(session, D), not per-D. Session A's
|
||||
rejects must not influence session B's routing."""
|
||||
topology = make_topology(decode_count=2)
|
||||
state = RoutingState.create(topology)
|
||||
policy = KvAwarePolicy(migration_reject_threshold=3)
|
||||
|
||||
# Blacklist decode-0 for session A.
|
||||
workers = list(topology.route_workers)
|
||||
for _ in range(3):
|
||||
state.record_admission_reject("session-A", workers[0].worker_id)
|
||||
|
||||
# Session B sees a clean slate — should be able to pick decode-0
|
||||
# (which is the iteration-order winner under empty state).
|
||||
decision_b = policy.select(
|
||||
request=make_request(session_id="session-B"),
|
||||
topology=topology,
|
||||
state=state,
|
||||
)
|
||||
# decode-0 wins iteration-order tiebreak when all scores are (0,0,0,0).
|
||||
assert decision_b.decode_worker_id == workers[0].worker_id
|
||||
|
||||
|
||||
def test_threshold_zero_disables_blacklist():
|
||||
"""migration_reject_threshold=0 means the migration mechanism is off:
|
||||
every D stays a candidate regardless of its reject count."""
|
||||
topology = make_topology(decode_count=2)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-no-mig")
|
||||
policy = KvAwarePolicy(migration_reject_threshold=0)
|
||||
|
||||
workers = list(topology.route_workers)
|
||||
# Pile a huge number of rejects on decode-0.
|
||||
for _ in range(100):
|
||||
state.record_admission_reject(request.session_id, workers[0].worker_id)
|
||||
|
||||
decision = policy.select(request=request, topology=topology, state=state)
|
||||
# decode-0 should still be eligible; with empty overlap/sticky/inflight,
|
||||
# iteration order picks decode-0 first.
|
||||
assert decision.decode_worker_id == workers[0].worker_id
|
||||
|
||||
|
||||
def test_reject_counter_only_grows_on_record():
|
||||
"""RoutingState.record_admission_reject is the ONLY mutator for the
|
||||
counter. select() must not silently bump it."""
|
||||
topology = make_topology(decode_count=2)
|
||||
state = RoutingState.create(topology)
|
||||
request = make_request(session_id="s-clean")
|
||||
policy = KvAwarePolicy()
|
||||
|
||||
for _ in range(5):
|
||||
policy.select(request=request, topology=topology, state=state)
|
||||
|
||||
# No explicit record_admission_reject -> all counters stay zero.
|
||||
assert sum(state.session_d_rejects.values()) == 0
|
||||
189
tests/test_policy_scoring.py
Normal file
189
tests/test_policy_scoring.py
Normal file
@@ -0,0 +1,189 @@
|
||||
"""Unit tests for Algorithm 1 (KvAwarePolicy score_candidate).
|
||||
|
||||
Reference: docs/KVC_ROUTER_ALGORITHM.md §3.1. The lex-score is
|
||||
|
||||
(overlap + sticky_bonus*sticky + floor_bonus,
|
||||
sticky,
|
||||
-inflight,
|
||||
-assigned)
|
||||
|
||||
These tests pin down the qualitative properties that the algorithm's
|
||||
correctness arguments rely on. They run without SGLang/GPU.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from agentic_pd_hybrid.policies import score_candidate
|
||||
|
||||
|
||||
def _score(**overrides):
|
||||
"""Helper: build a score with all defaults and per-test overrides."""
|
||||
args = dict(
|
||||
overlap=0,
|
||||
sticky=False,
|
||||
inflight=0,
|
||||
assigned=0,
|
||||
mean_assigned=0.0,
|
||||
sticky_bonus=1,
|
||||
load_floor_bonus=0,
|
||||
)
|
||||
args.update(overrides)
|
||||
return score_candidate(**args)
|
||||
|
||||
|
||||
# -- Determinism ----------------------------------------------------------------
|
||||
|
||||
|
||||
def test_score_is_pure():
|
||||
"""Same kwargs must produce the same tuple (no hidden state)."""
|
||||
a = _score(overlap=3, sticky=True, inflight=1, assigned=7)
|
||||
b = _score(overlap=3, sticky=True, inflight=1, assigned=7)
|
||||
assert a == b
|
||||
|
||||
|
||||
def test_score_returns_4_tuple():
|
||||
s = _score()
|
||||
assert isinstance(s, tuple)
|
||||
assert len(s) == 4
|
||||
assert all(isinstance(x, int) for x in s)
|
||||
|
||||
|
||||
# -- Primary term: overlap dominates sticky --------------------------------------
|
||||
|
||||
|
||||
def test_overlap_strictly_dominates_pure_sticky():
|
||||
"""Theorem-2 building block: any positive overlap on a non-sticky D wins
|
||||
against a sticky-only D with zero overlap (sticky_bonus=1)."""
|
||||
overlap = _score(overlap=2, sticky=False)
|
||||
sticky_only = _score(overlap=0, sticky=True)
|
||||
assert overlap > sticky_only
|
||||
|
||||
|
||||
def test_overlap_plus_sticky_beats_overlap_alone():
|
||||
"""Two D's with equal overlap: sticky one wins (sticky_bonus contributes
|
||||
to primary AND wins tie-1)."""
|
||||
sticky_d = _score(overlap=5, sticky=True)
|
||||
fresh_d = _score(overlap=5, sticky=False)
|
||||
assert sticky_d > fresh_d
|
||||
|
||||
|
||||
# -- Tie breakers ----------------------------------------------------------------
|
||||
|
||||
|
||||
def test_tiebreaker_inflight_lower_wins():
|
||||
"""Equal primary & sticky: prefer the D with fewer in-flight requests."""
|
||||
low = _score(overlap=3, sticky=False, inflight=0, assigned=10)
|
||||
high = _score(overlap=3, sticky=False, inflight=5, assigned=10)
|
||||
assert low > high
|
||||
|
||||
|
||||
def test_tiebreaker_assigned_lower_wins():
|
||||
"""Equal primary & sticky & inflight: prefer rarely-picked D."""
|
||||
rare = _score(overlap=3, sticky=False, inflight=2, assigned=1)
|
||||
frequent = _score(overlap=3, sticky=False, inflight=2, assigned=99)
|
||||
assert rare > frequent
|
||||
|
||||
|
||||
def test_tiebreaker_strict_lex_order():
|
||||
"""Sticky always beats non-sticky on tie-1 even if non-sticky has lower
|
||||
inflight (the lex order is strict, position 1 outranks positions 2/3)."""
|
||||
sticky_busy = _score(overlap=4, sticky=True, inflight=10, assigned=10)
|
||||
fresh_idle = _score(overlap=4, sticky=False, inflight=0, assigned=0)
|
||||
# Note: with sticky_bonus=1 added to position 0, sticky_busy actually wins
|
||||
# on position 0 first (5 > 4). Force equal primary by lowering sticky's
|
||||
# overlap.
|
||||
sticky_busy_eq_primary = _score(overlap=3, sticky=True, inflight=10, assigned=10)
|
||||
fresh_idle_eq_primary = _score(overlap=4, sticky=False, inflight=0, assigned=0)
|
||||
# Now equal primary (3+1=4 vs 4). Sticky wins position 1.
|
||||
assert sticky_busy_eq_primary > fresh_idle_eq_primary
|
||||
|
||||
|
||||
# -- Load-floor bonus ------------------------------------------------------------
|
||||
|
||||
|
||||
def test_load_floor_disabled_by_default():
|
||||
"""load_floor_bonus=0 → no contribution to primary."""
|
||||
s = _score(overlap=0, sticky=False, mean_assigned=10, assigned=0)
|
||||
assert s[0] == 0
|
||||
|
||||
|
||||
def test_load_floor_gated_off_when_sticky():
|
||||
"""Even with load_floor_bonus>0, sticky D does NOT receive the boost.
|
||||
Otherwise a session would migrate away from its warm D under load."""
|
||||
sticky_under_loaded = _score(
|
||||
overlap=0, sticky=True, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
# primary = overlap(0) + sticky_bonus(1) + floor(0) = 1
|
||||
assert sticky_under_loaded[0] == 1
|
||||
|
||||
|
||||
def test_load_floor_zero_when_mean_zero():
|
||||
"""Warmup case: mean_assigned=0 -> no D gets boost -> degenerate to lex
|
||||
tiebreak by iteration order."""
|
||||
s = _score(
|
||||
overlap=0, sticky=False, mean_assigned=0, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
assert s[0] == 0
|
||||
|
||||
|
||||
def test_load_floor_proportional_to_deficit():
|
||||
"""floor_bonus = K * deficit / mean. assigned=0, mean=10, K=200 -> 200."""
|
||||
s_zero = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
s_half = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=5, load_floor_bonus=200
|
||||
)
|
||||
s_full = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
|
||||
)
|
||||
# deficit = max(0, 10-0)=10 -> bonus = int(200*10/10) = 200
|
||||
# deficit = max(0, 10-5)=5 -> bonus = int(200*5/10) = 100
|
||||
# deficit = max(0, 10-10)=0 -> bonus = 0
|
||||
assert s_zero[0] == 200
|
||||
assert s_half[0] == 100
|
||||
assert s_full[0] == 0
|
||||
|
||||
|
||||
def test_load_floor_does_not_underflow_when_overloaded():
|
||||
"""assigned > mean -> deficit clamped to 0, no negative bonus."""
|
||||
s = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=50, load_floor_bonus=200
|
||||
)
|
||||
assert s[0] == 0
|
||||
|
||||
|
||||
# -- Routing intent: real overlap beats load-floor bonus -------------------------
|
||||
|
||||
|
||||
def test_real_prefix_overlap_beats_load_floor_on_warm_d():
|
||||
"""E1_E2_FIX_DESIGN_ZH §Q2: load_floor should be set such that
|
||||
real per-session prefix overlap outweighs the cold-D bonus.
|
||||
With overlap=800 (a per-session prefix) and load_floor_bonus=200,
|
||||
a warm D (high overlap, possibly high load) should still win against
|
||||
a cold D with floor bonus."""
|
||||
warm = _score(
|
||||
overlap=800, sticky=True, mean_assigned=10, assigned=10, load_floor_bonus=200
|
||||
)
|
||||
cold = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
# warm primary = 800 + 1 + 0 = 801. cold primary = 0 + 0 + 200 = 200.
|
||||
assert warm[0] == 801
|
||||
assert cold[0] == 200
|
||||
assert warm > cold
|
||||
|
||||
|
||||
def test_boilerplate_overlap_loses_to_load_floor_for_cold_d():
|
||||
"""Same §Q2: load_floor should beat cross-session boilerplate overlap.
|
||||
If load_floor_bonus=200 and the worst-case boilerplate overlap is ~50,
|
||||
a fresh cold D should still win against a slightly-warm-from-boilerplate D."""
|
||||
warm_boilerplate = _score(
|
||||
overlap=50, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
|
||||
)
|
||||
cold_under_loaded = _score(
|
||||
overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
|
||||
)
|
||||
# warm_boilerplate primary = 50 + 0 + 0 = 50 (assigned=mean, no deficit).
|
||||
# cold_under_loaded primary = 0 + 0 + 200 = 200.
|
||||
assert cold_under_loaded > warm_boilerplate
|
||||
@@ -189,10 +189,11 @@ class MooncakeTransferEngine:
|
||||
device_name if device_name is not None else "",
|
||||
)
|
||||
else:
|
||||
protocol = os.environ.get("MOONCAKE_PROTOCOL", "rdma")
|
||||
ret_value = self.engine.initialize(
|
||||
hostname,
|
||||
"P2PHANDSHAKE",
|
||||
"rdma",
|
||||
protocol,
|
||||
device_name if device_name is not None else "",
|
||||
)
|
||||
if ret_value != 0:
|
||||
|
||||
@@ -1602,6 +1602,9 @@ class DirectAppendAdmissionReqInput(BaseReq):
|
||||
session_id: str
|
||||
uncached_input_tokens: int
|
||||
output_tokens: int
|
||||
# "direct_append": existing behavior — require session resident on this D
|
||||
# "seed": new admission for session not yet resident; do capacity check + LRU eviction
|
||||
mode: str = "direct_append"
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -1619,6 +1622,9 @@ class DirectAppendAdmissionReqOutput(BaseReq):
|
||||
decode_prealloc_queue_reqs: int = 0
|
||||
decode_transfer_queue_reqs: int = 0
|
||||
decode_retracted_queue_reqs: int = 0
|
||||
# Backpressure hint: if > 0, the caller should pause this many ms before
|
||||
# sending more requests to this D. Computed from transfer-queue depth.
|
||||
recommended_pause_ms: int = 0
|
||||
|
||||
|
||||
@dataclass
|
||||
|
||||
@@ -1564,6 +1564,74 @@ class ScheduleBatch(ScheduleBatchDisaggregationDecodeMixin):
|
||||
# For DLLM, we use a separate forward mode
|
||||
self.forward_mode = ForwardMode.DLLM_EXTEND
|
||||
|
||||
# Pre-filter pass: drop streaming-session reqs whose committed prefix
|
||||
# already covers fill_ids. The streaming-session correction below would
|
||||
# set extend_input_len = max(0, fill_len - prefix_len) = 0 for these
|
||||
# reqs, but the downstream invariant at the per-req loop
|
||||
# (`assert seq_len - pre_len == req.extend_input_len`) is computed from
|
||||
# raw fill_ids/prefix_indices lengths and has no path to be satisfied
|
||||
# when fill_len < prefix_len. Treat the condition as upstream state
|
||||
# inconsistency, abort the affected reqs (so the client sees an error
|
||||
# response instead of the worker crashing), and continue with the
|
||||
# remaining batch. See docs/E3_FINDINGS_ZH.md for the failure mode
|
||||
# this guards against.
|
||||
if self.reqs:
|
||||
kept_reqs = []
|
||||
for req in self.reqs:
|
||||
if (
|
||||
req.session is not None
|
||||
and req.session.streaming
|
||||
and len(req.fill_ids) < len(req.prefix_indices)
|
||||
):
|
||||
logger.error(
|
||||
"Dropping streaming-session req with fill_ids shorter than "
|
||||
"prefix_indices (rid=%s, session_id=%s, fill_len=%d, "
|
||||
"prefix_len=%d, kv_committed_len=%d). Upstream state "
|
||||
"inconsistency would crash prepare_for_extend's invariant; "
|
||||
"aborting this req. See docs/E3_FINDINGS_ZH.md.",
|
||||
req.rid,
|
||||
req.session.session_id,
|
||||
len(req.fill_ids),
|
||||
len(req.prefix_indices),
|
||||
req.kv_committed_len,
|
||||
)
|
||||
req.finished_reason = FINISH_ABORT(
|
||||
message=(
|
||||
"streaming-session inconsistency: fill_ids "
|
||||
f"({len(req.fill_ids)}) < prefix_indices "
|
||||
f"({len(req.prefix_indices)})"
|
||||
),
|
||||
)
|
||||
else:
|
||||
kept_reqs.append(req)
|
||||
if len(kept_reqs) != len(self.reqs):
|
||||
self.reqs = kept_reqs
|
||||
|
||||
if not self.reqs:
|
||||
# Whole batch filtered. Set empty tensor / list state so
|
||||
# downstream callers (model_runner.forward, batch_result handlers)
|
||||
# see a valid no-op batch and skip the model pass cleanly.
|
||||
_pin = is_pin_memory_available(self.device)
|
||||
empty_long = torch.zeros(0, dtype=torch.int64, pin_memory=_pin).to(
|
||||
self.device, non_blocking=True
|
||||
)
|
||||
empty_int = torch.zeros(0, dtype=torch.int32, pin_memory=_pin).to(
|
||||
self.device, non_blocking=True
|
||||
)
|
||||
self.input_ids = empty_long
|
||||
self.req_pool_indices = empty_int
|
||||
self.seq_lens = empty_long
|
||||
self.seq_lens_cpu = torch.zeros(0, dtype=torch.int64)
|
||||
self.orig_seq_lens = empty_int
|
||||
self.prefix_lens = []
|
||||
self.extend_lens = []
|
||||
self.extend_num_tokens = 0
|
||||
self.out_cache_loc = empty_int
|
||||
self.input_embeds = None
|
||||
self.multimodal_inputs = []
|
||||
self.token_type_ids = None
|
||||
return
|
||||
|
||||
# Init tensors
|
||||
reqs = self.reqs
|
||||
for req in reqs:
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user