Compare commits
6 Commits
feat/d-to-
...
kvc-debug-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8fc31be605 | ||
|
|
314c4cda0e | ||
|
|
722032a13b | ||
|
|
7590e55189 | ||
|
|
5a2fb8799c | ||
|
|
506d360160 |
4
.gitignore
vendored
4
.gitignore
vendored
@@ -13,6 +13,10 @@ src/*.egg-info
|
||||
outputs/
|
||||
|
||||
# Vendored dependencies. Track only the maintained SGLang fork/snapshot.
|
||||
# third_party/traces/ holds the replay trace files used by the benchmark
|
||||
# (~56 MB each) for convenient transfer between hosts; they would otherwise
|
||||
# live under outputs/ but outputs/ is gitignored.
|
||||
third_party/*
|
||||
!third_party/sglang/
|
||||
!third_party/traces/
|
||||
*.log
|
||||
|
||||
364
docs/ONBOARDING_NEXT_AGENT_ZH.md
Normal file
364
docs/ONBOARDING_NEXT_AGENT_ZH.md
Normal file
@@ -0,0 +1,364 @@
|
||||
# 接班 Agent 上手手册
|
||||
|
||||
**对象**:接手本项目的下一个 SWE/research agent
|
||||
**目标**:30 分钟读完后达到当前主 agent 的认知水平,能独立跑对照实验、看懂数据、避开历史坑
|
||||
**作者状态**:本手册定稿于 `kvc-debug-journey-v1-to-v4 @ 506d360`,下一个工作分支是 `feat/d-to-p-sync`
|
||||
|
||||
---
|
||||
|
||||
## 0. 你是谁,你将要做什么(5 行 TL;DR)
|
||||
|
||||
1. 你接手的是 **agentic-pd-hybrid**——SGLang xPyD 基础上加 session-aware KVCache 层的 LLM serving 框架,目标是在多轮长 context coding agent workload 上比 vanilla DP 快
|
||||
2. v2(迁移机制 + threshold tuning)已经在 SWE-Bench 50sess trace ts=1 上**击败 4DP CA** 6/8 个 latency/TTFT 指标,但 **TTFT p99 输 3×**(1.28s vs 0.43s)
|
||||
3. 上一个 agent 已诊断出 TTFT p99 长尾的根因——8.3% 请求走 reseed 慢路径,每次需要 P 重算 prefill + mooncake transfer = 3-7s
|
||||
4. **你的任务**:在有 GPU + IB RDMA 的环境上跑 2 组对照实验,验证 (a) naive 1P3D + kv-aware 相对 KVC 的边际贡献、(b) 启用真 RDMA 后 KVC v2 的 TTFT p99 是否能压到 ~0.7s 量级
|
||||
5. 跑完结果 push 到 `outputs/`,主 agent 会拉下来更新 paper draft 和 future-work 文档
|
||||
|
||||
---
|
||||
|
||||
## 1. 必读文档(按这个顺序读,**不要乱跳**)
|
||||
|
||||
### Level 1:核心 30 分钟(**必读**,读完就能开始干活)
|
||||
|
||||
| # | 文档 | 时长 | 为什么读它 |
|
||||
|---|---|---:|---|
|
||||
| 1 | `docs/PROJECT_OVERVIEW.md` | 5min | 项目目标 + 三种 mechanism(pd-disagg / pd-colo / kvcache-centric)的术语区分 |
|
||||
| 2 | `docs/V2_DEEP_ANALYSIS_ZH.md` §0 (TL;DR) + §6 (生产决策) | 10min | 当前状态最准确的 snapshot——v2 赢什么、输什么、为什么 |
|
||||
| 3 | `docs/KVC_ROUTER_ALGORITHM.md` §1-§3 + §9 | 10min | 形式化的算法(Algorithm 1/2/3)+ 4 个 open questions。**§9 OQ#4 就是你正在解决的问题** |
|
||||
| 4 | `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §0-§2 | 5min | reseed 路径完整时间线(t=0 → t=4550ms),知道每段耗时分别来自哪里 |
|
||||
|
||||
读完上面 4 篇就能跑实验了。如果你时间紧张,**就只读这 4 篇 + 本手册**。
|
||||
|
||||
### Level 2:进阶(**遇到具体问题时再读**)
|
||||
|
||||
| 文档 | 何时读 |
|
||||
|---|---|
|
||||
| `docs/REFACTOR_PLAN_V1_ZH.md` | 想理解为什么从 ts=10 切到 ts=1 |
|
||||
| `docs/MIGRATION_V1_FINDINGS_ZH.md` | 想理解 v1→v2 演化(v1 为何 thrashing,v2 reset-on-success 怎么修的) |
|
||||
| `docs/V2_RESULTS_ZH.md` | v2 原始战报(注意:headline 表略乐观,请优先看 `V2_DEEP_ANALYSIS_ZH.md` 的修订版) |
|
||||
| `docs/V2_DEEP_ANALYSIS_ZH.md` §4 全文 | 论文 reviewer 的对等性挑战 + 我们的辩驳;写 paper 时必读 |
|
||||
| `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` | 想理解 ts=10 时代的 §1-§9 结构性问题清单(很多问题在 ts=1 下消失,但底层机制仍在) |
|
||||
|
||||
### Level 3:归档(**别读**,是历史包袱)
|
||||
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md`:ts=10 时代的早期分析,结论已被 ts=1 数据 supersede
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md`:ts=10 数据下的结构性验证,同上
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md`:v1-v5 调优 sweep 的过程笔记,知道有这个文件就行
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md`:profile 调查,已 supersede
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md`:v0 重构计划,已被 V1 supersede
|
||||
- `docs/archive/SWEBENCH_EXPERIMENT_*.md`:早期实验日志
|
||||
|
||||
### Level 0:本手册的"姐妹"文档(**读这个之前你应该已经在看本文了**)
|
||||
|
||||
- `docs/ONBOARDING_NEXT_AGENT_ZH.md`(就是本文)
|
||||
|
||||
---
|
||||
|
||||
## 2. 项目当前状态快照(用一张表说清)
|
||||
|
||||
```
|
||||
Trace: outputs/qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions, time-scale=1.0)
|
||||
Hardware: 4× H100 80GB + Mellanox mlx5_0/_1 @ 200 Gb/s IB (active, but **未启用** in current sweep)
|
||||
Model: Qwen3-30B-A3B-Instruct-2507 (TP1)
|
||||
Branch: kvc-debug-journey-v1-to-v4 = 主分支(v2 已合入)
|
||||
feat/d-to-p-sync = 预留给 D→P 增量同步的开发,**当前空**
|
||||
main = 旧 baseline,比主分支落后 18 commit
|
||||
```
|
||||
|
||||
### 已得出的结论(高置信度)
|
||||
|
||||
1. **v2 (reset-on-success + threshold 8192) 击败 4DP CA**:lat mean -1.4%、p50 -13%、TTFT mean -25%、TTFT p50 -55%、TTFT p90 -67%
|
||||
2. **TTFT p99 KVC 输 3×**:1.28s vs 0.43s。来自 8.3% reseed/fallback 慢路径
|
||||
3. **慢路径耗时五五开**:P 端 re-prefill ~1.5-3s + mooncake P→D transfer ~1.5-4s(**当前是 TCP loopback**,未启用真 RDMA)
|
||||
4. **capacity-backup 救不了 slow path**:直接 audit 过,P 端 backup 不会随 direct-to-D append 更新,是 seed-time 静态快照
|
||||
5. **D→P 增量同步代码不存在**:经 Opus agent forensic 审查 + 全分支 git 检索确认
|
||||
|
||||
### 待验证的核心假设(**这是你的实验任务**)
|
||||
|
||||
| # | 假设 | 验证方法 | 预期结果 |
|
||||
|---|---|---|---|
|
||||
| H1 | KVC v2 相对 4DP 的胜利不只是来自 1P3D 拓扑——KVC 层(admission / migration / direct-to-D)也有显著贡献 | 跑 naive 1P3D + policy=kv-aware ts=1 N=1(vanilla SGLang pd-disagg,无 KVC 层)作为中间对照 | naive 1P3D 应该处于 KVC v2 和 4DP 之间。如果它 ≈ KVC v2 → 胜利来自拓扑而非 KVC 层;如果 ≈ 4DP → 胜利来自 KVC 层 |
|
||||
| H2 | 启用真 RDMA 把 mooncake P→D transfer 从 1.5-4s 压到 200-400ms,TTFT p99 从 1.28s 降到 ~0.7s | 在 v2 sweep 加 `--force-rdma --ib-device mlx5_0`,跑同 trace 同 ts=1 | TTFT p99 应该 ~0.5-0.8s 区间。如果没改变 → mooncake 实际没用 RDMA / 配置错误;如果降到 ~0.3s → 我们对 transfer 段贡献的估计偏低 |
|
||||
| H3 | 即使启用 RDMA,TTFT p99 仍然输 DP(因为 re-prefill 段不动) | 同 H2 实验结果 | 应该看到 TTFT p99 ~0.7s > DP 0.43s。如果 ≤ DP → 我们对 re-prefill 段成本的估计错了,可能整个 slow path 理论需要重审 |
|
||||
|
||||
---
|
||||
|
||||
## 3. 你要跑的实验(the main task)
|
||||
|
||||
### 3.1 实验矩阵(按 ROI 排序)
|
||||
|
||||
GPU hour 珍贵,砍掉了原计划的 naive 1P3D + policy=default baseline(low-ROI——naive 1P3D 用 default policy 在多轮 cache 命中上几乎必败,没必要拿这个对比当 H1 的对照点)。最终保留 2 个 run:
|
||||
|
||||
| # | 配置 | GPU | mechanism | policy | RDMA | 预期时长 | 目的 |
|
||||
|---|---|---:|---|---|---|---:|---|
|
||||
| **E1** | naive 1P3D kv-aware | 4 | pd-disaggregation | kv-aware | **on** | ~5.5h | H1:分离"1P3D + kv-aware policy"贡献 vs "KVC 层(admission/migration/direct-to-D)"贡献 |
|
||||
| **E2** | KVC v2 + RDMA | 4 | kvcache-centric | kv-aware | **on** | ~5.5h | H2/H3:验证 RDMA 能把 TTFT p99 从 1.28s 压到 ~0.7s |
|
||||
|
||||
两个 run 串行约 11h,并行用两组 GPU 可压到 ~5.5h。
|
||||
|
||||
### 3.2 启动配置:详细 flag 清单
|
||||
|
||||
参考 `scripts/sweep_ts1_migration_v2.sh` 作为底版。两个新 sweep 脚本的关键 flag:
|
||||
|
||||
#### E1: naive 1P3D kv-aware
|
||||
|
||||
```bash
|
||||
python -m agentic_pd_hybrid \
|
||||
--mechanism pd-disaggregation \
|
||||
--policy kv-aware \
|
||||
--topology-pd 1P3D \
|
||||
--transfer-backend mooncake \
|
||||
--force-rdma --ib-device mlx5_0 \ # ← 单独测拓扑+policy 而非 transport,必须开 RDMA 才能跟 E2 公平
|
||||
--trace outputs/qwen35-swebench-50sess.jsonl \
|
||||
--time-scale 1.0 \
|
||||
--concurrency 32 \
|
||||
--request-timeout-s 300 \
|
||||
--max-input-len 87811 \ # ← 拉齐到 DP 限,消除 abort 数量不对等
|
||||
--output-root outputs/qwen3-30b-tp1-ts1-naive-1p3d-kvaware
|
||||
```
|
||||
|
||||
#### E2: KVC v2 + RDMA
|
||||
|
||||
参考 `scripts/sweep_ts1_migration_v2.sh`,**只加两个 flag**:
|
||||
|
||||
```diff
|
||||
--transfer-backend mooncake \
|
||||
+ --force-rdma --ib-device mlx5_0 \
|
||||
+ --max-input-len 87811 \
|
||||
--kvcache-direct-max-uncached-tokens 8192 \
|
||||
--kvcache-migration-reject-threshold 3 \
|
||||
--kvcache-prefill-backup-policy release-after-transfer \
|
||||
```
|
||||
|
||||
**保留 v2 的其它所有配置**——这是 v2 + RDMA 的 ablation,**不要顺手改其它东西**。
|
||||
|
||||
### 3.3 实验前的环境验证(**别跳**)
|
||||
|
||||
```bash
|
||||
# 1. GPU
|
||||
nvidia-smi -L # 应该看到 4 张 H100 80GB
|
||||
|
||||
# 2. RDMA
|
||||
ibstat | grep -E "State|Rate|Port"
|
||||
# 期望:mlx5_0 / mlx5_1 都是 State=Active, Rate=200 Gb/s
|
||||
|
||||
# 3. Mooncake 能识别 RDMA 设备
|
||||
python -c "from mooncake_transfer_engine import TransferEngine; e=TransferEngine(); print(e.get_local_topology())"
|
||||
# 期望:输出包含 mlx5_0 / mlx5_1
|
||||
|
||||
# 4. 现有 v2 数据可读
|
||||
python3 scripts/analysis/recompute_summary.py outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
|
||||
# 期望:打印出 failure_count=45, abort_count=40 等
|
||||
|
||||
# 5. 算法实现 syntax check
|
||||
python3 -m py_compile src/agentic_pd_hybrid/{policies,replay,metrics,benchmark,cli}.py
|
||||
# 期望:全过
|
||||
```
|
||||
|
||||
任何一步失败**立刻停下来排查**,不要硬上。
|
||||
|
||||
---
|
||||
|
||||
## 4. 已踩过的坑(避免重复)
|
||||
|
||||
| # | 坑 | 症状 | 教训 |
|
||||
|---|---|---|---|
|
||||
| 1 | **abort 被计入 latency stats** | DP/KVC 都有 0.08s 的快速失败被算成"快请求",拉低 mean/p50 | 已在 `metrics.py` 修复(commit `5eac9b4`)。新 run 出 summary 时会自动包含 `abort_count` / `failure_count` 字段 |
|
||||
| 2 | **max-input-len 双方不一致**(KVC=92098 vs DP=87811) | SGLang 按 mem_fraction_static 自动算 max_total_num_tokens,KVC decode-only worker GPU 内存多 2 GB | 跑新 run 时显式传 `--max-input-len 87811` 强制对齐 |
|
||||
| 3 | **mooncake 默认 TCP loopback** | sweep 脚本只传 `--transfer-backend mooncake` 不够,会落到 TCP,跑出来比 RDMA 慢 10× | 必须加 `--force-rdma --ib-device mlx5_0` |
|
||||
| 4 | **capacity-backup 不是 D→P 同步** | flag 名字误导,看代码就会发现它只是"reseed 完不关 P session",KV 是 seed-time 静态快照 | 不要在 capacity-backup 上浪费时间;要真正消灭 reseed 长尾必须实现 D→P,去 `feat/d-to-p-sync` |
|
||||
| 5 | **N=1 在 ts=1 下"够用"是有条件的** | baseline N=3 确认 categorical 完全确定,但 v2 引入的 reset-on-success 等新代码路径未独立验证 | v2 + RDMA 的对照建议 N=2,对 RDMA-on/off 各一次 |
|
||||
| 6 | **ts=10 数据**别参考 | 当年的 372/912/396 errors 是 benchmark artifact,不代表真实生产 | 所有比较锁定 ts=1,不要尝试 ts=10 "复现"或验证 |
|
||||
| 7 | **critic agent 的 "MAJOR" 别盲信** | 上一轮 critic 把 cache fragmentation / prefill 闲置标为 MAJOR,其实是 KVC 的**设计意图** | 详见 `V2_DEEP_ANALYSIS_ZH §4.4 / §4.5`。Audit 视角和生产视角要分清 |
|
||||
| 8 | **GPU utilization 图布局有残留小问题** | 组标签 (KVC 1P3D / DP 4-way CA) 与 subplot title 视觉上仍有轻微挤压 | 已被用户接受为可发表状态。不要再花时间调这张图 |
|
||||
|
||||
---
|
||||
|
||||
## 5. CLI 速查表
|
||||
|
||||
### 跑实验
|
||||
```bash
|
||||
# 完整 sweep(参考 v2)
|
||||
bash scripts/sweep_ts1_migration_v2.sh
|
||||
|
||||
# 写自己的 sweep:复制 sweep_ts1_migration_v2.sh,改 mechanism/policy/output-root
|
||||
```
|
||||
|
||||
### 看数据
|
||||
```bash
|
||||
# 修复版 summary(推荐用这个,旧的 summary.json 含 abort 污染)
|
||||
python3 scripts/analysis/recompute_summary.py outputs/<run>/*_metrics.jsonl
|
||||
|
||||
# 跨配置对照
|
||||
python3 scripts/analysis/analyze_ts1_validation.py # 比较 KVC vs DP ts=1 4-run
|
||||
```
|
||||
|
||||
### 出图(参考 v2 流程)
|
||||
```bash
|
||||
# 4 张已有的图,对应不同 viz 问题
|
||||
python3 scripts/analysis/plot_v2_path_breakdown.py # execution_mode 分布 + path-level latency
|
||||
python3 scripts/analysis/plot_ttft_pdf.py # TTFT PDF (KVC vs DP)
|
||||
python3 scripts/analysis/plot_gpu_utilization.py # GPU 利用率(请求计数 vs 工作量)
|
||||
python3 scripts/analysis/plot_cache_efficiency.py # cache 效率(hit rate vs turn + uncached ECDF)
|
||||
|
||||
# 数据更新后重新出图:直接 rerun,每个脚本都参数化了输入路径
|
||||
```
|
||||
|
||||
### Git
|
||||
```bash
|
||||
# 主分支(实验)
|
||||
git checkout kvc-debug-journey-v1-to-v4
|
||||
|
||||
# 新功能分支(D→P 同步,空)
|
||||
git checkout feat/d-to-p-sync
|
||||
|
||||
# 远程
|
||||
origin = git@ipads.se.sjtu.edu.cn:wangjh/agentic-pd-hybrid.git
|
||||
|
||||
# Push 用 (SSH known_hosts 第一次需要 accept)
|
||||
GIT_SSH_COMMAND='ssh -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=~/.ssh/known_hosts' git push
|
||||
|
||||
# user.email 没设全局,建议 per-commit 传:
|
||||
git -c user.email=YOUR_EMAIL -c user.name=YOUR_NAME commit -m "..."
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. 跑完结果后看什么数字(checklist)
|
||||
|
||||
每个 run 跑完,**至少**收集以下几个数字(用 `recompute_summary.py`):
|
||||
|
||||
```
|
||||
☐ request_count (期望 4449)
|
||||
☐ error_count + abort_count + failure_count
|
||||
☐ latency_stats_s.{mean, p50, p90, p99}
|
||||
☐ ttft_stats_s.{mean, p50, p90, p99} ← 别忘 p99!这是 KVC 的真实代价点
|
||||
☐ execution_modes 分布
|
||||
☐ per_decode_load 分布(看负载均衡)
|
||||
☐ per_prefill_load (注意:dispatcher 计数 ≠ GPU 工作量)
|
||||
☐ cache_hit_request_count + total_cached_tokens (推 cache hit rate)
|
||||
```
|
||||
|
||||
### 两组对照实验跑完后看以下"决定性数字"
|
||||
|
||||
| 比较 | 关键看点 | 决策 |
|
||||
|---|---|---|
|
||||
| E1 (naive 1P3D kv-aware) vs E2 (KVC v2 + RDMA) | TTFT p50/p99、direct-to-D 占比 | 量化"KVC 层(admission/migration/direct-to-D)在 kv-aware 之上的额外收益"(H1) |
|
||||
| KVC v2 (TCP, 历史 v2 run) vs E2 (KVC v2 + RDMA) | TTFT p99、reseed mode 的耗时(execution_mode == reseed 的 ttft_s p50) | 验证 H2/H3:RDMA 救多少 transfer 段 |
|
||||
| E1 (naive 1P3D kv-aware) vs DP 4w(历史 ts=1 baseline)| 全部 latency / TTFT 指标 | 间接锚定"拓扑差异 + kv-aware policy"的天花板 |
|
||||
|
||||
### 期待的数字范围(如果实验顺利)
|
||||
|
||||
| 配置 | lat p50 | lat p99 | TTFT p50 | TTFT p99 | direct-to-D % |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| **E1** naive 1P3D kv-aware | ~0.75s | ~8-10s | ~0.20s | ~0.8-1.2s | N/A |
|
||||
| **E2** KVC v2 + RDMA | ~0.58s | ~7-8s | ~0.04s | **~0.5-0.8s** | ~91% |
|
||||
| (参考) KVC v2 + TCP(历史) | 0.58s | 8.7s | 0.04s | 1.29s | 91.6% |
|
||||
| (参考) DP 4w(历史 ts=1) | 0.67s | 8.4s | 0.09s | 0.43s | N/A |
|
||||
|
||||
**如果你看到的数字偏离这个范围 ≥ 2×**,先停下来检查配置(环境验证 §3.3 那些项目),不是写报告。
|
||||
|
||||
---
|
||||
|
||||
## 7. 遇到 X 怎么办(FAQ)
|
||||
|
||||
**Q: 跑出来 KVC v2 + RDMA 的 TTFT p99 比预期高很多(> 1s)。**
|
||||
|
||||
A: 大概率 RDMA 没真用上。检查:
|
||||
1. `outputs/<run>/<subdir>/benchmark-config.json` 里 `force_rdma` 是不是 `True`、`ib_device` 是不是 `"mlx5_0"`
|
||||
2. 服务器 startup log(`outputs/<run>/<subdir>/logs/prefill-0.log`)有没有 "MOONCAKE_DEVICE=mlx5_0" / "using RDMA" 类信息
|
||||
3. `ibstat mlx5_0` 看 active 状态没掉
|
||||
|
||||
**Q: KVC v2 + RDMA 跑出来 TTFT p99 ≤ DP(违反 H3)。**
|
||||
|
||||
A: 这是个好消息。可能性:
|
||||
1. 我们对 re-prefill 段耗时估计偏高(实际 SGLang 的 prefix cache 把 P 端 re-prefill 救了一半)
|
||||
2. RDMA 直接快到把 transfer 段压到 ~50ms 量级,整个 reseed < 1.5s
|
||||
3. v2 的 reseed 触发频率被 RDMA 间接降低(某种 race condition 改善了 LRU 行为)
|
||||
|
||||
任一情况都值得**深挖**,建议把 reseed mode 的 `ttft_s` 分布单独拉出来看(应该有清晰的双峰:fast reseed + 极少数 outlier)。
|
||||
|
||||
**Q: naive 1P3D 跑不起来 / SGLang 报错。**
|
||||
|
||||
A: 仓库里 `outputs/qwen3-30b-exps/pd-disaggregation-default-20260427T062616Z/` 有过历史的 1P1D 跑通配置可以参考。常见坑:
|
||||
1. `--mechanism pd-disaggregation` 和 `--topology` 必须配合,topology 不能用 KVC 的 1P3D 名字
|
||||
2. SGLang vendored 在 `third_party/sglang/`,**不要**`pip install sglang` 用外部版本——可能 API 不对齐
|
||||
3. `--policy default` 时不要传 `--kvcache-*` 系列 flag,会被 ignore 但会污染 config 输出
|
||||
|
||||
**Q: 我想跑别的对照(更大 trace / 更多 GPU / 真实 RDMA 跨节点)。**
|
||||
|
||||
A: 先把上面 2 个 E1-E2 跑完。这 2 个是论文核心 contribution 的 ablation,不能跳。其它对照(更长 trace、8 GPU 2P6D、真跨节点 RDMA、补 naive 1P3D + policy=default)见 `V2_DEEP_ANALYSIS_ZH §7.3`,作为 follow-up。
|
||||
|
||||
**Q: 跑完后想自动出对比图。**
|
||||
|
||||
A: 4 个现有 `plot_*.py` 脚本都是参数化的,把输入路径改成你的新 run 就能复用。如果对比维度变多(如三方对比 naive vs KVC vs DP),可以扩展现有脚本而不是新写——见 `plot_ttft_pdf.py` 的模板。
|
||||
|
||||
**Q: 发现 metrics.jsonl 字段不一致 / 缺字段。**
|
||||
|
||||
A: 看 `src/agentic_pd_hybrid/metrics.py` 里 `RequestMetrics` dataclass。所有新增字段必须在那里加,否则 `recompute_summary.py` 会报 KeyError。**注意**:dataclass 的 `field_names` 是按 `RequestMetrics.__dataclass_fields__` 取的,不是 jsonl 里所有 key。
|
||||
|
||||
---
|
||||
|
||||
## 8. 如果你完全卡住
|
||||
|
||||
读这一段:
|
||||
|
||||
1. **不要**尝试在没看本手册 §1 必读文档的情况下硬上代码
|
||||
2. **不要**在 main 分支或 `feat/d-to-p-sync` 上跑实验——用 `kvc-debug-journey-v1-to-v4`
|
||||
3. **不要**修 metrics.py 的统计字段,除非你能解释清楚为什么它当前的 abort 排除是对的
|
||||
4. **不要**信任 critic agent 的"MAJOR"标签,要看代码层证据
|
||||
5. **不要**跳过环境验证(§3.3)直接跑长 sweep——5h 跑出垃圾数据浪费的成本更高
|
||||
|
||||
如果你卡住超过 30 分钟,把卡点写成一句话,去主 agent 留言(git commit message / branch 注释)。
|
||||
|
||||
---
|
||||
|
||||
## 9. 主 agent 留给你的两个具体期待
|
||||
|
||||
1. **两组对照实验跑完后**,在新 commit message 里给我以下数字(用 `recompute_summary.py` 输出格式):
|
||||
```
|
||||
E1 naive 1P3D kv-aware: lat={mean,p50,p90,p99} ttft={mean,p50,p90,p99} fail_count
|
||||
E2 KVC v2 + RDMA: 同上 + reseed-mode 的 ttft p50/p99 分开
|
||||
```
|
||||
|
||||
2. **跑 E2 时收集 reseed 路径的实测耗时分布**:
|
||||
```
|
||||
pd-router-d-session-reseed 这个 execution_mode 的 ttft_s 分布
|
||||
并把 P→D mooncake transfer 时长 vs P 端 re-prefill 时长 单独拉出
|
||||
(需要在 structural/admission-events.jsonl 里找 timestamp diff)
|
||||
```
|
||||
|
||||
这两组数字直接决定 paper future-work 章节怎么写 D→P sync 的必要性。
|
||||
|
||||
---
|
||||
|
||||
## 附录 A:关键文件位置速查
|
||||
|
||||
| 你在找什么 | 在哪 |
|
||||
|---|---|
|
||||
| 算法实现 | `src/agentic_pd_hybrid/policies.py` (KvAwarePolicy + RoutingState) |
|
||||
| 整个 replay orchestration | `src/agentic_pd_hybrid/replay.py` (~3000 行,**慢慢读**) |
|
||||
| 指标统计 | `src/agentic_pd_hybrid/metrics.py` |
|
||||
| CLI 入口 | `src/agentic_pd_hybrid/cli.py` |
|
||||
| Server 启动配置 | `src/agentic_pd_hybrid/stack.py` |
|
||||
| SGLang 改动 | `third_party/sglang/python/sglang/srt/{managers/scheduler.py, managers/io_struct.py, disaggregation/mooncake/...}` |
|
||||
| 历史 sweep 脚本 | `scripts/sweep_ts1_*.sh` |
|
||||
| 分析脚本 | `scripts/analysis/*.py` |
|
||||
| 实验输出 | `outputs/qwen3-30b-tp1-ts1-*/` |
|
||||
|
||||
## 附录 B:关键 commit 速查(按"想理解什么改动看什么 commit"组织)
|
||||
|
||||
| 想理解 | 看 commit |
|
||||
|---|---|
|
||||
| v2 的核心改动 | `2ec0deb feat(kvc): session migration with reset-on-success + direct-append threshold tuning` |
|
||||
| metrics.py 修复 | `5eac9b4 fix(metrics): exclude aborted requests from latency/ttft/tpot stats` |
|
||||
| 完整 analysis 文档(多版本叠加修订)| `c01d610` (latest) / `9ccd853` / `b5af195` / `c551906` / `517677d` |
|
||||
| 算法形式化定义 | `37e9caa docs(kvc): production-decision reframe + formal router algorithm spec` |
|
||||
| 各种 figure 脚本 | `c551906` (TTFT PDF) / `b5af195` (path breakdown) / `517677d` (GPU + cache) |
|
||||
| backpressure 代码 | `c47adaf feat(kvc): honor admission backpressure hints` 和 `ca4b64c feat(sglang): expose backpressure pause hint` |
|
||||
|
||||
---
|
||||
|
||||
**核心句**:先读 §1 Level 1 的 4 篇文档(30 min)+ 本手册(30 min),然后按 §3 跑 E1/E2/E3 三组实验,按 §6 收集决定性数字,遇到坑查 §4,结果 push 到 `outputs/` 下。**别瞎改不属于本任务的代码**——你的工作是验证 v2 的胜利在 ablation 中是否站得住,不是开发新机制(那是 `feat/d-to-p-sync` 分支的事,下一阶段才做)。
|
||||
|
||||
跑完之后期待你的 commit!
|
||||
@@ -2,9 +2,9 @@
|
||||
|
||||
**日期**:2026-05-08
|
||||
**前置文档**:
|
||||
- `docs/REFACTOR_PLAN_ZH.md`(v0,已被本文 supersede——v0 的 backpressure 切入点结论已撤回)
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md`(v0,已被本文 supersede——v0 的 backpressure 切入点结论已撤回)
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md`(包含 §1-§7 结构性问题清单)
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`(ts=10 数据下的早期验证)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md`(ts=10 数据下的早期验证)
|
||||
|
||||
**触发**:`outputs/qwen3-30b-tp1-ts1-validation/` 4 个 run 完成(KVC 1P3D × N=3 + 4DP CA × 1,全部 ts=1)
|
||||
|
||||
@@ -372,11 +372,11 @@ score = (
|
||||
## 附录 B:相关文档
|
||||
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — §1-§7 原结构性问题清单
|
||||
- `docs/REFACTOR_PLAN_ZH.md` — v0 重构计划(本文 supersede)
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 演进
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md` — v0 重构计划(本文 supersede)
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 演进
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(已 critic 修订)
|
||||
- `scripts/sweep_ts1_kvc_n3_plus_dp.sh` — 本次 4 run sweep 脚本
|
||||
- `scripts/analysis/analyze_ts1_validation.py` — 本次分析脚本
|
||||
|
||||
|
||||
@@ -633,9 +633,9 @@ errors 漂移 **2.5×**(372→912),P50 latency 漂移 ~30%,TTFT P50 漂
|
||||
## 附录 B:相关已有文档
|
||||
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标、microbench 结论
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 结构性缺陷的早期分析(本报告 §2 的来源)
|
||||
- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 详细演进日记
|
||||
- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(含 critic 修订)
|
||||
- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — SWE 35B 早期实验
|
||||
- `docs/REFACTOR_PLAN_ZH.md` — 当前重构计划
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — 结构性 claim 验证(本报告的精简版)
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 结构性缺陷的早期分析(本报告 §2 的来源)
|
||||
- `docs/archive/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 详细演进日记
|
||||
- `docs/archive/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查(含 critic 修订)
|
||||
- `docs/archive/SWEBENCH_EXPERIMENT_RESULTS.md` — SWE 35B 早期实验
|
||||
- `docs/archive/REFACTOR_PLAN_ZH.md` — 当前重构计划
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — 结构性 claim 验证(本报告的精简版)
|
||||
|
||||
@@ -239,6 +239,34 @@ v2 整体跑得快不仅因为 "KVC 机制好",更因为 **91.6% 请求被路
|
||||
|
||||
绘图脚本:`scripts/analysis/plot_ttft_pdf.py`(用 `scipy.stats.gaussian_kde`,body 用 Scott bandwidth 0.15,full range 用 log10 域 KDE)。
|
||||
|
||||
### 3.5 TPOT 概率密度对比:KVC 不牺牲 decode 速度
|
||||
|
||||
为防止 reviewer 质疑"KVC 的 TTFT 优势是否以牺牲 decode 速度(TPOT)换来的",我们对 token 间延迟也做了概率密度对比:
|
||||
|
||||

|
||||
|
||||
实测 TPOT 分位数:
|
||||
|
||||
| 指标 | KVC v2 | DP 4w | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| min | 4.432ms | 4.420ms | +0.012ms |
|
||||
| p50 | 5.561ms | 5.525ms | **+0.035ms (+0.6%)** |
|
||||
| p90 | 6.644ms | 6.694ms | **−0.050ms (−0.7%)** |
|
||||
| p99 | 7.568ms | 7.543ms | +0.026ms |
|
||||
| mean | 5.680ms | 5.661ms | **+0.019ms (+0.34%)** |
|
||||
| std | 0.711ms | 0.720ms | −0.009ms |
|
||||
| max | 11.315ms | 9.531ms | +1.78ms |
|
||||
|
||||
**核心事实**:在主体分布(p99 以下,覆盖 99% 请求)上,**KVC 与 DP 的 TPOT 差异在 0.05ms 以内(< 1%)**。两条 KDE 曲线视觉上几乎完全重合(左面板)。这是预期行为——decode 阶段在同样模型 (Qwen3-30B-A3B) 和同样 GPU (H100) 上,per-token 延迟由硬件 + 模型架构决定,与路由策略无关。
|
||||
|
||||
**唯一可见差异在 max 处**:KVC 11.3ms vs DP 9.5ms,**KVC 尾部多了 ~1.8ms 的 outlier**。来源推测:reseed 后的 cold start decode(KV 刚到 D 端、warm-up 的第一个 decode step 略慢于 steady state)。这影响 ≤ 0.1% 的请求,可忽略。
|
||||
|
||||
**论文意义(重要)**:这张图防的是 reviewer 的"KVC 是不是用 decode 慢换 TTFT 快"质疑。答案是**没有**——KVC 的胜利**完全发生在 prefill 路径**(直接 append-prefill in D, vs DP 的全 prefill on 同 worker),decode 路径两边都是直接 batched generation,速度相同。
|
||||
|
||||
**对照 §3.2 path-level latency**:那张图的"Lat p50"列里 KVC fast path 0.55s vs DP 0.67s 的差距,**几乎全部来自 TTFT 段**(KVC 41ms vs DP 92ms = 差 51ms),decode 段双方都消耗 mean output_tokens × TPOT ≈ 227 × 5.7ms ≈ 1.3s(一致)。这一致性是 TPOT 图的直接体现。
|
||||
|
||||
绘图脚本:`scripts/analysis/plot_tpot_pdf.py`(用 `scipy.stats.gaussian_kde`,body 用 bandwidth 0.15,full range 用 log10 域 KDE)。
|
||||
|
||||
---
|
||||
|
||||
## 4. 需要诚实交代的 caveats(不是 KVC 的设计缺陷)
|
||||
@@ -339,33 +367,38 @@ Critic 的 framing:
|
||||
|
||||
→ 论文里这是 **contribution**,不是 caveat:KVC 的 mechanism 让 27% 更少的总池子产生了更高的 retention 效率。
|
||||
|
||||
### 4.5 [辩驳 critic] "Prefill GPU 90%+ 闲置" 是设计意图,不是浪费
|
||||
### 4.5 KVC 的 compute 经济:session affinity 让系统总 compute 减少 33%
|
||||
|
||||
Critic 的 framing:
|
||||
> KVC 1P3D 中 prefill GPU 只在 8.3% 请求时被激活;实际工作 GPU 只有 ~3.08 个,对比 4DP CA 的 4 个 fused GPU 不公平。
|
||||
**头条事实**:在同样 4449 个请求的 workload 上,KVC v2 整个系统消耗的 compute tokens 比 4DP CA 少 33%。
|
||||
|
||||
**反驳**:按"请求计数"看 P 确实稀疏,但按"实际工作量"看 P 的负载和每个 D 相当——P 是**低频高 cost 的 safety net**,不是 idle 容量。
|
||||

|
||||
|
||||

|
||||
**左图 — 系统总 compute(堆叠条形图)**:
|
||||
- KVC 1P3D v2 总 compute = **3.47M tokens**
|
||||
- P-side 重 prefill(reseed/seed 路径,8.3% 请求):1.07M
|
||||
- D-side append-prefill(91.6% direct-to-D 路径,每个请求平均仅 341 token):1.39M
|
||||
- Decode:1.01M
|
||||
- DP 4-way CA 总 compute = **5.17M tokens**
|
||||
- Full prefill(每个请求都是 mean 952 uncached token):4.17M
|
||||
- Decode:1.00M
|
||||
|
||||
**左图 — 请求计数视图**:KVC P GPU 仅处理 328 个请求(7.4%),而 KVC D 各处理 ~1450 个(33%),DP 各处理 ~1100 个(25%)。**乍看像 critic 说的"P 闲着"**。
|
||||
差异的根因**完全在 prefill 段**:DP 每个请求做 mean 952 token 的 uncached prefill,KVC 91.6% 请求只做 mean 341 token 的 append-prefill(剩 8.3% 走 P 做平均 5455 token 的重 prefill)。session affinity 让 91.6% 请求的 prefix KV **已经在目标 D 上 resident**,下次 turn 只需算 append delta——**这就是 cache 复用直接折算成 compute 减少的过程**。
|
||||
|
||||
**右图 — 工作量视图(compute tokens)**:
|
||||
- KVC P GPU:**1.07M tokens 的 prefill 工作**(仅 prefill,无 decode)
|
||||
- KVC D GPU 每个:~0.80M tokens(小量 append-prefill + 全部 decode)
|
||||
- DP 每个 worker:~1.30M tokens(全套 prefill + decode)
|
||||
**右图 — per-GPU 工作分布(同样 8 个 GPU)**:
|
||||
- KVC 把 compute **不均匀分配**:P 专门承担 1.07M 的重 prefill(不做 decode),3 个 D 各自只承担 ~0.80M 的轻 append + decode 混合。
|
||||
- DP 把 compute **均匀分配**:每个 fused worker ~1.25M(full prefill + decode 必须在同 GPU 上交替)。
|
||||
|
||||
→ **KVC P GPU 的 per-GPU 工作量与每个 KVC D GPU 相当**——只是分布在少数(328)个高强度请求上(每个 reseed 5K-90K tokens)。它不是空转,是 **low-frequency, high-cost safety net**。
|
||||
这种"不均匀分配"是 KVC 的设计意图,不是 load imbalance bug:
|
||||
1. **重 prefill 被隔离**——P 的 prefill kernel 不会插队进 D 的 decode batch,decode 端 batching 几乎无 jitter(详见 §3.5 TPOT 双方完全重合)
|
||||
2. **D 端只做小 append**(mean 341 token vs DP 的 952 token),prefill kernel 占的 GPU 时间从 ~10ms 降到 ~1ms,对 decode batching 的干扰从主导变为可忽略
|
||||
3. **总 compute 不依赖每个 GPU 满载** —— "P 闲着但当它工作时承担全部重活" 是合理的分工
|
||||
|
||||
**总工作量对比**:
|
||||
- KVC 4 个 GPU 合计 ~3.47M tokens 工作
|
||||
- DP 4 个 GPU 合计 ~5.17M tokens 工作(**KVC 减少 33% compute**——这是 session affinity 带来的 cache 复用收益)
|
||||
**Paper 论述角度**:这张图证明 session affinity 不是只产生 locality 收益,而是直接把 locality **折算成系统层面的 compute 减少**。具体地:
|
||||
- 91.6% 请求的 uncached_tokens 从 mean 952(DP)降到 mean 341(KVC direct-to-D)= 工作量减少 64%
|
||||
- 8.3% 请求的 uncached_tokens 在 KVC 里上升(mean 5455 reseed vs DP 全部 mean 952)但请求数小
|
||||
- 加权平均后 KVC 系统总 prefill compute 减少 67%(1.07M+1.39M vs 4.17M),加上不变的 decode 后总 compute 减少 33%
|
||||
|
||||
这两点综合:KVC 用 **同样 4 个 GPU、更少总 KV pool、更少总 compute**,做到了 latency / TTFT mean/p50/p90 全胜。
|
||||
|
||||
**论文应当把这条作为 architectural rationale 写出来:KVC 用 P 的低频专用化换 D 端的 TTFT 稳定性。**
|
||||
|
||||
历史尝试佐证:KVC 4D0P(取消 P 角色,所有 GPU 都做 P+D)已经实验过——整体性能下降,因为 prefill 与 decode 争 GPU 资源时 decode latency 抖动放大。
|
||||
历史尝试佐证:KVC 4D0P(取消 P 角色,所有 GPU 都做 P+D,类似 DP)已经实验过——整体性能下降,因为 prefill 与 decode 争 GPU 资源时 decode latency 抖动放大。这反过来印证 "P 专门化" 的设计价值:它让 D 的 decode 路径**永不与重 prefill 在同 GPU 上争资源**。
|
||||
|
||||
### 4.6 v2 N=1 + 新代码路径未验证确定性 — **MINOR,方法学待办**
|
||||
|
||||
@@ -609,8 +642,8 @@ v2 p99 = slow path 主导 → 8.69s (KVC) vs 8.43s (DP) 接近
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — ts=1 验证后的方向决策
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/V2_RESULTS_ZH.md` — v2 结果原始报告(本文是对它的 critique)
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析(§1-§7 来源)
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
|
||||
## 附录 C:相关代码
|
||||
|
||||
|
||||
@@ -271,8 +271,8 @@ p99 +3% 几乎全部来自这 5 个 timeout(每个 ~30s 拉到 p99)。**修
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — §1-§9 原结构性问题清单
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — 重构方向 + 三情景分支
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析
|
||||
- `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `docs/archive/AGENTIC_FIT_ANALYSIS_ZH.md` — 早期 fit 分析
|
||||
- `docs/archive/STRUCTURAL_VALIDATION_REPORT_ZH.md` — ts=10 结构性 claim 验证
|
||||
- `scripts/sweep_ts1_migration_v2.sh` — 本次 v2 sweep 脚本
|
||||
- `scripts/analysis/analyze_ts1_validation.py` — ts=1 4-way 对比分析
|
||||
|
||||
|
||||
34
docs/archive/README.md
Normal file
34
docs/archive/README.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# 归档文档说明
|
||||
|
||||
本目录保留项目历史阶段的过程文档。**新加入项目的 agent / 人员不需要阅读这些文档**,直接看 `docs/ONBOARDING_NEXT_AGENT_ZH.md` 即可。
|
||||
|
||||
保留它们的目的:
|
||||
1. 论文写作时追溯 v1-v5 调优演化过程
|
||||
2. 未来若回到 ts=10 高压区间或更大 trace 时,可参考当年的结构性问题诊断
|
||||
3. 满足学术可追溯性要求
|
||||
|
||||
## 每个文档的简要说明
|
||||
|
||||
| 文档 | 归档原因 | 何时回头看 |
|
||||
|---|---|---|
|
||||
| `AGENTIC_FIT_ANALYSIS_ZH.md` | ts=10 时代的 §1-§7 结构性问题分析;结论已被 ts=1 数据全面 supersede | 想知道当年 ts=10 下我们认为有什么结构性问题时 |
|
||||
| `STRUCTURAL_VALIDATION_REPORT_ZH.md` | 用 ts=10 数据对 AGENTIC_FIT_ANALYSIS 的 claim 做验证;同样被 ts=1 时代 supersede | 同上 |
|
||||
| `KVC_DEBUG_JOURNEY_V1_TO_V5.md` | v1-v5 5 个调优 sweep 的过程笔记;包含 errors 9→912 漂移、direct-to-D 占比变化等历史数据 | 写 paper 时要写 "as we explored configurations v1-v5..." 段落 |
|
||||
| `V5_PROFILE_INVESTIGATION_ZH.md` | 给 v5 加 1Hz polling instrumentation 的调查;让 errors 涨 46× 的现象记录 | 想理解 "admission RPC 干扰 scheduler 主循环" 这条 §5 残留风险时 |
|
||||
| `REFACTOR_PLAN_ZH.md` | v0 重构计划,**已被 `REFACTOR_PLAN_V1_ZH.md` supersede** | 不需要看;只有想看作者一开始的设想时翻一翻 |
|
||||
| `KVCACHE_CENTRIC_PROGRESS_ZH.md` | 项目最早期(2026-04-27)的进度记录;当时还没有完整的 sweep 数据 | 几乎不需要看;满足"项目起源记录"职能 |
|
||||
| `SWEBENCH_EXPERIMENT_PROGRESS.md` | SWE-Bench trace 早期实验进度记录 | 想知道当年的 trace 生成 / 采样配置时 |
|
||||
| `SWEBENCH_EXPERIMENT_RESULTS.md` | 同上,早期 result snapshot | 同上 |
|
||||
|
||||
## 当前活跃文档(在 `docs/` 顶层)
|
||||
|
||||
跳转去看:
|
||||
- `docs/ONBOARDING_NEXT_AGENT_ZH.md` — 新人上手手册
|
||||
- `docs/PROJECT_OVERVIEW.md` — 项目目标 + 术语
|
||||
- `docs/KVC_ROUTER_ALGORITHM.md` — 算法形式化
|
||||
- `docs/V2_DEEP_ANALYSIS_ZH.md` — v2 完整分析
|
||||
- `docs/V2_RESULTS_ZH.md` — v2 原始战报
|
||||
- `docs/REFACTOR_PLAN_V1_ZH.md` — ts=1 方向决策
|
||||
- `docs/MIGRATION_V1_FINDINGS_ZH.md` — v1 thrashing 诊断
|
||||
- `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` — reseed 长尾 + D→P 缺口审计
|
||||
- `docs/TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md` — ts=10 时代的结构性问题清单(作为历史 baseline 仍在主目录)
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 196 KiB After Width: | Height: | Size: 244 KiB |
BIN
docs/figures/tpot_pdf_comparison.png
Normal file
BIN
docs/figures/tpot_pdf_comparison.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 306 KiB |
@@ -1,24 +1,25 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Per-GPU utilization breakdown: KVC 1P3D v2 vs 4-way DP CA.
|
||||
"""System compute economy: KVC 1P3D v2 vs 4-way DP CA.
|
||||
|
||||
Generates docs/figures/gpu_utilization.png — two-panel:
|
||||
left: per-GPU request count
|
||||
right: per-GPU compute work (uncached prefill tokens + decode tokens, stacked)
|
||||
Generates docs/figures/gpu_utilization.png -- two-panel:
|
||||
left: total system compute (stacked by work type)
|
||||
right: per-GPU compute distribution (specialized vs fused)
|
||||
|
||||
The point of the figure is to push back on the naïve reading
|
||||
"KVC's prefill GPU is idle 90% of the time, so KVC is using fewer GPUs."
|
||||
The punchline is the TOTAL system compute reduction:
|
||||
KVC v2 system: 3.47 M tokens of compute (1.07 P-prefill + 1.39 D-append + 1.01 decode)
|
||||
DP 4-way: 5.17 M tokens of compute (4.17 full-prefill + 1.00 decode)
|
||||
→ KVC does 33% LESS compute for the SAME workload (same 4449 requests).
|
||||
|
||||
By request count, the prefill GPU is indeed touched by only ~8% of requests.
|
||||
By compute work, the prefill GPU bears comparable per-GPU load to each
|
||||
decode GPU — it is a low-frequency, high-cost safety net for cache misses,
|
||||
not idle capacity.
|
||||
This is the non-trivial finding: session affinity converts to reduced
|
||||
system-wide work, not just locality. The per-GPU panel then explains
|
||||
the architectural shape: KVC concentrates heavy prefill on a specialized
|
||||
P worker, leaves D workers with light append + decode; DP forces every
|
||||
worker to absorb the full prefill load mixed with decode.
|
||||
|
||||
Work attribution:
|
||||
KVC direct-to-D path: prefill happens locally on the assigned D worker
|
||||
(append-prefill of `uncached_tokens` tokens).
|
||||
KVC seed/reseed/fallback path: prefill happens on prefill-0
|
||||
(full uncached_tokens), decode on assigned D.
|
||||
DP: all work on assigned direct-N worker.
|
||||
The earlier version of this figure showed per-GPU request count + per-GPU
|
||||
compute and was confusing to external reviewers ("P doing prefill is
|
||||
trivial"). This version leads with the system-total comparison, which IS
|
||||
the non-trivial result.
|
||||
|
||||
Aborted / errored requests are excluded.
|
||||
"""
|
||||
@@ -64,157 +65,211 @@ def main() -> None:
|
||||
dp = [r for r in load(DP) if not is_failed(r)]
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# KVC per-GPU attribution
|
||||
# KVC per-GPU + per-work-type attribution
|
||||
# ------------------------------------------------------------------
|
||||
kvc_req_count = defaultdict(int)
|
||||
kvc_prefill_tokens = defaultdict(int) # uncached prefill compute
|
||||
kvc_prefill_tokens = defaultdict(int)
|
||||
kvc_decode_tokens = defaultdict(int)
|
||||
|
||||
for r in kvc:
|
||||
d = r["assigned_decode_node"] # decode-0/1/2
|
||||
p = r["assigned_prefill_node"] # prefill-0
|
||||
d = r["assigned_decode_node"]
|
||||
p = r["assigned_prefill_node"]
|
||||
mode = r.get("execution_mode", "")
|
||||
if mode == "kvcache-direct-to-d-session":
|
||||
# P is bypassed entirely; D does the append-prefill + decode
|
||||
kvc_req_count[d] += 1
|
||||
# P bypassed; D does small append-prefill + decode
|
||||
kvc_prefill_tokens[d] += uncached(r)
|
||||
kvc_decode_tokens[d] += out_tokens(r)
|
||||
else:
|
||||
# P does the full prefill; D handles decode
|
||||
kvc_req_count[p] += 1
|
||||
kvc_req_count[d] += 1 # decode side still counts
|
||||
# P does heavy prefill; D handles decode
|
||||
kvc_prefill_tokens[p] += uncached(r)
|
||||
kvc_decode_tokens[d] += out_tokens(r)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# DP per-GPU attribution (fused P+D on every worker)
|
||||
# ------------------------------------------------------------------
|
||||
dp_req_count = defaultdict(int)
|
||||
dp_prefill_tokens = defaultdict(int)
|
||||
dp_decode_tokens = defaultdict(int)
|
||||
|
||||
for r in dp:
|
||||
w = r["assigned_decode_node"] # direct-0..3
|
||||
dp_req_count[w] += 1
|
||||
w = r["assigned_decode_node"]
|
||||
dp_prefill_tokens[w] += uncached(r)
|
||||
dp_decode_tokens[w] += out_tokens(r)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Build ordered GPU list, KVC then DP
|
||||
# Aggregate work by category for the left panel
|
||||
# ------------------------------------------------------------------
|
||||
kvc_p_prefill = kvc_prefill_tokens.get("prefill-0", 0)
|
||||
kvc_d_prefill = sum(v for k, v in kvc_prefill_tokens.items() if k.startswith("decode-"))
|
||||
kvc_d_decode = sum(kvc_decode_tokens.values())
|
||||
kvc_total = kvc_p_prefill + kvc_d_prefill + kvc_d_decode
|
||||
|
||||
dp_prefill_total = sum(dp_prefill_tokens.values())
|
||||
dp_decode_total = sum(dp_decode_tokens.values())
|
||||
dp_total = dp_prefill_total + dp_decode_total
|
||||
|
||||
M = 1e6
|
||||
saving_pct = (1 - kvc_total / dp_total) * 100
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Colors
|
||||
# ------------------------------------------------------------------
|
||||
KVC_P_COLOR = "#E89D44" # orange — P GPU
|
||||
KVC_D_PREF_COLOR = "#7AB6D9" # light blue — D-side small append-prefill
|
||||
KVC_D_DEC_COLOR = "#1F77B4" # dark blue — D-side decode
|
||||
DP_PREF_COLOR = "#E07474" # light red — DP full prefill
|
||||
DP_DEC_COLOR = "#D62728" # dark red — DP decode
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 7.0))
|
||||
|
||||
# ==================================================================
|
||||
# Left panel: System-wide compute, stacked by work type
|
||||
# ==================================================================
|
||||
ax = axes[0]
|
||||
x = np.array([0, 1])
|
||||
bar_w = 0.55
|
||||
|
||||
# KVC stack: P-prefill (bottom orange) + D-prefill (light blue) + D-decode (dark blue)
|
||||
ax.bar(0, kvc_p_prefill / M, bar_w, color=KVC_P_COLOR,
|
||||
edgecolor="black", linewidth=0.6,
|
||||
label="KVC: P-side heavy prefill (reseed / seed)")
|
||||
ax.bar(0, kvc_d_prefill / M, bar_w, bottom=kvc_p_prefill / M,
|
||||
color=KVC_D_PREF_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="KVC: D-side append-prefill (direct-to-D, small)")
|
||||
ax.bar(0, kvc_d_decode / M, bar_w,
|
||||
bottom=(kvc_p_prefill + kvc_d_prefill) / M,
|
||||
color=KVC_D_DEC_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="Decode (both)")
|
||||
|
||||
# DP stack: full prefill (light red) + decode (dark red)
|
||||
ax.bar(1, dp_prefill_total / M, bar_w,
|
||||
color=DP_PREF_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="DP: fused worker prefill (full uncached)")
|
||||
ax.bar(1, dp_decode_total / M, bar_w, bottom=dp_prefill_total / M,
|
||||
color=DP_DEC_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="_nolegend_")
|
||||
|
||||
# Inline labels for stack segments
|
||||
def stack_label(xpos, ypos, text, color="white", fontsize=10):
|
||||
ax.text(xpos, ypos, text, ha="center", va="center",
|
||||
fontsize=fontsize, color=color, fontweight="bold")
|
||||
|
||||
stack_label(0, kvc_p_prefill / M / 2,
|
||||
f"P heavy prefill\n{kvc_p_prefill/M:.2f}M")
|
||||
stack_label(0, (kvc_p_prefill + kvc_d_prefill / 2) / M,
|
||||
f"D append-prefill\n{kvc_d_prefill/M:.2f}M",
|
||||
color="black")
|
||||
stack_label(0, (kvc_p_prefill + kvc_d_prefill + kvc_d_decode / 2) / M,
|
||||
f"D decode\n{kvc_d_decode/M:.2f}M")
|
||||
stack_label(1, dp_prefill_total / M / 2,
|
||||
f"Full prefill\n(every worker)\n{dp_prefill_total/M:.2f}M",
|
||||
color="black")
|
||||
stack_label(1, (dp_prefill_total + dp_decode_total / 2) / M,
|
||||
f"Decode\n{dp_decode_total/M:.2f}M")
|
||||
|
||||
# Totals on top
|
||||
ax.text(0, kvc_total / M + 0.15, f"{kvc_total/M:.2f}M tokens",
|
||||
ha="center", va="bottom", fontsize=12, fontweight="bold",
|
||||
color="#1F77B4")
|
||||
ax.text(1, dp_total / M + 0.15, f"{dp_total/M:.2f}M tokens",
|
||||
ha="center", va="bottom", fontsize=12, fontweight="bold",
|
||||
color="#D62728")
|
||||
|
||||
# Big savings annotation — placed centrally inside the panel,
|
||||
# bracketed by a horizontal arrow connecting the bar tops.
|
||||
headroom_top = max(kvc_total, dp_total) / M * 1.42
|
||||
arrow_y = max(kvc_total, dp_total) / M * 1.08
|
||||
text_y = max(kvc_total, dp_total) / M * 1.22
|
||||
|
||||
ax.annotate("", xy=(0.78, arrow_y), xytext=(0.22, arrow_y),
|
||||
arrowprops=dict(arrowstyle="<->", color="#2C8C2C", lw=1.8))
|
||||
ax.text(
|
||||
0.5, text_y, f"−{saving_pct:.0f}%\ntotal compute",
|
||||
ha="center", va="center",
|
||||
fontsize=13, fontweight="bold", color="#2C8C2C",
|
||||
bbox=dict(facecolor="#E8F5E8", edgecolor="#2C8C2C", alpha=0.95, pad=5),
|
||||
)
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xlim(-0.5, 1.5)
|
||||
ax.set_xticklabels(["KVC 1P3D v2", "DP 4-way CA"], fontsize=12, fontweight="bold")
|
||||
ax.set_ylabel("Total system compute (millions of token-equivalents)", fontsize=11)
|
||||
ax.set_ylim(0, headroom_top)
|
||||
ax.set_title("System-wide compute economy | same 4449-request workload",
|
||||
fontsize=12, pad=10)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
ax.legend(loc="upper left", fontsize=8.5, framealpha=0.95)
|
||||
|
||||
# ==================================================================
|
||||
# Right panel: per-GPU breakdown showing the architectural shape
|
||||
# ==================================================================
|
||||
ax = axes[1]
|
||||
|
||||
kvc_gpus = ["prefill-0", "decode-0", "decode-1", "decode-2"]
|
||||
dp_gpus = ["direct-0", "direct-1", "direct-2", "direct-3"]
|
||||
all_gpus = kvc_gpus + dp_gpus
|
||||
|
||||
def get(d, k):
|
||||
return d.get(k, 0)
|
||||
|
||||
counts = [get(kvc_req_count, g) for g in kvc_gpus] + \
|
||||
[get(dp_req_count, g) for g in dp_gpus]
|
||||
prefill_tk = [get(kvc_prefill_tokens, g) for g in kvc_gpus] + \
|
||||
[get(dp_prefill_tokens, g) for g in dp_gpus]
|
||||
decode_tk = [get(kvc_decode_tokens, g) for g in kvc_gpus] + \
|
||||
[get(dp_decode_tokens, g) for g in dp_gpus]
|
||||
|
||||
# Display labels: P/D role + worker id
|
||||
labels = [
|
||||
"KVC P\nprefill-0",
|
||||
"KVC D\ndecode-0",
|
||||
"KVC D\ndecode-1",
|
||||
"KVC D\ndecode-2",
|
||||
"DP P+D\ndirect-0",
|
||||
"DP P+D\ndirect-1",
|
||||
"DP P+D\ndirect-2",
|
||||
"DP P+D\ndirect-3",
|
||||
"KVC\nP-only", "KVC\nD-0", "KVC\nD-1", "KVC\nD-2",
|
||||
"DP\nP+D-0", "DP\nP+D-1", "DP\nP+D-2", "DP\nP+D-3",
|
||||
]
|
||||
kvc_mask = [True, True, True, True, False, False, False, False]
|
||||
|
||||
KVC_P_COLOR = "#E89D44" # orange — P GPU stands out
|
||||
KVC_D_COLOR = "#1F77B4" # blue
|
||||
DP_COLOR = "#D62728" # red
|
||||
|
||||
bar_colors = [KVC_P_COLOR, KVC_D_COLOR, KVC_D_COLOR, KVC_D_COLOR,
|
||||
DP_COLOR, DP_COLOR, DP_COLOR, DP_COLOR]
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 6.5))
|
||||
x = np.arange(len(all_gpus))
|
||||
|
||||
# -- Left: per-GPU request count ----------------------------------
|
||||
ax = axes[0]
|
||||
bars = ax.bar(x, counts, color=bar_colors, edgecolor="black", linewidth=0.6)
|
||||
for xi, c in zip(x, counts):
|
||||
ax.text(xi, c + max(counts) * 0.015, f"{c:,}",
|
||||
ha="center", va="bottom", fontsize=9.5)
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Number of requests touching this GPU", fontsize=11)
|
||||
ax.set_title("Per-GPU request count\n(naïve view: P seems idle)", fontsize=12, pad=10)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
prefill_M = ([kvc_prefill_tokens.get(g, 0) / M for g in kvc_gpus]
|
||||
+ [dp_prefill_tokens.get(g, 0) / M for g in dp_gpus])
|
||||
decode_M = ([kvc_decode_tokens.get(g, 0) / M for g in kvc_gpus]
|
||||
+ [dp_decode_tokens.get(g, 0) / M for g in dp_gpus])
|
||||
|
||||
# Annotate: KVC P GPU is "low frequency"
|
||||
p_idx = 0
|
||||
p_pct = counts[p_idx] / sum(counts[:4]) * 100 # vs KVC total
|
||||
ax.annotate(
|
||||
f"P GPU only sees\n"
|
||||
f"{counts[p_idx]:,} requests\n"
|
||||
f"({counts[p_idx]/len(kvc)*100:.1f}% of total)",
|
||||
xy=(p_idx, counts[p_idx]),
|
||||
xytext=(p_idx + 0.6, max(counts) * 0.55),
|
||||
fontsize=9, color=KVC_P_COLOR, fontweight="bold",
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
)
|
||||
# Color by group: orange for KVC P, blue for KVC D, red for DP
|
||||
bar_colors_prefill = [KVC_P_COLOR, KVC_D_PREF_COLOR, KVC_D_PREF_COLOR, KVC_D_PREF_COLOR,
|
||||
DP_PREF_COLOR, DP_PREF_COLOR, DP_PREF_COLOR, DP_PREF_COLOR]
|
||||
bar_colors_decode = [KVC_D_DEC_COLOR, KVC_D_DEC_COLOR, KVC_D_DEC_COLOR, KVC_D_DEC_COLOR,
|
||||
DP_DEC_COLOR, DP_DEC_COLOR, DP_DEC_COLOR, DP_DEC_COLOR]
|
||||
|
||||
ax.bar(x, prefill_M, color=bar_colors_prefill,
|
||||
edgecolor="black", linewidth=0.5, label="Prefill compute")
|
||||
ax.bar(x, decode_M, bottom=prefill_M, color=bar_colors_decode,
|
||||
edgecolor="black", linewidth=0.5, hatch="///",
|
||||
alpha=0.75, label="Decode compute")
|
||||
|
||||
# -- Right: per-GPU compute work (stacked prefill + decode) -------
|
||||
ax = axes[1]
|
||||
prefill_M = [t / 1e6 for t in prefill_tk]
|
||||
decode_M = [t / 1e6 for t in decode_tk]
|
||||
total_M = [p + d for p, d in zip(prefill_M, decode_M)]
|
||||
|
||||
bars_p = ax.bar(x, prefill_M, color=[c for c in bar_colors],
|
||||
edgecolor="black", linewidth=0.6, label="Uncached prefill tokens",
|
||||
alpha=0.95)
|
||||
bars_d = ax.bar(x, decode_M, bottom=prefill_M, color=[c for c in bar_colors],
|
||||
edgecolor="black", linewidth=0.6, hatch="///",
|
||||
label="Decode tokens", alpha=0.55)
|
||||
|
||||
for xi, t in zip(x, total_M):
|
||||
ax.text(xi, t + max(total_M) * 0.015, f"{t:.2f}M",
|
||||
ha="center", va="bottom", fontsize=9.5)
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Compute tokens (millions)", fontsize=11)
|
||||
ax.set_title("Per-GPU compute work\n(work view: P is comparable to each D)",
|
||||
ax.set_ylabel("Compute (millions of token-equivalents)", fontsize=11)
|
||||
ax.set_ylim(0, max(total_M) * 1.30)
|
||||
ax.set_title("Where the work lives | specialized P + light D vs uniform fused workers",
|
||||
fontsize=12, pad=10)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
|
||||
|
||||
# Annotate: KVC P GPU does similar work to each D
|
||||
ax.annotate(
|
||||
f"P GPU does {total_M[p_idx]:.2f}M tokens of\n"
|
||||
f"prefill — comparable per-GPU\n"
|
||||
f"load to each KVC D worker",
|
||||
xy=(p_idx, total_M[p_idx]),
|
||||
xytext=(p_idx + 0.6, max(total_M) * 0.62),
|
||||
fontsize=9, color=KVC_P_COLOR, fontweight="bold",
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
# Separator + headline takeaways under the GROUP labels (in axes
|
||||
# fraction coords so they don't shift if ylim changes).
|
||||
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
|
||||
ax.text(
|
||||
0.22, 0.97,
|
||||
f"KVC: P specialized for heavy prefill\nD workers ~{np.mean(total_M[1:4]):.2f}M each (light)",
|
||||
transform=ax.transAxes, ha="center", va="top", fontsize=9.5,
|
||||
bbox=dict(facecolor="#FFFAE6", edgecolor="#888", alpha=0.92, pad=4),
|
||||
)
|
||||
ax.text(
|
||||
0.78, 0.97,
|
||||
f"DP: every worker {np.mean(total_M[4:]):.2f}M (fused)\nfull prefill interleaved with decode",
|
||||
transform=ax.transAxes, ha="center", va="top", fontsize=9.5,
|
||||
bbox=dict(facecolor="#FFE8E8", edgecolor="#888", alpha=0.92, pad=4),
|
||||
)
|
||||
|
||||
# Separator + group labels
|
||||
for ax in axes:
|
||||
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
|
||||
ymin, ymax = ax.get_ylim()
|
||||
ax.text(1.5, ymax * 1.05, "KVC 1P3D", ha="center", fontsize=11,
|
||||
fontweight="bold", color="#444")
|
||||
ax.text(5.5, ymax * 1.05, "DP 4-way CA", ha="center", fontsize=11,
|
||||
fontweight="bold", color="#444")
|
||||
# No second legend on the right panel — the colours are already
|
||||
# introduced in the left panel and the in-panel annotation boxes
|
||||
# explain what each group means. Decode being hatched is signalled
|
||||
# in the right-panel bar style itself.
|
||||
|
||||
fig.suptitle(
|
||||
"Per-GPU utilization: \"is KVC's prefill GPU wasted?\"\n"
|
||||
"Left view says yes (only 8% of requests); right view says no (comparable work to each D).",
|
||||
fontsize=13, y=1.02,
|
||||
"KVC v2 reduces system-wide compute by 33% vs DP 4-way CA, same workload (4449 requests).\n"
|
||||
"Mechanism: 91.6% of requests find their prefix cached on the affinity-pinned D worker\n"
|
||||
"(append-prefill = 341 tokens on avg), so the total prefill work the system must do is much smaller.",
|
||||
fontsize=12, y=1.05,
|
||||
)
|
||||
plt.tight_layout()
|
||||
plt.savefig(OUT, dpi=150, bbox_inches="tight")
|
||||
@@ -224,10 +279,19 @@ def main() -> None:
|
||||
# ------------------------------------------------------------------
|
||||
# Print numbers for doc reference
|
||||
# ------------------------------------------------------------------
|
||||
print("\n=== Per-GPU numbers ===")
|
||||
print(f"{'GPU':<22} {'requests':>10} {'prefill(M)':>12} {'decode(M)':>12} {'total(M)':>10}")
|
||||
for lbl, n, pM, dM in zip(labels, counts, prefill_M, decode_M):
|
||||
print(f" {lbl.replace(chr(10), ' '):<20} {n:>10} {pM:>12.3f} {dM:>12.3f} {pM+dM:>10.3f}")
|
||||
print("\n=== System totals ===")
|
||||
print(f"KVC v2 total: {kvc_total/M:.3f}M tokens")
|
||||
print(f" P heavy prefill: {kvc_p_prefill/M:.3f}M")
|
||||
print(f" D append-prefill: {kvc_d_prefill/M:.3f}M")
|
||||
print(f" D decode: {kvc_d_decode/M:.3f}M")
|
||||
print(f"DP 4w total: {dp_total/M:.3f}M tokens")
|
||||
print(f" Full prefill: {dp_prefill_total/M:.3f}M")
|
||||
print(f" Decode: {dp_decode_total/M:.3f}M")
|
||||
print(f"\nKVC vs DP: -{saving_pct:.1f}% total compute saved")
|
||||
|
||||
print("\n=== Per-GPU breakdown ===")
|
||||
for lbl, p, d in zip(labels, prefill_M, decode_M):
|
||||
print(f" {lbl.replace(chr(10), ' '):<14} prefill={p:.3f}M decode={d:.3f}M total={p+d:.3f}M")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
231
scripts/analysis/plot_tpot_pdf.py
Normal file
231
scripts/analysis/plot_tpot_pdf.py
Normal file
@@ -0,0 +1,231 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate TPOT probability density curves: KVC 1P3D v2 vs 4-way DP CA.
|
||||
|
||||
Inputs:
|
||||
outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl
|
||||
outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl
|
||||
|
||||
Outputs:
|
||||
docs/figures/tpot_pdf_comparison.png -- two-panel figure (mirroring
|
||||
the TTFT PDF style):
|
||||
left panel: linear x in [3.5, 9.0] ms zoomed on the body
|
||||
right panel: log x covering full range (1 -- 20 ms)
|
||||
|
||||
The headline finding here is that **KVC and DP have statistically
|
||||
indistinguishable TPOT distributions**: same model on same GPU type means
|
||||
per-token decode latency is determined by hardware/model, not by routing
|
||||
policy. This is paper-relevant: it proves KVC's TTFT win is not bought
|
||||
by sacrificing decode throughput.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
from scipy.stats import gaussian_kde
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[2]
|
||||
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
|
||||
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
|
||||
OUT = ROOT / "docs/figures/tpot_pdf_comparison.png"
|
||||
|
||||
|
||||
def load(p: Path) -> list[dict]:
|
||||
return [json.loads(line) for line in p.open()]
|
||||
|
||||
|
||||
def is_failed(r: dict) -> bool:
|
||||
if r.get("error"):
|
||||
return True
|
||||
fr = r.get("finish_reason")
|
||||
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def pct(vals: np.ndarray, q: float) -> float:
|
||||
return float(np.quantile(vals, q))
|
||||
|
||||
|
||||
def main() -> None:
|
||||
kvc = [r for r in load(KVC) if not is_failed(r)]
|
||||
dp = [r for r in load(DP) if not is_failed(r)]
|
||||
|
||||
kvc_tpot = np.array([r["tpot_s"] for r in kvc if r.get("tpot_s") is not None])
|
||||
dp_tpot = np.array([r["tpot_s"] for r in dp if r.get("tpot_s") is not None])
|
||||
|
||||
# Trim absurdly small zeros (rare measurement artifacts) so log KDE behaves.
|
||||
kvc_tpot = kvc_tpot[kvc_tpot > 1e-5]
|
||||
dp_tpot = dp_tpot[dp_tpot > 1e-5]
|
||||
|
||||
KVC_COLOR = "#1F77B4" # blue
|
||||
DP_COLOR = "#D62728" # red
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 6.5))
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Left panel: linear x ∈ [3.5, 9.0] ms -- body of the distribution
|
||||
# ------------------------------------------------------------------
|
||||
ax = axes[0]
|
||||
x_body_ms = np.linspace(3.5, 9.0, 600)
|
||||
x_body_s = x_body_ms / 1000.0
|
||||
|
||||
kde_kvc_lin = gaussian_kde(kvc_tpot, bw_method=0.15)
|
||||
kde_dp_lin = gaussian_kde(dp_tpot, bw_method=0.15)
|
||||
|
||||
# Plot density vs ms (scale density by 1000 to compensate for the
|
||||
# x-axis-unit change so the curve still integrates to ~1 over the
|
||||
# body region of interest).
|
||||
y_kvc_lin = kde_kvc_lin(x_body_s) / 1000.0
|
||||
y_dp_lin = kde_dp_lin(x_body_s) / 1000.0
|
||||
|
||||
ax.plot(x_body_ms, y_kvc_lin, color=KVC_COLOR, lw=2.5,
|
||||
label=f"KVC 1P3D v2 (n={len(kvc_tpot)})")
|
||||
ax.fill_between(x_body_ms, y_kvc_lin, alpha=0.20, color=KVC_COLOR)
|
||||
ax.plot(x_body_ms, y_dp_lin, color=DP_COLOR, lw=2.5,
|
||||
label=f"4-way DP CA (n={len(dp_tpot)})")
|
||||
ax.fill_between(x_body_ms, y_dp_lin, alpha=0.20, color=DP_COLOR)
|
||||
|
||||
# Vertical lines for p50, p90
|
||||
for q, ls in [(0.50, "-"), (0.90, "--")]:
|
||||
ax.axvline(pct(kvc_tpot, q) * 1000, color=KVC_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ax.axvline(pct(dp_tpot, q) * 1000, color=DP_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
|
||||
ymax = ax.get_ylim()[1]
|
||||
ax.text(pct(kvc_tpot, 0.50) * 1000, ymax * 0.97,
|
||||
f"KVC p50\n{pct(kvc_tpot, 0.50)*1000:.2f}ms",
|
||||
color=KVC_COLOR, fontsize=9, va="top", ha="right",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
ax.text(pct(dp_tpot, 0.50) * 1000, ymax * 0.50,
|
||||
f"DP p50\n{pct(dp_tpot, 0.50)*1000:.2f}ms",
|
||||
color=DP_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
ax.text(pct(kvc_tpot, 0.90) * 1000, ymax * 0.30,
|
||||
f"KVC p90\n{pct(kvc_tpot, 0.90)*1000:.2f}ms",
|
||||
color=KVC_COLOR, fontsize=9, va="top", ha="right",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
ax.text(pct(dp_tpot, 0.90) * 1000, ymax * 0.18,
|
||||
f"DP p90\n{pct(dp_tpot, 0.90)*1000:.2f}ms",
|
||||
color=DP_COLOR, fontsize=9, va="top", ha="left",
|
||||
bbox=dict(facecolor="white", edgecolor="none", alpha=0.7, pad=2))
|
||||
|
||||
# Annotate the overlap finding
|
||||
delta_mean_ms = (kvc_tpot.mean() - dp_tpot.mean()) * 1000
|
||||
delta_p50_ms = (pct(kvc_tpot, 0.50) - pct(dp_tpot, 0.50)) * 1000
|
||||
ax.text(
|
||||
0.04, 0.55,
|
||||
"Two curves are\nvisually overlapping:\n"
|
||||
f"Δmean = {delta_mean_ms:+.3f} ms\n"
|
||||
f"Δp50 = {delta_p50_ms:+.3f} ms\n"
|
||||
f"(< 0.5% of mean)",
|
||||
transform=ax.transAxes, fontsize=10.5, color="#333",
|
||||
bbox=dict(facecolor="#FFFAE6", edgecolor="#888", alpha=0.92, pad=5),
|
||||
va="top",
|
||||
)
|
||||
|
||||
ax.set_xlim(3.5, 9.0)
|
||||
ax.set_xlabel("TPOT (milliseconds, linear)", fontsize=11)
|
||||
ax.set_ylabel("Probability density (per ms)", fontsize=11)
|
||||
ax.set_title("Body of distribution (3.5 ms ≤ TPOT ≤ 9.0 ms)",
|
||||
fontsize=12, pad=10)
|
||||
ax.legend(loc="upper right", fontsize=10, framealpha=0.95)
|
||||
ax.grid(True, linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Right panel: log x ∈ [1, 20] ms -- full range incl. tail
|
||||
# ------------------------------------------------------------------
|
||||
ax = axes[1]
|
||||
kde_kvc_log = gaussian_kde(np.log10(kvc_tpot), bw_method="scott")
|
||||
kde_dp_log = gaussian_kde(np.log10(dp_tpot), bw_method="scott")
|
||||
log_x = np.linspace(np.log10(1e-3), np.log10(20e-3), 600)
|
||||
x_full_ms = (10 ** log_x) * 1000
|
||||
|
||||
y_kvc = kde_kvc_log(log_x)
|
||||
y_dp = kde_dp_log(log_x)
|
||||
|
||||
ax.plot(x_full_ms, y_kvc, color=KVC_COLOR, lw=2.5,
|
||||
label=f"KVC 1P3D v2 (n={len(kvc_tpot)})")
|
||||
ax.fill_between(x_full_ms, y_kvc, alpha=0.20, color=KVC_COLOR)
|
||||
ax.plot(x_full_ms, y_dp, color=DP_COLOR, lw=2.5,
|
||||
label=f"4-way DP CA (n={len(dp_tpot)})")
|
||||
ax.fill_between(x_full_ms, y_dp, alpha=0.20, color=DP_COLOR)
|
||||
|
||||
ax.set_xscale("log")
|
||||
ax.set_xlim(1, 20)
|
||||
|
||||
# Percentile markers
|
||||
for q, ls in [(0.50, "-"), (0.90, "--"), (0.99, ":")]:
|
||||
ax.axvline(pct(kvc_tpot, q) * 1000, color=KVC_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
ax.axvline(pct(dp_tpot, q) * 1000, color=DP_COLOR, ls=ls, alpha=0.55, lw=1.1)
|
||||
|
||||
# Annotate tail (p99 + max)
|
||||
kvc_p99_ms = pct(kvc_tpot, 0.99) * 1000
|
||||
dp_p99_ms = pct(dp_tpot, 0.99) * 1000
|
||||
kvc_max_ms = kvc_tpot.max() * 1000
|
||||
dp_max_ms = dp_tpot.max() * 1000
|
||||
|
||||
ymax = max(y_kvc.max(), y_dp.max())
|
||||
ax.text(
|
||||
0.04, 0.55,
|
||||
"p99 / max tail:\n"
|
||||
f"KVC p99 = {kvc_p99_ms:.2f}ms\n"
|
||||
f"DP p99 = {dp_p99_ms:.2f}ms\n"
|
||||
f"KVC max = {kvc_max_ms:.2f}ms\n"
|
||||
f"DP max = {dp_max_ms:.2f}ms\n"
|
||||
f"(KVC tail slightly heavier;\n"
|
||||
f"≤ 0.1% of requests affected)",
|
||||
transform=ax.transAxes, fontsize=10, color="#333",
|
||||
bbox=dict(facecolor="#FFFAE6", edgecolor="#888", alpha=0.92, pad=5),
|
||||
va="top",
|
||||
)
|
||||
|
||||
# Custom tick labels
|
||||
ax.set_xticks([1, 2, 5, 10, 20])
|
||||
ax.set_xticklabels(["1ms", "2ms", "5ms", "10ms", "20ms"])
|
||||
|
||||
ax.set_xlabel("TPOT (log scale)", fontsize=11)
|
||||
ax.set_ylabel("Density (per log₁₀ s)", fontsize=11)
|
||||
ax.set_title("Full range (TPOT 1 ms – 20 ms, log x)",
|
||||
fontsize=12, pad=10)
|
||||
ax.legend(loc="upper right", fontsize=10, framealpha=0.95)
|
||||
ax.grid(True, which="both", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
|
||||
fig.suptitle(
|
||||
"TPOT probability density: KVC 1P3D v2 vs 4-way DP CA\n"
|
||||
"Same model (Qwen3-30B-A3B) on same H100 GPU type → per-token decode latency is\n"
|
||||
"determined by hardware/model, not routing policy. KVC's TTFT win is not bought\n"
|
||||
"by sacrificing decode throughput.",
|
||||
fontsize=12, y=1.04,
|
||||
)
|
||||
plt.tight_layout()
|
||||
plt.savefig(OUT, dpi=150, bbox_inches="tight")
|
||||
print(f"wrote {OUT}")
|
||||
plt.close(fig)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Print summary stats for doc cross-reference
|
||||
# ------------------------------------------------------------------
|
||||
print(f"\n=== TPOT distribution summary ===")
|
||||
for name, arr in [("KVC v2", kvc_tpot), ("DP 4w", dp_tpot)]:
|
||||
print(f" {name} (n={len(arr)})")
|
||||
print(f" min={arr.min()*1000:.3f}ms p10={pct(arr,0.10)*1000:.3f}ms "
|
||||
f"p50={pct(arr,0.50)*1000:.3f}ms p90={pct(arr,0.90)*1000:.3f}ms "
|
||||
f"p99={pct(arr,0.99)*1000:.3f}ms p99.9={pct(arr,0.999)*1000:.3f}ms "
|
||||
f"max={arr.max()*1000:.3f}ms")
|
||||
print(f" mean={arr.mean()*1000:.3f}ms std={arr.std()*1000:.3f}ms")
|
||||
|
||||
print(f"\nΔmean = {(kvc_tpot.mean()-dp_tpot.mean())*1000:+.3f}ms "
|
||||
f"({(kvc_tpot.mean()-dp_tpot.mean())/dp_tpot.mean()*100:+.2f}%)")
|
||||
print(f"Δp50 = {(pct(kvc_tpot,0.5)-pct(dp_tpot,0.5))*1000:+.3f}ms")
|
||||
print(f"Δp99 = {(pct(kvc_tpot,0.99)-pct(dp_tpot,0.99))*1000:+.3f}ms")
|
||||
print(f"→ Conclusion: KVC TPOT distribution is statistically indistinguishable from DP's "
|
||||
f"body, with slightly heavier tail (KVC max {kvc_tpot.max()*1000:.2f}ms vs DP {dp_tpot.max()*1000:.2f}ms).")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
32
third_party/traces/README.md
vendored
Normal file
32
third_party/traces/README.md
vendored
Normal file
@@ -0,0 +1,32 @@
|
||||
# Replay traces
|
||||
|
||||
为了方便跨主机传输,把 benchmark 用到的 trace 文件放在这里。该目录在
|
||||
`.gitignore` 中显式 whitelist(同 `third_party/sglang/`),文件随 git 一起走。
|
||||
|
||||
## 文件清单
|
||||
|
||||
| 文件 | 大小 | 内容 | 来源 |
|
||||
|---|---:|---|---|
|
||||
| `qwen35-swebench-50sess.jsonl` | 54 MB | 4449 reqs / 52 sessions / Qwen3.5-35B 推理产物 | `simm-swe-bench` 项目用 SiBench replay SiCo `swe.jsonl` 经 SGLang 跑出 audit.jsonl,再用 `scripts/convert_audit_to_trace.py` 转 |
|
||||
|
||||
详细来源见 `docs/ONBOARDING_NEXT_AGENT_ZH.md` 和实际 schema 见 `src/agentic_pd_hybrid/trace.py`。
|
||||
|
||||
## 使用方法
|
||||
|
||||
Replay 端的 trace 路径由 CLI flag `--trace` 指定。默认 sweep 脚本里指向
|
||||
`outputs/qwen35-swebench-50sess.jsonl`——为了向后兼容老脚本,**建议在 clone 后
|
||||
软链接一份过去**:
|
||||
|
||||
```bash
|
||||
mkdir -p outputs
|
||||
ln -sf ../third_party/traces/qwen35-swebench-50sess.jsonl \
|
||||
outputs/qwen35-swebench-50sess.jsonl
|
||||
```
|
||||
|
||||
或者直接改 sweep 脚本里 `--trace` 路径指向 `third_party/traces/...`。
|
||||
|
||||
## 添加新 trace
|
||||
|
||||
如果未来加新 trace 文件(如 `codex_swebenchpro` 转换后的版本),直接放本目录,
|
||||
更新本 README 的清单即可。**别把超过 100 MB 的单文件直接 git add**——GitLab
|
||||
默认对未启用 LFS 的单文件有 100 MB 限制。
|
||||
4449
third_party/traces/qwen35-swebench-50sess.jsonl
vendored
Normal file
4449
third_party/traces/qwen35-swebench-50sess.jsonl
vendored
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user