profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument

Hostile audit of the original report flagged three load-bearing errors: 1. held_tokens semantic was inverted. session_held_tokens() at session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len) per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held - avail" actually CONTAINS the radix-tree protected prefix cache (likely the single biggest component for shared agentic prefixes), not just running batch + in-flight as the original report claimed. 2. Admission-race causal hypothesis for the 415 EXP2+profile errors is contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they passed admission and died downstream ("generate stream ended before producing any token", raised by the client when a 200 response had an empty stream). 3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1 (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a passive read — it dispatches into the scheduler main loop and iterates every session slot. Plus: per-D error% confounded by sticky session affinity (only 18 unique sessions cause 415 errors, decode-3 had 0 errors only because no high-error session landed there); decile 10 "recovery" was an equal-time binning artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not 6h; p50/p90 latency comparison is N=1. Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4). Action items split into P0 (verify, must do first) and P1 (instrument): P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2 (no polling, identical config to the original v5 run) to test whether the 9-error baseline result is reproducible. If 3 runs give ~9 errors and profile gives 415, polling is the leading suspect. Currently running in background. P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only "pool_breakdown" dict to /server_info covering: radix_evictable_tokens, radix_protected_tokens, slot_private_held_tokens, session_slot_count, running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens}, prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these, "unaccounted = cap - sum(known)" exposes true leakage. replay.py captures all fields into the per-tick row; analyzer prints the decomposition and gracefully handles old timeseries (prints "P1 instrument absent"). Mock-tested end-to-end. SGLang patch is read-only and does not affect admission/scheduling. Old v5+profile data still analyzes correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:29:21 +08:00
parent 51f5386691
commit 4978c0d0cd
5 changed files with 567 additions and 1 deletions
--- a/docs/V5_PROFILE_INVESTIGATION_ZH.md
+++ b/docs/V5_PROFILE_INVESTIGATION_ZH.md
@@ -0,0 +1,305 @@
+# v5+Profile 调查报告(经 critic 审计修订版)
+
+**日期**: 2026-04-29(原稿)/ 2026-04-29(经审计修订)
+**实验配置**: Qwen3-30B-A3B (TP1)、单机 8×H100 80GB、trace = qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions)、time-scale=10、concurrency=32
+**数据集**: `outputs/qwen3-30b-tp1-v5-optD-profile/`(EXP1 1P7D + EXP2 2P6D,均加入 1Hz `/server_info` 时序采样)
+**v5 baseline 对照**: `outputs/qwen3-30b-tp1-v5-optD/`(无 polling)
+**研究问题**: v5 (Option D) 把 errors 从 9-10% 降到 0.2%,但 session-cap fallback 反而升到 46-51%。fallback / errors 究竟来自哪里。
+
+> **本稿是经过 hostile audit 后的修订版**。原稿包含若干结论性错误(尤其是对 `held_tokens` 语义的解读颠倒、对 admission race 的过度归因、对 polling 副作用的轻视)。审计意见保存在本会话记录中,关键纠错以 ⚠️ 标注。
+
+---
+
+## TL;DR(已修订)
+
+1. **真实容量**: 每张 D 的 `token_to_kv_pool_allocator.size = 92086 tokens (~92K)`。⚠️ 单 turn 真实 footprint **不是 50-100K**;`cached_tokens` p50=18K、p90=48K、p99=67K。原稿过度夸张。
+2. **`other = capacity − held − available` 的解读已修订**: ⚠️ `held_tokens = sum(slot.kv_allocated_len − slot.cache_protected_len)`(代码:`session_aware_cache.py:278-282`),即"slot 拿到但**不在 radix tree 保护范围内**的部分"。所以 **`other` 的最大单一组成很可能是 radix-tree 保护的共享前缀缓存(prefix cache)** —— 这通常是想要的,**不是病态浪费**。原稿把 `other` 全归因为 running batch + 在途传输是错的。
+3. **`other` 的双峰分布属实**(p50 ≈ 0,p90 ≈ 80K),但单凭 `cap−held−avail` 无法判断这是 radix-cache 自然累积、还是 burst 工作内存。**P1 的细分 instrument 必须先做**。
+4. **errors 与 `other` 在时间上相关**属实,但**不能被解释为因果**。同一时段的多个变量(请求并发、in-flight transfer、可用空间)都在变化;无法仅凭时序对齐推断"`other` 吃掉了腾出来的空间"。
+5. **EXP2 2P6D errors 9 → 415**:⚠️ **polling 被升级为 leading hypothesis**,而非"无关"。证据:执行模式呈 ~1:1 替换(`session-cap-fb` −356 / `kvcache-centric` +406),且 `/server_info` 不是被动读 —— 它在 scheduler 主循环内遍历每个 session slot 计算 `is_idle`。需要 P0 三次 baseline 复跑去伪。
+6. **errors 集中在 18 个 session 上**(总共 52 个),每个 session 钉死在 1 个 D。per-D error rate 差异**无法解释为 D 的结构差别**,本质是 18 个"坏 session"如何被路由分配。
+7. **v5+profile 1P7D 的延迟优于 baseline** 完全在 single-run variance 范围内。N=1,**不能作为任何性能结论**。
+
+---
+
+## 1. 方法论
+
+### 1.1 Instrument 改动
+- `src/agentic_pd_hybrid/replay.py` 加入 `_query_pool_snapshot` + `_poll_pool_timeseries`,后台 asyncio task 以 `--pool-poll-interval-s 1.0` 周期访问每个 P/D worker 的 `/server_info`。
+- 每 tick 写一行 jsonl 到 `<run_dir>/d-pool-timeseries.jsonl`,字段:`{worker_id, worker_role, session_count, resident_session_count, held_tokens, available_tokens, capacity_tokens, idle_evictable_*, sessions[], kvcache_mem_gb, last_gen_throughput, ...}`。
+- 分析脚本:`scripts/analysis/analyze_pool_timeseries.py`。
+
+### 1.2 字段定义(已修订 ⚠️)
+`/server_info` → `internal_states[0].session_cache` 的来源是 `session_controller.py:get_streaming_session_cache_status` → `tree_cache`(`SessionAwareCache`)。
+
+| 字段 | 真实含义 | 备注 |
+|---|---|---|
+| `held_tokens` | `sum_over_slots(ceil(kv_allocated_len, page_size) − cache_protected_len)` | **不是** "session 在 cache 中占用的全部";只统计**slot-private、未被 radix tree 保护**的部分 |
+| `cache_protected_len` | radix tree 保护的共享前缀部分 | 多个 session 共享时只计一次 |
+| `available_tokens` | `token_to_kv_pool_allocator.available_size()` | 全局 KV 池剩余空间 |
+| `capacity_tokens` | `allocator.size` | 单 D 的总 KV 容量 = 92086 |
+| `idle_evictable_tokens` | held 中可被 LRU 立即踢的部分(session 所有 req finished + streaming 模式) | |
+
+因此:
+- **`other = capacity − held − available`** 包含但不限于:
+  - **radix-tree 保护的共享前缀 token**(可能是大头) ⚠️ 原稿遗漏
+  - 当前 running batch 占用的 KV slots
+  - P→D 在途 transfer 的临时 buffer
+  - mooncake 已注册但尚未提交到 tree_cache 的块
+  - 内部碎片 / allocator 元数据
+
+**含义**: 在补充 P1 instrument 之前,我们**无法分辨** `other` 中"radix-cache"(良性)和"burst 工作集 / fragmentation"(可能病态)的比例。
+
+### 1.3 配置一致性与风险
+- v5+profile 与 v5 baseline 唯一差别:加了 `--pool-poll-interval-s 1.0`(其余 CLI 参数完全一致)。
+- **两次 run 时间间隔 ~21 小时**(2026-04-28 15:39/16:27 vs 2026-04-29 12:08/12:59)⚠️ 原稿误写 ~6h。同一台机,但 GPU 温度、PCIe、NUMA 分配未控制。
+- **N=1 比较没有统计意义**;任何延迟差异 < 30% 都属于 single-run variance 合理范围。
+
+---
+
+## 2. 整体性能对比
+
+| 指标 | v5 1P7D | **v5+profile 1P7D** | v5 2P6D | **v5+profile 2P6D** |
+|---|---|---|---|---|
+| 总 requests | 4449 | 4449 | 4449 | 4449 |
+| **errors** | 9 (0.2%) | 6 (0.1%) | 9 (0.2%) | **415 (9.3%)** |
+| truncated | 42 | 43 | 42 | 42 |
+| direct-to-D | 44.7% | 54.9% | 41.3% | 41.1% |
+| session-cap fallback | 45.6% | 36.1% | 50.6% | 42.6% |
+| no-d-capacity | 1.2% | 0.7% | 0.8% | 0.6% |
+| pd-router-d-session-reseed | 4.8% | 4.3% | 3.4% | 2.9% |
+| pd-router-turn1-seed | 1.2% | 1.2% | 1.1% | 1.1% |
+| **kvcache-centric (failed mode)** | 0.2% (9) | 0.1% (6) | 0.2% (9) | **9.3% (415)** |
+| latency mean / p50 / p90 / p99 (s) | 5.18/1.59/14.7/26.1 | 4.21/1.18/11.3/28.8 | 3.49/1.31/9.1/24.9 | 3.23/1.11/8.4/20.3 |
+
+⚠️ **不要从此表得出"v5+profile 改进了延迟"** —— N=1 single run,且 EXP2 引入了 415 个 errors 相当于换了一种回退策略,延迟均值的下降很可能只是**剔除了慢路径请求**的副作用。
+
+### 2.1 EXP2+profile 415 errors 解构(已修订)
+
+**Error type 分布**:
+| Error Type | 数量 |
+|---|---|
+| `RuntimeError: generate stream ended before producing any token` | 407 |
+| `ReadTimeout: ` | 8 |
+
+⚠️ **关键约束**:
+- **414/415 个 error 的 `kv_transfer_blocks > 0`**(从 metrics jsonl 验证)。这些请求**已经过了 admission,P→D 传输已开始**,死于下游(server-side abort、流被关、生成阶段失败)。
+- **`session_reused=False` 占 415/415**(全部是 seed,无一是 direct append)。
+- **失败集中在 18 个 unique session**(top 5: 58080→decode-5 66 errs / 70560→decode-2 54 / 67200→decode-4 40 / 59200→decode-4 35 / 77280→decode-2 33),每个 session 钉死在一台 D。
+
+**Per-D error rate(已修正百分比)**:
+| Decode Worker | Errors | Total Reqs | Error Rate |
+|---|---|---|---|
+| decode-0 | 56 | 758 | 7.4% |
+| decode-1 | 5 | 561 | 0.9% |
+| decode-2 | 141 | 858 | **16.4%** |
+| decode-3 | 0 | 838 | 0.0% |
+| decode-4 | 106 | 731 | 14.5% |
+| decode-5 | 107 | 703 | 15.2% |
+
+⚠️ **不要解读为"decode-3 健康、decode-2 病态"**。每个 session 钉死在一台 D,18 个坏 session 是否落到某个 D 是路由分配的随机结果。**当前 N=1 数据无法分辨"D 结构差异"与"session 分配运气"**。
+
+---
+
+## 3. D KV pool 时序分解(EXP1 1P7D 关键结果)
+
+每张 D capacity=92086 tokens,运行 ~2696 秒(去掉前 10% 暖机):
+
+| Worker | mean_other | p50_other | p90_other | max_other | mean_held | mean_avail |
+|---|---:|---:|---:|---:|---:|---:|
+| decode-0 | 13599 | 63 | 77189 | 90959 | 47124 | 31363 |
+| decode-1 | 21242 | 0 | 76854 | 91074 | 37024 | 33820 |
+| decode-2 | 39333 | 46841 | 82782 | 91996 | 17381 | 35372 |
+| decode-3 | 30543 | 15864 | 81512 | 91511 | 9584 | 51959 |
+| decode-4 | 32659 | 32365 | 72995 | 92082 | 7643 | 51784 |
+| decode-5 | 31745 | 20366 | 86341 | 91211 | 11305 | 49036 |
+| decode-6 | 24602 | 701 | 82291 | 91000 | 20967 | 46517 |
+
+**已修订观察(去掉了原稿的过度归因)**:
+- **`other` 是双峰**(p50 接近 0,p90 接近 80K,mean 在 14-39K)。这一形态属实。
+- **不同 D 的 mean_held / mean_other 差异巨大** —— 但⚠️ **不能直接归类为 "session-heavy" 或 "transfer-heavy"**,因为我们不知道 `other` 里 radix-cache vs 工作内存的比例。**P1 的拆分必做**。
+- 由于 `held` 不包含 radix-protected token,`mean_held` 低**不代表**该 D 上 sessions 占用少 —— 只代表它们的"slot 私有部分"少;共享前缀可能很大,完全藏在 `other` 里。
+
+### 3.1 `other` 在某些时段持续高位(EXP1 decode-2 抽样)
+
+| t (s) | held | avail | other | sess_count | last_gen_throughput |
+|---:|---:|---:|---:|---:|---:|
+| 3 | 0 | 92086 | 0 | 0/0 | (未抽) |
+| 273 | 65310 | 26776 | 0 | 1/1 | (未抽) |
+| 543 | 15296 | 76589 | 201 | 1/1 | (未抽) |
+| 812 | 0 | 92086 | 0 | 0/0 | (未抽) |
+| 1082 | 52507 | 39579 | 0 | 1/1 | (未抽) |
+| 1351 | 40985 | 30175 | 20926 | 2/2 | (未抽) |
+| **1622** | **0** | 17703 | **74383** | **0/0** | **未核** |
+| 1891 | 0 | 46376 | 45710 | 0/0 | (未抽) |
+| 2161 | 0 | 27667 | 64419 | 0/0 | (未抽) |
+| 2430 | 0 | 62224 | 29862 | 0/0 | (未抽) |
+
+⚠️ **t=1622 之后(约 30+ tick)持续 held=0/sess=0/other≈45-74K** —— 这种持久状态**不是 burst 工作集的形态**(burst 应是亚秒级)。更可能的解释包括:
+- 一个 stuck request 的 KV 块未能正常释放
+- mooncake 注册但未 commit 的 transfer buffer 滞留
+- 某个 cleanup 路径未触发
+
+**未在原稿中验证 `last_gen_throughput`**,该字段记录在 timeseries 但未对齐分析。**P1 时一并补**。
+
+---
+
+## 4. Errors 与 Saturation 时序相关性(EXP2 2P6D)
+
+### 4.1 等数量 vs 等时间 decile(已修订 ⚠️)
+
+原稿仅展示等时间分箱,有"第 10 decile 系统恢复"的视觉错觉。两种分箱并列:
+
+| Decile | 等时间(reqs / errs / rate) | 等数量(reqs / errs / rate) |
+|:---:|:---:|:---:|
+| 1 | 567 / 0 / 0.0% | 444 / 0 / 0.0% |
+| 2 | 268 / 0 / 0.0% | 445 / 0 / 0.0% |
+| 3 | 517 / 0 / 0.0% | 445 / 0 / 0.0% |
+| 4 | 189 / 0 / 0.0% | 445 / 0 / 0.0% |
+| 5 | 662 / 3 / 0.5% | 445 / 3 / 0.7% |
+| 6 | 417 / 27 / 6.5% | 445 / 28 / 6.3% |
+| 7 | 486 / 39 / 8.0% | 445 / 42 / 9.4% |
+| 8 | 612 / 177 / 28.9% | 445 / 114 / 25.6% |
+| 9 | 486 / 128 / 26.3% | 445 / 119 / 26.7% |
+| **10** | **245 / 41 / 16.7%** | **445 / 109 / 24.5%** |
+
+⚠️ **第 10 decile 不是"系统恢复"**。等数量分箱显示 24.5% 的 error rate,与 decile 8/9 持平。原稿"恢复"叙事是分母 245 vs 612 造成的视觉假象。
+
+### 4.2 多重假设并列(已修订,不再独尊 admission race)
+
+针对 EXP2 2P6D 415 errors 的可能机制(按当前数据强弱排序):
+
+**H1: Polling 引发 scheduler 时序扰动(leading hypothesis ⚠️)**
+- 证据:执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)。
+- 证据:`/server_info` 进 scheduler 主循环遍历 session slot,1 Hz × 8 worker 不是 0 开销。
+- 证伪条件:**P0(三次 baseline EXP2 复跑)如果都得到 ~9 errors,本假设确认**。
+
+**H2: v5 自身存在 admission/transfer race**
+- v5 baseline 也出 9 个 errors(均为 ReadTimeout),说明该 race 在 baseline 已存在,profile 是被放大了。
+- 证据弱化:原稿提的 "admission race"(admit_direct_append snapshot 过期)与数据冲突 —— **414/415 errors 的 `kv_transfer_blocks > 0`**,他们都过了 admission,死在下游。所以即便有 race,也不是发生在 admission 端,而是 P→D transfer 后 / 生成开始前。
+
+**H3: 18 个特定 session 的工作负载结构性失败**
+- 18/52 session 集中失败,每个 session 都是高 turn_id (median=70)。
+- 这些 session 可能 input 特别长,或某种 trace 结构会触发某个特定路径。
+- 证伪条件:在 P0 三次 baseline 复跑后,看是否仍是同一组 18 个 session 失败。
+
+**H4: 单次运行的 GPU/PCIe 状态扰动**
+- ~21 小时间隔,GPU 温度/clock 不同。
+- 证伪条件:P0 三次 baseline 都 ~9 errors → 排除单次扰动主导。
+
+⚠️ **原稿独推 admission-race(H2)是错的**。当前数据无法决定 H1-H4 哪个是主因。
+
+---
+
+## 5. 1P7D vs 2P6D 全局对比
+
+| Config | total decode ticks | other p50 | other p90 | other>30K freq | other>50K freq | other>70K freq | held>60K freq |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| 1P7D | 18865 | 663 | 79751 | 36.9% | 27.9% | 14.8% | 15.5% |
+| 2P6D | 14016 | 14459 | 77199 | 43.2% | 30.4% | 13.9% | 4.8% |
+
+⚠️ **原稿"2P6D 的 p50_other 是 1P7D 的 22 倍 → 2P 推送压力更大"过度解读**。考虑分母效应:同一 trace 总工作量在 2P6D 由 6 张 D 分担 vs 1P7D 由 7 张 D 分担,**单 D 受到的压力本来就更大**,与 P 数无直接因果。这个数据只能说"2P6D 单 D 负担更高",**不能**得出"2P 在 transfer 上比 1P 更激进"。
+
+---
+
+## 6. 关键解读(已大幅修订)
+
+### 6.1 v5 真实瓶颈尚不明确
+原稿声称"瓶颈是 D 的 KV pool 在压力期被 'other' 占据"。⚠️ **此结论已撤回**。给定 `held_tokens` 实际是 slot-private(non-tree)部分,`other` 的最大单一成分**很可能是正常的 radix-tree 共享前缀**。"被 running batch / 在途传输占据"是**未经验证的猜想**。需要 P1 的细分 instrument 才能给出真瓶颈。
+
+### 6.2 LRU eviction 的行为暂无可靠解读
+原稿基于 mean_held 在压力期"暴跌"推断 LRU 在拼命踢。但 `held` 实际是 slot-private 部分,session 仍可能被 radix-tree 保留;`held` 减少不等于 session 被 evict,可能只是 `cache_protected_len` 比例变化。**P1 拆分前不下结论**。
+
+### 6.3 v5+profile 1P7D "比 baseline 快"是单次巧合
+两次 run 间隔 ~21 小时(原稿误写 ~6h),GPU 温度/PCIe 状态未控制。**N=1**,任何性能差异 < 30% 都不可声称。
+
+### 6.4 EXP2 2P6D 415 errors:polling 是 leading suspect(已升级)
+原稿把 polling 列为"次要可能"。⚠️ **现在升级为主嫌疑**:
+- 执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)说明 polling **改变了 admission 走哪条路**。
+- `/server_info` 不是只读旁路 —— 调度内部循环 + 遍历 session slots 计算 `is_idle`。
+- **必须做 P0 三次 baseline 复跑去伪**;在那之前不能动 v6。
+
+### 6.5 "Other" 在 P 上 90% 不是 backup blocks
+`prefill-0` 的 SessionAwareCache **未启用**(replay 数据 `held=0`),P 的 "other" 等于"P 全部 KV 使用量"(radix cache + running batch + 备份)。⚠️ 当前数据**无法分辨** prefill-backup-policy 是不是真的释放了。需在 P 加单独的 `prefill_backup_tokens` 字段。
+
+---
+
+## 7. v6 行动项(已重排,以 P0 起步)
+
+### **P0:验证 EXP2 errors=9 的可复现性**(最高优先级,先做)
+**操作**: 跑 3 次 v5 baseline EXP2(同 v5 配置,**不开 polling**),比较 error 分布。
+- 如果 3 次都得到 ~9 errors → polling 被坐实为 415 暴涨主因。**必须把 polling 改成更轻量的形式**(如降低频率、改成 streaming push、或用 sidecar metrics 而非 HTTP poll)再做后续。
+- 如果 3 次都得到 ~400 errors → polling 不是主因,415 是 v5 admission/transfer race + 单次 GPU 状态扰动的复合。
+- 如果 3 次结果分布很广(如 9 / 50 / 400) → run-to-run variance 才是主导,任何 single-run 比较失效。
+
+**预期工程量**: 1 个新 sweep 脚本(只跑 EXP2,3 次)+ ~3 × 50 min = ~2.5h GPU 时间。
+**风险**: 0(纯重跑现有配置)。
+
+### **P1:把 D 的 `other` 拆开打表**(P0 跑的同时并行做代码)
+**操作**: 改 SGLang `scheduler.py:get_streaming_session_cache_status` 与 `session_aware_cache.py`,在返回的 dict 里加:
+- `radix_protected_tokens` = `sum(slot.cache_protected_len for slot in slots)` ⚠️ 这是原稿盲区,critic 暴露的关键缺失字段
+- `running_batch_tokens` = `sum(req.fill_ids size for req in running_batch.reqs)`
+- `inflight_transfer_tokens` = `sum(req.size for req in disagg_decode_transfer_queue.queue)`
+- `prealloc_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.queue)`
+- `retracted_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.retracted_queue)`
+- `last_gen_throughput`(已有)更细 —— 加 `running_batch_size`(req 数)
+
+**预期收益**: `other_unaccounted = capacity − held − available − radix_protected − running_batch − inflight − prealloc − retracted` 应该接近 0。剩余的就是真"病态"内存。
+**风险**: 低(纯只读 stat,不改 admission 逻辑)。
+**工程量**: ~80 行 SGLang patch + 同步 replay.py 的 `_query_pool_snapshot` + analyzer。
+
+### **P2:如果 P0 暴露 polling 是主因,改 polling 实现**
+- 选项 A:把 `/server_info` 改成事件驱动 push(scheduler 在 step 末尾把 stats 写到环形缓冲区,polling 只读不进 scheduler 队列)
+- 选项 B:把 polling 频率从 1Hz 降到 5Hz/10s,在 P1 的拆分数据上验证够用
+- 选项 C:scheduler 端加锁分离,把 stats 读和 admission 决策的临界区拆开
+
+### **P3(条件性,等 P0+P1 数据)**:决定真正的优化方向
+原稿 §7 的 5 条优先级在 `other` 模型纠正后**全部需要重新评估**。等真实拆分数据出来再排。
+
+---
+
+## 8. 局限与 Confounders(已扩充)
+
+1. ⚠️ `held_tokens` 语义在原稿被解读颠倒,引发 `other` 的因果归因错误(已纠正,见 §1.2)。
+2. `other` 字段是计算所得且**未细分**,无法直接归因。需要 P1 instrument 才能区分 radix-cache、running batch、inflight 等。
+3. ⚠️ EXP2+profile 的 415 errors 与 baseline 9 errors **量级差异无法 deconfound**;polling 是 leading suspect 但未证实。**P0 是必经步骤**。
+4. **N=1** 的实验配置:任何 v5+profile vs v5 baseline 的延迟/失败差异都属于 single-run variance 合理范围,**不能作为方向性结论**。
+5. trace 是 single-shot,52 sessions × 4449 reqs 的特定结构可能放大某些路径。
+6. `capacity = 92086` 是 `token_to_kv_pool_allocator.size`,来自 `mem_fraction_static`(未抽具体值),与"H100 80GB 的物理上限"差距是 SGLang 的安全裕量。
+7. ⚠️ §3.1 t=1622 持续高 `other` 30+ tick 的现象 **未与 `last_gen_throughput` 交叉验证**;原稿"running batch + 在途传输"的解释是猜想而非证据。
+8. ⚠️ 18/52 失败 session 的特征(turn_id、input 长度、prefix shape)**未做对比分析**;不能排除某个 session 类型本来就会触发某个固定 bug。
+9. polling 频率 1Hz 错过亚秒级 burst —— `other` 的双峰可能比测到的更剧烈。
+10. critic 指出 `pd-router-d-session-reseed` 在 EXP1 涨(193 vs 152)、EXP2 跌(127 vs 152)的反向移动**未在原稿分析**,这是 admission/路由 决策的清晰信号,应该在 P1 之后回看。
+
+---
+
+## 9. 后续指令(已更新顺序)
+
+1. **P0**: 跑 `scripts/sweep_tp1_v5_baseline_rerun_exp2.sh`,3 次 EXP2 baseline,无 polling。
+2. **P1**: 同时改 SGLang 把 `other` 真正拆开。
+3. 完成 P0+P1 后:
+   - 重跑 EXP2 一次 + 新 instrument(同 polling),拿到 `other` 拆分。
+   - 对比 baseline-rerun 三次的 errors 分布。
+   - 决定是否回退 polling、调 admission、还是攻 specific 18 个 session 的工作负载特征。
+4. 任何 v6 代码改动(优化 admission / eviction / transfer)**必须在 P0+P1 之后**。
+
+---
+
+## 10. 数据产物
+
+```
+outputs/qwen3-30b-tp1-v5-optD-profile/
+├── exp{1,2}_*_metrics.jsonl                # 4449 行 / 实验
+├── exp{1,2}_*_summary.json
+├── exp{1,2}_*_pool_timeseries.jsonl        # 12 MB / 10 MB
+└── kvcache-centric-...20260429T{120847,125911}Z/  # 原始 run dir
+
+outputs/qwen3-30b-tp1-v5-optD/  # baseline 对照(N=1)
+└── exp{1,2}_1p7d_kvc_optD_*
+
+# 待 P0 产生:
+outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
+└── exp2_2p6d_run{1,2,3}_*
+```
+
+分析脚本:`scripts/analysis/analyze_pool_timeseries.py`(`--json` 拿机器可读输出)。
--- a/scripts/analysis/analyze_pool_timeseries.py
+++ b/scripts/analysis/analyze_pool_timeseries.py
@@ -171,9 +171,80 @@ def analyze(timeseries_path: Path) -> dict[str, Any]:

    print(
        "\nLegend: active=held-idle  idle=idle_evictable  "
-        "other=cap-held-avail (prefill backup, in-flight, fragmentation)"
+        "other=cap-held-avail (radix-protected + running-batch + in-flight + frag)"
    )

+    # P1: decomposition of "other" using pool_breakdown fields (zeros if instrument absent)
+    has_breakdown = any(
+        any(r.get(k) for k in (
+            "radix_evictable_tokens",
+            "radix_protected_tokens",
+            "running_batch_kv_tokens",
+            "transfer_queue_tokens",
+            "prealloc_queue_tokens",
+            "retracted_queue_tokens",
+        ))
+        for r in rows
+    )
+
+    if has_breakdown:
+        print("\n=== P1 'other' decomposition (per worker, mean over run) ===")
+        print(
+            f"{'worker':<12} {'role':<8} | "
+            f"{'r_evictable':>11} {'r_protected':>11} {'slot_private':>12} | "
+            f"{'run_batch':>10} {'transfer':>9} {'prealloc':>9} {'retracted':>10} | "
+            f"{'unaccounted':>11}"
+        )
+        for wid in sorted(by_worker.keys()):
+            ws = by_worker[wid]
+            role = ws[0].get("worker_role", "?")
+            cap = max(int(r.get("capacity_tokens") or 0) for r in ws)
+
+            def m(field: str) -> float:
+                vals = [int(r.get(field) or 0) for r in ws]
+                return statistics.fmean(vals) if vals else 0.0
+
+            r_ev = m("radix_evictable_tokens")
+            r_pr = m("radix_protected_tokens")
+            slot = m("slot_private_held_tokens")
+            rb = m("running_batch_kv_tokens")
+            tq = m("transfer_queue_tokens")
+            pq = m("prealloc_queue_tokens")
+            rq = m("retracted_queue_tokens")
+            avail = m("available_tokens")
+            # `running_batch_kv_tokens` overlaps with radix_protected for tree-tracked
+            # reqs — do NOT subtract it again. Decomposition assumes:
+            # capacity ≈ avail + r_evictable + r_protected + slot_private
+            #           + transfer_queue + prealloc_queue + retracted_queue + unaccounted
+            unacc = max(
+                0,
+                cap - avail - r_ev - r_pr - slot - tq - pq - rq,
+            )
+            print(
+                f"{wid:<12} {role:<8} | "
+                f"{_fmt_tokens(r_ev):>11} {_fmt_tokens(r_pr):>11} {_fmt_tokens(slot):>12} | "
+                f"{_fmt_tokens(rb):>10} {_fmt_tokens(tq):>9} {_fmt_tokens(pq):>9} {_fmt_tokens(rq):>10} | "
+                f"{_fmt_tokens(unacc):>11}"
+            )
+
+            summary["workers"][wid]["pool_breakdown_avg"] = {
+                "radix_evictable": r_ev,
+                "radix_protected": r_pr,
+                "slot_private_held": slot,
+                "running_batch_kv": rb,
+                "transfer_queue": tq,
+                "prealloc_queue": pq,
+                "retracted_queue": rq,
+                "available": avail,
+                "unaccounted": unacc,
+            }
+        print(
+            "\nNote: running_batch_kv_tokens overlaps with radix_protected_tokens "
+            "(tree-tracked decode reqs are also in protected); not summed."
+        )
+    else:
+        print("\n(P1 instrument absent: pool_breakdown fields are all zero)")
+
    # Session residency churn: how many distinct sessions ever sat on each worker,
    # and how many sessions hopped across workers (= starvation indicator).
    print("\n=== Session residency churn ===")
--- a/scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
+++ b/scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
@@ -0,0 +1,89 @@
+#!/bin/bash
+# P0: Re-run v5 baseline EXP2 (2P6D) three times to establish whether
+# errors=9 is a stable property of the v5 config or single-run variance.
+# Critic of V5_PROFILE_INVESTIGATION_ZH.md flagged that the 415 errors in
+# v5+profile EXP2 may have been polling-induced. We need 3 baseline runs
+# (no polling, identical config to original v5) to test reproducibility.
+#
+# Output:
+#   outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
+#     ├── exp2_2p6d_run{1,2,3}_summary.json
+#     ├── exp2_2p6d_run{1,2,3}_metrics.jsonl
+#     └── kvcache-centric-...<ts>/   (one per run)
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v5-optD-baseline-rerun
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+run_exp2() {
+  local run_idx=$1
+  local label="exp2_2p6d_run${run_idx}"
+  log ""
+  log "=== [RUN ${run_idx}/3] EXP2 2P6D KVC kv-aware Option D (no polling) ==="
+  PYTHONPATH=src:third_party/sglang/python \
+  $VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+    --trace $TRACE \
+    --output-root $OUTPUT \
+    --mechanism kvcache-centric \
+    --policy kv-aware \
+    --model-path $MODEL \
+    --prefill-workers 2 --decode-workers 6 \
+    --prefill-tp-size 1 --decode-tp-size 1 \
+    --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+    --transfer-backend mooncake \
+    --gpu-budget 8 \
+    --time-scale 10 \
+    --session-sample-rate 1.0 \
+    --target-duration-s 100000 \
+    --concurrency-limit 32 \
+    --timeout-s 900 \
+    --request-timeout-s 300 \
+    --kvcache-admission-mode worker \
+    --kvcache-seed-min-turn-id 1 \
+    --kvcache-seed-max-inflight-decode -1 \
+    --kvcache-prefill-backup-policy release-after-transfer \
+    --kvcache-prefill-priority-eviction
+
+  local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+  log "=== [RUN ${run_idx}/3] $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
+    log "  errors = $errs (baseline reference = 9)"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+  else
+    log "WARNING: no summary file in $run_dir"
+  fi
+}
+
+log "=== P0: v5 baseline EXP2 reproducibility test (3 runs, no polling) ==="
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+log "Goal: confirm whether errors=9 in v5 baseline EXP2 is reproducible"
+log "      (v5+profile saw 415 errors; we need to know if polling was causal)"
+
+for i in 1 2 3; do
+  run_exp2 $i
+done
+
+log ""
+log "=== P0 SUMMARY: errors per run ==="
+for i in 1 2 3; do
+  if [ -f "$OUTPUT/exp2_2p6d_run${i}_summary.json" ]; then
+    e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/exp2_2p6d_run${i}_summary.json')); print(d.get('error_count',0))")
+    log "  run ${i}: errors = $e"
+  fi
+done
+log "=== P0 ALL DONE ==="
--- a/src/agentic_pd_hybrid/replay.py
+++ b/src/agentic_pd_hybrid/replay.py
@@ -708,6 +708,11 @@ async def _query_pool_snapshot(
    if not isinstance(memory_usage, dict):
        memory_usage = {}

+    # P1 instrument: pool_breakdown decomposes "other" into named buckets
+    pool_breakdown = internal.get("pool_breakdown") if isinstance(internal, dict) else None
+    if not isinstance(pool_breakdown, dict):
+        pool_breakdown = {}
+
    return {
        "session_cache_enabled": bool(session_cache.get("enabled")),
        "session_count": int(session_cache.get("session_count") or 0),
@@ -727,6 +732,18 @@ async def _query_pool_snapshot(
        "last_gen_throughput": float(internal.get("last_gen_throughput") or 0.0)
        if isinstance(internal, dict)
        else 0.0,
+        "radix_evictable_tokens": int(pool_breakdown.get("radix_evictable_tokens") or 0),
+        "radix_protected_tokens": int(pool_breakdown.get("radix_protected_tokens") or 0),
+        "slot_private_held_tokens": int(pool_breakdown.get("slot_private_held_tokens") or 0),
+        "session_slot_count": int(pool_breakdown.get("session_slot_count") or 0),
+        "running_batch_reqs": int(pool_breakdown.get("running_batch_reqs") or 0),
+        "running_batch_kv_tokens": int(pool_breakdown.get("running_batch_kv_tokens") or 0),
+        "transfer_queue_reqs": int(pool_breakdown.get("transfer_queue_reqs") or 0),
+        "transfer_queue_tokens": int(pool_breakdown.get("transfer_queue_tokens") or 0),
+        "prealloc_queue_reqs": int(pool_breakdown.get("prealloc_queue_reqs") or 0),
+        "prealloc_queue_tokens": int(pool_breakdown.get("prealloc_queue_tokens") or 0),
+        "retracted_queue_reqs": int(pool_breakdown.get("retracted_queue_reqs") or 0),
+        "retracted_queue_tokens": int(pool_breakdown.get("retracted_queue_tokens") or 0),
        "sessions": sessions,
    }

--- a/third_party/sglang/python/sglang/srt/managers/scheduler.py
+++ b/third_party/sglang/python/sglang/srt/managers/scheduler.py
@@ -3181,6 +3181,89 @@ class Scheduler(
            success = False
        return success

+    def _compute_pool_breakdown_for_diagnostics(self) -> dict:
+        """Read-only KV pool decomposition for the agentic-pd-hybrid profiler.
+
+        Decomposes capacity into:
+          - radix_evictable_tokens / radix_protected_tokens: tree-managed
+          - slot_private_held_tokens: SessionAwareCache out-of-tree slot holds
+          - running_batch_kv_tokens: kv_allocated_len of currently-decoding reqs
+            (overlaps with radix_protected; not additive)
+          - {transfer,prealloc,retracted}_queue_{reqs,tokens}: disagg queues
+          - available_tokens: free pool
+
+        Caller computes "unaccounted = capacity - sum_of_known" to find leakage.
+        Implementation is best-effort; missing components return omitted keys.
+        """
+        breakdown: dict = {
+            "capacity_tokens": int(self.max_total_num_tokens or 0),
+            "available_tokens": int(self.token_to_kv_pool_allocator.available_size()),
+        }
+
+        # Radix tree (works for SessionAwareCache and most inner caches)
+        try:
+            ev = self.tree_cache.evictable_size()
+            pr = self.tree_cache.protected_size()
+            if isinstance(ev, tuple):
+                ev = ev[0]
+            if isinstance(pr, tuple):
+                pr = pr[0]
+            breakdown["radix_evictable_tokens"] = int(ev or 0)
+            breakdown["radix_protected_tokens"] = int(pr or 0)
+        except Exception:
+            pass
+
+        # SessionAwareCache slot-private holds (already in session_cache.held_tokens
+        # but mirrored here for one-stop decomposition)
+        try:
+            from sglang.srt.mem_cache.session_aware_cache import SessionAwareCache
+            if isinstance(self.tree_cache, SessionAwareCache):
+                breakdown["slot_private_held_tokens"] = int(
+                    self.tree_cache.session_held_tokens()
+                )
+                breakdown["session_slot_count"] = int(
+                    self.tree_cache.session_held_req_count()
+                )
+        except Exception:
+            pass
+
+        # Running batch KV (overlaps with radix_protected for tree-tracked reqs)
+        try:
+            running_reqs = self.running_batch.reqs
+            breakdown["running_batch_reqs"] = len(running_reqs)
+            breakdown["running_batch_kv_tokens"] = sum(
+                int(getattr(req, "kv_allocated_len", 0) or 0)
+                for req in running_reqs
+            )
+        except Exception:
+            pass
+
+        # Disagg decode queues
+        if self.disaggregation_mode == DisaggregationMode.DECODE:
+            try:
+                tq = self.disagg_decode_transfer_queue.queue
+                pq = self.disagg_decode_prealloc_queue.queue
+                rq = self.disagg_decode_prealloc_queue.retracted_queue
+                breakdown["transfer_queue_reqs"] = len(tq)
+                breakdown["transfer_queue_tokens"] = sum(
+                    int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
+                    for dr in tq
+                )
+                breakdown["prealloc_queue_reqs"] = len(pq)
+                breakdown["prealloc_queue_tokens"] = sum(
+                    int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
+                    for dr in pq
+                )
+                breakdown["retracted_queue_reqs"] = len(rq)
+                breakdown["retracted_queue_tokens"] = sum(
+                    int(getattr(req, "kv_allocated_len", 0) or 0)
+                    for req in rq
+                )
+            except Exception:
+                pass
+
+        return breakdown
+
    def get_internal_state(self, recv_req: GetInternalStateReq):
        ret = vars(get_global_server_args())
        ret["last_gen_throughput"] = self.last_gen_throughput
@@ -3196,6 +3279,7 @@ class Scheduler(
        ret["session_cache"] = (
            self.session_controller.get_streaming_session_cache_status()
        )
+        ret["pool_breakdown"] = self._compute_pool_breakdown_for_diagnostics()

        if not self.spec_algorithm.is_none() and self.spec_total_num_forward_ct > 0:
            ret["avg_spec_accept_length"] = (