docs(kvc): record real Ali KVC experiment results

feat(kvc): add real Ali replay workflow
docs(kvc): agentic-fit analysis, refactor plan, validation report
2026-05-12 05:28:06 +00:00 · 2026-05-12 05:28:00 +00:00 · 2026-05-06 21:30:11 +08:00 · 2026-05-06 21:29:56 +08:00 · 2026-05-06 21:29:46 +08:00 · 2026-05-06 21:29:30 +08:00
50 changed files with 7228 additions and 114 deletions
--- a/docs/AGENTIC_FIT_ANALYSIS_ZH.md
+++ b/docs/AGENTIC_FIT_ANALYSIS_ZH.md
@@ -0,0 +1,434 @@
+# Agentic 场景下的结构性设计缺陷分析
+
+**日期**：2026-05-06
+**对照数据**：`outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/exp2_2p6d_run1_*`（KVC kv-aware Option D，2P6D，4449 reqs / 52 sessions）+ `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`（同 trace 8-way DP cache-aware baseline）。
+**模型**：Qwen3-30B-A3B（TP1），单机 8×H100 80GB。
+**研究问题**：把 SWE trace 视为"真实 agentic"的代表，KVC 机制相对 vanilla DP 系统性输在哪里——除了"D 容量 4.6× 过载"之外的结构性原因。
+
+> 本文是对 `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` 与 `docs/V5_PROFILE_INVESTIGATION_ZH.md` 的补充：版本演进与瓶颈定位之外，从设计层看哪些假设和真实 agentic workload 不匹配。
+
+---
+
+## TL;DR
+
+按重要性排序的结构性缺陷：
+
+| # | 缺陷 | 数据 | 修复方向 | 工程量 |
+|---|---|---|---|---|
+| 1 | **KvAwarePolicy 不感知 D 容量；session 永久 pin 到首次落点 D** | session 平均访问的不同 D 数 = **1.00**；direct-to-D 命中率呈极端双峰（15 session 0-20%、14 session 80-100%） | score 函数加 capacity-aware 项；允许跨 D session 迁移 | 中 |
+| 2 | **D 端 LRU 只能 evict idle session，hot session 永远踢不掉** | D 跑全程仅 9-43 次 trim 事件 vs 80-150 次 transfer 错误；token_usage 顶到 1.00 | 加 score-based eviction（按访问频率/最近性多层） | 中 |
+| 3 | **没有 D→Router→Replay 的 backpressure 通道** | concurrency 一路 32 不降；D 失败时 replay 无感 | admission 响应加 `recommended_pause_ms`；replay 端按它降并发 | 小 |
+| 4 | **Admission HTTP round-trip 与 scheduler 主循环耦合** | v5+profile 仅加 1Hz polling 就让 errors 从 9 涨到 415 | 拆成 lock-free `/probe` + 进 scheduler 队列的 `/commit_evict` | 中 |
+| 5 | **P-side round-robin 不感知 D 健康** | prefill-0 出 367 KVTransferError，prefill-1 仅 4——但请求量近乎对半 | router 选 P 时考虑目标 D 健康度 | 中 |
+| 6 | **Replay 端 session footprint 估算膨胀 30×** | `_estimate_session_resident_tokens = input + output`，把 turn-50 的 80K 上下文当成"需要全新 80K 空间" | 改成"增量 token"估算 | 小 |
+| 7 | **time-scale=10 把测试条件人为推到失真区间** | inter-turn gap p50 从 2.5s 压到 0.25s——KVC 想利用的"自然 idle 窗口"被消除 | 跑一组 time-scale=1 baseline 验证 | 小（仅配置） |
+
+**最重要的对照事实**：同 trace、同硬件、同模型下 8-way DP cache-aware（无 PD 拆分、无 KVC、无 session 抽象）：
+
+| 指标 | 8-way DP CA | v5 KVC 2P6D |
+|---|---|---|
+| Errors | **0** | 372 (8.4%) |
+| Latency mean | **1.43s** | 3.50s |
+| Latency P50 | **0.65s** | 1.11s |
+| Latency P99 | **8.37s** | 20.37s |
+| TTFT mean | **0.12s** | 2.13s |
+| TTFT P90 | **0.26s** | 6.47s |
+| Per-worker 请求量分布 | 508–619（±10%） | 561–858（±26%） |
+
+**naive DP 在每一项都赢，包括 latency mean 的 145% 优势**。这定义了 KVC 在该 workload 下"必须超过"的基线。
+
+---
+
+## 1. Session 永久 pin 到 D + 容量盲选（最核心问题）
+
+### 1.1 现象
+
+每个 session 在整次运行中只访问 **1.00 个不同 D worker**（见上文数据）。结合 direct-to-D 命中率分布：
+
+```
+direct-to-D 命中率分桶（n=52 sessions）：
+  0-20%:  15 sessions ← 几乎每 turn 都失败回退到 P→D 全量传输
+  20-40%:  7
+  40-60%: 11
+  60-80%:  5
+  80-100%: 14 sessions ← 几乎每 turn 都走 direct-to-D 快路径
+```
+
+**几乎没有中间态**——这是典型的不公平资源分配信号。
+
+被饿死与被照顾的 session 在工作量上差异明显：
+- 饿死 session 平均 peak input：56,011 token
+- 顺利 session 平均 peak input：31,344 token（**1.8× 差距**）
+
+**大 session 倾向被饿死**——因为它们在容量已紧张的 D 上更容易触发 admission 拒。
+
+### 1.2 根因（代码级）
+
+`policies.py:166-172` `KvAwarePolicy.select`：
+
+```python
+score = (
+    overlap + sticky * self.sticky_bonus,    # 主项: 历史 KV overlap
+    sticky,                                   # 二级: 是否 last_decode_worker
+    inflight_penalty,                         # 三级: 当前 inflight 数（很小）
+    assignment_penalty,                       # 四级: 累计被分配数（更小）
+)
+```
+
+评分中**完全无 D 当前容量项**。Session X 第一次落到 D-2 时积累 hash_id 在 D-2 上；之后无论 D-2 多满，X 的 turn N+1 都会被打分到 D-2（因为 overlap 主导）。
+
+更糟的是 `RoutingState.decode_resident_blocks`（`policies.py:46`）从不缩减——即使 D 早 evict 了某些块，replay 仍认为它们在那。运行中期所有 D 的 overlap 集合都接近"trace 全部 hash_id"，policy 退化为纯 sticky。
+
+### 1.3 后果——具体到 session 的体验
+
+**饿死 session（如 session 50400，105 turns，0 次 direct-to-D）每 turn 流程**：
+
+1. policy 选 D（永远是同一个）
+2. admission 拒（D 容量已被占住）
+3. 走 fallback-session-cap → P 全量 prefill 50K-100K token
+4. mooncake 推 KV → D 仍无空间 → 32s timeout 或 KVTransferError
+5. 用户每 turn 体验 5-10s 延迟，反复出错
+
+**顺利 session（如 session 3840，118 turns，97% direct-to-D）每 turn 流程**：
+
+1. policy 选 D（永远是该 session 的初始 D）
+2. admission 通过（这个 session 一直占着这个 D 的 slot）
+3. direct-to-D：D 上 append-prefill 几百 token，零 P 介入、零 mooncake transfer
+4. TTFT 0.043s、E2E 0.495s
+
+**这不是"平均慢一点"，是结构性不公平**——SLO 视角下 P99 是被饿死那 15 session 的尾巴拉出来的。
+
+### 1.4 为什么 naive DP 反而赢
+
+8-way DP cache-aware 用纯 hash-based 路由，没有 session 抽象，没有 PD 拆分：
+
+- 每个请求按 prefix hash 路由到一个 worker → 同 session 的 turn 在 worker 上自然有 prefix 命中
+- 容量过载时 SGLang 自己的 radix cache + 调度器统一管 KV 池
+- 不存在 admission/fallback/reseed 路径
+- 不存在 mooncake transfer
+- per-worker 负载误差 ±10%（vs KVC ±26%），自动接近均衡
+
+**KVC 引入的 session affinity / KV 复用 / admission 三件套，在容量紧张时反而加剧了不均衡，没有任何一项能挽回 vs DP 的差距。**
+
+### 1.5 修复方向
+
+`KvAwarePolicy.select` 里加：
+
+```python
+# 当前 D 容量利用率（worker-mode admission 已经能查到）
+capacity_penalty = -worker_capacity_used_ratio[worker.worker_id]
+
+# 当多个 D 都有 overlap 时，按容量挑最空的；
+# 当某 D 容量 > 阈值时，禁止该 D 进入候选
+if worker_capacity_used_ratio[worker.worker_id] > HARD_CAP:
+    continue
+
+score = (
+    overlap_capped,                # overlap 但限幅，避免单个 D 永远赢
+    capacity_penalty,              # ← 新增
+    sticky,
+    inflight_penalty,
+)
+```
+
+更激进的修法：当一个 session 被某 D 反复拒 N 次后，主动 release 它在该 D 上的 session 状态，**允许下次 turn 走另一个 D**（代价是丢失已积累的 KV，但目前 fallback 路径本来也丢了）。
+
+---
+
+## 2. D 端 LRU eviction 跟不上压力
+
+### 2.1 数据
+
+每个 D 全程：
+
+| Worker | Trim 事件（主动 LRU） | KVTransferError + OOM | 峰值 token_usage |
+|---|---:|---:|---:|
+| decode-0 | 9 | 0 | 0.99 |
+| decode-1 | 43 | 12 (4 err + 8 oom) | 0.99 |
+| decode-2 | 16 | 459 (153 err + 306 oom) | 0.97 |
+| decode-3 | 37 | 87 (29 err + 58 oom) | 0.99 |
+| decode-4 | 28 | 270 (90 err + 180 oom) | **1.00** |
+| decode-5 | 30 | 279 (93 err + 186 oom) | **1.00** |
+
+**LRU 触发频率比错误次数低 5-15 倍。** D-4 / D-5 直接顶到 token_usage=1.00。
+
+### 2.2 根因
+
+`scheduler.py:2040` `evict_idle_streaming_sessions_lru` 的 idle 判定：
+
+```python
+# 只能 evict "所有 req 都 finished + streaming 模式" 的 session
+```
+
+但 SWE 高并发下每个 session 几乎一直有 inflight req（time-scale=10 又压缩了 inter-turn gap）。**hot session 永远不 idle，LRU 永远找不到东西可踢**。结果 D 一路开到 100% → 下一笔 transfer 来直接 OOM/timeout。
+
+### 2.3 修复方向
+
+引入分层 eviction：
+
+1. **Idle session 优先**（当前）
+2. **冷 session 次优**（最近 N 秒无访问，即使有 inflight，也可以 retract 那个 inflight 让位）
+3. **hot session 强制 retract**（在 hard cap 触发时）
+
+vanilla SGLang 已有 `disagg_decode_prealloc_queue.retracted_queue` 机制（看 `admit_direct_append` 引用），但**没有人主动触发 retract**——目前只有内部异常时才会进 retracted_queue。需要把 retract 提升为正常 admission 路径的一部分。
+
+---
+
+## 3. 没有 D→Replay 的 backpressure 通道
+
+### 3.1 名词解释
+
+**Backpressure（反压）** = 流式系统下游过载时把信号反向传给上游让它降速。例：TCP 滑动窗口、Kafka consumer lag、gRPC HTTP/2 flow control。
+
+### 3.2 当前状态
+
+- D 端 transfer queue 堆 → 32s 后 timeout → 抛 KVTransferError
+- error 抛回 P → P 抛给 router → router 抛给 replay → replay 走 fallback 路径
+- **整个链路上没有"D 过载，请慢点发"的信号**——concurrency 一直保持上限
+
+后果：D 一旦开始失败，会**持续失败**（因为 replay 没降速），直到 D 自己消化完积压。
+
+### 3.3 修复方向
+
+`admit_direct_append` 响应里加：
+
+```python
+{
+  "can_admit": ...,
+  "recommended_pause_ms": int,    # ← 新增：下次发同类请求前建议等多久
+  "queue_depth": int,             # ← 新增：D transfer queue 当前深度
+  ...
+}
+```
+
+replay 端在 admission 拒被拒时按 `recommended_pause_ms` 降并发或退避。**这是最便宜的一条改动**——不改协议、不改 SGLang 内部，只改两端代码。
+
+---
+
+## 4. Admission RPC 与 scheduler 耦合——结构 vs 工程的精确边界
+
+### 4.1 现象
+
+`docs/V5_PROFILE_INVESTIGATION_ZH.md` 报告：仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415。`/server_info` 在 scheduler 主循环里遍历 session slots 算 `is_idle`，1 Hz × 8 worker 就足以扰动调度。
+
+但实际负载下 admission RPC 频率远高于 1Hz：每个 turn 1 + reseed + direct-to-D 都调一次。concurrency=32 + 4449 reqs / ~2700s ≈ **每秒 16+ 次 admission RPC**。
+
+### 4.2 这是结构问题还是工程问题——精确拆解
+
+`admit_direct_append`（`scheduler.py:3581`）做两件事：
+
+```python
+# (a) 读池子状态——轻
+available_tokens = self.token_to_kv_pool_allocator.available_size()
+
+# (b) 触发 LRU 扫描——重，且必须修改池子状态
+trim_result = self.maybe_trim_decode_session_cache(...)
+```
+
+| 部分 | 性质 | 是否能靠工程化解决 |
+|---|---|---|
+| (a) 读池子状态 | 几个原子读 | **完全可工程化**——做成 lock-free shared-memory snapshot 即可 |
+| (b) LRU eviction | 修改 GPU 池子，必须独占 | **结构性的**——Python GIL + 共享 GPU 池子无法并发修改 |
+
+**关键观察**：实际负载里 (b) 是少数路径——大部分 admission 只需要"看一下够不够"，不需要立即 evict。
+
+### 4.3 工程化修复方案
+
+把 admission API 拆成两个端点：
+
+```
+POST /session_cache/probe          ← 90% 流量
+  - 只读 lock-free snapshot
+  - 返回 (can_admit_estimate, available_tokens, queue_depth)
+  - 不进 scheduler 队列
+
+POST /session_cache/commit_evict   ← 10% 流量
+  - probe 不够时才调
+  - 进 scheduler 队列，做实际 LRU
+  - 保留当前 admit_direct_append 语义
+```
+
+snapshot 由 scheduler 在每个 step 末尾写到一段 mmap 共享内存（atomic publish）；replay 端 mmap 读，零 syscall 零序列化。一秒内能撑数千次 probe。
+
+### 4.4 关于"协程/多线程/多进程/换语言"
+
+| 工具 | 对本问题的实际效果 |
+|---|---|
+| asyncio 协程 | SGLang 已用，对 scheduler 主循环本身无帮助 |
+| Python 多线程 | GIL 拦着，且 GPU 池子状态只能 scheduler 进程改 |
+| 多进程 | scheduler 已是独立进程；问题是它**自己的 step 循环**串行了 admission 与 decode |
+| orjson / uvloop | 网络/JSON 加速 5-10×，但 LRU 遍历不在那条热路径 |
+| Rust/C++ 重写 scheduler | 把 LRU 遍历提速 5-10×，但**结构性共享问题仍在** |
+
+**正确的工程化解法是重设计 API（拆 probe / commit），不是单纯换更快的库或语言。**
+
+---
+
+## 5. P-side 路由不感知 D 健康
+
+### 5.1 数据
+
+```
+prefill-0:  367 KVTransferError, 361 "Decode instance could be dead"
+prefill-1:    4 KVTransferError, 0  "Decode instance could be dead"
+
+请求量对比:
+  prefill-0: 2225 requests
+  prefill-1: 2224 requests   ← 几乎对半
+```
+
+**两 P 请求量完全均衡，错误率差 92×**。日志里 prefill-0 的错误反复指向某个特定 D（`10.45.80.47:XXXXX`）——它跟某个 hot D 形成了"死亡链路"。
+
+### 5.2 根因
+
+`pd_router.py:43-49` 的 P 选择是裸 round-robin：
+
+```python
+prefill_url, bootstrap_port = self.config.prefill_urls[
+    self.prefill_cursor % len(self.config.prefill_urls)
+]
+```
+
+不知道 D 是否健康，不会避开"正在和 D-X 死磕"的 P。
+
+### 5.3 修复方向
+
+router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度) 联合得分。健康度可以用 §3 提的 `queue_depth` 字段。
+
+---
+
+## 6. Replay 端 session footprint 估算膨胀 30×
+
+### 6.1 代码
+
+`replay.py:898-899`：
+
+```python
+def _estimate_session_resident_tokens(request: TraceRequest) -> int:
+    return request.input_length + request.output_length
+```
+
+被用于 `_decode_session_soft_cap`（`replay.py:1051`）和 `_should_admit_new_decode_session`。
+
+### 6.2 问题
+
+对一个已经在 D 上有 80K KV 的 turn 50：
+- 真实增量需求：input 新增几千 token + output 几百 token = ~3K
+- 估算返回值：80K + 1K = 81K（**膨胀 ~27×**）
+
+后果：router-mode admission 系统性误判——本来能 admit 的 session 被 replay 自己拒掉。v5 worker-mode 让 D 自己看真实容量部分修了这个，**但 KvAwarePolicy 选 D 时仍用这个膨胀估算**——选 D 仍然是错的。
+
+### 6.3 修复
+
+```python
+def _estimate_session_resident_tokens(request: TraceRequest) -> int:
+    if request.turn_id == 1:
+        return request.input_length + request.output_length
+    # turn 2+: only the increment matters for additional reservation
+    return max(0, request.input_length - request.cached_tokens) + request.output_length
+```
+
+---
+
+## 7. time-scale=10 测量失真
+
+### 7.1 它是什么
+
+`replay.py` 把原始 trace 每个请求的 `timestamp` 字段做 `t / time_scale` 缩放后再按这个时间发。
+
+- 原始 trace 跨度 ~6000s（≈100 分钟）
+- time-scale=10 → 实际 replay 跨度 ~600s（≈10 分钟）
+
+### 7.2 为什么这么设计
+
+**纯粹为了节省测试时间**——单次 1× 跑 100 分钟，sweep 5 版 × 3 重复 = 25h GPU 时间；10× 只要 2.5h。
+
+### 7.3 它扭曲了什么
+
+| 维度 | 原始 trace | replay (time-scale=10) |
+|---|---|---|
+| inter-turn gap p10 | 1.6s | 0.16s |
+| inter-turn gap p50 | 2.5s | 0.25s |
+| inter-turn gap p90 | 7.8s | 0.78s |
+| inter-turn gap max | 261s | 26s |
+
+真实 agentic 用户/agent 在每个 turn 之间停 2-8 秒（思考、打字、tool call）。**这些间隙正好是 KVC 想利用的"自然 idle 窗口"**——session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit。
+
+time-scale=10 把这些窗口压到 0.2-0.8s，**人为消除了 KVC 的设计前提条件**。
+
+### 7.4 严重的实验有效性威胁
+
+所有 v3-v6 数据基于 time-scale=10。这意味着前面所有"KVC 在 SWE 上输给 baseline"的结论都带着这个失真。**真实部署里 inter-turn gap 是 2.5s 的话，KVC 可能根本不会撞到当前看到的容量瓶颈**——D 有时间在 turn 之间释放/重排。
+
+**应该单独跑一组 time-scale=1 的 baseline 对比**，才能判断 KVC 输给 DP 是因为机制本身不行，还是因为 benchmark 把它推到了不该工作的区间。这是这个项目目前**最重要但还没做**的验证。
+
+---
+
+## 8. 应用层抽象不需要在引擎层引入（撤回）
+
+之前草稿里提过"框架不支持 speculative 多分支、嵌套 sub-agent、tool call 中断"——这是过度抽象。**应用层模式都可以由 timestamp + 独立 session_id 隐式表达**：
+
+| 应用层模式 | 表现在 trace 里 | 推理引擎需要做什么 |
+|---|---|---|
+| Tool call 异步返回 | turn N 与 N+1 之间 timestamp gap 很大 | 啥都不用，按时间发请求即可 |
+| 嵌套 sub-agent | 父 session timestamp 突然停顿；sub-agent 是独立 session_id | 把它们当成两个独立 session 即可（KV 也无需共享） |
+| Speculative N 分支 | N 个独立 session_id 同时发 | 用 radix prefix cache 自然命中前缀；不需要任何额外抽象 |
+
+**这条不构成结构性缺陷。** 已从结论中移除。
+
+---
+
+## 9. 行动项（按 ROI 排序）
+
+### 优先级 P0（修了显著改善饿死/不公平）
+
+1. **[§1] KvAwarePolicy 加 capacity-aware penalty + 允许 session 跨 D 迁移** — 工程量中、收益最大
+2. **[§2] D 端引入分层 eviction（冷 session、hot retract）** — 工程量中、收益大
+3. **[§7] 跑一组 time-scale=1 baseline** — 工程量小（仅配置），但**不做这条所有结论都不可信**
+
+### 优先级 P1（修了把工程稳定性补齐）
+
+4. **[§3] D→Replay backpressure 通道**（admission 响应加 pause hint） — 工程量小
+5. **[§4] 拆 admission 为 probe + commit_evict** — 工程量中
+6. **[§6] 修 `_estimate_session_resident_tokens` 用增量** — 工程量小
+
+### 优先级 P2（等 P0 数据后再决定）
+
+7. **[§5] P-side 选 P 时考虑 D 健康** — 工程量中
+
+---
+
+## 10. 局限与未验证假设
+
+1. **N=1**：所有数据来自单次 run（v6 P0 已证 EXP2 errors 在 9-912 间漂移，single-run variance 巨大）。本文所有数字都应理解为"代表性观察"而非"统计显著结论"。
+2. **time-scale=10 失真**（§7）：所有"KVC 输给 DP"的程度可能是被 benchmark 放大的。这是最大的不确定性。
+3. **8DP 对比的硬件优势**：DP 是 8 个 worker 全部跑 prefill+decode；KVC 是 2P+6D，只有 6 个能解码。理论上 8 worker 对 6 worker 自带 1.33× 解码并发优势。本文未折算这部分——但 8DP 优势远大于 1.33×（latency mean 145% 优势），所以核心结论（KVC 在该 workload 下系统性输）不受此影响。
+4. **mooncake TCP loopback**：所有 transfer 错误是单机 TCP 模拟下的产物。生产环境 RDMA 下错误率分布可能完全不同。
+5. **KvAwarePolicy 的 stale `decode_resident_blocks`**（§1.2 末尾）现象有数据观察支撑（运行中期 overlap 失去判别力），但**没有系统性测过"清掉 stale 状态会怎样"**。
+6. **P-side 错误集中在 prefill-0**（§5.1）的因果链是推测——可能也是"prefill-0 早启动 + race"的偶然结果。N>1 数据未验证。
+
+---
+
+## 附录 A：数据产物索引
+
+```
+outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
+├── exp2_2p6d_run1_metrics.jsonl    ← 本文主数据源
+├── exp2_2p6d_run1_summary.json
+├── exp2_2p6d_run2_*  (errors=912, single-run variance 证据)
+├── exp2_2p6d_run3_*  (errors=396)
+└── kvcache-centric-*-20260429T142429Z/logs/
+    ├── decode-{0..5}.log           ← §2.1 LRU vs error 计数
+    └── prefill-{0,1}.log           ← §5.1 P 错误分布
+
+outputs/qwen3-30b-tp1-exps/
+├── exp1_8way_dp_cache_aware_summary.json   ← 对照 baseline
+└── RESULTS_SUMMARY.md
+```
+
+## 附录 B：相关文档
+
+- `docs/PROJECT_OVERVIEW.md` — 项目目标与已实现功能
+- `docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md` — v1→v5 版本演进
+- `docs/V5_PROFILE_INVESTIGATION_ZH.md` — v5+profile 调查（已 critic 修订）
+- `docs/SWEBENCH_EXPERIMENT_RESULTS.md` — Qwen3.5-35B-A3B SWE 实验
--- a/docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md
+++ b/docs/KVC_DEBUG_JOURNEY_V1_TO_V5.md
@@ -0,0 +1,367 @@
+# KVC 实验踩坑记录与代码 Bug 分析（v1 → v5）
+
+记录从 v1 到 v5 KVC 实验的踩坑过程、错误诊断、以及最终定位的代码 bug。
+模型: Qwen3-30B-A3B (TP1)，硬件: 单节点 8×H100 80GB。
+Trace: `qwen35-swebench-50sess.jsonl`（4449 请求，52 sessions）。
+
+## TL;DR
+
+| 版本 | 关键变化 | 截断率 | direct-to-D 占比 | P50 | 主要瓶颈 |
+|------|----------|:---:|:---:|:---:|----------|
+| v1 (smoke / 早期) | mechanism 跑通 | - | - | - | - |
+| v2 | KVC + `--policy default` | **56.8% / 61.4%** | <0.1% | 0.08s* | Routing 错位（默认策略） |
+| v3 | KVC + `--policy kv-aware` | **0.9%** | 30-42% | 1.5-1.8s | session-cap fallback (52-65%) |
+| v4 | v3 + soft_cap 4→16 | 1.0% | 54-58% | 1.08 / 0.84s | session-cap fb 仍 35%、9-10% mooncake errors |
+| v5 | Option D：worker-mode 驱动 seed/reseed | 0.9% | 41-45% | 1.59 / 1.31s | D KV pool 真容量不足 → fallback 反而 ↑ 至 46-51% |
+
+`*` v2 的 P50 是假数字——超过半数请求只生成 1 个 token 就被 abort。
+
+## v2 踩坑：Default policy 与 KVC 机制根本不兼容
+
+### 表象
+
+`scripts/sweep_tp1_v2_fixed.sh` 跑出来：
+- Exp1（8-way DP，baseline）：4449/4449 成功，P50=0.65s，error=0
+- Exp2（1P7D KVC）：**2524 truncated (56.8%)**，18 errors，P50=0.08s* (假)
+- Exp3（2P6D KVC）：**2733 truncated (61.4%)**，17 errors，P50=0.08s* (假)
+
+每个截断请求 `actual_output_tokens=1`，`finish_reason="abort: session id X does not exist"`。
+
+### 错误的早期诊断
+
+之前 `RESULTS_SUMMARY.md` 把锅扣在 SGLang 的 `--disaggregation-decode-allow-local-prefill` flag 上，认为是 D worker 在有 `bootstrap_room` 时仍然做了 local prefill。这个诊断**完全错误**——查 `scheduler.py:1975-1980` 的 `_should_allow_local_prefill_on_decode`：
+
+```python
+def _should_allow_local_prefill_on_decode(self, req: Req) -> bool:
+    return (
+        self.disaggregation_mode == DisaggregationMode.DECODE
+        and self.server_args.disaggregation_decode_allow_local_prefill
+        and req.bootstrap_room is None  # ← 有 bootstrap_room 不会走 local prefill
+    )
+```
+
+KVC reseed 路径的请求都带 `bootstrap_room`，根本不会触发 local prefill。
+
+### 实际根因：Replay 与 PD Router 的 round-robin 错位
+
+实验脚本里 KVC 用 `--policy default`，而 baseline 用 `--policy kv-aware`。
+看 `benchmark.py:287-300` 这两者的差别巨大：
+
+```python
+def _decode_policy_for(policy_name: str) -> str:
+    if policy_name == "sticky":      return "manual"
+    if policy_name == "kv-aware":    return "consistent_hashing"
+    return "round_robin"  # default
+
+def _header_mode_for(policy_name: str) -> str:
+    if policy_name == "sticky":      return "routing-key"
+    if policy_name == "kv-aware":    return "target-worker"
+    return "none"  # default
+```
+
+`default` policy + KVC 机制下：
+1. Replay policy（`policies.py:DefaultPolicy`）round-robin 选一个 D，比如 D-3
+2. Replay 在 D-3 上 `open_session(session_id=X)`（`replay.py:1722-1731`）
+3. Replay 通过 PD Router 发请求（带 `session_params`），但 `header_mode=none`，**不发任何 routing header**
+4. PD Router (`pd_router.py:_select_decode_index`) 看到 `decode_policy=round_robin`，用**自己独立的计数器**round-robin，发到了 D-5
+5. D-5 的 scheduler 看到 `session_params` 里有 session_id，但自己的 `session_controller` 里没这个 session（session 在 D-3 上）→ abort with `"Invalid request: session id X does not exist"` (`scheduler.py:1824-1836`)
+
+两个独立的 round-robin 计数器只要一次错位（任何并发或 direct-to-D 绕过 router 的请求都会引起）就永远对不上。
+
+### 为什么 turn 0 不出问题？
+
+Turn 0 走 `_invoke_plain_router`（`replay.py:1894`），不带 `session_params`，作为普通 PD disagg 请求处理，发到任何 D 都行。Turn 1+ 才开始走带 session_params 的 KVC 路径，撞上路由错位。
+
+### 数据特征验证（per-session pattern）
+
+```
+session 11360 (58 turns): pattern = .TTTTT.TTTTTTT.TTTTTT...   ← turn 0 OK，1+ 全 T
+session 18720 (87 turns): pattern = .TTTTTTTTTTTTTTTTTT...
+```
+
+每个 D worker 收到了全部 52 个 session 的请求（理想情况下应该是 ~7-8 个/D，因为 round-robin 把 session 完全打散）。
+
+### 修复
+
+唯一正确的修复是把 KVC 的 policy 从 `default` 改成 `kv-aware`：
+
+```diff
+- --policy default
+ --policy kv-aware
+```
+
+`KvAwarePolicy` (`policies.py:146-187`) 做两件事：
+1. 用 `_overlap_blocks` + `sticky_bonus` 给每个 D 打分，session 自然粘在同一个 D（**session 亲和性**）
+2. `header_mode=target-worker`，发 `x-smg-target-worker` header
+3. PD Router 用 `consistent_hashing` 模式，看到 header 就直接用，不再 round-robin
+
+## v3 改 kv-aware policy 后：路由对了，但新瓶颈出现
+
+`scripts/sweep_tp1_v3_kvaware.sh` 把所有 KVC 实验改成 `--policy kv-aware`，结果：
+
+| 指标 | v2 1P7D (default) | **v3 1P7D (kv-aware)** | v3 2P6D | 8-way DP baseline |
+|------|:---:|:---:|:---:|:---:|
+| 截断 | 56.8% | **0.9%** | 0.9% | 1.5% |
+| Errors | 18 | 363 (8.2%) | 9 | 0 |
+| Mean | 4.74s | 4.88s | 3.58s | 1.43s |
+| P50 | 0.08s* (假) | 1.75s | 1.52s | 0.65s |
+| P90 | 12.14s | 12.67s | 9.23s | 3.61s |
+| TTFT P50 | - | 0.36s | 0.33s | 0.09s |
+
+✅ **截断从 56.8% 降到 0.9%，路由问题彻底解决**。
+❌ 但 P50 仍然是 baseline 的 2-3 倍。
+
+### Direct-to-D 路径表现优秀（KVC 该有的样子）
+
+按 execution_mode 拆开看：
+
+| 路径 | Exp1 1P7D 占比 | Exp1 1P7D P50 | Exp1 1P7D TTFT P50 |
+|------|:---:|:---:|:---:|
+| `kvcache-direct-to-d-session` ✨ | 42.0% | **0.495s** | **0.043s** |
+| `pd-router-fallback-large-append-session-cap` 🔥 | **52.6%** | 5.6s | 3.7s |
+
+Direct-to-D 路径下：
+- P50 = 0.495s（**比 baseline 0.65s 快 25%**）
+- TTFT P50 = 0.043s（**比 baseline 0.093s 快 2 倍**）
+- KV transfer = 0（无 P 介入，纯 D 上 append-prefill）
+
+这才是 KVC 真正的价值。但只有 30-42% 请求走到这条路。
+
+### 新瓶颈：session-cap fallback 占了 52-65%
+
+`pd-router-fallback-large-append-session-cap` 占 1P7D 的 52.6%、2P6D 的 65.4%。这条路径意味着 router 想开新 session 在 D 上，但 admission 拒绝了（"d-session-cap"），只好回退到 plain router（P 全量 prefill + 传给 D，无 session 复用）。
+
+### Bimodal session 分布（starvation）
+
+| Session | Total turns | Direct-to-D | Session-cap fallback |
+|---------|:---:|:---:|:---:|
+| 22080 | 129 | **98%** | 0% |
+| 3840 | 118 | **97%** | 0% |
+| 70560 | 150 | **0%** | **99%** |
+| 39360 | 148 | **0%** | **99%** |
+| 61600 | 117 | **0%** | **99%** |
+
+要么完全幸运，要么完全饿死——典型的双峰分布。
+
+### 根因：硬编码 cap=4
+
+看 `replay.py:_decode_session_soft_cap` 原始代码：
+
+```python
+def _decode_session_soft_cap(...) -> int:
+    target_tokens = max(1, _estimate_session_resident_tokens(request))
+    usable_capacity_tokens = _usable_capacity_tokens(residency, server_url)
+    ...
+    if usable_capacity_tokens <= 0:
+        return 4
+    return max(1, min(4, usable_capacity_tokens // target_tokens))
+    #              ^^^ 硬编码上限 4
+```
+
+7 个 D × 每个 D 最多 4 个 session = **28 个 session slot 总容量**。Trace 有 52 个 session → 24 个 session 永远抢不到 slot。
+
+启动期 race condition 决定了哪些 session 是"幸运儿"——前 28 个挤进来的 session 的所有后续 turn 都走 direct-to-D（快）；剩下 24 个 session 永远走 session-cap fallback（慢）。
+
+## v4 改进：把硬 cap 从 4 提到 16
+
+`replay.py:_decode_session_soft_cap` 一行修改：
+
+```diff
+-    if usable_capacity_tokens <= 0:
+-        return 4
+-    return max(1, min(4, usable_capacity_tokens // target_tokens))
+    if usable_capacity_tokens <= 0:
+        return 16
+    return max(1, min(16, usable_capacity_tokens // target_tokens))
+```
+
+7 D × 16 = 112 个 slot，远超 52 个 session 需求。
+
+### v4 实际结果（vs v3 1P7D / 2P6D）
+
+| 指标 | v3 1P7D | **v4 1P7D** | v3 2P6D | **v4 2P6D** | baseline 8DP |
+|------|:---:|:---:|:---:|:---:|:---:|
+| Errors | 363 (8%) | 435 (10%) | 9 (0%) | **403 (9%)** | 0 |
+| 截断 | 42 | 43 | 42 | 36 | 68 |
+| **direct-to-D** | 38.6% | **54.3%** | 30.5% | **58.0%** ⭐ | - |
+| **session-cap fallback** | 48.3% | 37.4% | 65.4% | **34.7%** | - |
+| Session reused | 1716 | 2180 | 1358 | **2348** | - |
+| KV transfer blocks | 62K | 53K | 79K | **51K** | - |
+| Mean | 4.88s | 4.21s | 3.58s | **2.51s** | 1.43s |
+| **P50** | 1.75s | 1.08s | 1.52s | **0.84s** | **0.65s** |
+| P90 | 12.67s | 13.38s | 9.23s | **6.51s** | 3.61s |
+| P99 | 28.72s | 24.45s | 18.70s | 18.34s | 8.38s |
+| **TTFT P50** | 0.36s | 0.056s | 0.33s | **0.051s** ⭐ | 0.094s |
+| TTFT P90 | 10.97s | 11.90s | 6.95s | **2.64s** | 0.26s |
+
+✓ direct-to-D 占比从 v3 的 30-38% 涨到 v4 的 54-58%
+✓ session 复用 +27% (1P7D) / +73% (2P6D)
+✓ KV transfer 量 -15% (1P7D) / -36% (2P6D)
+✓ TTFT P50 反超 baseline 46%（0.051s vs 0.094s）
+
+### Direct-to-D 路径全面碾压 baseline（KVC 真实价值）
+
+| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
+|--------|:---:|:---:|:---:|:---:|:---:|
+| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
+| v4 1P7D direct-to-D | 2179 | 0.495s | 3.03s | 0.044s | 0.055s |
+| **v4 2P6D direct-to-D** | **2348** | **0.499s** | **2.86s** | **0.043s** | **0.054s** |
+
+direct-to-D 子集相对 baseline：
+- P50 快 24-30%
+- P90 快 16-22%
+- TTFT P50 快 54%
+- TTFT P90 快 79%
+
+### 整体性能（去掉 errors 和 truncated）vs baseline
+
+| Config | clean | Mean | P50 | P90 | P99 |
+|--------|:---:|:---:|:---:|:---:|:---:|
+| baseline 8DP | 4381 | 1.45s | 0.66s | 3.65s | 8.38s |
+| v4 2P6D | 4010 | 2.53s | 0.85s | 6.55s | 18.33s |
+
+vs baseline：P50 慢 28%、P90 慢 80%、P99 慢 119%。即使错误率为 0，整体仍输 baseline——根因是 35% 请求被推到 fallback 路径。
+
+### 新瓶颈 1：35% 请求仍走 session-cap fallback
+
+抬到 16 后真实瓶颈是 capacity-based 计算：`min(16, usable_capacity_tokens // target_tokens)`。
+- `target_tokens = input + output`，agentic 里常见 50-100K
+- D 的 KV pool ≈ 100-150K tokens（80GB H100, mem_fraction=0.835）
+- `usable / target` = 1-2，远没到 16 → 真实 cap 是 capacity 算出来的小数字
+
+要解决必须改 capacity-based 估算逻辑（或上方案 D，让 D 自己决定）。
+
+### 新瓶颈 2：9-10% errors（mooncake 传输超时）
+
+P-side log 显示：
+
+```
+KVTransferError: Failed to send kv chunk of <bootstrap_room> to 10.45.7.165:40319
+Sync batch data transfer timeout after 32722558107ns  (32 秒超时)
+Decode instance could be dead, remote mooncake session ... is not alive
+```
+
+特征：
+- 所有 errors 在 run 的 44.8% 之后出现（系统压力累积）
+- 98% errors 集中在 turn ≥ 31（大 input 的请求）
+- v3 cap=4 时 1P7D 已有 363 errors（仅 1 个 D 集中受冲击），v4 cap=16 把压力均匀分布但量级更大
+
+是 mooncake TCP loopback 在并发上去后撞超时，**不是 SGLang 逻辑 bug**。修复方向：
+1. 加长 mooncake transfer timeout（现在 32s）
+2. 限制并发 inflight transfer 数量
+3. 改用 RDMA（loopback 是单机模拟，生产环境换真 RDMA）
+4. chunked KV transfer
+
+## v5 落地方案 D：worker-mode 驱动 seed/reseed
+
+`scripts/sweep_tp1_v5_optD.sh` 真正把方案 D 落到了代码里。改动核心：把 `--kvcache-admission-mode` 从 `local`(replay 估算) 改成 `worker`(D 决策)，并扩展到 **direct_append + seed + reseed 全部路径**。
+
+### 关键代码改动
+
+1. SGLang 侧：`scheduler.py` 的 `admit_direct_append` 端点新增 `mode` 字段，支持 `direct_append | seed`，seed 模式会触发 D 真正去 reserve KV pool 块并主动调用 `maybe_trim_decode_session_cache` 做 LRU。
+2. Replay 侧：`replay.py` 中 reseed / turn-1 seed / large-append-reseed 都改走同一个 admit endpoint；`_decode_session_soft_cap` 在 worker mode 下被完全 bypass。
+3. 新增运行参数：`--kvcache-admission-mode worker`、`--kvcache-seed-min-turn-id 1`、`--kvcache-seed-max-inflight-decode -1`、`--kvcache-prefill-backup-policy release-after-transfer`、`--kvcache-prefill-priority-eviction`。
+
+### 假设
+
+- v4 的 35% session-cap fallback 来自 replay 视图过期 + capacity-based 计算保守 → 让 D 自己看 KV pool 应该把这 35% 救回来。
+- D 主动 LRU eviction 比 replay 自己写的 reservation 更准确，**应该**让更多 session 能 seed 进来。
+
+### v5 实际结果（vs v4 同配置）
+
+| 指标 | v4 1P7D | **v5 1P7D** | v4 2P6D | **v5 2P6D** | baseline 8DP |
+|------|:---:|:---:|:---:|:---:|:---:|
+| Errors | 435 (10%) | **9 (0.2%)** ⭐ | 403 (9%) | **9 (0.2%)** ⭐ | 0 |
+| 截断 | 43 | 42 | 36 | 42 | 68 |
+| direct-to-D | 54.3% | 44.7% ↓ | 58.0% | 41.3% ↓ | - |
+| **session-cap fallback** | 37.4% | **45.6%** ↑ | 34.7% | **50.6%** ↑ | - |
+| no-d-capacity fallback | 0.3% | 1.2% | 0.2% | 0.8% | - |
+| pd-router-turn1-seed (新可见) | - | 1.2% | - | 1.1% | - |
+| pd-router-d-session-reseed (新可见) | - | 4.8% | - | 3.4% | - |
+| pd-router-large-append-reseed (新可见) | - | 1.0% | - | 1.0% | - |
+| Session reused | 2180 | 1990 | 2348 | 1837 | - |
+| KV transfer blocks | 53K | 66K | 51K | 69K | - |
+| Mean | 4.21s | 5.18s | 2.51s | 3.49s | 1.45s |
+| **P50** | 1.08s | 1.59s | 0.84s | 1.31s | 0.66s |
+| P90 | 13.38s | 14.67s | 6.51s | 9.09s | 3.65s |
+| P99 | 24.45s | 26.09s | 18.34s | 24.92s | 8.38s |
+| TTFT P50 | 0.056s | 0.21s | 0.051s | 0.24s | 0.094s |
+| TTFT P90 | 11.90s | 13.06s | 2.64s | 6.90s | 0.26s |
+
+✅ **可靠性大幅提升**：mooncake 传输超时 errors 从 9-10% 跌到 0.2%。D 真容量决策避免了 v4 那种"乐观 admit → 30s 后超时"的死亡链路。
+✅ reseed / turn1-seed 路径首次显式出现，证明 admission 端点对 seed 模式确实生效了。
+❌ **session-cap fallback 不降反升**（37→46% 与 35→51%）。说明 v4 的本地 soft_cap 实际上**比 D 真实容量更乐观**——admit 进来后转身就 OOM，统计成了 error 而不是 fallback。
+❌ 直接结果：**direct-to-D 占比下降、整体延迟全面变差**。P50/P90/P99 与 TTFT 都退步。
+
+### Direct-to-D 子集还是稳的（KVC 真实价值仍在）
+
+| Config | n | Lat P50 | Lat P90 | TTFT P50 | TTFT P90 |
+|--------|:---:|:---:|:---:|:---:|:---:|
+| baseline 8DP | 4381 | 0.66s | 3.65s | 0.094s | 0.256s |
+| v4 2P6D direct-to-D | 2348 | 0.499s | 2.86s | 0.043s | 0.054s |
+| **v5 1P7D direct-to-D** | 1990 | 0.475s | 3.04s | 0.043s | 0.055s |
+| **v5 2P6D direct-to-D** | 1837 | 0.483s | 3.04s | 0.043s | 0.054s |
+
+direct-to-D 的尾延迟和 TTFT 与 v4 几乎完全一致（端点决策开销可忽略），**v5 的回退不是路径本身变慢，而是更多请求被赶到 fallback**。
+
+### Fallback 路径反而比 v4 更糟
+
+| Config | n | Lat P50 | Lat P90 | TTFT P50 |
+|--------|:---:|:---:|:---:|:---:|
+| v5 1P7D session-cap fallback | 2027 | 6.38s | 17.47s | 4.49s |
+| v5 2P6D session-cap fallback | 2253 | 3.13s | 11.25s | 0.89s |
+
+由于 fallback 占比上升、且这条路径本身就比 direct-to-D 慢一个数量级，整体均值被拖累得更厉害。
+
+### v5 真正暴露的瓶颈：D 的 KV pool 物理容量
+
+把 admission 决策权交给 D 之后，瓶颈从"replay 估得太死"变成"D 真的装不下"：
+
+- 80GB H100 × `mem_fraction_static=0.835` → D 单卡 KV pool ≈ 100-150K tokens
+- agentic 长 context session 单 turn footprint 50-100K
+- 单 D 上能并存的 session 数量本就 2-3 个 → 7 个 D 装 50 session 基本不可能
+
+v4 的 cap=16 之所以"看起来好"，部分是因为本地 soft_cap 没真的查 D 的 free pool，开了一堆**最终会失败**的 session（统计成 errors 而非 fallback）。v5 把这部分洗成了"诚实的拒绝"——可靠性跃升的代价是看见了真实容量上限。
+
+### v6 应该针对什么
+
+把 D 物理容量管理打开，而不是再调 replay：
+
+1. **prefill backup 提早 release**（已经加了 `release-after-transfer` 但可能还不够及时） → 让 P 上的 backup blocks 不要长期占用 KV pool。
+2. **priority eviction 策略调优**（已开 `--kvcache-prefill-priority-eviction`）：当前 LRU 可能把 hot session 误踢；需要按 session 命中频率/最近访问做加权。
+3. **chunked / streamed seed**：不要一次 reserve 整个 prompt 的容量，按 chunk 分摊。
+4. **跨 D 的 session migration**：当一个 D 满了但隔壁 D 空时主动迁移，而不是直接 fallback 到 P。
+5. **真正的多机 RDMA**：单机 mooncake loopback 是 errors 的根因之一；上多机 + RDMA 才能让 prefill backup release 后的 KV transfer 真的稳。
+
+工程量：1-3 是 SGLang 内部改 (`scheduler.py` + `session_controller.py`)，4 需要 router 协议扩展，5 是部署变更。
+
+## 关键文件与代码位置索引
+
+| 现象 | 代码位置 |
+|------|----------|
+| Replay policy round-robin | `policies.py:63-67` `RoutingState.next_decode_worker_id` |
+| KV-aware policy（session 亲和） | `policies.py:146-187` `KvAwarePolicy.select` |
+| PD router decode 选择 | `pd_router.py:51-74` `_select_decode_index` |
+| Header 构建 | `replay.py:2407-2424` `_build_headers` |
+| Policy → router config 映射 | `benchmark.py:287-300` `_decode_policy_for/_header_mode_for` |
+| Session admission 软 cap | `replay.py:889-905` `_decode_session_soft_cap` |
+| 已有的 D 侧 admission 端点 | `scheduler.py:3497-3580` `admit_direct_append`（v5 扩展支持 `mode=seed`） |
+| Worker-mode admission 调用方 | `replay.py` reseed / turn1-seed / large-append-reseed 路径 |
+| Prefill backup 释放策略（v5 引入） | `--kvcache-prefill-backup-policy release-after-transfer` |
+| Prefill priority eviction（v5 引入） | `--kvcache-prefill-priority-eviction` |
+| Session 在 D 上找不到的报错 | `scheduler.py:1824-1836` |
+| `_should_allow_local_prefill_on_decode` | `scheduler.py:1975-1980` |
+| Reseed 流程入口 | `replay.py:1665-1809` `_invoke_kvcache_seeded_router` |
+| Direct-to-D 流程 | `replay.py:2351-2398` `_invoke_decode_session_direct` |
+
+## 经验教训
+
+1. **policy 和 mechanism 是两个正交维度**——`--policy default` 不是"无脑默认值"，它真的是 round-robin 无 session 亲和性。KVC 机制必须配 session 亲和的 policy。
+
+2. **不要无脑相信前一个 agent 的 RESULTS_SUMMARY**——v2 的诊断（"local prefill bug"）和实际 finish_reason（"session id does not exist"）完全对不上。任何错误诊断必须用 finish_reason、execution_mode 这些原始字段交叉验证。
+
+3. **bimodal 分布是 starvation 的强信号**——v3 数据里某些 session 100% 走快路径、某些 100% 走慢路径，几乎肯定是某种"先到先得"的资源竞争。看到这种模式立刻去找硬编码 cap 或全局共享资源。
+
+4. **测量要看分组而非整体均值**——v3 整体 P50=1.5s 看似比 baseline 慢，但拆开看 direct-to-D 子集 P50=0.495s 已经反超 baseline。整体均值被 fallback 路径拖累，但 KVC 的核心价值是真实存在的。
+
+5. **errors 与 fallback 是同一类资源压力的两副面孔**——v4 的"低 fallback 率 + 高 error 率"不是更优解，是把容量超限的失败从"显式拒绝"伪装成"超时失败"。v5 把决策权交给真容量后，fallback 升、errors 降，这是更诚实的指标，不要被 v4 的 fallback 数字误导。当看到错误率和 fallback 率呈反相关时，要警惕 admission 决策是否在说谎。
--- a/docs/REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md
+++ b/docs/REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md
@@ -0,0 +1,514 @@
+# Real Ali KVC 实验日志
+
+**分支**：`kvc-real-ali-iter-v1`，从 `kvc-debug-journey-v1-to-v4` checkout 出来。
+**日期**：2026-05-11/12。
+**环境**：单机 8x NVIDIA H20，SGLang xPyD，模型 `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`。
+**真实 trace**：`/home/admin/cpfs/wjh/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`。
+
+本日志记录真实 Ali workload 上的 KVC pd-hybrid 迭代。结论只按当前证据成立；`time-scale=10` smoke 和 KVC-friendly slice 不作为 full workload headline。
+
+## 1. 当前最新进展
+
+已新增真实 Ali trace 的固定样本和 sweep 管线：
+
+- `scripts/prepare_real_ali_samples.py`：从真实 Ali trace 生成可复现实验样本，保留真实 input/output/hash_ids/timestamp，可选择 rebase timestamp。
+- `scripts/sweep_real_ali_kvc.sh`：对同一 prebuilt sample 依次跑 DP cache-aware、PD-disaggregation、KVC、KVC+backpressure。
+- `benchmark-live --use-trace-as-sample`：直接 replay 指定 trace，避免不同策略重新采样导致不可比。
+- `replay-progress.jsonl` heartbeat：后续长跑会每 30s 写客户端侧进度，不轮询 `/server_info`，避免扰动 scheduler。
+- `prepare_real_ali_samples.py --max-sampled-duration-s`：为快速 smoke 生成 capped sample；只用于迭代，不用于 headline。
+
+已经完成的真实 Ali KVC-fit smoke：
+
+- 样本：`outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl`
+- 179 requests，64 sessions，全部 multi-turn；turn2+ 共 115 个，direct-eligible ratio 100%。
+- `time-scale=10`，concurrency 32。
+- DP cache-aware、PD-disaggregation、KVC no-backpressure、KVC+backpressure 均已完成。
+
+## 2. 全量 Ali trace 画像
+
+`outputs/real-ali-kvc-iter/ali-full-profile.json` 显示：
+
+| 指标 | 数值 |
+|---|---:|
+| requests | 763,727 |
+| sessions | 555,905 |
+| multi-turn sessions | 39,247 |
+| turn2+ requests | 207,822 |
+| turn2+ direct-eligible ratio | 82.95% |
+| input p50 / p90 / p99 | 4,329 / 51,067 / 112,955 tokens |
+| output p50 / p90 / p99 | 93 / 826 / 5,616 tokens |
+| append p50 / p90 / p99 | 303 / 2,879 / 17,885 tokens |
+| inter-turn gap p50 / p90 / p99 | 4.65s / 38.68s / 1,133s |
+
+这个 profile 说明 KVC 有真实适用面：turn2+ 的 hash overlap 和小 append 很常见。但 full workload 里 single-turn session 极多，KVC 收益会被显著稀释；因此必须分 slice 报告，不能只报 KVC-fit 子集。
+
+## 3. 已跑样本
+
+### Continuous 15min cold-window session sample
+
+路径：`outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl`
+
+- 600 requests，439 sessions，32 multi-turn sessions。
+- rebased duration：886.544s，覆盖约 15min。
+- turn2+ requests：161，direct-eligible：143，ratio 88.8%。
+- input p50 / p90 / p99：3,871 / 68,234 / 98,131。
+- output p50 / p90 / p99：85 / 712 / 5,195。
+- append p50 / p90 / p99：274 / 2,202 / 16,120。
+- inter-turn gap p50 / p90 / p99：4.656s / 19.376s / 63.575s。
+
+这是对 179-request KVC-fit smoke 的替代验证样本。它按 900s 窗口分成 15 个时间桶，轮转选择窗口内从 root 开始的整 session，直到达到 600 requests。这样避免 parent 缺失导致 `load_trace()` 把真实 session 切碎，也让请求覆盖整个 15min，而不是只取窗口开头 600 条。
+
+重要边界：它是 **cold-window / new-session-only** sample，不是完整 raw production window；它排除了窗口开始前已经活跃的 ongoing sessions。因此可以用于“600+ 请求、15min、真实混合负载”的稳定性验证，但不能单独代表全量 Ali production window。
+
+### KVC-fit small append
+
+路径：`outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl`
+
+- 179 requests，64 sessions。
+- input p50 / p90：6,446 / 15,491。
+- output p50 / p90：112 / 1,159。
+- append p50 / p90：215 / 855。
+- overlap ratio p50 / p90：0.875 / 0.938。
+
+这是 KVC-friendly slice，用来验证机制上限和 microbenchmark 是否能迁移到真实 token/hash 序列。
+
+### Representative-mt / early multi-turn balanced
+
+路径：`outputs/real-ali-kvc-iter/samples-balanced/ali-representative-mt.jsonl`
+
+- 460 requests，64 sessions。
+- input p50 / p90：41,175 / 98,621。
+- append p50 / p90 / p99：272 / 1,979 / 13,900。
+
+这个样本更接近真实 multi-turn 压力，后续用于验证大上下文、大 resident KV 下是否仍能稳定。但它当前实现是“从 start_time 后取最早 64 个 multi-turn session”，不是严格随机或分层 representative；正式 headline 需要按 input/append/output/gap 分层抽样。
+
+### Capped smoke samples
+
+为避免少数真实长 gap 让 smoke 浪费大量 wall time，新增：
+
+- `outputs/real-ali-kvc-iter/samples-balanced-cap120s/ali-kvc-fit-smallappend.jsonl`：177 requests，64 sessions，duration 65.859s。
+- `outputs/real-ali-kvc-iter/samples-balanced-cap120s/ali-representative-mt.jsonl`：359 requests，64 sessions，duration 117.366s。
+
+这些样本去掉了 KVC-fit 原样本末尾 timestamp 3613s 和 5414s 的两个请求，因此只能用于快速工程迭代；正式对比仍应使用完整样本或真实连续窗口。
+
+## 4. 当前结果
+
+### 4.1 DP cache-aware vs KVC+backpressure, KVC-fit, time-scale=10
+
+| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 8-way DP cache-aware | 179 | 0 | 0 | 6.603s | 3.126s | 17.639s | 34.582s | 1.112s | 1.052s |
+| KVC 2P6D + worker admission + backpressure | 179 | 0 | 0 | 4.443s | 2.076s | 13.288s | 21.202s | 0.700s | 0.154s |
+
+Paired comparison（KVC - DP）：
+
+- overall E2E mean delta：-2.161s；p50 delta：-1.427s；152/179 wins。
+- turn2+ direct 子集：mean delta -2.503s；p50 delta -1.508s；103/115 wins。
+- turn2+ TTFT mean delta：-0.930s；p50 delta -0.887s。
+
+执行路径：
+
+- KVC turn1 seed：64 requests。
+- `kvcache-direct-to-d-session`：115 requests。
+- session reused：115。
+- actual KV transfer blocks：623。
+
+结构日志：
+
+- admission probes：179，全为 `ok`。
+- transfer queue depth：p50=0，p90=2，max=3。
+- backpressure event：0。
+
+解释：这轮证明的是 **KVC direct-to-D/session reuse** 在真实 Ali KVC-fit slice 上有正信号；不是证明 backpressure 有效，因为没有触发 backpressure。
+
+### 4.2 PD-disaggregation baseline, KVC-fit, time-scale=10
+
+| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| PD-disaggregation 2P6D | 179 | 0 | 0 | 7.850s | 6.306s | 15.192s | 22.405s | 4.994s | 5.336s |
+
+Paired comparison（PD - DP）：
+
+- overall E2E mean delta：+1.247s。
+- p50 delta：+2.231s。
+- 46/179 faster，133/179 slower。
+
+解释：在这个 KVC-fit slice 上，普通 PD-disaggregation 明显弱于 8-way DP cache-aware。它付出了 P->D transfer 和拆分调度成本，却没有 KVC direct-to-D 的 bypass 收益。
+
+### 4.3 KVC no-backpressure 消融, KVC-fit, time-scale=10
+
+| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| KVC 2P6D worker admission, no backpressure | 179 | 0 | 0 | 4.404s | 1.936s | 13.200s | 21.326s | 0.604s | 0.139s |
+
+Paired comparison：
+
+- KVC no-BP vs DP：mean delta -2.200s，p50 delta -1.434s，153/179 wins。
+- KVC no-BP vs PD-disaggregation：mean delta -3.447s，p50 delta -3.514s，163/179 wins。
+- KVC no-BP vs KVC+BP：mean delta -0.039s，p50 delta -0.005s，92/179 wins。
+
+结构分析：
+
+- direct-to-D rate：64.25%。
+- admission probes：179，全为 `ok`。
+- transfer queue depth：p50=0，p90=2，max=3。
+- pause_ms 全 0，backpressure event 0。
+
+解释：no-backpressure 与 +backpressure 几乎等价，说明本 slice 没有 D 压力；本轮提升来自 direct-to-D，不来自反压。
+
+### 4.4 Continuous 15min / 600-request window, time-scale=1
+
+样本：`outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl`
+
+重要边界：这是 cold-window / new-session-only session sample，不是完整 raw window。它覆盖约 15min，且 `missing_parent_count=0`，但排除了窗口开始前已活跃的 ongoing sessions。
+
+运行结果：
+
+| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 | TTFT p90 |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| DP cache-aware 8-way | 600 | 1 | 0 | 13.942s | 5.222s | 29.299s | 151.183s | 6.162s | 1.746s | 19.176s |
+| PD-disaggregation 2P6D | 600 | 1 | 0 | 40.886s | 40.018s | 84.681s | 113.460s | 38.545s | 37.782s | 81.852s |
+| KVC 2P6D mem_fraction_static=0.82 | 600 | 53 | 0 | 12.386s | 4.225s | 37.998s | 78.234s | 10.078s | 2.674s | 27.774s |
+
+KVC 默认启动失败：
+
+- 默认 KVC 2P6D 在 H20 上两次启动 OOM，均未进入 replay。
+- 日志显示 decode/prefill worker 启动时只剩约 526MB，模型加载阶段 OOM。
+- `--load-format layered` 不支持 Qwen3-Coder-30B-A3B。
+- 使用 `--mem-fraction-static 0.82` 后 KVC 能启动并完成 replay，但这降低了 KV pool 容量，因此这轮 KVC 是 memory-constrained rerun。
+- 尝试 `KVC_SEED_MIN_TURN_ID=2` + `mem_fraction_static=0.82` 时，启动阶段 scheduler 被 SIGKILL，疑似 OS OOM killer，未进入 replay。
+
+Paired comparison（只在两边都有 latency 的 547 个 paired request 上计算）：
+
+- KVC vs DP：mean delta -1.335s，p50 delta -0.055s，p90 delta +19.371s，284 wins / 263 losses。
+- KVC vs PD：mean delta -28.341s，p50 delta -25.687s，p90 delta +2.834s，465 wins / 82 losses。
+
+KVC 结构数据：
+
+- execution modes：388 `pd-router-turn1-seed`，90 `kvcache-direct-to-d-session`，67 `pd-router-fallback-large-append-session-cap`，1 `pd-router-large-append-reseed`，1 `pd-router-turn1-d-backpressure`，53 `kvcache-centric` error rows。
+- direct-to-D rate：15.0%。
+- direct-to-D session 分布：413/439 sessions 在 0-20% direct rate；只有 6 sessions 在 80-100%。
+- admission probes：533；reason `ok` 531，`no-space` 2；queue depth p50=0，p90=2，max=5。
+- pause hint 非零 20 次，但没有 backpressure event，因为本轮 no-BP。
+
+KVC error breakdown：
+
+- 50 `ReadTimeout`。
+- 2 `HTTPStatusError 400 Bad Request` on `open_session`。
+- 1 context length error：同 DP/PD 的 `input_length=310521 > 262144`。
+- 错误主要集中在 turn1：50 turn1，3 turn2+。
+
+解释：
+
+1. KVC 相对普通 PD 仍明显更好，说明普通 P->D disaggregation 在真实 600-request 窗口上成本很高。
+2. KVC 相对 DP 只在 clean request 的 mean/p50 上有小幅正信号，但 p90 变差，而且 error_count 从 DP 的 1 增到 53。
+3. 因此在这个 600-request / 15min window 上，**KVC 不能算稳定提升系统**。主要问题不是 direct-to-D 快路径无效，而是该快路径覆盖率只有 15%，并且 turn1 seed / session admission / memory-constrained KV pool 引入大量 timeout。
+4. 这直接修正 179-request KVC-fit smoke 的结论：小样本证明 KVC 适用 slice 存在；600-request mixed window 证明当前实现还不能稳定服务真实混合 workload。
+
+## 5. 是否已经相对 pd-colocation/pd-disaggregation 取得提升
+
+当前只能下这个限定结论：
+
+1. **相对 PD-disaggregation：已经取得清晰提升。**
+   PD-disaggregation p50 6.306s，KVC no-BP p50 1.936s，KVC+BP p50 2.076s；TTFT p50 5.336s vs 0.139s/0.154s。收益主要来自 turn2+ 直接打到已有 D session，避免每轮 P 全量 prefill 和 P->D KV transfer。
+
+2. **相对强 DP cache-aware：在 KVC-fit slice 上有提升。**
+   KVC no-BP 和 KVC+BP overall mean/p50/p90/p99 都优于 DP，并且 paired wins 分别是 153/179 和 152/179。但这是 KVC-friendly、全 multi-turn、turn2+ 100% direct-eligible 的 slice，不代表 full Ali workload。
+
+3. **相对 full workload：尚未证明。**
+   全量 Ali 里 single-turn 占多数，且长上下文和长尾 output 较多。KVC 的收益面会被 single-turn 稀释，D resident KV 容量和 tail 稳定性会成为更强约束。
+
+4. **相对 600-request / 15min mixed window：尚未取得稳定提升。**
+   KVC clean E2E mean/p50 有正信号，但 error_count=53/600，p90 paired delta 相对 DP 变差。按“E2E + error/truncation”标准，这不能算系统性胜出。
+
+## 6. 提升来自哪里
+
+主要收益链路：
+
+1. turn1 seed 在 D 上建立 session。
+2. turn2+ 若 append 小、hash overlap 高，直接走 `kvcache-direct-to-d-session`。
+3. direct-to-D 避免 P worker 参与，不走 P->D KV transfer。
+4. D 只对 append suffix 做少量 prefill，已有前缀 KV 直接复用。
+
+这带来两个可观测收益：
+
+- TTFT 大幅下降：turn2+ direct 子集 TTFT mean 从 DP 的约 1.04s 降到约 0.112s。
+- E2E 下降：direct 子集 mean E2E 降低约 2.50s。
+
+另外，KVC 的 cached_tokens 统计显著更高：KVC mean cached tokens 5,992，DP mean 228。这说明它确实复用了大段真实前缀 KV。
+
+## 7. 遇到的问题与修复
+
+### 问题 1：通用 sampler 会被单个长 session 主导
+
+现象：真实 Ali session 分布长尾明显，duration-oriented 采样容易选出不均衡样本，导致策略比较不可重复或不代表多 session 竞争。
+
+修复：新增 `scripts/prepare_real_ali_samples.py`，按 session 上限和每 session turn 上限生成 balanced sample，并保留真实 token/hash/timestamp。
+
+### 问题 2：不同策略重新采样导致不可比
+
+现象：`benchmark-live` 原本会按参数重新采样，不同策略可能 replay 不同请求。
+
+修复：新增 `--use-trace-as-sample`，所有策略 copy 并 replay 同一个 prebuilt sample；后续 paired comparison 才有意义。
+
+### 问题 3：长 trace replay 中途没有进度
+
+现象：`request-metrics.jsonl` 和 summary 只在 replay 结束后写出，跑真实 pacing 时很难判断是正常等待还是卡住。
+
+修复：新增 `replay-progress.jsonl` heartbeat，每 30s 写 submitted/completed/inflight/errors/execution_modes。它只使用客户端本地状态，不访问 `/server_info`。
+
+### 问题 4：`/server_info` polling 会扰动 scheduler
+
+现象：旧 profiling 里 1Hz polling 曾明显改变错误数。真实 performance run 如果持续 poll pool，会把测量工具变成干扰源。
+
+修复：`scripts/sweep_real_ali_kvc.sh` 默认关闭 pool polling。容量类问题依赖结构日志和必要时单独 profile run，不混入 headline performance run。
+
+### 问题 5：backpressure smoke 没有触发 backpressure
+
+现象：KVC-fit smoke 中 transfer queue max 只有 3，所有 admission reason 都是 `ok`，pause_ms 全 0。
+
+结论：这轮不能证明 backpressure 有效，只能证明 direct-to-D 有效。需要更高 session 数、更大 resident KV 或更强并发的压力样本专门验证 backpressure。
+
+### 问题 6：环境和旧报告不一致
+
+现象：旧文档写的是 H100，本轮真实环境是 H20；模型路径也在 `/home/admin/cpfs/wjh/models/...`。
+
+处理：本日志按 H20 记录；跨文档比较时只看机制趋势，不把 H100/H20 的绝对 latency 混为同一实验。
+
+### 问题 7：continuous window 可能截断 session ancestry
+
+现象：按 timestamp 直接截窗口可能留下 parent turn 在窗口外的请求。对 KVC 来说，这会让 session reuse/turn chain 与真实 workload 不一致。
+
+处理：当前 continuous window 只作为待改进候选，不作为正式 headline。正式窗口需要保留 warmup ancestors，或显式保留原始 session chain 信息。
+
+## 8. 如果后续 full workload 效果不好，当前假设
+
+可能不是实现小 bug，而是方案适用面和资源约束共同导致：
+
+1. **single-turn 稀释收益**：全量 Ali session 中 single-turn 占多数，KVC seed 只带来成本，没有 turn2+ reuse。
+2. **长上下文挤占 D KV 池**：input p90 51K、p99 113K，resident KV 长尾会限制 D 上可同时保留的 session。
+3. **direct 不是免费 lunch**：turn1 seed、admission probe、session lifecycle 都有额外成本；只有后续 turns 充分复用时才摊薄。
+4. **D 端容量和 eviction 仍是核心风险**：旧 SWE 实验已经显示 session pinning + D 容量盲选会造成 starvation；early multi-turn balanced 样本可能复现。
+5. **普通 PD-disaggregation 很弱**：如果 KVC fallback 频繁退回普通 PD 路径，整体会被 P->D transfer 和高 TTFT 拖垮。
+6. **H20 显存余量不足会改变 KVC 条件**：默认 KVC 2P6D 启动 OOM，必须降 `mem_fraction_static` 才能完成 600-request run；这会进一步降低 D KV pool，放大 session-cap 和 timeout。
+
+## 9. 下一步验证顺序
+
+1. 补 sticky/session-affinity baseline，拆出“粘到同一个 D”和“KVC direct bypass”的贡献。
+2. 补 KVC `seed-min-turn-id=2` 或 no-turn1-seed，验证 turn1 seed 成本是否值得。
+3. 在 early multi-turn balanced 样本上跑 DP / PD / KVC no-BP / KVC+BP，验证大上下文真实 multi-turn 压力。
+4. 选小固定样本跑 `time-scale=1`，避免只在压缩 replay 条件下成立。
+5. 做包含 single-turn 的 continuous window，并处理窗口内 parent turn 缺失问题，再按 full Ali 分布加权报告。
+6. 对最终候选配置做 N>=3 rerun，报告方差；N=1 只作为 smoke。
+7. 针对 600-request window 优先跑 `seed-min-turn-id=2`，减少 single-turn turn1 seed；目标是先把 53/600 errors 降到接近 DP 的 1/600，再讨论 latency。
+   - 当前第一次尝试未进入 replay，启动阶段疑似 OS OOM；需要先解决 H20 启动显存/系统内存稳定性，或者降低 worker 数/模型内存占用。
+
+## 10. KVC error 根因与 multi-turn-only 验证准备
+
+用户指出 179-request run 不够，并要求至少 15min / 600+ 请求；当前正式问题定位基于
+`outputs/real-ali-kvc-iter/runs/window900s-600req-ts1-kvc-mem082/kvcache-centric-kv-aware-worker-admission-20260511T093601Z`。
+
+### 10.1 为什么 KVC 有大量 error
+
+该 run 为 600 requests，KVC mem0.82 有 53 errors：
+
+- 50 个 `ReadTimeout`。
+- 2 个 `/open_session` HTTP 400。
+- 1 个真实超上下文错误：input 310,521 > model context 262,144。
+
+按 turn 看，50/53 errors 在 turn1。按 structural admission 看，绝大多数失败请求在
+`structural/admission-events.jsonl` 中已经被 D 端 admission 判定 `can_admit=true`，所以这不是单纯的
+`d-session-cap` 或 `no-space`。主要失败点是 turn1 seed 进入 KVC seeded path 后，在
+P/D streaming session bootstrap、P->D transfer 或 router streaming 过程中超时；而混合真实窗口中 single-turn session 很多，
+这些 turn1 seed 对大多数 session 没有后续复用收益。
+
+结论：当前 KVC error 的主因是 **对 single-turn / 未知是否多轮的 session 做了过多 turn1 seed**，它把大量新 session 推进
+KVC control-plane 和 seeded router 路径，增加超时和 session lifecycle 残留；不是 direct-to-D fast path 本身出错。
+
+### 10.2 已做修复/消融开关
+
+代码与脚本修复：
+
+- `scripts/sweep_real_ali_kvc.sh` 新增 `KVC_SEED_ONLY_MULTITURN=1`，会传入
+  `--kvcache-seed-only-multiturn-sessions`。这是 oracle 消融，用来验证“只 seed 会有后续 turn 的 session”能否消除 turn1 seed 错误。
+- `src/agentic_pd_hybrid/replay.py` 对 `/open_session` 400 增加 close+retry 一次，并写
+  `structural/session-lifecycle.jsonl`。这是 lifecycle 健壮性修复，目标是处理 timeout 后服务端残留 session 导致的
+  “already exists” 400，不改变 routing policy。
+- `scripts/prepare_real_ali_samples.py` 新增 `--window-min-turns` 和 `--window-output-name`，用于生成可复现的 multi-turn-only window 样本。
+
+验证：
+
+- `uv run python -m py_compile scripts/prepare_real_ali_samples.py src/agentic_pd_hybrid/replay.py src/agentic_pd_hybrid/benchmark.py src/agentic_pd_hybrid/cli.py`
+- `bash -n scripts/sweep_real_ali_kvc.sh`
+
+### 10.3 已生成 multi-turn-only 样本
+
+样本路径：
+
+`outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl`
+
+生成命令：
+
+```bash
+uv run python scripts/prepare_real_ali_samples.py \
+  --trace /home/admin/cpfs/wjh/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
+  --output-root outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn \
+  --window-duration-s 900 \
+  --window-target-requests 600 \
+  --window-buckets 15 \
+  --window-min-turns 2 \
+  --window-output-name ali-window-multiturn.jsonl \
+  --profiles representative-mt \
+  --max-sessions 64 \
+  --max-turns-per-session 12
+```
+
+样本 profile：
+
+- 626 requests，107 sessions，107 个都是 multi-turn sessions。
+- sampled duration 889.341s。
+- turn2+ = 519。
+- direct-eligible turn2+ = 473 / 519 = 91.1%。
+- missing parent = 0。
+- input p50/p90/p99 = 26,846 / 91,596 / 123,898 tokens。
+
+这个 case 是“过滤掉 single-turn 的多轮压力切片”，不能替代 full mixed workload，但可以回答：
+如果 workload 确实以多轮 coding agent session 为主，KVC 的 direct-to-D 覆盖率和稳定性是否接近 microbenchmark。
+
+### 10.4 GPU 资源阻塞
+
+截至本次记录，8 张 GPU 均被另一组 `vllm serve` 进程占用，每张约 82GB / 98GB，端口为 51000-51007。
+这些不是本 repo 的 SGLang/benchmark 进程，因此未启动新的性能 run，避免把资源冲突误判为 KVC 策略失败。
+
+GPU 释放后，优先跑两组：
+
+```bash
+# 混合真实窗口：验证 seed-only-multiturn 是否把 53/600 errors 降下来
+TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req/ali-window.jsonl \
+OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-ts1-kvc-seedonly-mt-mem082 \
+RUNS="kvc" \
+TIME_SCALE=1 \
+CONCURRENCY=32 \
+REQUEST_TIMEOUT_S=600 \
+STACK_TIMEOUT_S=1800 \
+EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
+KVC_SEED_ONLY_MULTITURN=1 \
+bash scripts/sweep_real_ali_kvc.sh
+
+# 多轮-only workload：DP vs KVC，对照过滤 workload 是否能复现 microbenchmark 收益
+TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
+OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-mem082 \
+RUNS="dp kvc" \
+TIME_SCALE=1 \
+CONCURRENCY=32 \
+REQUEST_TIMEOUT_S=600 \
+STACK_TIMEOUT_S=1800 \
+EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
+KVC_SEED_ONLY_MULTITURN=1 \
+bash scripts/sweep_real_ali_kvc.sh
+```
+
+### 10.5 multi-turn-only 启动尝试被 GPU 占用阻塞
+
+用户要求启动 multi-turn-only 的 `pd-disaggregation` vs `kvcache-centric` 对比。启动前检查发现 8 张 GPU 均被外部
+`vllm serve` 进程占用，每张约 84GB / 98GB，端口为 51000-51007。该进程不属于本 repo 的 SGLang/benchmark run。
+
+因此本次没有强行启动 SGLang。原因是剩余显存不足以启动 2P6D 或 8-worker 对照，强行运行只会得到初始化 OOM 或不稳定超时，
+不能用于判断 KVC pd-hybrid 是否优于 pd-disaggregation。
+
+资源释放后要运行的 multi-turn-only 对比命令：
+
+```bash
+TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
+OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082 \
+RUNS="pd kvc" \
+TIME_SCALE=1 \
+CONCURRENCY=32 \
+REQUEST_TIMEOUT_S=600 \
+STACK_TIMEOUT_S=1800 \
+EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
+KVC_SEED_ONLY_MULTITURN=1 \
+bash scripts/sweep_real_ali_kvc.sh
+```
+
+### 10.6 multi-turn-only PD vs KVC 正式结果
+
+资源释放后已启动并完成 multi-turn-only 对比。运行命令：
+
+```bash
+TRACE=outputs/real-ali-kvc-iter/samples-window-900s-600req-multiturn/ali-window-multiturn.jsonl \
+OUT_ROOT=outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082 \
+RUNS="pd kvc" \
+TIME_SCALE=1 \
+CONCURRENCY=32 \
+REQUEST_TIMEOUT_S=600 \
+STACK_TIMEOUT_S=1800 \
+EXTRA_SERVER_ARGS="--mem-fraction-static 0.82" \
+KVC_SEED_ONLY_MULTITURN=1 \
+bash scripts/sweep_real_ali_kvc.sh
+```
+
+Run 目录：
+
+- PD：`outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082/pd-disaggregation-kv-aware-20260512T030433Z`
+- KVC：`outputs/real-ali-kvc-iter/runs/window900s-600req-multiturn-ts1-pd-vs-kvc-mem082/kvcache-centric-kv-aware-worker-admission-20260512T040444Z`
+
+样本仍是 626 requests、107 sessions、889.341s，全部为 multi-turn session。
+
+| 策略 | Requests | Errors | Trunc | E2E mean | E2E p50 | E2E p90 | E2E p99 | TTFT mean | TTFT p50 | TTFT p90 |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| PD-disaggregation 2P6D | 626 | 0 | 0 | 97.013s | 70.243s | 214.309s | 308.406s | 94.506s | 69.048s | 212.528s |
+| KVC 2P6D worker admission, no BP, seed-only-multiturn | 626 | 39 | 0 | 43.362s | 8.239s | 135.289s | 236.475s | 40.578s | 1.442s | 132.233s |
+
+Paired comparison 只在 KVC 成功且 PD 也有 latency 的 587 个 request 上计算：
+
+- PD same-request E2E mean/p50/p90/p99：97.457s / 70.514s / 214.095s / 309.362s。
+- KVC same-request E2E mean/p50/p90/p99：43.362s / 8.239s / 135.930s / 237.283s。
+- mean E2E reduction：55.5%。
+- absolute mean improvement：54.095s。
+- wins/losses：472 / 115。
+
+按 KVC execution mode 拆分：
+
+| KVC mode | Count | KVC mean | PD same mean | Reduction |
+|---|---:|---:|---:|---:|
+| `kvcache-direct-to-d-session` | 286 | 2.255s | 92.944s | 97.6% |
+| `pd-router-fallback-large-append-session-cap` | 169 | 88.869s | 113.614s | 21.8% |
+| `pd-router-d-session-reseed` | 25 | 143.456s | 106.501s | -34.7% |
+| `pd-router-large-append-reseed` | 19 | 47.631s | 88.981s | 46.5% |
+| `pd-router-turn1-seed` | 78 | 55.974s | 73.050s | 23.4% |
+
+按 turn 深度拆分：
+
+- turn2+：504 successful paired requests，KVC mean 40.791s vs PD mean 101.055s，reduction 59.6%。
+- turn>=5：299 successful paired requests，KVC mean 34.121s vs PD mean 104.697s，reduction 67.4%。
+- turn>=10：161 successful paired requests，KVC mean 39.027s vs PD mean 86.548s，reduction 54.9%。
+
+KVC execution modes：
+
+- `kvcache-direct-to-d-session`：286。
+- `pd-router-fallback-large-append-session-cap`：169。
+- `pd-router-turn1-seed`：78。
+- `pd-router-d-session-reseed`：25。
+- `pd-router-large-append-reseed`：19。
+- `pd-router-fallback-no-d-capacity`：4。
+- `pd-router-turn1-d-backpressure`：5。
+- `pd-router-d-session-reseed-after-eviction`：1。
+- error rows：39，记录为 `kvcache-centric`。
+
+KVC 的收益来源非常清楚：286 个 direct-to-D request 的 same-request mean 从 PD 的 92.944s 降到 2.255s，基本复现了 microbenchmark 的核心机制收益。它跳过 P worker 和 P->D KV transfer，只在已有 D session 上处理 append suffix。总体 actual KV transfer blocks 从 PD same-success 的 4436 降到 KVC success 的 3827；summary 口径下 KVC total actual KV transfer blocks 为 3827，低于 PD 的 5276。
+
+但这轮仍不能作为“稳定生产级胜出”结论：
+
+1. KVC 仍有 39/626 errors，error rate 6.23%，PD 为 0。
+2. 39 个错误全部是客户端 `ReadTimeout`，不是服务端 OOM/Traceback；服务端日志未发现对应崩溃关键字。
+3. 错误分布：24 个 turn1，15 个 turn2+；按 decode 节点分布为 decode-0 15、decode-1 9、decode-3 7、decode-4 5、decode-5 3。
+4. 8 次 `/open_session` 400 已被 close+retry 兜住，并写入 `structural/session-lifecycle.jsonl`，没有形成 HTTP 400 error row。
+5. 长尾 drain 明显：PD 约 60min 完成，KVC 约 40min 完成；二者都远超 889s trace duration。KVC 在 900s 时已完成 490/626，而 PD 只完成 283/626，说明 KVC 中段吞吐更好，但最后几十个 large-append fallback 仍然拖尾。
+6. direct-to-D 覆盖率为 286/626 = 45.7%，低于样本静态 direct-eligible turn2+ ratio 91.1%。缺口主要来自 D session/residency capacity、large append session cap、reseed/fallback。
+
+当前判断：
+
+- 如果只看 successful paired request，multi-turn-only workload 上 KVC 相对 PD-disaggregation 已经有很强 E2E 提升，且提升主要来自 direct-to-D session reuse。
+- 如果按系统可靠性看，当前实现还不合格，因为 6.23% timeout 会抵消“稳定系统”的结论。
+- 真实 workload 与 microbenchmark 差距的主要原因不是 KVC fast path 无效，而是 fast path 覆盖率不足、D 侧 resident KV/session admission 压力、large append fallback、以及 seeded/reseed path 的 timeout 稳定性。
--- a/docs/REFACTOR_PLAN_ZH.md
+++ b/docs/REFACTOR_PLAN_ZH.md
@@ -0,0 +1,123 @@
+# Refactor Plan v0：极简版
+
+**日期**：2026-05-06
+**目标**：用最小改动 + 轻量实验，验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` 提出的结构性缺陷是否真实存在、影响多大。
+**预算**：8h GPU 时间（约 4-6 次 ~30-60 min smoke run）。
+**KISS 边界**：不动 SGLang `scheduler.py` 主循环结构；不引入新 mooncake 协议；不实现 cross-D session migration；不做 admission probe/commit 拆分；不动 LRU eviction 策略。
+
+## 计划结论（与用户已确认的）
+
+回审 plan-v0 时发现两个原 Phase 1 改动**都不是 bug**：
+
+- `_estimate_session_resident_tokens` 返回 full prompt 是设计如此——所有需要"增量"的 call site 都已经做 `target - current` 减法（`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`）。
+- `decode_resident_blocks` 不缩减只是浪费几 MB 内存，**不影响 routing 决策**（SWE trace 的 hash_ids 是 session-unique，policy 仍能正确选 D）。
+
+最终极简版只做一件代码改动（**加 backpressure**）+ 大量 instrumentation。
+
+## 唯一代码改动：Backpressure 信号
+
+### 改动点 1：SGLang `admit_direct_append` 响应增加两个字段
+
+文件：`third_party/sglang/python/sglang/srt/managers/io_struct.py`、`scheduler.py`
+
+```python
+@dataclass
+class DirectAppendAdmissionReqOutput:
+    ...                         # 已有字段保留
+    recommended_pause_ms: int = 0   # 新增
+    queue_depth: int = 0            # 新增
+```
+
+`scheduler.py:admit_direct_append` 末尾计算 hint：
+
+```python
+def _compute_backpressure_pause_hint(self) -> float:
+    depth = len(self.disagg_decode_transfer_queue.queue)
+    if depth < 8:
+        return 0.0
+    return min(2000.0, depth * 100.0)   # 简单线性
+```
+
+### 改动点 2：replay 端按 hint 退避
+
+文件：`src/agentic_pd_hybrid/replay.py`
+
+- `DecodeResidencyState` 新增 `pause_until_s: dict[str, float]`
+- `_query_decode_direct_admission` 解析响应里的 `recommended_pause_ms`，更新 `pause_until_s[server_url] = now + pause_ms / 1000`
+- 在调 `_invoke_router` / `_invoke_decode_session_direct` 前检查 `pause_until_s[decode_url]`，若 `now < pause_until` 则 sleep 到该时刻
+
+### 改动点 3：新 CLI flag
+
+`src/agentic_pd_hybrid/cli.py`、`benchmark.py`：
+
+```
+--enable-backpressure   # 默认 false，保留 baseline 行为
+```
+
+### 改动点 4：观测日志
+
+每个 run dir 新增三个 jsonl：
+
+- `admission-events.jsonl`：每次 admission RPC（timestamp, session, D, can_admit, queue_depth, pause_ms, latency_s, available_tokens, evicted_session_count）
+- `backpressure-events.jsonl`：每次实际 sleep（timestamp, D, sleep_ms, queue_depth_at_signal）
+- `session-d-binding.jsonl`：每个 session 第一次 open 在某 D 时记录（timestamp, session, D, turn_id）
+
+## 实验矩阵（8h 预算内）
+
+按"先做 anchor，再做单变量对照"排序。每行右侧是预估机时。
+
+| ID | 配置 | 目的 | 机时 |
+|---|---|---|---|
+| **E0 (existing)** | v5 baseline，time-scale=10，无 backpressure | Anchor，已存在 `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/run1` | 0 |
+| **E1** | v5 + backpressure ON，time-scale=10，全 trace | 验证 Claim §3（backpressure 是否能消除 KVTransferError 雪崩） | ~50 min |
+| **E2** | v5 baseline，time-scale=1，**短 trace**（前 12 sessions ≈ 1000 reqs） | 验证 Claim §7（time-scale=10 失真）；不开 backpressure | ~60 min |
+| **E3** | 8DP CA，time-scale=1，同 E2 trace | E2 的对照——真实时序下 KVC 是否仍输 DP | ~60 min |
+| **E4** | v5 + backpressure，time-scale=1，同 E2 trace | backpressure 在真实时序下还有用吗？ | ~60 min |
+| **E5**（备选） | v5 baseline，time-scale=10，**concurrency=4**，全 trace | 验证 Claim §1（高并发是不是必要条件） | ~50 min |
+
+总：4-5 个 run，~3-5h。剩余预算给失败重跑/分析。
+
+## 实验目标——回到 §1-§7 一一对照
+
+| 文档 § | Claim | 由哪个 exp 证伪/支持 | 需要的指标 |
+|---|---|---|---|
+| §1 | Session 永久 pin + 容量盲选造成双峰 | 已有 E0 数据足够 | direct-to-D rate per session distribution |
+| §2 | LRU 跟不上压力 | 已有 E0 logs 足够 + E1 看 backpressure 之后 trim/error 比例变化 | trim 事件数 vs OOM 数 |
+| §3 | 没 backpressure 是雪崩源 | E0 vs E1 | KVTransferError 数、P99 latency |
+| §4 | admission RPC 干扰 scheduler | 不在本轮实验范围（需要 admission probe 拆分才能验，不做） | – |
+| §5 | P-side 不感知 D 健康 | 已有 E0 logs 足够（prefill-0 vs prefill-1 错误数） | per-P KVTransferError |
+| §6 | (已撤回) | – | – |
+| §7 | time-scale=10 失真 | E0 vs E2（同 KVC，不同 time-scale）；E2 vs E3（同 time-scale，KVC vs DP） | latency 分布、direct-to-D rate |
+
+## Final 实验报告交付
+
+跑完后输出 `docs/STRUCTURAL_VALIDATION_REPORT_ZH.md`，按 §1-§7 每条给出：
+
+- **Claim 字面**
+- **数据证据**（哪个 exp、哪个 metric）
+- **结论**：成立 / 部分成立 / 推翻
+- **影响量化**：数字差异
+- **不确定性**：N=1 风险、其他 confounder
+
+## 不做的事（KISS 边界）
+
+| 想做但不做 | 理由 |
+|---|---|
+| 跑 N=3 重复 | 8h 装不下；single-run 可看大方向 |
+| 全 sweep 参数 | 只调 time-scale 和 backpressure 一个 boolean |
+| 改 LRU eviction | 不在本轮范围 |
+| Cross-D migration | 不在本轮范围 |
+| Admission probe/commit 拆分 | 不在本轮范围 |
+| P-side D-health routing | 不在本轮范围 |
+| 修两个"非 bug"（estimate / aging） | 验证后非真实 bug |
+
+## 预期失败路径
+
+- **GPU 资源紧张**：smoke trace 进一步压缩（前 8 sessions / 600 reqs）
+- **time-scale=1 跑超 1.5h**：截断到 600s 内能完成的部分
+- **backpressure 配错**：先用 sleep_ms = depth * 100 简单线性；调不通就回滚到 0（无 backpressure）
+- **SGLang patch 编译错**：所有 patch 在 io_struct.py 和 scheduler.py 的少量行内，可单独 git restore
+
+---
+
+接下来：实现 → 跑 smoke → 写报告。
--- a/docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
+++ b/docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
@@ -0,0 +1,304 @@
+# 结构性缺陷验证报告
+
+**日期**：2026-05-06
+**对照数据源**：
+- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/`（v5 KVC kv-aware Option D，2P6D，**3 次同配置 rerun**）
+- `outputs/qwen3-30b-tp1-exps/exp1_8way_dp_cache_aware_summary.json`（同 trace 8DP CA）
+- `outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/.../logs/decode-{0..5}.log`、`prefill-{0,1}.log`
+**模型**：Qwen3-30B-A3B（TP1），单机 8×H100 80GB，trace `qwen35-swebench-50sess.jsonl`（4449 reqs / 52 sessions）。
+**报告作用域**：验证 `docs/AGENTIC_FIT_ANALYSIS_ZH.md` §1-§7 提出的结构性 claim 是否真实存在；量化影响。
+
+> ⚠️ **环境限制**：本轮缺 GPU 访问，未跑新 sweep。所有数据来自已存在的 v5 rerun + 8DP baseline。Backpressure 代码已实现但**未端到端验证**——下文标注为"预期收益（pending GPU smoke）"。
+
+---
+
+## 0. 实验有效性锚点：N=1 不可信
+
+3 次 v5 baseline EXP2（**完全相同配置**）的 errors 漂移：
+
+| Run | Errors | Lat P50 | Lat P90 | TTFT P50 |
+|---|---:|---:|---:|---:|
+| run1 | **372** | 1.11s | 8.65s | 0.147s |
+| run2 | **912** | 0.94s | 7.68s | 0.071s |
+| run3 | **396** | 1.22s | 8.43s | 0.183s |
+
+errors 漂移 **2.5×**（372 → 912），P50 latency 漂移 **30%**。**任何 N=1 比较 < 30% 差异都不可信。** 后续所有"同 trace 不同配置 / 不同代码"的对比，都需要 N≥3 才有意义。
+
+**对 KVC vs DP 的 headline 数据，3 次 KVC 的最佳值（P50=0.94s）仍然是 DP（P50=0.65s）的 1.45×**——8 way DP 的优势远超 single-run variance 范围，这一头条结论不受 variance 影响。
+
+---
+
+## §1. Session 永久 pin 到 D + 容量盲选 → 极端双峰 ✅ 完全成立
+
+### Claim
+KvAwarePolicy 评分以 hash overlap 为主，没有 D 容量项。Session 第一次落到某 D 后被永久 pin。导致大 session 在已满 D 上反复 admission 拒绝，小 session 在原 D 上 100% 走 direct-to-D。
+
+### 数据
+
+**(a) Session 永久绑定，跨 3 次 rerun 一致**：
+
+```
+run1: 52 sessions, avg distinct-D-per-session = 1.00
+run2: 52 sessions, avg distinct-D-per-session = 1.00
+run3: 52 sessions, avg distinct-D-per-session = 1.00
+```
+
+每个 session 在整个运行中只访问 **1 个** D worker，3 次独立 run 完全一致。**不是巧合，是结构。**
+
+**(b) Direct-to-D 命中率呈极端双峰**：
+
+| Direct-to-D rate | run1 | run2 | run3 |
+|---|---:|---:|---:|
+| 0-20%（饿死） | 15 | 18 | 16 |
+| 20-40% | 7 | 6 | 7 |
+| 40-60% | 11 | 7 | 9 |
+| 60-80% | 5 | 6 | 4 |
+| 80-100%（顺利） | 14 | 15 | 16 |
+
+中间态稀少，两端拥挤。
+
+**(c) 跨 3 次 run 一致饿死的 session 与 session 大小强相关**：
+
+```
+13 sessions starved (<20% direct-to-D) in ALL 3 runs.
+  avg peak input of consistently-starved sessions: 62043 tokens
+  avg peak input of consistently-lucky sessions:   31344 tokens
+  ratio: 1.98× — starved sessions are exactly 2× larger.
+```
+
+**13/52 = 25% 的 session 在 3 次独立 run 中都被饿死，且这些 session 的 peak input 恰好是顺利 session 的 2 倍。** 这排除了"运气"假说，证实是大 session 在容量过载 D 上结构性失败。
+
+### 影响量化
+- 25% session 几乎每个 turn 都走 fallback 路径，相对 direct-to-D **TTFT 慢 100×、E2E 慢 6×**（数据点：fallback path mean lat ~3.5s vs direct ~0.5s）
+- 对应这些 session 的用户体验是"系统性糟糕"，而不是"偶尔慢"
+- **SLO 视角下 P99 完全由这 13 个 session 拉高**
+
+### 结论
+**完全成立**。修复方向（不在本轮）：policy score 加 capacity penalty + 允许 session 跨 D 迁移，或 D 端引入 hot session retract。
+
+---
+
+## §2. D 端 LRU 只 evict idle session → 跟不上压力 ✅ 完全成立
+
+### Claim
+`scheduler.py:2040` 的 `evict_idle_streaming_sessions_lru` 只能 evict "所有 req 都 finished + streaming 模式"的 session。高并发下 hot session 永远不 idle，LRU 找不到东西可踢。结果 D 顶到 100% 然后撞 mooncake transfer timeout。
+
+### 数据（v5 baseline rerun run1）
+
+| D worker | Trim 事件 | KVTransferError | 峰值 token_usage |
+|---|---:|---:|---:|
+| decode-0 | 9 | 0 | 0.99 |
+| decode-1 | 43 | 4 | 0.99 |
+| decode-2 | 16 | 153 | 0.97 |
+| decode-3 | 37 | 29 | 0.99 |
+| decode-4 | 28 | 90 | **1.00** |
+| decode-5 | 30 | 93 | **1.00** |
+
+**6 个 D 全部峰值 ≥ 0.97**，其中 2 个直接顶到 1.00（KV 池完全耗尽）。**LRU 触发 9-43 次，远不及 transfer 错误的 90-153 次。**
+
+decode-2 极端：trim 16 次 vs error 153 次 = LRU 比错误慢 **9.5×**。
+
+### 影响量化
+- 单 run 累计 369 KVTransferError（总 6 个 D 之和）
+- 对应 ~8% 的请求失败率（v5 errors 9/372/912 三次平均 ~430/4449 = 9.7%）
+- **每次 mooncake timeout 是 32s**——对 P99 latency 直接贡献几十秒尾巴
+
+### 结论
+**完全成立**。修复方向（不在本轮）：分层 eviction——除 idle 外加冷 session retract、按访问频率/时序加权。Backpressure（本轮代码）只是把"D 满"的雪崩从"timeout 错误"转成"主动等待"，**不是真正解决容量问题**。
+
+---
+
+## §3. 没有 D→Replay backpressure 通道 ✅ 成立（已实现修复）
+
+### Claim
+D 端 transfer queue 堆 → 32s timeout → KVTransferError，没有"D 过载请慢点"信号反向到 replay；concurrency 一直 32 不降。
+
+### 数据
+- §2 的 369 KVTransferError 全部为 32s mooncake timeout（日志中均为 `Failed to send kv chunk` 或 `Decode instance could be dead`）
+- 错误集中在运行后半段（按现有 `KVC_DEBUG_JOURNEY_V1_TO_V5.md` §v4：错误均在 run 的 44.8% 之后开始累积）
+- 表明：**前期 D 容量充裕时正常，达到容量上限后所有后续请求集中失败**——典型无 backpressure 系统行为
+
+### 修复（本轮已实现，待 GPU smoke 验证）
+
+代码改动：
+1. `third_party/sglang/python/sglang/srt/managers/io_struct.py`：`DirectAppendAdmissionReqOutput` 增加 `recommended_pause_ms` 字段
+2. `third_party/sglang/python/sglang/srt/managers/scheduler.py:admit_direct_append`：基于 `transfer_queue_depth`、`retracted_queue_depth`、`token_usage_after` 计算 hint
+   ```python
+   def _compute_backpressure_pause_hint(...):
+       if retracted_queue_depth > 0: return 1500
+       if token_usage_after >= 0.90: return max(200, min(2000, overshoot * 5))
+       if transfer_queue_depth >= 8: return min(2000, transfer_queue_depth * 100)
+       return 0
+   ```
+3. `src/agentic_pd_hybrid/replay.py`：
+   - `DecodeResidencyState.pause_until_s: dict[str, float]`
+   - `_query_decode_direct_admission` 解析 hint 更新 `pause_until_s`
+   - 新增 `_wait_for_decode_pause`，在 `_invoke_router` / `_invoke_session_direct` 入口检查
+4. CLI flag：`--enable-backpressure`、`--backpressure-max-pause-s 2.0`（默认关闭）
+5. 结构性日志：`structural/admission-events.jsonl`、`backpressure-events.jsonl`、`session-d-binding.jsonl`
+
+### 预期收益（pending GPU smoke E2 vs E1）
+- KVTransferError 应从 ~370 / 4449 跌到 < 50 / 4449
+- P99 应改善（消除 32s timeout 尾巴）
+- 整体 latency mean 可能**略升**（被强制 pause），但 P99 应大幅降
+- backpressure-events.jsonl 应显示 D-4 / D-5 累积大量 pause 事件（与 §2 数据吻合）
+
+### 结论
+**Claim 成立；修复已实现，待 smoke 验证**。注意：backpressure 是**降级**机制，不是性能优化——它把"硬错误"换成"主动等待"，整体 throughput 不会因此提升。
+
+---
+
+## §4. Admission RPC 与 scheduler 主循环耦合 ⚠️ 间接证据，本轮未直接验证
+
+### Claim
+`admit_direct_append` 进 scheduler 主循环遍历 session slot，admission RPC 频率 16+/s 时与 decode 抢调度。
+
+### 现有间接证据
+- `docs/V5_PROFILE_INVESTIGATION_ZH.md`：仅加 1Hz `/server_info` polling 就让 EXP2 errors 从 9 涨到 415（46×）；但 v6 P0 三次 baseline 不开 polling 同样得到 372/912/396——**polling 不是唯一原因，主循环负载本身就敏感**。
+
+### 本轮未做
+- 没有"admission probe 拆 fast/slow"的对照实验。需要 SGLang 较深的改动（提供 lock-free snapshot），不在 KISS 边界。
+
+### 结论
+**Claim 间接成立，本轮未直接验证**。Backpressure 实现里 admission RPC 的频率没有变（仍每个 turn 一次），只是结果会触发 sleep。如果这条 claim 成立，加 backpressure 后 admission RPC 数量大致不变但每次响应里的 `pause_ms` 会非零——**新增的 admission-events.jsonl 可在 GPU smoke 后用来直接验证此现象**。
+
+---
+
+## §5. P-side round-robin 不感知 D 健康 ✅ 成立
+
+### Claim
+`pd_router.py:_select_decode_index` 是裸 round-robin。任一 P 撞到 hot D 时反复失败，另一 P 完全不受影响。
+
+### 数据（v5 baseline rerun run1）
+
+| Worker | KVTransferError | "Decode could be dead" |
+|---|---:|---:|
+| prefill-0 | **367** | 361 |
+| prefill-1 | **2** | 0 |
+
+prefill-0 的请求量从 summary 看是 2225 vs prefill-1 的 2224——**请求量近乎对半，错误率差 180×**。
+
+### 影响量化
+- 失败请求集中在 P-0 → 某个 hot D 的链路上（日志中反复出现 `to 10.45.80.47:XXXXX`）
+- 单 P 的"死亡链路"贡献了 **99%** 的全部 KVTransferError
+- 如果 P 选择能避开"正在和 hot D 死磕"的链路，**理论上可消除单 P 故障的雪崩效应**
+
+### 备注
+- 此现象**未在 v6 P0 的 3 次 rerun 中横向验证**——只有 run1 的日志可读。需要在新 sweep 的 prefill-{0,1}.log 上重复确认，避免 N=1 嫌疑。
+
+### 结论
+**单 run 数据成立，多 run 一致性未验证**。修复方向（不在本轮）：router 选 P 时考虑 (P 当前 inflight transfer 数, 目标 D 健康度)。
+
+---
+
+## §6. （已撤回）Replay 端 session footprint 估算膨胀
+
+写计划时仔细看代码后撤回——`_estimate_session_resident_tokens` 返回 full prompt，但所有需要"增量"的 call site (`replay.py:1247-1254`、`:1393-1394`、`:1490-1491`) 都已用 `target - current` 减法处理。**不是 bug**。
+
+---
+
+## §7. time-scale=10 把 inter-turn gap 压到 1/10 ✅ 完全成立
+
+### 数据
+
+```
+原始 trace inter-turn gap (n=4397):
+  p10=1.6s   p50=2.5s   p90=7.8s   p99=25.1s   max=261s
+
+time-scale=10 实际 replay gap:
+  p10=0.16s  p50=0.25s  p90=0.78s  p99=2.5s    max=26s
+```
+
+真实 agentic 用户/agent 在 turn 之间停 2-8 秒（思考、打字、tool call、agent reasoning）。time-scale=10 把这些窗口压到 0.16-0.78 秒——**人为消除了 D 的自然 idle 时间**，正好是 KVC 想利用的"session 短暂 idle 时 LRU 可以 evict、其他 session 可以 admit"机会。
+
+### 测量学影响
+- 所有 v3-v6 数据基于 time-scale=10
+- 意味着所有"KVC 在 SWE 上输给 baseline"的结论**可能被 benchmark 放大了**
+- §1 的 25% session 永久饿死现象，在 time-scale=1 下可能因为 D 有更多 drain 时间而显著缓解
+
+### 本轮未做
+- 没跑 time-scale=1 baseline。这是项目当前**最重要但缺失的验证**。
+- Smoke sweep 脚本（`scripts/sweep_backpressure_smoke.sh`）E3、E4 包含了 time-scale=1 的 KVC + DP 短 trace 对比，等 GPU 时跑。
+
+### 结论
+**Claim 完全成立；time-scale=1 验证为 P0 待办**。
+
+---
+
+## 头条对比（同 trace、同硬件）
+
+```
+8-way DP cache-aware (TP1):
+  errors=  0 | latency mean=1.426s p50=0.654s p90=3.609s
+              | TTFT  mean=0.123s p50=0.093s p90=0.256s
+
+KVC v5 2P6D (3 reruns, no polling):
+  run1: errors=372 | mean=3.50s p50=1.11s p90=8.65s | TTFT mean=2.13s
+  run2: errors=912 | mean=3.00s p50=0.94s p90=7.68s | TTFT mean=1.64s
+  run3: errors=396 | mean=3.42s p50=1.22s p90=8.43s | TTFT mean=2.07s
+```
+
+KVC 三次 run 全输 DP，且差距远超 single-run variance：
+- Latency mean：DP 优 **+110%**（KVC 平均 3.30s vs DP 1.43s）
+- Latency P50：DP 优 **+65%**（KVC 平均 1.09s vs DP 0.65s）
+- TTFT mean：DP 优 **+1500%**（KVC 平均 1.95s vs DP 0.12s——慢 17×！）
+- Errors：DP 0 vs KVC 平均 ~560
+
+**这是这个项目当前最严肃的事实**——所有 KVC 复杂度回报为负。
+
+---
+
+## 综合结论
+
+按"是否结构性 + 影响大小"的二维分类：
+
+| Claim | 结构性 | 影响 | 本轮验证 | 修复（KISS 内） | 修复（KISS 外） |
+|---|---|---|---|---|---|
+| §1 Session pin + 容量盲选 | 强 | 大（25% session 饿死） | ✅ 3 run 一致 | ❌ | capacity-aware policy + 跨 D 迁移 |
+| §2 LRU 跟不上 | 强 | 大（每次 ~370 KVTransferError） | ✅ 6 D 数据 | ❌ | 分层 eviction、hot retract |
+| §3 无 backpressure | 强 | 中-大（消除 32s timeout 雪崩） | ⚠️ 已实现，待 smoke | ✅ **本轮交付** | – |
+| §4 admission RPC 干扰 | 弱-中 | 中 | ⚠️ 间接 | ❌ | probe / commit_evict 拆分 |
+| §5 P-side 不感知 D 健康 | 中 | 中（单 P 错误率差 180×） | ✅ N=1，需 N≥3 复核 | ❌ | router P 选择带 D 健康反馈 |
+| §6 estimate 膨胀 | – | – | ❌ 已撤回 | – | – |
+| §7 time-scale=10 失真 | 强（测量学） | 大（可能颠覆所有 KVC vs DP 结论） | ✅ 数据明确 | ✅ 改 flag | – |
+
+### 最关键的两个 takeaway
+
+1. **§7 time-scale=1 是当前项目所有结论的前置依赖**——必须先做。如果 time-scale=1 下 KVC 与 DP 接近，前面所有 v3-v6 的"KVC 输得彻底"诊断都需要重新解读。
+2. **§1 + §2 是双胞胎结构性问题**——session 被永久 pin 在某个 D + D 不能 evict 已满 = 大 session 永久卡死。任何不动 policy + 不动 LRU 的修复（包括本轮的 backpressure）只能让症状好看，不能消除根因。
+
+---
+
+## 本轮代码改动汇总（git diff 范围）
+
+```
+src/agentic_pd_hybrid/replay.py        # +结构性日志 + backpressure pause 检查 + admission 增强
+src/agentic_pd_hybrid/cli.py           # +CLI flags
+src/agentic_pd_hybrid/benchmark.py     # +CLI flags 透传
+third_party/sglang/python/sglang/srt/managers/io_struct.py
+third_party/sglang/python/sglang/srt/managers/scheduler.py
+                                       # +recommended_pause_ms 字段 + hint 计算
+scripts/sweep_backpressure_smoke.sh    # 4-run smoke sweep（待 GPU 跑）
+scripts/analysis/analyze_backpressure_smoke.py
+                                       # 配套分析器
+docs/REFACTOR_PLAN_ZH.md               # 计划文档
+docs/STRUCTURAL_VALIDATION_REPORT_ZH.md
+                                       # 本报告
+```
+
+代码默认行为**不变**（`enable_backpressure=False`）——所有现有脚本/配置无影响。
+
+---
+
+## 待 GPU 时执行
+
+```bash
+bash scripts/sweep_backpressure_smoke.sh
+python3 scripts/analysis/analyze_backpressure_smoke.py outputs/sweep_backpressure_smoke
+```
+
+预算：4 个 run × 30-60 min ≈ 3-4h GPU 时间。
+
+按 §3 的预期：E2 (KVC + backpressure) 相对 E1 (KVC baseline) 应有 errors 降 70%+；P99 改善；TTFT P50 持平或略升。E3 (KVC + backpressure @ time-scale=1) vs E4 (DP @ time-scale=1) 是验证 §7 的关键对照。
+
+如果 E2 vs E1 的 errors 没有显著下降，说明 backpressure hint 公式调得不对（`_compute_backpressure_pause_hint` 阈值可调），或 §3 实际不是雪崩主因（更可能是 §2 D-side LRU 才是）。
--- a/docs/SWEBENCH_EXPERIMENT_PROGRESS.md
+++ b/docs/SWEBENCH_EXPERIMENT_PROGRESS.md
@@ -0,0 +1,95 @@
+# SWE-Bench PD Hybrid Experiment Progress
+
+## 实验目标
+
+在单节点 8xH100 上复现 agentic-pd-hybrid 三种 serving mechanism，对比 Qwen3.5-35B-A3B 在 SWE-Bench 500 instance agentic trajectory 上的性能。
+
+## 硬件环境
+
+- 8x H100 80GB (NVLink 互联, 2 NUMA nodes: GPU 0-3 / GPU 4-7)
+- 无 RDMA/IB 设备
+- Transfer backend: **mooncake TCP** (nixl UCX 因 pip 包缺少 CUDA 支持导致 segfault，已放弃)
+
+## 实验矩阵
+
+| 实验 | Mechanism | Workers | GPU 分配 | Router | Policy |
+|------|-----------|---------|----------|--------|--------|
+| A | pd-disaggregation | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
+| B | pd-colo | 2 direct (TP4 each) | D0: 0-3, D1: 4-7 | No | default |
+| C | kvcache-centric | 1P + 1D (TP4 each) | P: 0-3, D: 4-7 | Yes | default |
+
+## 测试负载
+
+- 源数据: `simm-swe-bench/outputs/20260416-205833-hicache-qwen35-verified-0-500/audit.jsonl`
+- 39,417 lines (turns), 497 unique instances (sessions)
+- 每个 instance 8-150 turns (均值 79.3)
+- 转换为 agentic-pd-hybrid trace 格式: `outputs/qwen35-swebench-500.jsonl`
+
+## 关键发现
+
+### Transfer Backend 选择
+
+- **nixl (UCX)**: pip 安装的 nixl_cu12 包自带的 UCX 库没有 CUDA 支持，导致 GPU memory registration 时 segfault。系统 UCX (/opt/hpcx/ucx) 有 CUDA 支持但因 RPATH 无法被 NIXL 使用。
+- **mooncake (TCP)**: 可用。需要两处修改:
+  1. `third_party/sglang/.../mooncake_transfer_engine.py`: 从环境变量 `MOONCAKE_PROTOCOL` 读取协议，而非硬编码 `"rdma"`
+  2. `src/agentic_pd_hybrid/stack.py`: 当 `transfer_backend == "mooncake"` 且非 `force_rdma` 时，自动设置 `MOONCAKE_PROTOCOL=tcp`
+
+### 代码修改记录
+
+1. **`third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py`**
+   - 将 `"rdma"` 硬编码改为 `os.environ.get("MOONCAKE_PROTOCOL", "rdma")`
+
+2. **`src/agentic_pd_hybrid/stack.py`**
+   - 在 `_build_process_env()` 中添加: mooncake 非 force_rdma 时默认设置 `MOONCAKE_PROTOCOL=tcp`
+
+3. **`scripts/convert_audit_to_trace.py`** (新建)
+   - 将 sibench audit.jsonl 转换为 agentic-pd-hybrid trace 格式
+
+## 实验进度
+
+- [x] Step 0: 环境准备 (uv sync, nixl/mooncake 安装)
+- [x] Step 1: Trace 格式转换 (39,417 lines 验证通过)
+- [x] Step 2: Smoke test (pd-disaggregation, mooncake TCP, 100 requests) — **通过**
+  - 100/100 requests, 0 errors
+  - Mean latency: 1.53s, P50: 0.77s, P90: 2.82s
+  - TTFT: mean 0.49s, P50 0.29s; TPOT: mean 4.7ms
+  - 91/100 cache hits
+- [x] Step 3a: 实验 A 全量尝试 (39K reqs, 497 sessions) — **中止**
+  - Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T171113Z` (无metrics,被kill)
+  - 前 90% 完成 ~80min (~8-10 req/s), 但尾部 D 侧 KV cache 98% 饱和
+  - 497 并发 session 争抢 D 侧 token 空间, mamba 80-93 sessions 无法 drain
+  - **教训**: 1P+1D (TP4) 无法支撑 497 并发 session, 需减少 session 数量或降低 concurrency
+- [x] Step 3b: 实验 A — pd-disaggregation (52 sessions, 4449 reqs, concurrency=32) — **完成**
+  - Run dir: `outputs/swebench-exps/pd-disaggregation-default-20260426T202540Z`
+  - Trace: `outputs/qwen35-swebench-50sess.jsonl` (10% sample, 52 sessions)
+  - **结果**: 4449/4449 成功, 0 errors
+  - Latency: mean=1.66s, P50=0.97s, P90=3.64s, P99=7.68s
+  - TTFT: mean=0.45s, P50=0.34s, P90=0.88s
+  - TPOT: mean=5.2ms, P50=5.2ms
+  - Cache hit: 4199/4449 (94.4%)
+- [x] Step 4: 实验 B — pd-colo — **失败: SGLang bug**
+  - Run dir: `outputs/swebench-exps/pd-colo-default-20260426T210129Z`
+  - **Bug**: `--disaggregation-mode null` (colocation) 下 Qwen3.5-35B-A3B 模型触发 token_to_kv_pool_allocator 内存泄漏
+  - 错误: `ValueError: token_to_kv_pool_allocator memory leak detected!`
+  - 两个 direct worker 在处理 ~5 个请求后均 crash (Scheduler exception)
+  - **结论**: 当前 vendored SGLang v0.5.10 不支持 Qwen3.5-35B-A3B 的 colocation 模式
+- [x] Step 5: 实验 C — kvcache-centric — **完成 (高错误率)**
+  - Run dir: `outputs/swebench-exps/kvcache-centric-default-worker-admission-20260426T210800Z`
+  - 4390/4449 errors (98.7%) — admission control 过于保守
+  - 59 成功请求: mean latency 1.24s (比 pd-disagg 快 25%), TTFT 0.18s (快 60%)
+  - 详细分析见 `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
+- [x] Step 6: 结果对比分析 — **完成**
+  - 完整报告: `docs/SWEBENCH_EXPERIMENT_RESULTS.md`
+
+## 启动脚本
+
+- `scripts/run_exp_a_pd_disagg.sh` — 实验 A
+- `scripts/run_exp_b_pd_colo.sh` — 实验 B
+- `scripts/run_exp_c_kvcache_centric.sh` — 实验 C
+- `scripts/convert_audit_to_trace.py` — Trace 转换
+
+## 已知风险
+
+1. Qwen3.5-35B-A3B TP4 可用 mem ~12GB/GPU (after model + CUDA graph)，长 session (150 turns) 可能 OOM
+2. mooncake TCP loopback 延迟远低于真实跨机，结果偏乐观
+3. 原始 trace 时间跨度 ~6000s，全量回放非常耗时
--- a/docs/SWEBENCH_EXPERIMENT_RESULTS.md
+++ b/docs/SWEBENCH_EXPERIMENT_RESULTS.md
@@ -0,0 +1,121 @@
+# SWE-Bench PD Hybrid Experiment Results
+
+## 实验配置
+
+- **模型**: Qwen3.5-35B-A3B (MoE, 35B total / 3B active), TP4
+- **硬件**: 8x H100 80GB, NVLink, 单节点
+- **Transfer backend**: mooncake TCP (loopback)
+- **Trace**: 52 sessions, 4,449 requests (10% sample of SWE-Bench 500 instances)
+- **时间压缩**: time-scale=10, concurrency-limit=32
+
+## 结果汇总
+
+### Experiment A: pd-disaggregation (baseline)
+
+| Metric | Value |
+|--------|-------|
+| Run dir | `pd-disaggregation-default-20260426T202540Z` |
+| Requests | 4,449 / 4,449 (100%) |
+| Errors | 0 |
+| **Mean Latency** | **1.662s** |
+| P50 Latency | 0.973s |
+| P90 Latency | 3.644s |
+| P99 Latency | 7.676s |
+| Mean TTFT | 0.445s |
+| P50 TTFT | 0.340s |
+| P90 TTFT | 0.880s |
+| Mean TPOT | 5.20ms |
+| Cache Hit Rate | 94.4% (4199/4449) |
+| Mean Cached Tokens | 27,794 |
+| KV Transfer Blocks | 105,235 |
+
+### Experiment B: pd-colo (colocation) — FAILED
+
+| Metric | Value |
+|--------|-------|
+| Run dir | `pd-colo-default-20260426T210129Z` |
+| Status | **CRASHED** |
+| Error | `token_to_kv_pool_allocator memory leak detected!` |
+| Root Cause | SGLang v0.5.10 `--disaggregation-mode null` 与 Qwen3.5-35B-A3B (Mamba/GDN hybrid) 不兼容 |
+| Requests | ~10 / 4,449 (0.2%) |
+
+**结论**: 当前 vendored SGLang 不支持此模型的 colocation 模式。需要修复 token_to_kv_pool_allocator 中 Mamba 模型的内存管理。
+
+### Experiment C: kvcache-centric (session-aware PD)
+
+| Metric | Value |
+|--------|-------|
+| Run dir | `kvcache-centric-default-worker-admission-20260426T210800Z` |
+| Requests | 4,449 total |
+| **Errors** | **4,390 (98.7%)** |
+| Successful | 59 (1.3%) |
+| Mean Latency (success) | 1.238s |
+| P50 Latency (success) | 0.484s |
+| P90 Latency (success) | 2.550s |
+| Mean TTFT (success) | 0.179s |
+| P50 TTFT (success) | 0.081s |
+| Mean TPOT (success) | 4.70ms |
+| Direct-to-D Sessions | 56 |
+| KV Transfer (actual) | 196 blocks (vs 105,235 planned) |
+
+**Execution Mode 分布**:
+- `kvcache-centric` (failed): 4,390
+- `kvcache-direct-to-d-session` (success): 56
+- `pd-router-*` variants: 3
+
+## 关键分析
+
+### 1. pd-disaggregation (A) — 稳定可靠
+
+- 100% 成功率，0 错误
+- Mean latency 1.66s 合理 (包含 P→D KV transfer 开销)
+- 94.4% cache hit 说明 prefix cache 在 P 侧工作良好
+- KV transfer 105K blocks = 主要开销来源
+- **适合生产使用**
+
+### 2. pd-colo (B) — 不可用
+
+- Qwen3.5-35B-A3B 的 Mamba/GDN hybrid 架构在 `disaggregation-mode null` 下触发内存泄漏
+- 这是 SGLang 的 bug，不是 agentic-pd-hybrid 的问题
+- **需要 SGLang 修复后重新测试**
+
+### 3. kvcache-centric (C) — Admission 过于保守
+
+- 98.7% 错误率说明 admission control 拒绝了几乎所有请求
+- `kvcache-seed-min-turn-id=2` 过滤了 turn 1 的 seed（正确行为）
+- 但绝大多数 turn 2+ 请求也走 `kvcache-centric` 模式后失败
+- 可能原因:
+  - Worker admission 查询发现 D 侧没有对应 session 的 KV cache（因为 turn 1 没有 seed）
+  - D 侧 transfer queue 积压导致 admission 拒绝
+- 成功的 56 个 `direct-to-d-session` 请求表现优异: TTFT 0.08s (P50), 比 pd-disagg 的 0.34s 快 4x
+- **需要调优 admission 参数，或使用 `kvcache-seed-min-turn-id=1` 允许 turn 1 seed**
+
+### 4. kvcache-centric 成功请求 vs pd-disaggregation 对比
+
+| Metric | pd-disagg (A) | kvcache-centric (C, success only) | Delta |
+|--------|:---:|:---:|:---:|
+| Mean Latency | 1.662s | 1.238s | **-25.5%** |
+| P50 Latency | 0.973s | 0.484s | **-50.3%** |
+| Mean TTFT | 0.445s | 0.179s | **-59.8%** |
+| P50 TTFT | 0.340s | 0.081s | **-76.2%** |
+| Mean TPOT | 5.20ms | 4.70ms | -9.6% |
+| Actual KV Transfer | 105,235 blk | 196 blk | **-99.8%** |
+
+**当 kvcache-centric 成功时，性能提升显著：**
+- TTFT 降低 60-76% (D 侧直接 append，无需 P→D transfer)
+- 端到端 latency 降低 25-50%
+- KV transfer 减少 99.8%
+
+## 后续建议
+
+1. **修复 pd-colo**: 提交 SGLang issue 关于 Mamba/GDN 模型在 disaggregation-mode null 下的内存泄漏
+2. **调优 kvcache-centric admission**:
+   - 尝试 `--kvcache-seed-min-turn-id 1` 允许 turn 1 seed
+   - 放宽 `--kvcache-seed-max-decode-transfer-queue-reqs` 阈值
+   - 使用 `--kvcache-admission-mode router` (shadow state, 不在 critical path)
+3. **增加 D 侧内存**: 调整 `--mem-fraction-static` 给 KV cache 更多空间
+4. **多 P/D 配置**: 测试 2P2D (TP2) 配置以增加并行度
+
+## 实验日期
+
+2026-04-27
--- a/docs/V5_PROFILE_INVESTIGATION_ZH.md
+++ b/docs/V5_PROFILE_INVESTIGATION_ZH.md
@@ -0,0 +1,305 @@
+# v5+Profile 调查报告(经 critic 审计修订版)
+
+**日期**: 2026-04-29(原稿)/ 2026-04-29(经审计修订)
+**实验配置**: Qwen3-30B-A3B (TP1)、单机 8×H100 80GB、trace = qwen35-swebench-50sess.jsonl (4449 reqs / 52 sessions)、time-scale=10、concurrency=32
+**数据集**: `outputs/qwen3-30b-tp1-v5-optD-profile/`(EXP1 1P7D + EXP2 2P6D,均加入 1Hz `/server_info` 时序采样)
+**v5 baseline 对照**: `outputs/qwen3-30b-tp1-v5-optD/`(无 polling)
+**研究问题**: v5 (Option D) 把 errors 从 9-10% 降到 0.2%,但 session-cap fallback 反而升到 46-51%。fallback / errors 究竟来自哪里。
+
+> **本稿是经过 hostile audit 后的修订版**。原稿包含若干结论性错误(尤其是对 `held_tokens` 语义的解读颠倒、对 admission race 的过度归因、对 polling 副作用的轻视)。审计意见保存在本会话记录中,关键纠错以 ⚠️ 标注。
+
+---
+
+## TL;DR(已修订)
+
+1. **真实容量**: 每张 D 的 `token_to_kv_pool_allocator.size = 92086 tokens (~92K)`。⚠️ 单 turn 真实 footprint **不是 50-100K**;`cached_tokens` p50=18K、p90=48K、p99=67K。原稿过度夸张。
+2. **`other = capacity − held − available` 的解读已修订**: ⚠️ `held_tokens = sum(slot.kv_allocated_len − slot.cache_protected_len)`(代码:`session_aware_cache.py:278-282`),即"slot 拿到但**不在 radix tree 保护范围内**的部分"。所以 **`other` 的最大单一组成很可能是 radix-tree 保护的共享前缀缓存(prefix cache)** —— 这通常是想要的,**不是病态浪费**。原稿把 `other` 全归因为 running batch + 在途传输是错的。
+3. **`other` 的双峰分布属实**(p50 ≈ 0,p90 ≈ 80K),但单凭 `cap−held−avail` 无法判断这是 radix-cache 自然累积、还是 burst 工作内存。**P1 的细分 instrument 必须先做**。
+4. **errors 与 `other` 在时间上相关**属实,但**不能被解释为因果**。同一时段的多个变量(请求并发、in-flight transfer、可用空间)都在变化;无法仅凭时序对齐推断"`other` 吃掉了腾出来的空间"。
+5. **EXP2 2P6D errors 9 → 415**:⚠️ **polling 被升级为 leading hypothesis**,而非"无关"。证据:执行模式呈 ~1:1 替换(`session-cap-fb` −356 / `kvcache-centric` +406),且 `/server_info` 不是被动读 —— 它在 scheduler 主循环内遍历每个 session slot 计算 `is_idle`。需要 P0 三次 baseline 复跑去伪。
+6. **errors 集中在 18 个 session 上**(总共 52 个),每个 session 钉死在 1 个 D。per-D error rate 差异**无法解释为 D 的结构差别**,本质是 18 个"坏 session"如何被路由分配。
+7. **v5+profile 1P7D 的延迟优于 baseline** 完全在 single-run variance 范围内。N=1,**不能作为任何性能结论**。
+
+---
+
+## 1. 方法论
+
+### 1.1 Instrument 改动
+- `src/agentic_pd_hybrid/replay.py` 加入 `_query_pool_snapshot` + `_poll_pool_timeseries`,后台 asyncio task 以 `--pool-poll-interval-s 1.0` 周期访问每个 P/D worker 的 `/server_info`。
+- 每 tick 写一行 jsonl 到 `<run_dir>/d-pool-timeseries.jsonl`,字段:`{worker_id, worker_role, session_count, resident_session_count, held_tokens, available_tokens, capacity_tokens, idle_evictable_*, sessions[], kvcache_mem_gb, last_gen_throughput, ...}`。
+- 分析脚本:`scripts/analysis/analyze_pool_timeseries.py`。
+
+### 1.2 字段定义(已修订 ⚠️)
+`/server_info` → `internal_states[0].session_cache` 的来源是 `session_controller.py:get_streaming_session_cache_status` → `tree_cache`(`SessionAwareCache`)。
+
+| 字段 | 真实含义 | 备注 |
+|---|---|---|
+| `held_tokens` | `sum_over_slots(ceil(kv_allocated_len, page_size) − cache_protected_len)` | **不是** "session 在 cache 中占用的全部";只统计**slot-private、未被 radix tree 保护**的部分 |
+| `cache_protected_len` | radix tree 保护的共享前缀部分 | 多个 session 共享时只计一次 |
+| `available_tokens` | `token_to_kv_pool_allocator.available_size()` | 全局 KV 池剩余空间 |
+| `capacity_tokens` | `allocator.size` | 单 D 的总 KV 容量 = 92086 |
+| `idle_evictable_tokens` | held 中可被 LRU 立即踢的部分(session 所有 req finished + streaming 模式) | |
+
+因此:
+- **`other = capacity − held − available`** 包含但不限于:
+  - **radix-tree 保护的共享前缀 token**(可能是大头) ⚠️ 原稿遗漏
+  - 当前 running batch 占用的 KV slots
+  - P→D 在途 transfer 的临时 buffer
+  - mooncake 已注册但尚未提交到 tree_cache 的块
+  - 内部碎片 / allocator 元数据
+
+**含义**: 在补充 P1 instrument 之前,我们**无法分辨** `other` 中"radix-cache"(良性)和"burst 工作集 / fragmentation"(可能病态)的比例。
+
+### 1.3 配置一致性与风险
+- v5+profile 与 v5 baseline 唯一差别:加了 `--pool-poll-interval-s 1.0`(其余 CLI 参数完全一致)。
+- **两次 run 时间间隔 ~21 小时**(2026-04-28 15:39/16:27 vs 2026-04-29 12:08/12:59)⚠️ 原稿误写 ~6h。同一台机,但 GPU 温度、PCIe、NUMA 分配未控制。
+- **N=1 比较没有统计意义**;任何延迟差异 < 30% 都属于 single-run variance 合理范围。
+
+---
+
+## 2. 整体性能对比
+
+| 指标 | v5 1P7D | **v5+profile 1P7D** | v5 2P6D | **v5+profile 2P6D** |
+|---|---|---|---|---|
+| 总 requests | 4449 | 4449 | 4449 | 4449 |
+| **errors** | 9 (0.2%) | 6 (0.1%) | 9 (0.2%) | **415 (9.3%)** |
+| truncated | 42 | 43 | 42 | 42 |
+| direct-to-D | 44.7% | 54.9% | 41.3% | 41.1% |
+| session-cap fallback | 45.6% | 36.1% | 50.6% | 42.6% |
+| no-d-capacity | 1.2% | 0.7% | 0.8% | 0.6% |
+| pd-router-d-session-reseed | 4.8% | 4.3% | 3.4% | 2.9% |
+| pd-router-turn1-seed | 1.2% | 1.2% | 1.1% | 1.1% |
+| **kvcache-centric (failed mode)** | 0.2% (9) | 0.1% (6) | 0.2% (9) | **9.3% (415)** |
+| latency mean / p50 / p90 / p99 (s) | 5.18/1.59/14.7/26.1 | 4.21/1.18/11.3/28.8 | 3.49/1.31/9.1/24.9 | 3.23/1.11/8.4/20.3 |
+
+⚠️ **不要从此表得出"v5+profile 改进了延迟"** —— N=1 single run,且 EXP2 引入了 415 个 errors 相当于换了一种回退策略,延迟均值的下降很可能只是**剔除了慢路径请求**的副作用。
+
+### 2.1 EXP2+profile 415 errors 解构(已修订)
+
+**Error type 分布**:
+| Error Type | 数量 |
+|---|---|
+| `RuntimeError: generate stream ended before producing any token` | 407 |
+| `ReadTimeout: ` | 8 |
+
+⚠️ **关键约束**:
+- **414/415 个 error 的 `kv_transfer_blocks > 0`**(从 metrics jsonl 验证)。这些请求**已经过了 admission,P→D 传输已开始**,死于下游(server-side abort、流被关、生成阶段失败)。
+- **`session_reused=False` 占 415/415**(全部是 seed,无一是 direct append)。
+- **失败集中在 18 个 unique session**(top 5: 58080→decode-5 66 errs / 70560→decode-2 54 / 67200→decode-4 40 / 59200→decode-4 35 / 77280→decode-2 33),每个 session 钉死在一台 D。
+
+**Per-D error rate(已修正百分比)**:
+| Decode Worker | Errors | Total Reqs | Error Rate |
+|---|---|---|---|
+| decode-0 | 56 | 758 | 7.4% |
+| decode-1 | 5 | 561 | 0.9% |
+| decode-2 | 141 | 858 | **16.4%** |
+| decode-3 | 0 | 838 | 0.0% |
+| decode-4 | 106 | 731 | 14.5% |
+| decode-5 | 107 | 703 | 15.2% |
+
+⚠️ **不要解读为"decode-3 健康、decode-2 病态"**。每个 session 钉死在一台 D,18 个坏 session 是否落到某个 D 是路由分配的随机结果。**当前 N=1 数据无法分辨"D 结构差异"与"session 分配运气"**。
+
+---
+
+## 3. D KV pool 时序分解(EXP1 1P7D 关键结果)
+
+每张 D capacity=92086 tokens,运行 ~2696 秒(去掉前 10% 暖机):
+
+| Worker | mean_other | p50_other | p90_other | max_other | mean_held | mean_avail |
+|---|---:|---:|---:|---:|---:|---:|
+| decode-0 | 13599 | 63 | 77189 | 90959 | 47124 | 31363 |
+| decode-1 | 21242 | 0 | 76854 | 91074 | 37024 | 33820 |
+| decode-2 | 39333 | 46841 | 82782 | 91996 | 17381 | 35372 |
+| decode-3 | 30543 | 15864 | 81512 | 91511 | 9584 | 51959 |
+| decode-4 | 32659 | 32365 | 72995 | 92082 | 7643 | 51784 |
+| decode-5 | 31745 | 20366 | 86341 | 91211 | 11305 | 49036 |
+| decode-6 | 24602 | 701 | 82291 | 91000 | 20967 | 46517 |
+
+**已修订观察(去掉了原稿的过度归因)**:
+- **`other` 是双峰**(p50 接近 0,p90 接近 80K,mean 在 14-39K)。这一形态属实。
+- **不同 D 的 mean_held / mean_other 差异巨大** —— 但⚠️ **不能直接归类为 "session-heavy" 或 "transfer-heavy"**,因为我们不知道 `other` 里 radix-cache vs 工作内存的比例。**P1 的拆分必做**。
+- 由于 `held` 不包含 radix-protected token,`mean_held` 低**不代表**该 D 上 sessions 占用少 —— 只代表它们的"slot 私有部分"少;共享前缀可能很大,完全藏在 `other` 里。
+
+### 3.1 `other` 在某些时段持续高位(EXP1 decode-2 抽样)
+
+| t (s) | held | avail | other | sess_count | last_gen_throughput |
+|---:|---:|---:|---:|---:|---:|
+| 3 | 0 | 92086 | 0 | 0/0 | (未抽) |
+| 273 | 65310 | 26776 | 0 | 1/1 | (未抽) |
+| 543 | 15296 | 76589 | 201 | 1/1 | (未抽) |
+| 812 | 0 | 92086 | 0 | 0/0 | (未抽) |
+| 1082 | 52507 | 39579 | 0 | 1/1 | (未抽) |
+| 1351 | 40985 | 30175 | 20926 | 2/2 | (未抽) |
+| **1622** | **0** | 17703 | **74383** | **0/0** | **未核** |
+| 1891 | 0 | 46376 | 45710 | 0/0 | (未抽) |
+| 2161 | 0 | 27667 | 64419 | 0/0 | (未抽) |
+| 2430 | 0 | 62224 | 29862 | 0/0 | (未抽) |
+
+⚠️ **t=1622 之后(约 30+ tick)持续 held=0/sess=0/other≈45-74K** —— 这种持久状态**不是 burst 工作集的形态**(burst 应是亚秒级)。更可能的解释包括:
+- 一个 stuck request 的 KV 块未能正常释放
+- mooncake 注册但未 commit 的 transfer buffer 滞留
+- 某个 cleanup 路径未触发
+
+**未在原稿中验证 `last_gen_throughput`**,该字段记录在 timeseries 但未对齐分析。**P1 时一并补**。
+
+---
+
+## 4. Errors 与 Saturation 时序相关性(EXP2 2P6D)
+
+### 4.1 等数量 vs 等时间 decile(已修订 ⚠️)
+
+原稿仅展示等时间分箱,有"第 10 decile 系统恢复"的视觉错觉。两种分箱并列:
+
+| Decile | 等时间(reqs / errs / rate) | 等数量(reqs / errs / rate) |
+|:---:|:---:|:---:|
+| 1 | 567 / 0 / 0.0% | 444 / 0 / 0.0% |
+| 2 | 268 / 0 / 0.0% | 445 / 0 / 0.0% |
+| 3 | 517 / 0 / 0.0% | 445 / 0 / 0.0% |
+| 4 | 189 / 0 / 0.0% | 445 / 0 / 0.0% |
+| 5 | 662 / 3 / 0.5% | 445 / 3 / 0.7% |
+| 6 | 417 / 27 / 6.5% | 445 / 28 / 6.3% |
+| 7 | 486 / 39 / 8.0% | 445 / 42 / 9.4% |
+| 8 | 612 / 177 / 28.9% | 445 / 114 / 25.6% |
+| 9 | 486 / 128 / 26.3% | 445 / 119 / 26.7% |
+| **10** | **245 / 41 / 16.7%** | **445 / 109 / 24.5%** |
+
+⚠️ **第 10 decile 不是"系统恢复"**。等数量分箱显示 24.5% 的 error rate,与 decile 8/9 持平。原稿"恢复"叙事是分母 245 vs 612 造成的视觉假象。
+
+### 4.2 多重假设并列(已修订,不再独尊 admission race)
+
+针对 EXP2 2P6D 415 errors 的可能机制(按当前数据强弱排序):
+
+**H1: Polling 引发 scheduler 时序扰动(leading hypothesis ⚠️)**
+- 证据:执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)。
+- 证据:`/server_info` 进 scheduler 主循环遍历 session slot,1 Hz × 8 worker 不是 0 开销。
+- 证伪条件:**P0(三次 baseline EXP2 复跑)如果都得到 ~9 errors,本假设确认**。
+
+**H2: v5 自身存在 admission/transfer race**
+- v5 baseline 也出 9 个 errors(均为 ReadTimeout),说明该 race 在 baseline 已存在,profile 是被放大了。
+- 证据弱化:原稿提的 "admission race"(admit_direct_append snapshot 过期)与数据冲突 —— **414/415 errors 的 `kv_transfer_blocks > 0`**,他们都过了 admission,死在下游。所以即便有 race,也不是发生在 admission 端,而是 P→D transfer 后 / 生成开始前。
+
+**H3: 18 个特定 session 的工作负载结构性失败**
+- 18/52 session 集中失败,每个 session 都是高 turn_id (median=70)。
+- 这些 session 可能 input 特别长,或某种 trace 结构会触发某个特定路径。
+- 证伪条件:在 P0 三次 baseline 复跑后,看是否仍是同一组 18 个 session 失败。
+
+**H4: 单次运行的 GPU/PCIe 状态扰动**
+- ~21 小时间隔,GPU 温度/clock 不同。
+- 证伪条件:P0 三次 baseline 都 ~9 errors → 排除单次扰动主导。
+
+⚠️ **原稿独推 admission-race(H2)是错的**。当前数据无法决定 H1-H4 哪个是主因。
+
+---
+
+## 5. 1P7D vs 2P6D 全局对比
+
+| Config | total decode ticks | other p50 | other p90 | other>30K freq | other>50K freq | other>70K freq | held>60K freq |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| 1P7D | 18865 | 663 | 79751 | 36.9% | 27.9% | 14.8% | 15.5% |
+| 2P6D | 14016 | 14459 | 77199 | 43.2% | 30.4% | 13.9% | 4.8% |
+
+⚠️ **原稿"2P6D 的 p50_other 是 1P7D 的 22 倍 → 2P 推送压力更大"过度解读**。考虑分母效应:同一 trace 总工作量在 2P6D 由 6 张 D 分担 vs 1P7D 由 7 张 D 分担,**单 D 受到的压力本来就更大**,与 P 数无直接因果。这个数据只能说"2P6D 单 D 负担更高",**不能**得出"2P 在 transfer 上比 1P 更激进"。
+
+---
+
+## 6. 关键解读(已大幅修订)
+
+### 6.1 v5 真实瓶颈尚不明确
+原稿声称"瓶颈是 D 的 KV pool 在压力期被 'other' 占据"。⚠️ **此结论已撤回**。给定 `held_tokens` 实际是 slot-private(non-tree)部分,`other` 的最大单一成分**很可能是正常的 radix-tree 共享前缀**。"被 running batch / 在途传输占据"是**未经验证的猜想**。需要 P1 的细分 instrument 才能给出真瓶颈。
+
+### 6.2 LRU eviction 的行为暂无可靠解读
+原稿基于 mean_held 在压力期"暴跌"推断 LRU 在拼命踢。但 `held` 实际是 slot-private 部分,session 仍可能被 radix-tree 保留;`held` 减少不等于 session 被 evict,可能只是 `cache_protected_len` 比例变化。**P1 拆分前不下结论**。
+
+### 6.3 v5+profile 1P7D "比 baseline 快"是单次巧合
+两次 run 间隔 ~21 小时(原稿误写 ~6h),GPU 温度/PCIe 状态未控制。**N=1**,任何性能差异 < 30% 都不可声称。
+
+### 6.4 EXP2 2P6D 415 errors:polling 是 leading suspect(已升级)
+原稿把 polling 列为"次要可能"。⚠️ **现在升级为主嫌疑**:
+- 执行模式 1:1 替换(session-cap-fb −356 / kvcache-centric +406)说明 polling **改变了 admission 走哪条路**。
+- `/server_info` 不是只读旁路 —— 调度内部循环 + 遍历 session slots 计算 `is_idle`。
+- **必须做 P0 三次 baseline 复跑去伪**;在那之前不能动 v6。
+
+### 6.5 "Other" 在 P 上 90% 不是 backup blocks
+`prefill-0` 的 SessionAwareCache **未启用**(replay 数据 `held=0`),P 的 "other" 等于"P 全部 KV 使用量"(radix cache + running batch + 备份)。⚠️ 当前数据**无法分辨** prefill-backup-policy 是不是真的释放了。需在 P 加单独的 `prefill_backup_tokens` 字段。
+
+---
+
+## 7. v6 行动项(已重排,以 P0 起步)
+
+### **P0:验证 EXP2 errors=9 的可复现性**(最高优先级,先做)
+**操作**: 跑 3 次 v5 baseline EXP2(同 v5 配置,**不开 polling**),比较 error 分布。
+- 如果 3 次都得到 ~9 errors → polling 被坐实为 415 暴涨主因。**必须把 polling 改成更轻量的形式**(如降低频率、改成 streaming push、或用 sidecar metrics 而非 HTTP poll)再做后续。
+- 如果 3 次都得到 ~400 errors → polling 不是主因,415 是 v5 admission/transfer race + 单次 GPU 状态扰动的复合。
+- 如果 3 次结果分布很广(如 9 / 50 / 400) → run-to-run variance 才是主导,任何 single-run 比较失效。
+
+**预期工程量**: 1 个新 sweep 脚本(只跑 EXP2,3 次)+ ~3 × 50 min = ~2.5h GPU 时间。
+**风险**: 0(纯重跑现有配置)。
+
+### **P1:把 D 的 `other` 拆开打表**(P0 跑的同时并行做代码)
+**操作**: 改 SGLang `scheduler.py:get_streaming_session_cache_status` 与 `session_aware_cache.py`,在返回的 dict 里加:
+- `radix_protected_tokens` = `sum(slot.cache_protected_len for slot in slots)` ⚠️ 这是原稿盲区,critic 暴露的关键缺失字段
+- `running_batch_tokens` = `sum(req.fill_ids size for req in running_batch.reqs)`
+- `inflight_transfer_tokens` = `sum(req.size for req in disagg_decode_transfer_queue.queue)`
+- `prealloc_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.queue)`
+- `retracted_tokens` = `sum(req.size for req in disagg_decode_prealloc_queue.retracted_queue)`
+- `last_gen_throughput`(已有)更细 —— 加 `running_batch_size`(req 数)
+
+**预期收益**: `other_unaccounted = capacity − held − available − radix_protected − running_batch − inflight − prealloc − retracted` 应该接近 0。剩余的就是真"病态"内存。
+**风险**: 低(纯只读 stat,不改 admission 逻辑)。
+**工程量**: ~80 行 SGLang patch + 同步 replay.py 的 `_query_pool_snapshot` + analyzer。
+
+### **P2:如果 P0 暴露 polling 是主因,改 polling 实现**
+- 选项 A:把 `/server_info` 改成事件驱动 push(scheduler 在 step 末尾把 stats 写到环形缓冲区,polling 只读不进 scheduler 队列)
+- 选项 B:把 polling 频率从 1Hz 降到 5Hz/10s,在 P1 的拆分数据上验证够用
+- 选项 C:scheduler 端加锁分离,把 stats 读和 admission 决策的临界区拆开
+
+### **P3(条件性,等 P0+P1 数据)**:决定真正的优化方向
+原稿 §7 的 5 条优先级在 `other` 模型纠正后**全部需要重新评估**。等真实拆分数据出来再排。
+
+---
+
+## 8. 局限与 Confounders(已扩充)
+
+1. ⚠️ `held_tokens` 语义在原稿被解读颠倒,引发 `other` 的因果归因错误(已纠正,见 §1.2)。
+2. `other` 字段是计算所得且**未细分**,无法直接归因。需要 P1 instrument 才能区分 radix-cache、running batch、inflight 等。
+3. ⚠️ EXP2+profile 的 415 errors 与 baseline 9 errors **量级差异无法 deconfound**;polling 是 leading suspect 但未证实。**P0 是必经步骤**。
+4. **N=1** 的实验配置:任何 v5+profile vs v5 baseline 的延迟/失败差异都属于 single-run variance 合理范围,**不能作为方向性结论**。
+5. trace 是 single-shot,52 sessions × 4449 reqs 的特定结构可能放大某些路径。
+6. `capacity = 92086` 是 `token_to_kv_pool_allocator.size`,来自 `mem_fraction_static`(未抽具体值),与"H100 80GB 的物理上限"差距是 SGLang 的安全裕量。
+7. ⚠️ §3.1 t=1622 持续高 `other` 30+ tick 的现象 **未与 `last_gen_throughput` 交叉验证**;原稿"running batch + 在途传输"的解释是猜想而非证据。
+8. ⚠️ 18/52 失败 session 的特征(turn_id、input 长度、prefix shape)**未做对比分析**;不能排除某个 session 类型本来就会触发某个固定 bug。
+9. polling 频率 1Hz 错过亚秒级 burst —— `other` 的双峰可能比测到的更剧烈。
+10. critic 指出 `pd-router-d-session-reseed` 在 EXP1 涨(193 vs 152)、EXP2 跌(127 vs 152)的反向移动**未在原稿分析**,这是 admission/路由 决策的清晰信号,应该在 P1 之后回看。
+
+---
+
+## 9. 后续指令(已更新顺序)
+
+1. **P0**: 跑 `scripts/sweep_tp1_v5_baseline_rerun_exp2.sh`,3 次 EXP2 baseline,无 polling。
+2. **P1**: 同时改 SGLang 把 `other` 真正拆开。
+3. 完成 P0+P1 后:
+   - 重跑 EXP2 一次 + 新 instrument(同 polling),拿到 `other` 拆分。
+   - 对比 baseline-rerun 三次的 errors 分布。
+   - 决定是否回退 polling、调 admission、还是攻 specific 18 个 session 的工作负载特征。
+4. 任何 v6 代码改动(优化 admission / eviction / transfer)**必须在 P0+P1 之后**。
+
+---
+
+## 10. 数据产物
+
+```
+outputs/qwen3-30b-tp1-v5-optD-profile/
+├── exp{1,2}_*_metrics.jsonl                # 4449 行 / 实验
+├── exp{1,2}_*_summary.json
+├── exp{1,2}_*_pool_timeseries.jsonl        # 12 MB / 10 MB
+└── kvcache-centric-...20260429T{120847,125911}Z/  # 原始 run dir
+
+outputs/qwen3-30b-tp1-v5-optD/  # baseline 对照(N=1)
+└── exp{1,2}_1p7d_kvc_optD_*
+
+# 待 P0 产生:
+outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
+└── exp2_2p6d_run{1,2,3}_*
+```
+
+分析脚本:`scripts/analysis/analyze_pool_timeseries.py`(`--json` 拿机器可读输出)。
--- a/outputs/qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_summary.json
+++ b/outputs/qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_summary.json
@@ -0,0 +1,88 @@
+{
+  "actual_output_tokens_stats": {
+    "count": 4086.0,
+    "mean": 213.95105237395987,
+    "p50": 83.0,
+    "p90": 562.0,
+    "p99": 1346.0
+  },
+  "cache_hit_request_count": 3929,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 22635.924702180266,
+    "p50": 20010.0,
+    "p90": 48002.0,
+    "p99": 65424.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 363,
+  "execution_modes": {
+    "kvcache-centric": 363,
+    "kvcache-direct-to-d-session": 1716,
+    "pd-router-d-session-reseed": 23,
+    "pd-router-fallback-d-backpressure": 12,
+    "pd-router-fallback-large-append": 5,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 51,
+    "pd-router-fallback-large-append-session-cap": 2148,
+    "pd-router-fallback-no-d-capacity": 7,
+    "pd-router-fallback-session-cap": 32,
+    "pd-router-large-append-reseed": 39,
+    "pd-router-large-append-reseed-after-eviction": 2,
+    "pd-router-turn1-d-backpressure": 1,
+    "pd-router-turn1-no-d-capacity": 3,
+    "pd-router-turn1-seed": 34,
+    "pd-router-turn1-session-cap": 13
+  },
+  "latency_stats_s": {
+    "count": 4086.0,
+    "mean": 4.8753733304192455,
+    "p50": 1.754677688702941,
+    "p90": 12.66968655679375,
+    "p99": 28.717210091650486
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 616,
+    "decode-1": 658,
+    "decode-2": 674,
+    "decode-3": 582,
+    "decode-4": 656,
+    "decode-5": 662,
+    "decode-6": 601
+  },
+  "per_prefill_load": {
+    "prefill-0": 4449
+  },
+  "prefill_request_priorities": {
+    "-100": 98,
+    "100": 2272
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 1716,
+  "total_actual_kv_transfer_blocks": 62123,
+  "total_cached_tokens": 100707229,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4086.0,
+    "mean": 0.005829451223571163,
+    "p50": 0.005684156496173296,
+    "p90": 0.007143743503740225,
+    "p99": 0.008634991403068266
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
+  "truncated_request_count": 42,
+  "ttft_stats_s": {
+    "count": 4086.0,
+    "mean": 3.5955862397812597,
+    "p50": 0.36274072993546724,
+    "p90": 10.972254231572151,
+    "p99": 27.433656523004174
+  }
+}
--- a/outputs/qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_summary.json
+++ b/outputs/qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_summary.json
@@ -0,0 +1,85 @@
+{
+  "actual_output_tokens_stats": {
+    "count": 4440.0,
+    "mean": 225.87972972972972,
+    "p50": 86.0,
+    "p90": 576.0,
+    "p99": 1347.0
+  },
+  "cache_hit_request_count": 4201,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 24345.55787817487,
+    "p50": 21504.0,
+    "p90": 48792.0,
+    "p99": 69120.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 9,
+  "execution_modes": {
+    "kvcache-centric": 9,
+    "kvcache-direct-to-d-session": 1358,
+    "pd-router-d-session-reseed": 12,
+    "pd-router-fallback-d-backpressure": 2,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 52,
+    "pd-router-fallback-large-append-session-cap": 2902,
+    "pd-router-fallback-session-cap": 25,
+    "pd-router-large-append-reseed": 34,
+    "pd-router-large-append-reseed-after-eviction": 4,
+    "pd-router-turn1-d-backpressure": 1,
+    "pd-router-turn1-seed": 30,
+    "pd-router-turn1-session-cap": 20
+  },
+  "latency_stats_s": {
+    "count": 4440.0,
+    "mean": 3.582334662846558,
+    "p50": 1.517257746309042,
+    "p90": 9.225348330102861,
+    "p99": 18.70269925892353
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 710,
+    "decode-1": 630,
+    "decode-2": 763,
+    "decode-3": 737,
+    "decode-4": 879,
+    "decode-5": 730
+  },
+  "per_prefill_load": {
+    "prefill-0": 2225,
+    "prefill-1": 2224
+  },
+  "prefill_request_priorities": {
+    "-100": 80,
+    "100": 3002
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 1358,
+  "total_actual_kv_transfer_blocks": 78979,
+  "total_cached_tokens": 108313387,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4440.0,
+    "mean": 0.005882534704321737,
+    "p50": 0.005807478777200416,
+    "p90": 0.00712956755887717,
+    "p99": 0.008372141476720572
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
+  "truncated_request_count": 42,
+  "ttft_stats_s": {
+    "count": 4440.0,
+    "mean": 2.2045287611873334,
+    "p50": 0.32809355948120356,
+    "p90": 6.947275545448065,
+    "p99": 16.705802395939827
+  }
+}
--- a/outputs/qwen3-30b-tp1-v3-kvaware/sweep_results.txt
+++ b/outputs/qwen3-30b-tp1-v3-kvaware/sweep_results.txt
@@ -0,0 +1,189 @@
+[2026-04-28 17:51:41] Starting TP1 v3 sweep (KVC with kv-aware policy)
+[2026-04-28 17:51:41] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+[2026-04-28 17:51:41] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
+[2026-04-28 17:51:41] Key change: --policy kv-aware for KVC (was --policy default in v2)
+[2026-04-28 17:51:41] 
+[2026-04-28 17:51:41] === [EXP1] 1P7D KVC kv-aware ===
+[2026-04-28 18:43:43] === exp1_1p7d_kvc_kvaware COMPLETED ===
+[2026-04-28 18:43:43] Summary:
+{
+  "actual_output_tokens_stats": {
+    "count": 4086.0,
+    "mean": 213.95105237395987,
+    "p50": 83.0,
+    "p90": 562.0,
+    "p99": 1346.0
+  },
+  "cache_hit_request_count": 3929,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 22635.924702180266,
+    "p50": 20010.0,
+    "p90": 48002.0,
+    "p99": 65424.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 363,
+  "execution_modes": {
+    "kvcache-centric": 363,
+    "kvcache-direct-to-d-session": 1716,
+    "pd-router-d-session-reseed": 23,
+    "pd-router-fallback-d-backpressure": 12,
+    "pd-router-fallback-large-append": 5,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 51,
+    "pd-router-fallback-large-append-session-cap": 2148,
+    "pd-router-fallback-no-d-capacity": 7,
+    "pd-router-fallback-session-cap": 32,
+    "pd-router-large-append-reseed": 39,
+    "pd-router-large-append-reseed-after-eviction": 2,
+    "pd-router-turn1-d-backpressure": 1,
+    "pd-router-turn1-no-d-capacity": 3,
+    "pd-router-turn1-seed": 34,
+    "pd-router-turn1-session-cap": 13
+  },
+  "latency_stats_s": {
+    "count": 4086.0,
+    "mean": 4.8753733304192455,
+    "p50": 1.754677688702941,
+    "p90": 12.66968655679375,
+    "p99": 28.717210091650486
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 616,
+    "decode-1": 658,
+    "decode-2": 674,
+    "decode-3": 582,
+    "decode-4": 656,
+    "decode-5": 662,
+    "decode-6": 601
+  },
+  "per_prefill_load": {
+    "prefill-0": 4449
+  },
+  "prefill_request_priorities": {
+    "-100": 98,
+    "100": 2272
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 1716,
+  "total_actual_kv_transfer_blocks": 62123,
+  "total_cached_tokens": 100707229,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4086.0,
+    "mean": 0.005829451223571163,
+    "p50": 0.005684156496173296,
+    "p90": 0.007143743503740225,
+    "p99": 0.008634991403068266
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T095141Z/sampled-trace.jsonl",
+  "truncated_request_count": 42,
+  "ttft_stats_s": {
+    "count": 4086.0,
+    "mean": 3.5955862397812597,
+    "p50": 0.36274072993546724,
+    "p90": 10.972254231572151,
+    "p99": 27.433656523004174
+  }
+}
+[2026-04-28 18:43:43] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_summary.json + exp1_1p7d_kvc_kvaware_metrics.jsonl
+[2026-04-28 18:43:43] 
+[2026-04-28 18:43:43] === [EXP2] 2P6D KVC kv-aware ===
+[2026-04-28 19:30:38] === exp2_2p6d_kvc_kvaware COMPLETED ===
+[2026-04-28 19:30:38] Summary:
+{
+  "actual_output_tokens_stats": {
+    "count": 4440.0,
+    "mean": 225.87972972972972,
+    "p50": 86.0,
+    "p90": 576.0,
+    "p99": 1347.0
+  },
+  "cache_hit_request_count": 4201,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 24345.55787817487,
+    "p50": 21504.0,
+    "p90": 48792.0,
+    "p99": 69120.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 9,
+  "execution_modes": {
+    "kvcache-centric": 9,
+    "kvcache-direct-to-d-session": 1358,
+    "pd-router-d-session-reseed": 12,
+    "pd-router-fallback-d-backpressure": 2,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 52,
+    "pd-router-fallback-large-append-session-cap": 2902,
+    "pd-router-fallback-session-cap": 25,
+    "pd-router-large-append-reseed": 34,
+    "pd-router-large-append-reseed-after-eviction": 4,
+    "pd-router-turn1-d-backpressure": 1,
+    "pd-router-turn1-seed": 30,
+    "pd-router-turn1-session-cap": 20
+  },
+  "latency_stats_s": {
+    "count": 4440.0,
+    "mean": 3.582334662846558,
+    "p50": 1.517257746309042,
+    "p90": 9.225348330102861,
+    "p99": 18.70269925892353
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 710,
+    "decode-1": 630,
+    "decode-2": 763,
+    "decode-3": 737,
+    "decode-4": 879,
+    "decode-5": 730
+  },
+  "per_prefill_load": {
+    "prefill-0": 2225,
+    "prefill-1": 2224
+  },
+  "prefill_request_priorities": {
+    "-100": 80,
+    "100": 3002
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 1358,
+  "total_actual_kv_transfer_blocks": 78979,
+  "total_cached_tokens": 108313387,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4440.0,
+    "mean": 0.005882534704321737,
+    "p50": 0.005807478777200416,
+    "p90": 0.00712956755887717,
+    "p99": 0.008372141476720572
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v3-kvaware/kvcache-centric-kv-aware-worker-admission-20260428T104343Z/sampled-trace.jsonl",
+  "truncated_request_count": 42,
+  "ttft_stats_s": {
+    "count": 4440.0,
+    "mean": 2.2045287611873334,
+    "p50": 0.32809355948120356,
+    "p90": 6.947275545448065,
+    "p99": 16.705802395939827
+  }
+}
+[2026-04-28 19:30:38] Saved to outputs/qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_summary.json + exp2_2p6d_kvc_kvaware_metrics.jsonl
+[2026-04-28 19:30:38] 
+[2026-04-28 19:30:38] === ALL TP1 V3 SWEEP EXPERIMENTS DONE ===
--- a/outputs/qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_summary.json
+++ b/outputs/qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_summary.json
@@ -0,0 +1,88 @@
+{
+  "actual_output_tokens_stats": {
+    "count": 4014.0,
+    "mean": 215.048081714001,
+    "p50": 83.0,
+    "p90": 570.0,
+    "p99": 1343.0
+  },
+  "cache_hit_request_count": 3865,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 21373.60867610699,
+    "p50": 18429.0,
+    "p90": 45643.0,
+    "p99": 65088.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 435,
+  "execution_modes": {
+    "kvcache-centric": 435,
+    "kvcache-direct-to-d-session": 2180,
+    "pd-router-d-session-reseed": 44,
+    "pd-router-d-session-reseed-after-eviction": 1,
+    "pd-router-fallback-d-backpressure": 36,
+    "pd-router-fallback-large-append": 35,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 52,
+    "pd-router-fallback-large-append-session-cap": 1500,
+    "pd-router-fallback-no-d-capacity": 13,
+    "pd-router-fallback-session-cap": 43,
+    "pd-router-large-append-reseed": 55,
+    "pd-router-large-append-reseed-after-eviction": 3,
+    "pd-router-turn1-d-backpressure": 1,
+    "pd-router-turn1-no-d-capacity": 5,
+    "pd-router-turn1-seed": 46
+  },
+  "latency_stats_s": {
+    "count": 4014.0,
+    "mean": 4.214657033050009,
+    "p50": 1.0827504023909569,
+    "p90": 13.380241627804935,
+    "p99": 24.453291333280504
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 690,
+    "decode-1": 599,
+    "decode-2": 660,
+    "decode-3": 584,
+    "decode-4": 606,
+    "decode-5": 646,
+    "decode-6": 664
+  },
+  "per_prefill_load": {
+    "prefill-0": 4449
+  },
+  "prefill_request_priorities": {
+    "-100": 149,
+    "100": 1685
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 2180,
+  "total_actual_kv_transfer_blocks": 52857,
+  "total_cached_tokens": 95091185,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4014.0,
+    "mean": 0.005804301410418847,
+    "p50": 0.005607025208882987,
+    "p90": 0.007293824862528552,
+    "p99": 0.008864479259402893
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
+  "truncated_request_count": 43,
+  "ttft_stats_s": {
+    "count": 4014.0,
+    "mean": 2.915135478307124,
+    "p50": 0.05643345229327679,
+    "p90": 11.900803190656006,
+    "p99": 22.758968392387033
+  }
+}
--- a/outputs/qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_summary.json
+++ b/outputs/qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_summary.json
@@ -0,0 +1,86 @@
+{
+  "actual_output_tokens_stats": {
+    "count": 4046.0,
+    "mean": 224.65002471576867,
+    "p50": 84.0,
+    "p90": 576.0,
+    "p99": 1349.0
+  },
+  "cache_hit_request_count": 3925,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 22852.7439874129,
+    "p50": 19584.0,
+    "p90": 49009.0,
+    "p99": 67320.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 403,
+  "execution_modes": {
+    "kvcache-centric": 403,
+    "kvcache-direct-to-d-session": 2348,
+    "pd-router-d-session-reseed": 28,
+    "pd-router-fallback-d-backpressure": 7,
+    "pd-router-fallback-large-append": 68,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 45,
+    "pd-router-fallback-large-append-session-cap": 1403,
+    "pd-router-fallback-no-d-capacity": 9,
+    "pd-router-fallback-session-cap": 25,
+    "pd-router-large-append-reseed": 57,
+    "pd-router-large-append-reseed-after-eviction": 6,
+    "pd-router-turn1-no-d-capacity": 1,
+    "pd-router-turn1-seed": 49
+  },
+  "latency_stats_s": {
+    "count": 4046.0,
+    "mean": 2.505981629502371,
+    "p50": 0.8372491216287017,
+    "p90": 6.5139341270551085,
+    "p99": 18.335972285829484
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 767,
+    "decode-1": 680,
+    "decode-2": 906,
+    "decode-3": 818,
+    "decode-4": 800,
+    "decode-5": 478
+  },
+  "per_prefill_load": {
+    "prefill-0": 2225,
+    "prefill-1": 2224
+  },
+  "prefill_request_priorities": {
+    "-100": 140,
+    "100": 1558
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 2348,
+  "total_actual_kv_transfer_blocks": 50727,
+  "total_cached_tokens": 101671858,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4046.0,
+    "mean": 0.005708743129332261,
+    "p50": 0.005565466725497757,
+    "p90": 0.006912594398356141,
+    "p99": 0.008102089307750717
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
+  "truncated_request_count": 36,
+  "ttft_stats_s": {
+    "count": 4046.0,
+    "mean": 1.1653790952959129,
+    "p50": 0.05140436999499798,
+    "p90": 2.6447059931233525,
+    "p99": 15.121314341202378
+  }
+}
--- a/outputs/qwen3-30b-tp1-v4-cap16/sweep_results.txt
+++ b/outputs/qwen3-30b-tp1-v4-cap16/sweep_results.txt
@@ -0,0 +1,190 @@
+[2026-04-28 20:50:21] Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)
+[2026-04-28 20:50:21] Model: /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+[2026-04-28 20:50:21] Trace: outputs/qwen35-swebench-50sess.jsonl (4449 requests, 52 sessions)
+[2026-04-28 20:50:21] Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)
+[2026-04-28 20:50:21] 
+[2026-04-28 20:50:21] === [EXP1] 1P7D KVC kv-aware cap=16 ===
+[2026-04-28 21:40:57] === exp1_1p7d_kvc_cap16 COMPLETED ===
+[2026-04-28 21:40:57] Summary:
+{
+  "actual_output_tokens_stats": {
+    "count": 4014.0,
+    "mean": 215.048081714001,
+    "p50": 83.0,
+    "p90": 570.0,
+    "p99": 1343.0
+  },
+  "cache_hit_request_count": 3865,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 21373.60867610699,
+    "p50": 18429.0,
+    "p90": 45643.0,
+    "p99": 65088.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 435,
+  "execution_modes": {
+    "kvcache-centric": 435,
+    "kvcache-direct-to-d-session": 2180,
+    "pd-router-d-session-reseed": 44,
+    "pd-router-d-session-reseed-after-eviction": 1,
+    "pd-router-fallback-d-backpressure": 36,
+    "pd-router-fallback-large-append": 35,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 52,
+    "pd-router-fallback-large-append-session-cap": 1500,
+    "pd-router-fallback-no-d-capacity": 13,
+    "pd-router-fallback-session-cap": 43,
+    "pd-router-large-append-reseed": 55,
+    "pd-router-large-append-reseed-after-eviction": 3,
+    "pd-router-turn1-d-backpressure": 1,
+    "pd-router-turn1-no-d-capacity": 5,
+    "pd-router-turn1-seed": 46
+  },
+  "latency_stats_s": {
+    "count": 4014.0,
+    "mean": 4.214657033050009,
+    "p50": 1.0827504023909569,
+    "p90": 13.380241627804935,
+    "p99": 24.453291333280504
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 690,
+    "decode-1": 599,
+    "decode-2": 660,
+    "decode-3": 584,
+    "decode-4": 606,
+    "decode-5": 646,
+    "decode-6": 664
+  },
+  "per_prefill_load": {
+    "prefill-0": 4449
+  },
+  "prefill_request_priorities": {
+    "-100": 149,
+    "100": 1685
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 2180,
+  "total_actual_kv_transfer_blocks": 52857,
+  "total_cached_tokens": 95091185,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4014.0,
+    "mean": 0.005804301410418847,
+    "p50": 0.005607025208882987,
+    "p90": 0.007293824862528552,
+    "p99": 0.008864479259402893
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T125022Z/sampled-trace.jsonl",
+  "truncated_request_count": 43,
+  "ttft_stats_s": {
+    "count": 4014.0,
+    "mean": 2.915135478307124,
+    "p50": 0.05643345229327679,
+    "p90": 11.900803190656006,
+    "p99": 22.758968392387033
+  }
+}
+[2026-04-28 21:40:57] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_summary.json + exp1_1p7d_kvc_cap16_metrics.jsonl
+[2026-04-28 21:40:57] 
+[2026-04-28 21:40:57] === [EXP2] 2P6D KVC kv-aware cap=16 ===
+[2026-04-28 22:27:53] === exp2_2p6d_kvc_cap16 COMPLETED ===
+[2026-04-28 22:27:53] Summary:
+{
+  "actual_output_tokens_stats": {
+    "count": 4046.0,
+    "mean": 224.65002471576867,
+    "p50": 84.0,
+    "p90": 576.0,
+    "p99": 1349.0
+  },
+  "cache_hit_request_count": 3925,
+  "cached_tokens_stats": {
+    "count": 4449.0,
+    "mean": 22852.7439874129,
+    "p50": 19584.0,
+    "p90": 49009.0,
+    "p99": 67320.0
+  },
+  "decode_request_priorities": {},
+  "error_count": 403,
+  "execution_modes": {
+    "kvcache-centric": 403,
+    "kvcache-direct-to-d-session": 2348,
+    "pd-router-d-session-reseed": 28,
+    "pd-router-fallback-d-backpressure": 7,
+    "pd-router-fallback-large-append": 68,
+    "pd-router-fallback-large-append-seed-filter-early-turn": 45,
+    "pd-router-fallback-large-append-session-cap": 1403,
+    "pd-router-fallback-no-d-capacity": 9,
+    "pd-router-fallback-session-cap": 25,
+    "pd-router-large-append-reseed": 57,
+    "pd-router-large-append-reseed-after-eviction": 6,
+    "pd-router-turn1-no-d-capacity": 1,
+    "pd-router-turn1-seed": 49
+  },
+  "latency_stats_s": {
+    "count": 4046.0,
+    "mean": 2.505981629502371,
+    "p50": 0.8372491216287017,
+    "p90": 6.5139341270551085,
+    "p99": 18.335972285829484
+  },
+  "mechanisms": {
+    "kvcache-centric": 4449
+  },
+  "per_decode_load": {
+    "decode-0": 767,
+    "decode-1": 680,
+    "decode-2": 906,
+    "decode-3": 818,
+    "decode-4": 800,
+    "decode-5": 478
+  },
+  "per_prefill_load": {
+    "prefill-0": 2225,
+    "prefill-1": 2224
+  },
+  "prefill_request_priorities": {
+    "-100": 140,
+    "100": 1558
+  },
+  "re_prefill_count": 0,
+  "request_count": 4449,
+  "reuse_expected_count": 4397,
+  "reuse_observed_count": 4397,
+  "router_url": "http://127.0.0.1:8000",
+  "session_reset_count": 0,
+  "session_reused_count": 2348,
+  "total_actual_kv_transfer_blocks": 50727,
+  "total_cached_tokens": 101671858,
+  "total_kv_transfer_blocks": 105235,
+  "tpot_stats_s": {
+    "count": 4046.0,
+    "mean": 0.005708743129332261,
+    "p50": 0.005565466725497757,
+    "p90": 0.006912594398356141,
+    "p99": 0.008102089307750717
+  },
+  "trace_path": "outputs/qwen3-30b-tp1-v4-cap16/kvcache-centric-kv-aware-worker-admission-20260428T134057Z/sampled-trace.jsonl",
+  "truncated_request_count": 36,
+  "ttft_stats_s": {
+    "count": 4046.0,
+    "mean": 1.1653790952959129,
+    "p50": 0.05140436999499798,
+    "p90": 2.6447059931233525,
+    "p99": 15.121314341202378
+  }
+}
+[2026-04-28 22:27:53] Saved to outputs/qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_summary.json + exp2_2p6d_kvc_cap16_metrics.jsonl
+[2026-04-28 22:27:53] 
+[2026-04-28 22:27:53] === ALL TP1 V4 SWEEP EXPERIMENTS DONE ===
--- a/scripts/analysis/analyze_backpressure_smoke.py
+++ b/scripts/analysis/analyze_backpressure_smoke.py
@@ -0,0 +1,191 @@
+#!/usr/bin/env python3
+"""Analyze backpressure smoke sweep outputs.
+
+For each run dir with a `request-metrics.jsonl` and the new `structural/`
+subdir (admission-events.jsonl, backpressure-events.jsonl,
+session-d-binding.jsonl), report:
+
+- Headline (errors, latency, ttft, direct-to-D rate)
+- Backpressure pause histogram (count, p50/p90 sleep, total pause time per D)
+- Admission probe stats (RPC count, mean RTT, queue_depth distribution,
+  pause_ms distribution)
+- Session pinning (distinct D per session, bimodal direct-to-D rate)
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import statistics
+from collections import Counter, defaultdict
+from pathlib import Path
+
+
+def load_jsonl(path: Path) -> list[dict]:
+    if not path.exists():
+        return []
+    return [json.loads(l) for l in path.open("r", encoding="utf-8") if l.strip()]
+
+
+def summarize_run(run_dir: Path) -> dict:
+    metrics_path = next(run_dir.rglob("request-metrics.jsonl"), None)
+    if metrics_path is None:
+        return {"run_dir": str(run_dir), "error": "no request-metrics.jsonl"}
+
+    summary_path = metrics_path.with_suffix(metrics_path.suffix + ".summary.json")
+    summary = (
+        json.load(summary_path.open()) if summary_path.exists() else {}
+    )
+
+    structural_dir = run_dir / "structural"
+    if not structural_dir.exists():
+        # try metrics dir's parent / structural
+        structural_dir = metrics_path.parent / "structural"
+
+    admission_events = load_jsonl(structural_dir / "admission-events.jsonl")
+    backpressure_events = load_jsonl(structural_dir / "backpressure-events.jsonl")
+    binding_events = load_jsonl(structural_dir / "session-d-binding.jsonl")
+
+    out: dict = {"run_dir": str(run_dir)}
+
+    # Headline metrics from summary.json
+    out["request_count"] = summary.get("request_count")
+    out["error_count"] = summary.get("error_count")
+    out["latency"] = summary.get("latency_stats_s")
+    out["ttft"] = summary.get("ttft_stats_s")
+    out["execution_modes"] = summary.get("execution_modes")
+    out["per_decode_load"] = summary.get("per_decode_load")
+    out["per_prefill_load"] = summary.get("per_prefill_load")
+
+    # Direct-to-D rate from execution_modes
+    em = summary.get("execution_modes", {}) or {}
+    direct = em.get("kvcache-direct-to-d-session", 0)
+    total = sum(em.values()) or 1
+    out["direct_to_d_rate"] = direct / total
+
+    # Session pinning
+    bind_per_session: dict[str, set[int]] = defaultdict(set)
+    for ev in binding_events:
+        bind_per_session[ev["session_id"]].add(ev["decode_worker_index"])
+    if bind_per_session:
+        out["session_count"] = len(bind_per_session)
+        out["avg_distinct_d_per_session"] = (
+            sum(len(v) for v in bind_per_session.values()) / len(bind_per_session)
+        )
+    else:
+        out["session_count"] = 0
+        out["avg_distinct_d_per_session"] = None
+
+    # Direct-to-D rate per session (bimodal check)
+    records = load_jsonl(metrics_path)
+    sess_records: dict[str, list[dict]] = defaultdict(list)
+    for r in records:
+        sess_records[r["session_id"]].append(r)
+    rates = []
+    for sid, turns in sess_records.items():
+        ndir = sum(
+            1 for t in turns if t.get("execution_mode") == "kvcache-direct-to-d-session"
+        )
+        rates.append(ndir / len(turns))
+    if rates:
+        buckets = [0, 0, 0, 0, 0]
+        for r in rates:
+            buckets[min(4, int(r * 5))] += 1
+        out["direct_to_d_rate_buckets"] = {
+            "0-20%": buckets[0],
+            "20-40%": buckets[1],
+            "40-60%": buckets[2],
+            "60-80%": buckets[3],
+            "80-100%": buckets[4],
+        }
+
+    # Backpressure events
+    if backpressure_events:
+        sleeps = [ev["sleep_s"] for ev in backpressure_events]
+        out["backpressure"] = {
+            "event_count": len(backpressure_events),
+            "total_sleep_s": round(sum(sleeps), 2),
+            "sleep_p50_s": round(statistics.median(sleeps), 4),
+            "sleep_p90_s": round(
+                sorted(sleeps)[int(len(sleeps) * 0.9)] if sleeps else 0, 4
+            ),
+            "events_per_d": dict(
+                Counter(ev["server_url"] for ev in backpressure_events).most_common()
+            ),
+        }
+    else:
+        out["backpressure"] = {"event_count": 0, "note": "no backpressure events"}
+
+    # Admission probe stats
+    if admission_events:
+        rtts = [ev["rtt_s"] for ev in admission_events]
+        depths = [ev.get("queue_depth", 0) for ev in admission_events]
+        pauses = [ev.get("recommended_pause_ms", 0) for ev in admission_events]
+        out["admission_probes"] = {
+            "count": len(admission_events),
+            "mean_rtt_s": round(sum(rtts) / len(rtts), 4),
+            "p99_rtt_s": round(sorted(rtts)[int(len(rtts) * 0.99)], 4),
+            "queue_depth_p50": int(statistics.median(depths)),
+            "queue_depth_p90": int(sorted(depths)[int(len(depths) * 0.9)]),
+            "queue_depth_max": max(depths),
+            "pause_ms_p50": int(statistics.median(pauses)),
+            "pause_ms_p90": int(sorted(pauses)[int(len(pauses) * 0.9)]),
+            "pause_ms_max": max(pauses),
+            "nonzero_pause_count": sum(1 for p in pauses if p > 0),
+            "by_reason": dict(
+                Counter(ev.get("reason") or "ok" for ev in admission_events).most_common()
+            ),
+        }
+
+    return out
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("sweep_root", type=Path)
+    ap.add_argument("--json", action="store_true", help="emit JSON only")
+    args = ap.parse_args()
+
+    summaries = []
+    for run_dir in sorted(args.sweep_root.iterdir()):
+        if not run_dir.is_dir():
+            continue
+        summary = summarize_run(run_dir)
+        summaries.append(summary)
+
+    if args.json:
+        print(json.dumps(summaries, indent=2))
+        return
+
+    for s in summaries:
+        print(f"\n{'=' * 70}")
+        print(f"  {s['run_dir']}")
+        print(f"{'=' * 70}")
+        if "error" in s:
+            print(f"  ERROR: {s['error']}")
+            continue
+        print(f"  reqs={s.get('request_count')} errors={s.get('error_count')}")
+        if s.get("latency"):
+            lt = s["latency"]
+            print(
+                f"  latency: mean={lt.get('mean'):.3f} "
+                f"p50={lt.get('p50'):.3f} p90={lt.get('p90'):.3f} p99={lt.get('p99'):.3f}"
+            )
+        if s.get("ttft"):
+            tt = s["ttft"]
+            print(
+                f"  ttft:    mean={tt.get('mean'):.3f} "
+                f"p50={tt.get('p50'):.3f} p90={tt.get('p90'):.3f}"
+            )
+        print(f"  direct_to_d_rate: {s.get('direct_to_d_rate', 0) * 100:.1f}%")
+        print(f"  sessions: {s.get('session_count')} | "
+              f"avg distinct-D-per-session: {s.get('avg_distinct_d_per_session')}")
+        if s.get("direct_to_d_rate_buckets"):
+            print(f"  direct-to-D distribution by session: {s['direct_to_d_rate_buckets']}")
+        if s.get("backpressure"):
+            print(f"  backpressure: {s['backpressure']}")
+        if s.get("admission_probes"):
+            print(f"  admission probes: {s['admission_probes']}")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/analysis/analyze_errors.py
+++ b/scripts/analysis/analyze_errors.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python3
+"""Deep dive into v4 errors: which path, which D, which session, which turn."""
+import json
+import numpy as np
+from pathlib import Path
+from collections import Counter, defaultdict
+
+BASE = Path(__file__).parent
+
+def load_rows(jsonl_path):
+    rows = []
+    with open(jsonl_path) as f:
+        for line in f:
+            rows.append(json.loads(line))
+    return rows
+
+# Compare v3 and v4 errors
+for label, path in [
+    ("v3 1P7D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
+    ("v4 1P7D", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
+    ("v3 2P6D", BASE.parent / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
+    ("v4 2P6D", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
+]:
+    if not path.exists():
+        print(f"\nSKIP {label}: {path} not found")
+        continue
+    rows = load_rows(path)
+    err = [r for r in rows if r.get("error") is not None]
+    print(f"\n========== {label} ({len(err)} errors / {len(rows)} total = {len(err)/len(rows)*100:.1f}%) ==========")
+
+    # Error finish_reason distribution
+    fr_counter = Counter()
+    for r in err:
+        fr = str(r.get("finish_reason") or r.get("error") or "?")
+        fr_counter[fr[:80]] += 1
+    print(f"finish_reason distribution:")
+    for fr, cnt in fr_counter.most_common():
+        print(f"  {cnt:>4}x  {fr}")
+
+    # Errors by execution mode (these are aborted before mode assignment usually)
+    mode_counter = Counter(r.get("execution_mode", "?") for r in err)
+    print(f"\nerror by execution_mode:")
+    for mode, cnt in mode_counter.most_common():
+        print(f"  {cnt:>4}x  {mode}")
+
+    # Errors per D worker
+    dw_counter = Counter(r.get("assigned_decode_node", "?") for r in err)
+    print(f"\nerror per assigned_decode_node:")
+    for dw, cnt in dw_counter.most_common():
+        print(f"  {cnt:>4}x  {dw}")
+
+    # Errors by turn distribution
+    turn_counter = Counter(r.get("turn_id", -1) for r in err)
+    early = sum(c for t, c in turn_counter.items() if t <= 5)
+    mid = sum(c for t, c in turn_counter.items() if 5 < t <= 30)
+    late = sum(c for t, c in turn_counter.items() if t > 30)
+    print(f"\nerror by turn: early(0-5)={early} mid(6-30)={mid} late(31+)={late}")
+
+    # Per-session error rate
+    per_sess_err = defaultdict(int)
+    per_sess_total = defaultdict(int)
+    for r in rows:
+        per_sess_total[r["session_id"]] += 1
+        if r.get("error") is not None:
+            per_sess_err[r["session_id"]] += 1
+    sess_with_err = [(sid, per_sess_err[sid], per_sess_total[sid]) for sid in per_sess_err]
+    sess_with_err.sort(key=lambda x: -x[1])
+    print(f"\ntop 5 sessions by error count:")
+    for sid, e, t in sess_with_err[:5]:
+        print(f"  session {sid}: {e}/{t} errors ({e/t*100:.0f}%)")
+
+    # Errors timeline: are they bursty?
+    err_ts = sorted([r.get("trace_timestamp_s", 0) for r in err])
+    if err_ts:
+        first_ts = err_ts[0]
+        last_ts = err_ts[-1]
+        all_ts = sorted([r.get("trace_timestamp_s", 0) for r in rows])
+        first_all = all_ts[0]
+        last_all = all_ts[-1]
+        run_duration = last_all - first_all
+        err_first_pct = (err_ts[0] - first_all) / run_duration * 100 if run_duration > 0 else 0
+        err_last_pct = (err_ts[-1] - first_all) / run_duration * 100 if run_duration > 0 else 0
+        print(f"\nerror time range (% of run): {err_first_pct:.1f}% - {err_last_pct:.1f}%")
--- a/scripts/analysis/analyze_pool_timeseries.py
+++ b/scripts/analysis/analyze_pool_timeseries.py
@@ -0,0 +1,346 @@
+#!/usr/bin/env python3
+"""Analyze d-pool-timeseries.jsonl produced by --pool-poll-interval-s.
+
+Answers v6's main question: where is D's KV pool actually spent?
+
+For each decode worker, decomposes capacity over the run wall-clock into:
+  - resident_held_active   = held - idle_evictable      (sessions in active use)
+  - resident_held_idle     = idle_evictable             (sessions kept around but evictable)
+  - prefill_backup_or_other = capacity - held - available (everything else: backup blocks,
+                                                          in-flight transfers, fragmentation)
+  - free_available         = available
+
+Also reports session residency churn (how many distinct sessions ever resided per D, and
+how often a session bounced between workers — a strong starvation signal).
+
+Usage:
+  python scripts/analysis/analyze_pool_timeseries.py <run_dir>
+or
+  python scripts/analysis/analyze_pool_timeseries.py <pool_timeseries.jsonl>
+
+Output: human-readable text. Add --json to also print a machine-readable summary.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import statistics
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Any
+
+
+def _load_jsonl(path: Path) -> list[dict[str, Any]]:
+    rows: list[dict[str, Any]] = []
+    with path.open() as fh:
+        for line in fh:
+            line = line.strip()
+            if not line:
+                continue
+            rows.append(json.loads(line))
+    return rows
+
+
+def _resolve_input(path: Path) -> Path:
+    if path.is_file():
+        return path
+    if path.is_dir():
+        candidate = path / "d-pool-timeseries.jsonl"
+        if candidate.is_file():
+            return candidate
+        raise FileNotFoundError(
+            f"{candidate} not found; pass the file directly or a run dir containing it."
+        )
+    raise FileNotFoundError(path)
+
+
+def _percentile(values: list[float], p: float) -> float:
+    if not values:
+        return 0.0
+    s = sorted(values)
+    idx = min(len(s) - 1, max(0, int(round((len(s) - 1) * p))))
+    return s[idx]
+
+
+def _fmt_tokens(n: float) -> str:
+    if n >= 1_000_000:
+        return f"{n / 1_000_000:.2f}M"
+    if n >= 1_000:
+        return f"{n / 1_000:.1f}K"
+    return f"{int(n)}"
+
+
+def _fmt_pct(n: float, total: float) -> str:
+    if total <= 0:
+        return "  -  "
+    return f"{100 * n / total:5.1f}%"
+
+
+def analyze(timeseries_path: Path) -> dict[str, Any]:
+    rows = _load_jsonl(timeseries_path)
+    if not rows:
+        raise ValueError(f"empty timeseries: {timeseries_path}")
+
+    by_worker: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    for row in rows:
+        if row.get("error") and "session_cache_enabled" not in row:
+            # poller failed at this tick — skip
+            continue
+        wid = row.get("worker_id") or "?"
+        by_worker[wid].append(row)
+
+    summary: dict[str, Any] = {
+        "timeseries_path": str(timeseries_path),
+        "total_rows": len(rows),
+        "tick_count": len(by_worker[next(iter(by_worker))]) if by_worker else 0,
+        "wall_s_span": (
+            max(r.get("wall_s", 0.0) for r in rows)
+            - min(r.get("wall_s", 0.0) for r in rows)
+        ),
+        "workers": {},
+    }
+
+    print(f"\n=== Pool timeseries: {timeseries_path}")
+    print(
+        f"  rows={summary['total_rows']}  workers={len(by_worker)}  "
+        f"span={summary['wall_s_span']:.1f}s"
+    )
+
+    # Print per-worker decomposition table
+    header = (
+        f"{'worker':<12} {'role':<8} {'cap':>8} | "
+        f"{'avg_active':>10} {'avg_idle':>10} {'avg_other':>10} {'avg_free':>10} | "
+        f"{'p90_held':>10} {'max_held':>10} {'p90_avail':>10}"
+    )
+    print(header)
+    print("-" * len(header))
+
+    for wid in sorted(by_worker.keys()):
+        ws = by_worker[wid]
+        role = ws[0].get("worker_role", "?")
+        cap_vals = [int(r.get("capacity_tokens") or 0) for r in ws]
+        held_vals = [int(r.get("held_tokens") or 0) for r in ws]
+        avail_vals = [int(r.get("available_tokens") or 0) for r in ws]
+        idle_vals = [int(r.get("idle_evictable_tokens") or 0) for r in ws]
+        # active = held - idle (sessions in active use)
+        active_vals = [max(0, h - i) for h, i in zip(held_vals, idle_vals)]
+        # other = capacity - held - available (prefill backup blocks, in-flight, fragmentation)
+        other_vals = [
+            max(0, c - h - a) for c, h, a in zip(cap_vals, held_vals, avail_vals)
+        ]
+        cap = max(cap_vals) if cap_vals else 0
+
+        avg_active = statistics.fmean(active_vals) if active_vals else 0.0
+        avg_idle = statistics.fmean(idle_vals) if idle_vals else 0.0
+        avg_other = statistics.fmean(other_vals) if other_vals else 0.0
+        avg_avail = statistics.fmean(avail_vals) if avail_vals else 0.0
+
+        p90_held = _percentile([float(v) for v in held_vals], 0.90)
+        max_held = max(held_vals) if held_vals else 0
+        p90_avail = _percentile([float(v) for v in avail_vals], 0.90)
+
+        sess_counts = [int(r.get("session_count") or 0) for r in ws]
+        resident_counts = [int(r.get("resident_session_count") or 0) for r in ws]
+
+        print(
+            f"{wid:<12} {role:<8} {_fmt_tokens(cap):>8} | "
+            f"{_fmt_tokens(avg_active):>4} {_fmt_pct(avg_active, cap):>5} "
+            f"{_fmt_tokens(avg_idle):>4} {_fmt_pct(avg_idle, cap):>5} "
+            f"{_fmt_tokens(avg_other):>4} {_fmt_pct(avg_other, cap):>5} "
+            f"{_fmt_tokens(avg_avail):>4} {_fmt_pct(avg_avail, cap):>5} | "
+            f"{_fmt_tokens(p90_held):>10} {_fmt_tokens(max_held):>10} "
+            f"{_fmt_tokens(p90_avail):>10}"
+        )
+
+        summary["workers"][wid] = {
+            "role": role,
+            "capacity_tokens": cap,
+            "avg_active_held_tokens": avg_active,
+            "avg_idle_evictable_tokens": avg_idle,
+            "avg_other_tokens": avg_other,
+            "avg_available_tokens": avg_avail,
+            "p90_held_tokens": p90_held,
+            "max_held_tokens": max_held,
+            "p90_available_tokens": p90_avail,
+            "max_session_count": max(sess_counts) if sess_counts else 0,
+            "max_resident_session_count": (
+                max(resident_counts) if resident_counts else 0
+            ),
+            "ticks": len(ws),
+        }
+
+    print(
+        "\nLegend: active=held-idle  idle=idle_evictable  "
+        "other=cap-held-avail (radix-protected + running-batch + in-flight + frag)"
+    )
+
+    # P1: decomposition of "other" using pool_breakdown fields (zeros if instrument absent)
+    has_breakdown = any(
+        any(r.get(k) for k in (
+            "radix_evictable_tokens",
+            "radix_protected_tokens",
+            "running_batch_kv_tokens",
+            "transfer_queue_tokens",
+            "prealloc_queue_tokens",
+            "retracted_queue_tokens",
+        ))
+        for r in rows
+    )
+
+    if has_breakdown:
+        print("\n=== P1 'other' decomposition (per worker, mean over run) ===")
+        print(
+            f"{'worker':<12} {'role':<8} | "
+            f"{'r_evictable':>11} {'r_protected':>11} {'slot_private':>12} | "
+            f"{'run_batch':>10} {'transfer':>9} {'prealloc':>9} {'retracted':>10} | "
+            f"{'unaccounted':>11}"
+        )
+        for wid in sorted(by_worker.keys()):
+            ws = by_worker[wid]
+            role = ws[0].get("worker_role", "?")
+            cap = max(int(r.get("capacity_tokens") or 0) for r in ws)
+
+            def m(field: str) -> float:
+                vals = [int(r.get(field) or 0) for r in ws]
+                return statistics.fmean(vals) if vals else 0.0
+
+            r_ev = m("radix_evictable_tokens")
+            r_pr = m("radix_protected_tokens")
+            slot = m("slot_private_held_tokens")
+            rb = m("running_batch_kv_tokens")
+            tq = m("transfer_queue_tokens")
+            pq = m("prealloc_queue_tokens")
+            rq = m("retracted_queue_tokens")
+            avail = m("available_tokens")
+            # `running_batch_kv_tokens` overlaps with radix_protected for tree-tracked
+            # reqs — do NOT subtract it again. Decomposition assumes:
+            # capacity ≈ avail + r_evictable + r_protected + slot_private
+            #           + transfer_queue + prealloc_queue + retracted_queue + unaccounted
+            unacc = max(
+                0,
+                cap - avail - r_ev - r_pr - slot - tq - pq - rq,
+            )
+            print(
+                f"{wid:<12} {role:<8} | "
+                f"{_fmt_tokens(r_ev):>11} {_fmt_tokens(r_pr):>11} {_fmt_tokens(slot):>12} | "
+                f"{_fmt_tokens(rb):>10} {_fmt_tokens(tq):>9} {_fmt_tokens(pq):>9} {_fmt_tokens(rq):>10} | "
+                f"{_fmt_tokens(unacc):>11}"
+            )
+
+            summary["workers"][wid]["pool_breakdown_avg"] = {
+                "radix_evictable": r_ev,
+                "radix_protected": r_pr,
+                "slot_private_held": slot,
+                "running_batch_kv": rb,
+                "transfer_queue": tq,
+                "prealloc_queue": pq,
+                "retracted_queue": rq,
+                "available": avail,
+                "unaccounted": unacc,
+            }
+        print(
+            "\nNote: running_batch_kv_tokens overlaps with radix_protected_tokens "
+            "(tree-tracked decode reqs are also in protected); not summed."
+        )
+    else:
+        print("\n(P1 instrument absent: pool_breakdown fields are all zero)")
+
+    # Session residency churn: how many distinct sessions ever sat on each worker,
+    # and how many sessions hopped across workers (= starvation indicator).
+    print("\n=== Session residency churn ===")
+    sessions_per_worker: dict[str, set[str]] = defaultdict(set)
+    workers_per_session: dict[str, set[str]] = defaultdict(set)
+    resident_ticks_per_session: Counter[str] = Counter()
+    resident_ticks_per_worker: Counter[str] = Counter()
+
+    for row in rows:
+        wid = row.get("worker_id")
+        if wid is None or row.get("worker_role") != "decode":
+            continue
+        sessions = row.get("sessions") or []
+        if not isinstance(sessions, list):
+            continue
+        for entry in sessions:
+            if not isinstance(entry, dict):
+                continue
+            sid = entry.get("session_id")
+            if sid is None:
+                continue
+            if entry.get("resident"):
+                sessions_per_worker[wid].add(sid)
+                workers_per_session[sid].add(wid)
+                resident_ticks_per_session[(wid, sid)] += 1
+                resident_ticks_per_worker[wid] += 1
+
+    # Per-decode worker: distinct session count
+    print(f"  {'worker':<12} {'distinct_sess':>14} {'resident_ticks':>16}")
+    for wid in sorted(sessions_per_worker.keys()):
+        print(
+            f"  {wid:<12} {len(sessions_per_worker[wid]):>14} "
+            f"{resident_ticks_per_worker[wid]:>16}"
+        )
+
+    # Per session: how many workers it hopped across
+    hops = Counter(len(ws) for ws in workers_per_session.values())
+    print(f"\n  Sessions seen on N workers (decode side):")
+    for n, count in sorted(hops.items()):
+        print(f"    on {n} worker(s): {count} sessions")
+
+    starvation = [sid for sid, ws in workers_per_session.items() if len(ws) == 0]
+    multi_hopper = sorted(
+        ((sid, ws) for sid, ws in workers_per_session.items() if len(ws) >= 2),
+        key=lambda x: -len(x[1]),
+    )[:10]
+    if multi_hopper:
+        print(
+            "\n  Top sessions seen resident on multiple workers (potential thrashing):"
+        )
+        for sid, ws in multi_hopper:
+            print(f"    {sid}: {len(ws)} workers ({sorted(ws)})")
+
+    summary["session_residency"] = {
+        "distinct_sessions_per_worker": {
+            wid: len(s) for wid, s in sessions_per_worker.items()
+        },
+        "session_hop_count_distribution": dict(hops),
+        "starvation_session_count": len(starvation),
+    }
+
+    # If a request-metrics file is co-located, also bucket fallback reasons
+    # against contemporaneous pool state (rough — uses tick nearest to median tick).
+    metrics_path = timeseries_path.with_name("request-metrics.jsonl")
+    if metrics_path.exists():
+        print(f"\n=== Request-metrics summary ({metrics_path.name}) ===")
+        mrows = _load_jsonl(metrics_path)
+        modes = Counter(r.get("execution_mode") or "?" for r in mrows)
+        total = sum(modes.values())
+        for mode, count in modes.most_common():
+            print(f"  {count:>6} ({100 * count / total:5.1f}%)  {mode}")
+        summary["execution_modes"] = dict(modes)
+
+    return summary
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "path",
+        type=Path,
+        help="Path to d-pool-timeseries.jsonl OR a run dir containing it",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Also print a machine-readable JSON summary",
+    )
+    args = parser.parse_args()
+
+    resolved = _resolve_input(args.path)
+    summary = analyze(resolved)
+    if args.json:
+        print("\n=== JSON summary ===")
+        print(json.dumps(summary, indent=2, sort_keys=True, default=str))
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/analysis/analyze_v3.py
+++ b/scripts/analysis/analyze_v3.py
@@ -0,0 +1,89 @@
+#!/usr/bin/env python3
+"""Analyze v3 (kv-aware) results — find why fallback-large-append-session-cap dominates."""
+import json
+import numpy as np
+from pathlib import Path
+from collections import Counter, defaultdict
+
+BASE = Path(__file__).parent
+
+def load_rows(jsonl_path):
+    rows = []
+    with open(jsonl_path) as f:
+        for line in f:
+            rows.append(json.loads(line))
+    return rows
+
+exp1 = load_rows(BASE / "exp1_1p7d_kvc_kvaware_metrics.jsonl")
+exp2 = load_rows(BASE / "exp2_2p6d_kvc_kvaware_metrics.jsonl")
+
+for name, rows in [("Exp1 1P7D", exp1), ("Exp2 2P6D", exp2)]:
+    print(f"\n========== {name} ==========")
+    ok = [r for r in rows if r.get("error") is None]
+
+    # Execution mode breakdown by latency
+    modes = Counter(r["execution_mode"] for r in ok)
+    print(f"\nExecution modes (n={len(ok)}):")
+    for mode, count in modes.most_common():
+        mode_rows = [r for r in ok if r["execution_mode"] == mode]
+        lats = [r["latency_s"] for r in mode_rows]
+        ttfts = [r["ttft_s"] for r in mode_rows]
+        print(f"  {mode}: n={count} ({count/len(ok)*100:.1f}%) "
+              f"lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s | "
+              f"ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
+
+    # Per-D session distribution
+    per_d_sessions = defaultdict(set)
+    for r in ok:
+        d = r.get("assigned_decode_node", "?")
+        per_d_sessions[d].add(r["session_id"])
+    print(f"\nSessions per D worker:")
+    for d in sorted(per_d_sessions.keys()):
+        print(f"  {d}: {len(per_d_sessions[d])} unique sessions")
+
+    # session-cap fallback analysis
+    sc_rows = [r for r in ok if r["execution_mode"] == "pd-router-fallback-large-append-session-cap"]
+    if sc_rows:
+        print(f"\nSession-cap fallback details (n={len(sc_rows)}):")
+        # Which sessions hit this most?
+        sc_per_sess = Counter(r["session_id"] for r in sc_rows)
+        print(f"  Sessions hitting session-cap (top 5):")
+        for sid, cnt in sc_per_sess.most_common(5):
+            print(f"    session {sid}: {cnt} times")
+        # Per-D distribution
+        sc_per_d = Counter(r.get("assigned_decode_node", "?") for r in sc_rows)
+        print(f"  Per-D distribution: {dict(sc_per_d.most_common())}")
+        # Input length distribution
+        inp = [r.get("input_length", 0) for r in sc_rows]
+        print(f"  Input length: P50={np.percentile(inp,50):.0f} P90={np.percentile(inp,90):.0f}")
+        # Turn distribution
+        turns = Counter(r.get("turn_id", -1) for r in sc_rows)
+        print(f"  Turn distribution (top 5): {dict(turns.most_common(5))}")
+
+    # Direct-to-D analysis (ideal path)
+    dd_rows = [r for r in ok if r["execution_mode"] == "kvcache-direct-to-d-session"]
+    if dd_rows:
+        lats = [r["latency_s"] for r in dd_rows]
+        ttfts = [r["ttft_s"] for r in dd_rows]
+        kv_blocks = [r.get("actual_kv_transfer_blocks", 0) for r in dd_rows]
+        cached = [r.get("cached_tokens", 0) for r in dd_rows]
+        print(f"\nDirect-to-D details (n={len(dd_rows)}):")
+        print(f"  lat P50={np.percentile(lats,50):.3f}s P90={np.percentile(lats,90):.3f}s P99={np.percentile(lats,99):.3f}s")
+        print(f"  ttft P50={np.percentile(ttfts,50):.3f}s P90={np.percentile(ttfts,90):.3f}s")
+        print(f"  KV transfer: P50={np.percentile(kv_blocks,50):.0f} (should be 0 — no P involved)")
+        print(f"  cached_tokens P50={np.percentile(cached,50):.0f}")
+
+    # Sessions: how many turns each, how many used direct-to-d
+    print(f"\nPer-session direct-to-D rate (top 10 by total turns):")
+    per_sess = defaultdict(list)
+    for r in ok:
+        per_sess[r["session_id"]].append(r)
+    sess_stats = []
+    for sid, sreqs in per_sess.items():
+        total = len(sreqs)
+        dd = sum(1 for r in sreqs if r["execution_mode"] == "kvcache-direct-to-d-session")
+        sc = sum(1 for r in sreqs if "session-cap" in r["execution_mode"])
+        sess_stats.append((sid, total, dd, sc))
+    sess_stats.sort(key=lambda x: -x[1])
+    for sid, total, dd, sc in sess_stats[:10]:
+        print(f"  session {sid}: {total} turns, {dd} direct-to-D ({dd/total*100:.0f}%), {sc} session-cap fallback ({sc/total*100:.0f}%)")
--- a/scripts/analysis/analyze_v4.py
+++ b/scripts/analysis/analyze_v4.py
@@ -0,0 +1,52 @@
+#!/usr/bin/env python3
+"""V4 results analysis: errors, execution modes, latency by mode."""
+import json
+import numpy as np
+from pathlib import Path
+from collections import Counter
+
+BASE = Path(__file__).parent
+
+def load_rows(jsonl_path):
+    rows = []
+    with open(jsonl_path) as f:
+        for line in f:
+            rows.append(json.loads(line))
+    return rows
+
+for name, path in [
+    ("Exp1 1P7D cap=16", BASE / "exp1_1p7d_kvc_cap16_metrics.jsonl"),
+    ("Exp2 2P6D cap=16", BASE / "exp2_2p6d_kvc_cap16_metrics.jsonl"),
+]:
+    rows = load_rows(path)
+    print(f"\n========== {name} ==========")
+    ok = [r for r in rows if r.get("error") is None]
+    err = [r for r in rows if r.get("error") is not None]
+    print(f"Total: {len(rows)}, OK: {len(ok)}, Errors: {len(err)}")
+
+    # Errors finish_reason
+    if err:
+        finish_reasons = Counter()
+        for r in err:
+            fr = str(r.get("finish_reason") or r.get("error") or "?")
+            # Truncate long messages
+            short = fr[:120]
+            finish_reasons[short] += 1
+        print(f"\nError finish_reasons (top 5):")
+        for fr, cnt in finish_reasons.most_common(5):
+            print(f"  {cnt}x: {fr}")
+
+    # Execution mode latency breakdown
+    modes = Counter(r["execution_mode"] for r in ok)
+    print(f"\nTop execution modes by latency:")
+    print(f"{'mode':<55}{'n':<8}{'%':<8}{'P50 lat':<10}{'P90 lat':<10}{'TTFT P50':<10}")
+    for mode, count in modes.most_common(8):
+        mode_rows = [r for r in ok if r["execution_mode"] == mode]
+        lats = [r["latency_s"] for r in mode_rows]
+        ttfts = [r["ttft_s"] for r in mode_rows]
+        print(f"  {mode:<53}{count:<8}{count/len(ok)*100:>5.1f}%  {np.percentile(lats,50):>7.3f}s  {np.percentile(lats,90):>7.3f}s  {np.percentile(ttfts,50):>7.3f}s")
+
+    # Per-D load
+    per_d = Counter(r.get("assigned_decode_node", "?") for r in ok)
+    print(f"\nPer-D load: max/min ratio = {max(per_d.values())/max(min(per_d.values()),1):.2f}x")
+    print(f"  {dict(per_d.most_common())}")
--- a/scripts/analysis/compare_no_error.py
+++ b/scripts/analysis/compare_no_error.py
@@ -0,0 +1,136 @@
+#!/usr/bin/env python3
+"""Compare KVC variants vs baseline, EXCLUDING errors and truncated requests."""
+import json
+import numpy as np
+from pathlib import Path
+
+OUT = Path("/mnt/kzlin/workflow/pd-hybrid/agentic-pd-hybrid/outputs")
+
+DATASETS = [
+    ("baseline 8DP", OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"),
+    ("v3 1P7D",     OUT / "qwen3-30b-tp1-v3-kvaware/exp1_1p7d_kvc_kvaware_metrics.jsonl"),
+    ("v3 2P6D",     OUT / "qwen3-30b-tp1-v3-kvaware/exp2_2p6d_kvc_kvaware_metrics.jsonl"),
+    ("v4 1P7D",     OUT / "qwen3-30b-tp1-v4-cap16/exp1_1p7d_kvc_cap16_metrics.jsonl"),
+    ("v4 2P6D",     OUT / "qwen3-30b-tp1-v4-cap16/exp2_2p6d_kvc_cap16_metrics.jsonl"),
+]
+
+def load_rows(path):
+    rows = []
+    with open(path) as f:
+        for line in f:
+            rows.append(json.loads(line))
+    return rows
+
+def is_truncated(row):
+    a = row.get("actual_output_tokens")
+    r = row.get("requested_output_tokens")
+    if a is not None and r is not None and r > 1:
+        return a < r * 0.5
+    return False
+
+def stats(values):
+    if not values:
+        return {"n": 0}
+    a = np.array(values)
+    return {
+        "n":    len(a),
+        "mean": float(np.mean(a)),
+        "p50":  float(np.percentile(a, 50)),
+        "p90":  float(np.percentile(a, 90)),
+        "p99":  float(np.percentile(a, 99)),
+    }
+
+def fmt(s, key):
+    if s["n"] == 0:
+        return "N/A"
+    v = s[key]
+    return f"{v:.3f}s" if v < 100 else f"{v:.1f}s"
+
+results = []
+for label, path in DATASETS:
+    if not path.exists():
+        print(f"SKIP {label}")
+        continue
+    rows = load_rows(path)
+    total = len(rows)
+    err_n = sum(1 for r in rows if r.get("error") is not None)
+    trunc_n = sum(1 for r in rows if r.get("error") is None and is_truncated(r))
+
+    # Filter: error=None AND not truncated AND latency present
+    clean = [r for r in rows
+             if r.get("error") is None
+             and not is_truncated(r)
+             and r.get("latency_s") is not None]
+
+    lats = [r["latency_s"] for r in clean]
+    ttfts = [r["ttft_s"] for r in clean if r.get("ttft_s") is not None]
+
+    results.append({
+        "label": label,
+        "total": total,
+        "err": err_n,
+        "trunc": trunc_n,
+        "clean_n": len(clean),
+        "lat": stats(lats),
+        "ttft": stats(ttfts),
+    })
+
+# Print comparison table
+print(f"\n{'='*100}")
+print("LATENCY (excluding errors AND truncated)")
+print(f"{'='*100}")
+print(f"{'config':<16}{'total':>7}{'err':>6}{'trunc':>7}{'clean':>7}  {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
+for r in results:
+    print(f"{r['label']:<16}{r['total']:>7}{r['err']:>6}{r['trunc']:>7}{r['clean_n']:>7}  "
+          f"{fmt(r['lat'],'mean'):>9}{fmt(r['lat'],'p50'):>9}{fmt(r['lat'],'p90'):>9}{fmt(r['lat'],'p99'):>9}")
+
+print(f"\n{'='*100}")
+print("TTFT (excluding errors AND truncated)")
+print(f"{'='*100}")
+print(f"{'config':<16}{'clean':>7}  {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
+for r in results:
+    print(f"{r['label']:<16}{r['clean_n']:>7}  "
+          f"{fmt(r['ttft'],'mean'):>9}{fmt(r['ttft'],'p50'):>9}{fmt(r['ttft'],'p90'):>9}{fmt(r['ttft'],'p99'):>9}")
+
+# Also: per-execution-mode breakdown for v4 only (the most interesting)
+print(f"\n{'='*100}")
+print("V4 2P6D: per-execution-mode (excluding errors and truncated)")
+print(f"{'='*100}")
+v4_2p6d = next((p for l, p in DATASETS if l == "v4 2P6D"), None)
+if v4_2p6d:
+    rows = load_rows(v4_2p6d)
+    clean = [r for r in rows if r.get("error") is None and not is_truncated(r)]
+    from collections import Counter
+    modes = Counter(r["execution_mode"] for r in clean)
+    print(f"{'mode':<55}{'n':>7}{'%':>7}  {'mean':>9}{'P50':>9}{'P90':>9}{'P99':>9}")
+    for mode, count in modes.most_common(10):
+        m_rows = [r for r in clean if r["execution_mode"] == mode]
+        s = stats([r["latency_s"] for r in m_rows])
+        pct = count/len(clean)*100
+        print(f"  {mode:<53}{count:>7}{pct:>6.1f}%  {fmt(s,'mean'):>9}{fmt(s,'p50'):>9}{fmt(s,'p90'):>9}{fmt(s,'p99'):>9}")
+
+# Also: WHAT IF we only count direct-to-D? (Pure KVC performance)
+print(f"\n{'='*100}")
+print("Pure KVC (kvcache-direct-to-d-session ONLY) vs Baseline")
+print(f"{'='*100}")
+for label, path in DATASETS:
+    if not path.exists() or "1P7D" not in label and "2P6D" not in label:
+        continue
+    rows = load_rows(path)
+    direct = [r for r in rows
+              if r.get("error") is None and not is_truncated(r)
+              and r.get("execution_mode") == "kvcache-direct-to-d-session"]
+    if not direct:
+        continue
+    s_lat = stats([r["latency_s"] for r in direct])
+    s_ttft = stats([r["ttft_s"] for r in direct if r.get("ttft_s") is not None])
+    print(f"{label:<16}n={s_lat['n']:>5}  lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')}  ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
+
+# Baseline for reference (already non-fallback by definition)
+print()
+baseline_path = OUT / "qwen3-30b-tp1-v2-fixed/exp1_8way_dp_cache_aware_metrics.jsonl"
+baseline_rows = load_rows(baseline_path)
+clean = [r for r in baseline_rows if r.get("error") is None and not is_truncated(r)]
+s_lat = stats([r["latency_s"] for r in clean])
+s_ttft = stats([r["ttft_s"] for r in clean if r.get("ttft_s") is not None])
+print(f"{'baseline 8DP':<16}n={s_lat['n']:>5}  lat: P50={fmt(s_lat,'p50')} P90={fmt(s_lat,'p90')}  ttft: P50={fmt(s_ttft,'p50')} P90={fmt(s_ttft,'p90')}")
--- a/scripts/convert_audit_to_trace.py
+++ b/scripts/convert_audit_to_trace.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+"""Convert sibench audit.jsonl to agentic-pd-hybrid trace format.
+
+Source format (sibench audit.jsonl):
+  {"instance_id": "...", "ts": float, "messages": [...],
+   "audit": {"prompt_tokens": int, "completion_tokens": int, ...}}
+
+Target format (agentic-pd-hybrid trace JSONL):
+  {"chat_id": int, "parent_chat_id": int, "timestamp": float,
+   "turn": int, "input_length": int, "output_length": int,
+   "type": str, "hash_ids": [int, ...]}
+"""
+
+import json
+import sys
+from collections import defaultdict
+from pathlib import Path
+
+BLOCK_TOKEN_BUDGET = 24  # tokens per block, matching trace.py default
+
+
+def convert(src: Path, dst: Path) -> None:
+    # Group lines by instance_id, preserving order within each instance
+    instances: dict[str, list[dict]] = defaultdict(list)
+    with src.open() as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            rec = json.loads(line)
+            instances[rec["instance_id"]].append(rec)
+
+    # Sort each instance's turns by timestamp
+    for iid in instances:
+        instances[iid].sort(key=lambda r: r["ts"])
+
+    # Assign stable chat_id bases: each instance gets a block of IDs
+    # Max turns across all instances determines the spacing
+    max_turns = max(len(turns) for turns in instances.values())
+    spacing = max_turns + 10  # extra headroom
+
+    total_written = 0
+    with dst.open("w") as out:
+        for inst_idx, (iid, turns) in enumerate(instances.items()):
+            base_chat_id = (inst_idx + 1) * spacing  # start from spacing to avoid 0
+            # Track cumulative hash_ids for prefix cache simulation
+            cumulative_hash_ids: list[int] = []
+            global_block_counter = inst_idx * 100_000  # unique block namespace per instance
+
+            for turn_idx, rec in enumerate(turns):
+                audit = rec.get("audit", {})
+                input_length = audit.get("prompt_tokens", 0)
+                output_length = audit.get("completion_tokens", 0)
+
+                if input_length <= 0:
+                    # Fallback: estimate from message content
+                    total_chars = sum(len(m.get("content", "")) for m in rec.get("messages", []))
+                    input_length = max(1, total_chars // 4)
+                if output_length <= 0:
+                    output_length = 128  # reasonable default
+
+                chat_id = base_chat_id + turn_idx
+                if turn_idx == 0:
+                    parent_chat_id = -1
+                else:
+                    parent_chat_id = base_chat_id + turn_idx - 1
+
+                # Build hash_ids: for turn 0, generate blocks for full input
+                # For turn N>0, keep previous blocks and add new ones for the delta
+                if turn_idx == 0:
+                    num_blocks = input_length // BLOCK_TOKEN_BUDGET
+                    cumulative_hash_ids = list(
+                        range(global_block_counter, global_block_counter + num_blocks)
+                    )
+                    global_block_counter += num_blocks
+                else:
+                    # The new input is the full prompt (cumulative), so the delta
+                    # is the new tokens beyond what was in the previous turn's prompt
+                    prev_input = audit.get("prompt_tokens", 0)
+                    prev_rec_audit = turns[turn_idx - 1].get("audit", {})
+                    prev_input_length = prev_rec_audit.get("prompt_tokens", 0)
+                    delta = max(0, prev_input - prev_input_length) if prev_input_length > 0 else 0
+                    new_blocks = delta // BLOCK_TOKEN_BUDGET
+                    new_ids = list(
+                        range(global_block_counter, global_block_counter + new_blocks)
+                    )
+                    global_block_counter += new_blocks
+                    cumulative_hash_ids = cumulative_hash_ids + new_ids
+
+                trace_line = {
+                    "chat_id": chat_id,
+                    "parent_chat_id": parent_chat_id,
+                    "timestamp": rec["ts"],
+                    "turn": turn_idx,
+                    "input_length": input_length,
+                    "output_length": output_length,
+                    "type": "chat",
+                    "hash_ids": cumulative_hash_ids,
+                }
+                out.write(json.dumps(trace_line, separators=(",", ":")) + "\n")
+                total_written += 1
+
+    print(f"Converted {total_written} lines from {len(instances)} instances -> {dst}")
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 3:
+        print(f"Usage: {sys.argv[0]} <input_audit.jsonl> <output_trace.jsonl>")
+        sys.exit(1)
+    convert(Path(sys.argv[1]), Path(sys.argv[2]))
--- a/scripts/prepare_real_ali_samples.py
+++ b/scripts/prepare_real_ali_samples.py
@@ -0,0 +1,450 @@
+#!/usr/bin/env python3
+"""Prepare balanced real-Ali trace samples for KVC experiments.
+
+The generic sampler is duration-oriented and can be dominated by one long
+session.  This script keeps real request lengths/timestamps but caps turns per
+session so live sweeps can compare policies on a repeatable multi-session
+workload.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import statistics
+from collections import defaultdict
+from dataclasses import asdict, dataclass
+from pathlib import Path
+
+from agentic_pd_hybrid.trace import TraceRequest, load_trace
+
+
+@dataclass(frozen=True)
+class SampleSummary:
+    input_trace_path: str
+    output_trace_path: str
+    profile: str
+    request_count: int
+    session_count: int
+    multi_turn_session_count: int
+    turn2plus_count: int
+    direct_eligible_turn2plus_count: int
+    direct_eligible_turn2plus_ratio: float
+    missing_parent_count: int
+    max_sessions: int
+    max_turns_per_session: int
+    start_time_s: float
+    end_time_s: float
+    sampled_duration_s: float
+    rebased_timestamps: bool
+    input_tokens: dict[str, float] | None
+    output_tokens: dict[str, float] | None
+    append_tokens: dict[str, float] | None
+    inter_turn_gap_s: dict[str, float] | None
+    overlap_ratio: dict[str, float] | None
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--trace", type=Path, required=True)
+    parser.add_argument("--output-root", type=Path, required=True)
+    parser.add_argument("--max-sessions", type=int, default=64)
+    parser.add_argument("--max-turns-per-session", type=int, default=12)
+    parser.add_argument("--start-time-s", type=float, default=0.0)
+    parser.add_argument(
+        "--window-duration-s",
+        type=float,
+        default=None,
+        help=(
+            "If set, also write continuous-window samples that keep only requests "
+            "inside [start-time, start-time + window-duration]."
+        ),
+    )
+    parser.add_argument(
+        "--window-target-requests",
+        type=int,
+        default=None,
+        help=(
+            "For continuous-window samples, select whole sessions across time "
+            "buckets until at least this many requests are included. This keeps "
+            "the window span while making live runs tractable."
+        ),
+    )
+    parser.add_argument(
+        "--window-buckets",
+        type=int,
+        default=15,
+        help="Number of time buckets used with --window-target-requests.",
+    )
+    parser.add_argument(
+        "--window-min-turns",
+        type=int,
+        default=1,
+        help=(
+            "Minimum number of in-window turns per selected session for "
+            "continuous-window samples."
+        ),
+    )
+    parser.add_argument(
+        "--window-output-name",
+        default="ali-window.jsonl",
+        help="Output filename for the continuous-window sample.",
+    )
+    parser.add_argument(
+        "--max-sampled-duration-s",
+        type=float,
+        default=None,
+        help=(
+            "For balanced profile samples, drop requests after the first selected "
+            "timestamp plus this duration. Use only for quick smoke runs; headline "
+            "runs should preserve the full sampled span."
+        ),
+    )
+    parser.add_argument(
+        "--profiles",
+        nargs="+",
+        default=["representative-mt", "kvc-fit-smallappend"],
+        choices=["representative-mt", "kvc-fit-smallappend"],
+    )
+    parser.add_argument(
+        "--no-rebase-timestamps",
+        action="store_true",
+        help="Keep original timestamps instead of shifting the sample to start at 0.",
+    )
+    args = parser.parse_args()
+
+    requests = load_trace(args.trace)
+    sessions: dict[str, list[TraceRequest]] = defaultdict(list)
+    for request in requests:
+        sessions[request.session_id].append(request)
+
+    args.output_root.mkdir(parents=True, exist_ok=True)
+    if args.window_duration_s is not None:
+        if args.window_target_requests is None:
+            selected = _select_window(
+                requests=requests,
+                start_time_s=args.start_time_s,
+                window_duration_s=args.window_duration_s,
+            )
+            profile = "window"
+        else:
+            selected = _select_window_session_sample(
+                sessions=sessions,
+                start_time_s=args.start_time_s,
+                window_duration_s=args.window_duration_s,
+                target_requests=args.window_target_requests,
+                bucket_count=args.window_buckets,
+                min_turns=args.window_min_turns,
+            )
+            profile = (
+                "window-session-sample"
+                if args.window_min_turns <= 1
+                else f"window-session-sample-min{args.window_min_turns}turns"
+            )
+        output_path = args.output_root / args.window_output_name
+        summary = _write_sample(
+            selected=selected,
+            input_trace_path=args.trace,
+            output_path=output_path,
+            profile=profile,
+            max_sessions=args.max_sessions,
+            max_turns_per_session=args.max_turns_per_session,
+            rebase_timestamps=not args.no_rebase_timestamps,
+        )
+        print(
+            f"window: wrote {summary.request_count} requests from "
+            f"{summary.session_count} sessions to {output_path}"
+        )
+
+    for profile in args.profiles:
+        selected = _select_profile(
+            sessions=sessions,
+            profile=profile,
+            start_time_s=args.start_time_s,
+            max_sessions=args.max_sessions,
+            max_turns_per_session=args.max_turns_per_session,
+            max_sampled_duration_s=args.max_sampled_duration_s,
+        )
+        output_path = args.output_root / f"ali-{profile}.jsonl"
+        summary = _write_sample(
+            selected=selected,
+            input_trace_path=args.trace,
+            output_path=output_path,
+            profile=profile,
+            max_sessions=args.max_sessions,
+            max_turns_per_session=args.max_turns_per_session,
+            rebase_timestamps=not args.no_rebase_timestamps,
+        )
+        print(
+            f"{profile}: wrote {summary.request_count} requests from "
+            f"{summary.session_count} sessions to {output_path}"
+        )
+
+
+def _select_profile(
+    *,
+    sessions: dict[str, list[TraceRequest]],
+    profile: str,
+    start_time_s: float,
+    max_sessions: int,
+    max_turns_per_session: int,
+    max_sampled_duration_s: float | None,
+) -> list[TraceRequest]:
+    eligible: list[list[TraceRequest]] = []
+    for session_requests in sessions.values():
+        ordered = _ordered(session_requests)
+        if len(ordered) < 2:
+            continue
+        if ordered[0].timestamp_s < start_time_s:
+            continue
+        if profile == "kvc-fit-smallappend" and not _is_kvc_fit_smallappend(ordered):
+            continue
+        eligible.append(ordered[:max_turns_per_session])
+
+    eligible.sort(key=lambda items: (items[0].timestamp_s, items[0].session_id))
+    selected_sessions = eligible[:max_sessions]
+    selected = [request for items in selected_sessions for request in items]
+    selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
+    if selected and max_sampled_duration_s is not None:
+        first_ts = selected[0].timestamp_s
+        end_ts = first_ts + max_sampled_duration_s
+        selected = [
+            request for request in selected if request.timestamp_s <= end_ts
+        ]
+    return selected
+
+
+def _select_window(
+    *,
+    requests: list[TraceRequest],
+    start_time_s: float,
+    window_duration_s: float,
+) -> list[TraceRequest]:
+    end_time_s = start_time_s + window_duration_s
+    selected = [
+        request
+        for request in requests
+        if start_time_s <= request.timestamp_s <= end_time_s
+    ]
+    selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
+    return selected
+
+
+def _select_window_session_sample(
+    *,
+    sessions: dict[str, list[TraceRequest]],
+    start_time_s: float,
+    window_duration_s: float,
+    target_requests: int,
+    bucket_count: int,
+    min_turns: int,
+) -> list[TraceRequest]:
+    if target_requests <= 0:
+        raise ValueError("--window-target-requests must be positive")
+    if bucket_count <= 0:
+        raise ValueError("--window-buckets must be positive")
+    if min_turns <= 0:
+        raise ValueError("--window-min-turns must be positive")
+
+    end_time_s = start_time_s + window_duration_s
+    bucket_width_s = window_duration_s / bucket_count
+    buckets: list[list[list[TraceRequest]]] = [[] for _ in range(bucket_count)]
+    for session_requests in sessions.values():
+        ordered = _ordered(session_requests)
+        if not ordered:
+            continue
+        first = ordered[0]
+        if first.timestamp_s < start_time_s or first.timestamp_s > end_time_s:
+            continue
+        in_window = [
+            request
+            for request in ordered
+            if start_time_s <= request.timestamp_s <= end_time_s
+        ]
+        if len(in_window) < min_turns:
+            continue
+        bucket_index = min(
+            bucket_count - 1,
+            int((first.timestamp_s - start_time_s) / bucket_width_s),
+        )
+        buckets[bucket_index].append(in_window)
+
+    for bucket in buckets:
+        bucket.sort(key=lambda items: (items[0].timestamp_s, items[0].session_id))
+
+    selected_sessions: list[list[TraceRequest]] = []
+    selected_count = 0
+    positions = [0 for _ in range(bucket_count)]
+    while selected_count < target_requests:
+        progressed = False
+        for index, bucket in enumerate(buckets):
+            if positions[index] >= len(bucket):
+                continue
+            session_requests = bucket[positions[index]]
+            positions[index] += 1
+            selected_sessions.append(session_requests)
+            selected_count += len(session_requests)
+            progressed = True
+            if selected_count >= target_requests:
+                break
+        if not progressed:
+            break
+
+    selected = [request for items in selected_sessions for request in items]
+    selected.sort(key=lambda request: (request.timestamp_s, request.chat_id))
+    if len(selected) < target_requests:
+        raise ValueError(
+            f"window session sample selected only {len(selected)} requests; "
+            f"target was {target_requests}"
+        )
+    return selected
+
+
+def _is_kvc_fit_smallappend(session_requests: list[TraceRequest]) -> bool:
+    initial = session_requests[0]
+    if initial.input_length < 2048 or initial.input_length > 16000:
+        return False
+    for request in session_requests:
+        if request.output_length > 2048:
+            return False
+    for previous, current in zip(session_requests, session_requests[1:], strict=False):
+        append_tokens = current.input_length - (
+            previous.input_length + previous.output_length
+        )
+        if append_tokens <= 0 or append_tokens > 2048:
+            return False
+        if _overlap_ratio(previous, current) < 0.75:
+            return False
+    return True
+
+
+def _write_sample(
+    *,
+    selected: list[TraceRequest],
+    input_trace_path: Path,
+    output_path: Path,
+    profile: str,
+    max_sessions: int,
+    max_turns_per_session: int,
+    rebase_timestamps: bool,
+) -> SampleSummary:
+    if not selected:
+        raise ValueError(f"profile {profile!r} selected no requests")
+
+    first_ts = selected[0].timestamp_s
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with output_path.open("w", encoding="utf-8") as handle:
+        for request in selected:
+            timestamp = request.timestamp_s - first_ts if rebase_timestamps else request.timestamp_s
+            payload = {
+                "chat_id": request.chat_id,
+                "parent_chat_id": request.parent_chat_id,
+                "timestamp": round(timestamp, 6),
+                "input_length": request.input_length,
+                "output_length": request.output_length,
+                "type": request.request_type,
+                "turn": request.turn_id,
+                "hash_ids": list(request.hash_ids),
+            }
+            handle.write(json.dumps(payload, sort_keys=True) + "\n")
+
+    sessions = defaultdict(list)
+    for request in selected:
+        sessions[request.session_id].append(request)
+
+    selected_chat_ids = {request.chat_id for request in selected}
+    missing_parent_count = sum(
+        1
+        for request in selected
+        if request.parent_chat_id >= 0 and request.parent_chat_id not in selected_chat_ids
+    )
+    append_values: list[float] = []
+    gap_values: list[float] = []
+    overlap_values: list[float] = []
+    direct_eligible_count = 0
+    for session_requests in sessions.values():
+        ordered = _ordered(session_requests)
+        for previous, current in zip(ordered, ordered[1:], strict=False):
+            append_tokens = current.input_length - (
+                previous.input_length + previous.output_length
+            )
+            overlap_ratio = _overlap_ratio(previous, current)
+            append_values.append(float(append_tokens))
+            gap_values.append(float(current.timestamp_s - previous.timestamp_s))
+            overlap_values.append(overlap_ratio)
+            if append_tokens > 0 and append_tokens <= 2048 and overlap_ratio > 0:
+                direct_eligible_count += 1
+
+    turn2plus_count = sum(max(0, len(items) - 1) for items in sessions.values())
+
+    start = min(request.timestamp_s for request in selected)
+    end = max(request.timestamp_s for request in selected)
+    summary = SampleSummary(
+        input_trace_path=str(input_trace_path),
+        output_trace_path=str(output_path),
+        profile=profile,
+        request_count=len(selected),
+        session_count=len(sessions),
+        multi_turn_session_count=sum(1 for items in sessions.values() if len(items) > 1),
+        turn2plus_count=turn2plus_count,
+        direct_eligible_turn2plus_count=direct_eligible_count,
+        direct_eligible_turn2plus_ratio=(
+            direct_eligible_count / turn2plus_count if turn2plus_count else 0.0
+        ),
+        missing_parent_count=missing_parent_count,
+        max_sessions=max_sessions,
+        max_turns_per_session=max_turns_per_session,
+        start_time_s=0.0 if rebase_timestamps else start,
+        end_time_s=end - start if rebase_timestamps else end,
+        sampled_duration_s=end - start,
+        rebased_timestamps=rebase_timestamps,
+        input_tokens=_stats([float(request.input_length) for request in selected]),
+        output_tokens=_stats([float(request.output_length) for request in selected]),
+        append_tokens=_stats(append_values),
+        inter_turn_gap_s=_stats(gap_values),
+        overlap_ratio=_stats(overlap_values),
+    )
+    with output_path.with_suffix(output_path.suffix + ".summary.json").open(
+        "w", encoding="utf-8"
+    ) as handle:
+        json.dump(asdict(summary), handle, indent=2, sort_keys=True)
+    return summary
+
+
+def _ordered(session_requests: list[TraceRequest]) -> list[TraceRequest]:
+    return sorted(
+        session_requests,
+        key=lambda request: (request.timestamp_s, request.turn_id, request.chat_id),
+    )
+
+
+def _overlap_ratio(previous: TraceRequest, current: TraceRequest) -> float:
+    if not current.hash_ids:
+        return 0.0
+    previous_blocks = set(previous.hash_ids)
+    overlap = sum(1 for block in current.hash_ids if block in previous_blocks)
+    return overlap / len(current.hash_ids)
+
+
+def _stats(values: list[float]) -> dict[str, float] | None:
+    if not values:
+        return None
+    ordered = sorted(values)
+    return {
+        "count": float(len(ordered)),
+        "mean": statistics.fmean(ordered),
+        "min": ordered[0],
+        "p50": _percentile(ordered, 0.50),
+        "p90": _percentile(ordered, 0.90),
+        "p99": _percentile(ordered, 0.99),
+        "max": ordered[-1],
+    }
+
+
+def _percentile(sorted_values: list[float], percentile: float) -> float:
+    if len(sorted_values) == 1:
+        return sorted_values[0]
+    return sorted_values[round((len(sorted_values) - 1) * percentile)]
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/run_all_experiments.sh
+++ b/scripts/run_all_experiments.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+# Run all 3 PD hybrid experiments sequentially
+# Uses 52 sessions / 4,449 requests (10% sample of 497 sessions)
+# Each experiment takes ~30-40 min
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+TRACE="outputs/qwen35-swebench-50sess.jsonl"
+MODEL="/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B"
+OUTPUT="outputs/swebench-exps"
+
+echo "=== Experiment A: pd-disaggregation ==="
+uv run agentic-pd-hybrid benchmark-live \
+  --trace "$TRACE" \
+  --output-root "$OUTPUT" \
+  --mechanism pd-disaggregation \
+  --policy default \
+  --model-path "$MODEL" \
+  --prefill-workers 1 --decode-workers 1 \
+  --prefill-tp-size 4 --decode-tp-size 4 \
+  --prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300
+
+echo "=== Experiment B: pd-colo ==="
+uv run agentic-pd-hybrid benchmark-live \
+  --trace "$TRACE" \
+  --output-root "$OUTPUT" \
+  --mechanism pd-colo \
+  --policy default \
+  --model-path "$MODEL" \
+  --prefill-workers 0 --decode-workers 0 \
+  --direct-workers 2 --direct-tp-size 4 \
+  --direct-gpu-ids 0,1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300
+
+echo "=== Experiment C: kvcache-centric ==="
+uv run agentic-pd-hybrid benchmark-live \
+  --trace "$TRACE" \
+  --output-root "$OUTPUT" \
+  --mechanism kvcache-centric \
+  --policy default \
+  --model-path "$MODEL" \
+  --prefill-workers 1 --decode-workers 1 \
+  --prefill-tp-size 4 --decode-tp-size 4 \
+  --prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 2 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+echo "=== All experiments complete ==="
--- a/scripts/run_exp_a_pd_disagg.sh
+++ b/scripts/run_exp_a_pd_disagg.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+# Experiment A: pd-disaggregation baseline
+# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
+# Full 39K trace from SWE-Bench 500 instances
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+uv run agentic-pd-hybrid benchmark-live \
+  --trace outputs/qwen35-swebench-500.jsonl \
+  --output-root outputs/swebench-exps \
+  --mechanism pd-disaggregation \
+  --policy default \
+  --model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
+  --prefill-workers 1 --decode-workers 1 \
+  --prefill-tp-size 4 --decode-tp-size 4 \
+  --prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 64 \
+  --timeout-s 900 \
+  --request-timeout-s 300
--- a/scripts/run_exp_b1_dp_colo_rr.sh
+++ b/scripts/run_exp_b1_dp_colo_rr.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+# Experiment B1: Naive DP colocation — round-robin policy
+# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with round-robin
+# No disaggregation — each worker does prefill+decode locally
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+uv run agentic-pd-hybrid benchmark-live \
+  --trace outputs/qwen35-swebench-50sess.jsonl \
+  --output-root outputs/swebench-exps \
+  --mechanism pd-colo \
+  --policy default \
+  --model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
+  --prefill-workers 0 --decode-workers 0 \
+  --direct-workers 2 --direct-tp-size 4 \
+  --direct-gpu-ids 0,1,2,3,4,5,6,7 \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300
--- a/scripts/run_exp_b2_dp_colo_cache_aware.sh
+++ b/scripts/run_exp_b2_dp_colo_cache_aware.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+# Experiment B2: Naive DP colocation — cache-aware (kv-aware) policy
+# 2 direct workers (GPU 0-3, 4-7), TP4, DP router with consistent-hashing
+# Replay kv-aware policy picks the worker with most prefix overlap
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+uv run agentic-pd-hybrid benchmark-live \
+  --trace outputs/qwen35-swebench-50sess.jsonl \
+  --output-root outputs/swebench-exps \
+  --mechanism pd-colo \
+  --policy kv-aware \
+  --model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
+  --prefill-workers 0 --decode-workers 0 \
+  --direct-workers 2 --direct-tp-size 4 \
+  --direct-gpu-ids 0,1,2,3,4,5,6,7 \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300
--- a/scripts/run_exp_b_pd_colo.sh
+++ b/scripts/run_exp_b_pd_colo.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+# Experiment B: pd-colo (direct/colocation)
+# 2 direct workers (GPU 0-3, 4-7), TP4, no router
+# Full 39K trace from SWE-Bench 500 instances
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+uv run agentic-pd-hybrid benchmark-live \
+  --trace outputs/qwen35-swebench-500.jsonl \
+  --output-root outputs/swebench-exps \
+  --mechanism pd-colo \
+  --policy default \
+  --model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
+  --prefill-workers 0 --decode-workers 0 \
+  --direct-workers 2 --direct-tp-size 4 \
+  --direct-gpu-ids 0,1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 64 \
+  --timeout-s 900 \
+  --request-timeout-s 300
--- a/scripts/run_exp_c_kvcache_centric.sh
+++ b/scripts/run_exp_c_kvcache_centric.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+# Experiment C: kvcache-centric (session-aware PD)
+# 1P(GPU 0-3) + 1D(GPU 4-7), TP4, mooncake TCP
+# Full 39K trace from SWE-Bench 500 instances
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+uv run agentic-pd-hybrid benchmark-live \
+  --trace outputs/qwen35-swebench-500.jsonl \
+  --output-root outputs/swebench-exps \
+  --mechanism kvcache-centric \
+  --policy default \
+  --model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
+  --prefill-workers 1 --decode-workers 1 \
+  --prefill-tp-size 4 --decode-tp-size 4 \
+  --prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 64 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 2 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
--- a/scripts/smoke_test.sh
+++ b/scripts/smoke_test.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# Smoke test: pd-disaggregation with mooncake TCP, 100 requests
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+# Sample a small trace for smoke testing
+uv run agentic-pd-hybrid sample-sessions \
+  --trace outputs/qwen35-swebench-500.jsonl \
+  --output outputs/qwen35-smoke-3sess.jsonl \
+  --session-sample-rate 0.02 \
+  --min-turns 5 \
+  --target-duration-s 300 \
+  --max-requests 100
+
+# Run smoke test
+uv run agentic-pd-hybrid benchmark-live \
+  --trace outputs/qwen35-smoke-3sess.jsonl \
+  --output-root outputs/smoke \
+  --mechanism pd-disaggregation \
+  --policy default \
+  --model-path /mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3.5-35B-A3B \
+  --prefill-workers 1 --decode-workers 1 \
+  --prefill-tp-size 4 --decode-tp-size 4 \
+  --prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300
--- a/scripts/sweep_backpressure_smoke.sh
+++ b/scripts/sweep_backpressure_smoke.sh
@@ -0,0 +1,114 @@
+#!/usr/bin/env bash
+# Smoke sweep: validate backpressure code change on top of v5 Option D config.
+# Designed to fit in ~3-4h GPU budget (4 runs × ~30-60 min).
+#
+# Usage:
+#   bash scripts/sweep_backpressure_smoke.sh
+#
+# Prerequisites: GPUs available; trace at outputs/qwen35-swebench-50sess.jsonl;
+# model at $MODEL_PATH (default Qwen3-30B-A3B-Instruct-2507).
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$REPO_ROOT"
+
+OUT_ROOT=${OUT_ROOT:-outputs/sweep_backpressure_smoke}
+TRACE=${TRACE:-outputs/qwen35-swebench-50sess.jsonl}
+MODEL=${MODEL:-/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507}
+
+mkdir -p "$OUT_ROOT"
+LOG="$OUT_ROOT/sweep.log"
+echo "[$(date '+%F %T')] Starting backpressure smoke sweep" | tee -a "$LOG"
+echo "  Trace: $TRACE" | tee -a "$LOG"
+echo "  Model: $MODEL" | tee -a "$LOG"
+echo "  Output root: $OUT_ROOT" | tee -a "$LOG"
+
+KVC_COMMON_ARGS=(
+    --trace "$TRACE"
+    --model "$MODEL"
+    --mechanism kvcache-centric
+    --policy kv-aware
+    --kvcache-admission-mode worker
+    --kvcache-seed-min-turn-id 1
+    --kvcache-seed-max-inflight-decode -1
+    --kvcache-prefill-backup-policy release-after-transfer
+    --kvcache-prefill-priority-eviction
+    --prefill-workers 2
+    --decode-workers 6
+    --prefill-gpu-ids 0,1
+    --decode-gpu-ids 2,3,4,5,6,7
+    --transfer-backend mooncake
+    --target-duration-s 2000
+    --session-sample-rate 1.0
+    --min-turns 2
+    --concurrency-limit 32
+)
+
+DP_COMMON_ARGS=(
+    --trace "$TRACE"
+    --model "$MODEL"
+    --mechanism pd-colo
+    --policy kv-aware
+    --direct-workers 8
+    --direct-gpu-ids 0,1,2,3,4,5,6,7
+    --transfer-backend mooncake
+    --target-duration-s 2000
+    --session-sample-rate 1.0
+    --min-turns 2
+    --concurrency-limit 32
+)
+
+run_kvc_baseline_ts10() {
+    local out="$OUT_ROOT/E1_kvc_baseline_ts10"
+    echo "[$(date '+%F %T')] === E1: KVC baseline (no backpressure) time-scale=10 ===" | tee -a "$LOG"
+    python -m agentic_pd_hybrid.cli benchmark-live \
+        "${KVC_COMMON_ARGS[@]}" \
+        --output-root "$out" \
+        --time-scale 10 \
+        2>&1 | tee -a "$LOG"
+}
+
+run_kvc_backpressure_ts10() {
+    local out="$OUT_ROOT/E2_kvc_backpressure_ts10"
+    echo "[$(date '+%F %T')] === E2: KVC + backpressure ON, time-scale=10 ===" | tee -a "$LOG"
+    python -m agentic_pd_hybrid.cli benchmark-live \
+        "${KVC_COMMON_ARGS[@]}" \
+        --output-root "$out" \
+        --time-scale 10 \
+        --enable-backpressure \
+        --backpressure-max-pause-s 2.0 \
+        2>&1 | tee -a "$LOG"
+}
+
+run_kvc_backpressure_ts1() {
+    local out="$OUT_ROOT/E3_kvc_backpressure_ts1_short"
+    echo "[$(date '+%F %T')] === E3: KVC + backpressure ON, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
+    python -m agentic_pd_hybrid.cli benchmark-live \
+        "${KVC_COMMON_ARGS[@]}" \
+        --output-root "$out" \
+        --time-scale 1 \
+        --enable-backpressure \
+        --backpressure-max-pause-s 2.0 \
+        --target-duration-s 1800 \
+        2>&1 | tee -a "$LOG"
+}
+
+run_dp_baseline_ts1() {
+    local out="$OUT_ROOT/E4_dp_ts1_short"
+    echo "[$(date '+%F %T')] === E4: 8-way DP cache-aware, time-scale=1, FIRST 1000 reqs ===" | tee -a "$LOG"
+    python -m agentic_pd_hybrid.cli benchmark-live \
+        "${DP_COMMON_ARGS[@]}" \
+        --output-root "$out" \
+        --time-scale 1 \
+        --target-duration-s 1800 \
+        2>&1 | tee -a "$LOG"
+}
+
+# Sequence — add/remove as fits the budget.
+run_kvc_baseline_ts10
+run_kvc_backpressure_ts10
+run_kvc_backpressure_ts1
+run_dp_baseline_ts1
+
+echo "[$(date '+%F %T')] === sweep DONE ===" | tee -a "$LOG"
+echo "Run analysis with: python scripts/analysis/analyze_backpressure_smoke.py $OUT_ROOT" | tee -a "$LOG"
--- a/scripts/sweep_kvc_qwen3_30b.sh
+++ b/scripts/sweep_kvc_qwen3_30b.sh
@@ -0,0 +1,60 @@
+#!/bin/bash
+# KVC admission control parameter sweep on Qwen3-30B
+# 5 experiments, ~35 min each, ~3 hours total
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-exps
+VENV_PYTHON=.venv/bin/python
+
+run_kvc() {
+  local label=$1
+  local inflight=$2
+  local min_turn=$3
+
+  echo "=== [$label] inflight=$inflight min_turn=$min_turn === $(date)"
+  PYTHONPATH=src:third_party/sglang/python \
+  $VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+    --trace $TRACE \
+    --output-root $OUTPUT \
+    --mechanism kvcache-centric \
+    --policy default \
+    --model-path $MODEL \
+    --prefill-workers 1 --decode-workers 1 \
+    --prefill-tp-size 4 --decode-tp-size 4 \
+    --prefill-gpu-ids 0,1,2,3 --decode-gpu-ids 4,5,6,7 \
+    --transfer-backend mooncake \
+    --gpu-budget 8 \
+    --time-scale 10 \
+    --session-sample-rate 1.0 \
+    --target-duration-s 100000 \
+    --concurrency-limit 32 \
+    --timeout-s 900 \
+    --request-timeout-s 300 \
+    --kvcache-admission-mode worker \
+    --kvcache-seed-min-turn-id $min_turn \
+    --kvcache-seed-max-inflight-decode $inflight \
+    --kvcache-prefill-backup-policy release-after-transfer \
+    --kvcache-prefill-priority-eviction
+  echo "=== [$label] DONE === $(date)"
+  echo ""
+}
+
+# C1: inflight=8, min-turn=2
+run_kvc "C1" 8 2
+
+# C2: inflight=16, min-turn=2
+run_kvc "C2" 16 2
+
+# C3: inflight=-1 (disabled), min-turn=2
+run_kvc "C3" -1 2
+
+# C4: inflight=8, min-turn=1
+run_kvc "C4" 8 1
+
+# C5: inflight=-1 (disabled), min-turn=1
+run_kvc "C5" -1 1
+
+echo "=== ALL SWEEP EXPERIMENTS DONE === $(date)"
--- a/scripts/sweep_real_ali_kvc.sh
+++ b/scripts/sweep_real_ali_kvc.sh
@@ -0,0 +1,170 @@
+#!/usr/bin/env bash
+# Real Ali workload sweep for KVC pd-hybrid.
+#
+# This script expects a prebuilt sample trace and replays it exactly for every
+# mechanism.  It intentionally keeps pool polling disabled for performance runs.
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$REPO_ROOT"
+
+MODEL=${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}
+TRACE=${TRACE:-outputs/real-ali-kvc-iter/samples-balanced/ali-kvc-fit-smallappend.jsonl}
+OUT_ROOT=${OUT_ROOT:-outputs/real-ali-kvc-iter/runs}
+TIME_SCALE=${TIME_SCALE:-1}
+CONCURRENCY=${CONCURRENCY:-32}
+REQUEST_TIMEOUT_S=${REQUEST_TIMEOUT_S:-300}
+STACK_TIMEOUT_S=${STACK_TIMEOUT_S:-1200}
+RUNS=${RUNS:-"dp kvc_bp"}
+EXTRA_SERVER_ARGS=${EXTRA_SERVER_ARGS:-}
+PREFILL_EXTRA_SERVER_ARGS=${PREFILL_EXTRA_SERVER_ARGS:-}
+DECODE_EXTRA_SERVER_ARGS=${DECODE_EXTRA_SERVER_ARGS:-}
+KVC_SEED_MIN_TURN_ID=${KVC_SEED_MIN_TURN_ID:-1}
+KVC_SEED_ONLY_MULTITURN=${KVC_SEED_ONLY_MULTITURN:-0}
+
+mkdir -p "$OUT_ROOT"
+LOG="$OUT_ROOT/sweep.log"
+
+log() {
+  echo "[$(date '+%F %T')] $*" | tee -a "$LOG"
+}
+
+common_args=(
+  --trace "$TRACE"
+  --model-path "$MODEL"
+  --output-root "$OUT_ROOT"
+  --use-trace-as-sample
+  --time-scale "$TIME_SCALE"
+  --concurrency-limit "$CONCURRENCY"
+  --timeout-s "$STACK_TIMEOUT_S"
+  --request-timeout-s "$REQUEST_TIMEOUT_S"
+)
+if [[ -n "$EXTRA_SERVER_ARGS" ]]; then
+  common_args+=(--extra-server-args "$EXTRA_SERVER_ARGS")
+fi
+if [[ -n "$PREFILL_EXTRA_SERVER_ARGS" ]]; then
+  common_args+=(--prefill-extra-server-args "$PREFILL_EXTRA_SERVER_ARGS")
+fi
+if [[ -n "$DECODE_EXTRA_SERVER_ARGS" ]]; then
+  common_args+=(--decode-extra-server-args "$DECODE_EXTRA_SERVER_ARGS")
+fi
+
+kvc_args=(
+  "${common_args[@]}"
+  --mechanism kvcache-centric
+  --policy kv-aware
+  --prefill-workers 2
+  --decode-workers 6
+  --prefill-tp-size 1
+  --decode-tp-size 1
+  --prefill-gpu-ids 0,1
+  --decode-gpu-ids 2,3,4,5,6,7
+  --transfer-backend mooncake
+  --gpu-budget 8
+  --kvcache-admission-mode worker
+  --kvcache-seed-min-turn-id "$KVC_SEED_MIN_TURN_ID"
+  --kvcache-seed-max-inflight-decode -1
+  --kvcache-prefill-backup-policy release-after-transfer
+  --kvcache-prefill-priority-eviction
+)
+if [[ "$KVC_SEED_ONLY_MULTITURN" == "1" ]]; then
+  kvc_args+=(--kvcache-seed-only-multiturn-sessions)
+fi
+
+run_dp() {
+  log "=== DP cache-aware baseline: 8 direct workers ==="
+  uv run agentic-pd-hybrid benchmark-live \
+    "${common_args[@]}" \
+    --mechanism pd-colo \
+    --policy kv-aware \
+    --prefill-workers 0 \
+    --decode-workers 0 \
+    --direct-workers 8 \
+    --direct-tp-size 1 \
+    --direct-gpu-ids 0,1,2,3,4,5,6,7 \
+    --gpu-budget 8
+}
+
+run_pd_disagg() {
+  log "=== PD-disaggregation baseline: 2P6D ==="
+  uv run agentic-pd-hybrid benchmark-live \
+    "${common_args[@]}" \
+    --mechanism pd-disaggregation \
+    --policy kv-aware \
+    --prefill-workers 2 \
+    --decode-workers 6 \
+    --prefill-tp-size 1 \
+    --decode-tp-size 1 \
+    --prefill-gpu-ids 0,1 \
+    --decode-gpu-ids 2,3,4,5,6,7 \
+    --transfer-backend mooncake \
+    --gpu-budget 8
+}
+
+run_pd_sticky() {
+  log "=== PD-disaggregation sticky baseline: 2P6D ==="
+  uv run agentic-pd-hybrid benchmark-live \
+    "${common_args[@]}" \
+    --mechanism pd-disaggregation \
+    --policy sticky \
+    --prefill-workers 2 \
+    --decode-workers 6 \
+    --prefill-tp-size 1 \
+    --decode-tp-size 1 \
+    --prefill-gpu-ids 0,1 \
+    --decode-gpu-ids 2,3,4,5,6,7 \
+    --transfer-backend mooncake \
+    --gpu-budget 8
+}
+
+run_kvc() {
+  log "=== KVC baseline: 2P6D worker admission, no backpressure ==="
+  uv run agentic-pd-hybrid benchmark-live "${kvc_args[@]}"
+}
+
+run_kvc_bp() {
+  log "=== KVC candidate: 2P6D worker admission + backpressure ==="
+  uv run agentic-pd-hybrid benchmark-live \
+    "${kvc_args[@]}" \
+    --enable-backpressure \
+    --backpressure-max-pause-s 2.0
+}
+
+summarize_latest() {
+  log "=== Latest summaries ==="
+  find "$OUT_ROOT" -maxdepth 2 -name 'request-metrics.jsonl.summary.json' -print \
+    | sort \
+    | while read -r summary; do
+        python - "$summary" <<'PY'
+import json, sys
+p=sys.argv[1]
+d=json.load(open(p))
+lat=d.get("latency_stats_s") or {}
+tt=d.get("ttft_stats_s") or {}
+em=d.get("execution_modes") or {}
+print(p)
+print("  reqs", d.get("request_count"), "errors", d.get("error_count"), "trunc", d.get("truncated_request_count"))
+print("  lat mean/p50/p90/p99", lat.get("mean"), lat.get("p50"), lat.get("p90"), lat.get("p99"))
+print("  ttft mean/p50/p90", tt.get("mean"), tt.get("p50"), tt.get("p90"))
+print("  modes", em)
+PY
+      done | tee -a "$LOG"
+}
+
+log "Trace: $TRACE"
+log "Model: $MODEL"
+log "Runs: $RUNS | time-scale=$TIME_SCALE concurrency=$CONCURRENCY | kvc-seed-min-turn-id=$KVC_SEED_MIN_TURN_ID | kvc-seed-only-multiturn=$KVC_SEED_ONLY_MULTITURN"
+
+for run in $RUNS; do
+  case "$run" in
+    dp) run_dp ;;
+    pd) run_pd_disagg ;;
+    pd_sticky) run_pd_sticky ;;
+    kvc) run_kvc ;;
+    kvc_bp) run_kvc_bp ;;
+    *) log "Unknown run name: $run"; exit 2 ;;
+  esac
+done
+
+summarize_latest
+log "DONE"
--- a/scripts/sweep_tp1_configs.sh
+++ b/scripts/sweep_tp1_configs.sh
@@ -0,0 +1,133 @@
+#!/bin/bash
+# TP1 configuration sweep: 8-way DP, 1P7D KVC, 2P6D KVC
+# Qwen3-30B-A3B TP=1, single GPU per worker
+# Most aggressive KVC admission: inflight=-1 (off), seed-min-turn=1
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-exps
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+save_result() {
+  local label=$1
+  local run_dir=$2
+  log "=== $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    log "Summary:"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+    # Also copy summary to a named file for easy access
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    log "Saved to $OUTPUT/${label}_summary.json"
+  else
+    log "WARNING: No summary file found in $run_dir"
+  fi
+}
+
+log "Starting TP1 configuration sweep"
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+
+########################################
+# Experiment 1: 8-way DP cache-aware
+########################################
+log ""
+log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism pd-colo \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 0 --decode-workers 0 \
+  --direct-workers 8 --direct-tp-size 1 \
+  --direct-gpu-ids 0,1,2,3,4,5,6,7 \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300
+
+# Find latest run dir for this experiment
+EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
+save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
+
+########################################
+# Experiment 2: 1P + 7D KVC (most aggressive)
+########################################
+log ""
+log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy default \
+  --model-path $MODEL \
+  --prefill-workers 1 --decode-workers 7 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
+
+########################################
+# Experiment 3: 2P + 6D KVC (most aggressive)
+########################################
+log ""
+log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy default \
+  --model-path $MODEL \
+  --prefill-workers 2 --decode-workers 6 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
+
+########################################
+log ""
+log "=== ALL TP1 SWEEP EXPERIMENTS DONE ==="
--- a/scripts/sweep_tp1_v2_fixed.sh
+++ b/scripts/sweep_tp1_v2_fixed.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+# TP1 configuration sweep v2 — after session_params fix + audit fields
+# Qwen3-30B-A3B TP=1, single GPU per worker
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v2-fixed
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+save_result() {
+  local label=$1
+  local run_dir=$2
+  log "=== $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    log "Summary:"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
+  else
+    log "WARNING: No summary file found in $run_dir"
+  fi
+}
+
+log "Starting TP1 v2 sweep (session_params fix + audit fields)"
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+
+########################################
+# Experiment 1: 8-way DP cache-aware
+########################################
+log ""
+log "=== [EXP1] 8-way DP cache-aware (8 direct × TP1) ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism pd-colo \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 0 --decode-workers 0 \
+  --direct-workers 8 --direct-tp-size 1 \
+  --direct-gpu-ids 0,1,2,3,4,5,6,7 \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300
+
+EXP1_DIR=$(ls -td $OUTPUT/pd-colo-kv-aware-*/ 2>/dev/null | head -1)
+save_result "exp1_8way_dp_cache_aware" "$EXP1_DIR"
+
+########################################
+# Experiment 2: 1P + 7D KVC (aggressive)
+########################################
+log ""
+log "=== [EXP2] 1P7D KVC (inflight=off, min-turn=1) ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy default \
+  --model-path $MODEL \
+  --prefill-workers 1 --decode-workers 7 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp2_1p7d_kvc_aggressive" "$EXP2_DIR"
+
+########################################
+# Experiment 3: 2P + 6D KVC (aggressive)
+########################################
+log ""
+log "=== [EXP3] 2P6D KVC (inflight=off, min-turn=1) ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy default \
+  --model-path $MODEL \
+  --prefill-workers 2 --decode-workers 6 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP3_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp3_2p6d_kvc_aggressive" "$EXP3_DIR"
+
+########################################
+log ""
+log "=== ALL TP1 V2 SWEEP EXPERIMENTS DONE ==="
--- a/scripts/sweep_tp1_v3_kvaware.sh
+++ b/scripts/sweep_tp1_v3_kvaware.sh
@@ -0,0 +1,108 @@
+#!/bin/bash
+# TP1 v3 sweep — KVC with kv-aware policy (fix routing mismatch)
+# v2 used --policy default for KVC experiments, causing session routing
+# mismatch: replay round-robin ≠ router round-robin → "session not found".
+# v3 uses --policy kv-aware for KVC to ensure session affinity.
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v3-kvaware
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+save_result() {
+  local label=$1
+  local run_dir=$2
+  log "=== $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    log "Summary:"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
+  else
+    log "WARNING: No summary file found in $run_dir"
+  fi
+}
+
+log "Starting TP1 v3 sweep (KVC with kv-aware policy)"
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+log "Key change: --policy kv-aware for KVC (was --policy default in v2)"
+
+########################################
+# Experiment 1: 1P + 7D KVC kv-aware
+########################################
+log ""
+log "=== [EXP1] 1P7D KVC kv-aware ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 1 --decode-workers 7 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp1_1p7d_kvc_kvaware" "$EXP1_DIR"
+
+########################################
+# Experiment 2: 2P + 6D KVC kv-aware
+########################################
+log ""
+log "=== [EXP2] 2P6D KVC kv-aware ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 2 --decode-workers 6 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp2_2p6d_kvc_kvaware" "$EXP2_DIR"
+
+########################################
+log ""
+log "=== ALL TP1 V3 SWEEP EXPERIMENTS DONE ==="
--- a/scripts/sweep_tp1_v4_cap16.sh
+++ b/scripts/sweep_tp1_v4_cap16.sh
@@ -0,0 +1,108 @@
+#!/bin/bash
+# TP1 v4 sweep — KVC with kv-aware policy + soft_cap raised from 4 to 16
+# v3 (kv-aware) fixed routing but session-cap fallback still dominated 52-65%
+# of requests. Hardcoded min(4, ...) in _decode_session_soft_cap was the
+# bottleneck — only 4*7=28 session slots for 52 trace sessions.
+# v4 raises the cap to 16 (4*7=28 -> 16*7=112 slots).
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v4-cap16
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+save_result() {
+  local label=$1
+  local run_dir=$2
+  log "=== $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    log "Summary:"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
+  else
+    log "WARNING: No summary file found in $run_dir"
+  fi
+}
+
+log "Starting TP1 v4 sweep (KVC kv-aware, session soft_cap raised 4->16)"
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+log "Key change: _decode_session_soft_cap now min(16, ...) instead of min(4, ...)"
+
+########################################
+# Experiment 1: 1P + 7D KVC kv-aware (cap=16)
+########################################
+log ""
+log "=== [EXP1] 1P7D KVC kv-aware cap=16 ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 1 --decode-workers 7 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp1_1p7d_kvc_cap16" "$EXP1_DIR"
+
+########################################
+# Experiment 2: 2P + 6D KVC kv-aware (cap=16)
+########################################
+log ""
+log "=== [EXP2] 2P6D KVC kv-aware cap=16 ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 2 --decode-workers 6 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp2_2p6d_kvc_cap16" "$EXP2_DIR"
+
+log ""
+log "=== ALL TP1 V4 SWEEP EXPERIMENTS DONE ==="
--- a/scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
+++ b/scripts/sweep_tp1_v5_baseline_rerun_exp2.sh
@@ -0,0 +1,89 @@
+#!/bin/bash
+# P0: Re-run v5 baseline EXP2 (2P6D) three times to establish whether
+# errors=9 is a stable property of the v5 config or single-run variance.
+# Critic of V5_PROFILE_INVESTIGATION_ZH.md flagged that the 415 errors in
+# v5+profile EXP2 may have been polling-induced. We need 3 baseline runs
+# (no polling, identical config to original v5) to test reproducibility.
+#
+# Output:
+#   outputs/qwen3-30b-tp1-v5-optD-baseline-rerun/
+#     ├── exp2_2p6d_run{1,2,3}_summary.json
+#     ├── exp2_2p6d_run{1,2,3}_metrics.jsonl
+#     └── kvcache-centric-...<ts>/   (one per run)
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v5-optD-baseline-rerun
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+run_exp2() {
+  local run_idx=$1
+  local label="exp2_2p6d_run${run_idx}"
+  log ""
+  log "=== [RUN ${run_idx}/3] EXP2 2P6D KVC kv-aware Option D (no polling) ==="
+  PYTHONPATH=src:third_party/sglang/python \
+  $VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+    --trace $TRACE \
+    --output-root $OUTPUT \
+    --mechanism kvcache-centric \
+    --policy kv-aware \
+    --model-path $MODEL \
+    --prefill-workers 2 --decode-workers 6 \
+    --prefill-tp-size 1 --decode-tp-size 1 \
+    --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+    --transfer-backend mooncake \
+    --gpu-budget 8 \
+    --time-scale 10 \
+    --session-sample-rate 1.0 \
+    --target-duration-s 100000 \
+    --concurrency-limit 32 \
+    --timeout-s 900 \
+    --request-timeout-s 300 \
+    --kvcache-admission-mode worker \
+    --kvcache-seed-min-turn-id 1 \
+    --kvcache-seed-max-inflight-decode -1 \
+    --kvcache-prefill-backup-policy release-after-transfer \
+    --kvcache-prefill-priority-eviction
+
+  local run_dir=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+  log "=== [RUN ${run_idx}/3] $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    local errs=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/${label}_summary.json')); print(d.get('error_count',0))")
+    log "  errors = $errs (baseline reference = 9)"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+  else
+    log "WARNING: no summary file in $run_dir"
+  fi
+}
+
+log "=== P0: v5 baseline EXP2 reproducibility test (3 runs, no polling) ==="
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+log "Goal: confirm whether errors=9 in v5 baseline EXP2 is reproducible"
+log "      (v5+profile saw 415 errors; we need to know if polling was causal)"
+
+for i in 1 2 3; do
+  run_exp2 $i
+done
+
+log ""
+log "=== P0 SUMMARY: errors per run ==="
+for i in 1 2 3; do
+  if [ -f "$OUTPUT/exp2_2p6d_run${i}_summary.json" ]; then
+    e=$($VENV_PYTHON -c "import json; d=json.load(open('$OUTPUT/exp2_2p6d_run${i}_summary.json')); print(d.get('error_count',0))")
+    log "  run ${i}: errors = $e"
+  fi
+done
+log "=== P0 ALL DONE ==="
--- a/scripts/sweep_tp1_v5_optD.sh
+++ b/scripts/sweep_tp1_v5_optD.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+# TP1 v5 sweep — Option D: D-side admission for seed/reseed.
+#
+# v4 (cap=16) still saw 35% session-cap fallback because the local soft_cap
+# evaluates min(16, usable_capacity_tokens / target_tokens) and target_tokens
+# (= input + output) is 50-100K in agentic workloads, giving cap = 1-2.
+#
+# v5 makes worker admission_mode authoritative for ALL admission decisions
+# (direct_append AND seed/reseed). Replay calls D's
+# /session_cache/admit_direct_append with mode={direct_append|seed} and
+# defers to D's KV pool availability + LRU eviction. Replay's local
+# _decode_session_soft_cap is bypassed entirely under worker mode.
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v5-optD
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+save_result() {
+  local label=$1
+  local run_dir=$2
+  log "=== $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    log "Summary:"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl"
+  else
+    log "WARNING: No summary file found in $run_dir"
+  fi
+}
+
+log "Starting TP1 v5 sweep (Option D: D-side seed admission)"
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+log "Key change: worker admission_mode now drives seed/reseed via D's admit endpoint"
+
+########################################
+# Experiment 1: 1P + 7D KVC kv-aware Option D
+########################################
+log ""
+log "=== [EXP1] 1P7D KVC kv-aware Option D ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 1 --decode-workers 7 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp1_1p7d_kvc_optD" "$EXP1_DIR"
+
+########################################
+# Experiment 2: 2P + 6D KVC kv-aware Option D
+########################################
+log ""
+log "=== [EXP2] 2P6D KVC kv-aware Option D ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 2 --decode-workers 6 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction
+
+EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp2_2p6d_kvc_optD" "$EXP2_DIR"
+
+log ""
+log "=== ALL TP1 V5 SWEEP EXPERIMENTS DONE ==="
--- a/scripts/sweep_tp1_v5_optD_profile.sh
+++ b/scripts/sweep_tp1_v5_optD_profile.sh
@@ -0,0 +1,125 @@
+#!/bin/bash
+# TP1 v5 + profiling — re-run the v5 (Option D) config with the new
+# d-pool-timeseries poller enabled, so we can attribute each session-cap
+# fallback to actual D KV pool occupancy (held vs available vs idle-evictable
+# vs prefill-backup) instead of guessing.
+#
+# Output:
+#   outputs/qwen3-30b-tp1-v5-optD-profile/
+#     ├── kvcache-centric-kv-aware-worker-admission-<ts>/
+#     │   ├── request-metrics.jsonl
+#     │   ├── request-metrics.jsonl.summary.json
+#     │   └── d-pool-timeseries.jsonl   ← NEW (1Hz P/D /server_info snapshots)
+#     ├── exp1_1p7d_kvc_optD_profile_metrics.jsonl
+#     └── exp2_2p6d_kvc_optD_profile_metrics.jsonl
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v5-optD-profile
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+POLL_INTERVAL=1.0
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+save_result() {
+  local label=$1
+  local run_dir=$2
+  log "=== $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    log "Summary:"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
+      cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
+      log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
+    else
+      log "WARNING: no d-pool-timeseries.jsonl produced"
+    fi
+    log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
+  else
+    log "WARNING: No summary file found in $run_dir"
+  fi
+}
+
+log "Starting TP1 v5 + profile sweep (Option D + ${POLL_INTERVAL}s pool polling)"
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+log "Profiling: --pool-poll-interval-s $POLL_INTERVAL (writes d-pool-timeseries.jsonl)"
+
+########################################
+# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
+########################################
+log ""
+log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 1 --decode-workers 7 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction \
+  --pool-poll-interval-s $POLL_INTERVAL
+
+EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp1_1p7d_kvc_optD_profile" "$EXP1_DIR"
+
+########################################
+# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
+########################################
+log ""
+log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 2 --decode-workers 6 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction \
+  --pool-poll-interval-s $POLL_INTERVAL
+
+EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp2_2p6d_kvc_optD_profile" "$EXP2_DIR"
+
+log ""
+log "=== ALL TP1 V5+PROFILE EXPERIMENTS DONE ==="
--- a/scripts/sweep_tp1_v6_p1_profile.sh
+++ b/scripts/sweep_tp1_v6_p1_profile.sh
@@ -0,0 +1,129 @@
+#!/bin/bash
+# v6 P1: re-run the v5 (Option D) config with the pool_breakdown instrument
+# (commit 4978c0d) so d-pool-timeseries.jsonl carries radix_protected /
+# slot_private / running_batch / {transfer,prealloc,retracted}_queue tokens.
+#
+# This is the same config as scripts/sweep_tp1_v5_optD_profile.sh but writes
+# to a separate output dir, leaving the pre-instrument v5+profile run intact
+# for before/after comparison.
+#
+# Output:
+#   outputs/qwen3-30b-tp1-v6-p1-profile/
+#     ├── kvcache-centric-kv-aware-worker-admission-<ts>/
+#     │   ├── request-metrics.jsonl
+#     │   ├── request-metrics.jsonl.summary.json
+#     │   └── d-pool-timeseries.jsonl   ← now with pool_breakdown fields
+#     ├── exp{1,2}_*_metrics.jsonl
+#     └── exp{1,2}_*_pool_timeseries.jsonl
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+MODEL=/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/Qwen3-30B-A3B-Instruct-2507
+TRACE=outputs/qwen35-swebench-50sess.jsonl
+OUTPUT=outputs/qwen3-30b-tp1-v6-p1-profile
+VENV_PYTHON=.venv/bin/python
+RESULTS_FILE=$OUTPUT/sweep_results.txt
+POLL_INTERVAL=1.0
+
+mkdir -p $OUTPUT
+
+log() {
+  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a $RESULTS_FILE
+}
+
+save_result() {
+  local label=$1
+  local run_dir=$2
+  log "=== $label COMPLETED ==="
+  if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+    log "Summary:"
+    cat "$run_dir/request-metrics.jsonl.summary.json" >> $RESULTS_FILE
+    echo "" >> $RESULTS_FILE
+    cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+    cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+    if [ -f "$run_dir/d-pool-timeseries.jsonl" ]; then
+      cp "$run_dir/d-pool-timeseries.jsonl" "$OUTPUT/${label}_pool_timeseries.jsonl"
+      log "Pool timeseries: $(wc -l < $OUTPUT/${label}_pool_timeseries.jsonl) rows"
+    else
+      log "WARNING: no d-pool-timeseries.jsonl produced"
+    fi
+    log "Saved to $OUTPUT/${label}_summary.json + ${label}_metrics.jsonl + ${label}_pool_timeseries.jsonl"
+  else
+    log "WARNING: No summary file found in $run_dir"
+  fi
+}
+
+log "Starting v6 P1 sweep (v5 Option D config + ${POLL_INTERVAL}s pool polling + pool_breakdown)"
+log "Model: $MODEL"
+log "Trace: $TRACE (4449 requests, 52 sessions)"
+log "Goal: capture pool_breakdown fields (radix_protected / slot_private / running_batch / queues)"
+log "      to decompose 'other' on the v5 baseline workload"
+
+########################################
+# Experiment 1: 1P + 7D KVC kv-aware Option D + profile
+########################################
+log ""
+log "=== [EXP1] 1P7D KVC kv-aware Option D + profile ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 1 --decode-workers 7 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction \
+  --pool-poll-interval-s $POLL_INTERVAL
+
+EXP1_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp1_1p7d_kvc_v6_p1" "$EXP1_DIR"
+
+########################################
+# Experiment 2: 2P + 6D KVC kv-aware Option D + profile
+########################################
+log ""
+log "=== [EXP2] 2P6D KVC kv-aware Option D + profile ==="
+PYTHONPATH=src:third_party/sglang/python \
+$VENV_PYTHON -m agentic_pd_hybrid.cli benchmark-live \
+  --trace $TRACE \
+  --output-root $OUTPUT \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path $MODEL \
+  --prefill-workers 2 --decode-workers 6 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0,1 --decode-gpu-ids 2,3,4,5,6,7 \
+  --transfer-backend mooncake \
+  --gpu-budget 8 \
+  --time-scale 10 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 900 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction \
+  --pool-poll-interval-s $POLL_INTERVAL
+
+EXP2_DIR=$(ls -td $OUTPUT/kvcache-centric-*/ 2>/dev/null | head -1)
+save_result "exp2_2p6d_kvc_v6_p1" "$EXP2_DIR"
+
+log ""
+log "=== ALL v6 P1 EXPERIMENTS DONE ==="
--- a/src/agentic_pd_hybrid/benchmark.py
+++ b/src/agentic_pd_hybrid/benchmark.py
@@ -3,13 +3,20 @@ from __future__ import annotations
 import asyncio
 import json
 import signal
+import shutil
+from collections import Counter
 from dataclasses import asdict, dataclass, replace
 from datetime import UTC, datetime
 from pathlib import Path

 from agentic_pd_hybrid.replay import ReplayConfig, replay_trace
-from agentic_pd_hybrid.sampling import SessionSampleConfig, sample_trace_sessions
+from agentic_pd_hybrid.sampling import (
+    SessionSampleConfig,
+    SessionSampleSummary,
+    sample_trace_sessions,
+)
 from agentic_pd_hybrid.stack import ManagedPdStack, launch_pd_stack
+from agentic_pd_hybrid.trace import load_trace
 from agentic_pd_hybrid.topology import SingleNodeTopology


@@ -43,12 +50,18 @@ class BenchmarkConfig:
    kvcache_prefill_priority_eviction: bool = False
    kvcache_prefill_direct_priority: int = -100
    kvcache_prefill_normal_priority: int = 100
+    pool_poll_interval_s: float = 0.0
+    pool_poll_include_sessions: bool = True
+    enable_backpressure: bool = False
+    backpressure_max_pause_s: float = 2.0
+    progress_interval_s: float = 30.0
    sample_profile: str = "default"
    min_initial_input_tokens: int | None = None
    max_initial_input_tokens: int | None = None
    max_append_input_tokens: int | None = None
    max_output_tokens: int | None = None
    min_overlap_ratio: float | None = None
+    use_trace_as_sample: bool = False
    launch_stack: bool = True


@@ -90,6 +103,21 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
        )

    sampled_trace_path = run_dir / "sampled-trace.jsonl"
+    if config.use_trace_as_sample:
+        shutil.copyfile(config.trace_path, sampled_trace_path)
+        sample_summary = _summarize_trace_sample(
+            input_trace_path=config.trace_path,
+            sampled_trace_path=sampled_trace_path,
+            profile=config.sample_profile,
+            session_sample_rate=config.session_sample_rate,
+            min_turns=config.min_turns,
+            min_initial_input_tokens=config.min_initial_input_tokens,
+            max_initial_input_tokens=config.max_initial_input_tokens,
+            max_append_input_tokens=config.max_append_input_tokens,
+            max_output_tokens=config.max_output_tokens,
+            min_overlap_ratio=config.min_overlap_ratio,
+        )
+    else:
        sample_summary = sample_trace_sessions(
            SessionSampleConfig(
                trace_path=config.trace_path,
@@ -119,6 +147,8 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
    try:
        signal.signal(signal.SIGINT, _handle_termination)
        signal.signal(signal.SIGTERM, _handle_termination)
+        _mechanisms_with_router = {"pd-disaggregation", "kvcache-centric", "pd-colo"}
+        _naive_dp = config.mechanism_name == "pd-colo"
        if config.launch_stack:
            stack = launch_pd_stack(
                topology=topology,
@@ -132,18 +162,19 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
                    else config.timeout_s
                ),
                include_router=(
-                    config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
+                    config.mechanism_name in _mechanisms_with_router
                ),
+                naive_dp=_naive_dp,
            )
            router_url = (
                stack.router_url
-                if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
+                if config.mechanism_name in _mechanisms_with_router
                else None
            )
        else:
            router_url = (
                topology.router_url
-                if config.mechanism_name in {"pd-disaggregation", "kvcache-centric"}
+                if config.mechanism_name in _mechanisms_with_router
                else None
            )

@@ -187,6 +218,11 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
            ),
            kvcache_prefill_direct_priority=config.kvcache_prefill_direct_priority,
            kvcache_prefill_normal_priority=config.kvcache_prefill_normal_priority,
+            pool_poll_interval_s=config.pool_poll_interval_s,
+            pool_poll_include_sessions=config.pool_poll_include_sessions,
+            enable_backpressure=config.enable_backpressure,
+            backpressure_max_pause_s=config.backpressure_max_pause_s,
+            progress_interval_s=config.progress_interval_s,
        )
        if config.request_timeout_s is not None:
            replay_config = replace(
@@ -243,12 +279,18 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
                "kvcache_prefill_normal_priority": (
                    config.kvcache_prefill_normal_priority
                ),
+                "pool_poll_interval_s": config.pool_poll_interval_s,
+                "pool_poll_include_sessions": config.pool_poll_include_sessions,
+                "enable_backpressure": config.enable_backpressure,
+                "backpressure_max_pause_s": config.backpressure_max_pause_s,
+                "progress_interval_s": config.progress_interval_s,
                "sample_profile": config.sample_profile,
                "min_initial_input_tokens": config.min_initial_input_tokens,
                "max_initial_input_tokens": config.max_initial_input_tokens,
                "max_append_input_tokens": config.max_append_input_tokens,
                "max_output_tokens": config.max_output_tokens,
                "min_overlap_ratio": config.min_overlap_ratio,
+                "use_trace_as_sample": config.use_trace_as_sample,
                "sample_summary": asdict(sample_summary),
                "topology": {
                    "model_path": config.topology.model_path,
@@ -295,3 +337,44 @@ def _header_mode_for(policy_name: str) -> str:
    if policy_name == "kv-aware":
        return "target-worker"
    return "none"
+
+
+def _summarize_trace_sample(
+    *,
+    input_trace_path: Path,
+    sampled_trace_path: Path,
+    profile: str,
+    session_sample_rate: float,
+    min_turns: int,
+    min_initial_input_tokens: int | None,
+    max_initial_input_tokens: int | None,
+    max_append_input_tokens: int | None,
+    max_output_tokens: int | None,
+    min_overlap_ratio: float | None,
+) -> SessionSampleSummary:
+    requests = load_trace(sampled_trace_path)
+    if not requests:
+        raise ValueError(f"Trace sample is empty: {sampled_trace_path}")
+    session_turns = Counter(request.session_id for request in requests)
+    start_time_s = requests[0].timestamp_s
+    end_time_s = requests[-1].timestamp_s
+    return SessionSampleSummary(
+        input_trace_path=str(input_trace_path),
+        output_trace_path=str(sampled_trace_path),
+        request_count=len(requests),
+        session_count=len(session_turns),
+        multi_turn_session_count=sum(1 for turns in session_turns.values() if turns > 1),
+        start_time_s=start_time_s,
+        end_time_s=end_time_s,
+        sampled_duration_s=end_time_s - start_time_s,
+        session_sample_rate=session_sample_rate,
+        min_turns=min_turns,
+        profile=profile,
+        min_initial_input_tokens=min_initial_input_tokens,
+        max_initial_input_tokens=max_initial_input_tokens,
+        max_append_input_tokens=max_append_input_tokens,
+        max_output_tokens=max_output_tokens,
+        min_overlap_ratio=min_overlap_ratio,
+        mean_append_input_tokens=None,
+        mean_turn_overlap_ratio=None,
+    )
--- a/src/agentic_pd_hybrid/cli.py
+++ b/src/agentic_pd_hybrid/cli.py
@@ -2,6 +2,7 @@ from __future__ import annotations

 import argparse
 import asyncio
+import shlex
 from pathlib import Path

 from agentic_pd_hybrid.benchmark import BenchmarkConfig, run_live_benchmark
@@ -228,6 +229,47 @@ def main() -> None:
    )
    replay.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
    replay.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
+    replay.add_argument(
+        "--pool-poll-interval-s",
+        type=float,
+        default=0.0,
+        help=(
+            "Poll each P/D worker's /server_info every N seconds and write a "
+            "time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
+            "0 disables polling."
+        ),
+    )
+    replay.add_argument(
+        "--pool-poll-no-sessions",
+        action="store_true",
+        help=(
+            "Disable per-session detail in the pool timeseries (smaller files)."
+        ),
+    )
+    replay.add_argument(
+        "--enable-backpressure",
+        action="store_true",
+        help=(
+            "Honor recommended_pause_ms hints from D's admission endpoint. "
+            "When set, replay sleeps before issuing requests to a saturated D. "
+            "Default off — preserves baseline behavior."
+        ),
+    )
+    replay.add_argument(
+        "--backpressure-max-pause-s",
+        type=float,
+        default=2.0,
+        help="Cap on per-request backpressure sleep, regardless of D hint.",
+    )
+    replay.add_argument(
+        "--progress-interval-s",
+        type=float,
+        default=30.0,
+        help=(
+            "Write client-side replay progress to <output_dir>/replay-progress.jsonl "
+            "every N seconds. 0 disables the heartbeat."
+        ),
+    )

    sample = subparsers.add_parser(
        "sample-sessions",
@@ -439,6 +481,45 @@ def main() -> None:
    )
    benchmark.add_argument("--kvcache-prefill-direct-priority", type=int, default=-100)
    benchmark.add_argument("--kvcache-prefill-normal-priority", type=int, default=100)
+    benchmark.add_argument(
+        "--pool-poll-interval-s",
+        type=float,
+        default=0.0,
+        help=(
+            "Poll each P/D worker's /server_info every N seconds and write a "
+            "time-series snapshot to <run_dir>/d-pool-timeseries.jsonl. "
+            "0 disables polling."
+        ),
+    )
+    benchmark.add_argument(
+        "--pool-poll-no-sessions",
+        action="store_true",
+        help=(
+            "Disable per-session detail in the pool timeseries (smaller files)."
+        ),
+    )
+    benchmark.add_argument(
+        "--enable-backpressure",
+        action="store_true",
+        help=(
+            "Honor recommended_pause_ms hints from D's admission endpoint."
+        ),
+    )
+    benchmark.add_argument(
+        "--backpressure-max-pause-s",
+        type=float,
+        default=2.0,
+        help="Cap on per-request backpressure sleep, regardless of D hint.",
+    )
+    benchmark.add_argument(
+        "--progress-interval-s",
+        type=float,
+        default=30.0,
+        help=(
+            "Write client-side replay progress to <run_dir>/replay-progress.jsonl "
+            "every N seconds. 0 disables the heartbeat."
+        ),
+    )
    benchmark.add_argument(
        "--sample-profile",
        choices=["default", "small-append"],
@@ -450,16 +531,31 @@ def main() -> None:
    benchmark.add_argument("--max-append-input-tokens", type=int, default=None)
    benchmark.add_argument("--max-output-tokens", type=int, default=None)
    benchmark.add_argument("--min-overlap-ratio", type=float, default=None)
+    benchmark.add_argument(
+        "--use-trace-as-sample",
+        action="store_true",
+        help=(
+            "Replay the provided --trace exactly instead of sampling sessions into "
+            "a new trace. Use this for prebuilt real-workload samples."
+        ),
+    )

    args = parser.parse_args()

    if args.command == "print-launch":
        topology = _topology_from_args(args)
+        has_pd = bool(topology.prefill_workers and topology.decode_workers)
+        has_direct_only = bool(
+            topology.direct_workers
+            and not topology.prefill_workers
+            and not topology.decode_workers
+        )
        plan = build_launch_plan(
            topology,
            prefill_policy=args.prefill_policy,
            decode_policy=args.decode_policy,
-            include_router=bool(topology.prefill_workers and topology.decode_workers),
+            include_router=has_pd or has_direct_only,
+            naive_dp=has_direct_only,
        )
        print(plan.render())
        return
@@ -513,6 +609,11 @@ def main() -> None:
            ),
            kvcache_prefill_direct_priority=args.kvcache_prefill_direct_priority,
            kvcache_prefill_normal_priority=args.kvcache_prefill_normal_priority,
+            pool_poll_interval_s=args.pool_poll_interval_s,
+            pool_poll_include_sessions=not args.pool_poll_no_sessions,
+            enable_backpressure=args.enable_backpressure,
+            backpressure_max_pause_s=args.backpressure_max_pause_s,
+            progress_interval_s=args.progress_interval_s,
        )
        results = asyncio.run(replay_trace(config))
        print(
@@ -655,12 +756,18 @@ def main() -> None:
                kvcache_prefill_normal_priority=(
                    args.kvcache_prefill_normal_priority
                ),
+                pool_poll_interval_s=args.pool_poll_interval_s,
+                pool_poll_include_sessions=not args.pool_poll_no_sessions,
+                enable_backpressure=args.enable_backpressure,
+                backpressure_max_pause_s=args.backpressure_max_pause_s,
+                progress_interval_s=args.progress_interval_s,
                sample_profile=args.sample_profile,
                min_initial_input_tokens=args.min_initial_input_tokens,
                max_initial_input_tokens=args.max_initial_input_tokens,
                max_append_input_tokens=args.max_append_input_tokens,
                max_output_tokens=args.max_output_tokens,
                min_overlap_ratio=args.min_overlap_ratio,
+                use_trace_as_sample=args.use_trace_as_sample,
                launch_stack=True,
            )
        )
@@ -720,6 +827,26 @@ def _add_topology_arguments(parser: argparse.ArgumentParser) -> None:
        "--no-trust-remote-code",
        action="store_true",
    )
+    parser.add_argument(
+        "--extra-server-args",
+        default="",
+        help="Extra arguments appended to every sglang.launch_server command.",
+    )
+    parser.add_argument(
+        "--prefill-extra-server-args",
+        default="",
+        help="Extra arguments appended only to prefill launch_server commands.",
+    )
+    parser.add_argument(
+        "--decode-extra-server-args",
+        default="",
+        help="Extra arguments appended only to decode launch_server commands.",
+    )
+    parser.add_argument(
+        "--direct-extra-server-args",
+        default="",
+        help="Extra arguments appended only to direct launch_server commands.",
+    )


 def _topology_from_args(args: argparse.Namespace):
@@ -749,7 +876,13 @@ def _topology_from_args(args: argparse.Namespace):
        force_rdma=args.force_rdma,
        trust_remote_code=not args.no_trust_remote_code,
        ib_device=args.ib_device,
-        direct_extra_server_args=("--enable-streaming-session",),
+        extra_server_args=tuple(shlex.split(args.extra_server_args)),
+        prefill_extra_server_args=tuple(shlex.split(args.prefill_extra_server_args)),
+        decode_extra_server_args=tuple(shlex.split(args.decode_extra_server_args)),
+        direct_extra_server_args=(
+            "--enable-streaming-session",
+            *tuple(shlex.split(args.direct_extra_server_args)),
+        ),
    )


--- a/src/agentic_pd_hybrid/launcher.py
+++ b/src/agentic_pd_hybrid/launcher.py
@@ -34,7 +34,24 @@ def build_launch_plan(
    decode_policy: str = "manual",
    include_router: bool = True,
    router_request_timeout_s: float | None = None,
+    naive_dp: bool = False,
 ) -> LaunchPlan:
+    router_command: tuple[str, ...] | None = None
+    if include_router:
+        if topology.prefill_workers and topology.decode_workers:
+            router_command = _build_router_command(
+                topology,
+                prefill_policy=prefill_policy,
+                decode_policy=decode_policy,
+                request_timeout_s=router_request_timeout_s,
+            )
+        elif naive_dp and topology.direct_workers:
+            router_command = _build_dp_router_command(
+                topology,
+                backend_policy=decode_policy,
+                request_timeout_s=router_request_timeout_s,
+            )
+
    return LaunchPlan(
        prefill_commands=tuple(
            _build_server_command(topology, worker) for worker in topology.prefill_workers
@@ -43,24 +60,17 @@ def build_launch_plan(
            _build_server_command(topology, worker) for worker in topology.decode_workers
        ),
        direct_commands=tuple(
-            _build_server_command(topology, worker) for worker in topology.direct_workers
-        ),
-        router_command=(
-            _build_router_command(
-                topology,
-                prefill_policy=prefill_policy,
-                decode_policy=decode_policy,
-                request_timeout_s=router_request_timeout_s,
-            )
-            if include_router and topology.prefill_workers and topology.decode_workers
-            else None
+            _build_server_command(topology, worker, naive_dp=naive_dp)
+            for worker in topology.direct_workers
        ),
+        router_command=router_command,
    )


 def _build_server_command(
    topology: SingleNodeTopology,
    worker: WorkerSpec,
+    naive_dp: bool = False,
 ) -> tuple[str, ...]:
    command = [
        sys.executable,
@@ -76,11 +86,15 @@ def _build_server_command(
        str(worker.port),
        "--base-gpu-id",
        str(worker.gpu_id),
+    ]
+    # Naive DP direct workers: no disaggregation flags at all
+    if not (naive_dp and worker.role == "direct"):
+        command.extend([
            "--disaggregation-mode",
            _disaggregation_mode_for(worker),
            "--disaggregation-transfer-backend",
            topology.transfer_backend,
-    ]
+        ])
    if worker.tp_size > 1:
        command.extend(["--tp-size", str(worker.tp_size)])
    if topology.trust_remote_code:
@@ -135,6 +149,32 @@ def _build_router_command(
    return tuple(command)


+def _build_dp_router_command(
+    topology: SingleNodeTopology,
+    *,
+    backend_policy: str,
+    request_timeout_s: float | None,
+) -> tuple[str, ...]:
+    command: list[str] = [
+        sys.executable,
+        "-B",
+        "-u",
+        "-m",
+        "agentic_pd_hybrid.pd_router",
+        "--host",
+        topology.router_host,
+        "--port",
+        str(topology.router_port),
+        "--backend-policy",
+        backend_policy,
+    ]
+    if request_timeout_s is not None:
+        command.extend(["--request-timeout-s", str(request_timeout_s)])
+    for worker in topology.direct_workers:
+        command.extend(["--backend", worker.url])
+    return tuple(command)
+
+
 def _render_named_command(name: str, command: tuple[str, ...]) -> str:
    return f"# {name}\n" + " ".join(shlex.quote(part) for part in command)

--- a/src/agentic_pd_hybrid/metrics.py
+++ b/src/agentic_pd_hybrid/metrics.py
@@ -43,6 +43,9 @@ class RequestMetrics:
    ttft_s: float | None
    tpot_s: float | None
    error: str | None = None
+    actual_output_tokens: int | None = None
+    requested_output_tokens: int | None = None
+    finish_reason: str | None = None

    @classmethod
    def from_decision(
@@ -63,6 +66,9 @@ class RequestMetrics:
        prefill_request_priority: int | None = None,
        decode_request_priority: int | None = None,
        error: str | None = None,
+        actual_output_tokens: int | None = None,
+        requested_output_tokens: int | None = None,
+        finish_reason: str | None = None,
    ) -> "RequestMetrics":
        return cls(
            request_id=request.request_id,
@@ -95,6 +101,9 @@ class RequestMetrics:
            ttft_s=ttft_s,
            tpot_s=tpot_s,
            error=error,
+            actual_output_tokens=actual_output_tokens,
+            requested_output_tokens=requested_output_tokens,
+            finish_reason=finish_reason,
        )


@@ -158,6 +167,17 @@ def write_summary_json(
            str(key): value for key, value in sorted(decode_priorities.items())
        },
        "error_count": sum(1 for row in rows if row.error is not None),
+        "truncated_request_count": sum(
+            1
+            for row in rows
+            if row.actual_output_tokens is not None
+            and row.requested_output_tokens is not None
+            and row.requested_output_tokens > 1
+            and row.actual_output_tokens < row.requested_output_tokens * 0.5
+        ),
+        "actual_output_tokens_stats": _stats(
+            [float(row.actual_output_tokens) for row in rows if row.actual_output_tokens is not None]
+        ),
    }
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", encoding="utf-8") as handle:
--- a/src/agentic_pd_hybrid/pd_router.py
+++ b/src/agentic_pd_hybrid/pd_router.py
@@ -74,8 +74,58 @@ class RouterState:
        return idx


+@dataclass
+class DpRouterConfig:
+    host: str
+    port: int
+    backend_urls: list[str]
+    backend_policy: str = "round_robin"
+    request_timeout_s: float = 1800.0
+
+
+class DpRouterState:
+    """DP (data-parallel) router: forward each request to exactly one backend."""
+
+    def __init__(self, config: DpRouterConfig):
+        if not config.backend_urls:
+            raise ValueError("At least one backend worker is required")
+        self.config = config
+        self.cursor = 0
+        self.sticky_map: dict[str, int] = {}
+
+    def select_backend(self, headers: dict[str, str]) -> str:
+        idx = self._select_index(headers)
+        return self.config.backend_urls[idx]
+
+    def _select_index(self, headers: dict[str, str]) -> int:
+        target_worker = headers.get("x-smg-target-worker")
+        routing_key = headers.get("x-smg-routing-key")
+
+        if (
+            self.config.backend_policy == "consistent_hashing"
+            and target_worker is not None
+        ):
+            idx = int(target_worker)
+            if 0 <= idx < len(self.config.backend_urls):
+                return idx
+
+        if self.config.backend_policy == "manual" and routing_key:
+            cached = self.sticky_map.get(routing_key)
+            if cached is not None:
+                return cached
+            idx = self.cursor % len(self.config.backend_urls)
+            self.cursor += 1
+            self.sticky_map[routing_key] = idx
+            return idx
+
+        idx = self.cursor % len(self.config.backend_urls)
+        self.cursor += 1
+        return idx
+
+
 app = FastAPI()
 router_state: RouterState | None = None
+dp_state: DpRouterState | None = None


@app.get("/health")
@@ -85,6 +135,16 @@ async def health() -> Response:

@app.get("/health_generate")
 async def health_generate() -> Response:
+    if dp_state is not None:
+        async with aiohttp.ClientSession() as session:
+            tasks = [
+                session.get(f"{url}/health_generate")
+                for url in dp_state.config.backend_urls
+            ]
+            for response in asyncio.as_completed(tasks):
+                async with await response:
+                    pass
+        return Response(status_code=200)
    state = _require_state()
    async with aiohttp.ClientSession() as session:
        tasks = []
@@ -101,6 +161,11 @@ async def health_generate() -> Response:

@app.get("/v1/models")
 async def models() -> ORJSONResponse:
+    if dp_state is not None:
+        async with aiohttp.ClientSession() as session:
+            async with session.get(f"{dp_state.config.backend_urls[0]}/v1/models") as resp:
+                payload = await resp.json()
+                return ORJSONResponse(payload, status_code=resp.status)
    state = _require_state()
    async with aiohttp.ClientSession() as session:
        async with session.get(f"{state.config.prefill_urls[0][0]}/v1/models") as response:
@@ -147,6 +212,15 @@ async def _forward_to_backend(
    headers: dict[str, str],
    endpoint_name: str,
 ) -> Response:
+    # DP mode: forward to a single backend
+    if dp_state is not None:
+        return await _forward_to_dp_backend(
+            request_data=request_data,
+            headers=headers,
+            endpoint_name=endpoint_name,
+        )
+
+    # PD mode: coordinate prefill + decode
    state = _require_state()
    prefill_server, bootstrap_port, decode_server = state.select_pair(headers)
    prefill_request, decode_request = _build_backend_requests(
@@ -186,6 +260,63 @@ async def _forward_to_backend(
            )


+async def _forward_to_dp_backend(
+    *,
+    request_data: dict,
+    headers: dict[str, str],
+    endpoint_name: str,
+) -> Response:
+    assert dp_state is not None
+    backend_server = dp_state.select_backend(headers)
+    cleaned = _strip_internal_fields(request_data)
+    timeout_s = dp_state.config.request_timeout_s
+
+    if request_data.get("stream", False):
+        return StreamingResponse(
+            _stream_dp_generate(
+                request_data=cleaned,
+                backend_server=backend_server,
+                endpoint_name=endpoint_name,
+                timeout_s=timeout_s,
+            ),
+            media_type="text/event-stream",
+        )
+
+    async with aiohttp.ClientSession(
+        timeout=aiohttp.ClientTimeout(total=timeout_s)
+    ) as session:
+        async with session.post(
+            f"{backend_server}/{endpoint_name}", json=cleaned
+        ) as response:
+            body = await response.read()
+            return Response(
+                content=body,
+                status_code=response.status,
+                media_type=response.content_type,
+            )
+
+
+async def _stream_dp_generate(
+    *,
+    request_data: dict,
+    backend_server: str,
+    endpoint_name: str,
+    timeout_s: float,
+) -> AsyncIterator[bytes]:
+    async with aiohttp.ClientSession(
+        timeout=aiohttp.ClientTimeout(total=timeout_s)
+    ) as session:
+        async with session.post(
+            f"{backend_server}/{endpoint_name}", json=request_data
+        ) as response:
+            if response.status != HTTPStatus.OK:
+                payload = await response.read()
+                yield payload
+                return
+            async for chunk in response.content.iter_chunked(_STREAM_CHUNK_SIZE):
+                yield chunk
+
+
 async def _stream_generate(
    *,
    prefill_request: dict,
@@ -241,6 +372,12 @@ def _build_backend_requests(
    prefill_request.update(bootstrap_payload)
    decode_request.update(bootstrap_payload)

+    # session_params is only meaningful for the decode worker (streaming session
+    # KV reuse).  Sending it to the prefill worker causes the D side to
+    # short-circuit with local-prefill on already-open sessions, returning
+    # truncated responses while P's KV transfer gets aborted.
+    prefill_request.pop("session_params", None)
+
    if prefill_priority is not None:
        prefill_request["priority"] = int(prefill_priority)
    if decode_priority is not None:
@@ -262,7 +399,7 @@ def _require_state() -> RouterState:


 def main() -> None:
-    parser = argparse.ArgumentParser(description="Minimal local PD router")
+    parser = argparse.ArgumentParser(description="Minimal local PD / DP router")
    parser.add_argument("--host", default="127.0.0.1")
    parser.add_argument("--port", type=int, default=8000)
    parser.add_argument(
@@ -270,19 +407,44 @@ def main() -> None:
        nargs=2,
        metavar=("URL", "BOOTSTRAP_PORT"),
        action="append",
-        required=True,
+        default=None,
    )
    parser.add_argument(
        "--decode",
        action="append",
-        required=True,
+        default=None,
    )
    parser.add_argument("--prefill-policy", default="round_robin")
    parser.add_argument("--decode-policy", default="manual")
+    parser.add_argument(
+        "--backend",
+        action="append",
+        default=None,
+        help="Backend URL for DP (data-parallel) mode. Repeat for each worker.",
+    )
+    parser.add_argument(
+        "--backend-policy",
+        default="round_robin",
+        help="Routing policy for DP mode: round_robin, manual, consistent_hashing.",
+    )
    parser.add_argument("--request-timeout-s", type=float, default=1800.0)
    args = parser.parse_args()

-    global router_state
+    global router_state, dp_state
+
+    if args.backend:
+        # DP mode: simple forward to one of N backends
+        dp_state = DpRouterState(
+            DpRouterConfig(
+                host=args.host,
+                port=args.port,
+                backend_urls=list(args.backend),
+                backend_policy=args.backend_policy,
+                request_timeout_s=args.request_timeout_s,
+            )
+        )
+    elif args.prefill and args.decode:
+        # PD mode: prefill/decode coordination
        router_state = RouterState(
            RouterConfig(
                host=args.host,
@@ -294,6 +456,9 @@ def main() -> None:
                request_timeout_s=args.request_timeout_s,
            )
        )
+    else:
+        parser.error("Either --backend (DP mode) or both --prefill and --decode (PD mode) are required")
+
    uvicorn.run(app, host=args.host, port=args.port, log_level="info")


--- a/src/agentic_pd_hybrid/replay.py
+++ b/src/agentic_pd_hybrid/replay.py
--- a/src/agentic_pd_hybrid/stack.py
+++ b/src/agentic_pd_hybrid/stack.py
@@ -66,6 +66,7 @@ def launch_pd_stack(
    timeout_s: float = 1200.0,
    router_request_timeout_s: float | None = None,
    include_router: bool = True,
+    naive_dp: bool = False,
 ) -> ManagedPdStack:
    run_dir.mkdir(parents=True, exist_ok=True)
    logs_dir = run_dir / "logs"
@@ -77,6 +78,7 @@ def launch_pd_stack(
        decode_policy=decode_policy,
        include_router=include_router,
        router_request_timeout_s=router_request_timeout_s,
+        naive_dp=naive_dp,
    )

    prefill_processes = [
@@ -195,6 +197,9 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
        env["MC_MS_AUTO_DISC"] = "0"
        if topology.ib_device:
            env["MOONCAKE_DEVICE"] = topology.ib_device
+    elif topology.transfer_backend == "mooncake":
+        # Default to TCP when RDMA is not forced (e.g. loopback on same node)
+        env.setdefault("MOONCAKE_PROTOCOL", "tcp")

    repo_root = Path(__file__).resolve().parents[2]
    python_paths = [
--- a/third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py
+++ b/third_party/sglang/python/sglang/srt/distributed/device_communicators/mooncake_transfer_engine.py
@@ -189,10 +189,11 @@ class MooncakeTransferEngine:
                device_name if device_name is not None else "",
            )
        else:
+            protocol = os.environ.get("MOONCAKE_PROTOCOL", "rdma")
            ret_value = self.engine.initialize(
                hostname,
                "P2PHANDSHAKE",
-                "rdma",
+                protocol,
                device_name if device_name is not None else "",
            )
        if ret_value != 0:
--- a/third_party/sglang/python/sglang/srt/managers/io_struct.py
+++ b/third_party/sglang/python/sglang/srt/managers/io_struct.py
@@ -1602,6 +1602,9 @@ class DirectAppendAdmissionReqInput(BaseReq):
    session_id: str
    uncached_input_tokens: int
    output_tokens: int
+    # "direct_append": existing behavior — require session resident on this D
+    # "seed": new admission for session not yet resident; do capacity check + LRU eviction
+    mode: str = "direct_append"


@dataclass
@@ -1619,6 +1622,9 @@ class DirectAppendAdmissionReqOutput(BaseReq):
    decode_prealloc_queue_reqs: int = 0
    decode_transfer_queue_reqs: int = 0
    decode_retracted_queue_reqs: int = 0
+    # Backpressure hint: if > 0, the caller should pause this many ms before
+    # sending more requests to this D. Computed from transfer-queue depth.
+    recommended_pause_ms: int = 0


@dataclass
--- a/third_party/sglang/python/sglang/srt/managers/scheduler.py
+++ b/third_party/sglang/python/sglang/srt/managers/scheduler.py
@@ -3181,6 +3181,89 @@ class Scheduler(
            success = False
        return success

+    def _compute_pool_breakdown_for_diagnostics(self) -> dict:
+        """Read-only KV pool decomposition for the agentic-pd-hybrid profiler.
+
+        Decomposes capacity into:
+          - radix_evictable_tokens / radix_protected_tokens: tree-managed
+          - slot_private_held_tokens: SessionAwareCache out-of-tree slot holds
+          - running_batch_kv_tokens: kv_allocated_len of currently-decoding reqs
+            (overlaps with radix_protected; not additive)
+          - {transfer,prealloc,retracted}_queue_{reqs,tokens}: disagg queues
+          - available_tokens: free pool
+
+        Caller computes "unaccounted = capacity - sum_of_known" to find leakage.
+        Implementation is best-effort; missing components return omitted keys.
+        """
+        breakdown: dict = {
+            "capacity_tokens": int(self.max_total_num_tokens or 0),
+            "available_tokens": int(self.token_to_kv_pool_allocator.available_size()),
+        }
+
+        # Radix tree (works for SessionAwareCache and most inner caches)
+        try:
+            ev = self.tree_cache.evictable_size()
+            pr = self.tree_cache.protected_size()
+            if isinstance(ev, tuple):
+                ev = ev[0]
+            if isinstance(pr, tuple):
+                pr = pr[0]
+            breakdown["radix_evictable_tokens"] = int(ev or 0)
+            breakdown["radix_protected_tokens"] = int(pr or 0)
+        except Exception:
+            pass
+
+        # SessionAwareCache slot-private holds (already in session_cache.held_tokens
+        # but mirrored here for one-stop decomposition)
+        try:
+            from sglang.srt.mem_cache.session_aware_cache import SessionAwareCache
+            if isinstance(self.tree_cache, SessionAwareCache):
+                breakdown["slot_private_held_tokens"] = int(
+                    self.tree_cache.session_held_tokens()
+                )
+                breakdown["session_slot_count"] = int(
+                    self.tree_cache.session_held_req_count()
+                )
+        except Exception:
+            pass
+
+        # Running batch KV (overlaps with radix_protected for tree-tracked reqs)
+        try:
+            running_reqs = self.running_batch.reqs
+            breakdown["running_batch_reqs"] = len(running_reqs)
+            breakdown["running_batch_kv_tokens"] = sum(
+                int(getattr(req, "kv_allocated_len", 0) or 0)
+                for req in running_reqs
+            )
+        except Exception:
+            pass
+
+        # Disagg decode queues
+        if self.disaggregation_mode == DisaggregationMode.DECODE:
+            try:
+                tq = self.disagg_decode_transfer_queue.queue
+                pq = self.disagg_decode_prealloc_queue.queue
+                rq = self.disagg_decode_prealloc_queue.retracted_queue
+                breakdown["transfer_queue_reqs"] = len(tq)
+                breakdown["transfer_queue_tokens"] = sum(
+                    int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
+                    for dr in tq
+                )
+                breakdown["prealloc_queue_reqs"] = len(pq)
+                breakdown["prealloc_queue_tokens"] = sum(
+                    int(getattr(getattr(dr, "req", None), "kv_allocated_len", 0) or 0)
+                    for dr in pq
+                )
+                breakdown["retracted_queue_reqs"] = len(rq)
+                breakdown["retracted_queue_tokens"] = sum(
+                    int(getattr(req, "kv_allocated_len", 0) or 0)
+                    for req in rq
+                )
+            except Exception:
+                pass
+
+        return breakdown
+
    def get_internal_state(self, recv_req: GetInternalStateReq):
        ret = vars(get_global_server_args())
        ret["last_gen_throughput"] = self.last_gen_throughput
@@ -3196,6 +3279,7 @@ class Scheduler(
        ret["session_cache"] = (
            self.session_controller.get_streaming_session_cache_status()
        )
+        ret["pool_breakdown"] = self._compute_pool_breakdown_for_diagnostics()

        if not self.spec_algorithm.is_none() and self.spec_total_num_forward_ct > 0:
            ret["avg_spec_accept_length"] = (
@@ -3508,6 +3592,9 @@ class Scheduler(
                reason="unsupported",
            )

+        mode = getattr(recv_req, "mode", "direct_append") or "direct_append"
+        is_seed = mode == "seed"
+
        session_cache_status = self.session_controller.get_streaming_session_cache_status(
            recv_req.session_id
        )
@@ -3515,27 +3602,28 @@ class Scheduler(
        resident = bool(
            isinstance(target_session, dict) and target_session.get("resident")
        )
-        if not resident:
+        if not resident and not is_seed:
+            # direct_append requires the session already resident on this D.
+            # For seed we skip this check and let capacity decide.
+            transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
+            retracted_queue_depth = len(self.disagg_decode_prealloc_queue.retracted_queue)
+            available_size = int(self.token_to_kv_pool_allocator.available_size())
+            token_usage = 1.0 - available_size / max(1, self.max_total_num_tokens)
            return DirectAppendAdmissionReqOutput(
                can_admit=False,
                resident=False,
                reason="session-not-resident",
-                available_tokens_before=int(
-                    self.token_to_kv_pool_allocator.available_size()
-                ),
-                available_tokens_after=int(
-                    self.token_to_kv_pool_allocator.available_size()
-                ),
-                token_usage=(
-                    1.0
-                    - self.token_to_kv_pool_allocator.available_size()
-                    / max(1, self.max_total_num_tokens)
-                ),
+                available_tokens_before=available_size,
+                available_tokens_after=available_size,
+                token_usage=token_usage,
                num_running_reqs=len(self.running_batch.reqs),
                decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
-                decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
-                decode_retracted_queue_reqs=len(
-                    self.disagg_decode_prealloc_queue.retracted_queue
+                decode_transfer_queue_reqs=transfer_queue_depth,
+                decode_retracted_queue_reqs=retracted_queue_depth,
+                recommended_pause_ms=self._compute_backpressure_pause_hint(
+                    transfer_queue_depth=transfer_queue_depth,
+                    retracted_queue_depth=retracted_queue_depth,
+                    token_usage_after=token_usage,
                ),
            )

@@ -3543,10 +3631,13 @@ class Scheduler(
            0, recv_req.output_tokens
        )
        available_tokens_before = int(self.token_to_kv_pool_allocator.available_size())
+        # Don't evict the session itself when it's already resident; for seed
+        # of a fresh session there is nothing to exclude.
+        exclude_ids = {recv_req.session_id} if resident else set()
        trim_result = self.maybe_trim_decode_session_cache(
            required_tokens=required_tokens,
            force=available_tokens_before < required_tokens,
-            exclude_session_ids={recv_req.session_id},
+            exclude_session_ids=exclude_ids,
        )
        available_tokens_after = int(self.token_to_kv_pool_allocator.available_size())
        decode_retracted_queue_reqs = len(self.disagg_decode_prealloc_queue.retracted_queue)
@@ -3556,6 +3647,7 @@ class Scheduler(
        )
        reason = None if can_admit else "no-space"

+        transfer_queue_depth = len(self.disagg_decode_transfer_queue.queue)
        return DirectAppendAdmissionReqOutput(
            can_admit=can_admit,
            resident=True,
@@ -3570,10 +3662,36 @@ class Scheduler(
            ),
            num_running_reqs=len(self.running_batch.reqs),
            decode_prealloc_queue_reqs=len(self.disagg_decode_prealloc_queue.queue),
-            decode_transfer_queue_reqs=len(self.disagg_decode_transfer_queue.queue),
+            decode_transfer_queue_reqs=transfer_queue_depth,
            decode_retracted_queue_reqs=decode_retracted_queue_reqs,
+            recommended_pause_ms=self._compute_backpressure_pause_hint(
+                transfer_queue_depth=transfer_queue_depth,
+                retracted_queue_depth=decode_retracted_queue_reqs,
+                token_usage_after=(
+                    1.0 - available_tokens_after / max(1, self.max_total_num_tokens)
+                ),
+            ),
        )

+    def _compute_backpressure_pause_hint(
+        self,
+        *,
+        transfer_queue_depth: int,
+        retracted_queue_depth: int,
+        token_usage_after: float,
+    ) -> int:
+        # If D is already retracting requests, pause aggressively.
+        if retracted_queue_depth > 0:
+            return 1500
+        # KV pool above 90%: pause proportional to overshoot.
+        if token_usage_after >= 0.90:
+            overshoot = int((token_usage_after - 0.90) * 10000)
+            return max(200, min(2000, overshoot * 5))
+        # Transfer queue heavy: pause linearly with depth.
+        if transfer_queue_depth >= 8:
+            return min(2000, transfer_queue_depth * 100)
+        return 0
+
    def maybe_sleep_on_idle(self):
        if self.idle_sleeper is not None:
            self.idle_sleeper.maybe_sleep()
Author	SHA1	Message	Date
Gahow Wang	7568e041ff	docs(kvc): record real Ali KVC experiment results	2026-05-12 05:28:06 +00:00
Gahow Wang	4e8f943875	feat(kvc): add real Ali replay workflow	2026-05-12 05:28:00 +00:00
kzlin	1d51704dad	docs(kvc): agentic-fit analysis, refactor plan, validation report Three new docs covering the structural-fit investigation: - AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess). Quantifies session pinning, LRU shortfall, P-side imbalance, time-scale distortion, etc., with code citations and N=3 rerun data. - REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the original "estimate inflation" and "resident_blocks aging" claims were not real bugs, scope shrinks to one code change (backpressure) plus a 4-run smoke sweep within an 8h budget. - STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled fully-supported / indirect / retracted with the data source. Notes that backpressure E2E validation is pending GPU smoke run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:30:11 +08:00
kzlin	7affb565b2	feat(kvc): add backpressure smoke sweep + analyzer (and v6 p1 profile script) scripts/sweep_backpressure_smoke.sh: 4-run smoke matrix (KVC baseline / KVC + backpressure / KVC + backpressure @ time-scale=1 / DP @ time-scale=1) designed to fit ~3-4h GPU budget. Validates §3 backpressure implementation and partially probes §7 time-scale distortion. scripts/analysis/analyze_backpressure_smoke.py: consumes the new structural/* jsonl files plus request-metrics; emits headline metrics, backpressure histograms, admission probe stats, and per-session pinning distribution. scripts/sweep_tp1_v6_p1_profile.sh: pre-existing v6 P1 profile sweep script (was untracked; included for completeness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:56 +08:00
kzlin	c47adaf8e3	feat(kvc): honor admission backpressure hints + structural event logging Replay-side changes paired with the SGLang admission hint: - DecodeResidencyState gains pause_until_s; admission probe parses recommended_pause_ms and updates the per-D pause window. - _wait_for_decode_pause is invoked at request entry points (_invoke_router, _invoke_session_direct) so requests stall before hitting a saturated D instead of timing out via mooncake. - New CLI flags: --enable-backpressure (default off, baseline preserved), --backpressure-max-pause-s (cap on per-request sleep, default 2s). Structural instrumentation written under <run_dir>/structural/: - admission-events.jsonl: every admission probe (RTT, queue_depth, pause_ms, available_tokens, evicted_count) - backpressure-events.jsonl: every actual pause sleep - session-d-binding.jsonl: per-request policy decision Used to validate the structural claims documented separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:46 +08:00
kzlin	ca4b64c79a	feat(sglang): expose backpressure pause hint in admit_direct_append Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D can advise callers when its transfer queue is heavy or KV pool is near capacity. The hint is computed from transfer_queue_depth, retracted_queue_depth, and post-trim token_usage; thresholds are simple heuristics (>0.90 usage, >=8 queue depth, retracted>0). Default behavior is unchanged for callers that ignore the field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:30 +08:00
kzlin	4978c0d0cd	profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument Hostile audit of the original report flagged three load-bearing errors: 1. held_tokens semantic was inverted. session_held_tokens() at session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len) per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held - avail" actually CONTAINS the radix-tree protected prefix cache (likely the single biggest component for shared agentic prefixes), not just running batch + in-flight as the original report claimed. 2. Admission-race causal hypothesis for the 415 EXP2+profile errors is contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they passed admission and died downstream ("generate stream ended before producing any token", raised by the client when a 200 response had an empty stream). 3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1 (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a passive read — it dispatches into the scheduler main loop and iterates every session slot. Plus: per-D error% confounded by sticky session affinity (only 18 unique sessions cause 415 errors, decode-3 had 0 errors only because no high-error session landed there); decile 10 "recovery" was an equal-time binning artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not 6h; p50/p90 latency comparison is N=1. Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4). Action items split into P0 (verify, must do first) and P1 (instrument): P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2 (no polling, identical config to the original v5 run) to test whether the 9-error baseline result is reproducible. If 3 runs give ~9 errors and profile gives 415, polling is the leading suspect. Currently running in background. P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only "pool_breakdown" dict to /server_info covering: radix_evictable_tokens, radix_protected_tokens, slot_private_held_tokens, session_slot_count, running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens}, prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these, "unaccounted = cap - sum(known)" exposes true leakage. replay.py captures all fields into the per-tick row; analyzer prints the decomposition and gracefully handles old timeseries (prints "P1 instrument absent"). Mock-tested end-to-end. SGLang patch is read-only and does not affect admission/scheduling. Old v5+profile data still analyzes correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:29:21 +08:00
kzlin	51f5386691	profile(kvc): add D KV pool timeseries poller + analyzer for v6 root-cause v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding v6 mitigations we need to attribute that capacity loss to one of: (a) active sessions — real footprint (b) idle-evictable sessions — LRU not aggressive enough (c) prefill backup blocks / in-flight / fragmentation — release timing Without this it's all guessing. Plumb a 1Hz poller into replay that hits each P/D worker's /server_info, captures session_cache + memory_usage, and writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl. Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at 1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible relative to the 50min run. Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's capacity into active_held / idle_evictable / other (= cap-held-avail, the backup-blocks bucket) / free, and reports session residency churn across workers as a starvation/thrashing signal. Mock-tested poller end-to-end (cancellation clean, file flushed, sessions captured); analyzer validated against synthetic timeseries. Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then analyze results to pick a v6 direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:04:21 +08:00
kzlin	6572d7f3f4	docs: add v5 chapter (Option D worker-mode admission) and rename to V1_TO_V5 v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D: worker admission_mode authoritative for direct_append + seed + reseed, bypassing replay's local _decode_session_soft_cap. Key findings now documented: - errors collapse from 9-10% to 0.2% (mooncake timeouts gone) - session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the binding constraint, not replay's estimator; v4's "low fallback" was hiding capacity overruns as transfer-timeout errors - direct-to-D subset latency unchanged from v4 (admission overhead negligible) - new bottleneck: D's physical KV pool — points v6 at prefill backup release timing, priority eviction tuning, chunked seed, cross-D session migration, and real RDMA Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the code index with the v5 endpoint extension and new CLI knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:13:25 +08:00
kzlin	6e5ed8da80	feat(kvc): Option D - delegate seed/reseed admission to D worker v4 (cap=16) saw 35% session-cap fallback because the local soft_cap min(16, usable / target) evaluates to 1-2 for large agentic inputs. The cap was hit not because D was full but because replay's heuristic underestimated capacity. This change makes worker admission_mode authoritative for ALL paths: SGLang side: - io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field ("direct_append" \| "seed", default "direct_append" preserves prior behavior). - scheduler.py:admit_direct_append: when mode == "seed", skip the resident-on-D requirement and run the same capacity check + LRU eviction (maybe_trim_decode_session_cache) that direct_append uses. This lets D atomically decide if a new session can be admitted based on actual token_to_kv_pool_allocator state. Replay side (replay.py): - _query_decode_direct_admission gains a `mode` parameter. - _reserve_decode_session_capacity: in worker admission_mode, the seed/reseed branch now queries D with mode="seed" and trusts the result, instead of estimating capacity from the residency snapshot. - _should_admit_new_decode_session: in worker mode, skip the local soft_cap pre-check and let D decide. Same-D session fast-path is preserved. Effects: - Local hardcoded cap of 16 is bypassed under worker mode; D's real KV pool size is the only constraint. - LRU eviction runs in D's process atomically with admission, so starvation (the v3 bimodal "lucky vs starved sessions" pattern) should resolve. scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D configs as v4 with the new admission path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:40:03 +08:00
kzlin	74194e660a	docs: v4 final results, error analysis, and updated journey Add v4 sweep results and post-mortem analysis showing: - direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline 8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%). - Overall vs baseline (errors+truncated excluded): v4 2P6D P50=0.85s vs baseline 0.66s (28% slower). Reason is not errors -- 35% of requests still hit fallback-large-append-session-cap, where capacity-based cap = usable_tokens / target_tokens evaluates to 1-2 (not 16) for large agentic inputs. - 9-10% errors on KVC variants are mooncake TCP transfer timeouts, not SGLang logic bugs. Prefill log shows "Failed to send kv chunk ... 32s timeout ... session not alive". Errors concentrate in turn>=31 (large inputs) after run >44.8%. Track: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table, per-mode breakdown, and error root cause. - scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py - outputs/qwen3-30b-tp1-v{3,4}/exp_summary.json (force-added, small JSON; metrics.jsonl excluded due to size). - outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:34:01 +08:00
kzlin	c9d350b372	docs: KVC v1-v4 debug journey + raise session soft_cap to 16 Document the iterative debugging from v1 (broken KVC) through v4 (routing fixed + session cap raised), with code-level analysis of the two main bugs encountered: 1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`): `--policy default` for KVC mechanism caused replay's round-robin policy and the PD router's round-robin to diverge, sending requests with `session_params` to a D worker that did not have the session open. Resulted in 56-61% truncation with finish_reason "session id X does not exist". Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay emits `x-smg-target-worker` and PD router uses consistent_hashing. 2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap` dominated 52-65% of requests. Root cause was hardcoded `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4 sessions = 28 slots for 52 trace sessions, ~24 sessions starved permanently (bimodal direct-to-D rate of 0% or 99%). Fix: raise the cap to 16 (replay.py). Also includes the v3 finding that direct-to-d-session path P50=0.495s and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s) - the KVC core mechanism works when fallback paths are avoided. Files: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index - docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes - scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts - src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields - src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill - src/agentic_pd_hybrid/metrics.py: truncated_request_count Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:10:41 +08:00