docs(failures): consolidated 5-mode failure taxonomy

Consolidates failure modes scattered across V2_DEEP_ANALYSIS, E1_E2_RESULTS, E3_FINDINGS, KVC_EVICTION_GRANULARITY, REAL_ALI_KVC_EXPERIMENT into a single lookup table with five fields per mode: symptom → root cause → trigger → current mitigation → real fix. Five modes covered: A. Mooncake "instance not alive" cascade — E2 80%-failure pathology; admission no-space → seed burst → heartbeat drop → batch abort B. Cold-D / overlap-pinning — shared boilerplate hash pins all sessions to a subset of D's; load_floor_bonus is a patch, the real fix is exclusive_overlap redefinition C. Evict storm (session-level eviction) — release_session frees 38–88K tokens in one shot; fix is BLOCK_LEVEL_EVICTION_DESIGN C'. Reseed storm (turn-1 concurrent seeds) — startup-phase mooncake burst; fix is per-D pending-seed budget, frequency drops after C D. Streaming-session correction invariant crash (E3) — schedule_batch.py:1646 landmine, hotfixed by 986f351, root-fix is removing the correction path entirely (BLOCK_LEVEL_EVICTION §3.7) Each mode has a forensic link back to the original experiment doc that surfaced it. §6 adds a diagnostic cheat sheet: "if you see X, look at Y." §7 wires every mode to a roadmap item — Milestone 1 should graduate §1–§4 to "mitigated" and eliminate §5. INDEX_ZH gets a new §1.6 section linking this and the SGLang patch inventory. No code change. Reading dependency for anyone debugging a sweep or writing paper §Limitations.
docs(sglang): patch surface inventory + retire-after-refactor list
2026-05-13 00:43:58 +08:00 · 2026-05-13 00:42:22 +08:00 · 2026-05-12 23:58:56 +08:00 · 2026-05-12 23:57:57 +08:00 · 2026-05-12 23:57:13 +08:00 · 2026-05-12 23:55:57 +08:00
18 changed files with 2408 additions and 44 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,9 +1,33 @@
 # AGENTS.md

+## For new collaborators / agents
+
+Before doing anything else, read [docs/INDEX_ZH.md](docs/INDEX_ZH.md). It points to the
+3 must-read docs and a role-based reading path (new SWE, paper reviewer,
+reproducing student, control-plane reader).
+
+Cross-branch progress, weaknesses, and roadmap live in
+[docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md). It is the single source of truth
+for "what's done, what's broken, what to do next."
+
+Two engineering work items are pre-specced and ready to pick up:
+- block-level eviction refactor — [docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)
+- D→P incremental KV sync — [docs/D_TO_P_SYNC_CONTRACT_ZH.md](docs/D_TO_P_SYNC_CONTRACT_ZH.md)
+
+Evaluation protocol (paper-quality N, paired CI, stratification,
+baselines) is in [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md).
+
 ## Environment

 Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.

+Algorithm-layer unit tests (no GPU, no SGLang):
+
+```bash
+uv sync --group test
+uv run pytest
+```
+
 ## Goal

 Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
--- a/README.md
+++ b/README.md
@@ -6,6 +6,9 @@

 更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。

+新加入的合作者：先看 [docs/INDEX_ZH.md](docs/INDEX_ZH.md)，按"我是谁"选 3 篇必读文档。
+项目当前进度、薄弱点、路线图总览见 [docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md)。
+
 ## 当前做了什么

 - 启动单机 SGLang P/D 栈。
@@ -99,3 +102,28 @@ uv run agentic-pd-hybrid replay \
 - SGLang 改动：`feat(sglang): ...` / `fix(sglang): ...`。
 - `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
 - 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
+
+## 单元测试（无 GPU）
+
+算法层（policies、Algorithm 1 / Theorem 1）有 pure-Python 单测，跑测试不需要 GPU、不需要 SGLang：
+
+```bash
+uv sync --group test
+uv run pytest
+```
+
+详见 [tests/README.md](tests/README.md)。
+
+## 评测脚本
+
+按 [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md) 跑数据后：
+
+```bash
+# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
+scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl
+
+# M2: paired-on-same-trial bootstrap 95% CI
+scripts/analysis/paired_compare.py \
+    --baseline outputs/run-dp/request-metrics.jsonl \
+    --candidate outputs/run-kvc/request-metrics.jsonl
+```
--- a/docs/AUDIT_AND_ROADMAP_ZH.md
+++ b/docs/AUDIT_AND_ROADMAP_ZH.md
@@ -0,0 +1,140 @@
+# 项目整体审阅与下一阶段路线图
+
+**日期**：2026-05-12
+**分支起点**：`improve/audit-and-foundations`（基于 `h200-cu130`）
+**性质**：跨分支整合 + 路线图，供合作者判断每个 commit 是否值得 merge
+**对象**：项目下一个 SWE / research agent + 论文 reviewer 预读
+
+本文把 `main` / `kvc-debug-journey-v1-to-v4` / `feat/d-to-p-sync` / `h200-cu130` / `kvc-real-ali-iter-v1` 五个分支的进度、已成立的贡献、薄弱点、走到 SOSP/OSDI + 工业级的路线图集中到一处，方便快速对齐。
+
+---
+
+## 0. TL;DR
+
+1. **已经成立**：v1 → v2 算法（reset-on-success、字典序 Route、worker-mode Admit RPC）有形式化定义 + 两条 theorem + SWE-Bench 50 sess ts=1 上 6/8 指标击败 4DP CA 的实测。
+2. **核心薄弱点**：(a) session-level eviction 与 KVC 设计意图冲突；(b) D→P 增量 KV 同步不存在，TTFT p99 长尾来自此；(c) mooncake "instance not alive" 级联是控制层根本可用性问题；(d) 评测仍缺多 baseline 多 trace 强统计。
+3. **不需要 GPU 也能推进**的事：算法层 unit test、形式化设计文档（block-level evict、D→P sync 接口契约）、评测协议、分层分析工具、文档体系收口。本路线图的 Milestone 1 大部分都属于此类。
+4. **进 OSDI/SOSP 必须做的**：执行 §S1（block-level evict）+ §S2（D→P sync POC）+ §M2/M3/M4（多 baseline / 全 Ali / paired 协议）。预计 3–4 个月单/双人。
+
+---
+
+## 1. 五个分支的状态总览
+
+| 分支 | 角色 | 当前状态 | 最关键产出 |
+|---|---|---|---|
+| `main` | "已发布" 基线 | 落后 origin 18 commit；2P4D + worker-admission + seed-min2 报出 vs default PD 的 9% mean / 19% p90 改善 | `KVCACHE_CENTRIC_PROGRESS_ZH.md` 的两档策略：latency-best vs stable |
+| `kvc-debug-journey-v1-to-v4` | 主工作分支 | v1→v5 完整算法演化；`KVC_ROUTER_ALGORITHM.md` 三段算法 + 两条 theorem | SWE-Bench 50 sess ts=1：v2 6/8 指标击败 4DP CA；**TTFT p99 仍输 3×**（1.28s vs 0.43s），诊断为 8.3% reseed 慢路径 |
+| `feat/d-to-p-sync` | 占位分支 | 代码空，仅 `RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` | 已排除"capacity-backup 是 D→P sync"的误解；列出 4 项工程子任务 |
+| `h200-cu130` | 真硬件 + RDMA 验证 | 4×H200 + mlx5_60 NDR 400 Gb/s 上跑 E1/E2/E3 | **E2 80% failure**（mooncake 死链级联）；**E3 16min 触发 SGLang patch invariant crash**；最新 `KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 把 root cause 上升到"session-level 是错的 eviction granularity" |
+| `kvc-real-ali-iter-v1` | 真 Ali trace 验证 | 8×H20，179-req KVC-fit slice + 600-req/15min cold-window | KVC vs DP：KVC-fit p50 −46% ✅；real 15min p90 +19s ❌，53 errors vs DP 1；KVC 默认 mem-fraction OOM，必须降到 0.82 |
+
+---
+
+## 2. 已经"硬"成立的贡献
+
+按"reviewer 能不能反驳"为标尺：
+
+1. **Reset-on-success 修复 v1 thrashing**：v1 永久 blacklist → migration 死循环 failure mode 有实测 + Algorithm 3 形式化 + Theorem 1 的不饿死证明（`KVC_ROUTER_ALGORITHM.md` §3.4 / §4.1）。
+2. **三段算法分工清晰**：Algorithm 1（字典序 Route）+ Algorithm 2（D 自治 Admit RPC）+ Algorithm 3（Dispatch + reset-on-success）。v5 把 admission 从 router 估算改成 D RPC（Option D）是把 capacity ground truth 与 routing score 解耦的正确分层。
+3. **Direct-to-D 快路径的确定性命中**（Theorem 2）：只要 residency ⊇ prefix ∧ append ≤ τ_append ∧ cap_ok 三条件同时成立必走快路径；SWE-Bench 91.6% 命中、TTFT p50 = 0.43s 是结构性结果。
+4. **每一个 negative result 都有 forensic 级解释**：mooncake death、cold-D、reseed 慢路径、session-level evict 都有代码定位 + 时间线 + 反例。这条对 paper 是真正加分项。
+
+---
+
+## 3. 让 reviewer 一击致命的薄弱点
+
+### 3.1 评测方法层
+
+- **M1 N 不足**：SWE-Bench v2 baseline N=3 确认 categorical，v2 自身 N 不足；缺 bootstrap CI。
+- **M2 比较口径不对等**：E2 80% 失败时用 "successful only" 算 latency 与 E1 全集比；paper 必须 paired-on-same-trial。
+- **M3 trace 偏 KVC-friendly**：KVC-fit slice 按 small-append + high overlap 筛过；full Ali（turn2+ ratio 26%、single-turn 极多）的 dilution 后结果没跑过。
+- **M4 baseline 不够强**：缺 vLLM + prefix-cache、DistServe、SplitWise、Mooncake-Master 任何一个。
+- **M5 trace 单一性**：缺 ShareGPT/Mooncake trace、缺 long-context tool-use agent benchmark、缺合成 adversarial trace。
+- **M6 硬件覆盖**：只 single-node ≤ 8 GPU；没有跨节点、没有 ≥ 32 GPU 集群实测。
+
+### 3.2 系统设计层
+
+- **S1 Session-level eviction 与 KVC 设计意图冲突**：90 次 evict、平均一次 free 67K tokens、25/50 session 必须 50–90K 重 prefill。`KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 已识别但未实现修复。
+- **S2 D→P 增量同步不存在**：TTFT p99 长尾 50% 来自 P 重 prefill。`capacity-backup` 是 seed-time 静态快照，不是 D→P sync。修复需改 SGLang radix 的单生产者假设。
+- **S3 Mooncake 级联 death**：admission no-space → 持续重试 seed → 心跳掉线 → SGLang 整批 abort（E2 1054/1285 失败）。控制层根本可用性 bug。
+- **S4 Admission RPC 同步阻塞**：缺 backoff / hedging / staleness budget。D scheduler GIL 抖动即把 router 卡死。
+- **S5 Cold-D / overlap-pinning**：boilerplate 24-token block hash 让所有 session 与 D0/D1 重叠 → D2/D3 0 binding。load-floor bonus 是补丁，不是 first-principles 修复。
+- **S6 SGLang 本地 patch 已 785 行 / 10 文件**，含 `schedule_batch.py:1646` 这种 hot-path 不变量改动；E3 crash 就是 vendored patch 引入的 latent landmine。
+- **S7 失败恢复 / 幂等性**：streaming session 在 chunked-prefill retry 下幂等性靠 `SessionSlot.restore_to_req`；缺 worker crash / mooncake 重连 / partial KV 损坏的恢复 protocol。
+- **S8 没有 multi-tenant / SLO-aware scheduling**：算法目标隐式 w_ttft=w_lat=1。生产里 interactive / batch / background 必须分级。
+- **S9 Topology fixed at boot**：P/D 比例是启动参数。生产负载需要 elastic。
+- **S10 Backpressure pause hint 信号未闭环**：触发 20 次但因 no-BP 无人响应；control-plane 没接通。
+
+### 3.3 工程基础设施层
+
+- **可观测性**：metrics 是 jsonl + 离线 `recompute_summary.py`；生产需要 Prometheus + Grafana + OpenTelemetry trace。
+- **形式化测试**：算法层与状态层缺 unit test；`SessionSlot.restore_to_req` 幂等性是作者自己 flag 的 invariant。
+- **混沌注入**：mooncake death 这种 control-plane failure 必须有 fault injection harness。
+- **代码体量**：`replay.py` 2460 行，集 orchestration / policy hook / control plane / metrics 于一身——prototype OK，paper-quality artifact 偏弱。
+
+---
+
+## 4. 路线图
+
+分三个 milestone。每个 milestone 可独立交付（paper 章节或工程 release）。
+
+### Milestone 1 — Defensible SOSP/OSDI submission（3–4 个月，单 / 双人）
+
+**目标**：把现有算法 + 失败诊断收口成能扛 PC 第一轮的稿子。
+
+1. **执行 §S1（block-level eviction refactor）** — 见 `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`。
+   - Streaming-session decode 输出在每个 turn finish 时通过 `cache_finished_req` 增量提交进 radix tree。
+   - `SessionSlot` 退化为纯 metadata（仅持 `last_node` + lock_ref）。
+   - `release_session` 改为 `dec_lock_ref` + 删 slot；evict 完全交给 SGLang radix LRU。
+   - 预期：evict 粒度从 67K tokens/次降到 24 tokens/次；reseed 频率降一个数量级。
+2. **执行 §S2（D→P 增量同步 POC）** — 见 `docs/D_TO_P_SYNC_CONTRACT_ZH.md`。
+   - microbench 证明：D append 完成后异步推 KV block 回 P 端 radix → 下次 reseed 跳过 re-prefill。
+3. **修 §S3（mooncake death 级联）**：admission RPC backoff + jitter；per-D pending-seed budget；mooncake heartbeat 与 admission 解耦。
+4. **修 §S5 的 first-principles 解法**：把 `overlap` 重定义为 "session 在 D 上独占 prefix 的 hash 数"（去掉 boilerplate 共享 hash 贡献），让 score 自然分散。
+5. **重做评测**：见 `docs/EVALUATION_PROTOCOL_ZH.md`。N≥3 + bootstrap CI + 多 baseline + 全 Ali + 分层报告。
+6. **形式化扩充**：加 Theorem 3（block-level evict 下重 prefill cost 上界）+ Theorem 4（D→P sync 的 staleness budget β 与 reseed cost 关系）。
+7. **Artifact**：一键脚本 + Dockerfile + 4×A100 一小时复现核心 table/figure。
+
+### Milestone 2 — Production-quality serving substrate（再 3–6 个月，2–3 人）
+
+8. **控制平面分层**：把 `replay.py` 拆成 `router/` / `control/` / `obs/` / `orch/`。
+9. **Elastic topology**：autoscaling controller，输入 (P queue, D transfer queue, D KV usage)。
+10. **Multi-tenant + SLO classes**：interactive / batch / background 三档独立 admission budget。
+11. **Failure injection harness**：mooncake link flap / D OOM kill / router GC pause / partial KV corruption；每个 case 有恢复 SLA。
+12. **Persistent KV tier**：CPU DRAM + NVMe + RDMA-attached pool；evict 改为 demote。
+13. **Cross-node + heterogeneous**：H100 + H200 + L40S 混合，topology-aware routing。
+14. **Observability**：per-request OpenTelemetry + Prometheus per-D + Grafana 主面板。
+
+### Milestone 3 — 真正能进 OSDI'27 的科研增量（6–12 个月）
+
+15. **Learning-based admission / migration**：multi-armed bandit / RL 控制 τ_reject 与 K；用 trace 训 session-aliveness predictor。
+16. **跨 router residency consensus**：轻量 gossip 共享 `Σ.resident[d]`。
+17. **可证明 competitive ratio**：在 oracle KV-residency 模型下证明 KVC expected TTFT 与 offline optimal 比值有界。
+18. **分布式 prefix tree**：逻辑 prefix 映射到多 D 物理副本，支持 multi-tenant prefix 共享（system prompt / tool schema）。
+19. **Energy-aware variant**：GPU SM 利用率 + PCIe/RDMA 能耗进目标函数。
+20. **End-to-end agent serving framing**：从 request-level latency 上升到 agent task completion time（coding agent 一个 task 30+ turn）。
+
+---
+
+## 5. 不需要 GPU 也能推进的工作清单
+
+按 ROI 排：
+
+- [x] 本路线图（`AUDIT_AND_ROADMAP_ZH.md`）。
+- [x] 合作者入口（`docs/INDEX_ZH.md`）。
+- [x] Block-level eviction 具体设计（`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`）。
+- [x] D→P sync 接口契约（`docs/D_TO_P_SYNC_CONTRACT_ZH.md`）。
+- [x] 评测协议（`docs/EVALUATION_PROTOCOL_ZH.md`）。
+- [x] `KvAwarePolicy` 纯函数 score 抽取 + unit test（Algorithm 1）。
+- [x] 不饿死性质测试（Theorem 1）。
+- [x] 分层分析脚本（按 turn-index / append-size / overlap 三维分桶）。
+- [x] Paired-comparison 协议 helper。
+- [ ] Mooncake death 的可重现 mock harness（无 GPU 也能跑）。
+- [ ] SGLang patch surface 的归类清单（每个 patch 标"必须" / "实验性" / "可下线"）。
+- [ ] Failure-mode taxonomy 文档（cold-D、overlap-pin、mooncake death、reseed storm、evict storm）。
+
+---
+
+## 6. 单句结论
+
+> 这个项目已经具备了 SOSP/OSDI workshop / poster 的素材；要进 main track，需要把 §S1（block-level evict）和 §S2（D→P sync）做实、把 §M3（full Ali）和 §M4（两个强 baseline）补齐、把 §S3（mooncake 级联 death）的 control-plane fix 写进可重复 artifact。如果只能做一件事，先做 block-level eviction refactor —— 它同时解决"reseed 太频繁"和"P 端 radix 多生产者扩展的前置条件"。
--- a/docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
+++ b/docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
@@ -0,0 +1,309 @@
+# Block-level Eviction Refactor — 设计文档
+
+**日期**：2026-05-12
+**前置**：[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md)（架构层 manifesto）
+**性质**：实现层设计 + API 草案 + 测试计划，供下一个合作者直接据此编码
+**Status**：草案，未实现。代码全部 quoted from `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py @ origin/h200-cu130`
+
+---
+
+## 0. TL;DR
+
+把 `SessionAwareCache` 当前对 streaming-session **整段 KV 一次性 free** 的语义改成：
+
+1. Streaming-session decode 输出在 turn finish 时 **增量 commit 进 radix tree**。
+2. `SessionSlot` 退化为**纯 metadata**（仅持 `last_node` + lock_ref 状态），不再独占 KV 区间。
+3. `release_session` 改为只 dec_lock_ref + 删 slot，**让 SGLang 标准 radix LRU 按 block 粒度蚕食**。
+
+预期收益：evict 粒度从一次 ~67K tokens 降到 ~24 tokens（page_size 个 token），reseed 频率降一个数量级；同时把 P 端 radix tree 改造成可被外部喂数据（为 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 铺路）。
+
+---
+
+## 1. 现状代码梳理
+
+### 1.1 关键文件与函数
+
+`third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py`
+
+| 函数 / 字段 | 当前语义 |
+|---|---|
+| `SessionSlot.req_pool_idx` | streaming-session 独占的 req_pool 槽位 |
+| `SessionSlot.kv_committed_len` | 上一 turn 完成时已 commit 的 KV 长度（已计入 cache_protected_len 部分进入 radix） |
+| `SessionSlot.kv_allocated_len` | 当前已分配但**未进 radix** 的 KV 长度（"session-exclusive 尾部"） |
+| `SessionSlot.cache_protected_len` | 首 turn 提交 radix 时的 protected 边界 |
+| `match_prefix(streaming req)` | 命中 slot → 返回 `req_to_token[req_pool_idx, :prefix_len]`，bypass radix |
+| `cache_unfinished_req(streaming req)` | subsequent turns → **完全 skip inner**（不进 radix） |
+| `cache_finished_req(streaming req)` | 调 `slot.save_from_req`，**不调 inner.cache_finished_req** |
+| `release_session(sid)` | `dec_lock_ref(slot.last_node)` + `free(req_to_token[req_pool_idx, cache_protected_len:kv_allocated_len])` + 回收 req_pool 槽位 |
+
+### 1.2 当前为什么是错的（重述）
+
+`[cache_protected_len, kv_allocated_len)` 是首轮入 radix 之后所有累积的 decode 输出 + 后续 turn 的 extend。在 Inferact / SWE-Bench 实测：
+
+- `cache_protected_len` ≈ 首 turn boilerplate ~12K
+- `kv_allocated_len` 累积 50–100K
+- 每次 `release_session` 一次性释放 38–88K，这部分**从未进 radix**，无法享受 leaf-by-leaf 渐进 evict
+
+→ session 被 evict 后必须从 client 原 prompt 重 prefill 全长 + mooncake transfer 全长，跟 naive PD-disagg 等价（详见 manifesto §1）。
+
+---
+
+## 2. 目标行为表
+
+| 场景 | 现状 | 目标 |
+|---|---|---|
+| Session 累积 50K KV，D 满了 | `release_session` 一次释放 38K | radix LRU 从最老 leaf 开始 evict，单次 ~24 tokens |
+| Session 被 evict 后再到来 | 必须 reseed 50K | 仅 re-prefill 被 evict 的 leaf 部分（典型 ≤ 5K） |
+| Evicted session TTFT | 50–90K reseed ≈ 3–7s | 5K append-prefill ≈ 200ms |
+| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only（不变） |
+| Direct-to-D fast path 命中率 | 91.6% (SWE-Bench) / 38% (E3 Inferact) | 应 ≥ 85% 即使 saturation |
+
+---
+
+## 3. 设计
+
+### 3.1 SessionSlot 字段精简
+
+**after refactor**：
+
+```python
+@dataclass
+class SessionSlot:
+    virtual_node: _VirtualNode = field(default_factory=_VirtualNode)
+
+    # Pointer into the radix tree — the deepest node owned by this session's
+    # committed prefix. Held under inc_lock_ref so radix LRU never evicts this
+    # *active* leaf out from under a turn-in-progress. Released by
+    # release_session.
+    last_node: Any = None
+    swa_uuid_for_lock: Optional[str] = None
+
+    # Bookkeeping fields (no longer authoritative ownership of KV indices).
+    last_access_time: float = field(default_factory=time.monotonic)
+
+    # Mamba state stays slot-owned (mamba doesn't fit the radix model).
+    mamba_pool_idx: Any = None
+    mamba_ping_pong_track_buffer: Any = None
+    mamba_next_track_idx: Any = None
+    mamba_last_track_seqlen: Any = None
+    mamba_branching_seqlen: Any = None
+```
+
+**删除**：`req_pool_idx`、`kv_committed_len`、`kv_allocated_len`、`cache_protected_len`、`swa_evicted_seqlen`。这些字段的真值改由 radix tree + req_to_token_pool 共同维护。
+
+### 3.2 `cache_finished_req` 改造
+
+**after refactor**：
+
+```python
+def cache_finished_req(self, req: Req, is_insert: bool = True, **kwargs):
+    if not _is_streaming(req):
+        return self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
+
+    session_id = req.session.session_id
+    slot = self.slots.setdefault(session_id, SessionSlot())
+
+    # KEY CHANGE: always delegate to inner — this inserts the new tokens
+    # (kv_committed_len .. fill_ids end) as radix-tree blocks. Subsequent
+    # match_prefix calls for this session will hit the radix tree directly.
+    result = self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
+
+    # Update slot bookkeeping only (no KV ownership).
+    slot.last_node = req.last_node
+    slot.swa_uuid_for_lock = req.swa_uuid_for_lock
+    slot.last_access_time = time.monotonic()
+
+    # Mamba state still goes through slot.
+    slot.mamba_pool_idx = req.mamba_pool_idx
+    ...
+    return result
+```
+
+**不变量**：
+- `inner.cache_finished_req` 会把 `[kv_committed_len_old, kv_committed_len_new)` 范围内对齐到 page_size 的 KV 插入 radix。这个语义来自 SGLang 标准实现，无需改 inner。
+- `slot.last_node` 现在指向**当前 session 已 commit prefix 的尾节点**，每个 turn 后向前推进。
+- `dec_lock_ref(old_last_node)` + `inc_lock_ref(new_last_node)` 必须在 turn 切换时执行。
+
+### 3.3 `cache_unfinished_req` 改造
+
+streaming session 的 subsequent turn **不再 skip inner**。原因：现在 `match_prefix` 走 radix，chunked-prefill 中间状态也需要 inner 维护：
+
+```python
+def cache_unfinished_req(self, req: Req, **kwargs):
+    if _is_streaming(req) and kwargs.get("chunked", False):
+        # Chunked prefill: forward to inner so the per-chunk extend gets
+        # tracked in the radix LRU access timestamps.
+        ...
+    self.inner.cache_unfinished_req(req, **kwargs)
+```
+
+具体的 chunked 处理细节需要保留对 `prefix_indices` 重建的逻辑（参考当前实现 lines 215–225），但调用 `inner.cache_unfinished_req` 不能 skip。
+
+### 3.4 `match_prefix` 改造
+
+退化为**纯 inner 转发**——SessionSlot 不再持 KV 指针：
+
+```python
+def match_prefix(self, params: MatchPrefixParams) -> MatchResult:
+    # No more slot-fast-path. Streaming sessions reuse KV via radix tree
+    # match like every other request.
+    return self.inner.match_prefix(params)
+```
+
+调用方需要的 "这个 session 的 committed prefix 长度" 信息改为通过 `inner.match_prefix(...).device_indices.shape[0]` 推导。
+
+### 3.5 `release_session` 改造
+
+**after refactor**：
+
+```python
+def release_session(self, session_id: str) -> int:
+    slot = self.slots.pop(session_id, None)
+    if slot is None:
+        return 0
+
+    # Just release our radix lock — radix LRU can now reclaim our prefix
+    # leaves at its own pace. NO direct token_to_kv_pool free.
+    if slot.last_node is not None:
+        if slot.swa_uuid_for_lock is not None:
+            self.inner.dec_lock_ref(
+                slot.last_node,
+                DecLockRefParams(swa_uuid_for_lock=slot.swa_uuid_for_lock),
+            )
+        else:
+            self.inner.dec_lock_ref(slot.last_node)
+
+    # Mamba state still needs explicit cleanup if present.
+    if slot.mamba_pool_idx is not None:
+        ...
+
+    return 0  # "freed_tokens" no longer meaningful; radix LRU shed lazily
+```
+
+### 3.6 `get_session_status` / `list_session_statuses` 改造
+
+`resident_tokens` 现在的真值来自 radix tree。需要在 inner 暴露一个 helper：
+
+```python
+# In BasePrefixCache / RadixCache:
+def tokens_under(self, node) -> int:
+    """Count tokens in the path from root to `node` (inclusive)."""
+    ...
+
+# In SessionAwareCache:
+def get_session_status(self, session_id: str) -> Optional[Dict[str, Any]]:
+    slot = self.slots.get(session_id)
+    if slot is None:
+        return None
+    resident_tokens = self.inner.tokens_under(slot.last_node) if slot.last_node else 0
+    return {
+        "session_id": session_id,
+        "resident": resident_tokens > 0,
+        "resident_tokens": int(resident_tokens),
+        "last_access_time": float(slot.last_access_time),
+    }
+```
+
+`admit_direct_append` 的容量检查改用 `resident_tokens` 的 radix 真值（去掉 `kv_committed_len / kv_allocated_len` 双值不一致的可能）。
+
+### 3.7 SGLang 调度路径配套改动
+
+参考 `schedule_batch.py:1572-1646`，当前 streaming-session correction（commit b8e6f13 / 986f351 引入）建立在 SessionSlot 拥有独立 KV 范围之上。block-level refactor 后这条 correction 路径**完全无需存在**——req 的 fill_ids / prefix_indices 由 inner radix `match_prefix` 直接给出一致值。
+
+**移除项**：
+- `schedule_batch.py:1572-1585` 的 `actual_extend_len = max(0, len(fill_ids) - len(prefix_indices))` correction 块。
+- `schedule_batch.py:1646` 的 `assert seq_len - pre_len == req.extend_input_len`（refactor 后该不变量结构上必然成立）。
+- E3 触发的 latent landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2)随之消失。
+
+---
+
+## 4. 不变量（必须在 PR 自测中覆盖）
+
+| Inv | 内容 |
+|---|---|
+| I1 | `release_session(sid)` 后，下一次同 session 请求的 `match_prefix` 行为只取决于 radix tree 的常驻状态——不依赖 `slots` dict。 |
+| I2 | 任意 (session_id, turn_id) 的 `cache_finished_req` 调用后，radix tree 上必然存在一条 root→leaf 路径覆盖该 turn 的全部 committed token（即 `tokens_under(slot.last_node)` 严格不降）。 |
+| I3 | `restore_to_req` 必须**幂等**：在 chunked-prefill 重试场景下，对同一 req 可被调用多次而最终 req 状态等价。当前实现靠"不清 slot 字段"实现 → refactor 后改由 radix `match_prefix` 的纯函数性质保证。 |
+| I4 | 无 streaming-session 的请求（`req.session is None`）行为 **不变**：所有路径 short-circuit 到 inner。 |
+| I5 | 任一 turn 结束后，对 `slot.last_node` 的 `inc_lock_ref` 必须有对应的 `dec_lock_ref`，且 `release_session` 是最终的释放点。 |
+
+---
+
+## 5. 测试计划（无 GPU 可跑）
+
+### 5.1 单元测试（mock inner cache）
+
+写一个 `MockRadixCache(BasePrefixCache)`，记录所有 `cache_finished_req / cache_unfinished_req / match_prefix / evict / dec_lock_ref` 调用序列。然后：
+
+| Test | 断言 |
+|---|---|
+| `test_release_session_no_direct_free` | 调 `release_session` 后，Mock 上 **没有** 直接 `free(kv_indices)` 调用，只有 `dec_lock_ref` |
+| `test_subsequent_turn_inserts_radix` | 模拟 turn 0 → 1 → 2 三次 `cache_finished_req`，断言每次都触发 `inner.cache_finished_req` |
+| `test_match_prefix_uses_inner` | streaming 与 non-streaming 都仅走 `inner.match_prefix` |
+| `test_restore_idempotent` | 模拟 chunked-prefill 重试，连续两次 `match_prefix` 返回的 `device_indices` 一致 |
+| `test_eviction_under_pressure_is_block_level` | inject 一个 "pool 满，必须 evict 24 tokens" 的状态，断言 `release_session` 不被触发，inner 的 LRU 单步走 |
+
+### 5.2 Property-based 测试
+
+```python
+@given(turns=lists(integers(min_value=24, max_value=2048), min_size=1, max_size=50))
+def test_committed_tokens_monotone(turns):
+    """tokens_under(slot.last_node) is monotonically non-decreasing across turns."""
+    ...
+```
+
+### 5.3 Integration smoke（需要 GPU，但放在 sweep 脚本里）
+
+执行 `sweep_e2_kvc_v2_rdma.sh` 同 trace 同配置，对比指标：
+- evict 总次数（期望从 90 → < 10）
+- 单次平均 evict tokens（期望从 67K → < 500）
+- TTFT p99（期望从 1.28s → < 0.7s）
+- direct-to-D 命中率（期望 ≥ 85%）
+
+---
+
+## 6. 工程量与风险
+
+### 6.1 工程量
+
+| 工作 | 估时 | 风险 |
+|---|---|---|
+| §3.1–§3.6 SessionAwareCache 改造 | 2–3 天 | 中：需要熟悉 radix 内部 lock_ref / evict 协议 |
+| §3.7 schedule_batch 清理 | 0.5 天 | 低：是删代码 |
+| §4 不变量单元测试 | 2 天 | 低 |
+| §5.3 GPU smoke + 数据对比 | 2 天 | 中：mooncake 仍可能触发 E2 级联 death，需要 §S3 修复一并跑 |
+| **总计** | **~1 周** | |
+
+### 6.2 关键风险
+
+1. **`inner.cache_finished_req` 对 streaming-session req 的兼容性**：当前 SGLang 标准 radix 假设 req 在 cache_finished_req 时是 "完整 prefill+decode 完成"。streaming-session 的 req 在每个 turn 结束时还会留下"未完成的 conversation"，要确保 inner 在插入时不会把 decode-only tokens 当成可丢弃尾巴。需要 audit `radix_cache.py:cache_finished_req` 的实现。
+
+2. **lock_ref 顺序**：turn N+1 开始的 `match_prefix` → inc_lock_ref(new_node)，turn N 结束的 dec_lock_ref(old_node)，时序若反了会在并发下让 LRU 把刚 commit 的 leaf 误 evict。建议加 assertion：`dec_lock_ref` 之前 `inc_lock_ref` 必须先到。
+
+3. **chunked-prefill retry**：见 I3。SGLang 当前 `restore_to_req` 不清 slot 字段就是为此 retry。refactor 后必须确认 inner radix `match_prefix` 在 retry 下也幂等（标准 radix tree 是的，但要写测试明确锁住这个性质）。
+
+---
+
+## 7. 与 D→P sync 工作的关系
+
+block-level evict 是 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 的**前置条件**：
+
+- D→P sync 需要 P 端 radix tree **可接收外部喂入的 KV block**。
+- 当前 P 端 radix 假设单生产者（本 worker 模型输出）。
+- block-level refactor 完成后，streaming-session 的 KV 已经走标准 radix 路径——再让 radix tree 接受"外部喂入"的额外生产者就只是扩展 insert API，而不是发明新的存储路径。
+
+→ 两件事可顺序做：先 block-level evict，再 D→P sync。
+
+---
+
+## 8. 接班 agent 的最小动作
+
+1. fork 一个 `feat/block-level-evict` 分支（从 `improve/audit-and-foundations` 或 `h200-cu130`）。
+2. 实现 §3.1–§3.6。
+3. 写 §5.1 + §5.2 单元测试。
+4. 在 8×H100 / H200 上跑 §5.3 smoke，对比 evict 频次和 TTFT p99。
+5. 若 §6.2 风险 1 成立，进 SGLang `radix_cache.py` 看是否需要给 streaming-session req 加 `is_session_active=True` flag 阻止"丢弃 decode 尾"。
+
+---
+
+**核心句**：把 session 当 lifecycle 边界（保留），但**不要**让它做 eviction 边界（移交给 radix LRU）。这次 refactor 同时解决"reseed 太频繁"和"P 端 radix 不可外部喂入"两个 blocker。
--- a/docs/D_TO_P_SYNC_CONTRACT_ZH.md
+++ b/docs/D_TO_P_SYNC_CONTRACT_ZH.md
@@ -0,0 +1,247 @@
+# D→P 增量 KV 同步 — 接口契约与 rollout 计划
+
+**日期**：2026-05-12
+**前置**：[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)（缺口定位）+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)（前置条件）
+**性质**：跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
+**Status**：草案。`feat/d-to-p-sync` 分支当前为空，本文是该分支应当首先 land 的设计文档
+
+---
+
+## 0. TL;DR
+
+reseed 慢路径的 50% 时间在 P 重 prefill，**修复 transfer 段（启 RDMA）只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append：
+
+> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix，无需 re-prefill，仅一次 P→D transfer。
+
+本文给出三层（mooncake / SGLang / agentic-pd-hybrid）的接口契约、一个 **staleness budget β** 的形式化定义，以及四阶段 rollout 计划，让该工作可以与 block-level eviction 解耦推进。
+
+---
+
+## 1. Staleness Budget β —— 形式化定义
+
+设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`（time `t` 的瞬时值），P 上同 session 的 backup prefix 长度为 `L_P(s, t)`。
+
+```
+staleness(s, t) := L_D(s, t) - L_P(s, t)   ≥ 0
+```
+
+**Staleness budget β** 是系统承诺维持的上界：
+
+```
+∀ s, ∀ t :  staleness(s, t) ≤ β
+```
+
+直观：β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
+
+- **β = 0**：完全同步（D 每 commit 一块就阻塞等 P ack）。延迟成本高，不推荐。
+- **β = ∞**：当前状态（P 端 backup 永远 seed-time 静态快照）。
+- **β = 一个 page（24 tokens）**：单 block sync。理论最优粒度，但 D 端每次 append 都触发一次 D→P RPC。
+- **β = O(append_len)（典型 1K–4K）**：批量 sync。推荐起点，把同 turn 的 decode 输出聚合后整批推送。
+- **β = O(turn_size)（典型 ~50K）**：粗粒度 sync。失效 reseed bypass，仅减少 transfer。不可取。
+
+→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))`，`β_max` 默认 4096。
+
+---
+
+## 2. 三层接口契约
+
+### 2.1 Mooncake 层：双角色化
+
+**当前状态**（详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3）：
+
+- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
+- `MooncakeKVSender` 仅在 PREFILL 模式实例化，`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
+- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`。
+
+**目标接口**：
+
+```python
+# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
+class BaseKVManager:
+    roles: set[KVRole]   # 替换原单值字段，允许 {PREFILL, DECODE}
+
+class KVRole(Enum):
+    PREFILL = "prefill"
+    DECODE = "decode"
+    PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver"   # 新：P 端接收 D→P sync
+    DECODE_BACKUP_SENDER = "decode_backup_sender"         # 新：D 端发送 D→P sync
+```
+
+**新增类**（实现层 ~400 LOC）：
+
+| 类 | 角色 | 关键方法 |
+|---|---|---|
+| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队，返回 `sync_id` |
+| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程；每个包触发 callback 注入 radix tree |
+
+**Bootstrap channel**：需要独立于现有 P→D 通道的第二个 bootstrap socket（避免 buffer pointer 协商冲突）。配置：
+- 默认 disable，由 ServerArgs flag `--enable-d2p-sync` 开启
+- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
+
+### 2.2 SGLang 层：Radix 多生产者扩展
+
+**当前状态**：P 端 radix 假设单生产者（本 worker 模型输出）。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
+
+**目标接口**（在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后）：
+
+```python
+class RadixCache(BasePrefixCache):
+    def insert_external(
+        self,
+        token_ids: Sequence[int],
+        kv_tensor: torch.Tensor,
+        *,
+        source_worker_id: str,
+        session_id: str,
+    ) -> InsertExternalResult:
+        """
+        Insert KV blocks supplied by an external worker (D→P sync).
+
+        Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
+        and threads the resulting indices through the radix tree exactly like
+        cache_finished_req would for a local prefill.
+
+        Invariants:
+            - Same model layout (verified at handshake time, not per-call).
+            - On collision with existing radix path, no-op for the shared prefix
+              and only insert the diverging suffix.
+            - Inserted nodes get lock_ref += 1 if `pin=True`, default False.
+              D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
+        """
+```
+
+**关键设计点**：
+
+| 决策 | 选项 | 推荐 |
+|---|---|---|
+| KV index 重映射 | A) D 发原 indices, P 重映射；B) D 发紧密打包的 tensor，P 重新分配 | **B**：避免跨 worker 索引泄漏 |
+| 失败处理 | A) D→P 失败 → 退化为重 prefill；B) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
+| Reference counting | sync 进 P 的 KV 是否被 pin？ | **不 pin**：P 端 LRU 自然管理，避免 backup 把生产 KV 挤出 |
+| 与 evict 协调 | sync 来到时 P 满怎么办？ | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
+| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办？ | **接受 multi-source**：每个 P 维护自己的 backup；reseed 时挑 staleness 最小者 |
+
+### 2.3 agentic-pd-hybrid 层：Hooks 与状态机
+
+**新增 CLI flag**：
+
+```bash
+--enable-d2p-sync                     # off by default
+--d2p-staleness-budget-tokens 4096    # β_max
+--d2p-sync-batch-min-tokens 24        # 至少 ≥ 1 page 才触发
+--d2p-sync-target-policy {last_p, round_robin, broadcast}
+                                      # last_p: 推回该 session 上次 seed 的 P
+                                      # broadcast: 推到所有 P（reseed 时灵活但带宽大）
+```
+
+**新增 state 字段**（`replay.py` 的 `DirectSessionState`）：
+
+```python
+@dataclass
+class DirectSessionState:
+    ...
+    # NEW: per-P backup view, populated by D->P sync callbacks.
+    prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
+    last_d2p_sync_at: float | None = None
+```
+
+**Hook 在 `_invoke_session_direct` 完成后**：
+
+```python
+async def _invoke_session_direct(...):
+    ...
+    response = await self._stream_direct_to_d(...)
+    if response.ok and self.config.enable_d2p_sync:
+        new_committed = response.kv_committed_len
+        prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
+        staleness = new_committed - prev_p_resident
+        if staleness >= self.config.d2p_sync_batch_min_tokens:
+            target_p = self._choose_d2p_target(session)
+            asyncio.create_task(
+                self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
+            )
+```
+
+**Hook 在 reseed 路径**（`_invoke_kvcache_seeded_router`）：
+
+```python
+async def _invoke_kvcache_seeded_router(..., request):
+    ...
+    if self.config.enable_d2p_sync:
+        # Probe P-side residency before issuing full re-prefill.
+        probe = await self._probe_prefill_residency(session_id)
+        if probe.resident_tokens >= request.prefix_len - β_max:
+            # Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
+            return await self._invoke_p_to_d_transfer_only(...)
+    # Fall back to existing path.
+    return await self._invoke_kvcache_seeded_router_legacy(...)
+```
+
+---
+
+## 3. 性质（待证明）
+
+### 3.1 Theorem 4 候选（论文形式）
+
+*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发：*
+
+```
+reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
+```
+
+*其中 T_p2d 是 P→D transfer 时间（在 RDMA 下 ~L · 4 ns/token），T_prefill 是 prefill 时间（在 H100 TP1 Qwen3-30B 下 ~50K tokens/s）。当 β ≪ L 时退化为 single P→D transfer 主导。*
+
+**对比 baseline**（无 D→P sync）：`reseed_cost = T_p2d(L) + T_prefill(L − seed_size)`，re-prefill 占主导。
+
+### 3.2 与 Theorem 2 的关系
+
+Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级，使 KVC 在**全分位数**击败 DP 成为可能。
+
+---
+
+## 4. 四阶段 Rollout
+
+| Phase | 范围 | GPU 需求 | 验收指标 |
+|---|---|---|---|
+| **P1** | block-level eviction refactor（[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)） | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
+| **P2** | mooncake 双角色化 + microbench（D→P 单包 RTT、带宽利用） | 单机 + RDMA | P→D RTT < 50ms（local），单 16K-token block 带宽 ≥ 50% 理论上限 |
+| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook（仅 best-effort，无 reseed probe） | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成；不引入新 failure mode |
+| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5s（vs 当前 3–7s），TTFT p99 < 0.5s |
+
+**关键决策点**：P1 → P2 之间需要走 audit，确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突。若发现严重冲突，引入 "P-only sync mode" 占位，等架构稳定再放开。
+
+---
+
+## 5. 风险与对策
+
+| 风险 | 影响 | 对策 |
+|---|---|---|
+| Mooncake 双角色化破坏现有 P→D 单向路径 | E2 已暴露 mooncake "instance not alive" 级联，再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag；保留 disable 路径 |
+| D→P sync 占用 D 出口带宽，影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QP（RDMA SL=0），且 batch 触发，单 turn 内最多 1 次 |
+| P 端 radix 被 backup 填满，反而挤出本地生产 KV | P 端 prefill 速度降 | sync 插入不 pin（§2.2），让 LRU 公平竞争 |
+| 多 P 多 backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policy（recency-biased），观察实测分布再决定是否上 `broadcast` |
+| 跨 SGLang patch 升级时 `insert_external` 与 upstream API 漂移 | 维护负担 | 把 API 限制在我方 vendor patch 边界（不污染 upstream radix），并写 contract test |
+
+---
+
+## 6. 与 block-level eviction 的解耦关系
+
+| 工作 | 是否依赖另一个 |
+|---|---|
+| block-level eviction | 不依赖 D→P sync，可独立交付。能单独降低 reseed 频次 |
+| D→P sync | **依赖** block-level eviction：需要 P 端 radix 是 streaming session KV 的真值源 |
+| 一起做 | 收益最大：reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
+
+→ rollout 顺序：block-level eviction 先 land，D→P sync 随后开 `feat/d-to-p-sync` 推进。两者**不应**合在一个 PR 里。
+
+---
+
+## 7. 接班 agent 的最小动作
+
+1. 在 `feat/d-to-p-sync` 分支上 land 本文。
+2. 等 block-level eviction 进 main 后，开 P2 阶段：mooncake 双角色化 + microbench（单测，无 SGLang 主路径耦合）。
+3. P3 阶段加 `insert_external` 与 hook；以 disabled-by-default 进 main。
+4. P4 端到端 evaluation 后再判断 reseed probe policy（`last_p` vs `broadcast`）。
+
+---
+
+**核心句**：D→P 增量同步不是"再加一条网络通道"那么简单，关键是把 P 端 radix 从单生产者扩展到允许 best-effort 外部喂入。Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后，不能颠倒。
--- a/docs/EVALUATION_PROTOCOL_ZH.md
+++ b/docs/EVALUATION_PROTOCOL_ZH.md
@@ -0,0 +1,185 @@
+# 评测协议（Paper-quality）
+
+**日期**：2026-05-12
+**性质**：评测协议规范，覆盖 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.1 M1–M6 全部薄弱点
+**对象**：跑实验的合作者；写 paper 的人；artifact reviewer
+
+---
+
+## 0. 总原则
+
+> 论文里每一个数字都必须能回答两个问题：
+> 1. **抽样误差有多大？**（bootstrap CI、N、std）
+> 2. **公平吗？**（同 trial、同 trace、同 token cap、同 timeout、paired）
+
+当前 sweep 报告（`KVCACHE_CENTRIC_PROGRESS_ZH.md` / `V2_RESULTS_ZH.md`）都不满足上述任一条。本文给出合规模板。
+
+---
+
+## 1. 评测维度（M1–M6 一对一解决）
+
+### 1.1 M1 — 统计显著性
+
+| 决策 | 规则 |
+|---|---|
+| `N` 每个 config 最小 run 数 | **3**（headline 数字）/ **5**（ablation 终值） |
+| 报告统计量 | `mean ± std`，**附 2.5/97.5 bootstrap CI** |
+| 多 run 聚合 | 把每 run 的 per-request latency append 后整体做 bootstrap；不要先 per-run 求 mean 再 average mean |
+| 差异显著性 | paired bootstrap p-value（≥ 5000 samples） |
+| `N=1` 仅允许 | smoke / sanity check，**不进 headline 表** |
+
+### 1.2 M2 — 公平 paired 比较
+
+| 决策 | 规则 |
+|---|---|
+| trace fixity | 用同一个 `samples-*.jsonl` 文件；replay 用 `--use-trace-as-sample` 锁定 |
+| timeout | 所有 mechanism 同 `--request-timeout-s`；不允许某一组用 600s 而另一组 300s |
+| token cap | 同 `--max-input-len`（取所有 baseline 的最小值并显式 truncate） |
+| 错误 / abort | **不**只算成功请求；abort 与 timeout 各自单列 `error_count`，按全集（含错误）报指标，或 paired-on-same-trial-mask |
+| 时间窗 | `time_scale` 一致；不允许同 sweep 内换 |
+| Worker 数 / GPU 类型 | 一致；topology 差异必须标注 |
+
+**反例**：当前 `E1 vs E2` 表（[E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) §4）显式声明 "not a fair head-to-head"——E2 80% 失败，successful-only 算 latency 与 E1 全集对比。**这种表不能直接进 paper**。
+
+### 1.3 M3 — Trace 分层
+
+| 维度 | 分桶建议 |
+|---|---|
+| `turn_id` | `{1, 2-5, 6-20, 21+}` |
+| `append_len` | `{≤128, 128-1K, 1K-8K, >8K}` |
+| `overlap_ratio` | `{≤0.3, 0.3-0.7, >0.7}` |
+| `inter_turn_gap_s` | `{≤5, 5-30, 30-300, >300}` |
+| `input_len` | `{≤8K, 8K-64K, >64K}` |
+
+**报告要求**：headline 数字之外，至少给一张"按 turn_id × append_len"的 heatmap，让 reviewer 看到收益来自哪个 slice。
+
+**反例**：当前 Real Ali 实验仅在 KVC-fit slice（high overlap + small append + 100% direct-eligible）上报 -46% p50。这是上限，不是平均。必须同时给出 full Ali 上的 paired 表。
+
+### 1.4 M4 — Baseline 矩阵
+
+至少以下 baseline 中跑 **2 个**：
+
+| Baseline | 类别 | 库 |
+|---|---|---|
+| vLLM + automatic prefix caching | 同 model 单 worker prefix cache | vLLM main |
+| SGLang DP cache-aware（4×TP1） | 当前主要 baseline | 本仓 vendored SGLang |
+| SGLang PD-disaggregation（kv-aware） | naive 但 cache-aware 拓扑 | 本仓 |
+| DistServe | P/D 分离 baseline | DistServe upstream |
+| SplitWise | P/D split + adaptive routing | open-source impl |
+| Mooncake-Master scheduler | 同代设计 | mooncake-master |
+
+**额外推荐**：跑一个 "oracle" baseline——assume `Σ.resident[d]` 完美已知 + admission 永不失败，作为 KVC 的上限对照。
+
+### 1.5 M5 — Trace 组合
+
+| Trace | 用途 |
+|---|---|
+| Ali coding agent (full) | 主结果；含 single-turn dilution |
+| Ali KVC-fit slice | KVC 上限演示 |
+| SWE-Bench 50 sess | 已有；多轮高 overlap workload |
+| ShareGPT | 对比 chat workload（短 turn，低 overlap）。**用来证明 KVC 不会在不合适 workload 上劣化** |
+| Inferact | tool-use heavy 的 agent workload |
+| Mooncake trace | 单 turn LLM serving 的 baseline trace |
+| Synthetic adversarial | 自构：burst 100 个新 session 同时 seed，验证 mooncake death 与 reset-on-success 的 robustness |
+
+**最低组合**：Ali full + SWE-Bench + ShareGPT + Synthetic adversarial。
+
+### 1.6 M6 — 硬件覆盖
+
+| Tier | 用途 |
+|---|---|
+| 单节点 ≤ 8 GPU | 当前所有结果 |
+| 双节点 NVLink + IB | 验证跨节点 D→P sync 与 mooncake 行为 |
+| 4 节点 cluster（≥ 16 GPU） | scaling 数字、cluster scheduler 假设 |
+| 异构（H100 + L40S） | topology-aware routing |
+
+**最低组合**：单节点 4×H200 + 双节点 NVLink + IB。剩下两个 tier 可放 future work。
+
+---
+
+## 2. 报告模板
+
+### 2.1 主结果表（Table 1）
+
+```
+| Config | N | mean ± std | p50 [CI] | p90 [CI] | p99 [CI] | err% | timeout% |
+|--------|---|------------|----------|----------|----------|------|----------|
+```
+
+加注：trace name、time_scale、`max_input_len`、`request_timeout_s`、所有共用参数。
+
+### 2.2 Paired delta 表
+
+```
+| Pair | N pairs | mean delta [CI] | p50 delta [CI] | wins / losses | p-value |
+```
+
+`N pairs` = 两边都 successful 的 trial 数。`wins` = `latency_kvc < latency_baseline` 的 trial 数。
+
+### 2.3 分层表（Table 2）
+
+每个分层维度（§1.3）独立一张。
+
+### 2.4 Negative-result 章节（强制）
+
+paper 必须有专章列出：
+
+- KVC 在 ShareGPT 上比 baseline 慢的具体数字。
+- KVC 在 trace 哪些 percentile / slice 不胜。
+- 失败的 sweep（mooncake death、E3 crash）的诊断链路。
+
+→ 论文 reviewer 看见诚实的 negative result 会显著提高印象分。当前的 [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §4 雏形可以扩成这一章。
+
+---
+
+## 3. 工具支持（本仓需要的脚本）
+
+| 脚本 | 状态 | 说明 |
+|---|---|---|
+| `scripts/analysis/recompute_summary.py` | ✅ 已有 | 修复 abort 污染的 latency；本协议主要数据入口 |
+| `scripts/analysis/stratified.py` | ⏳ 本分支新增 | 按 §1.3 维度切桶 + 输出表 |
+| `scripts/analysis/paired_compare.py` | ⏳ 本分支新增 | paired bootstrap，输出 §2.2 表 |
+| `scripts/analysis/plot_*` | ✅ 已有 | TTFT PDF、GPU 利用率、cache efficiency |
+
+→ 本分支的 stratified + paired 脚本 land 后，跑实验的合作者可以一条命令出表。
+
+---
+
+## 4. Artifact 要求（SOSP/OSDI AE）
+
+| 项目 | 标准 |
+|---|---|
+| Dockerfile | 单一 `Dockerfile.artifact`，4×A100/H100 即可启 |
+| 一键脚本 | `bash artifact/reproduce_main_table.sh`，1 小时内出 Table 1 |
+| 数据集 | 提供 `outputs/sample-*.jsonl` 子集（可 ~5GB 内）；full Ali 走 instruction |
+| 复现度 | bootstrap CI 与原文重叠即算复现，不要求 bit-exact |
+| 文档 | `artifact/README.md`，列出每张表 / 图对应的命令 |
+
+→ 本路线图 §M1 修复后再准备 artifact。
+
+---
+
+## 5. 自检清单（提 paper draft 前用）
+
+- [ ] 每张表 N ≥ 3，含 mean±std 与 95% CI。
+- [ ] 没有 "successful only" 字样；所有错误已列入 `err%`。
+- [ ] 所有 baseline 用同 `max_input_len` / 同 `request_timeout_s` / 同 `time_scale`。
+- [ ] 至少 3 个 trace + 1 个 synthetic adversarial。
+- [ ] 至少 1 个 non-SGLang baseline。
+- [ ] 有 negative-result 章节。
+- [ ] 有 KVC 在 single-turn workload 上的 dilution 数据。
+- [ ] 形式化部分：Algorithm 1/2/3 + Theorem 1/2，以及 D→P sync 完成后的 Theorem 4。
+- [ ] 失败模式 forensic：mooncake death、E3 crash、cold-D 都进 §Limitations 或 §Discussion。
+
+---
+
+## 6. 路线图衔接
+
+- [ ] Phase A — 实现本分支 `scripts/analysis/stratified.py` + `scripts/analysis/paired_compare.py`（无 GPU 可做）。
+- [ ] Phase B — 把现有 `kvc-real-ali-iter-v1` 的 600-req/15min 数据用新工具重出一份分层表 / paired 表，存入 `outputs/`（GPU 不需重跑）。
+- [ ] Phase C — 跑 ShareGPT + Synthetic adversarial baseline（GPU 需 ~12h）。
+- [ ] Phase D — 选 1 个非 SGLang baseline（推荐 vLLM + prefix caching）补齐 M4（GPU 需 ~24h）。
+
+---
+
+**核心句**：当前结果"看起来已经赢"，但按本协议重报后，赢的 magnitude 会缩小、赢的 slice 会窄化、负面 slice 会暴露。这是论文必须经历的过程；越早做越省事。
--- a/docs/FAILURE_MODES_ZH.md
+++ b/docs/FAILURE_MODES_ZH.md
@@ -0,0 +1,222 @@
+# Failure-mode Taxonomy
+
+**日期**：2026-05-13
+**性质**：集中清单 + 诊断手册
+**对象**：跑实验时遇到失败要立刻 lookup 的合作者；写 paper §Limitations 时需引用的人；reviewer 想问"你为什么觉得这次会更稳"时的答案
+
+本文把当前系统已识别的失败模式按"症状 → 根因 → 触发条件 → 当前缓解 → 真正的修复"梳成一张表。所有条目都有 forensic 链接到原始实验 doc。
+
+---
+
+## 0. TL;DR
+
+5 类已识别失败模式，按"是否阻碍 paper claim"分组：
+
+| 类别 | 名称 | 阻碍 paper | 真正修复 |
+|---|---|:---:|---|
+| **A. 控制层级联** | Mooncake "instance not alive" cascade | ✅ | admission backoff + per-D pending-seed budget |
+| **B. 路由偏置** | Cold-D / overlap-pinning | ✅ | first-principles overlap term redefinition |
+| **C. KV 抖动** | Evict storm（session-level evict） | ✅ | [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) |
+| **C'. KV 抖动** | Reseed storm（turn 1 大 seed 并发） | ✅ | per-D pending-seed budget + (C 缓解后频率自降) |
+| **D. Vendor 不变量** | streaming-session correction invariant crash (E3) | ❌（hotfix 已 land） | 删除 correction 路径（block-level evict 完成后） |
+
+A / B / C 三类是 Milestone 1 必须解决的；C' 是 A 的次因；D 已临时止血但根本修复绑在 C 上。
+
+---
+
+## 1. A — Mooncake "instance not alive" cascade
+
+### 1.1 症状
+
+- 客户端看：`RuntimeError: generate stream ended before producing any token`
+- D scheduler 日志：`[mooncake] Decode instance could be dead, dropping ...`
+- 整批请求被 abort，单一 sweep 在数分钟内从健康降到 80% failure（[E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) E2：1054 / 1285 失败）
+
+### 1.2 根因（forensic 链路）
+
+```
+admission no-space (D KV pool 满)
+    → router 立刻 fallback 走 seed/reseed 路径
+    → 多个并发 seed 同时打 mooncake P→D
+    → P→D 出口排队，handshake 阶段超时
+    → mooncake 把对端标记 dead
+    → SGLang 把 dead 链路上的 in-flight req 全部 abort
+    → 客户端看到批量 generate-stream 中断
+```
+
+### 1.3 触发条件
+
+- D KV pool 接近满（≥ ρ·K_d，默认 0.95）
+- router fallback chain 把多个 reseed 在毫秒级窗口内发起
+- mooncake heartbeat 超时（默认窗口短）
+
+### 1.4 当前缓解
+
+- `--kvcache-seed-min-turn-id=2` 跳过 turn 1 大 seed，减少首爆（main 分支 stable 配置）
+- `--mc-transfer-timeout=1800s` 默认值（commit 905d671）减少假性 dead
+- `--request-timeout-s=180/300` 让客户端不至于看见整 hour 卡死，但不阻止 cascade 自身
+
+→ 这些都是治标，不是治本。E2 在 4×H200 NDR 真硬件下仍 80% 失败 ([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md))。
+
+### 1.5 真正的修复（路线图 §S3）
+
+1. **admission RPC backoff + jitter**：拒绝时不立刻 fallback，给 D scheduler 喘息机会。
+2. **per-D pending-seed budget**：同时刻最多 K 个 seed 在 transfer 队列里，超出排队而不爆裂。
+3. **mooncake heartbeat 与 admission 解耦**：admission 路径不再 imply "对端 alive"。
+4. **Backpressure pause hint 闭环**（[SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) §2.3 当前 EXPERIMENTAL）。
+
+---
+
+## 2. B — Cold-D / overlap-pinning
+
+### 2.1 症状
+
+- N=k decode workers，但只有 ~k-1 真正承载流量；某些 D 0 binding
+- Per-D load 直方图严重偏斜（E2：D0:600 / D1:685 / **D2:0**）
+- 整体 throughput 受最忙 D 限制；裸 latency 不一定差，但容量利用率差 33%+
+
+### 2.2 根因
+
+Inferact / Ali coding agent trace 在每个 session 开头有 ~12K 的"system prompt + tool schema"，这些 24-token 块在所有 session 之间共享 hash。kv-aware policy 的 `overlap` term 把它们当成"该 D 已经常驻这些 hash" → 任何新 session 都被 score 推向 D0/D1（最先 warm 的两个）→ D2 永远 0 overlap → 永远不被选 → 永远 cold。
+
+### 2.3 触发条件
+
+- 多 session workload + 共享 boilerplate prefix
+- `migration_reject_threshold > 0` 且 reject 从未触发（因为 D0/D1 还没满）
+
+### 2.4 当前缓解
+
+`KvAwarePolicy.load_floor_bonus`（commit 93fce42）：
+
+```
+floor_bonus = K * max(0, mean - assigned) / max(1, mean)
+```
+
+E3 实测 D2 binding 从 0 升到 22.5%（[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §1）。
+
+→ 这是 patch，不是修复。`K` 是 magic number；boilerplate 的 hash 数量大于 `K / sticky_bonus` 时仍 cold。
+
+### 2.5 真正的修复（路线图 §S5）
+
+把 `overlap` 重新定义为 **"该 session 在该 D 上独占 prefix 的 hash 数"**：
+
+```
+exclusive_overlap(s, d) := |prefix_hashes(s) ∩ resident[d] ∩ session_owned[s]|
+```
+
+其中 `session_owned[s]` 排除其它 session 也持有的 hash。Boilerplate 共享 hash 不进 `exclusive_overlap`，score 自然分散。需要 D 端在 `admit_direct_append` 响应里返回 per-session resident hash 集合的 sketch（Bloom filter / minhash）。
+
+---
+
+## 3. C — Evict storm（session-level eviction）
+
+### 3.1 症状
+
+- 在 D 内存有压力的 workload 下，每 1–2 分钟出现 30–90K tokens 的 KV pool 释放峰
+- 紧随其后的同 session 请求触发 `Reseed`：P 重 prefill 50K + mooncake transfer 50K（3–7s）
+- TTFT 长尾完全由这类 reseed 主导（[V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §3.2）
+
+### 3.2 根因
+
+`SessionAwareCache.release_session` 一次性 `free([cache_protected_len, kv_allocated_len))`——即整段 session-exclusive 尾部。E3 实测：90 次 evict、平均一次 free 67,726 tokens、25/50 session 受影响（[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) §0）。
+
+→ 与 SGLang 标准 radix 的 leaf-by-leaf 渐进 evict 形成鲜明对比。这部分 KV 从未进 radix，所以享受不到 LRU 的细粒度蚕食。
+
+### 3.3 触发条件
+
+- D KV pool 接近满
+- `maybe_trim_decode_session_cache` 被 scheduler 触发（在 `DecodePreallocQueue` 检测到 `available_size() <= 0` 时）
+
+### 3.4 当前缓解
+
+- `--kvcache-session-soft-cap=N`（main 分支）：限制 D 上常驻 session 数 → 提前 trim，避免顶到爆
+- `--kvcache-direct-max-uncached-tokens=8192`（v2）：降低 direct path 吃 KV 的速度
+
+→ 都是放慢节奏，没有解决"单次 free 太大"的根本问题。
+
+### 3.5 真正的修复（路线图 §S1）
+
+[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)：让 streaming-session decode 输出每 turn finish 时 `inner.cache_finished_req` 进 radix → `release_session` 退化为 `dec_lock_ref` + 删 slot → radix LRU 按 24-token leaf 蚕食。
+
+预期：单次 evict 从 67K 降到 ≤ 500 tokens；reseed 频次降一个数量级。
+
+---
+
+## 4. C' — Reseed storm（turn 1 大 seed 并发）
+
+### 4.1 症状
+
+- workload 起步阶段（前 30–60s）所有 session 同时打 turn 1
+- 多个并发 `Seed`（每个 ~50–90K tokens）打 mooncake → 与 §1 cascade 重合
+
+### 4.2 根因
+
+`KvAwarePolicy` 启动阶段 `resident[d]` 全空，所有 D score 相同，但 ε 重试 + per-trial admit 不阻止并发。
+
+### 4.3 触发条件
+
+- trace `time_scale=1` 重放下，session 在原始到达密度内同时启动
+- 没有 per-D pending-seed 限流
+
+### 4.4 当前缓解
+
+- `--kvcache-seed-min-turn-id=2`：跳过 turn 1 seed 完全（main 分支 stable 配置）
+- 副作用：失去 turn-1 的 KV 注入，turn 2 必走 reseed（但反而稳定，因为 reseed 是分散在时间上的）
+
+### 4.5 真正的修复
+
+- per-D pending-seed budget（同 §1.5 第 2 项）
+- §3.5 完成后 evict 频次自降，间接降低 reseed 频次
+
+---
+
+## 5. D — Streaming-session correction invariant crash (E3 landmine)
+
+### 5.1 症状
+
+- D scheduler 抛 `AssertionError` at `schedule_batch.py:1646`：`seq_len - pre_len == req.extend_input_len`
+- 整个 D worker 进程退出 → router 看见对端死 → §1 cascade
+
+### 5.2 根因
+
+[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2：streaming-session correction（commit b8e6f13）把 `extend_input_len` 改写为 `max(0, fill_len - prefix_len)`，但下游 invariant 还从原始 fill_ids/prefix_indices 计算。当 `fill_len < prefix_len`（多 turn 累积 prefix > 当前 turn 增量）时数学上不可能满足。
+
+### 5.3 触发条件
+
+- streaming session 跨 turn 已 commit prefix 长于本 turn 的新增 fill_ids
+- E2 因 pipeline 阻塞从未跑到这个状态；E3 修了 cold-D bottleneck → pipeline 更快 → landmine 暴露
+
+### 5.4 当前缓解
+
+commit 986f351 的 pre-filter pass：在 `prepare_for_extend` 入口 drop 这类 req（让 client 看错误响应而不是 worker 崩）。是止血。
+
+### 5.5 真正的修复
+
+`schedule_batch.py:1572–1646` 这整段 correction 路径在 block-level eviction refactor 完成后**结构上不再需要**——[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.7 已说明 refactor 后 fill_ids / prefix_indices 一致性由 radix `match_prefix` 自动保证。
+
+→ 不要再加更多 correction 子句；要删整段。
+
+---
+
+## 6. 失败诊断 cheat sheet
+
+跑 sweep 时按下表 lookup：
+
+| 你看到 | 大概率是 | 先查 |
+|---|---|---|
+| 客户端 `RuntimeError: generate stream ended before...` | §1 cascade | D scheduler log 搜 `instance could be dead` |
+| 某个 D `binding=0` 而其它 D 繁忙 | §2 cold-D | `per_decode_load` 直方图 |
+| TTFT p99 突然抬到 5–8s 量级 | §3 evict storm | `release_session` 调用频次 + 平均 free tokens |
+| Sweep 起步阶段失败率高、稳态低 | §4 reseed storm | mooncake transfer queue 在前 30s 的峰值 |
+| D worker 进程异常退出 | §5 invariant crash | scheduler log 搜 `AssertionError`、`extend_input_len` |
+
+---
+
+## 7. 与路线图的衔接
+
+- [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) Milestone 1 的第 1/3/4 项分别对应本表 C / A / B 的真正修复。完成 Milestone 1 后本表 §1–§4 应该都从"未修"降级为"已缓解"，§5 直接消失。
+- 论文 §Limitations 必须老实写出现状："we identify five failure modes; A/C are addressed by this work, B/C' are partially addressed, D is a transient artifact of the in-progress refactor."
+
+---
+
+**核心句**：把失败模式当 first-class artifact 来管理——每个失败都有"症状 → 根因 → 触发 → 缓解 → 真正修复"五字段，是把 prototype 推到 production-grade 的关键工具。reviewer 看见你能枚举失败远比看见你赢得 baseline 更让人信服。
--- a/docs/INDEX_ZH.md
+++ b/docs/INDEX_ZH.md
@@ -0,0 +1,119 @@
+# 文档索引
+
+**目的**：让任何合作者在 10 分钟内找到他需要的文档；让 Reviewer 知道哪些先看。
+
+---
+
+## 0. 时间紧的 3 篇
+
+按这个顺序读完即可参与讨论：
+
+1. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) — 项目当前进度、薄弱点、路线图。
+2. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) — 算法形式化（Algorithm 1/2/3 + Theorem 1/2）。
+3. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §0 + §6 — v2 当前 win/lose snapshot。
+
+---
+
+## 1. 按主题分类
+
+### 1.1 进度 / 现状
+
+| 文档 | 内容 |
+|---|---|
+| [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) | 跨分支整合 + 路线图（本分支的总入口） |
+| [PROJECT_OVERVIEW.md](PROJECT_OVERVIEW.md) | 项目目标 + 三种 mechanism（pd-disagg / pd-colo / kvcache-centric）的术语区分 |
+| [ONBOARDING_NEXT_AGENT_ZH.md](ONBOARDING_NEXT_AGENT_ZH.md) | 接班 agent 30 分钟上手手册（来自 `kvc-debug-journey-v1-to-v4`） |
+
+### 1.2 算法 / 形式化
+
+| 文档 | 内容 |
+|---|---|
+| [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) | Algorithm 1（Route）/ 2（Admit）/ 3（Dispatch）+ Theorem 1（无饿死）+ Theorem 2（fast-path 命中下限） |
+| [MIGRATION_V1_FINDINGS_ZH.md](MIGRATION_V1_FINDINGS_ZH.md) | v1 thrashing pathology 的实测 + 为什么 reset-on-success 是关键修复 |
+
+### 1.3 实验结果
+
+| 文档 | 内容 |
+|---|---|
+| [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) | SWE-Bench 50 sess ts=1：v2 vs 4DP CA 的 6/8 win + TTFT p99 落后原因 |
+| [V2_RESULTS_ZH.md](V2_RESULTS_ZH.md) | v2 原始战报（headline 数字略乐观，请同时看 deep analysis） |
+| [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) | H200 + RDMA 上 E1（naive 1P3D + kv-aware）vs E2（KVC v2）；E2 80% failure 的 forensic |
+| [E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) | E3（+load-floor bonus）16 min 触发 SGLang patch invariant crash |
+| [E1_E2_FIX_DESIGN_ZH.md](E1_E2_FIX_DESIGN_ZH.md) | Q1（mooncake death）+ Q2（cold-D2）的 fix 设计 |
+
+### 1.4 当前关键 design discussion
+
+| 文档 | 内容 |
+|---|---|
+| [KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) | 架构层反思：session-level evict 与 KVC continuity 设计冲突 |
+| [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | block-level evict refactor 的具体 API / 步骤 / 测试计划（本分支新增） |
+| [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) | reseed 慢路径时间线 + D→P 同步缺口的 forensic |
+| [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) | D→P sync 的接口契约、staleness budget、rollout 阶段（本分支新增） |
+
+### 1.5 评测 / 方法论
+
+| 文档 | 内容 |
+|---|---|
+| [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md) | paper-quality 评测协议（N、CI、paired、stratify、baseline list、trace mix）—— 本分支新增 |
+| [REFACTOR_PLAN_V1_ZH.md](REFACTOR_PLAN_V1_ZH.md) | 为什么从 ts=10 切到 ts=1 |
+| [TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md](TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md) | ts=10 时代的结构性问题清单（多数已 supersede） |
+
+### 1.6 工程债 / 失败模式
+
+| 文档 | 内容 |
+|---|---|
+| [SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) | 785 行 vendored SGLang patch 的归类清单（MUST-HAVE / WORKAROUND / EXPERIMENTAL / INSTRUMENTATION）—— 本分支新增 |
+| [FAILURE_MODES_ZH.md](FAILURE_MODES_ZH.md) | 5 类失败模式的诊断 + 缓解 + 真正修复（mooncake cascade / cold-D / evict storm / reseed storm / E3 invariant）—— 本分支新增 |
+
+### 1.7 环境
+
+| 文档 | 内容 |
+|---|---|
+| [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md) | H200 + driver 570 + cu12.8 环境搭建 + 11 条 lesson learned |
+
+### 1.7 归档（仅历史参考）
+
+`docs/archive/` 下的内容已被新文档 supersede，不必看：
+
+- `AGENTIC_FIT_ANALYSIS_ZH.md`、`STRUCTURAL_VALIDATION_REPORT_ZH.md`：ts=10 早期分析。
+- `KVCACHE_CENTRIC_PROGRESS_ZH.md`：早期项目快照。
+- `KVC_DEBUG_JOURNEY_V1_TO_V5.md`、`V5_PROFILE_INVESTIGATION_ZH.md`：v1–v5 调优过程笔记。
+- `REFACTOR_PLAN_ZH.md`：v0 重构计划。
+- `SWEBENCH_EXPERIMENT_*.md`：早期实验日志。
+
+---
+
+## 2. 按角色推荐阅读路径
+
+### 2.1 我是新接手的 SWE/research agent
+
+1. 先读本文 §0 的 3 篇。
+2. 再看 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3（薄弱点）+ §5（GPU-free 工作清单）。
+3. 选一个 Milestone 1 子项开始做。`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` 与 `docs/D_TO_P_SYNC_CONTRACT_ZH.md` 是已经准备好的两条工程主线。
+
+### 2.2 我是 paper reviewer / 审稿预读
+
+1. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md)：算法 + theorem。
+2. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md)：核心实测对比 + 我们自己识别的 limitation。
+3. [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md)：真硬件 + RDMA 上的 ablation（含 E2 的 80% failure forensic，证明我们能解释失败）。
+4. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3：我们自己列出的薄弱点与未来工作（不藏问题）。
+
+### 2.3 我是要复现实验的 student
+
+1. [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md)。
+2. [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md)：跑哪些 sweep、按什么协议比较。
+3. `scripts/sweep_ts1_migration_v2.sh`：v2 主 sweep；`scripts/sweep_e1_naive_1p3d.sh` / `scripts/sweep_e2_kvc_v2_rdma.sh`：E1/E2 ablation。
+
+### 2.4 我想看 control plane 与 admission
+
+1. `src/agentic_pd_hybrid/policies.py`：`KvAwarePolicy.select` 是 Algorithm 1 的实现。
+2. `src/agentic_pd_hybrid/replay.py`：`_invoke_session_direct` / `_invoke_kvcache_seeded_router` 是 Algorithm 3 的 orchestration。
+3. `third_party/sglang/python/sglang/srt/managers/scheduler.py`：D 端 `_admit_direct_append` 是 Algorithm 2 实现。
+
+---
+
+## 3. 这份索引的维护约定
+
+- 新加一份 design / experiment doc 必须在本文 §1 表格里加一行。
+- 文档归档（移到 `docs/archive/`）时本文同步删除条目或标 "已归档"。
+- 本文不写实质内容，只做导航；任何深入说明都在被指向的文档里。
--- a/docs/SGLANG_PATCH_INVENTORY_ZH.md
+++ b/docs/SGLANG_PATCH_INVENTORY_ZH.md
@@ -0,0 +1,165 @@
+# Vendored SGLang Patch — 归类清单
+
+**日期**：2026-05-13
+**基线**：clean SGLang v0.5.10 snapshot @ `bded083`
+**当前 HEAD**：`origin/h200-cu130` + 本分支 (785 行新增 / 17 行删除 / 10 文件)
+**目的**：让 reviewer 与下一个合作者一眼看清"哪些 patch 是核心机制、哪些是 workaround、哪些可以在 refactor 后下线"。对应 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.2 / §S6 的工程债项。
+
+---
+
+## 0. TL;DR
+
+| 分类 | 文件数 | 行数（估） | 命运 |
+|---|---:|---:|---|
+| MUST-HAVE — 核心机制（Algorithm 1/2/3、streaming session lifecycle、admit RPC） | 6 | ~450 | 长期保留，是 paper claim 的核心 |
+| WORKAROUND — 已识别的 latent 问题修补，应在 refactor 后下线 | 2 | ~150 | block-level eviction refactor 完成后大量删除 |
+| EXPERIMENTAL — 未闭环的特性，论文不依赖 | 1 | ~60 | 可下线或保留为 future-work hook |
+| INSTRUMENTATION — 诊断 / 日志 | 1 | ~50 | 保留但应隔离到 debug build |
+| MINOR — 杂项 | 1 | ~3 | 不影响决策 |
+
+**关键指引**：当 block-level eviction refactor（[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)）完成时，WORKAROUND 类的 ~150 行应同步删除。E3 触发的 `schedule_batch.py` invariant landmine 是这条路径上的产物，不修引擎而是修 evict 粒度才是正解。
+
+---
+
+## 1. 文件粒度清单
+
+### 1.1 `mem_cache/session_aware_cache.py` — MUST-HAVE *（待 refactor）*
+
+| 项目 | 内容 | 引入 | 分类 |
+|---|---|---|---|
+| `SessionSlot` dataclass | streaming session 跨 turn 复用 KV 的 metadata | b8e6f13 | MUST-HAVE |
+| `last_access_time` 字段 | LRU 决策需要 | 6e5ed8d | MUST-HAVE |
+| `match_prefix` / `cache_finished_req` / `cache_unfinished_req` 的 streaming 分支 | session 复用快路径 | b8e6f13 | **MUST-HAVE → 待 refactor**（block-level evict 后语义大改） |
+| `release_session` 直接 `free(kv_indices)` | session 退出时一次性归还 KV | b8e6f13 | **WORKAROUND → 替换**（refactor 后改为只 `dec_lock_ref`） |
+| `slot_held_tokens` / `get_session_status` / `list_session_statuses` | 状态查询 | 6e5ed8d | MUST-HAVE |
+
+**说明**：本文件是 KVC 设计的中枢。block-level eviction refactor（[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.1–§3.6）改造的就是这里。`SessionSlot` 的 5 个 KV-ownership 字段（`req_pool_idx` / `kv_committed_len` / `kv_allocated_len` / `cache_protected_len` / `swa_evicted_seqlen`）应在 refactor 后删除；这部分**将由 commit message 单独标记**，方便回滚。
+
+### 1.2 `managers/scheduler.py` — 混合类别
+
+D worker 端的 Algorithm 2 实现，含多个独立 patch。按行级归类：
+
+| 函数 / 行段 | 内容 | 分类 | 何时可下线 |
+|---|---|---|---|
+| `admit_direct_append(...)` | Algorithm 2 的 D 端 admission RPC handler | **MUST-HAVE** | 不下线（论文核心） |
+| `_should_allow_local_prefill_on_decode(req)` | 决定 decode worker 是否接受无 bootstrap 的本地 append-prefill | **MUST-HAVE** | 不下线 |
+| `_decode_session_cache_low_watermark_tokens()` | 水位线参数读取 | **WORKAROUND** | block-level evict 后由 radix LRU 取代 |
+| `_decode_session_cache_target_available_tokens()` | 目标可用 token 数计算 | **WORKAROUND** | 同上 |
+| `maybe_trim_decode_session_cache(...)` | 主动 trim session（触发 `release_session`） | **WORKAROUND** | 同上：refactor 后 radix LRU 自然蚕食，trim 不再必要 |
+| `_compute_backpressure_pause_hint(...)` | 给 router 的 pause 提示 | **EXPERIMENTAL** | 信号未闭环（[REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md](../docs/archive/) §4.3），路线图 §S10；可保留为 future work hook |
+| `_compute_pool_breakdown_for_diagnostics()` | 池状态快照供 `/server_info` | **INSTRUMENTATION** | 长期保留但建议门 flag 化 |
+
+### 1.3 `managers/schedule_batch.py` — WORKAROUND（待删除）
+
+| 项目 | 内容 | 引入 | 分类 |
+|---|---|---|---|
+| streaming-session `extend_input_len` correction (lines ~1572–1585) | 在 fill_ids < prefix_indices 时把 extend_input_len 改为 0 | b8e6f13 | **WORKAROUND** |
+| pre-filter pass dropping `fill_ids < prefix_indices` reqs | E3 触发 assertion 后的 hotfix（commit 986f351） | 986f351 | **WORKAROUND** |
+| invariant assert `seq_len - pre_len == req.extend_input_len` 的容忍逻辑 | 与 correction 配套 | b8e6f13 | **WORKAROUND** |
+
+**全部** ~85 行在 block-level eviction refactor 完成后**应整体删除**——`BLOCK_LEVEL_EVICTION_DESIGN_ZH §3.7` 已说明 refactor 后该不变量结构上必然成立，correction 路径无需存在。E3 的 landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2) 是该 workaround 的产物。
+
+### 1.4 `managers/session_controller.py` — MUST-HAVE
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| streaming session lifecycle hooks（open / close / admit signal） | 让 P/D worker 知道何时开始 / 结束一个 streaming session | MUST-HAVE |
+| session ID 路由 | 让 admission RPC 找到正确的 SessionSlot | MUST-HAVE |
+
+不下线。
+
+### 1.5 `managers/io_struct.py` — MUST-HAVE
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| `AdmitDirectAppendReqInput` / `AdmitDirectAppendReqOutput` | admit RPC 的请求 / 响应消息类型 | MUST-HAVE |
+| backpressure pause hint 字段 | 同上消息的 optional 字段 | EXPERIMENTAL |
+
+可以把 EXPERIMENTAL 字段折叠到 MUST-HAVE 消息里保持兼容；本身不构成下线压力。
+
+### 1.6 `managers/tokenizer_communicator_mixin.py` — MUST-HAVE
+
+admit RPC 的 communicator-side glue。19 行，不下线。
+
+### 1.7 `entrypoints/http_server.py` — MUST-HAVE
+
+`/admit_direct_append` HTTP endpoint 注册。6 行。
+
+### 1.8 `disaggregation/decode.py` — 混合类别
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| `DecodeReqToTokenPool`: `assert len(reusing) <= 1` 放宽 | 让 local append-prefill 在一个 batch 里复用多个 req_pool_idx | **MUST-HAVE** |
+| `DecodePreallocQueue` 引入 `refresh_allocatable_tokens` + `maybe_trim_decode_session_cache` 触发 | pool 满时主动 trim session | **WORKAROUND**（refactor 后改由 radix LRU 自然 shed） |
+| `--disaggregation-decode-allow-local-prefill` flag | 服务端 opt-in 本地 append-prefill | **MUST-HAVE** |
+
+trim 触发逻辑 ~30 行在 refactor 后应删除。
+
+### 1.9 `server_args.py` — MUST-HAVE
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| `--radix-eviction-policy priority` 选项 | E1/E2 实验需要 | MUST-HAVE |
+| `--disaggregation-decode-allow-local-prefill` flag | 见 §1.8 | MUST-HAVE |
+
+13 行，全部是 CLI 接口扩展，不下线。
+
+### 1.10 `disaggregation/mooncake_transfer_engine.py` — MINOR
+
+3 行小调整。不构成决策点。
+
+---
+
+## 2. 按分类汇总
+
+### 2.1 MUST-HAVE（保留）
+
+约 6 个文件、450 行：
+- `admit_direct_append` 主链路（Algorithm 2）：scheduler + io_struct + tokenizer_communicator_mixin + http_server + session_controller
+- `SessionSlot` 主链路（streaming session lifecycle）：session_aware_cache 多数字段、session_controller
+- CLI / server interface：server_args、decode.py 的 `allow_local_prefill`
+
+### 2.2 WORKAROUND（block-level evict refactor 后删除）
+
+约 2.5 个文件、150 行：
+- `session_aware_cache.release_session` 的 token-free 路径
+- `scheduler.py` 的 `_decode_session_cache_*_watermark_tokens` + `maybe_trim_decode_session_cache`
+- `schedule_batch.py` streaming-session correction + drop-pre-filter（含 E3 landmine 的 hotfix）
+- `decode.py` `DecodePreallocQueue` 中的 trim 触发
+
+→ 这些 patch 的存在是当前架构的产物；refactor 后应整段删除而不是修小 bug。
+
+### 2.3 EXPERIMENTAL（未闭环）
+
+约 60 行：
+- backpressure pause hint（`_compute_backpressure_pause_hint` + io_struct 字段）：可作为未来 control-plane 反馈机制的 hook 保留；若 1 个月后仍未接通，下线
+
+### 2.4 INSTRUMENTATION（长期保留但门 flag 化）
+
+约 50 行：
+- `_compute_pool_breakdown_for_diagnostics` + 相关 `/server_info` 字段：建议加 `--enable-diagnostic-pool-snapshot` flag，避免 prod 路径背诊断开销
+
+### 2.5 MINOR
+
+约 3 行：忽略。
+
+---
+
+## 3. 维护约定
+
+1. **新加 SGLang 改动必须落到本表**：在 commit message 用 `feat(sglang): ...` / `fix(sglang): ...` 前缀，并在 PR 描述声明落到 §2 哪一类。
+2. **不直接覆盖 upstream 文件**：所有 patch 必须可在 v0.5.10 上 git apply（保留 hunk header 整洁）。
+3. **删除 WORKAROUND 时同步删 doc**：refactor 完成的同一个 PR 应把本文表中对应行划掉。
+4. **不下放 EXPERIMENTAL 到主路径**：未闭环的 patch 必须默认 disabled。
+
+---
+
+## 4. 与路线图的衔接
+
+- Milestone 1（[AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §4）执行 block-level eviction refactor 时，**整段 §2.2 应该消失**——这是衡量 refactor 完成度的客观指标。
+- Milestone 2 把 control plane 拆层（§4.8）时，§2.3 backpressure pause hint 应或被启用、或被下线，不允许悬挂。
+- Milestone 3 引入 learning-based admission（§4.15）时，§2.1 的 `admit_direct_append` 接口应保持稳定，policy 替换在 router 侧而非 D 侧。
+
+---
+
+**核心句**：vendored SGLang 的 785 行不是 monolithic 黑箱——三分之二是核心机制（论文必备），三分之一是当前架构的 workaround（refactor 后可整段删）。reviewer 看到本表能立刻判断"哪些是 paper 的真贡献、哪些是 prototype 当前的临时支撑"。
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -20,8 +20,21 @@ build-backend = "setuptools.build_meta"
 [tool.setuptools.packages.find]
 where = ["src"]

+[dependency-groups]
+# Pure-Python unit tests. Install via:
+#   uv sync --group test
+# These tests deliberately import only the algorithm-layer modules
+# (policies, trace, topology) so they run without SGLang / GPU / CUDA.
+test = [
+    "pytest>=8.0",
+]
+
 [tool.uv]
 prerelease = "allow"

 [tool.uv.sources]
 sglang = { path = "third_party/sglang/python", editable = true }
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+addopts = "-q"
--- a/scripts/analysis/paired_compare.py
+++ b/scripts/analysis/paired_compare.py
@@ -0,0 +1,225 @@
+#!/usr/bin/env python3
+"""Paired latency comparison with bootstrap CI.
+
+Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix): when comparing
+mechanism A vs B on the same trace, the only honest comparison is paired
+on same-trial-mask. This script joins two metrics.jsonl by request_id,
+keeps the rows where BOTH sides succeeded, and reports paired deltas
+with 95% bootstrap CIs.
+
+Out vs the existing `compare_no_error.py`:
+  - works on raw metrics.jsonl, not pre-aggregated summary.json
+  - bootstrap CIs (not just point estimates)
+  - reports paired-mask size + per-side failure counts so the reader
+    sees how many rows were dropped from the comparison
+
+Usage:
+  scripts/analysis/paired_compare.py \
+      --baseline outputs/run-dp/request-metrics.jsonl \
+      --candidate outputs/run-kvc/request-metrics.jsonl
+  scripts/analysis/paired_compare.py ... --bootstrap 5000 --seed 42
+  scripts/analysis/paired_compare.py ... --json > paired.json
+
+stdlib only — no scipy/numpy. Runs without GPU and without SGLang.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import random
+import sys
+from pathlib import Path
+
+
+def _load(path: Path) -> dict[str, dict]:
+    out: dict[str, dict] = {}
+    with path.open() as handle:
+        for line in handle:
+            line = line.strip()
+            if not line:
+                continue
+            row = json.loads(line)
+            rid = row.get("request_id")
+            if rid is None:
+                continue
+            out[rid] = row
+    return out
+
+
+def _ok(row: dict) -> bool:
+    return row.get("error") is None and row.get("latency_s") is not None
+
+
+def _quantile(values: list[float], q: float) -> float:
+    if not values:
+        return float("nan")
+    s = sorted(values)
+    if len(s) == 1:
+        return s[0]
+    pos = (len(s) - 1) * q
+    lo = math.floor(pos)
+    hi = math.ceil(pos)
+    if lo == hi:
+        return s[lo]
+    return s[lo] + (s[hi] - s[lo]) * (pos - lo)
+
+
+def _stats(deltas: list[float]) -> dict[str, float]:
+    if not deltas:
+        return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
+    return {
+        "mean": sum(deltas) / len(deltas),
+        "p50": _quantile(deltas, 0.50),
+        "p90": _quantile(deltas, 0.90),
+        "p99": _quantile(deltas, 0.99),
+    }
+
+
+def _bootstrap_ci(
+    deltas: list[float], statistic, n_boot: int, rng: random.Random
+) -> tuple[float, float]:
+    """Return (lo, hi) 95% CI for `statistic(deltas)`."""
+    if len(deltas) < 2:
+        return (float("nan"), float("nan"))
+    n = len(deltas)
+    samples = []
+    for _ in range(n_boot):
+        # resample with replacement
+        resample = [deltas[rng.randrange(n)] for _ in range(n)]
+        samples.append(statistic(resample))
+    samples.sort()
+    lo = samples[int(0.025 * (n_boot - 1))]
+    hi = samples[int(0.975 * (n_boot - 1))]
+    return (lo, hi)
+
+
+def compare(
+    baseline: dict[str, dict],
+    candidate: dict[str, dict],
+    *,
+    metric: str,
+    n_boot: int,
+    seed: int,
+) -> dict:
+    common_ids = set(baseline.keys()) & set(candidate.keys())
+    paired_ids = [
+        rid for rid in common_ids if _ok(baseline[rid]) and _ok(candidate[rid])
+    ]
+    paired_ids.sort()
+
+    base_only_fail = sum(1 for rid in common_ids if not _ok(baseline[rid]))
+    cand_only_fail = sum(1 for rid in common_ids if not _ok(candidate[rid]))
+
+    deltas = []
+    wins = losses = ties = 0
+    for rid in paired_ids:
+        b = baseline[rid].get(metric)
+        c = candidate[rid].get(metric)
+        if b is None or c is None:
+            continue
+        d = float(c) - float(b)
+        deltas.append(d)
+        if d < 0:
+            wins += 1
+        elif d > 0:
+            losses += 1
+        else:
+            ties += 1
+
+    rng = random.Random(seed)
+    stats = _stats(deltas)
+    ci_mean = _bootstrap_ci(deltas, lambda x: sum(x) / len(x), n_boot, rng)
+    ci_p50 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.50), n_boot, rng)
+    ci_p90 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.90), n_boot, rng)
+
+    return {
+        "metric": metric,
+        "baseline_size": len(baseline),
+        "candidate_size": len(candidate),
+        "intersection_size": len(common_ids),
+        "paired_size": len(paired_ids),
+        "baseline_fail_in_common": base_only_fail,
+        "candidate_fail_in_common": cand_only_fail,
+        "delta_stats": stats,
+        "delta_mean_ci95": ci_mean,
+        "delta_p50_ci95": ci_p50,
+        "delta_p90_ci95": ci_p90,
+        "wins_candidate": wins,
+        "losses_candidate": losses,
+        "ties": ties,
+    }
+
+
+def _fmt(x: float, w: int = 6) -> str:
+    if x is None or (isinstance(x, float) and math.isnan(x)):
+        return "  nan "
+    return f"{x:+{w}.3f}"
+
+
+def render(result: dict) -> str:
+    s = result["delta_stats"]
+    mlo, mhi = result["delta_mean_ci95"]
+    p5lo, p5hi = result["delta_p50_ci95"]
+    p9lo, p9hi = result["delta_p90_ci95"]
+    n = result["paired_size"]
+    lines = [
+        f"# paired comparison ({result['metric']})",
+        "",
+        f"baseline rows:           {result['baseline_size']}",
+        f"candidate rows:          {result['candidate_size']}",
+        f"intersection (rid):      {result['intersection_size']}",
+        f"paired (both ok):        {result['paired_size']}",
+        f"  baseline fails in common:  {result['baseline_fail_in_common']}",
+        f"  candidate fails in common: {result['candidate_fail_in_common']}",
+        "",
+        "## delta (candidate - baseline)  — negative = candidate is faster",
+        "",
+        "| stat | value | 95% CI |",
+        "|---|---:|---:|",
+        f"| mean | {_fmt(s['mean'])} | [{_fmt(mlo)}, {_fmt(mhi)}] |",
+        f"| p50  | {_fmt(s['p50'])}  | [{_fmt(p5lo)}, {_fmt(p5hi)}] |",
+        f"| p90  | {_fmt(s['p90'])}  | [{_fmt(p9lo)}, {_fmt(p9hi)}] |",
+        f"| p99  | {_fmt(s['p99'])}  | — |",
+        "",
+        f"win/loss/tie: {result['wins_candidate']} / {result['losses_candidate']} / {result['ties']} (of {n})",
+    ]
+    return "\n".join(lines)
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
+    p.add_argument("--baseline", required=True, type=Path)
+    p.add_argument("--candidate", required=True, type=Path)
+    p.add_argument(
+        "--metric",
+        default="latency_s",
+        choices=["latency_s", "ttft_s", "tpot_s"],
+        help="which per-request field to compare (default: latency_s)",
+    )
+    p.add_argument("--bootstrap", type=int, default=2000)
+    p.add_argument("--seed", type=int, default=20260512)
+    p.add_argument("--json", action="store_true")
+    args = p.parse_args()
+
+    baseline = _load(args.baseline)
+    candidate = _load(args.candidate)
+    if not baseline or not candidate:
+        print("empty input on one side", file=sys.stderr)
+        sys.exit(1)
+
+    result = compare(
+        baseline, candidate,
+        metric=args.metric, n_boot=args.bootstrap, seed=args.seed,
+    )
+
+    if args.json:
+        json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
+        sys.stdout.write("\n")
+    else:
+        print(render(result))
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/analysis/stratified.py
+++ b/scripts/analysis/stratified.py
@@ -0,0 +1,227 @@
+#!/usr/bin/env python3
+"""Stratified latency / TTFT reporter for paper-quality evaluation.
+
+Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): every headline
+number must be accompanied by a stratified breakdown so reviewers can
+see which slice the gains come from.
+
+Buckets the request rows from one or more metrics.jsonl files along:
+  - turn_id        : {1, 2-5, 6-20, 21+}
+  - input_length   : {<=8K, 8K-64K, >64K}
+  - overlap_ratio  : {<=0.3, 0.3-0.7, >0.7}
+  - append_tokens  : input_length - observed_overlap_blocks * BLOCK_SIZE
+
+For each bucket, reports:
+  - n (total rows in bucket)
+  - n_ok (rows with no error and latency_s set)
+  - latency_s mean / p50 / p90 / p99
+  - ttft_s    mean / p50 / p90 / p99
+  - err_pct   (1 - n_ok/n)
+
+Usage:
+  scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl \
+      [outputs/<other-run>/request-metrics.jsonl ...]
+  scripts/analysis/stratified.py --dim turn_id outputs/<run>/request-metrics.jsonl
+  scripts/analysis/stratified.py --json outputs/<run>/request-metrics.jsonl > strat.json
+
+stdlib only — no pandas/numpy. Runs without GPU and without SGLang.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import sys
+from collections import defaultdict
+from pathlib import Path
+from typing import Iterable
+
+BLOCK_SIZE = 24  # SGLang radix block, matches docs/KVC_ROUTER_ALGORITHM.md §2
+
+TURN_BUCKETS: list[tuple[str, tuple[int, int]]] = [
+    ("turn=1", (1, 1)),
+    ("turn=2-5", (2, 5)),
+    ("turn=6-20", (6, 20)),
+    ("turn=21+", (21, 10**9)),
+]
+INPUT_BUCKETS: list[tuple[str, tuple[int, int]]] = [
+    ("input<=8K", (0, 8 * 1024)),
+    ("input=8K-64K", (8 * 1024 + 1, 64 * 1024)),
+    ("input>64K", (64 * 1024 + 1, 10**9)),
+]
+OVERLAP_BUCKETS: list[tuple[str, tuple[float, float]]] = [
+    ("overlap<=0.3", (0.0, 0.3)),
+    ("overlap=0.3-0.7", (0.3, 0.7)),
+    ("overlap>0.7", (0.7, 1.0001)),
+]
+APPEND_BUCKETS: list[tuple[str, tuple[int, int]]] = [
+    ("append<=128", (0, 128)),
+    ("append=128-1K", (129, 1024)),
+    ("append=1K-8K", (1025, 8 * 1024)),
+    ("append>8K", (8 * 1024 + 1, 10**9)),
+]
+
+DIM_BUCKETS: dict[str, list[tuple[str, tuple]]] = {
+    "turn_id": TURN_BUCKETS,
+    "input_length": INPUT_BUCKETS,
+    "overlap_ratio": OVERLAP_BUCKETS,
+    "append_tokens": APPEND_BUCKETS,
+}
+
+
+def _quantile(values: list[float], q: float) -> float:
+    """Linear-interpolation quantile, stdlib only."""
+    if not values:
+        return float("nan")
+    s = sorted(values)
+    if len(s) == 1:
+        return s[0]
+    pos = (len(s) - 1) * q
+    lo = math.floor(pos)
+    hi = math.ceil(pos)
+    if lo == hi:
+        return s[lo]
+    return s[lo] + (s[hi] - s[lo]) * (pos - lo)
+
+
+def _stats(values: list[float]) -> dict[str, float]:
+    if not values:
+        return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
+    return {
+        "mean": sum(values) / len(values),
+        "p50": _quantile(values, 0.50),
+        "p90": _quantile(values, 0.90),
+        "p99": _quantile(values, 0.99),
+    }
+
+
+def _bucket_for(value: float | int, buckets: list[tuple[str, tuple]]) -> str:
+    for label, (lo, hi) in buckets:
+        if lo <= value <= hi:
+            return label
+    return "OOB"
+
+
+def _classify(row: dict, dim: str) -> str:
+    if dim == "turn_id":
+        return _bucket_for(int(row.get("turn_id", 0)), TURN_BUCKETS)
+    if dim == "input_length":
+        return _bucket_for(int(row.get("input_length", 0)), INPUT_BUCKETS)
+    if dim == "overlap_ratio":
+        inp = max(1, int(row.get("input_length", 0)))
+        cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
+        ratio = min(1.0, cached / inp)
+        return _bucket_for(ratio, OVERLAP_BUCKETS)
+    if dim == "append_tokens":
+        inp = int(row.get("input_length", 0))
+        cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
+        return _bucket_for(max(0, inp - cached), APPEND_BUCKETS)
+    raise ValueError(f"Unknown dim: {dim}")
+
+
+def load_rows(paths: Iterable[Path]) -> list[dict]:
+    rows: list[dict] = []
+    for path in paths:
+        with path.open() as handle:
+            for line in handle:
+                line = line.strip()
+                if not line:
+                    continue
+                rows.append(json.loads(line))
+    return rows
+
+
+def stratify(rows: list[dict], dim: str) -> dict[str, dict]:
+    by_bucket: dict[str, list[dict]] = defaultdict(list)
+    for row in rows:
+        by_bucket[_classify(row, dim)].append(row)
+
+    output: dict[str, dict] = {}
+    for label, _ in DIM_BUCKETS[dim]:
+        bucket_rows = by_bucket.get(label, [])
+        n = len(bucket_rows)
+        ok = [r for r in bucket_rows if r.get("error") is None and r.get("latency_s") is not None]
+        n_ok = len(ok)
+        lat = [float(r["latency_s"]) for r in ok]
+        ttft = [float(r["ttft_s"]) for r in ok if r.get("ttft_s") is not None]
+        output[label] = {
+            "n": n,
+            "n_ok": n_ok,
+            "err_pct": (n - n_ok) / n if n else 0.0,
+            "latency_s": _stats(lat),
+            "ttft_s": _stats(ttft),
+        }
+    return output
+
+
+def render_table(name: str, stats: dict[str, dict]) -> str:
+    lines = [
+        f"## stratified by {name}",
+        "",
+        "| bucket | n | n_ok | err% | lat mean | lat p50 | lat p90 | lat p99 | ttft mean | ttft p50 | ttft p90 | ttft p99 |",
+        "|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|",
+    ]
+    for label, _ in DIM_BUCKETS[name]:
+        s = stats[label]
+        lat = s["latency_s"]
+        ttft = s["ttft_s"]
+        lines.append(
+            "| {label} | {n} | {n_ok} | {err:.1%} | "
+            "{lm:.3f} | {l50:.3f} | {l90:.3f} | {l99:.3f} | "
+            "{tm:.3f} | {t50:.3f} | {t90:.3f} | {t99:.3f} |".format(
+                label=label,
+                n=s["n"],
+                n_ok=s["n_ok"],
+                err=s["err_pct"],
+                lm=lat["mean"],
+                l50=lat["p50"],
+                l90=lat["p90"],
+                l99=lat["p99"],
+                tm=ttft["mean"],
+                t50=ttft["p50"],
+                t90=ttft["p90"],
+                t99=ttft["p99"],
+            )
+        )
+    return "\n".join(lines)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
+    parser.add_argument("metrics_paths", nargs="+", type=Path)
+    parser.add_argument(
+        "--dim",
+        choices=list(DIM_BUCKETS.keys()) + ["all"],
+        default="all",
+        help="stratification dimension (default: all four)",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="emit JSON instead of markdown tables",
+    )
+    args = parser.parse_args()
+
+    rows = load_rows(args.metrics_paths)
+    if not rows:
+        print("no rows loaded", file=sys.stderr)
+        sys.exit(1)
+
+    dims = list(DIM_BUCKETS.keys()) if args.dim == "all" else [args.dim]
+    result = {dim: stratify(rows, dim) for dim in dims}
+
+    if args.json:
+        json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
+        sys.stdout.write("\n")
+        return
+
+    header_paths = ", ".join(str(p) for p in args.metrics_paths)
+    print(f"# stratified report ({len(rows)} rows from {header_paths})\n")
+    for dim in dims:
+        print(render_table(dim, result[dim]))
+        print()
+
+
+if __name__ == "__main__":
+    main()
--- a/src/agentic_pd_hybrid/policies.py
+++ b/src/agentic_pd_hybrid/policies.py
@@ -152,6 +152,49 @@ class StickyDecodePolicy:
        )


+CandidateScore = tuple[int, int, int, int]
+
+
+def score_candidate(
+    *,
+    overlap: int,
+    sticky: bool,
+    inflight: int,
+    assigned: int,
+    mean_assigned: float,
+    sticky_bonus: int,
+    load_floor_bonus: int,
+) -> CandidateScore:
+    """Pure scoring function for KvAwarePolicy (Algorithm 1 in KVC_ROUTER_ALGORITHM.md).
+
+    Returns the 4-tuple compared lexicographically by `select()` to pick the
+    best D. Extracted as a top-level function so unit tests can exercise it
+    without constructing topology/state objects.
+
+    Score tuple positions:
+        0: overlap + sticky_bonus*sticky + floor_bonus  — primary, KV reuse aware
+        1: sticky                                         — tie-1, session locality
+        2: -inflight                                      — tie-2, prefer low load
+        3: -assigned                                      — tie-3, prefer rarely-picked
+
+    Load-floor bonus is gated on `not sticky` (turn-1+ sessions continue to
+    stick to their original D). The boost magnitude scales linearly with the
+    D's deficit relative to the running mean of decode_assignment_counts:
+        floor_bonus = load_floor_bonus * max(0, mean - assigned) / max(1, mean)
+    When mean == 0 (warmup) the bonus is 0 for all candidates (lex tiebreak
+    falls through to iteration order).
+
+    See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the load-floor design and
+    docs/KVC_ROUTER_ALGORITHM.md §3.1 for the lex-score formalism.
+    """
+    floor_bonus = 0
+    if load_floor_bonus > 0 and not sticky and mean_assigned > 0:
+        deficit = max(0.0, mean_assigned - assigned)
+        floor_bonus = int(load_floor_bonus * deficit / mean_assigned)
+    primary = overlap + (sticky_bonus if sticky else 0) + floor_bonus
+    return (primary, int(sticky), -inflight, -assigned)
+
+
@dataclass(frozen=True)
 class KvAwarePolicy:
    name: str = "kv-aware"
@@ -161,27 +204,11 @@ class KvAwarePolicy:
    # 0 disables the mechanism. Default 3 picked empirically to allow brief
    # transient saturation without panicking, but to reroute persistent starvation.
    migration_reject_threshold: int = 3
-    # Load-floor bonus: graduated boost added to lex-score position 0 for
-    # under-loaded D workers, gated on `not sticky` so turn-1+ requests of an
-    # existing session continue to stick to their original D. The boost
-    # magnitude scales linearly with the D's deficit relative to the running
-    # mean of `decode_assignment_counts`:
-    #   floor_bonus = K * max(0, mean - assigned[D]) / max(1, mean)
-    # When mean=0 (warmup), bonus is 0 for all workers (lex tiebreak by
-    # iteration order). Once any D has been assigned, under-loaded D's get a
-    # bonus that approaches K as their deficit-to-mean ratio approaches 1.
-    # The bonus naturally decays as load equalises, leaving the original
-    # overlap+sticky scoring in charge of steady-state selection.
-    #
-    # Set this above the maximum cross-session boilerplate overlap you expect
-    # so that fresh sessions are routed to under-loaded D's even when those
-    # D's currently have 0 overlap, but below the magnitude of "real" prefix
-    # overlap (e.g., a session with 800-block per-session prefix on an
-    # already-warm D should still go there).
-    #
-    # 0 disables. See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the full design and
-    # docs/E1_E2_RESULTS_ZH.md §5d for why this is needed on Inferact-shaped
-    # workloads where boilerplate overlap pins D2 cold forever.
+    # Load-floor bonus: see score_candidate() docstring for the exact formula.
+    # Set above the max cross-session boilerplate overlap you expect (so fresh
+    # sessions reach under-loaded D's even at 0 overlap), but below the
+    # magnitude of "real" prefix overlap (so a warm D still wins for its own
+    # session). 0 disables.
    load_floor_bonus: int = 0

    def select(
@@ -194,15 +221,12 @@ class KvAwarePolicy:
        prefill_worker_id = state.next_prefill_worker_id(topology)
        session = state.session_state.get(request.session_id)

-        # Pre-compute the running mean of decode assignments. Used by the
-        # load-floor bonus inside the candidate loop.
        n_route_workers = max(1, len(topology.route_workers))
        total_assigned = sum(state.decode_assignment_counts.values())
        mean_assigned = total_assigned / n_route_workers

        best_decode_worker_id: str | None = None
-        best_score: tuple[int, int, int, int] | None = None
-        candidates_considered = 0
+        best_score: CandidateScore | None = None
        for worker in topology.route_workers:
            # Migration: skip workers that have rejected this session too many times.
            # If all candidates get filtered (degenerate case), fall through to
@@ -213,25 +237,17 @@ class KvAwarePolicy:
                )
                if rejects >= self.migration_reject_threshold:
                    continue
-            candidates_considered += 1
-            overlap = _overlap_blocks(request, state, worker.worker_id)
-            sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
-            inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
-            worker_assigned = state.decode_assignment_counts.get(worker.worker_id, 0)
-            assignment_penalty = -worker_assigned
-
-            # Load-floor bonus: only for fresh placements (not sticky), and
-            # only when the knob is enabled. See docstring above.
-            floor_bonus = 0
-            if self.load_floor_bonus > 0 and not sticky and mean_assigned > 0:
-                deficit = max(0.0, mean_assigned - worker_assigned)
-                floor_bonus = int(self.load_floor_bonus * deficit / mean_assigned)
-
-            score = (
-                overlap + sticky * self.sticky_bonus + floor_bonus,
-                sticky,
-                inflight_penalty,
-                assignment_penalty,
+            score = score_candidate(
+                overlap=_overlap_blocks(request, state, worker.worker_id),
+                sticky=(
+                    session is not None
+                    and session.last_decode_worker == worker.worker_id
+                ),
+                inflight=state.inflight_decode.get(worker.worker_id, 0),
+                assigned=state.decode_assignment_counts.get(worker.worker_id, 0),
+                mean_assigned=mean_assigned,
+                sticky_bonus=self.sticky_bonus,
+                load_floor_bonus=self.load_floor_bonus,
            )
            if best_score is None or score > best_score:
                best_score = score
--- a/tests/README.md
+++ b/tests/README.md
@@ -0,0 +1,39 @@
+# Tests
+
+Pure-Python unit + property tests for the algorithm layer. These tests do
+**not** import SGLang and do **not** need a GPU — they validate the routing
+algorithm (Algorithm 1/2/3 in `docs/KVC_ROUTER_ALGORITHM.md`) and its
+theorems against the pure functions extracted from `policies.py`.
+
+## Run
+
+```bash
+uv sync --group test
+uv run pytest
+```
+
+Or, without uv:
+
+```bash
+pip install pytest
+PYTHONPATH=src pytest tests
+```
+
+## Scope
+
+- `test_policy_scoring.py` — Algorithm 1 lex-score properties (overlap
+  dominates sticky, load-floor gating, tie-breakers).
+- `test_no_starvation.py` — Theorem 1: bounded retries before some D either
+  accepts or the least-rejected D is forced through the degenerate path.
+
+Future:
+- block-level eviction `MockRadixCache` tests (see
+  `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` §5).
+- D→P sync `staleness_budget` property tests (see
+  `docs/D_TO_P_SYNC_CONTRACT_ZH.md` §1).
+
+## Why no integration tests here
+
+Anything that needs SGLang, mooncake, or a real model is an integration
+test and must run on hardware. Those tests live as `scripts/sweep_*.sh`
+under the evaluation protocol in `docs/EVALUATION_PROTOCOL_ZH.md`.
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/_fixtures.py
+++ b/tests/_fixtures.py
@@ -0,0 +1,66 @@
+"""Lightweight fixtures for algorithm-layer tests.
+
+Builds minimal TraceRequest / SingleNodeTopology / RoutingState instances
+without invoking build_single_node_topology() (which validates GPU budgets
+we don't care about in unit tests).
+"""
+
+from __future__ import annotations
+
+from agentic_pd_hybrid.topology import SingleNodeTopology, WorkerSpec
+from agentic_pd_hybrid.trace import TraceRequest
+
+
+def make_topology(decode_count: int = 3, prefill_count: int = 1) -> SingleNodeTopology:
+    prefill_workers = tuple(
+        WorkerSpec(
+            role="prefill",
+            ordinal=i,
+            gpu_ids=(i,),
+            host="127.0.0.1",
+            port=30000 + i,
+        )
+        for i in range(prefill_count)
+    )
+    decode_workers = tuple(
+        WorkerSpec(
+            role="decode",
+            ordinal=i,
+            gpu_ids=(prefill_count + i,),
+            host="127.0.0.1",
+            port=31000 + i,
+        )
+        for i in range(decode_count)
+    )
+    return SingleNodeTopology(
+        model_path="/dev/null/test-model",
+        prefill_workers=prefill_workers,
+        decode_workers=decode_workers,
+        direct_workers=(),
+        router_host="127.0.0.1",
+        router_port=8000,
+        transfer_backend="mooncake",
+        trust_remote_code=True,
+    )
+
+
+def make_request(
+    *,
+    session_id: str = "sess-1",
+    turn_id: int = 0,
+    hash_ids: tuple[int, ...] = (),
+    input_length: int = 1024,
+    output_length: int = 64,
+) -> TraceRequest:
+    return TraceRequest(
+        request_id=f"{session_id}-t{turn_id}",
+        session_id=session_id,
+        chat_id=int(turn_id),
+        parent_chat_id=-1 if turn_id == 0 else int(turn_id - 1),
+        timestamp_s=float(turn_id),
+        input_length=input_length,
+        output_length=output_length,
+        request_type="user",
+        turn_id=turn_id,
+        hash_ids=hash_ids,
+    )
--- a/tests/test_no_starvation.py
+++ b/tests/test_no_starvation.py
@@ -0,0 +1,150 @@
+"""Theorem 1 — no permanent starvation under bounded retries.
+
+Reference: docs/KVC_ROUTER_ALGORITHM.md §4.1.
+
+    For any session s with τ_reject ≥ 1, after at most |D| · τ_reject
+    consecutive admission rejects on s, the routing policy MUST still
+    return a valid decision (via the degenerate "least-rejected D"
+    fallback). The session cannot be permanently starved at the policy
+    layer.
+
+We can't exercise the full Dispatch loop here (it lives in replay.py and
+needs HTTP, mooncake, etc.). What we CAN test is the policy-layer
+guarantee: after K = |D| · τ_reject reject bumps, select() never raises
+and never returns a worker that's both blacklisted *and* has positive
+overlap (the degenerate path chooses by least-rejected).
+
+This is the property-layer companion to test_policy_scoring.py's
+quantitative checks.
+"""
+
+from __future__ import annotations
+
+from agentic_pd_hybrid.policies import KvAwarePolicy, RoutingState
+
+from ._fixtures import make_request, make_topology
+
+
+def test_select_returns_valid_decision_under_full_blacklist():
+    """Bump all (s, d) reject counters past τ_reject. select() must still
+    pick a worker (degenerate fallback, no exception, no None)."""
+    topology = make_topology(decode_count=3)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-stuck", turn_id=0)
+    policy = KvAwarePolicy(migration_reject_threshold=3)
+
+    # Pre-fill the blacklist for every D.
+    for worker in topology.route_workers:
+        for _ in range(3):
+            state.record_admission_reject(request.session_id, worker.worker_id)
+
+    decision = policy.select(request=request, topology=topology, state=state)
+    assert decision.decode_worker_id is not None
+    assert decision.decode_worker_id in {w.worker_id for w in topology.route_workers}
+
+
+def test_bounded_retries_to_force_degenerate_path():
+    """Theorem 1: at most |D| · τ_reject rejects suffice to either exhaust
+    every D or to force the degenerate fallback. Simulate the worst case
+    where each retry picks a fresh D and is immediately rejected."""
+    topology = make_topology(decode_count=4)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-worst", turn_id=0)
+    threshold = 3
+    policy = KvAwarePolicy(migration_reject_threshold=threshold)
+
+    seen_decoders: set[str] = set()
+    max_retries = len(topology.route_workers) * threshold
+
+    for retry in range(max_retries):
+        decision = policy.select(request=request, topology=topology, state=state)
+        seen_decoders.add(decision.decode_worker_id)
+        # Adversary: this D rejects this session.
+        state.record_admission_reject(request.session_id, decision.decode_worker_id)
+
+    # After |D|·τ_reject rejects every D must be blacklisted, so the next
+    # select() takes the degenerate "least-rejected" branch and STILL
+    # returns a valid worker.
+    final = policy.select(request=request, topology=topology, state=state)
+    assert final.decode_worker_id in {w.worker_id for w in topology.route_workers}
+    # And we should have explored every D over the bounded retries — the
+    # algorithm cannot trap a session on a single D when all are rejecting.
+    assert seen_decoders == {w.worker_id for w in topology.route_workers}
+
+
+def test_least_rejected_d_chosen_when_all_blacklisted():
+    """When every D is past threshold, the degenerate fallback chooses the
+    one with the *fewest* rejects (Algorithm 1, line 4)."""
+    topology = make_topology(decode_count=3)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-lr", turn_id=0)
+    policy = KvAwarePolicy(migration_reject_threshold=3)
+
+    # Skew rejections: decode-0 has 5, decode-1 has 10, decode-2 has 3.
+    # All are >= threshold=3, so the filter wipes out every candidate.
+    # The fallback should pick decode-2 (smallest rejection count).
+    workers = list(topology.route_workers)
+    bumps = {workers[0].worker_id: 5, workers[1].worker_id: 10, workers[2].worker_id: 3}
+    for wid, n in bumps.items():
+        for _ in range(n):
+            state.record_admission_reject(request.session_id, wid)
+
+    decision = policy.select(request=request, topology=topology, state=state)
+    assert decision.decode_worker_id == workers[2].worker_id
+
+
+def test_other_session_unaffected_by_blacklist():
+    """Algorithm 1's filter is per-(session, D), not per-D. Session A's
+    rejects must not influence session B's routing."""
+    topology = make_topology(decode_count=2)
+    state = RoutingState.create(topology)
+    policy = KvAwarePolicy(migration_reject_threshold=3)
+
+    # Blacklist decode-0 for session A.
+    workers = list(topology.route_workers)
+    for _ in range(3):
+        state.record_admission_reject("session-A", workers[0].worker_id)
+
+    # Session B sees a clean slate — should be able to pick decode-0
+    # (which is the iteration-order winner under empty state).
+    decision_b = policy.select(
+        request=make_request(session_id="session-B"),
+        topology=topology,
+        state=state,
+    )
+    # decode-0 wins iteration-order tiebreak when all scores are (0,0,0,0).
+    assert decision_b.decode_worker_id == workers[0].worker_id
+
+
+def test_threshold_zero_disables_blacklist():
+    """migration_reject_threshold=0 means the migration mechanism is off:
+    every D stays a candidate regardless of its reject count."""
+    topology = make_topology(decode_count=2)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-no-mig")
+    policy = KvAwarePolicy(migration_reject_threshold=0)
+
+    workers = list(topology.route_workers)
+    # Pile a huge number of rejects on decode-0.
+    for _ in range(100):
+        state.record_admission_reject(request.session_id, workers[0].worker_id)
+
+    decision = policy.select(request=request, topology=topology, state=state)
+    # decode-0 should still be eligible; with empty overlap/sticky/inflight,
+    # iteration order picks decode-0 first.
+    assert decision.decode_worker_id == workers[0].worker_id
+
+
+def test_reject_counter_only_grows_on_record():
+    """RoutingState.record_admission_reject is the ONLY mutator for the
+    counter. select() must not silently bump it."""
+    topology = make_topology(decode_count=2)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-clean")
+    policy = KvAwarePolicy()
+
+    for _ in range(5):
+        policy.select(request=request, topology=topology, state=state)
+
+    # No explicit record_admission_reject -> all counters stay zero.
+    assert sum(state.session_d_rejects.values()) == 0
--- a/tests/test_policy_scoring.py
+++ b/tests/test_policy_scoring.py
@@ -0,0 +1,189 @@
+"""Unit tests for Algorithm 1 (KvAwarePolicy score_candidate).
+
+Reference: docs/KVC_ROUTER_ALGORITHM.md §3.1. The lex-score is
+
+    (overlap + sticky_bonus*sticky + floor_bonus,
+     sticky,
+     -inflight,
+     -assigned)
+
+These tests pin down the qualitative properties that the algorithm's
+correctness arguments rely on. They run without SGLang/GPU.
+"""
+
+from __future__ import annotations
+
+from agentic_pd_hybrid.policies import score_candidate
+
+
+def _score(**overrides):
+    """Helper: build a score with all defaults and per-test overrides."""
+    args = dict(
+        overlap=0,
+        sticky=False,
+        inflight=0,
+        assigned=0,
+        mean_assigned=0.0,
+        sticky_bonus=1,
+        load_floor_bonus=0,
+    )
+    args.update(overrides)
+    return score_candidate(**args)
+
+
+# -- Determinism ----------------------------------------------------------------
+
+
+def test_score_is_pure():
+    """Same kwargs must produce the same tuple (no hidden state)."""
+    a = _score(overlap=3, sticky=True, inflight=1, assigned=7)
+    b = _score(overlap=3, sticky=True, inflight=1, assigned=7)
+    assert a == b
+
+
+def test_score_returns_4_tuple():
+    s = _score()
+    assert isinstance(s, tuple)
+    assert len(s) == 4
+    assert all(isinstance(x, int) for x in s)
+
+
+# -- Primary term: overlap dominates sticky --------------------------------------
+
+
+def test_overlap_strictly_dominates_pure_sticky():
+    """Theorem-2 building block: any positive overlap on a non-sticky D wins
+    against a sticky-only D with zero overlap (sticky_bonus=1)."""
+    overlap = _score(overlap=2, sticky=False)
+    sticky_only = _score(overlap=0, sticky=True)
+    assert overlap > sticky_only
+
+
+def test_overlap_plus_sticky_beats_overlap_alone():
+    """Two D's with equal overlap: sticky one wins (sticky_bonus contributes
+    to primary AND wins tie-1)."""
+    sticky_d = _score(overlap=5, sticky=True)
+    fresh_d = _score(overlap=5, sticky=False)
+    assert sticky_d > fresh_d
+
+
+# -- Tie breakers ----------------------------------------------------------------
+
+
+def test_tiebreaker_inflight_lower_wins():
+    """Equal primary & sticky: prefer the D with fewer in-flight requests."""
+    low = _score(overlap=3, sticky=False, inflight=0, assigned=10)
+    high = _score(overlap=3, sticky=False, inflight=5, assigned=10)
+    assert low > high
+
+
+def test_tiebreaker_assigned_lower_wins():
+    """Equal primary & sticky & inflight: prefer rarely-picked D."""
+    rare = _score(overlap=3, sticky=False, inflight=2, assigned=1)
+    frequent = _score(overlap=3, sticky=False, inflight=2, assigned=99)
+    assert rare > frequent
+
+
+def test_tiebreaker_strict_lex_order():
+    """Sticky always beats non-sticky on tie-1 even if non-sticky has lower
+    inflight (the lex order is strict, position 1 outranks positions 2/3)."""
+    sticky_busy = _score(overlap=4, sticky=True, inflight=10, assigned=10)
+    fresh_idle = _score(overlap=4, sticky=False, inflight=0, assigned=0)
+    # Note: with sticky_bonus=1 added to position 0, sticky_busy actually wins
+    # on position 0 first (5 > 4). Force equal primary by lowering sticky's
+    # overlap.
+    sticky_busy_eq_primary = _score(overlap=3, sticky=True, inflight=10, assigned=10)
+    fresh_idle_eq_primary = _score(overlap=4, sticky=False, inflight=0, assigned=0)
+    # Now equal primary (3+1=4 vs 4). Sticky wins position 1.
+    assert sticky_busy_eq_primary > fresh_idle_eq_primary
+
+
+# -- Load-floor bonus ------------------------------------------------------------
+
+
+def test_load_floor_disabled_by_default():
+    """load_floor_bonus=0 → no contribution to primary."""
+    s = _score(overlap=0, sticky=False, mean_assigned=10, assigned=0)
+    assert s[0] == 0
+
+
+def test_load_floor_gated_off_when_sticky():
+    """Even with load_floor_bonus>0, sticky D does NOT receive the boost.
+    Otherwise a session would migrate away from its warm D under load."""
+    sticky_under_loaded = _score(
+        overlap=0, sticky=True, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    # primary = overlap(0) + sticky_bonus(1) + floor(0) = 1
+    assert sticky_under_loaded[0] == 1
+
+
+def test_load_floor_zero_when_mean_zero():
+    """Warmup case: mean_assigned=0 -> no D gets boost -> degenerate to lex
+    tiebreak by iteration order."""
+    s = _score(
+        overlap=0, sticky=False, mean_assigned=0, assigned=0, load_floor_bonus=200
+    )
+    assert s[0] == 0
+
+
+def test_load_floor_proportional_to_deficit():
+    """floor_bonus = K * deficit / mean. assigned=0, mean=10, K=200 -> 200."""
+    s_zero = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    s_half = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=5, load_floor_bonus=200
+    )
+    s_full = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
+    )
+    # deficit = max(0, 10-0)=10 -> bonus = int(200*10/10) = 200
+    # deficit = max(0, 10-5)=5  -> bonus = int(200*5/10)  = 100
+    # deficit = max(0, 10-10)=0 -> bonus = 0
+    assert s_zero[0] == 200
+    assert s_half[0] == 100
+    assert s_full[0] == 0
+
+
+def test_load_floor_does_not_underflow_when_overloaded():
+    """assigned > mean -> deficit clamped to 0, no negative bonus."""
+    s = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=50, load_floor_bonus=200
+    )
+    assert s[0] == 0
+
+
+# -- Routing intent: real overlap beats load-floor bonus -------------------------
+
+
+def test_real_prefix_overlap_beats_load_floor_on_warm_d():
+    """E1_E2_FIX_DESIGN_ZH §Q2: load_floor should be set such that
+    real per-session prefix overlap outweighs the cold-D bonus.
+    With overlap=800 (a per-session prefix) and load_floor_bonus=200,
+    a warm D (high overlap, possibly high load) should still win against
+    a cold D with floor bonus."""
+    warm = _score(
+        overlap=800, sticky=True, mean_assigned=10, assigned=10, load_floor_bonus=200
+    )
+    cold = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    # warm primary = 800 + 1 + 0 = 801. cold primary = 0 + 0 + 200 = 200.
+    assert warm[0] == 801
+    assert cold[0] == 200
+    assert warm > cold
+
+
+def test_boilerplate_overlap_loses_to_load_floor_for_cold_d():
+    """Same §Q2: load_floor should beat cross-session boilerplate overlap.
+    If load_floor_bonus=200 and the worst-case boilerplate overlap is ~50,
+    a fresh cold D should still win against a slightly-warm-from-boilerplate D."""
+    warm_boilerplate = _score(
+        overlap=50, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
+    )
+    cold_under_loaded = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    # warm_boilerplate primary = 50 + 0 + 0 = 50 (assigned=mean, no deficit).
+    # cold_under_loaded primary = 0 + 0 + 200 = 200.
+    assert cold_under_loaded > warm_boilerplate
Author	SHA1	Message	Date
Gahow Wang	110bd68000	docs(failures): consolidated 5-mode failure taxonomy Consolidates failure modes scattered across V2_DEEP_ANALYSIS, E1_E2_RESULTS, E3_FINDINGS, KVC_EVICTION_GRANULARITY, REAL_ALI_KVC_EXPERIMENT into a single lookup table with five fields per mode: symptom → root cause → trigger → current mitigation → real fix. Five modes covered: A. Mooncake "instance not alive" cascade — E2 80%-failure pathology; admission no-space → seed burst → heartbeat drop → batch abort B. Cold-D / overlap-pinning — shared boilerplate hash pins all sessions to a subset of D's; load_floor_bonus is a patch, the real fix is exclusive_overlap redefinition C. Evict storm (session-level eviction) — release_session frees 38–88K tokens in one shot; fix is BLOCK_LEVEL_EVICTION_DESIGN C'. Reseed storm (turn-1 concurrent seeds) — startup-phase mooncake burst; fix is per-D pending-seed budget, frequency drops after C D. Streaming-session correction invariant crash (E3) — schedule_batch.py:1646 landmine, hotfixed by `986f351`, root-fix is removing the correction path entirely (BLOCK_LEVEL_EVICTION §3.7) Each mode has a forensic link back to the original experiment doc that surfaced it. §6 adds a diagnostic cheat sheet: "if you see X, look at Y." §7 wires every mode to a roadmap item — Milestone 1 should graduate §1–§4 to "mitigated" and eliminate §5. INDEX_ZH gets a new §1.6 section linking this and the SGLang patch inventory. No code change. Reading dependency for anyone debugging a sweep or writing paper §Limitations.	2026-05-13 00:43:58 +08:00
Gahow Wang	d93228e156	docs(sglang): patch surface inventory + retire-after-refactor list Resolves AUDIT_AND_ROADMAP §S6: the 785 lines of vendored SGLang patch are a known reviewer trust risk because the prototype touches scheduler.py / schedule_batch.py / session_aware_cache.py / disaggregation hot paths. Without classification readers cannot tell core mechanism from temporary scaffold. Classifies each of the 10 patched files into: MUST-HAVE — Algorithm 1/2/3, streaming session lifecycle, admit RPC. ~450 lines. Long-term retention. WORKAROUND — release_session token-free, maybe_trim_decode_session_cache, streaming-session extend_input_len correction (incl. the E3 landmine hotfix from commit `986f351`), DecodePreallocQueue trim trigger. ~150 lines. To DELETE entirely after block-level eviction refactor (BLOCK_LEVEL_EVICTION_DESIGN §3.7). EXPERIMENTAL — backpressure pause hint (_compute_backpressure_pause_hint). ~60 lines. Signal not closed-loop per REAL_ALI §4.3; retain as hook or retire in 1 month. INSTRUMENTATION — _compute_pool_breakdown_for_diagnostics. ~50 lines. Keep behind a flag. MINOR — ~3 lines. Ignore. The §2 summary gives reviewers a one-glance picture of what's core vs. scaffold. Maintenance convention in §3 mandates classifying every new (sglang) patch at commit time. §4 wires the classification into the roadmap: clearing the WORKAROUND bucket is the objective completion marker for block-level eviction refactor. No code change.	2026-05-13 00:42:22 +08:00
Gahow Wang	9a81c993ab	docs(onboarding): link new audit / design / eval docs from the root README + AGENTS.md Without this, the four docs added on this branch (AUDIT_AND_ROADMAP, INDEX, BLOCK_LEVEL_EVICTION_DESIGN, D_TO_P_SYNC_CONTRACT, EVALUATION_PROTOCOL) are reachable only by listing docs/. This wires them into the two entry points an agent or collaborator hits first. README.md changes: - top-of-page pointer to INDEX_ZH for new collaborators - pointer to AUDIT_AND_ROADMAP_ZH for project state - "单元测试 (无 GPU)" section: how to run pytest - "评测脚本" section: invocations for the two new analysis scripts AGENTS.md changes: - top section "For new collaborators / agents" before the existing "Environment" block, pointing at INDEX_ZH, AUDIT_AND_ROADMAP_ZH, the two ready-to-pick-up design docs, and EVALUATION_PROTOCOL_ZH - pytest invocation under Environment	2026-05-12 23:58:56 +08:00
Gahow Wang	dbb9eee471	feat(analysis): paired comparison with bootstrap CI Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix): mechanism A vs B comparisons on the same trace must be paired on same-trial-mask, with errors and aborts surfaced rather than silently dropped. How it differs from scripts/analysis/compare_no_error.py: - works on raw request-metrics.jsonl (not pre-aggregated summary.json) so it can recompute paired masks - reports 95% bootstrap CIs for mean / p50 / p90 - exposes intersection size + per-side failure count in the intersection so the reader can see how many rows were dropped from the comparison and whether the candidate's win came from selection effects stdlib only — random.Random for bootstrap, no scipy/numpy. Default 2000 bootstrap iterations; seed is configurable for reproducibility. Verified locally on a synthetic 20-row pair (5s constant delta + one candidate failure): correctly reports paired_size=19, candidate_fail_in_common=1, mean delta -5.000s, 19/0/0 win/loss/tie. CLI: scripts/analysis/paired_compare.py \\ --baseline outputs/run-dp/request-metrics.jsonl \\ --candidate outputs/run-kvc/request-metrics.jsonl \\ [--metric latency_s\|ttft_s\|tpot_s] \\ [--bootstrap 5000] [--seed 42] [--json]	2026-05-12 23:57:57 +08:00
Gahow Wang	4021f27ee2	feat(analysis): stratified latency / TTFT reporter Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): headline numbers must be accompanied by stratified breakdowns so reviewers can see which slice the gains come from. The script reads one or more request-metrics.jsonl files and buckets rows along four orthogonal dimensions: - turn_id : {1, 2-5, 6-20, 21+} - input_length : {<=8K, 8K-64K, >64K} - overlap_ratio : {<=0.3, 0.3-0.7, >0.7} - append_tokens : {<=128, 128-1K, 1K-8K, >8K} Per bucket: n, n_ok, err_pct, latency/ttft mean+p50+p90+p99. Output is markdown by default, --json for machine read. stdlib only — no pandas/numpy. Verified on a synthetic 5-row jsonl (turn=1 with one error correctly reports 33.3% err% on the bucket). Why this script and not pandas: - the existing scripts/analysis/* are stdlib-only; keeping consistency - reviewers can run it on the artifact without pip-installing anything beyond pytest - speed irrelevant; runs in <1s on the largest existing sweep (4449 rows) Usage shown in EVALUATION_PROTOCOL_ZH §3.	2026-05-12 23:57:13 +08:00
Gahow Wang	c5f552e122	test(policy): Theorem 1 no-starvation property tests Adds the algorithm-layer guarantee tests for docs/KVC_ROUTER_ALGORITHM.md §4.1. The full Dispatch loop lives in replay.py (HTTP + mooncake), but the policy-layer guarantee is testable in isolation: under any reject sequence, select() must keep returning a valid worker. Cases: - select returns a valid decision even after every (s,d) is past τ_reject (degenerate fallback) - \|D\|·τ_reject rejects suffice to explore every D (cannot trap a session on one D under universal rejection) - degenerate fallback picks the least-rejected D (Algorithm 1 line 4) - per-(session, D) isolation: session A's blacklist does not affect session B - migration_reject_threshold=0 disables blacklist - select() does NOT silently bump the reject counter (the only mutator is record_admission_reject) Adds tests/_fixtures.py with minimal make_topology() and make_request() helpers that skip build_single_node_topology's GPU-budget validation (irrelevant in unit tests). Verified locally: 20/20 passing under pytest 9.0.3. The six new tests cover only Algorithm 1's policy-layer half of Theorem 1; the reset-on-success half lives in Algorithm 3 (replay.py) and is a future test target.	2026-05-12 23:55:57 +08:00
Gahow Wang	a785b83023	test(policy): unit tests for Algorithm 1 lex scoring Adds the project's first test suite. Covers the score_candidate() pure function from the previous refactor commit, validating the qualitative properties that KVC_ROUTER_ALGORITHM.md §3.1 and §4.2 rely on. Tests / properties: - determinism: same args -> same tuple - shape: 4-int tuple - primary term: overlap dominates pure sticky - primary term: sticky_bonus credited - tie-2 inflight: lower wins - tie-3 assigned: lower wins - strict lex order: sticky wins position-1 over fresh-idle - load_floor disabled by default - load_floor gated off when sticky=True - load_floor zero during warmup (mean=0) - load_floor proportional to deficit (200/100/0 at 0/50/100% load) - load_floor does not underflow when overloaded - real per-session overlap beats load_floor on warm D - boilerplate overlap loses to load_floor on cold D (the cold-D fix from E1_E2_FIX_DESIGN §Q2) Test infrastructure: - tests/ package with README explaining the GPU-free scope and the run instruction - pyproject.toml [dependency-groups] test = [pytest>=8] (install via `uv sync --group test`) - pyproject.toml [tool.pytest.ini_options] sets testpaths Verified locally: 14/14 passing under pytest 9.0.3 in an isolated 3.13 venv. No SGLang / GPU touched.	2026-05-12 23:54:48 +08:00
Gahow Wang	76a79dfdda	refactor(policy): extract pure score_candidate() from KvAwarePolicy Pulls the per-D score computation out of KvAwarePolicy.select into a top-level pure function that takes primitives. The in-method behavior is unchanged — the loop now calls score_candidate() instead of inlining the arithmetic. Motivation: Algorithm 1 (KVC_ROUTER_ALGORITHM.md §3.1) is the routing core. Until now its only API was select(), which requires building TraceRequest + SingleNodeTopology + RoutingState to test even a single lex-score property. After this extraction, unit tests can drive the four-tuple score directly with integers. What changed: - Added module-level CandidateScore type alias. - Added score_candidate(*, overlap, sticky, inflight, assigned, mean_assigned, sticky_bonus, load_floor_bonus) -> CandidateScore. - KvAwarePolicy.select() loop body collapsed to a score_candidate() call; sticky now bool (was int) inside the call site. - Moved the load-floor docstring from KvAwarePolicy onto score_candidate where the formula lives. Verified pure: - same kwargs -> same tuple - overlap=5 beats sticky-only (no load_floor): (5,0,0,0) > (1,1,0,0) - load_floor gated off when sticky=True No behavior change; follow-up commit adds the unit tests this refactor enables.	2026-05-12 23:53:17 +08:00
Gahow Wang	591cd6d382	docs(eval): paper-quality evaluation protocol (M1–M6) Codifies the methodology fixes for every weakness called out in AUDIT_AND_ROADMAP_ZH §3.1. Existing sweep reports (KVCACHE_CENTRIC_PROGRESS_ZH, V2_RESULTS_ZH) violate at least one of these; future runs must use this protocol. Contents: - §1.1 M1 — N≥3 + bootstrap CI; no N=1 in headline - §1.2 M2 — paired-on-same-trial-mask; same trace / timeout / max_input_len / time_scale; errors and aborts each get their own column - §1.3 M3 — required stratification dimensions (turn_id / append_len / overlap_ratio / inter_turn_gap / input_len) - §1.4 M4 — minimum 2 baselines from a 6-item list, including at least one non-SGLang baseline - §1.5 M5 — trace mix: Ali full + SWE-Bench + ShareGPT + synthetic adversarial - §1.6 M6 — hardware tiers; single-node 4xH200 + dual-node NVLink/IB as minimum - §2 report templates (main table, paired delta, stratified, negative-result section) - §3 tool support: marks the two scripts that the follow-up commits on this branch add - §4 SOSP/OSDI artifact requirements - §5 pre-submission self-checklist - §6 phased delivery plan for catching up to protocol No code change; reading dependency for the analyzer scripts that follow.	2026-05-12 23:51:46 +08:00
Gahow Wang	fd37eda367	docs(design): D->P sync interface contract + 4-phase rollout Companion to BLOCK_LEVEL_EVICTION_DESIGN_ZH. Specifies the three-layer contract (mooncake / SGLang / agentic-pd-hybrid) that the empty feat/d-to-p-sync branch is meant to fill. Contents: - §1 staleness budget β as a first-class system parameter, with recommended default (page_size .. 4096 tokens) - §2.1 mooncake double-role API: KVRole enum extension, DecodeKVSender / PrefillKVReceiver class shapes, independent bootstrap channel - §2.2 SGLang RadixCache.insert_external signature with five concrete design decisions (re-mapping policy, failure handling, lock_ref discipline, evict interaction, multi-P backup view) - §2.3 agentic-pd-hybrid CLI flags, DirectSessionState additions, hook points in _invoke_session_direct and _invoke_kvcache_seeded_router - §3 candidate Theorem 4 (reseed_cost upper bound under staleness budget β) - §4 P1..P4 rollout with validation criteria per phase - §5 five enumerated risks + mitigation - §6 explicit decoupling: block-level eviction first, then D->P sync; do NOT bundle in one PR Makes the feat/d-to-p-sync branch actionable for the next collaborator without GPU until P2 microbench phase.	2026-05-12 23:50:39 +08:00
Gahow Wang	683c44bd71	docs(design): block-level eviction refactor — concrete API plan Turns the architectural manifesto (KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) into a function-by-function design the next collaborator can implement against. Contents: - §1 current SessionAwareCache state with exact field semantics (req_pool_idx / kv_committed_len / kv_allocated_len / cache_protected_len) - §3.1–§3.6 post-refactor source sketches for SessionSlot, cache_finished_req, cache_unfinished_req, match_prefix, release_session, get_session_status - §3.7 the schedule_batch.py:1572-1646 correction block we can remove (the E3 landmine) - §4 five invariants the PR must defend - §5 GPU-free unit + property test plan with a MockRadixCache shape - §6 ~1 week engineering estimate and three risks - §7 dependency relationship to the planned D->P sync work - §8 minimal step list for the implementing agent No code change yet. Future commits on a feat/block-level-evict branch will execute against this spec.	2026-05-12 23:49:18 +08:00
Gahow Wang	baa843a3f9	docs(index): collaborator-facing doc index Single navigation entry point. Existing docs were scattered across five branches with no clear reading order — this is the fix. Includes: - 3-doc fast path for anyone joining - topic-grouped table (algorithm / experiments / design discussions / evaluation / environment / archive) - role-based reading paths (new SWE, paper reviewer, reproducing student, control-plane reader) Index also references the four docs added later on this branch (AUDIT_AND_ROADMAP, BLOCK_LEVEL_EVICTION_DESIGN, D_TO_P_SYNC_CONTRACT, EVALUATION_PROTOCOL) so reviewers can see the planned layout up front.	2026-05-12 23:47:28 +08:00
Gahow Wang	6cdea52f28	docs(audit): cross-branch audit + 3-milestone roadmap Consolidates the state of the five working branches (main / kvc-debug-journey-v1-to-v4 / feat/d-to-p-sync / h200-cu130 / kvc-real-ali-iter-v1) into a single collaborator-facing document. Sections: - §1 per-branch state - §2 contributions a reviewer cannot refute - §3 weaknesses (M1–M6 methodology, S1–S10 system, infra) ranked by how badly they hurt at OSDI/SOSP - §4 3-milestone roadmap (defensible submission → production substrate → OSDI'27 increments) - §5 GPU-free work queue (what subsequent commits in this branch deliver) No code change. Acts as the index target for the follow-up commits on this branch.	2026-05-12 23:46:40 +08:00