docs(failures): consolidated 5-mode failure taxonomy

Consolidates failure modes scattered across V2_DEEP_ANALYSIS, E1_E2_RESULTS, E3_FINDINGS, KVC_EVICTION_GRANULARITY, REAL_ALI_KVC_EXPERIMENT into a single lookup table with five fields per mode: symptom → root cause → trigger → current mitigation → real fix. Five modes covered: A. Mooncake "instance not alive" cascade — E2 80%-failure pathology; admission no-space → seed burst → heartbeat drop → batch abort B. Cold-D / overlap-pinning — shared boilerplate hash pins all sessions to a subset of D's; load_floor_bonus is a patch, the real fix is exclusive_overlap redefinition C. Evict storm (session-level eviction) — release_session frees 38–88K tokens in one shot; fix is BLOCK_LEVEL_EVICTION_DESIGN C'. Reseed storm (turn-1 concurrent seeds) — startup-phase mooncake burst; fix is per-D pending-seed budget, frequency drops after C D. Streaming-session correction invariant crash (E3) — schedule_batch.py:1646 landmine, hotfixed by 986f351, root-fix is removing the correction path entirely (BLOCK_LEVEL_EVICTION §3.7) Each mode has a forensic link back to the original experiment doc that surfaced it. §6 adds a diagnostic cheat sheet: "if you see X, look at Y." §7 wires every mode to a roadmap item — Milestone 1 should graduate §1–§4 to "mitigated" and eliminate §5. INDEX_ZH gets a new §1.6 section linking this and the SGLang patch inventory. No code change. Reading dependency for anyone debugging a sweep or writing paper §Limitations.
docs(sglang): patch surface inventory + retire-after-refactor list
2026-05-13 00:43:58 +08:00 · 2026-05-13 00:42:22 +08:00 · 2026-05-12 23:58:56 +08:00 · 2026-05-12 23:57:57 +08:00 · 2026-05-12 23:57:13 +08:00 · 2026-05-12 23:55:57 +08:00
35 changed files with 4841 additions and 102 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,9 +1,33 @@
 # AGENTS.md

+## For new collaborators / agents
+
+Before doing anything else, read [docs/INDEX_ZH.md](docs/INDEX_ZH.md). It points to the
+3 must-read docs and a role-based reading path (new SWE, paper reviewer,
+reproducing student, control-plane reader).
+
+Cross-branch progress, weaknesses, and roadmap live in
+[docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md). It is the single source of truth
+for "what's done, what's broken, what to do next."
+
+Two engineering work items are pre-specced and ready to pick up:
+- block-level eviction refactor — [docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)
+- D→P incremental KV sync — [docs/D_TO_P_SYNC_CONTRACT_ZH.md](docs/D_TO_P_SYNC_CONTRACT_ZH.md)
+
+Evaluation protocol (paper-quality N, paired CI, stratification,
+baselines) is in [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md).
+
 ## Environment

 Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.

+Algorithm-layer unit tests (no GPU, no SGLang):
+
+```bash
+uv sync --group test
+uv run pytest
+```
+
 ## Goal

 Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
--- a/README.md
+++ b/README.md
@@ -6,6 +6,9 @@

 更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。

+新加入的合作者：先看 [docs/INDEX_ZH.md](docs/INDEX_ZH.md)，按"我是谁"选 3 篇必读文档。
+项目当前进度、薄弱点、路线图总览见 [docs/AUDIT_AND_ROADMAP_ZH.md](docs/AUDIT_AND_ROADMAP_ZH.md)。
+
 ## 当前做了什么

 - 启动单机 SGLang P/D 栈。
@@ -99,3 +102,28 @@ uv run agentic-pd-hybrid replay \
 - SGLang 改动：`feat(sglang): ...` / `fix(sglang): ...`。
 - `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
 - 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
+
+## 单元测试（无 GPU）
+
+算法层（policies、Algorithm 1 / Theorem 1）有 pure-Python 单测，跑测试不需要 GPU、不需要 SGLang：
+
+```bash
+uv sync --group test
+uv run pytest
+```
+
+详见 [tests/README.md](tests/README.md)。
+
+## 评测脚本
+
+按 [docs/EVALUATION_PROTOCOL_ZH.md](docs/EVALUATION_PROTOCOL_ZH.md) 跑数据后：
+
+```bash
+# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
+scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl
+
+# M2: paired-on-same-trial bootstrap 95% CI
+scripts/analysis/paired_compare.py \
+    --baseline outputs/run-dp/request-metrics.jsonl \
+    --candidate outputs/run-kvc/request-metrics.jsonl
+```
--- a/docs/AUDIT_AND_ROADMAP_ZH.md
+++ b/docs/AUDIT_AND_ROADMAP_ZH.md
@@ -0,0 +1,140 @@
+# 项目整体审阅与下一阶段路线图
+
+**日期**：2026-05-12
+**分支起点**：`improve/audit-and-foundations`（基于 `h200-cu130`）
+**性质**：跨分支整合 + 路线图，供合作者判断每个 commit 是否值得 merge
+**对象**：项目下一个 SWE / research agent + 论文 reviewer 预读
+
+本文把 `main` / `kvc-debug-journey-v1-to-v4` / `feat/d-to-p-sync` / `h200-cu130` / `kvc-real-ali-iter-v1` 五个分支的进度、已成立的贡献、薄弱点、走到 SOSP/OSDI + 工业级的路线图集中到一处，方便快速对齐。
+
+---
+
+## 0. TL;DR
+
+1. **已经成立**：v1 → v2 算法（reset-on-success、字典序 Route、worker-mode Admit RPC）有形式化定义 + 两条 theorem + SWE-Bench 50 sess ts=1 上 6/8 指标击败 4DP CA 的实测。
+2. **核心薄弱点**：(a) session-level eviction 与 KVC 设计意图冲突；(b) D→P 增量 KV 同步不存在，TTFT p99 长尾来自此；(c) mooncake "instance not alive" 级联是控制层根本可用性问题；(d) 评测仍缺多 baseline 多 trace 强统计。
+3. **不需要 GPU 也能推进**的事：算法层 unit test、形式化设计文档（block-level evict、D→P sync 接口契约）、评测协议、分层分析工具、文档体系收口。本路线图的 Milestone 1 大部分都属于此类。
+4. **进 OSDI/SOSP 必须做的**：执行 §S1（block-level evict）+ §S2（D→P sync POC）+ §M2/M3/M4（多 baseline / 全 Ali / paired 协议）。预计 3–4 个月单/双人。
+
+---
+
+## 1. 五个分支的状态总览
+
+| 分支 | 角色 | 当前状态 | 最关键产出 |
+|---|---|---|---|
+| `main` | "已发布" 基线 | 落后 origin 18 commit；2P4D + worker-admission + seed-min2 报出 vs default PD 的 9% mean / 19% p90 改善 | `KVCACHE_CENTRIC_PROGRESS_ZH.md` 的两档策略：latency-best vs stable |
+| `kvc-debug-journey-v1-to-v4` | 主工作分支 | v1→v5 完整算法演化；`KVC_ROUTER_ALGORITHM.md` 三段算法 + 两条 theorem | SWE-Bench 50 sess ts=1：v2 6/8 指标击败 4DP CA；**TTFT p99 仍输 3×**（1.28s vs 0.43s），诊断为 8.3% reseed 慢路径 |
+| `feat/d-to-p-sync` | 占位分支 | 代码空，仅 `RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` | 已排除"capacity-backup 是 D→P sync"的误解；列出 4 项工程子任务 |
+| `h200-cu130` | 真硬件 + RDMA 验证 | 4×H200 + mlx5_60 NDR 400 Gb/s 上跑 E1/E2/E3 | **E2 80% failure**（mooncake 死链级联）；**E3 16min 触发 SGLang patch invariant crash**；最新 `KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 把 root cause 上升到"session-level 是错的 eviction granularity" |
+| `kvc-real-ali-iter-v1` | 真 Ali trace 验证 | 8×H20，179-req KVC-fit slice + 600-req/15min cold-window | KVC vs DP：KVC-fit p50 −46% ✅；real 15min p90 +19s ❌，53 errors vs DP 1；KVC 默认 mem-fraction OOM，必须降到 0.82 |
+
+---
+
+## 2. 已经"硬"成立的贡献
+
+按"reviewer 能不能反驳"为标尺：
+
+1. **Reset-on-success 修复 v1 thrashing**：v1 永久 blacklist → migration 死循环 failure mode 有实测 + Algorithm 3 形式化 + Theorem 1 的不饿死证明（`KVC_ROUTER_ALGORITHM.md` §3.4 / §4.1）。
+2. **三段算法分工清晰**：Algorithm 1（字典序 Route）+ Algorithm 2（D 自治 Admit RPC）+ Algorithm 3（Dispatch + reset-on-success）。v5 把 admission 从 router 估算改成 D RPC（Option D）是把 capacity ground truth 与 routing score 解耦的正确分层。
+3. **Direct-to-D 快路径的确定性命中**（Theorem 2）：只要 residency ⊇ prefix ∧ append ≤ τ_append ∧ cap_ok 三条件同时成立必走快路径；SWE-Bench 91.6% 命中、TTFT p50 = 0.43s 是结构性结果。
+4. **每一个 negative result 都有 forensic 级解释**：mooncake death、cold-D、reseed 慢路径、session-level evict 都有代码定位 + 时间线 + 反例。这条对 paper 是真正加分项。
+
+---
+
+## 3. 让 reviewer 一击致命的薄弱点
+
+### 3.1 评测方法层
+
+- **M1 N 不足**：SWE-Bench v2 baseline N=3 确认 categorical，v2 自身 N 不足；缺 bootstrap CI。
+- **M2 比较口径不对等**：E2 80% 失败时用 "successful only" 算 latency 与 E1 全集比；paper 必须 paired-on-same-trial。
+- **M3 trace 偏 KVC-friendly**：KVC-fit slice 按 small-append + high overlap 筛过；full Ali（turn2+ ratio 26%、single-turn 极多）的 dilution 后结果没跑过。
+- **M4 baseline 不够强**：缺 vLLM + prefix-cache、DistServe、SplitWise、Mooncake-Master 任何一个。
+- **M5 trace 单一性**：缺 ShareGPT/Mooncake trace、缺 long-context tool-use agent benchmark、缺合成 adversarial trace。
+- **M6 硬件覆盖**：只 single-node ≤ 8 GPU；没有跨节点、没有 ≥ 32 GPU 集群实测。
+
+### 3.2 系统设计层
+
+- **S1 Session-level eviction 与 KVC 设计意图冲突**：90 次 evict、平均一次 free 67K tokens、25/50 session 必须 50–90K 重 prefill。`KVC_EVICTION_GRANULARITY_DESIGN_ZH.md` 已识别但未实现修复。
+- **S2 D→P 增量同步不存在**：TTFT p99 长尾 50% 来自 P 重 prefill。`capacity-backup` 是 seed-time 静态快照，不是 D→P sync。修复需改 SGLang radix 的单生产者假设。
+- **S3 Mooncake 级联 death**：admission no-space → 持续重试 seed → 心跳掉线 → SGLang 整批 abort（E2 1054/1285 失败）。控制层根本可用性 bug。
+- **S4 Admission RPC 同步阻塞**：缺 backoff / hedging / staleness budget。D scheduler GIL 抖动即把 router 卡死。
+- **S5 Cold-D / overlap-pinning**：boilerplate 24-token block hash 让所有 session 与 D0/D1 重叠 → D2/D3 0 binding。load-floor bonus 是补丁，不是 first-principles 修复。
+- **S6 SGLang 本地 patch 已 785 行 / 10 文件**，含 `schedule_batch.py:1646` 这种 hot-path 不变量改动；E3 crash 就是 vendored patch 引入的 latent landmine。
+- **S7 失败恢复 / 幂等性**：streaming session 在 chunked-prefill retry 下幂等性靠 `SessionSlot.restore_to_req`；缺 worker crash / mooncake 重连 / partial KV 损坏的恢复 protocol。
+- **S8 没有 multi-tenant / SLO-aware scheduling**：算法目标隐式 w_ttft=w_lat=1。生产里 interactive / batch / background 必须分级。
+- **S9 Topology fixed at boot**：P/D 比例是启动参数。生产负载需要 elastic。
+- **S10 Backpressure pause hint 信号未闭环**：触发 20 次但因 no-BP 无人响应；control-plane 没接通。
+
+### 3.3 工程基础设施层
+
+- **可观测性**：metrics 是 jsonl + 离线 `recompute_summary.py`；生产需要 Prometheus + Grafana + OpenTelemetry trace。
+- **形式化测试**：算法层与状态层缺 unit test；`SessionSlot.restore_to_req` 幂等性是作者自己 flag 的 invariant。
+- **混沌注入**：mooncake death 这种 control-plane failure 必须有 fault injection harness。
+- **代码体量**：`replay.py` 2460 行，集 orchestration / policy hook / control plane / metrics 于一身——prototype OK，paper-quality artifact 偏弱。
+
+---
+
+## 4. 路线图
+
+分三个 milestone。每个 milestone 可独立交付（paper 章节或工程 release）。
+
+### Milestone 1 — Defensible SOSP/OSDI submission（3–4 个月，单 / 双人）
+
+**目标**：把现有算法 + 失败诊断收口成能扛 PC 第一轮的稿子。
+
+1. **执行 §S1（block-level eviction refactor）** — 见 `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`。
+   - Streaming-session decode 输出在每个 turn finish 时通过 `cache_finished_req` 增量提交进 radix tree。
+   - `SessionSlot` 退化为纯 metadata（仅持 `last_node` + lock_ref）。
+   - `release_session` 改为 `dec_lock_ref` + 删 slot；evict 完全交给 SGLang radix LRU。
+   - 预期：evict 粒度从 67K tokens/次降到 24 tokens/次；reseed 频率降一个数量级。
+2. **执行 §S2（D→P 增量同步 POC）** — 见 `docs/D_TO_P_SYNC_CONTRACT_ZH.md`。
+   - microbench 证明：D append 完成后异步推 KV block 回 P 端 radix → 下次 reseed 跳过 re-prefill。
+3. **修 §S3（mooncake death 级联）**：admission RPC backoff + jitter；per-D pending-seed budget；mooncake heartbeat 与 admission 解耦。
+4. **修 §S5 的 first-principles 解法**：把 `overlap` 重定义为 "session 在 D 上独占 prefix 的 hash 数"（去掉 boilerplate 共享 hash 贡献），让 score 自然分散。
+5. **重做评测**：见 `docs/EVALUATION_PROTOCOL_ZH.md`。N≥3 + bootstrap CI + 多 baseline + 全 Ali + 分层报告。
+6. **形式化扩充**：加 Theorem 3（block-level evict 下重 prefill cost 上界）+ Theorem 4（D→P sync 的 staleness budget β 与 reseed cost 关系）。
+7. **Artifact**：一键脚本 + Dockerfile + 4×A100 一小时复现核心 table/figure。
+
+### Milestone 2 — Production-quality serving substrate（再 3–6 个月，2–3 人）
+
+8. **控制平面分层**：把 `replay.py` 拆成 `router/` / `control/` / `obs/` / `orch/`。
+9. **Elastic topology**：autoscaling controller，输入 (P queue, D transfer queue, D KV usage)。
+10. **Multi-tenant + SLO classes**：interactive / batch / background 三档独立 admission budget。
+11. **Failure injection harness**：mooncake link flap / D OOM kill / router GC pause / partial KV corruption；每个 case 有恢复 SLA。
+12. **Persistent KV tier**：CPU DRAM + NVMe + RDMA-attached pool；evict 改为 demote。
+13. **Cross-node + heterogeneous**：H100 + H200 + L40S 混合，topology-aware routing。
+14. **Observability**：per-request OpenTelemetry + Prometheus per-D + Grafana 主面板。
+
+### Milestone 3 — 真正能进 OSDI'27 的科研增量（6–12 个月）
+
+15. **Learning-based admission / migration**：multi-armed bandit / RL 控制 τ_reject 与 K；用 trace 训 session-aliveness predictor。
+16. **跨 router residency consensus**：轻量 gossip 共享 `Σ.resident[d]`。
+17. **可证明 competitive ratio**：在 oracle KV-residency 模型下证明 KVC expected TTFT 与 offline optimal 比值有界。
+18. **分布式 prefix tree**：逻辑 prefix 映射到多 D 物理副本，支持 multi-tenant prefix 共享（system prompt / tool schema）。
+19. **Energy-aware variant**：GPU SM 利用率 + PCIe/RDMA 能耗进目标函数。
+20. **End-to-end agent serving framing**：从 request-level latency 上升到 agent task completion time（coding agent 一个 task 30+ turn）。
+
+---
+
+## 5. 不需要 GPU 也能推进的工作清单
+
+按 ROI 排：
+
+- [x] 本路线图（`AUDIT_AND_ROADMAP_ZH.md`）。
+- [x] 合作者入口（`docs/INDEX_ZH.md`）。
+- [x] Block-level eviction 具体设计（`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md`）。
+- [x] D→P sync 接口契约（`docs/D_TO_P_SYNC_CONTRACT_ZH.md`）。
+- [x] 评测协议（`docs/EVALUATION_PROTOCOL_ZH.md`）。
+- [x] `KvAwarePolicy` 纯函数 score 抽取 + unit test（Algorithm 1）。
+- [x] 不饿死性质测试（Theorem 1）。
+- [x] 分层分析脚本（按 turn-index / append-size / overlap 三维分桶）。
+- [x] Paired-comparison 协议 helper。
+- [ ] Mooncake death 的可重现 mock harness（无 GPU 也能跑）。
+- [ ] SGLang patch surface 的归类清单（每个 patch 标"必须" / "实验性" / "可下线"）。
+- [ ] Failure-mode taxonomy 文档（cold-D、overlap-pin、mooncake death、reseed storm、evict storm）。
+
+---
+
+## 6. 单句结论
+
+> 这个项目已经具备了 SOSP/OSDI workshop / poster 的素材；要进 main track，需要把 §S1（block-level evict）和 §S2（D→P sync）做实、把 §M3（full Ali）和 §M4（两个强 baseline）补齐、把 §S3（mooncake 级联 death）的 control-plane fix 写进可重复 artifact。如果只能做一件事，先做 block-level eviction refactor —— 它同时解决"reseed 太频繁"和"P 端 radix 多生产者扩展的前置条件"。
--- a/docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
+++ b/docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md
@@ -0,0 +1,309 @@
+# Block-level Eviction Refactor — 设计文档
+
+**日期**：2026-05-12
+**前置**：[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md)（架构层 manifesto）
+**性质**：实现层设计 + API 草案 + 测试计划，供下一个合作者直接据此编码
+**Status**：草案，未实现。代码全部 quoted from `third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py @ origin/h200-cu130`
+
+---
+
+## 0. TL;DR
+
+把 `SessionAwareCache` 当前对 streaming-session **整段 KV 一次性 free** 的语义改成：
+
+1. Streaming-session decode 输出在 turn finish 时 **增量 commit 进 radix tree**。
+2. `SessionSlot` 退化为**纯 metadata**（仅持 `last_node` + lock_ref 状态），不再独占 KV 区间。
+3. `release_session` 改为只 dec_lock_ref + 删 slot，**让 SGLang 标准 radix LRU 按 block 粒度蚕食**。
+
+预期收益：evict 粒度从一次 ~67K tokens 降到 ~24 tokens（page_size 个 token），reseed 频率降一个数量级；同时把 P 端 radix tree 改造成可被外部喂数据（为 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 铺路）。
+
+---
+
+## 1. 现状代码梳理
+
+### 1.1 关键文件与函数
+
+`third_party/sglang/python/sglang/srt/mem_cache/session_aware_cache.py`
+
+| 函数 / 字段 | 当前语义 |
+|---|---|
+| `SessionSlot.req_pool_idx` | streaming-session 独占的 req_pool 槽位 |
+| `SessionSlot.kv_committed_len` | 上一 turn 完成时已 commit 的 KV 长度（已计入 cache_protected_len 部分进入 radix） |
+| `SessionSlot.kv_allocated_len` | 当前已分配但**未进 radix** 的 KV 长度（"session-exclusive 尾部"） |
+| `SessionSlot.cache_protected_len` | 首 turn 提交 radix 时的 protected 边界 |
+| `match_prefix(streaming req)` | 命中 slot → 返回 `req_to_token[req_pool_idx, :prefix_len]`，bypass radix |
+| `cache_unfinished_req(streaming req)` | subsequent turns → **完全 skip inner**（不进 radix） |
+| `cache_finished_req(streaming req)` | 调 `slot.save_from_req`，**不调 inner.cache_finished_req** |
+| `release_session(sid)` | `dec_lock_ref(slot.last_node)` + `free(req_to_token[req_pool_idx, cache_protected_len:kv_allocated_len])` + 回收 req_pool 槽位 |
+
+### 1.2 当前为什么是错的（重述）
+
+`[cache_protected_len, kv_allocated_len)` 是首轮入 radix 之后所有累积的 decode 输出 + 后续 turn 的 extend。在 Inferact / SWE-Bench 实测：
+
+- `cache_protected_len` ≈ 首 turn boilerplate ~12K
+- `kv_allocated_len` 累积 50–100K
+- 每次 `release_session` 一次性释放 38–88K，这部分**从未进 radix**，无法享受 leaf-by-leaf 渐进 evict
+
+→ session 被 evict 后必须从 client 原 prompt 重 prefill 全长 + mooncake transfer 全长，跟 naive PD-disagg 等价（详见 manifesto §1）。
+
+---
+
+## 2. 目标行为表
+
+| 场景 | 现状 | 目标 |
+|---|---|---|
+| Session 累积 50K KV，D 满了 | `release_session` 一次释放 38K | radix LRU 从最老 leaf 开始 evict，单次 ~24 tokens |
+| Session 被 evict 后再到来 | 必须 reseed 50K | 仅 re-prefill 被 evict 的 leaf 部分（典型 ≤ 5K） |
+| Evicted session TTFT | 50–90K reseed ≈ 3–7s | 5K append-prefill ≈ 200ms |
+| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only（不变） |
+| Direct-to-D fast path 命中率 | 91.6% (SWE-Bench) / 38% (E3 Inferact) | 应 ≥ 85% 即使 saturation |
+
+---
+
+## 3. 设计
+
+### 3.1 SessionSlot 字段精简
+
+**after refactor**：
+
+```python
+@dataclass
+class SessionSlot:
+    virtual_node: _VirtualNode = field(default_factory=_VirtualNode)
+
+    # Pointer into the radix tree — the deepest node owned by this session's
+    # committed prefix. Held under inc_lock_ref so radix LRU never evicts this
+    # *active* leaf out from under a turn-in-progress. Released by
+    # release_session.
+    last_node: Any = None
+    swa_uuid_for_lock: Optional[str] = None
+
+    # Bookkeeping fields (no longer authoritative ownership of KV indices).
+    last_access_time: float = field(default_factory=time.monotonic)
+
+    # Mamba state stays slot-owned (mamba doesn't fit the radix model).
+    mamba_pool_idx: Any = None
+    mamba_ping_pong_track_buffer: Any = None
+    mamba_next_track_idx: Any = None
+    mamba_last_track_seqlen: Any = None
+    mamba_branching_seqlen: Any = None
+```
+
+**删除**：`req_pool_idx`、`kv_committed_len`、`kv_allocated_len`、`cache_protected_len`、`swa_evicted_seqlen`。这些字段的真值改由 radix tree + req_to_token_pool 共同维护。
+
+### 3.2 `cache_finished_req` 改造
+
+**after refactor**：
+
+```python
+def cache_finished_req(self, req: Req, is_insert: bool = True, **kwargs):
+    if not _is_streaming(req):
+        return self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
+
+    session_id = req.session.session_id
+    slot = self.slots.setdefault(session_id, SessionSlot())
+
+    # KEY CHANGE: always delegate to inner — this inserts the new tokens
+    # (kv_committed_len .. fill_ids end) as radix-tree blocks. Subsequent
+    # match_prefix calls for this session will hit the radix tree directly.
+    result = self.inner.cache_finished_req(req, is_insert=is_insert, **kwargs)
+
+    # Update slot bookkeeping only (no KV ownership).
+    slot.last_node = req.last_node
+    slot.swa_uuid_for_lock = req.swa_uuid_for_lock
+    slot.last_access_time = time.monotonic()
+
+    # Mamba state still goes through slot.
+    slot.mamba_pool_idx = req.mamba_pool_idx
+    ...
+    return result
+```
+
+**不变量**：
+- `inner.cache_finished_req` 会把 `[kv_committed_len_old, kv_committed_len_new)` 范围内对齐到 page_size 的 KV 插入 radix。这个语义来自 SGLang 标准实现，无需改 inner。
+- `slot.last_node` 现在指向**当前 session 已 commit prefix 的尾节点**，每个 turn 后向前推进。
+- `dec_lock_ref(old_last_node)` + `inc_lock_ref(new_last_node)` 必须在 turn 切换时执行。
+
+### 3.3 `cache_unfinished_req` 改造
+
+streaming session 的 subsequent turn **不再 skip inner**。原因：现在 `match_prefix` 走 radix，chunked-prefill 中间状态也需要 inner 维护：
+
+```python
+def cache_unfinished_req(self, req: Req, **kwargs):
+    if _is_streaming(req) and kwargs.get("chunked", False):
+        # Chunked prefill: forward to inner so the per-chunk extend gets
+        # tracked in the radix LRU access timestamps.
+        ...
+    self.inner.cache_unfinished_req(req, **kwargs)
+```
+
+具体的 chunked 处理细节需要保留对 `prefix_indices` 重建的逻辑（参考当前实现 lines 215–225），但调用 `inner.cache_unfinished_req` 不能 skip。
+
+### 3.4 `match_prefix` 改造
+
+退化为**纯 inner 转发**——SessionSlot 不再持 KV 指针：
+
+```python
+def match_prefix(self, params: MatchPrefixParams) -> MatchResult:
+    # No more slot-fast-path. Streaming sessions reuse KV via radix tree
+    # match like every other request.
+    return self.inner.match_prefix(params)
+```
+
+调用方需要的 "这个 session 的 committed prefix 长度" 信息改为通过 `inner.match_prefix(...).device_indices.shape[0]` 推导。
+
+### 3.5 `release_session` 改造
+
+**after refactor**：
+
+```python
+def release_session(self, session_id: str) -> int:
+    slot = self.slots.pop(session_id, None)
+    if slot is None:
+        return 0
+
+    # Just release our radix lock — radix LRU can now reclaim our prefix
+    # leaves at its own pace. NO direct token_to_kv_pool free.
+    if slot.last_node is not None:
+        if slot.swa_uuid_for_lock is not None:
+            self.inner.dec_lock_ref(
+                slot.last_node,
+                DecLockRefParams(swa_uuid_for_lock=slot.swa_uuid_for_lock),
+            )
+        else:
+            self.inner.dec_lock_ref(slot.last_node)
+
+    # Mamba state still needs explicit cleanup if present.
+    if slot.mamba_pool_idx is not None:
+        ...
+
+    return 0  # "freed_tokens" no longer meaningful; radix LRU shed lazily
+```
+
+### 3.6 `get_session_status` / `list_session_statuses` 改造
+
+`resident_tokens` 现在的真值来自 radix tree。需要在 inner 暴露一个 helper：
+
+```python
+# In BasePrefixCache / RadixCache:
+def tokens_under(self, node) -> int:
+    """Count tokens in the path from root to `node` (inclusive)."""
+    ...
+
+# In SessionAwareCache:
+def get_session_status(self, session_id: str) -> Optional[Dict[str, Any]]:
+    slot = self.slots.get(session_id)
+    if slot is None:
+        return None
+    resident_tokens = self.inner.tokens_under(slot.last_node) if slot.last_node else 0
+    return {
+        "session_id": session_id,
+        "resident": resident_tokens > 0,
+        "resident_tokens": int(resident_tokens),
+        "last_access_time": float(slot.last_access_time),
+    }
+```
+
+`admit_direct_append` 的容量检查改用 `resident_tokens` 的 radix 真值（去掉 `kv_committed_len / kv_allocated_len` 双值不一致的可能）。
+
+### 3.7 SGLang 调度路径配套改动
+
+参考 `schedule_batch.py:1572-1646`，当前 streaming-session correction（commit b8e6f13 / 986f351 引入）建立在 SessionSlot 拥有独立 KV 范围之上。block-level refactor 后这条 correction 路径**完全无需存在**——req 的 fill_ids / prefix_indices 由 inner radix `match_prefix` 直接给出一致值。
+
+**移除项**：
+- `schedule_batch.py:1572-1585` 的 `actual_extend_len = max(0, len(fill_ids) - len(prefix_indices))` correction 块。
+- `schedule_batch.py:1646` 的 `assert seq_len - pre_len == req.extend_input_len`（refactor 后该不变量结构上必然成立）。
+- E3 触发的 latent landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2)随之消失。
+
+---
+
+## 4. 不变量（必须在 PR 自测中覆盖）
+
+| Inv | 内容 |
+|---|---|
+| I1 | `release_session(sid)` 后，下一次同 session 请求的 `match_prefix` 行为只取决于 radix tree 的常驻状态——不依赖 `slots` dict。 |
+| I2 | 任意 (session_id, turn_id) 的 `cache_finished_req` 调用后，radix tree 上必然存在一条 root→leaf 路径覆盖该 turn 的全部 committed token（即 `tokens_under(slot.last_node)` 严格不降）。 |
+| I3 | `restore_to_req` 必须**幂等**：在 chunked-prefill 重试场景下，对同一 req 可被调用多次而最终 req 状态等价。当前实现靠"不清 slot 字段"实现 → refactor 后改由 radix `match_prefix` 的纯函数性质保证。 |
+| I4 | 无 streaming-session 的请求（`req.session is None`）行为 **不变**：所有路径 short-circuit 到 inner。 |
+| I5 | 任一 turn 结束后，对 `slot.last_node` 的 `inc_lock_ref` 必须有对应的 `dec_lock_ref`，且 `release_session` 是最终的释放点。 |
+
+---
+
+## 5. 测试计划（无 GPU 可跑）
+
+### 5.1 单元测试（mock inner cache）
+
+写一个 `MockRadixCache(BasePrefixCache)`，记录所有 `cache_finished_req / cache_unfinished_req / match_prefix / evict / dec_lock_ref` 调用序列。然后：
+
+| Test | 断言 |
+|---|---|
+| `test_release_session_no_direct_free` | 调 `release_session` 后，Mock 上 **没有** 直接 `free(kv_indices)` 调用，只有 `dec_lock_ref` |
+| `test_subsequent_turn_inserts_radix` | 模拟 turn 0 → 1 → 2 三次 `cache_finished_req`，断言每次都触发 `inner.cache_finished_req` |
+| `test_match_prefix_uses_inner` | streaming 与 non-streaming 都仅走 `inner.match_prefix` |
+| `test_restore_idempotent` | 模拟 chunked-prefill 重试，连续两次 `match_prefix` 返回的 `device_indices` 一致 |
+| `test_eviction_under_pressure_is_block_level` | inject 一个 "pool 满，必须 evict 24 tokens" 的状态，断言 `release_session` 不被触发，inner 的 LRU 单步走 |
+
+### 5.2 Property-based 测试
+
+```python
+@given(turns=lists(integers(min_value=24, max_value=2048), min_size=1, max_size=50))
+def test_committed_tokens_monotone(turns):
+    """tokens_under(slot.last_node) is monotonically non-decreasing across turns."""
+    ...
+```
+
+### 5.3 Integration smoke（需要 GPU，但放在 sweep 脚本里）
+
+执行 `sweep_e2_kvc_v2_rdma.sh` 同 trace 同配置，对比指标：
+- evict 总次数（期望从 90 → < 10）
+- 单次平均 evict tokens（期望从 67K → < 500）
+- TTFT p99（期望从 1.28s → < 0.7s）
+- direct-to-D 命中率（期望 ≥ 85%）
+
+---
+
+## 6. 工程量与风险
+
+### 6.1 工程量
+
+| 工作 | 估时 | 风险 |
+|---|---|---|
+| §3.1–§3.6 SessionAwareCache 改造 | 2–3 天 | 中：需要熟悉 radix 内部 lock_ref / evict 协议 |
+| §3.7 schedule_batch 清理 | 0.5 天 | 低：是删代码 |
+| §4 不变量单元测试 | 2 天 | 低 |
+| §5.3 GPU smoke + 数据对比 | 2 天 | 中：mooncake 仍可能触发 E2 级联 death，需要 §S3 修复一并跑 |
+| **总计** | **~1 周** | |
+
+### 6.2 关键风险
+
+1. **`inner.cache_finished_req` 对 streaming-session req 的兼容性**：当前 SGLang 标准 radix 假设 req 在 cache_finished_req 时是 "完整 prefill+decode 完成"。streaming-session 的 req 在每个 turn 结束时还会留下"未完成的 conversation"，要确保 inner 在插入时不会把 decode-only tokens 当成可丢弃尾巴。需要 audit `radix_cache.py:cache_finished_req` 的实现。
+
+2. **lock_ref 顺序**：turn N+1 开始的 `match_prefix` → inc_lock_ref(new_node)，turn N 结束的 dec_lock_ref(old_node)，时序若反了会在并发下让 LRU 把刚 commit 的 leaf 误 evict。建议加 assertion：`dec_lock_ref` 之前 `inc_lock_ref` 必须先到。
+
+3. **chunked-prefill retry**：见 I3。SGLang 当前 `restore_to_req` 不清 slot 字段就是为此 retry。refactor 后必须确认 inner radix `match_prefix` 在 retry 下也幂等（标准 radix tree 是的，但要写测试明确锁住这个性质）。
+
+---
+
+## 7. 与 D→P sync 工作的关系
+
+block-level evict 是 [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) 的**前置条件**：
+
+- D→P sync 需要 P 端 radix tree **可接收外部喂入的 KV block**。
+- 当前 P 端 radix 假设单生产者（本 worker 模型输出）。
+- block-level refactor 完成后，streaming-session 的 KV 已经走标准 radix 路径——再让 radix tree 接受"外部喂入"的额外生产者就只是扩展 insert API，而不是发明新的存储路径。
+
+→ 两件事可顺序做：先 block-level evict，再 D→P sync。
+
+---
+
+## 8. 接班 agent 的最小动作
+
+1. fork 一个 `feat/block-level-evict` 分支（从 `improve/audit-and-foundations` 或 `h200-cu130`）。
+2. 实现 §3.1–§3.6。
+3. 写 §5.1 + §5.2 单元测试。
+4. 在 8×H100 / H200 上跑 §5.3 smoke，对比 evict 频次和 TTFT p99。
+5. 若 §6.2 风险 1 成立，进 SGLang `radix_cache.py` 看是否需要给 streaming-session req 加 `is_session_active=True` flag 阻止"丢弃 decode 尾"。
+
+---
+
+**核心句**：把 session 当 lifecycle 边界（保留），但**不要**让它做 eviction 边界（移交给 radix LRU）。这次 refactor 同时解决"reseed 太频繁"和"P 端 radix 不可外部喂入"两个 blocker。
--- a/docs/D_TO_P_SYNC_CONTRACT_ZH.md
+++ b/docs/D_TO_P_SYNC_CONTRACT_ZH.md
@@ -0,0 +1,247 @@
+# D→P 增量 KV 同步 — 接口契约与 rollout 计划
+
+**日期**：2026-05-12
+**前置**：[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)（缺口定位）+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)（前置条件）
+**性质**：跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
+**Status**：草案。`feat/d-to-p-sync` 分支当前为空，本文是该分支应当首先 land 的设计文档
+
+---
+
+## 0. TL;DR
+
+reseed 慢路径的 50% 时间在 P 重 prefill，**修复 transfer 段（启 RDMA）只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append：
+
+> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix，无需 re-prefill，仅一次 P→D transfer。
+
+本文给出三层（mooncake / SGLang / agentic-pd-hybrid）的接口契约、一个 **staleness budget β** 的形式化定义，以及四阶段 rollout 计划，让该工作可以与 block-level eviction 解耦推进。
+
+---
+
+## 1. Staleness Budget β —— 形式化定义
+
+设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`（time `t` 的瞬时值），P 上同 session 的 backup prefix 长度为 `L_P(s, t)`。
+
+```
+staleness(s, t) := L_D(s, t) - L_P(s, t)   ≥ 0
+```
+
+**Staleness budget β** 是系统承诺维持的上界：
+
+```
+∀ s, ∀ t :  staleness(s, t) ≤ β
+```
+
+直观：β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
+
+- **β = 0**：完全同步（D 每 commit 一块就阻塞等 P ack）。延迟成本高，不推荐。
+- **β = ∞**：当前状态（P 端 backup 永远 seed-time 静态快照）。
+- **β = 一个 page（24 tokens）**：单 block sync。理论最优粒度，但 D 端每次 append 都触发一次 D→P RPC。
+- **β = O(append_len)（典型 1K–4K）**：批量 sync。推荐起点，把同 turn 的 decode 输出聚合后整批推送。
+- **β = O(turn_size)（典型 ~50K）**：粗粒度 sync。失效 reseed bypass，仅减少 transfer。不可取。
+
+→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))`，`β_max` 默认 4096。
+
+---
+
+## 2. 三层接口契约
+
+### 2.1 Mooncake 层：双角色化
+
+**当前状态**（详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3）：
+
+- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
+- `MooncakeKVSender` 仅在 PREFILL 模式实例化，`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
+- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`。
+
+**目标接口**：
+
+```python
+# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
+class BaseKVManager:
+    roles: set[KVRole]   # 替换原单值字段，允许 {PREFILL, DECODE}
+
+class KVRole(Enum):
+    PREFILL = "prefill"
+    DECODE = "decode"
+    PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver"   # 新：P 端接收 D→P sync
+    DECODE_BACKUP_SENDER = "decode_backup_sender"         # 新：D 端发送 D→P sync
+```
+
+**新增类**（实现层 ~400 LOC）：
+
+| 类 | 角色 | 关键方法 |
+|---|---|---|
+| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队，返回 `sync_id` |
+| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程；每个包触发 callback 注入 radix tree |
+
+**Bootstrap channel**：需要独立于现有 P→D 通道的第二个 bootstrap socket（避免 buffer pointer 协商冲突）。配置：
+- 默认 disable，由 ServerArgs flag `--enable-d2p-sync` 开启
+- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
+
+### 2.2 SGLang 层：Radix 多生产者扩展
+
+**当前状态**：P 端 radix 假设单生产者（本 worker 模型输出）。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
+
+**目标接口**（在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后）：
+
+```python
+class RadixCache(BasePrefixCache):
+    def insert_external(
+        self,
+        token_ids: Sequence[int],
+        kv_tensor: torch.Tensor,
+        *,
+        source_worker_id: str,
+        session_id: str,
+    ) -> InsertExternalResult:
+        """
+        Insert KV blocks supplied by an external worker (D→P sync).
+
+        Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
+        and threads the resulting indices through the radix tree exactly like
+        cache_finished_req would for a local prefill.
+
+        Invariants:
+            - Same model layout (verified at handshake time, not per-call).
+            - On collision with existing radix path, no-op for the shared prefix
+              and only insert the diverging suffix.
+            - Inserted nodes get lock_ref += 1 if `pin=True`, default False.
+              D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
+        """
+```
+
+**关键设计点**：
+
+| 决策 | 选项 | 推荐 |
+|---|---|---|
+| KV index 重映射 | A) D 发原 indices, P 重映射；B) D 发紧密打包的 tensor，P 重新分配 | **B**：避免跨 worker 索引泄漏 |
+| 失败处理 | A) D→P 失败 → 退化为重 prefill；B) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
+| Reference counting | sync 进 P 的 KV 是否被 pin？ | **不 pin**：P 端 LRU 自然管理，避免 backup 把生产 KV 挤出 |
+| 与 evict 协调 | sync 来到时 P 满怎么办？ | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
+| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办？ | **接受 multi-source**：每个 P 维护自己的 backup；reseed 时挑 staleness 最小者 |
+
+### 2.3 agentic-pd-hybrid 层：Hooks 与状态机
+
+**新增 CLI flag**：
+
+```bash
+--enable-d2p-sync                     # off by default
+--d2p-staleness-budget-tokens 4096    # β_max
+--d2p-sync-batch-min-tokens 24        # 至少 ≥ 1 page 才触发
+--d2p-sync-target-policy {last_p, round_robin, broadcast}
+                                      # last_p: 推回该 session 上次 seed 的 P
+                                      # broadcast: 推到所有 P（reseed 时灵活但带宽大）
+```
+
+**新增 state 字段**（`replay.py` 的 `DirectSessionState`）：
+
+```python
+@dataclass
+class DirectSessionState:
+    ...
+    # NEW: per-P backup view, populated by D->P sync callbacks.
+    prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
+    last_d2p_sync_at: float | None = None
+```
+
+**Hook 在 `_invoke_session_direct` 完成后**：
+
+```python
+async def _invoke_session_direct(...):
+    ...
+    response = await self._stream_direct_to_d(...)
+    if response.ok and self.config.enable_d2p_sync:
+        new_committed = response.kv_committed_len
+        prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
+        staleness = new_committed - prev_p_resident
+        if staleness >= self.config.d2p_sync_batch_min_tokens:
+            target_p = self._choose_d2p_target(session)
+            asyncio.create_task(
+                self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
+            )
+```
+
+**Hook 在 reseed 路径**（`_invoke_kvcache_seeded_router`）：
+
+```python
+async def _invoke_kvcache_seeded_router(..., request):
+    ...
+    if self.config.enable_d2p_sync:
+        # Probe P-side residency before issuing full re-prefill.
+        probe = await self._probe_prefill_residency(session_id)
+        if probe.resident_tokens >= request.prefix_len - β_max:
+            # Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
+            return await self._invoke_p_to_d_transfer_only(...)
+    # Fall back to existing path.
+    return await self._invoke_kvcache_seeded_router_legacy(...)
+```
+
+---
+
+## 3. 性质（待证明）
+
+### 3.1 Theorem 4 候选（论文形式）
+
+*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发：*
+
+```
+reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
+```
+
+*其中 T_p2d 是 P→D transfer 时间（在 RDMA 下 ~L · 4 ns/token），T_prefill 是 prefill 时间（在 H100 TP1 Qwen3-30B 下 ~50K tokens/s）。当 β ≪ L 时退化为 single P→D transfer 主导。*
+
+**对比 baseline**（无 D→P sync）：`reseed_cost = T_p2d(L) + T_prefill(L − seed_size)`，re-prefill 占主导。
+
+### 3.2 与 Theorem 2 的关系
+
+Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级，使 KVC 在**全分位数**击败 DP 成为可能。
+
+---
+
+## 4. 四阶段 Rollout
+
+| Phase | 范围 | GPU 需求 | 验收指标 |
+|---|---|---|---|
+| **P1** | block-level eviction refactor（[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)） | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
+| **P2** | mooncake 双角色化 + microbench（D→P 单包 RTT、带宽利用） | 单机 + RDMA | P→D RTT < 50ms（local），单 16K-token block 带宽 ≥ 50% 理论上限 |
+| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook（仅 best-effort，无 reseed probe） | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成；不引入新 failure mode |
+| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5s（vs 当前 3–7s），TTFT p99 < 0.5s |
+
+**关键决策点**：P1 → P2 之间需要走 audit，确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突。若发现严重冲突，引入 "P-only sync mode" 占位，等架构稳定再放开。
+
+---
+
+## 5. 风险与对策
+
+| 风险 | 影响 | 对策 |
+|---|---|---|
+| Mooncake 双角色化破坏现有 P→D 单向路径 | E2 已暴露 mooncake "instance not alive" 级联，再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag；保留 disable 路径 |
+| D→P sync 占用 D 出口带宽，影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QP（RDMA SL=0），且 batch 触发，单 turn 内最多 1 次 |
+| P 端 radix 被 backup 填满，反而挤出本地生产 KV | P 端 prefill 速度降 | sync 插入不 pin（§2.2），让 LRU 公平竞争 |
+| 多 P 多 backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policy（recency-biased），观察实测分布再决定是否上 `broadcast` |
+| 跨 SGLang patch 升级时 `insert_external` 与 upstream API 漂移 | 维护负担 | 把 API 限制在我方 vendor patch 边界（不污染 upstream radix），并写 contract test |
+
+---
+
+## 6. 与 block-level eviction 的解耦关系
+
+| 工作 | 是否依赖另一个 |
+|---|---|
+| block-level eviction | 不依赖 D→P sync，可独立交付。能单独降低 reseed 频次 |
+| D→P sync | **依赖** block-level eviction：需要 P 端 radix 是 streaming session KV 的真值源 |
+| 一起做 | 收益最大：reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
+
+→ rollout 顺序：block-level eviction 先 land，D→P sync 随后开 `feat/d-to-p-sync` 推进。两者**不应**合在一个 PR 里。
+
+---
+
+## 7. 接班 agent 的最小动作
+
+1. 在 `feat/d-to-p-sync` 分支上 land 本文。
+2. 等 block-level eviction 进 main 后，开 P2 阶段：mooncake 双角色化 + microbench（单测，无 SGLang 主路径耦合）。
+3. P3 阶段加 `insert_external` 与 hook；以 disabled-by-default 进 main。
+4. P4 端到端 evaluation 后再判断 reseed probe policy（`last_p` vs `broadcast`）。
+
+---
+
+**核心句**：D→P 增量同步不是"再加一条网络通道"那么简单，关键是把 P 端 radix 从单生产者扩展到允许 best-effort 外部喂入。Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后，不能颠倒。
--- a/docs/E1_E2_FIX_DESIGN_ZH.md
+++ b/docs/E1_E2_FIX_DESIGN_ZH.md
@@ -0,0 +1,137 @@
+# E1 / E2 Failure Modes — Fix Design Space (no code changes)
+
+**Status**: design proposal for review.
+**Branch**: `h200-cu130`.
+**Companion**: `docs/E1_E2_RESULTS_ZH.md` §5b–§5d for the forensic findings this design responds to.
+
+This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:
+- **Q1**: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side `batch_transfer_sync` to time out (~30 s) and the hair-trigger in `conn.py:1270` to permanently blacklist the D's mooncake_session_id.
+- **Q2**: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.
+
+For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. **No code is committed** until a path is chosen.
+
+---
+
+## Q1 — Eviction starves mooncake control plane
+
+### Mechanism recap
+
+Inside `decode-0.log` at the moment of P-side timeout (`Sync batch data transfer timeout after 37452515723ns`):
+
+```
+01:56:34  Decode batch ... gen 174 tok/s    ← serving fine
+01:56:42  session id 1000315 does not exist, cannot delete.
+01:56:42  Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
+01:56:42  Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
+01:56:42  Decode transfer failed ...        ← P-side timeout fires
+```
+
+`maybe_trim_decode_session_cache` (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via `kv_pool_allocator.free()`, and updates `session_aware_cache` under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → `batch_transfer_sync` returns nonzero → hair-trigger fires.
+
+### Design space
+
+| # | Fix | Layer | Mechanism | Assumes | Risks |
+|---|---|---|---|---|---|
+| **Q1.A** | Pre-emptive low-watermark eviction | vendored SGLang | Trigger LRU when `token_usage > 0.7` in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has `_decode_session_cache_low_watermark_tokens`; question is whether it currently runs proactively or only on-demand. | Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state. | If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning. |
+| **Q1.B** | Async eviction thread | vendored SGLang | Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls `notify_evict_needed()`; mooncake control plane keeps running. | KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity. | Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness. |
+| **Q1.C** | Bump mooncake transfer timeout | mooncake env / wheel patch | Set `MC_TRANSFER_TIMEOUT_NS` (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up. | A real broken link won't go unnoticed for ≥120 s. | Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection. |
+| **Q1.D** | Windowed hair-trigger | vendored SGLang `conn.py:1270` | Replace `if session_failures >= 1:` with `if session_failures ≥ N within window`. Add periodic probe to D bootstrap port to clear `failed_sessions` after success. | Transient stalls are recoverable; real deaths are not. | Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath. |
+| **Q1.E** | Router-side backpressure | our `--enable-backpressure` (already exists, off by default) | D returns `recommended_pause_ms` in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented. | Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes. | Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires. |
+| **Q1.F** | Upstream load balance (= Q2 fix) | our `policies.py` | Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires. | Q2 fix is sound and the workload's KV demand fits into 3 D's evenly. | The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger. |
+
+### Recommendation for Q1
+
+**Primary: Q1.F (do Q2 fix first).** This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we *know* it's a real symptom and need defense-in-depth.
+
+**Defense-in-depth (cheap): Q1.C (bump mooncake timeout).** Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.
+
+**Avoid for now: Q1.B and Q1.D.** Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.
+
+**Open question for the team**: does SGLang's existing `low_watermark` LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.
+
+---
+
+## Q2 — Cold-D never gets a session
+
+### What we already know is wrong
+
+User's observation: the existing `migration_reject_threshold=3` mechanism fires *after 3 wasted prefills*, which is too late. The fix needs to be *proactive*: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.
+
+### Design space
+
+Let `assigned[D] = state.decode_assignment_counts[D]` and `inflight[D] = state.inflight_decode[D]`. Lex score is currently:
+
+```
+score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)
+```
+
+| # | Fix | Mechanism | Assumes | Risks |
+|---|---|---|---|---|
+| **Q2.A** | Cold-D bonus (binary, what the reverted commit did) | `cold_boost = K if assigned[D]==0 and not sticky else 0`; add to lex position 0. | Each D needs to be "popped" from cold once, after that the bonus disappears. | One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently. |
+| **Q2.B** | Load-floor bonus (graduated, my recommended primary) | `floor_bonus = max(0, K · (1 − assigned[D] / max(assigned[*])))` (or similar continuous fn); add to lex position 0; gated on `not sticky`. | "Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold. | Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions). |
+| **Q2.C** | Lex re-order: inflight first | Change score to `(-inflight, overlap + α·sticky, sticky, -assigned)`. | Idle D always wins ties → idle D2 wins fresh sessions immediately. | Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load *is* balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns. |
+| **Q2.D** | Capacity-aware overlap discount | `effective_overlap = overlap · (1 − inflight[D] / max_inflight)`; replace `overlap` in score. | Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff. | More complex than Q2.B; needs `max_inflight` estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B. |
+| **Q2.E** | Pre-warm cold D's at startup | After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating `state.resident[D]` evenly. | We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start). | Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle. |
+| **Q2.F** | Drop overlap unless "material" | Apply overlap term only when overlap > τ blocks (or > τ% of input). | Tiny overlap doesn't actually save meaningful prefill work. | Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue. |
+| **Q2.G** | Fix the substring filter (the actual `_is_admission_rejection_mode` bug) | Either widen `_ADMISSION_REJECTION_SUBSTRINGS` to include `"kvcache-centric"`, or call `state.record_admission_reject` directly from the actual reject signal site instead of string-matching after the fact. | Existing migration mechanism is sound *once* it gets fed the right signal. | User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup. |
+
+### Recommendation for Q2
+
+**Primary: Q2.B (load-floor bonus, graduated).**
+- Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
+- Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
+- Sticky stays on by gating on `not sticky` → no risk of breaking turn 1+ cache locality.
+- Single knob (`K`) to tune.
+
+**Orthogonal cleanup: Q2.G (fix the reject-substring filter).** Independent of Q2.B, since the migration mechanism is the *backstop* (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the *primary* mechanism, but as a *backstop after* primary load balancing, it's still valuable.
+
+**Avoid: Q2.C** (lex re-order destroys overlap-first design). **Avoid: Q2.E** (workload-coupled, brittle). **Q2.D / Q2.F** are reasonable but more complex than Q2.B with marginal gain.
+
+### Concrete shape of Q2.B (for review, not for merge)
+
+```python
+# In KvAwarePolicy.select, replacing the current score line:
+total_assigned = sum(state.decode_assignment_counts.values())
+n_decoders = max(1, len(topology.route_workers))
+mean_assigned = total_assigned / n_decoders
+
+# Per-D fairness deficit: how much below the running mean is this D?
+deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
+floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0
+
+score = (
+    overlap + sticky * self.sticky_bonus + floor_bonus,
+    sticky,
+    inflight_penalty,
+    assignment_penalty,
+)
+```
+
+Knob: `load_floor_bonus: int = 0` (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets `floor_bonus = 200 * 16 / 16 = 200`, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets `floor_bonus = 200 * 1 / 16 ≈ 12`, which doesn't override real prefix-cache wins.
+
+But this is just a *sketch* — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.
+
+### Validation plan if we go with Q2.B
+
+1. Implement Q2.B + flag, default off.
+2. Re-run E2 on the same `outputs/inferact_50sess.jsonl` subset with `--kvcache-load-floor-bonus 200`.
+3. Check structural log: do D0/D1/D2 each get a non-trivial share of `session-d-binding.jsonl` rows?
+4. Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
+5. Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
+6. Re-evaluate H1 with E1 vs the new E2.
+
+---
+
+## Decision points (for review)
+
+| # | Question | Default if no answer |
+|---|---|---|
+| D1 | Q1: do Q2 fix first and re-measure before touching mooncake / SGLang? | **Yes** (recommended) |
+| D2 | Q1: bump mooncake `MC_TRANSFER_TIMEOUT_NS` to 120 s as cheap defense-in-depth? | Yes |
+| D3 | Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table? | Q2.B |
+| D4 | Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup? | Yes |
+| D5 | Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form? | Defer |
+| D6 | Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values? | Try 200 first |
+| D7 | Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom? | Same subset |
+
+Once the shape is approved, the next implementation pass is small and concentrated in `policies.py` + `replay.py` + `cli.py` (no SGLang vendor changes needed for the primary fix).
--- a/docs/E1_E2_RESULTS_ZH.md
+++ b/docs/E1_E2_RESULTS_ZH.md
@@ -0,0 +1,416 @@
+# E1 vs E2 Experiment Results — H200 + Driver 570
+
+**Status**: E1 ✅ complete (2026-05-12 01:48 UTC, wall 1h29min). E2 ✅ complete (2026-05-12 03:22 UTC, wall 1h33min).
+**Branch**: `h200-cu130`.
+**Trace**: `outputs/inferact_50sess.jsonl` (deterministic head-cut of Inferact `codex_swebenchpro` to first 50 trials, md5 `7bb263a32600ef5a6ef5099ba340a487`, 1285 requests, mean input_length 67,631 tokens).
+**Hardware**: 4× H200 80GB, driver 570.86.15 (cu12.8 API), Mellanox mlx5_60 RoCE 400 Gb/s NDR.
+**Model**: Qwen3-30B-A3B-Instruct-2507 (TP1).
+**Toolchain**: vendored SGLang 0.5.10 + cu12.8 nvcc local install (`~/cuda-12.8`) — see `docs/H200_DRIVER570_SETUP_ZH.md`.
+
+---
+
+## 1. Hypotheses being tested
+
+From `docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.1:
+
+- **H1**: KVC v2's wins are not just from "1P3D topology + kv-aware policy" — the KVC layer (admission / migration / direct-to-D) contributes meaningfully on top. Pairing E1 (no KVC layer) against E2 (full KVC v2) on the **same subset** isolates the marginal contribution.
+- **H2/H3**: Enabling real RDMA pushes TTFT p99 down from the reported 1.28s (TCP loopback) toward ~0.7s. Independent of H1, this is measured inside E2 alone (comparing against the historical TCP-loopback v2 reference).
+
+---
+
+## 2. E1 results — naive 1P3D + kv-aware + RDMA
+
+**Configuration**: `mechanism=pd-disaggregation`, `policy=kv-aware`, 1P3D (GPU0=P, GPU1/2/3=D), `--force-rdma --ib-device mlx5_60`, `--concurrency-limit 32`, ts=1.
+
+| Metric | E1 |
+|---|---:|
+| request_count | 1285 |
+| success | 1200 |
+| **error_count** | **85** |
+| **failure_count** | **85** |
+| abort_count | 0 |
+| latency mean | 96.34 s |
+| latency p50 | 93.21 s |
+| latency p90 | 180.69 s |
+| latency p99 | 219.46 s |
+| ttft mean | 90.48 s |
+| ttft p50 | 88.62 s |
+| ttft p90 | 175.13 s |
+| **ttft p99** | **207.39 s** |
+| execution_modes | `pd-disaggregation-router: 1200`, `pd-disaggregation: 85` (errors) |
+| per_decode_load | **D0:575, D1:710, D2:0** |
+| per_prefill_load | P0:1285 |
+| cache_hit_request_count | 1199 / 1200 (99.9%) |
+
+### Key observations on E1
+
+1. **D2 was never bound to a single session**. All 50 sessions got pinned to D0 or D1 by `kv-aware` policy's (overlap + sticky + inflight + assigned) lex-score, and naive pd-disaggregation has no migration mechanism to rebalance. Effective topology was **1P2D**, not 1P3D.
+2. **Massive queueing**. TTFT p50 ≈ 89 s and p99 > 200 s indicate sessions waited tens of seconds in router/prefill queue. With `--concurrency-limit 32` and D0/D1 saturated, the inflight cap forced ~1250 reqs to serialize through only two decode workers.
+3. **85 failures (6.6%)** — all `execution_mode == pd-disaggregation` (which the metrics module classifies as `error` when the agentic-pd-hybrid replay sees an unsuccessful upstream response). Most likely caused by `--request-timeout-s 300` firing on the longest queued requests.
+4. **Cache hit 99.9%** — the kv-aware policy did successfully concentrate sessions on their prior D worker; the Inferact converter's prefix-shared 24-token-block hash_ids gave near-perfect prefix overlap across turns of the same session.
+
+### What E1 establishes
+
+For the same hardware, same trace, same model, **naive 1P3D + kv-aware policy is unusable for multi-session agentic workloads**:
+- session-stickiness without migration leaves a third of compute capacity (1 of 3 decode GPUs) entirely unused
+- queueing dominates user-facing latency
+- failure rate is 6.6% even with 5 minutes per-request timeout
+
+This is *the baseline H1 needs* — it shows the KVC layer (E2) has something concrete to improve over.
+
+---
+
+## 3. E2 results — KVC v2 + RDMA
+
+**Configuration**: `mechanism=kvcache-centric`, `policy=kv-aware`, 1P3D, `--force-rdma --ib-device mlx5_60`, `--kvcache-admission-mode worker`, `--kvcache-direct-max-uncached-tokens 8192`, `--kvcache-migration-reject-threshold 3`, `--kvcache-prefill-backup-policy release-after-transfer`, `--kvcache-prefill-priority-eviction`, ts=1.
+
+| Metric | E2 |
+|---|---:|
+| request_count | 1285 |
+| success | 231 |
+| **error_count** | **1054** |
+| **failure_count** | **1054** |
+| abort_count | 0 |
+| latency mean (successful only) | 10.94 s |
+| latency p50 | 7.44 s |
+| latency p90 | 20.68 s |
+| latency p99 | 64.73 s |
+| ttft mean (successful only) | 1.76 s |
+| ttft p50 | 0.43 s |
+| ttft p90 | 6.56 s |
+| **ttft p99** | **8.74 s** |
+| execution_modes (succ.) | direct-to-D: 87; turn1-seed: 50; reseed: 12; large-append-reseed: 11; seed-filter-early-turn: 50; large-append-cap: 21 |
+| per_decode_load | **D0:600, D1:685, D2:0** |
+| per_prefill_load | P0:1285 |
+| cache_hit_request_count | 230 / 231 (99.6 %) |
+
+### Key observations on E2
+
+1. **D2 still has zero bindings** — same root cause as E1. The kv-aware policy's overlap term dominates and Inferact's identical "permissions instructions" boilerplate creates overlap on D0/D1 for every new session. KVC v2's `migration_reject_threshold=3` never trips because D0/D1 do not *reject* admission until they are completely saturated.
+2. **80 % failure rate, 1054 / 1285**. **NOT timeouts** — actual root cause is a 3-layer cascade documented in §6. Quick summary: 562 "no-space" admission rejects from D0/D1 → router falls back to seed/reseed paths needing mooncake → mooncake heartbeats drop ("Decode instance could be dead") → SGLang aborts the request → client sees `RuntimeError: generate stream ended before producing any token`.
+3. **Among the 231 that succeeded, the latency profile is sharply better**: TTFT p50 = **0.43 s** vs E1's 88.62 s (E2/E1 = 0.5 %), latency p50 = **7.44 s** vs E1's 93.21 s (8 %). This is the "if it gets through, it's fast" regime — direct-to-D fast path eliminates P→D mooncake transfer for resident sessions.
+4. **Direct-to-D fast path engaged 87 / 231 = 37.7 %** of successful requests. Lower than historical v2's 91.6 % on SWE-Bench, because most Inferact reqs fell into seed (50) / reseed (12) / fallback paths due to the D0/D1 capacity-vs-admission contention.
+
+---
+
+## 4. Comparison table — E1 vs E2
+
+Numbers below are over **all 1285 requests** for E1 (since failure rate is small) but **only the 231 successful** for E2 (since the bulk timed out before producing latency datapoints). This is **not a fair head-to-head**, see §6.
+
+| Metric | E1 | E2 (succ only) | E2 / E1 |
+|---|---:|---:|---:|
+| Total reqs | 1285 | 1285 | – |
+| Successful | 1200 | **231** | 0.19× |
+| **error_count** | 85 (6.6 %) | **1054 (82 %)** | **12.4× worse** |
+| lat mean | 96.34 s | 10.94 s | 0.114 |
+| lat p50 | 93.21 s | **7.44 s** | **0.080** |
+| lat p90 | 180.69 s | 20.68 s | 0.114 |
+| lat p99 | 219.46 s | 64.73 s | 0.295 |
+| ttft mean | 90.48 s | 1.76 s | 0.019 |
+| **ttft p50** | 88.62 s | **0.43 s** | **0.005** |
+| ttft p90 | 175.13 s | 6.56 s | 0.037 |
+| ttft p99 | 207.39 s | 8.74 s | 0.042 |
+| per_decode_load | D0:575, D1:710, D2:0 | D0:600, D1:685, D2:0 | both 1P2D |
+| direct-to-D % | N/A (no KVC) | 87/231 = 37.7 % | – |
+
+---
+
+## 5. Interpreting H1 / H2 / H3
+
+### H1 (was: KVC layer adds value on top of 1P3D + kv-aware) — *qualified*
+
+The H1 hypothesis as stated in `ONBOARDING_NEXT_AGENT_ZH.md` predicted E2 would clearly win on most metrics. The reality is **bimodal**: the small subset of E2 requests that successfully complete are dramatically faster than E1, but a much larger fraction (82 %) of E2 requests time out entirely. Net throughput on this workload is *worse* for E2 than E1.
+
+Two issues drove this:
+1. The D2 cold-start pathology already documented in §3, root cause. Both runs are de facto 1P2D, not 1P3D.
+2. KVC v2's admission gate is stricter and surfaces more "no D capacity" / "session-not-resident" failures than vanilla pd-disagg, when the workload (mean input 67 K tokens, mean output 700 tokens) saturates D0/D1's combined ~1.5 M KV pool.
+
+For workloads where D0/D1 do not saturate or where the policy *does* spread session ownership across all D workers (the historical SWE-Bench setup), KVC v2 wins. The Inferact `codex_swebenchpro` subset breaks both assumptions.
+
+### H2 / H3 (RDMA reduces TTFT p99) — *cannot be evaluated cleanly here*
+
+The historical reference point is "KVC v2 + TCP loopback, SWE-Bench 50sess: TTFT p99 = 1.28 s". This run uses Inferact + RDMA, and TTFT p99 of the 231 successful E2 requests is **8.74 s** — much higher than the TCP baseline. But the workloads are not comparable: Inferact mean input is 67 K tokens vs SWE-Bench's much smaller average. Per-request prefill + transfer is roughly 5× longer here. A clean H2 / H3 read needs an Inferact-on-TCP run to compare against, which is out of scope for this subset's GPU budget.
+
+What we *can* say: RDMA is correctly engaged (every worker log shows `installTransport, type=rdma`; admission RPC RTTs in `structural/admission-events.jsonl` are ~6 ms — consistent with one-hop RoCE).
+
+---
+
+## 5b. Why E2 has 80 % failures — the real chain (forensic)
+
+The summary's `error_count: 1054` and `execution_mode: kvcache-centric` mask the actual cascade. Pulling the underlying `request-metrics.jsonl`, `structural/admission-events.jsonl`, and per-worker SGLang logs gives the full picture.
+
+### Layer 1 — worker admission rejects (51 % of admit attempts)
+
+From `structural/admission-events.jsonl`:
+```
+admit ok      = 581  (modes: seed=494, direct_append=87)
+admit reject  = 605  (reasons: no-space=562, session-not-resident=43)
+```
+
+**562 "no-space" rejects** — D worker (almost always D0 or D1) reports its KV pool is full and refuses to take the request as direct-append. The router then re-routes the request to the seed/reseed path.
+
+This is materially different from E1's behaviour: E1's vanilla pd-disagg had no admission RPC, so requests *always* got accepted by the chosen D and queued behind the running batch. E1 paid for that as a 90-second TTFT but never saw a "no-space" failure.
+
+### Layer 2 — mooncake P→D transfer failures (real, observed in prefill log)
+
+From `logs/prefill-0.log`:
+```
+[01:56:42] Prefill transfer failed for request rank=0 req.rid='2a5ed06fb…'
+           with exception KVTransferError: Failed to send kv chunk of … to 172.18.112.37:46067
+[01:56:42] Prefill transfer failed for request rank=0 req.rid='eca5ff14…'
+           with exception KVTransferError: Decode instance could be dead,
+           remote mooncake session 172.18.112.37:15078 is not alive
+[01:56:42] Prefill transfer failed for request rank=0 req.rid='7ed9827b…'
+           Decode instance could be dead, remote mooncake session ... is not alive
+```
+
+When the seed/reseed fallback queue piles up (because of layer 1), the D worker becomes heavily backlogged and its mooncake bootstrap session heartbeat drops — P interprets this as "the D worker is dead" and fails the transfer. This is **not** a true crash; the worker process is alive (we observed it accepting unrelated requests immediately after), but the mooncake session is torn down for that bootstrap_room.
+
+### Layer 3 — client-visible error
+
+From `request-metrics.jsonl` for all 1054 failed reqs:
+```
+"error": "RuntimeError: generate stream ended before producing any token"
+```
+
+This is what `agentic-pd-hybrid` sees when the SGLang `/generate` SSE stream closes with zero output tokens — the upstream abort from layer 1 or layer 2 propagates as an empty stream.
+
+### The complete causal chain
+
+```
+Inferact shared "permissions instructions" boilerplate
+    ↓
+overlap term in kv-aware lex score never lets D2 win → D2 cold forever
+    ↓
+50 sessions all pinned to D0 / D1
+    ↓
+D0 / D1 KV pool saturates
+    ↓
+worker admission emits 562 × "no-space"  ← Layer 1
+    ↓
+router falls back to seed/reseed path (needs P→D mooncake transfer)
+    ↓
+P→D transfer queue piles up; D mooncake heartbeat drops
+    ↓
+"Decode instance could be dead" → KVTransferError  ← Layer 2
+    ↓
+SGLang aborts the req → SSE stream closes with 0 tokens
+    ↓
+agentic-pd-hybrid raises "generate stream ended ..." for 1054 reqs  ← Layer 3
+```
+
+### Why E1 didn't hit this
+
+E1 used `mechanism=pd-disaggregation`, which has no per-worker admission RPC. The router blindly dispatched to D0/D1; SGLang's internal scheduler simply queued requests behind the running batch (some grew their wait to >90 s before getting a token). Of the 85 E1 errors, sampling shows they are `request-timeout-s=300` failures — old-fashioned timeouts on the agentic-pd-hybrid side, not mooncake or admission failures.
+
+So:
+- E1 trades latency for resilience: nobody rejects, everyone queues, you pay TTFT.
+- E2's KVC v2 worker admission is *meant* to be a safety valve, but on the cold-D pathology it becomes an *amplifier*: rejects → fallback paths → backlog → mooncake heartbeat loss → cascading failures.
+
+### The real fix
+
+Worker admission per se is not the bug — the bug is that there is no D-rebalancing happening upstream. With balanced D load (e.g. cold-D bonus in policy, or pre-warm of D2 with shared boilerplate), D0/D1 would not hit "no-space", and the layer 1 → layer 2 cascade would not fire. The reseed long-tail TTFT (8.74 s p99 here) becomes the dominant cost — exactly the regime onboarding §3.1 H3 describes.
+
+---
+
+## 5c. Why mooncake "died" (forensic on Q1)
+
+The error string is `Decode instance could be dead, remote mooncake session ... is not alive`, which sounds like the D worker process crashed. **It did not.** Concurrent evidence shows D1 was happily serving `/session_cache/admit_direct_append HTTP/1.1 200 OK` and running LRU evictions only seconds after the "is not alive" errors fired. The real mechanism is hair-trigger.
+
+### What the SGLang mooncake conn.py actually does
+
+In `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1267-1276`:
+
+```python
+if ret != 0:                                    # one transfer slice failed
+    with self.session_lock:
+        self.session_failures[req.mooncake_session_id] += 1
+        # Failures should never happen if the session is not dead,
+        # if the session fails once, mark it as failed
+        if self.session_failures[req.mooncake_session_id] >= 1:
+            self.failed_sessions.add(req.mooncake_session_id)
+            logger.error(f"Session {req.mooncake_session_id} failed.")
+    ...
+```
+
+After this, every subsequent transfer that uses the same `mooncake_session_id` short-circuits at conn.py:1184:
+
+```python
+if req.mooncake_session_id in self.failed_sessions:
+    self.record_failure(kv_chunk.room,
+        f"Decode instance could be dead, remote mooncake session ... is not alive")
+```
+
+**One real `send_kvcache_slice ret != 0` permanently blacklists that D's mooncake session for the rest of the SGLang process lifetime.** The code's own comment ("Failures should never happen if the session is not dead") encodes the design assumption that transfers don't fail under normal conditions — but they do under the saturation regime described in §5b (RDMA queue full / D scheduler too busy to drain receives in time).
+
+### Connecting back to Q1 timeline
+
+Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
+
+### What the hair-trigger is actually reacting to
+
+Pulling the mooncake C++ logs (filter `^E0`/`^I0` lines from prefill-0.log) reveals the actual underlying error:
+
+```
+I0512 01:56:42.242062 transfer_engine_py.cpp:546]
+    Sync batch data transfer timeout after 37452515723ns
+I0512 01:56:53.335597 transfer_engine_py.cpp:546]
+    Sync batch data transfer timeout after 30892690400ns
+```
+
+**37.45 s** and **30.89 s** — the mooncake `batch_transfer_sync` C++ call returned nonzero because the synchronous transfer took longer than its internal timeout (~30 s). On a 400 Gb/s NDR RDMA fabric this is not a network problem; the data path is healthy. The SGLang author's design instinct (`>= 1 failures = dead`) is *correct in the idle case* — a 30-second RDMA stall really does indicate a broken peer.
+
+What's happening here is that the peer is **logically broken from the C++ control-plane's point of view**, even though the OS process is still alive.
+
+### Why does the D side stall the control plane for 30 s?
+
+Cross-referencing decode-0.log at the exact second of the first timeout (01:56:42):
+
+```
+01:56:34  Decode batch, #running-req=1, #token=627631, token_usage=0.83,
+          gen throughput=174.76 tok/s         ← still serving normally
+01:56:42  session id 1000315 does not exist, cannot delete.
+01:56:42  session id 1000360 does not exist, cannot delete.
+01:56:42  Trimmed decode session cache via LRU.
+            #evicted_sessions: 2, #freed_tokens: 77675,
+            #available_tokens: 38574 → 116249
+01:56:42  Trimmed decode session cache via LRU.
+            #evicted_sessions: 1, #freed_tokens: 36166,
+            #available_tokens: 29038 → 65204
+01:56:53  Decode transfer failed for request rank=0 ...
+            Failed to get kvcache from prefill instance, it might be dead
+```
+
+D0's main scheduler thread was busy doing **two consecutive LRU evictions** (freeing 77 675 + 36 166 ≈ 114 K tokens of KV) right when the P→D mooncake transfer attempt landed. Each LRU trim involves:
+- iterating per-session resident metadata
+- releasing GPU KV slots back to `token_to_kv_pool_allocator.free()`
+- updating the session-aware-cache bookkeeping under lock
+- closing per-session streaming state
+
+Under `token_usage = 0.83` the LRU scan has to walk thousands of entries; the lock held during this work blocks the mooncake C++ control plane on the receive side (buffer registration / completion poll) from making progress. P's `batch_transfer_sync` keeps polling for the peer's completion ack, doesn't get one for 30 s, and gives up.
+
+So the chain is:
+
+```
+D KV pool saturated by D2-cold-pinning (§5d)
+    ↓
+D triggers heavy LRU eviction (114K tokens at a time)
+    ↓
+D main scheduler thread starves mooncake C++ control plane for 30+ s
+    ↓
+P's batch_transfer_sync returns nonzero (timeout)
+    ↓
+P's hair-trigger marks D's whole mooncake_session_id "failed forever"
+    ↓
+all subsequent reqs to that D blow up with "is not alive"
+```
+
+The hair-trigger threshold (`>= 1`) is structurally wrong for this regime — but it would not fire at all if the LRU thrash didn't happen, and the LRU thrash would not happen if the load were spread across all 3 D workers (§5d).
+
+### Two layers of fix
+
+| Layer | What | Cost |
+|---|---|---|
+| Root cause | Spread load to D2 so D0/D1's KV never saturate, LRU never thrashes. See §5d and the cold-D bonus implementation in `policies.py` (next commit). | Low — pure policy change |
+| Defense in depth | In `mooncake/conn.py:1267-1276`, replace `>= 1` with a windowed threshold (e.g. ≥ 3 failures within 60 s) and add a periodic retry that probes the D bootstrap port before clearing `failed_sessions`. | Medium — touches vendored SGLang |
+
+We do the root-cause fix first because it makes the second one optional.
+
+---
+
+## 5d. Why no session ever migrated to D2 (forensic on Q2)
+
+KVC v2's design (KVC_ROUTER_ALGORITHM §3.3) uses `state.session_d_rejects[(session_id, D)] += 1` after a rejection, then policy.select skips any D with `rejects >= migration_reject_threshold (=3)`. The mechanism is conceptually sound. The bug is in *which* failures count as rejections.
+
+### The substring filter is too narrow
+
+In `replay.py:1379`:
+
+```python
+_ADMISSION_REJECTION_SUBSTRINGS = (
+    "session-cap",
+    "no-d-capacity",
+    "d-backpressure",
+)
+
+def _is_admission_rejection_mode(execution_mode: str) -> bool:
+    return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
+```
+
+Only execution_modes containing one of those three substrings increment the per-(session, D) reject counter. **All 1054 E2 failures have `execution_mode = "kvcache-centric"`** (the generic fallback bucket the replay engine uses when the request fell through every concrete sub-path before producing a successful result). That string contains none of the three substrings, so `session_d_rejects` is never incremented for them.
+
+### Empirical confirmation
+
+Counting from `structural/admission-events.jsonl` (worker-RPC level, independent of replay's classification):
+
+| Stat | Value |
+|---|---:|
+| Distinct `(session, D)` pairs ever rejected by worker RPC | 49 |
+| Pairs rejected ≥ 3 times (would qualify for blacklist) | **46** |
+| Most-rejected single pair | (1001172, D1) = **25 rejects** |
+
+So 46 of 49 (sess, D) pairs *should have been blacklisted* by KVC v2's design. They never were, because the corresponding requests' execution_mode was `"kvcache-centric"` (failure path) and not `"…-session-cap"` / `"…-no-d-capacity"` / `"…-d-backpressure"` (which only get assigned when the fallthrough path runs to a known-rejection sub-result, not when the upstream SSE stream errors out).
+
+Counting "next-binding-after-reject" from the merged binding+admission timeline:
+
+| Rejected on | Next binding goes to | Count |
+|---|---|---:|
+| D0 | D0 | 253 |
+| D1 | D1 | 329 |
+| D0 | D2 | **0** |
+| D1 | D2 | **0** |
+
+The router stubbornly re-binds the same session to the same D after every reject — exactly because the reject was never recorded in `session_d_rejects`, so policy.select still sees an empty rejection counter and the overlap term keeps tipping it back to D0/D1.
+
+### The fix
+
+Two paths, in increasing scope:
+
+1. **Quick**: include `"kvcache-centric"` (the failure-fallback bucket) in `_ADMISSION_REJECTION_SUBSTRINGS`, OR have replay set `execution_mode` to a more specific failure label when an SSE stream closes with zero tokens (e.g. `"upstream-aborted"`) and add that to the substring set.
+2. **Better**: don't rely on string-matching at all. Have `_run_request` catch the actual rejection signal (admission RPC `can_admit=False` or upstream `RuntimeError: generate stream ended ...`) and call `state.record_admission_reject(...)` directly at that point. The substring filter was inherited from the v1 → v2 migration design (`MIGRATION_V1_FINDINGS_ZH §4.1`) when only specific fallback paths set those names.
+
+Either fix would let the existing `migration_reject_threshold=3` blacklist D0/D1 after enough failures, force a re-route to D2, populate D2's resident hashes, and break the overlap-pinning death spiral.
+
+---
+
+## 6. What this experiment actually shows
+
+1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
+2. **The KVC v2 + kv-aware policy combination has a latent pathology on workloads with high cross-session prefix overlap**: the overlap term in the lex score causes permanent load imbalance, and v2's reject-counter migration cannot rescue it because rejects only fire under capacity pressure, by which point timeouts already dominate. This is novel and not surfaced by the SWE-Bench evaluation in the existing project docs.
+3. **For Inferact-like workloads, a cold-D bonus (e.g. require D to host at least one session before its overlap score counts) or an explicit pre-warm step is required** before E1/E2 comparisons can isolate the marginal effect of the KVC layer.
+
+---
+
+## 7. Reproducibility
+
+- Trace: `outputs/inferact_50sess.jsonl`, md5 `7bb263a32600ef5a6ef5099ba340a487`, regenerable via `scripts/sample_trace_subset.py`.
+- E1: `bash scripts/sweep_e1_naive_1p3d.sh` (1h 29 min wall)
+- E2: `bash scripts/sweep_e2_kvc_v2_rdma.sh` (1h 33 min wall)
+- Summary JSON paths:
+  - `outputs/e1_naive_1p3d_kvaware_rdma_50sess/e1_naive_1p3d_kvaware_run1_summary.json`
+  - `outputs/e2_kvc_v2_rdma_50sess/e2_kvc_v2_rdma_run1_summary.json`
+- Per-request metrics JSONL alongside each summary, plus structural events under `*/structural/`.
+
+---
+
+## 8. Open follow-ups for the next agent
+
+1. **Add a cold-D bonus** to `KvAwarePolicy.select` (e.g. positive constant for D with `state.resident[D] == ∅`) and re-run E2 on the same subset. Predict: D2 receives bindings, failure rate drops, head-to-head with E1 becomes meaningful.
+2. **Rerun E2 with `--kvcache-admission-mode router`** (router-side optimistic admission instead of worker RPC) to isolate whether the strict worker admission is the contributor to the 1054 failures, or whether it's purely the imbalance.
+3. **Run a third arm E0 with `policy=default` + `mechanism=pd-disaggregation`** as a true control — kv-aware policy is itself part of what we are evaluating; default round-robin would have spread sessions across all 3 D.
+4. **Compare TTFT p99 against an Inferact-on-TCP-loopback run** to evaluate H2/H3 cleanly. Cost: 1 more E2-shaped sweep (~1.5 h).
+5. **Investigate the 1054 E2 failures** in `request-metrics.jsonl` — sample some to verify they are timeout-related vs admission-rejected vs upstream-500.
+
+---
+
+## 4. Comparison table — pending
+
+To be appended.
+
+---
+
+## 5. Open questions for the next iteration
+
+- Are the 85 E1 errors all timeouts? `request-metrics.jsonl` rows with `error` execution_mode should be sampled to confirm. (Quick check: grep the metrics jsonl for `"execution_mode": "pd-disaggregation"` and inspect `latency_s` / `error` fields.)
+- Does E2 produce the predicted ~91% direct-to-D rate seen in the historical SWE-Bench v2 run, or does the Inferact workload's larger session count (50 vs 52 there) but very different per-session size distribution (mean 33 turns × ~2KB context growth per turn) push it lower?
+- Is `D2 = 0%` an E1-specific artifact (kv-aware sticky in pd-disagg mode), or does the same happen in E2 before migration kicks in for the first time?
--- a/docs/E3_FINDINGS_ZH.md
+++ b/docs/E3_FINDINGS_ZH.md
@@ -0,0 +1,129 @@
+# E3 — first run findings + bug exposure
+
+**Status**: E3 first attempt aborted at ~16 min wall by SGLang assertion crash on decode-1. Partial data confirms the load-floor bonus works as designed; the crash is an independent vendored-SGLang bug exposed by E3's new routing pattern.
+
+**Branch**: `h200-cu130`.
+**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`.
+
+---
+
+## 1. What worked: load-floor bonus (K=200)
+
+Within the first ~15 minutes of E3, before the crash:
+
+| | E1 (run1) | E2 (run1) | E3 (run1, partial) |
+|---|---:|---:|---:|
+| total bindings | 1285 | 1186 admit attempts | 1001 |
+| decode-0 bindings | 575 | 600 | 240 (24.0%) |
+| decode-1 bindings | 710 | 685 | 536 (53.5%) |
+| **decode-2 bindings** | **0** | **0** | **225 (22.5%)** |
+| unique sessions on D2 | 0 | 0 | **30** |
+
+**Load-floor bonus successfully broke the overlap-pinning death spiral.** D2 is finally getting traffic on Inferact's shared-boilerplate workload. The graduated formula (`K * deficit / mean`) plus the `not sticky` gate produces the intended behavior: fresh sessions land on under-loaded D's, established sessions keep going to their original D for cache locality.
+
+This validates the Q2.B design from `docs/E1_E2_FIX_DESIGN_ZH.md` empirically — but only as far as the run got. End-to-end metrics (lat / TTFT / failure rate) are not interpretable yet because the worker died.
+
+## 2. The new crash: SGLang streaming-session correction leaves an invariant violated
+
+At `01:51:21` (~5 min into the benchmark), decode-1 hit:
+
+```
+[01:51:21] Correcting streaming-session extend_input_len from 6648 to 0
+  (rid=6f4318e93dd543a49dbf19248cfc1e6f, session_id=1000195,
+   fill_len=6648, prefix_len=43459, kv_committed_len=43459)
+[01:51:21] Scheduler hit an exception: AssertionError
+  at third_party/sglang/python/sglang/srt/managers/schedule_batch.py:1646
+  → assert seq_len - pre_len == req.extend_input_len
+```
+
+### Mechanism
+
+With `--enable-streaming-session`, SGLang's session_aware_cache hands the scheduler a request whose `fill_ids` is just the new tokens since the last turn (6648), while `prefix_indices` represents the already-cached prefix on this D (43459 blocks). When the prefix exceeds `fill_ids` (e.g., the new turn's input is short relative to the conversation history that's already in cache), this code path fires at `schedule_batch.py:1572-1585`:
+
+```python
+actual_extend_len = max(0, len(req.fill_ids) - len(req.prefix_indices))
+if req.extend_input_len != actual_extend_len:
+    logger.warning("Correcting streaming-session extend_input_len from %d to %d ...")
+    req.set_extend_input_len(actual_extend_len)
+```
+
+So `req.extend_input_len` becomes `max(0, 6648 - 43459) = 0`.
+
+Then at line 1588-1590:
+
+```python
+seq_lens = [len(r.fill_ids) for r in reqs]       # 6648
+prefix_lens = [len(r.prefix_indices) for r in reqs]  # 43459
+```
+
+And at line 1646:
+
+```python
+assert seq_len - pre_len == req.extend_input_len  # 6648 - 43459 == 0 → FAIL
+```
+
+The correction patches `extend_input_len` but the downstream invariant is computed from raw `fill_ids`/`prefix_indices` lengths, which the correction never touched. The arithmetic check is fundamentally incompatible with the corrected state.
+
+### Provenance
+
+The streaming-session correction (`schedule_batch.py:1572-1585`) and the assertion site (line 1646) are both inside the project's SGLang vendor patches — `git log` on this file shows the patch came from commit `b8e6f13 feat(sglang): support decode session cache admission`. So this is a regression in the project's own SGLang fork, not upstream SGLang.
+
+### Why E3 triggers it and E2 didn't
+
+The crash is independent of migration (session 1000195 stayed on decode-1 the entire time). Two factors combined to expose it in E3:
+
+1. **D1 was under more sustained load in E3** — 536 bindings on 17 unique sessions means high re-binding density per session, which means more concurrent turns of the same session at the scheduler, increasing the rate at which streaming-session corrections fire.
+2. **Faster overall dispatch** — with D2 actually consuming work, the prefill→decode pipeline moves faster, so streaming-session entries reach the corrected state more often than in E2's saturated cap-out regime.
+
+Both factors are effects of the load-floor fix, not its cause. The crash is a pre-existing landmine in the vendored streaming-session code that E1 and E2 happened to avoid because their pipelines stalled before sessions accumulated enough committed prefix to trigger the correction.
+
+---
+
+## 3. Decision space for the fix
+
+| # | Fix | Layer | Where | Risk |
+|---|---|---|---|---|
+| **A** | Patch the assertion to match the corrected state | vendored SGLang `schedule_batch.py:1646` | Add: `if req.extend_input_len == 0 and len(req.fill_ids) < len(req.prefix_indices): continue` to skip degenerate reqs before iterating. | Local, scoped, doesn't touch correctness elsewhere. Need to handle the skipped reqs (set `was_skipped` flag, drop from batch). |
+| **B** | Fix the correction site to also drop the req from the batch | vendored SGLang `schedule_batch.py:1572-1585` | When `actual_extend_len == 0` and req has nothing to extend, signal upstream to remove the req from this batch (defer or drop). | Slightly more invasive. The upstream call path needs to handle a "filtered" return. |
+| **C** | Compute `seq_lens` and `prefix_lens` consistently with the correction | vendored SGLang `schedule_batch.py:1588-1590` | After correction, recompute `seq_lens = [len(r.fill_ids[:pre_len] + extension)]` or align both sides. | Risky; affects all downstream tensor sizing. |
+| **D** | Workaround: disable session migration in E3 (the trigger combination) | our `cli` flag `--kvcache-migration-reject-threshold 0` | One-line config change in `sweep_e3_*.sh`. | Doesn't actually fix the crash — session 1000195 didn't migrate. May reduce but not eliminate. Might still hit it on a different session. |
+| **E** | Workaround: disable streaming session | server flag, remove `--enable-streaming-session` | Sidesteps the entire correction path. | Loses KVC's direct-to-D fast path (the central perf win we measure). Defeats the experiment. |
+
+### Recommendation
+
+**Fix A** — patch `schedule_batch.py:1646` to skip the malformed req before asserting. It's the minimal-blast-radius change and matches the apparent intent of the correction (graceful handling of the degenerate state).
+
+Concretely:
+
+```python
+# Just before the assertion at line ~1646
+if req.extend_input_len == 0:
+    # The streaming-session correction zeroed extend_input_len because
+    # prefix_indices already covers fill_ids. Skip this req from the
+    # extend batch — its KV is already committed; nothing to compute.
+    skip_indices.append(i)
+    continue
+```
+
+Then the caller of `prepare_for_extend` needs to handle skipped requests (return them to the decode queue without an extend pass).
+
+**Avoid Fix D/E** — D doesn't address the root cause (the failing session didn't migrate), and E loses the entire reason we're running this experiment.
+
+---
+
+## 4. Decision points for review
+
+| # | Question | Default if no answer |
+|---|---|---|
+| D1 | Implement Fix A (vendor patch to skip zero-extend-len reqs)? | **Yes** |
+| D2 | Re-run E3 with same K=200, same subset, after the fix? | Yes |
+| D3 | Add a structural log entry every time the correction fires so we can track its frequency? | Recommended |
+| D4 | File this as a separate `feat(sglang)` commit on the branch so the patch and the failure case it fixes are traceable? | Yes |
+
+---
+
+## 5. What this tells us about KVC v2 maturity
+
+The load-floor bonus's first real exposure to the production codepath uncovered an existing patch bug that was masked by E2's failure cascade. This is good news: the failure cascade in E2 was hiding *another* layer of breakage. Without rebalancing, sessions cap-out → cascade → never run long enough to commit deep prefixes → never hit the streaming-session correction → never crash. With rebalancing, sessions DO commit deep prefixes → trigger the correction → crash.
+
+Each fix tends to expose the next-shallowest bug. This is expected for a stack of ~6 interacting subsystems (kv-aware policy, KVC admission, session_aware_cache, streaming session, mooncake transfer, prefill batch prep). The path forward is to keep patching, re-running, and pushing the failure boundary out.
--- a/docs/EVALUATION_PROTOCOL_ZH.md
+++ b/docs/EVALUATION_PROTOCOL_ZH.md
@@ -0,0 +1,185 @@
+# 评测协议（Paper-quality）
+
+**日期**：2026-05-12
+**性质**：评测协议规范，覆盖 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.1 M1–M6 全部薄弱点
+**对象**：跑实验的合作者；写 paper 的人；artifact reviewer
+
+---
+
+## 0. 总原则
+
+> 论文里每一个数字都必须能回答两个问题：
+> 1. **抽样误差有多大？**（bootstrap CI、N、std）
+> 2. **公平吗？**（同 trial、同 trace、同 token cap、同 timeout、paired）
+
+当前 sweep 报告（`KVCACHE_CENTRIC_PROGRESS_ZH.md` / `V2_RESULTS_ZH.md`）都不满足上述任一条。本文给出合规模板。
+
+---
+
+## 1. 评测维度（M1–M6 一对一解决）
+
+### 1.1 M1 — 统计显著性
+
+| 决策 | 规则 |
+|---|---|
+| `N` 每个 config 最小 run 数 | **3**（headline 数字）/ **5**（ablation 终值） |
+| 报告统计量 | `mean ± std`，**附 2.5/97.5 bootstrap CI** |
+| 多 run 聚合 | 把每 run 的 per-request latency append 后整体做 bootstrap；不要先 per-run 求 mean 再 average mean |
+| 差异显著性 | paired bootstrap p-value（≥ 5000 samples） |
+| `N=1` 仅允许 | smoke / sanity check，**不进 headline 表** |
+
+### 1.2 M2 — 公平 paired 比较
+
+| 决策 | 规则 |
+|---|---|
+| trace fixity | 用同一个 `samples-*.jsonl` 文件；replay 用 `--use-trace-as-sample` 锁定 |
+| timeout | 所有 mechanism 同 `--request-timeout-s`；不允许某一组用 600s 而另一组 300s |
+| token cap | 同 `--max-input-len`（取所有 baseline 的最小值并显式 truncate） |
+| 错误 / abort | **不**只算成功请求；abort 与 timeout 各自单列 `error_count`，按全集（含错误）报指标，或 paired-on-same-trial-mask |
+| 时间窗 | `time_scale` 一致；不允许同 sweep 内换 |
+| Worker 数 / GPU 类型 | 一致；topology 差异必须标注 |
+
+**反例**：当前 `E1 vs E2` 表（[E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) §4）显式声明 "not a fair head-to-head"——E2 80% 失败，successful-only 算 latency 与 E1 全集对比。**这种表不能直接进 paper**。
+
+### 1.3 M3 — Trace 分层
+
+| 维度 | 分桶建议 |
+|---|---|
+| `turn_id` | `{1, 2-5, 6-20, 21+}` |
+| `append_len` | `{≤128, 128-1K, 1K-8K, >8K}` |
+| `overlap_ratio` | `{≤0.3, 0.3-0.7, >0.7}` |
+| `inter_turn_gap_s` | `{≤5, 5-30, 30-300, >300}` |
+| `input_len` | `{≤8K, 8K-64K, >64K}` |
+
+**报告要求**：headline 数字之外，至少给一张"按 turn_id × append_len"的 heatmap，让 reviewer 看到收益来自哪个 slice。
+
+**反例**：当前 Real Ali 实验仅在 KVC-fit slice（high overlap + small append + 100% direct-eligible）上报 -46% p50。这是上限，不是平均。必须同时给出 full Ali 上的 paired 表。
+
+### 1.4 M4 — Baseline 矩阵
+
+至少以下 baseline 中跑 **2 个**：
+
+| Baseline | 类别 | 库 |
+|---|---|---|
+| vLLM + automatic prefix caching | 同 model 单 worker prefix cache | vLLM main |
+| SGLang DP cache-aware（4×TP1） | 当前主要 baseline | 本仓 vendored SGLang |
+| SGLang PD-disaggregation（kv-aware） | naive 但 cache-aware 拓扑 | 本仓 |
+| DistServe | P/D 分离 baseline | DistServe upstream |
+| SplitWise | P/D split + adaptive routing | open-source impl |
+| Mooncake-Master scheduler | 同代设计 | mooncake-master |
+
+**额外推荐**：跑一个 "oracle" baseline——assume `Σ.resident[d]` 完美已知 + admission 永不失败，作为 KVC 的上限对照。
+
+### 1.5 M5 — Trace 组合
+
+| Trace | 用途 |
+|---|---|
+| Ali coding agent (full) | 主结果；含 single-turn dilution |
+| Ali KVC-fit slice | KVC 上限演示 |
+| SWE-Bench 50 sess | 已有；多轮高 overlap workload |
+| ShareGPT | 对比 chat workload（短 turn，低 overlap）。**用来证明 KVC 不会在不合适 workload 上劣化** |
+| Inferact | tool-use heavy 的 agent workload |
+| Mooncake trace | 单 turn LLM serving 的 baseline trace |
+| Synthetic adversarial | 自构：burst 100 个新 session 同时 seed，验证 mooncake death 与 reset-on-success 的 robustness |
+
+**最低组合**：Ali full + SWE-Bench + ShareGPT + Synthetic adversarial。
+
+### 1.6 M6 — 硬件覆盖
+
+| Tier | 用途 |
+|---|---|
+| 单节点 ≤ 8 GPU | 当前所有结果 |
+| 双节点 NVLink + IB | 验证跨节点 D→P sync 与 mooncake 行为 |
+| 4 节点 cluster（≥ 16 GPU） | scaling 数字、cluster scheduler 假设 |
+| 异构（H100 + L40S） | topology-aware routing |
+
+**最低组合**：单节点 4×H200 + 双节点 NVLink + IB。剩下两个 tier 可放 future work。
+
+---
+
+## 2. 报告模板
+
+### 2.1 主结果表（Table 1）
+
+```
+| Config | N | mean ± std | p50 [CI] | p90 [CI] | p99 [CI] | err% | timeout% |
+|--------|---|------------|----------|----------|----------|------|----------|
+```
+
+加注：trace name、time_scale、`max_input_len`、`request_timeout_s`、所有共用参数。
+
+### 2.2 Paired delta 表
+
+```
+| Pair | N pairs | mean delta [CI] | p50 delta [CI] | wins / losses | p-value |
+```
+
+`N pairs` = 两边都 successful 的 trial 数。`wins` = `latency_kvc < latency_baseline` 的 trial 数。
+
+### 2.3 分层表（Table 2）
+
+每个分层维度（§1.3）独立一张。
+
+### 2.4 Negative-result 章节（强制）
+
+paper 必须有专章列出：
+
+- KVC 在 ShareGPT 上比 baseline 慢的具体数字。
+- KVC 在 trace 哪些 percentile / slice 不胜。
+- 失败的 sweep（mooncake death、E3 crash）的诊断链路。
+
+→ 论文 reviewer 看见诚实的 negative result 会显著提高印象分。当前的 [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §4 雏形可以扩成这一章。
+
+---
+
+## 3. 工具支持（本仓需要的脚本）
+
+| 脚本 | 状态 | 说明 |
+|---|---|---|
+| `scripts/analysis/recompute_summary.py` | ✅ 已有 | 修复 abort 污染的 latency；本协议主要数据入口 |
+| `scripts/analysis/stratified.py` | ⏳ 本分支新增 | 按 §1.3 维度切桶 + 输出表 |
+| `scripts/analysis/paired_compare.py` | ⏳ 本分支新增 | paired bootstrap，输出 §2.2 表 |
+| `scripts/analysis/plot_*` | ✅ 已有 | TTFT PDF、GPU 利用率、cache efficiency |
+
+→ 本分支的 stratified + paired 脚本 land 后，跑实验的合作者可以一条命令出表。
+
+---
+
+## 4. Artifact 要求（SOSP/OSDI AE）
+
+| 项目 | 标准 |
+|---|---|
+| Dockerfile | 单一 `Dockerfile.artifact`，4×A100/H100 即可启 |
+| 一键脚本 | `bash artifact/reproduce_main_table.sh`，1 小时内出 Table 1 |
+| 数据集 | 提供 `outputs/sample-*.jsonl` 子集（可 ~5GB 内）；full Ali 走 instruction |
+| 复现度 | bootstrap CI 与原文重叠即算复现，不要求 bit-exact |
+| 文档 | `artifact/README.md`，列出每张表 / 图对应的命令 |
+
+→ 本路线图 §M1 修复后再准备 artifact。
+
+---
+
+## 5. 自检清单（提 paper draft 前用）
+
+- [ ] 每张表 N ≥ 3，含 mean±std 与 95% CI。
+- [ ] 没有 "successful only" 字样；所有错误已列入 `err%`。
+- [ ] 所有 baseline 用同 `max_input_len` / 同 `request_timeout_s` / 同 `time_scale`。
+- [ ] 至少 3 个 trace + 1 个 synthetic adversarial。
+- [ ] 至少 1 个 non-SGLang baseline。
+- [ ] 有 negative-result 章节。
+- [ ] 有 KVC 在 single-turn workload 上的 dilution 数据。
+- [ ] 形式化部分：Algorithm 1/2/3 + Theorem 1/2，以及 D→P sync 完成后的 Theorem 4。
+- [ ] 失败模式 forensic：mooncake death、E3 crash、cold-D 都进 §Limitations 或 §Discussion。
+
+---
+
+## 6. 路线图衔接
+
+- [ ] Phase A — 实现本分支 `scripts/analysis/stratified.py` + `scripts/analysis/paired_compare.py`（无 GPU 可做）。
+- [ ] Phase B — 把现有 `kvc-real-ali-iter-v1` 的 600-req/15min 数据用新工具重出一份分层表 / paired 表，存入 `outputs/`（GPU 不需重跑）。
+- [ ] Phase C — 跑 ShareGPT + Synthetic adversarial baseline（GPU 需 ~12h）。
+- [ ] Phase D — 选 1 个非 SGLang baseline（推荐 vLLM + prefix caching）补齐 M4（GPU 需 ~24h）。
+
+---
+
+**核心句**：当前结果"看起来已经赢"，但按本协议重报后，赢的 magnitude 会缩小、赢的 slice 会窄化、负面 slice 会暴露。这是论文必须经历的过程；越早做越省事。
--- a/docs/FAILURE_MODES_ZH.md
+++ b/docs/FAILURE_MODES_ZH.md
@@ -0,0 +1,222 @@
+# Failure-mode Taxonomy
+
+**日期**：2026-05-13
+**性质**：集中清单 + 诊断手册
+**对象**：跑实验时遇到失败要立刻 lookup 的合作者；写 paper §Limitations 时需引用的人；reviewer 想问"你为什么觉得这次会更稳"时的答案
+
+本文把当前系统已识别的失败模式按"症状 → 根因 → 触发条件 → 当前缓解 → 真正的修复"梳成一张表。所有条目都有 forensic 链接到原始实验 doc。
+
+---
+
+## 0. TL;DR
+
+5 类已识别失败模式，按"是否阻碍 paper claim"分组：
+
+| 类别 | 名称 | 阻碍 paper | 真正修复 |
+|---|---|:---:|---|
+| **A. 控制层级联** | Mooncake "instance not alive" cascade | ✅ | admission backoff + per-D pending-seed budget |
+| **B. 路由偏置** | Cold-D / overlap-pinning | ✅ | first-principles overlap term redefinition |
+| **C. KV 抖动** | Evict storm（session-level evict） | ✅ | [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) |
+| **C'. KV 抖动** | Reseed storm（turn 1 大 seed 并发） | ✅ | per-D pending-seed budget + (C 缓解后频率自降) |
+| **D. Vendor 不变量** | streaming-session correction invariant crash (E3) | ❌（hotfix 已 land） | 删除 correction 路径（block-level evict 完成后） |
+
+A / B / C 三类是 Milestone 1 必须解决的；C' 是 A 的次因；D 已临时止血但根本修复绑在 C 上。
+
+---
+
+## 1. A — Mooncake "instance not alive" cascade
+
+### 1.1 症状
+
+- 客户端看：`RuntimeError: generate stream ended before producing any token`
+- D scheduler 日志：`[mooncake] Decode instance could be dead, dropping ...`
+- 整批请求被 abort，单一 sweep 在数分钟内从健康降到 80% failure（[E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) E2：1054 / 1285 失败）
+
+### 1.2 根因（forensic 链路）
+
+```
+admission no-space (D KV pool 满)
+    → router 立刻 fallback 走 seed/reseed 路径
+    → 多个并发 seed 同时打 mooncake P→D
+    → P→D 出口排队，handshake 阶段超时
+    → mooncake 把对端标记 dead
+    → SGLang 把 dead 链路上的 in-flight req 全部 abort
+    → 客户端看到批量 generate-stream 中断
+```
+
+### 1.3 触发条件
+
+- D KV pool 接近满（≥ ρ·K_d，默认 0.95）
+- router fallback chain 把多个 reseed 在毫秒级窗口内发起
+- mooncake heartbeat 超时（默认窗口短）
+
+### 1.4 当前缓解
+
+- `--kvcache-seed-min-turn-id=2` 跳过 turn 1 大 seed，减少首爆（main 分支 stable 配置）
+- `--mc-transfer-timeout=1800s` 默认值（commit 905d671）减少假性 dead
+- `--request-timeout-s=180/300` 让客户端不至于看见整 hour 卡死，但不阻止 cascade 自身
+
+→ 这些都是治标，不是治本。E2 在 4×H200 NDR 真硬件下仍 80% 失败 ([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md))。
+
+### 1.5 真正的修复（路线图 §S3）
+
+1. **admission RPC backoff + jitter**：拒绝时不立刻 fallback，给 D scheduler 喘息机会。
+2. **per-D pending-seed budget**：同时刻最多 K 个 seed 在 transfer 队列里，超出排队而不爆裂。
+3. **mooncake heartbeat 与 admission 解耦**：admission 路径不再 imply "对端 alive"。
+4. **Backpressure pause hint 闭环**（[SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) §2.3 当前 EXPERIMENTAL）。
+
+---
+
+## 2. B — Cold-D / overlap-pinning
+
+### 2.1 症状
+
+- N=k decode workers，但只有 ~k-1 真正承载流量；某些 D 0 binding
+- Per-D load 直方图严重偏斜（E2：D0:600 / D1:685 / **D2:0**）
+- 整体 throughput 受最忙 D 限制；裸 latency 不一定差，但容量利用率差 33%+
+
+### 2.2 根因
+
+Inferact / Ali coding agent trace 在每个 session 开头有 ~12K 的"system prompt + tool schema"，这些 24-token 块在所有 session 之间共享 hash。kv-aware policy 的 `overlap` term 把它们当成"该 D 已经常驻这些 hash" → 任何新 session 都被 score 推向 D0/D1（最先 warm 的两个）→ D2 永远 0 overlap → 永远不被选 → 永远 cold。
+
+### 2.3 触发条件
+
+- 多 session workload + 共享 boilerplate prefix
+- `migration_reject_threshold > 0` 且 reject 从未触发（因为 D0/D1 还没满）
+
+### 2.4 当前缓解
+
+`KvAwarePolicy.load_floor_bonus`（commit 93fce42）：
+
+```
+floor_bonus = K * max(0, mean - assigned) / max(1, mean)
+```
+
+E3 实测 D2 binding 从 0 升到 22.5%（[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §1）。
+
+→ 这是 patch，不是修复。`K` 是 magic number；boilerplate 的 hash 数量大于 `K / sticky_bonus` 时仍 cold。
+
+### 2.5 真正的修复（路线图 §S5）
+
+把 `overlap` 重新定义为 **"该 session 在该 D 上独占 prefix 的 hash 数"**：
+
+```
+exclusive_overlap(s, d) := |prefix_hashes(s) ∩ resident[d] ∩ session_owned[s]|
+```
+
+其中 `session_owned[s]` 排除其它 session 也持有的 hash。Boilerplate 共享 hash 不进 `exclusive_overlap`，score 自然分散。需要 D 端在 `admit_direct_append` 响应里返回 per-session resident hash 集合的 sketch（Bloom filter / minhash）。
+
+---
+
+## 3. C — Evict storm（session-level eviction）
+
+### 3.1 症状
+
+- 在 D 内存有压力的 workload 下，每 1–2 分钟出现 30–90K tokens 的 KV pool 释放峰
+- 紧随其后的同 session 请求触发 `Reseed`：P 重 prefill 50K + mooncake transfer 50K（3–7s）
+- TTFT 长尾完全由这类 reseed 主导（[V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §3.2）
+
+### 3.2 根因
+
+`SessionAwareCache.release_session` 一次性 `free([cache_protected_len, kv_allocated_len))`——即整段 session-exclusive 尾部。E3 实测：90 次 evict、平均一次 free 67,726 tokens、25/50 session 受影响（[KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) §0）。
+
+→ 与 SGLang 标准 radix 的 leaf-by-leaf 渐进 evict 形成鲜明对比。这部分 KV 从未进 radix，所以享受不到 LRU 的细粒度蚕食。
+
+### 3.3 触发条件
+
+- D KV pool 接近满
+- `maybe_trim_decode_session_cache` 被 scheduler 触发（在 `DecodePreallocQueue` 检测到 `available_size() <= 0` 时）
+
+### 3.4 当前缓解
+
+- `--kvcache-session-soft-cap=N`（main 分支）：限制 D 上常驻 session 数 → 提前 trim，避免顶到爆
+- `--kvcache-direct-max-uncached-tokens=8192`（v2）：降低 direct path 吃 KV 的速度
+
+→ 都是放慢节奏，没有解决"单次 free 太大"的根本问题。
+
+### 3.5 真正的修复（路线图 §S1）
+
+[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)：让 streaming-session decode 输出每 turn finish 时 `inner.cache_finished_req` 进 radix → `release_session` 退化为 `dec_lock_ref` + 删 slot → radix LRU 按 24-token leaf 蚕食。
+
+预期：单次 evict 从 67K 降到 ≤ 500 tokens；reseed 频次降一个数量级。
+
+---
+
+## 4. C' — Reseed storm（turn 1 大 seed 并发）
+
+### 4.1 症状
+
+- workload 起步阶段（前 30–60s）所有 session 同时打 turn 1
+- 多个并发 `Seed`（每个 ~50–90K tokens）打 mooncake → 与 §1 cascade 重合
+
+### 4.2 根因
+
+`KvAwarePolicy` 启动阶段 `resident[d]` 全空，所有 D score 相同，但 ε 重试 + per-trial admit 不阻止并发。
+
+### 4.3 触发条件
+
+- trace `time_scale=1` 重放下，session 在原始到达密度内同时启动
+- 没有 per-D pending-seed 限流
+
+### 4.4 当前缓解
+
+- `--kvcache-seed-min-turn-id=2`：跳过 turn 1 seed 完全（main 分支 stable 配置）
+- 副作用：失去 turn-1 的 KV 注入，turn 2 必走 reseed（但反而稳定，因为 reseed 是分散在时间上的）
+
+### 4.5 真正的修复
+
+- per-D pending-seed budget（同 §1.5 第 2 项）
+- §3.5 完成后 evict 频次自降，间接降低 reseed 频次
+
+---
+
+## 5. D — Streaming-session correction invariant crash (E3 landmine)
+
+### 5.1 症状
+
+- D scheduler 抛 `AssertionError` at `schedule_batch.py:1646`：`seq_len - pre_len == req.extend_input_len`
+- 整个 D worker 进程退出 → router 看见对端死 → §1 cascade
+
+### 5.2 根因
+
+[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2：streaming-session correction（commit b8e6f13）把 `extend_input_len` 改写为 `max(0, fill_len - prefix_len)`，但下游 invariant 还从原始 fill_ids/prefix_indices 计算。当 `fill_len < prefix_len`（多 turn 累积 prefix > 当前 turn 增量）时数学上不可能满足。
+
+### 5.3 触发条件
+
+- streaming session 跨 turn 已 commit prefix 长于本 turn 的新增 fill_ids
+- E2 因 pipeline 阻塞从未跑到这个状态；E3 修了 cold-D bottleneck → pipeline 更快 → landmine 暴露
+
+### 5.4 当前缓解
+
+commit 986f351 的 pre-filter pass：在 `prepare_for_extend` 入口 drop 这类 req（让 client 看错误响应而不是 worker 崩）。是止血。
+
+### 5.5 真正的修复
+
+`schedule_batch.py:1572–1646` 这整段 correction 路径在 block-level eviction refactor 完成后**结构上不再需要**——[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.7 已说明 refactor 后 fill_ids / prefix_indices 一致性由 radix `match_prefix` 自动保证。
+
+→ 不要再加更多 correction 子句；要删整段。
+
+---
+
+## 6. 失败诊断 cheat sheet
+
+跑 sweep 时按下表 lookup：
+
+| 你看到 | 大概率是 | 先查 |
+|---|---|---|
+| 客户端 `RuntimeError: generate stream ended before...` | §1 cascade | D scheduler log 搜 `instance could be dead` |
+| 某个 D `binding=0` 而其它 D 繁忙 | §2 cold-D | `per_decode_load` 直方图 |
+| TTFT p99 突然抬到 5–8s 量级 | §3 evict storm | `release_session` 调用频次 + 平均 free tokens |
+| Sweep 起步阶段失败率高、稳态低 | §4 reseed storm | mooncake transfer queue 在前 30s 的峰值 |
+| D worker 进程异常退出 | §5 invariant crash | scheduler log 搜 `AssertionError`、`extend_input_len` |
+
+---
+
+## 7. 与路线图的衔接
+
+- [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) Milestone 1 的第 1/3/4 项分别对应本表 C / A / B 的真正修复。完成 Milestone 1 后本表 §1–§4 应该都从"未修"降级为"已缓解"，§5 直接消失。
+- 论文 §Limitations 必须老实写出现状："we identify five failure modes; A/C are addressed by this work, B/C' are partially addressed, D is a transient artifact of the in-progress refactor."
+
+---
+
+**核心句**：把失败模式当 first-class artifact 来管理——每个失败都有"症状 → 根因 → 触发 → 缓解 → 真正修复"五字段，是把 prototype 推到 production-grade 的关键工具。reviewer 看见你能枚举失败远比看见你赢得 baseline 更让人信服。
--- a/docs/H200_DRIVER570_SETUP_ZH.md
+++ b/docs/H200_DRIVER570_SETUP_ZH.md
@@ -0,0 +1,270 @@
+# H200 + Driver 570 上跑通本仓库的环境配置（含踩坑记录）
+
+**适用范围**：4× H200 节点 + NVIDIA driver `570.86.15` + 本仓库 `kvc-debug-journey-v1-to-v4` 或后续分支。
+**目标读者**：拿到一台新 H200 机器、需要快速跑通 sglang 0.5.10 vendor + mooncake RDMA + agentic-pd-hybrid 的下一个 SWE/research agent。
+**作者状态**：本文档定稿于 `h200-cu130 @ 初始 commit`，smoke test 已 RDMA 跑通 16 reqs / 0 error。
+
+---
+
+## 0. TL;DR（5 行）
+
+1. **`nvidia-smi` 的 "CUDA Version: 13.0" 是个陷阱**——它是 driver 能 forward-compat 跑的 runtime 上限，不是 driver 自己 API 版本。driver `570.86.15` 提供的 driver API 是 **cu12.8**。
+2. vendor sglang 0.5.10 的 `jit_kernel/` 用 `tvm_ffi` + ninja + nvcc binary 在首次调用每个 kernel 时编译。系统唯一 nvcc 在 `/usr/local/cuda-13.0/bin/`，cu13 编译出的 .so 会 NEEDED `libcudart.so.13`，driver 570 拒绝运行 → `cudaErrorInsufficientDriver`。
+3. 解法是**本地装一份 cu12.8 toolkit 到 `$HOME/cuda-12.8`**（不需要 root），让 tvm_ffi 走 cu12.8 nvcc，编译产物 NEEDED `libcudart.so.12`，driver 570 完美支持。
+4. mooncake wheel (`mooncake-transfer-engine 0.3.10.post2`) 也是 cu12 build，需要 `libcudart.so.12`——已经由 `nvidia-cuda-runtime-cu12` 包提供，在 venv 里。
+5. 每个 shell **必须 `source scripts/setup_env.sh`** 才能跑 SGLang。已封装好。
+
+---
+
+## 1. 一次性 setup（约 25min）
+
+```bash
+cd /path/to/agentic-pd-hybrid
+
+# (1) Python 环境 (~3min)
+uv sync
+
+# (2) cu12.8 toolkit 本地装（~5GB 下载 + 5min 解压 = ~15-20min）
+mkdir -p /tmp/cuda_dl && cd /tmp/cuda_dl
+wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
+sh cuda_12.8.1_570.124.06_linux.run \
+  --silent --toolkit --override \
+  --installpath=$HOME/cuda-12.8 \
+  --tmpdir=$HOME/tmp \
+  --no-drm --no-man-page
+
+# (3) 验证
+$HOME/cuda-12.8/bin/nvcc --version   # 应该看到 release 12.8, V12.8.93
+
+# (4) 回到 repo 根目录,首次 source（每个 shell 都要做）
+cd /path/to/agentic-pd-hybrid
+source scripts/setup_env.sh
+```
+
+`source scripts/setup_env.sh` 输出应是：
+```
+agentic-pd-hybrid env ready:
+  CUDA_HOME=/home/<user>/cuda-12.8 (12.8, V12.8.93)
+  libcudart.so.12 at .../.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib
+  MC_TRANSFER_TIMEOUT=1800s
+```
+
+**`MC_TRANSFER_TIMEOUT=1800` (30 min) 替代 mooncake 默认 30s**——E2 forensic 发现 D 端 LRU eviction 会让 mooncake C++ control plane 被 starved 30+s，触发 `conn.py:1270` hair-trigger 永久 blacklist 整个 D 的 mooncake_session_id。1800s 给足缓冲，30 分钟还没回应才是真正"D 死了"。详见 `docs/E1_E2_RESULTS_ZH.md §5c`。`stack.py` 也对 worker subprocess 设了同名默认值。
+
+---
+
+## 2. Smoke test（验证整条链路）
+
+把 16 个合成 request 喂给 1P3D 拓扑，启用真 RDMA，跑通后才能动 E1/E2 实验。
+
+```bash
+# 假设已 source scripts/setup_env.sh
+mkdir -p outputs/smoke_rdma
+
+uv run --no-sync python -m agentic_pd_hybrid.cli make-small-append-trace \
+  --output outputs/smoke_rdma/mini_trace.jsonl \
+  --session-count 4 --turns-per-session 4 \
+  --initial-input-length 1024 --append-input-length 200 --output-length 50 \
+  --inter-turn-gap-s 2 --session-stagger-s 1
+
+uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
+  --trace outputs/smoke_rdma/mini_trace.jsonl \
+  --output-root outputs/smoke_rdma \
+  --mechanism pd-disaggregation --policy default \
+  --model-path /mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507 \
+  --prefill-workers 1 --decode-workers 3 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
+  --transfer-backend mooncake \
+  --force-rdma --ib-device mlx5_60 \
+  --gpu-budget 4 --time-scale 1 \
+  --concurrency-limit 4 --timeout-s 1800 --request-timeout-s 300 \
+  --session-sample-rate 1.0 --min-turns 1 --target-duration-s 600
+```
+
+**首次跑会慢 8-15min**（model load 196s + 5-10 个 JIT kernel 各编译 ~10-30s + warmup）。后续跑只 ~3-5min。
+
+**期望结果**：`request_count=16, error=0, abort=0, failure=0, execution_modes={'pd-disaggregation-router': 16}`。
+
+每个 worker 的日志应有 `installTransport, type=rdma`，表示 mooncake 真的走 RDMA 而不是 TCP loopback。
+
+---
+
+## 3. GPU ↔ RDMA HCA 映射（本机实测）
+
+8 块 ConnectX HCA，全部 ACTIVE / 400 Gb/s NDR / RoCE v2 (link_layer=Ethernet, GID Index 3)。Mooncake 按 NUMA / PCIe affinity 自动选 preferred：
+
+| GPU | preferred HCA | NUMA |
+|---|---|---|
+| cuda:0 | mlx5_60 | 0 |
+| cuda:1 | mlx5_88 | 0 |
+| cuda:2 | mlx5_98 | 1 |
+| cuda:3 | mlx5_42 | 1 |
+
+CLI 的 `--ib-device <name>` 只接单个设备名，给所有 worker 全局 override。Smoke test 默认填 `mlx5_60`（P worker 在 cuda:0 上 NUMA-local；D worker 在其它 GPU 上是 cross-NUMA 但能跑）。E1/E2 实验如果想最优，可以分 P/D worker 独立设环境变量，但目前 stack.py 不支持 per-worker `MOONCAKE_DEVICE`，要么所有 worker 同一个，要么走 mooncake auto（需把 `MC_MS_AUTO_DISC=0` 改回 1）。
+
+完整 8 块 HCA：`mlx5_22, _27, _42, _60, _88, _98, _126, _135`（NUMA 0/1/0/0/0/1/0/1 混杂）。
+
+---
+
+## 4. 踩过的坑（按时间线）
+
+### 坑 1：`nvidia-smi` 的 "CUDA Version: 13.0" 是误导
+
+`nvidia-smi` header 显示 `Driver Version: 570.86.15 / CUDA Version: 13.0` 让人以为机器支持 cu13。**这是 driver 能 forward-compat 跑的 CUDA runtime 上限**，不是 driver 自己 API 的版本。driver 570 的 driver API 上限是 cu12.8（参见 NVIDIA "CUDA Compatibility" 矩阵）。
+
+**正确判断方法**：跑 `torch.cuda.is_available()`，如果装了 cu13 build 的 torch 会报 `The NVIDIA driver on your system is too old (found version 12080)`。返回 `12080` 才是 driver 自己 API 版本（cu12.8）。
+
+### 坑 2：vendor sglang vs pip sglang 的 patch 差异
+
+仓库的 `third_party/sglang/python/` 是带项目自有 patches 的 SGLang 0.5.10 fork。**pip 上的 `sglang==0.5.10` 不包含核心 patches**——具体差异：
+
+| 文件 | pip 版 | vendor 版 |
+|---|---|---|
+| `srt/managers/scheduler.py` | 3621 行 | 3938 行 |
+| `admit_direct_append` 出现次数 | 2 | **11** |
+| `DirectAppendAdmissionReqInput/Output` | 没有 | **有**（核心 RPC） |
+| `_should_allow_local_prefill_on_decode` | 没有 | 有 |
+| `maybe_trim_decode_session_cache` | 没有 | 有 |
+| `decode_direct_waiting_queue` | 没有 | 有 |
+
+→ **必须用 vendor 版**。本分支已把 `pyproject.toml` 的 `sglang==0.5.10` 改成 `sglang` + `[tool.uv.sources] sglang = { path = "third_party/sglang/python", editable = true }`，`uv sync` 后会自动 editable 安装 vendor 版。
+
+历史上有些 sweep 脚本用 `PYTHONPATH=src:third_party/sglang/python` 在运行时切换，但用 `uv.sources` 把它装进 venv 更彻底，不会被 pip 的 sglang 偷偷 shadow。
+
+### 坑 3：cu13 切换是死路
+
+发现 driver 570 不兼容时第一个想到的路径是「装 cu13 PyTorch」。试过：
+
+1. 改 `pyproject.toml` 加 `[[tool.uv.index]]` 指向 `https://download.pytorch.org/whl/cu130`
+2. 同样改 vendor sglang 的 `pyproject.toml`（root 项目的 sources 不会传递给 transitive editable dep）
+3. `uv sync` 成功装上 `torch==2.9.1+cu130` 和 `nvidia-{nccl,nvjitlink,nvshmem,cusparselt,nvtx}-cu13`
+4. **但 driver 570 不支持 cu13 runtime**——`torch.cuda.is_available()=False`，CUDA init 报 `driver too old (12080)`
+
+→ cu13 路径需要 **driver 580+**。我们没有 root + 别人在用机器，所以放弃。本分支已 rollback 到 cu12 stack（pyproject 干净）。
+
+### 坑 4：`--disable-overlap-schedule` 不够
+
+第一次 smoke 崩在 `resolve_future_token_ids.cuh:49`，路径是 `event_loop_overlap_disagg_prefill`，怀疑是 overlap 模式特定 JIT kernel 问题。
+
+cli.py 给 PD worker 加了 `--disable-overlap-schedule` 后，event loop 切到 `event_loop_normal_disagg_prefill`，但**崩在另一个 kernel `fused_inplace_qknorm`**，错误码完全相同（`cudaErrorInsufficientDriver`）。
+
+→ 不是 overlap-specific，是 **整体 vendor sglang `jit_kernel/` 模块和 driver 570 不兼容**，任何 JIT kernel 都会崩在 `runtime.cuh:21` 的 `cudaOccupancyMaxActiveBlocksPerMultiprocessor` 调用（CUDA runtime 初始化时 driver feature 版本检查失败）。
+
+但 `--disable-overlap-schedule` 留着不会造成伤害，且能避免之后类似 overlap-path 特定问题。本分支保留它在 `cli.py:_topology_from_args`。
+
+### 坑 5：pip sgl_kernel vs vendor sglang/jit_kernel/ 是两套系统
+
+`pip install sglang-kernel` 提供 `.venv/lib/.../sgl_kernel/{flash_ops,flashmla_ops,spatial_ops}.abi3.so`——这是 AOT 预编译产物。
+
+`third_party/sglang/python/sglang/jit_kernel/` 是 vendor SGLang 0.5.10 内置的 **另一套 JIT 模块**，运行时用 tvm_ffi 编译。Smoke 崩在 vendor 的 jit_kernel，**降级 pip sgl_kernel 没用**（实测 0.4.0 / 0.4.1 同样崩）。
+
+### 坑 6：`nvidia-cuda-nvcc-cu12` PyPI 包没装 nvcc binary
+
+发现 cu13 nvcc 是 root cause 后，第一反应是 PyPI 装 cu12 nvcc 包：
+
+```bash
+uv pip install nvidia-cuda-nvcc-cu12==12.8.93
+```
+
+装上以后 `find .venv -name nvcc` **返回空**——这个 PyPI 包只装 `ptxas` 和 `nvvm/`，**没有 nvcc binary**（NVIDIA 出于分发限制不把 nvcc 放 PyPI）。
+
+→ 完整 nvcc 必须从 NVIDIA 官方 `.run` installer 或 apt 装。`.run` installer 可以装到 user-writable 路径不需要 root，本仓库选这条路。
+
+### 坑 7：tvm_ffi 通过 ninja 调用 nvcc
+
+vendor sglang 的 `jit_kernel/` 用 `tvm_ffi.cpp.extension`，源码在 `~/.local/lib/python3.12/site-packages/tvm_ffi/cpp/extension.py`。关键路径：
+
+```python
+def _find_cuda_home() -> str:
+    cuda_home = os.environ.get("CUDA_HOME") or os.environ.get("CUDA_PATH")
+    if cuda_home is None:
+        nvcc_path = shutil.which("nvcc")
+        if nvcc_path is not None:
+            cuda_home = str(Path(nvcc_path).parent.parent)
+    ...
+```
+
+然后构造 ninja file：
+```
+nvcc = {_find_cuda_home()}/bin/nvcc
+```
+
+→ **设 `CUDA_HOME=$HOME/cuda-12.8` 就能 hook 整条编译链**。`scripts/setup_env.sh` 已经设好。
+
+JIT 编译产物缓存在 `~/.cache/tvm-ffi/sgl_kernel_jit_*/*.so`。如果之前用 cu13 nvcc 编过，要先 `rm -rf ~/.cache/tvm-ffi/sgl_kernel_jit_*` 再用 cu12.8 重编。
+
+### 坑 8：mooncake import path 与 onboarding 文档不一致
+
+`docs/ONBOARDING_NEXT_AGENT_ZH.md` §3.3 的环境验证写：
+```python
+from mooncake_transfer_engine import TransferEngine
+```
+
+但实际 PyPI `mooncake-transfer-engine 0.3.10.post2` wheel 的 import path 是：
+```python
+from mooncake.engine import TransferEngine
+```
+
+第一次 `from mooncake_transfer_engine` 会 `ModuleNotFoundError`。**ONBOARDING 文档应该更新**（本分支不动 onboarding，留给主 agent 决定）。
+
+### 坑 9：mooncake.engine import 必须有 libcudart.so.12
+
+`from mooncake.engine import TransferEngine` 在 fresh shell（未 source setup_env.sh）下报：
+```
+ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
+```
+
+mooncake 的 `engine.so` 是 cu12 build，dynamic link `libcudart.so.12`。venv 里有但需要 LD_LIBRARY_PATH 暴露。`scripts/setup_env.sh` 已加。
+
+### 坑 10：Inferact 数据集 schema 与 agentic-pd-hybrid 期望不匹配
+
+`huggingface.co/datasets/Inferact/codex_swebenchpro_traces` 是 ShareGPT 格式（`{"from": "human/gpt", "value": "<text>"}`），不含 token 计数 / hash_ids / 时间戳。
+
+`agentic-pd-hybrid` 期望 JSONL：`chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids[]`。
+
+→ 已写 `scripts/convert_inferact_to_trace.py`：tokenize（用 model 自带 tokenizer）+ 滚动 hash 切 24-token block + 伪造 timestamp。610 trials × 33 turns 处理约 37min，跑出 20,230 reqs（与 Inferact README 的 "20,230 total LLM calls" 完全一致）。
+
+输出 `outputs/inferact_codex_swebenchpro.jsonl`（1.3GB，被 `.gitignore` 排除不进仓库）。
+
+### 坑 11：sampling 默认 `--session-sample-rate 0.01`
+
+`benchmark-live` 跑的时候内部会先做 sampling。默认 1%，意味着 50 sessions 才抽 1 个。Mini smoke trace 4 sessions × 1% = 0 → `ValueError: Sampling produced no requests`。
+
+→ smoke test 命令显式加 `--session-sample-rate 1.0 --target-duration-s 600`。
+
+---
+
+## 5. 后续给下个 agent
+
+跑 E1 / E2 sweep 之前**每个 shell 第一件事**：
+
+```bash
+cd /path/to/agentic-pd-hybrid
+source scripts/setup_env.sh
+```
+
+然后用 ONBOARDING §3 的 sweep 脚本（参考 `scripts/sweep_ts1_migration_v2.sh` 作为底版）。注意几处针对本机的修改：
+
+1. **MODEL 路径**改成 `/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507`（onboarding 写的 `/mnt/kzlin/workflow/pd-hybrid/simm-swe-bench/models/...` 不存在）。
+2. **TRACE 路径**：`outputs/qwen35-swebench-50sess.jsonl` 不存在；用 `outputs/inferact_codex_swebenchpro.jsonl` （converter 跑完后产生）。
+3. **`--ib-device`** 选 `mlx5_60`（cuda:0 NUMA-local）或视实验需要自选；onboarding 写的 `mlx5_0` 在本机不存在。
+4. **保留 cli.py 的 `--disable-overlap-schedule`** 不要删——理论上 cu12.8 toolchain 应该让 overlap 也能跑，但目前未验证 overlap path 没有别的潜在问题，留着是 zero-cost 保险。
+
+---
+
+## 附录 A：本分支的代码改动
+
+- `pyproject.toml`：sglang dep 改用 `[tool.uv.sources]` path source 走 `third_party/sglang/python`（editable）。
+- `src/agentic_pd_hybrid/cli.py:_topology_from_args`：给 prefill/decode worker 自动加 `--disable-overlap-schedule`。
+- `scripts/setup_env.sh`：env wrapper，每个 shell `source` 一次。
+- `scripts/convert_inferact_to_trace.py`：Inferact ShareGPT → agentic-pd-hybrid JSONL schema converter。
+- `docs/H200_DRIVER570_SETUP_ZH.md`：本文档。
+
+## 附录 B：被 `.gitignore` 排除的产物
+
+- `outputs/inferact_codex_swebenchpro.jsonl`（1.3GB）——converter 输出，用 `scripts/convert_inferact_to_trace.py` 重新生成
+- `outputs/smoke_rdma/`（含 mini trace + smoke run artifacts）
+- `third_party/codex_swebenchpro_traces/`（209MB，HF dataset 下载）—— `hf download Inferact/codex_swebenchpro_traces --repo-type dataset --local-dir third_party/codex_swebenchpro_traces` 重下
+- `~/cuda-12.8/`——cu12.8 toolkit，用 §1 步骤 (2) 重装
+- `.venv/`——`uv sync` 重建
--- a/docs/INDEX_ZH.md
+++ b/docs/INDEX_ZH.md
@@ -0,0 +1,119 @@
+# 文档索引
+
+**目的**：让任何合作者在 10 分钟内找到他需要的文档；让 Reviewer 知道哪些先看。
+
+---
+
+## 0. 时间紧的 3 篇
+
+按这个顺序读完即可参与讨论：
+
+1. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) — 项目当前进度、薄弱点、路线图。
+2. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) — 算法形式化（Algorithm 1/2/3 + Theorem 1/2）。
+3. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §0 + §6 — v2 当前 win/lose snapshot。
+
+---
+
+## 1. 按主题分类
+
+### 1.1 进度 / 现状
+
+| 文档 | 内容 |
+|---|---|
+| [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) | 跨分支整合 + 路线图（本分支的总入口） |
+| [PROJECT_OVERVIEW.md](PROJECT_OVERVIEW.md) | 项目目标 + 三种 mechanism（pd-disagg / pd-colo / kvcache-centric）的术语区分 |
+| [ONBOARDING_NEXT_AGENT_ZH.md](ONBOARDING_NEXT_AGENT_ZH.md) | 接班 agent 30 分钟上手手册（来自 `kvc-debug-journey-v1-to-v4`） |
+
+### 1.2 算法 / 形式化
+
+| 文档 | 内容 |
+|---|---|
+| [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md) | Algorithm 1（Route）/ 2（Admit）/ 3（Dispatch）+ Theorem 1（无饿死）+ Theorem 2（fast-path 命中下限） |
+| [MIGRATION_V1_FINDINGS_ZH.md](MIGRATION_V1_FINDINGS_ZH.md) | v1 thrashing pathology 的实测 + 为什么 reset-on-success 是关键修复 |
+
+### 1.3 实验结果
+
+| 文档 | 内容 |
+|---|---|
+| [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) | SWE-Bench 50 sess ts=1：v2 vs 4DP CA 的 6/8 win + TTFT p99 落后原因 |
+| [V2_RESULTS_ZH.md](V2_RESULTS_ZH.md) | v2 原始战报（headline 数字略乐观，请同时看 deep analysis） |
+| [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) | H200 + RDMA 上 E1（naive 1P3D + kv-aware）vs E2（KVC v2）；E2 80% failure 的 forensic |
+| [E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) | E3（+load-floor bonus）16 min 触发 SGLang patch invariant crash |
+| [E1_E2_FIX_DESIGN_ZH.md](E1_E2_FIX_DESIGN_ZH.md) | Q1（mooncake death）+ Q2（cold-D2）的 fix 设计 |
+
+### 1.4 当前关键 design discussion
+
+| 文档 | 内容 |
+|---|---|
+| [KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) | 架构层反思：session-level evict 与 KVC continuity 设计冲突 |
+| [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) | block-level evict refactor 的具体 API / 步骤 / 测试计划（本分支新增） |
+| [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) | reseed 慢路径时间线 + D→P 同步缺口的 forensic |
+| [D_TO_P_SYNC_CONTRACT_ZH.md](D_TO_P_SYNC_CONTRACT_ZH.md) | D→P sync 的接口契约、staleness budget、rollout 阶段（本分支新增） |
+
+### 1.5 评测 / 方法论
+
+| 文档 | 内容 |
+|---|---|
+| [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md) | paper-quality 评测协议（N、CI、paired、stratify、baseline list、trace mix）—— 本分支新增 |
+| [REFACTOR_PLAN_V1_ZH.md](REFACTOR_PLAN_V1_ZH.md) | 为什么从 ts=10 切到 ts=1 |
+| [TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md](TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md) | ts=10 时代的结构性问题清单（多数已 supersede） |
+
+### 1.6 工程债 / 失败模式
+
+| 文档 | 内容 |
+|---|---|
+| [SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) | 785 行 vendored SGLang patch 的归类清单（MUST-HAVE / WORKAROUND / EXPERIMENTAL / INSTRUMENTATION）—— 本分支新增 |
+| [FAILURE_MODES_ZH.md](FAILURE_MODES_ZH.md) | 5 类失败模式的诊断 + 缓解 + 真正修复（mooncake cascade / cold-D / evict storm / reseed storm / E3 invariant）—— 本分支新增 |
+
+### 1.7 环境
+
+| 文档 | 内容 |
+|---|---|
+| [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md) | H200 + driver 570 + cu12.8 环境搭建 + 11 条 lesson learned |
+
+### 1.7 归档（仅历史参考）
+
+`docs/archive/` 下的内容已被新文档 supersede，不必看：
+
+- `AGENTIC_FIT_ANALYSIS_ZH.md`、`STRUCTURAL_VALIDATION_REPORT_ZH.md`：ts=10 早期分析。
+- `KVCACHE_CENTRIC_PROGRESS_ZH.md`：早期项目快照。
+- `KVC_DEBUG_JOURNEY_V1_TO_V5.md`、`V5_PROFILE_INVESTIGATION_ZH.md`：v1–v5 调优过程笔记。
+- `REFACTOR_PLAN_ZH.md`：v0 重构计划。
+- `SWEBENCH_EXPERIMENT_*.md`：早期实验日志。
+
+---
+
+## 2. 按角色推荐阅读路径
+
+### 2.1 我是新接手的 SWE/research agent
+
+1. 先读本文 §0 的 3 篇。
+2. 再看 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3（薄弱点）+ §5（GPU-free 工作清单）。
+3. 选一个 Milestone 1 子项开始做。`docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` 与 `docs/D_TO_P_SYNC_CONTRACT_ZH.md` 是已经准备好的两条工程主线。
+
+### 2.2 我是 paper reviewer / 审稿预读
+
+1. [KVC_ROUTER_ALGORITHM.md](KVC_ROUTER_ALGORITHM.md)：算法 + theorem。
+2. [V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md)：核心实测对比 + 我们自己识别的 limitation。
+3. [E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md)：真硬件 + RDMA 上的 ablation（含 E2 的 80% failure forensic，证明我们能解释失败）。
+4. [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3：我们自己列出的薄弱点与未来工作（不藏问题）。
+
+### 2.3 我是要复现实验的 student
+
+1. [H200_DRIVER570_SETUP_ZH.md](H200_DRIVER570_SETUP_ZH.md)。
+2. [EVALUATION_PROTOCOL_ZH.md](EVALUATION_PROTOCOL_ZH.md)：跑哪些 sweep、按什么协议比较。
+3. `scripts/sweep_ts1_migration_v2.sh`：v2 主 sweep；`scripts/sweep_e1_naive_1p3d.sh` / `scripts/sweep_e2_kvc_v2_rdma.sh`：E1/E2 ablation。
+
+### 2.4 我想看 control plane 与 admission
+
+1. `src/agentic_pd_hybrid/policies.py`：`KvAwarePolicy.select` 是 Algorithm 1 的实现。
+2. `src/agentic_pd_hybrid/replay.py`：`_invoke_session_direct` / `_invoke_kvcache_seeded_router` 是 Algorithm 3 的 orchestration。
+3. `third_party/sglang/python/sglang/srt/managers/scheduler.py`：D 端 `_admit_direct_append` 是 Algorithm 2 实现。
+
+---
+
+## 3. 这份索引的维护约定
+
+- 新加一份 design / experiment doc 必须在本文 §1 表格里加一行。
+- 文档归档（移到 `docs/archive/`）时本文同步删除条目或标 "已归档"。
+- 本文不写实质内容，只做导航；任何深入说明都在被指向的文档里。
--- a/docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md
+++ b/docs/KVC_EVICTION_GRANULARITY_DESIGN_ZH.md
@@ -0,0 +1,228 @@
+# KVC Eviction Granularity — 设计审视 (架构层)
+
+**日期**: 2026-05-12
+**Status**: 架构审视 / 待 design discussion
+**Companion**: `docs/E1_E2_RESULTS_ZH.md`, `docs/E3_FINDINGS_ZH.md`, `docs/E1_E2_FIX_DESIGN_ZH.md`
+**Branch**: `h200-cu130`
+
+本文是 E2 → E3 迭代后的高层架构反思，**不是又一份 fix design**。前几轮 E2 → E3 我一直在加 local patches（load-floor bonus、Fix A skip-zero-extend、调 migration_reject_threshold 等），但 E3 实测数据迫使我们承认这些 patches 大局上看是 **KVC 在向 DP / naive PD-disagg 退化的轨迹**。
+
+---
+
+## 0. TL;DR
+
+1. **KVC 的 value proposition** 是"session pin 在 D 上、KV 跨 turn 连续累积、direct-to-D 快路径 0.04s TTFT"。
+2. **`SessionAwareCache.release_session` 在 trim 时一次性 free 整段 session-exclusive 尾部**：实测 E3 一次 trim 平均 free **67,726 tokens**（samples: 35K / 38K / 40K / 86K / 87K），不是 "几个 leaf block"。
+3. 被 evict 的 session 下次到来时必须**从客户端原 prompt 重 prefill 50-90K** + mooncake transfer 5-9 GB → **跟 naive PD-disagg 一模一样**。
+4. → 在 saturation regime 下 KVC 的 cache continuity 设计被自己的 eviction 抵消。**Session-level eviction 与 KVC 的设计意图冲突**。
+5. 真正的方向不是堆 patch，是 **改 eviction granularity**: 让 streaming-session 的 decode 输出 **progressively commit 进 radix tree**，由 SGLang 标准的 block-level LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。
+
+---
+
+## 1. 我们做对了什么，又错过了什么
+
+### KVC 的 design promise（来自 `KVC_ROUTER_ALGORITHM.md` §1）
+
+| Property | 设计意图 |
+|---|---|
+| Session 钉定 | Session `s` pin 在 `pin[s]` 这一个 D；同 session 的所有 turn 在同一个 D 上做 KV 累积 |
+| Direct-to-D 快路径 | `req.session ∈ M_d ∧ append_len ≤ τ_append ∧ cap_ok` → 仅 append 新 token，**不走 P→D mooncake transfer** |
+| TTFT 优势 | append-only path TTFT ≈ 40ms (历史 v2 在 SWE-Bench 的 fast-path p50) |
+| 集中 cache 而非 fragment | 同 session cache 集中在一个 D 上，命中率高 |
+
+### 我们当前实测在做什么（E3, killed at 1h12min）
+
+| 指标 | 实测值 | 与设计 promise 的偏离 |
+|---|---:|---|
+| Eviction 次数 | **90** | 设计假设 "session 一旦绑就持续累积" |
+| 平均每次 evict 释放 | **67,726 tokens** | 不是 "几个 leaf block"，是整段 session 尾部 |
+| 总释放 | **6,095,375 tokens** | 在 1h12min 里 trash 了 ≈ 8 个 session-pool 容量的 KV |
+| 触发 reseed 的 session 数 | 25 / 50 (50%) | 这些 session 每个被 evict-revisit 一次 = 付一次 50-90K re-prefill |
+| 单次 reseed 平均耗时 | 3-7s (P prefill + mooncake) | 跟 naive PD-disagg 持平 |
+
+**E1 对照**：0 eviction、0 retract、50 sessions 顺利完成。E1 用的是 `pd-disaggregation` mechanism，**没有 KVC 层、没有 admission RPC**，但反而保留了 cache continuity（router-side sticky 让 session 不挪窝）。
+
+> **讽刺**: E1 (naive 1P2D + kv-aware policy) **意外地** 比 E3 (KVC v2 + load-floor + RDMA) 更接近 KVC 设计意图——因为 E1 没有 admission 反馈链路，所以没人会触发那 90 次 session-level evict。
+
+---
+
+## 2. 为什么 session-level evict 是错的
+
+### `release_session` 实测语义（`session_aware_cache.py:250-281`）
+
+```python
+def release_session(self, session_id: str):
+    slot = self.slots.pop(session_id, None)
+    ...
+    if slot.last_node is not None:
+        self.inner.dec_lock_ref(slot.last_node, ...)        # 解 radix 锁 ✓
+
+    if slot.is_holding_kv:
+        start = slot.cache_protected_len
+        end = slot.kv_allocated_len
+        if start < end:
+            kv_indices = self.req_to_token_pool.req_to_token[
+                slot.req_pool_idx, start:end
+            ]
+            self.token_to_kv_pool_allocator.free(kv_indices)  # 显式 free 一段 KV
+        ...
+```
+
+`[cache_protected_len, kv_allocated_len)` 是 **session-exclusive 尾部**——从首 turn 提交 radix tree 之后所有累积的 decode output + 后续 turn 的 extend。在 Inferact workload 上：
+
+- `cache_protected_len` ≈ 首 turn 提交的 boilerplate 部分 (~12K)
+- `kv_allocated_len` ≈ 50-100K（多 turn 累积）
+- **释放范围 = 38-88K**
+
+这部分 KV **没有进 radix tree**，所以也享受不到 radix block-level LRU 的渐进式 shedding。`release_session` 一刀切。
+
+### 与 SGLang 标准 radix LRU 的本质差异
+
+SGLang 标准 `inner.evict()`（`base_prefix_cache.py` 接口由 RadixCache 实现）：
+
+```
+按节点 last_access_time 排序，从 leaf 开始 evict (因为 evict 中间节点会破坏树结构)
+每次释放一个 leaf node 的 KV indices
+lock_ref > 0 的节点不可 evict
+```
+
+**特性对比**:
+
+| | session-level (current) | block-level (SGLang radix) |
+|---|---|---|
+| 单次释放粒度 | 整段 session 尾部 (35-87K) | 一个 leaf node (~24 tokens / page-size) |
+| Recent prefix 保留 | ❌ 全丢 | ✅ 保留 (recent 访问 → 时间戳新 → 不被先 evict) |
+| Evict-revisit 成本 | 50-90K re-prefill | 仅丢的 leaf 部分 (≪ 50K) |
+| 与 session lifecycle | 强绑定 (是 lifecycle 退出动作) | 解耦 (lifecycle 仅做 lock_ref 管理) |
+
+### 为什么会变这样：SessionAwareCache 的双重职责混淆
+
+`SessionAwareCache` 设计承担了**两个本应分离的职责**：
+
+1. **Session lifecycle 跟踪** (合理)：streaming session 跨多个 req 复用 KV，需要在 turn 间保留 `(req_pool_idx, kv_committed_len, kv_allocated_len, last_node)` 这些字段，恢复给下个 turn 的 req。
+2. **Eviction granularity 决策** (问题所在)：把 session 当成 evict 的最小单位，绕过了 SGLang 标准 LRU 的 leaf-by-leaf 渐进 shedding。
+
+第 2 个职责本不该存在于 SessionAwareCache 里。SGLang radix 已经能处理 block-level LRU——前提是 session 的 KV 真的进了 radix 树。但**因为 session-exclusive 尾部没 commit 进 radix tree**，radix LRU 看不到它们，只能由 release_session 一次性大块 free。
+
+---
+
+## 3. 我们前几轮 patches 的总体轨迹
+
+按 commit 时间线审视，每一步看似在修当下 issue，整体方向却是 KVC → DP 退化：
+
+| Iteration | 改动 | 局部目标 | 大局影响 |
+|---|---|---|---|
+| E2 baseline | mechanism=kvcache-centric, worker admission | 跑出 KVC v2 头条数字 | D2 cold + cascade → 1054 failures (KVC 设计前提崩塌) |
+| E3 load-floor bonus | 让 fresh session 均匀分到 D2 | 解 cold-start 偏置 | 触发 migration → 25 sessions reseed → 暴露 evict granularity 问题 |
+| E3 → Fix A | 修 vendored SGLang `prepare_for_extend` 的 fill_ids<prefix_indices invariant | 防 decode-1 assertion crash | Patch 局部 bug，没动 evict 设计 |
+| **我之前提议: disable migration** | `--kvcache-migration-reject-threshold 0` | "让 session 不挪窝" | **会让 KVC 退化成 pd-disagg + load-floor**（admission RPC 还在但 migration 不生效） |
+| **更早提议: disable admission** | 砍 admission RPC | "省掉那个 RPC overhead" | **直接砍 KVC 的 direct-to-D fast path** (KVC_ROUTER_ALGORITHM.md §3.2 Algorithm 2 不存在) |
+
+用户每次都正确地阻止了进一步退化。**没有人在审视 evict granularity 这个根本问题**——直到现在。
+
+---
+
+## 4. 正确方向（粗描）
+
+**核心思路**: 让 streaming session 的 decode 输出 **progressively commit 进 radix tree**，由 SGLang 标准 radix LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。
+
+### 4.1 目标行为
+
+| 场景 | 当前行为 | 目标行为 |
+|---|---|---|
+| Session 累积 50K KV，D 满了 | release_session 一次释放 38K (整段 session-exclusive 尾部) | radix LRU evict 最老 leaf (可能是首 turn 的 boilerplate tail，~24 tokens) |
+| Session 被 evict 后再到来 | 必须 reseed 50K (P prefill + mooncake) | 仅 re-prefill 被 evict 的 leaf 部分 (e.g. ~5K) |
+| TTFT 对 evicted session 的影响 | 50-90K reseed = 3-7s | 5K append-prefill = ~200ms |
+| 不被 evict 的 session | 同 session 内 turns append-only | 同样 append-only ✓ (不变) |
+| KVC fast-path 命中率 | 91.6% (历史 SWE-Bench) / 38% (E3 Inferact, 因为 evict-revisit) | 应稳定在 >85% 即使 saturation |
+
+### 4.2 需要的 refactor scope
+
+按依赖排序，每一步可独立做但有耦合：
+
+1. **Streaming session decode output 增量进 radix tree** (vendor SGLang)
+   - 当前: decode output 累积在 `kv_allocated_len` 维度，但 radix tree 只记录到 `cache_protected_len`
+   - 改: 每 turn finish 时把新的 decode tail 通过 radix `cache_finished_req` 路径插入 radix 树
+   - 影响: streaming session 在 radix 树里有持续 growing 的 chain，每个 24-token block 一个 node
+   - 牵涉: `radix_cache.py` 的 insert 路径、`schedule_batch.py` 的 cache_finished_req hook、SessionSlot.save_from_req
+
+2. **SessionSlot 退化成纯 metadata**
+   - 当前: SessionSlot 拥有 `req_pool_idx` + `[cache_protected_len, kv_allocated_len)` 范围的 KV 索引所有权
+   - 改: SessionSlot 仅持有 `last_node`（指向 radix 树某 node）和 lock_ref 状态，不直接管 KV 范围
+   - 影响: `restore_to_req` 改成基于 radix `match_prefix` 重建 req 状态，不直接 reuse req_pool_idx
+
+3. **`release_session` 改为仅 dec_lock_ref + 删 slot metadata**
+   - 当前: 还 free `[cache_protected_len, kv_allocated_len)` 范围 KV
+   - 改: 只 dec_lock_ref → 让 radix LRU 自然 evict
+   - 影响: `maybe_trim_decode_session_cache` 不再"按 session 释放"，而是用 SGLang 现有的 `tree_cache.evict(required_tokens)`
+
+4. **`admit_direct_append` 的 capacity 检查改用 radix-resident 长度**
+   - 当前: `current_tokens = session.resident_tokens` (来自 SessionSlot)
+   - 改: `current_tokens` = radix tree 上该 session 实际 commit 的长度 = `match_prefix(session.last_node).matched_length`
+   - 影响: admission 评估的 "uncached = input - radix-resident" 更精确，evict-revisit 场景下 admission 反映出"只丢了一部分"而不是"全丢"
+
+5. **`prepare_for_extend` 的 streaming-session correction 重新设计**
+   - 当前: Fix A patches 的 fill_ids/prefix_indices invariant 是基于 session-exclusive 尾部的复杂 fixup
+   - 改: 如果 SessionSlot 不再拥有独立 KV 范围，整个 correction 路径需要重写或可能不再必要
+
+### 4.3 与 onboarding §4.4 D→P sync 的关系
+
+`docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 描述的 D→P 增量同步是**针对 reseed 自身成本**的 fix（让 P 端 backup 跟上，避免 reseed 时 P 重 prefill）。
+
+本文 §4 描述的 eviction granularity 是**针对 reseed 触发频率**的 fix（让 session 不被一次性 evict 整段，减少 evict-revisit）。
+
+**两者正交、互补**:
+- 单做 evict-granularity fix: reseed 频率下降，但偶发 reseed 仍然慢
+- 单做 D→P sync: reseed 自身快了，但仍然频繁触发
+- 都做: reseed 几乎消失、即使触发也快
+
+工程量都是 ~1-2 周量级，可并行启动。
+
+### 4.4 不是 local patch
+
+注意整个 §4.2 列表里没有"调一个 hyperparameter"或者"加一个 CLI flag"这种局部改动。这是 vendor SGLang 内部数据结构的 invariants 重新设计，不能通过更精确的 K 值或更宽的 substring filter 解决。
+
+---
+
+## 5. 我们不该再做的事 (anti-patterns)
+
+防止下个 agent 走同样的局部 patch 路径：
+
+1. **不要继续调整 `migration_reject_threshold`** — 这个参数只是控制"reject 后多久换 D"，跟 evict granularity 无关。调小让 migration 更频繁 → 更多 reseed → 更糟。调大 → blacklist 永久化 (v1 thrashing 问题)。
+2. **不要 disable migration** — 会让 KVC 退化到 sticky pd-disagg。失去 v2 的 reset-on-success 整体设计。
+3. **不要 disable admission** — 会砍掉 direct-to-D fast path 这个 KVC 唯一的差异化优势。
+4. **不要继续 tune `_decode_session_cache_low_watermark_tokens`** — 调高让 LRU 更激进 → 更多 evict → 更糟。调低让 LRU 不触发 → 顶到 retract decode → 更糟。是治标。
+5. **不要再加 `_ADMISSION_REJECTION_SUBSTRINGS`** — 之前修的 string filter bug (Q2 forensic) 让 migration counter 真的递增，反而暴露了 migration 本身的 reseed 成本。修这个 bug 没错，但显示出 migration 机制本身在 saturated 场景下是负收益。
+
+---
+
+## 6. 推荐 Decision Points
+
+| # | Question | 推荐 |
+|---|---|---|
+| D1 | 接受本文的诊断（session-level evict 是根本问题）？ | **Yes** |
+| D2 | 暂停 E1/E2/E3 ablation 线索，集中精力做 §4.2 refactor？ | **Yes** (current path 在用 GPU 时间确认已知结论) |
+| D3 | refactor 在 vendored SGLang 主线（kvc-debug-journey-v1-to-v4）还是新分支？ | 新分支 `feat/block-level-evict`（隔离 risk） |
+| D4 | 同时启动 §4.3 的 D→P sync（`feat/d-to-p-sync` 分支已预留）？ | 视团队带宽 |
+| D5 | 在 refactor 完成前对外的 paper 表述如何处理？ | 标"v2 系列在 saturation regime 下的 evict 行为是已识别的 limitation，§future-work 已 propose 修复" |
+
+---
+
+## 7. 给下个 agent 的接班
+
+**如果你接手要做 §4.2 refactor**，按顺序读:
+
+1. `KVC_ROUTER_ALGORITHM.md` §2-3 — KVC 设计意图
+2. 本文 §2.1, §2.2 — 实测 evict 行为
+3. SGLang vendor `mem_cache/radix_cache.py` — 标准 radix LRU 实现细节
+4. SGLang vendor `mem_cache/session_aware_cache.py` — 当前 SessionSlot 设计
+5. SGLang vendor `managers/schedule_batch.py` — prepare_for_extend 怎么用 session state
+6. `docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md` §4 — D→P sync 的工程 scope（互补 work）
+
+**关键 invariant 不变量**: SessionSlot.restore_to_req 必须保持幂等（chunked prefill 失败可能 retry 多次）。任何 refactor 都要测试此 invariant。
+
+**关键 testing pattern**: 单元化测试 streaming session 在 LRU 压力下的行为。具体：注入一个 fake `inner.evict()` 返回部分 leaf 被 evict 的状态，断言 SessionSlot.restore_to_req 仍然返回合法 req 状态（不抛 assertion，re-prefill 长度合理）。
+
+---
+
+**核心句**: 我们前 3 轮 patch 都在解 saturation 暴露的 secondary 问题（cold-D 偏置、admission 字符串 bug、streaming-session correction 边界），但**真正的 primary 问题是 SessionAwareCache 把 session lifecycle 跟踪和 eviction granularity 决策混在一起**。session 是 lifecycle 边界，**不应该是 eviction 边界**。Eviction 应该交还给 SGLang 已经做得很好的 block-level radix LRU。
--- a/docs/SGLANG_PATCH_INVENTORY_ZH.md
+++ b/docs/SGLANG_PATCH_INVENTORY_ZH.md
@@ -0,0 +1,165 @@
+# Vendored SGLang Patch — 归类清单
+
+**日期**：2026-05-13
+**基线**：clean SGLang v0.5.10 snapshot @ `bded083`
+**当前 HEAD**：`origin/h200-cu130` + 本分支 (785 行新增 / 17 行删除 / 10 文件)
+**目的**：让 reviewer 与下一个合作者一眼看清"哪些 patch 是核心机制、哪些是 workaround、哪些可以在 refactor 后下线"。对应 [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §3.2 / §S6 的工程债项。
+
+---
+
+## 0. TL;DR
+
+| 分类 | 文件数 | 行数（估） | 命运 |
+|---|---:|---:|---|
+| MUST-HAVE — 核心机制（Algorithm 1/2/3、streaming session lifecycle、admit RPC） | 6 | ~450 | 长期保留，是 paper claim 的核心 |
+| WORKAROUND — 已识别的 latent 问题修补，应在 refactor 后下线 | 2 | ~150 | block-level eviction refactor 完成后大量删除 |
+| EXPERIMENTAL — 未闭环的特性，论文不依赖 | 1 | ~60 | 可下线或保留为 future-work hook |
+| INSTRUMENTATION — 诊断 / 日志 | 1 | ~50 | 保留但应隔离到 debug build |
+| MINOR — 杂项 | 1 | ~3 | 不影响决策 |
+
+**关键指引**：当 block-level eviction refactor（[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)）完成时，WORKAROUND 类的 ~150 行应同步删除。E3 触发的 `schedule_batch.py` invariant landmine 是这条路径上的产物，不修引擎而是修 evict 粒度才是正解。
+
+---
+
+## 1. 文件粒度清单
+
+### 1.1 `mem_cache/session_aware_cache.py` — MUST-HAVE *（待 refactor）*
+
+| 项目 | 内容 | 引入 | 分类 |
+|---|---|---|---|
+| `SessionSlot` dataclass | streaming session 跨 turn 复用 KV 的 metadata | b8e6f13 | MUST-HAVE |
+| `last_access_time` 字段 | LRU 决策需要 | 6e5ed8d | MUST-HAVE |
+| `match_prefix` / `cache_finished_req` / `cache_unfinished_req` 的 streaming 分支 | session 复用快路径 | b8e6f13 | **MUST-HAVE → 待 refactor**（block-level evict 后语义大改） |
+| `release_session` 直接 `free(kv_indices)` | session 退出时一次性归还 KV | b8e6f13 | **WORKAROUND → 替换**（refactor 后改为只 `dec_lock_ref`） |
+| `slot_held_tokens` / `get_session_status` / `list_session_statuses` | 状态查询 | 6e5ed8d | MUST-HAVE |
+
+**说明**：本文件是 KVC 设计的中枢。block-level eviction refactor（[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.1–§3.6）改造的就是这里。`SessionSlot` 的 5 个 KV-ownership 字段（`req_pool_idx` / `kv_committed_len` / `kv_allocated_len` / `cache_protected_len` / `swa_evicted_seqlen`）应在 refactor 后删除；这部分**将由 commit message 单独标记**，方便回滚。
+
+### 1.2 `managers/scheduler.py` — 混合类别
+
+D worker 端的 Algorithm 2 实现，含多个独立 patch。按行级归类：
+
+| 函数 / 行段 | 内容 | 分类 | 何时可下线 |
+|---|---|---|---|
+| `admit_direct_append(...)` | Algorithm 2 的 D 端 admission RPC handler | **MUST-HAVE** | 不下线（论文核心） |
+| `_should_allow_local_prefill_on_decode(req)` | 决定 decode worker 是否接受无 bootstrap 的本地 append-prefill | **MUST-HAVE** | 不下线 |
+| `_decode_session_cache_low_watermark_tokens()` | 水位线参数读取 | **WORKAROUND** | block-level evict 后由 radix LRU 取代 |
+| `_decode_session_cache_target_available_tokens()` | 目标可用 token 数计算 | **WORKAROUND** | 同上 |
+| `maybe_trim_decode_session_cache(...)` | 主动 trim session（触发 `release_session`） | **WORKAROUND** | 同上：refactor 后 radix LRU 自然蚕食，trim 不再必要 |
+| `_compute_backpressure_pause_hint(...)` | 给 router 的 pause 提示 | **EXPERIMENTAL** | 信号未闭环（[REAL_ALI_KVC_EXPERIMENT_LOG_ZH.md](../docs/archive/) §4.3），路线图 §S10；可保留为 future work hook |
+| `_compute_pool_breakdown_for_diagnostics()` | 池状态快照供 `/server_info` | **INSTRUMENTATION** | 长期保留但建议门 flag 化 |
+
+### 1.3 `managers/schedule_batch.py` — WORKAROUND（待删除）
+
+| 项目 | 内容 | 引入 | 分类 |
+|---|---|---|---|
+| streaming-session `extend_input_len` correction (lines ~1572–1585) | 在 fill_ids < prefix_indices 时把 extend_input_len 改为 0 | b8e6f13 | **WORKAROUND** |
+| pre-filter pass dropping `fill_ids < prefix_indices` reqs | E3 触发 assertion 后的 hotfix（commit 986f351） | 986f351 | **WORKAROUND** |
+| invariant assert `seq_len - pre_len == req.extend_input_len` 的容忍逻辑 | 与 correction 配套 | b8e6f13 | **WORKAROUND** |
+
+**全部** ~85 行在 block-level eviction refactor 完成后**应整体删除**——`BLOCK_LEVEL_EVICTION_DESIGN_ZH §3.7` 已说明 refactor 后该不变量结构上必然成立，correction 路径无需存在。E3 的 landmine ([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2) 是该 workaround 的产物。
+
+### 1.4 `managers/session_controller.py` — MUST-HAVE
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| streaming session lifecycle hooks（open / close / admit signal） | 让 P/D worker 知道何时开始 / 结束一个 streaming session | MUST-HAVE |
+| session ID 路由 | 让 admission RPC 找到正确的 SessionSlot | MUST-HAVE |
+
+不下线。
+
+### 1.5 `managers/io_struct.py` — MUST-HAVE
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| `AdmitDirectAppendReqInput` / `AdmitDirectAppendReqOutput` | admit RPC 的请求 / 响应消息类型 | MUST-HAVE |
+| backpressure pause hint 字段 | 同上消息的 optional 字段 | EXPERIMENTAL |
+
+可以把 EXPERIMENTAL 字段折叠到 MUST-HAVE 消息里保持兼容；本身不构成下线压力。
+
+### 1.6 `managers/tokenizer_communicator_mixin.py` — MUST-HAVE
+
+admit RPC 的 communicator-side glue。19 行，不下线。
+
+### 1.7 `entrypoints/http_server.py` — MUST-HAVE
+
+`/admit_direct_append` HTTP endpoint 注册。6 行。
+
+### 1.8 `disaggregation/decode.py` — 混合类别
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| `DecodeReqToTokenPool`: `assert len(reusing) <= 1` 放宽 | 让 local append-prefill 在一个 batch 里复用多个 req_pool_idx | **MUST-HAVE** |
+| `DecodePreallocQueue` 引入 `refresh_allocatable_tokens` + `maybe_trim_decode_session_cache` 触发 | pool 满时主动 trim session | **WORKAROUND**（refactor 后改由 radix LRU 自然 shed） |
+| `--disaggregation-decode-allow-local-prefill` flag | 服务端 opt-in 本地 append-prefill | **MUST-HAVE** |
+
+trim 触发逻辑 ~30 行在 refactor 后应删除。
+
+### 1.9 `server_args.py` — MUST-HAVE
+
+| 项目 | 内容 | 分类 |
+|---|---|---|
+| `--radix-eviction-policy priority` 选项 | E1/E2 实验需要 | MUST-HAVE |
+| `--disaggregation-decode-allow-local-prefill` flag | 见 §1.8 | MUST-HAVE |
+
+13 行，全部是 CLI 接口扩展，不下线。
+
+### 1.10 `disaggregation/mooncake_transfer_engine.py` — MINOR
+
+3 行小调整。不构成决策点。
+
+---
+
+## 2. 按分类汇总
+
+### 2.1 MUST-HAVE（保留）
+
+约 6 个文件、450 行：
+- `admit_direct_append` 主链路（Algorithm 2）：scheduler + io_struct + tokenizer_communicator_mixin + http_server + session_controller
+- `SessionSlot` 主链路（streaming session lifecycle）：session_aware_cache 多数字段、session_controller
+- CLI / server interface：server_args、decode.py 的 `allow_local_prefill`
+
+### 2.2 WORKAROUND（block-level evict refactor 后删除）
+
+约 2.5 个文件、150 行：
+- `session_aware_cache.release_session` 的 token-free 路径
+- `scheduler.py` 的 `_decode_session_cache_*_watermark_tokens` + `maybe_trim_decode_session_cache`
+- `schedule_batch.py` streaming-session correction + drop-pre-filter（含 E3 landmine 的 hotfix）
+- `decode.py` `DecodePreallocQueue` 中的 trim 触发
+
+→ 这些 patch 的存在是当前架构的产物；refactor 后应整段删除而不是修小 bug。
+
+### 2.3 EXPERIMENTAL（未闭环）
+
+约 60 行：
+- backpressure pause hint（`_compute_backpressure_pause_hint` + io_struct 字段）：可作为未来 control-plane 反馈机制的 hook 保留；若 1 个月后仍未接通，下线
+
+### 2.4 INSTRUMENTATION（长期保留但门 flag 化）
+
+约 50 行：
+- `_compute_pool_breakdown_for_diagnostics` + 相关 `/server_info` 字段：建议加 `--enable-diagnostic-pool-snapshot` flag，避免 prod 路径背诊断开销
+
+### 2.5 MINOR
+
+约 3 行：忽略。
+
+---
+
+## 3. 维护约定
+
+1. **新加 SGLang 改动必须落到本表**：在 commit message 用 `feat(sglang): ...` / `fix(sglang): ...` 前缀，并在 PR 描述声明落到 §2 哪一类。
+2. **不直接覆盖 upstream 文件**：所有 patch 必须可在 v0.5.10 上 git apply（保留 hunk header 整洁）。
+3. **删除 WORKAROUND 时同步删 doc**：refactor 完成的同一个 PR 应把本文表中对应行划掉。
+4. **不下放 EXPERIMENTAL 到主路径**：未闭环的 patch 必须默认 disabled。
+
+---
+
+## 4. 与路线图的衔接
+
+- Milestone 1（[AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) §4）执行 block-level eviction refactor 时，**整段 §2.2 应该消失**——这是衡量 refactor 完成度的客观指标。
+- Milestone 2 把 control plane 拆层（§4.8）时，§2.3 backpressure pause hint 应或被启用、或被下线，不允许悬挂。
+- Milestone 3 引入 learning-based admission（§4.15）时，§2.1 的 `admit_direct_append` 接口应保持稳定，policy 替换在 router 侧而非 D 侧。
+
+---
+
+**核心句**：vendored SGLang 的 785 行不是 monolithic 黑箱——三分之二是核心机制（论文必备），三分之一是当前架构的 workaround（refactor 后可整段删）。reviewer 看到本表能立刻判断"哪些是 paper 的真贡献、哪些是 prototype 当前的临时支撑"。
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -7,7 +7,7 @@ requires-python = ">=3.12"
 dependencies = [
    "httpx>=0.28.1",
    "mooncake-transfer-engine",
-    "sglang==0.5.10",
+    "sglang",
 ]

 [project.scripts]
@@ -20,5 +20,21 @@ build-backend = "setuptools.build_meta"
 [tool.setuptools.packages.find]
 where = ["src"]

+[dependency-groups]
+# Pure-Python unit tests. Install via:
+#   uv sync --group test
+# These tests deliberately import only the algorithm-layer modules
+# (policies, trace, topology) so they run without SGLang / GPU / CUDA.
+test = [
+    "pytest>=8.0",
+]
+
 [tool.uv]
 prerelease = "allow"
+
+[tool.uv.sources]
+sglang = { path = "third_party/sglang/python", editable = true }
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+addopts = "-q"
--- a/scripts/analysis/paired_compare.py
+++ b/scripts/analysis/paired_compare.py
@@ -0,0 +1,225 @@
+#!/usr/bin/env python3
+"""Paired latency comparison with bootstrap CI.
+
+Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix): when comparing
+mechanism A vs B on the same trace, the only honest comparison is paired
+on same-trial-mask. This script joins two metrics.jsonl by request_id,
+keeps the rows where BOTH sides succeeded, and reports paired deltas
+with 95% bootstrap CIs.
+
+Out vs the existing `compare_no_error.py`:
+  - works on raw metrics.jsonl, not pre-aggregated summary.json
+  - bootstrap CIs (not just point estimates)
+  - reports paired-mask size + per-side failure counts so the reader
+    sees how many rows were dropped from the comparison
+
+Usage:
+  scripts/analysis/paired_compare.py \
+      --baseline outputs/run-dp/request-metrics.jsonl \
+      --candidate outputs/run-kvc/request-metrics.jsonl
+  scripts/analysis/paired_compare.py ... --bootstrap 5000 --seed 42
+  scripts/analysis/paired_compare.py ... --json > paired.json
+
+stdlib only — no scipy/numpy. Runs without GPU and without SGLang.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import random
+import sys
+from pathlib import Path
+
+
+def _load(path: Path) -> dict[str, dict]:
+    out: dict[str, dict] = {}
+    with path.open() as handle:
+        for line in handle:
+            line = line.strip()
+            if not line:
+                continue
+            row = json.loads(line)
+            rid = row.get("request_id")
+            if rid is None:
+                continue
+            out[rid] = row
+    return out
+
+
+def _ok(row: dict) -> bool:
+    return row.get("error") is None and row.get("latency_s") is not None
+
+
+def _quantile(values: list[float], q: float) -> float:
+    if not values:
+        return float("nan")
+    s = sorted(values)
+    if len(s) == 1:
+        return s[0]
+    pos = (len(s) - 1) * q
+    lo = math.floor(pos)
+    hi = math.ceil(pos)
+    if lo == hi:
+        return s[lo]
+    return s[lo] + (s[hi] - s[lo]) * (pos - lo)
+
+
+def _stats(deltas: list[float]) -> dict[str, float]:
+    if not deltas:
+        return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
+    return {
+        "mean": sum(deltas) / len(deltas),
+        "p50": _quantile(deltas, 0.50),
+        "p90": _quantile(deltas, 0.90),
+        "p99": _quantile(deltas, 0.99),
+    }
+
+
+def _bootstrap_ci(
+    deltas: list[float], statistic, n_boot: int, rng: random.Random
+) -> tuple[float, float]:
+    """Return (lo, hi) 95% CI for `statistic(deltas)`."""
+    if len(deltas) < 2:
+        return (float("nan"), float("nan"))
+    n = len(deltas)
+    samples = []
+    for _ in range(n_boot):
+        # resample with replacement
+        resample = [deltas[rng.randrange(n)] for _ in range(n)]
+        samples.append(statistic(resample))
+    samples.sort()
+    lo = samples[int(0.025 * (n_boot - 1))]
+    hi = samples[int(0.975 * (n_boot - 1))]
+    return (lo, hi)
+
+
+def compare(
+    baseline: dict[str, dict],
+    candidate: dict[str, dict],
+    *,
+    metric: str,
+    n_boot: int,
+    seed: int,
+) -> dict:
+    common_ids = set(baseline.keys()) & set(candidate.keys())
+    paired_ids = [
+        rid for rid in common_ids if _ok(baseline[rid]) and _ok(candidate[rid])
+    ]
+    paired_ids.sort()
+
+    base_only_fail = sum(1 for rid in common_ids if not _ok(baseline[rid]))
+    cand_only_fail = sum(1 for rid in common_ids if not _ok(candidate[rid]))
+
+    deltas = []
+    wins = losses = ties = 0
+    for rid in paired_ids:
+        b = baseline[rid].get(metric)
+        c = candidate[rid].get(metric)
+        if b is None or c is None:
+            continue
+        d = float(c) - float(b)
+        deltas.append(d)
+        if d < 0:
+            wins += 1
+        elif d > 0:
+            losses += 1
+        else:
+            ties += 1
+
+    rng = random.Random(seed)
+    stats = _stats(deltas)
+    ci_mean = _bootstrap_ci(deltas, lambda x: sum(x) / len(x), n_boot, rng)
+    ci_p50 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.50), n_boot, rng)
+    ci_p90 = _bootstrap_ci(deltas, lambda x: _quantile(x, 0.90), n_boot, rng)
+
+    return {
+        "metric": metric,
+        "baseline_size": len(baseline),
+        "candidate_size": len(candidate),
+        "intersection_size": len(common_ids),
+        "paired_size": len(paired_ids),
+        "baseline_fail_in_common": base_only_fail,
+        "candidate_fail_in_common": cand_only_fail,
+        "delta_stats": stats,
+        "delta_mean_ci95": ci_mean,
+        "delta_p50_ci95": ci_p50,
+        "delta_p90_ci95": ci_p90,
+        "wins_candidate": wins,
+        "losses_candidate": losses,
+        "ties": ties,
+    }
+
+
+def _fmt(x: float, w: int = 6) -> str:
+    if x is None or (isinstance(x, float) and math.isnan(x)):
+        return "  nan "
+    return f"{x:+{w}.3f}"
+
+
+def render(result: dict) -> str:
+    s = result["delta_stats"]
+    mlo, mhi = result["delta_mean_ci95"]
+    p5lo, p5hi = result["delta_p50_ci95"]
+    p9lo, p9hi = result["delta_p90_ci95"]
+    n = result["paired_size"]
+    lines = [
+        f"# paired comparison ({result['metric']})",
+        "",
+        f"baseline rows:           {result['baseline_size']}",
+        f"candidate rows:          {result['candidate_size']}",
+        f"intersection (rid):      {result['intersection_size']}",
+        f"paired (both ok):        {result['paired_size']}",
+        f"  baseline fails in common:  {result['baseline_fail_in_common']}",
+        f"  candidate fails in common: {result['candidate_fail_in_common']}",
+        "",
+        "## delta (candidate - baseline)  — negative = candidate is faster",
+        "",
+        "| stat | value | 95% CI |",
+        "|---|---:|---:|",
+        f"| mean | {_fmt(s['mean'])} | [{_fmt(mlo)}, {_fmt(mhi)}] |",
+        f"| p50  | {_fmt(s['p50'])}  | [{_fmt(p5lo)}, {_fmt(p5hi)}] |",
+        f"| p90  | {_fmt(s['p90'])}  | [{_fmt(p9lo)}, {_fmt(p9hi)}] |",
+        f"| p99  | {_fmt(s['p99'])}  | — |",
+        "",
+        f"win/loss/tie: {result['wins_candidate']} / {result['losses_candidate']} / {result['ties']} (of {n})",
+    ]
+    return "\n".join(lines)
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
+    p.add_argument("--baseline", required=True, type=Path)
+    p.add_argument("--candidate", required=True, type=Path)
+    p.add_argument(
+        "--metric",
+        default="latency_s",
+        choices=["latency_s", "ttft_s", "tpot_s"],
+        help="which per-request field to compare (default: latency_s)",
+    )
+    p.add_argument("--bootstrap", type=int, default=2000)
+    p.add_argument("--seed", type=int, default=20260512)
+    p.add_argument("--json", action="store_true")
+    args = p.parse_args()
+
+    baseline = _load(args.baseline)
+    candidate = _load(args.candidate)
+    if not baseline or not candidate:
+        print("empty input on one side", file=sys.stderr)
+        sys.exit(1)
+
+    result = compare(
+        baseline, candidate,
+        metric=args.metric, n_boot=args.bootstrap, seed=args.seed,
+    )
+
+    if args.json:
+        json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
+        sys.stdout.write("\n")
+    else:
+        print(render(result))
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/analysis/stratified.py
+++ b/scripts/analysis/stratified.py
@@ -0,0 +1,227 @@
+#!/usr/bin/env python3
+"""Stratified latency / TTFT reporter for paper-quality evaluation.
+
+Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): every headline
+number must be accompanied by a stratified breakdown so reviewers can
+see which slice the gains come from.
+
+Buckets the request rows from one or more metrics.jsonl files along:
+  - turn_id        : {1, 2-5, 6-20, 21+}
+  - input_length   : {<=8K, 8K-64K, >64K}
+  - overlap_ratio  : {<=0.3, 0.3-0.7, >0.7}
+  - append_tokens  : input_length - observed_overlap_blocks * BLOCK_SIZE
+
+For each bucket, reports:
+  - n (total rows in bucket)
+  - n_ok (rows with no error and latency_s set)
+  - latency_s mean / p50 / p90 / p99
+  - ttft_s    mean / p50 / p90 / p99
+  - err_pct   (1 - n_ok/n)
+
+Usage:
+  scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl \
+      [outputs/<other-run>/request-metrics.jsonl ...]
+  scripts/analysis/stratified.py --dim turn_id outputs/<run>/request-metrics.jsonl
+  scripts/analysis/stratified.py --json outputs/<run>/request-metrics.jsonl > strat.json
+
+stdlib only — no pandas/numpy. Runs without GPU and without SGLang.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import sys
+from collections import defaultdict
+from pathlib import Path
+from typing import Iterable
+
+BLOCK_SIZE = 24  # SGLang radix block, matches docs/KVC_ROUTER_ALGORITHM.md §2
+
+TURN_BUCKETS: list[tuple[str, tuple[int, int]]] = [
+    ("turn=1", (1, 1)),
+    ("turn=2-5", (2, 5)),
+    ("turn=6-20", (6, 20)),
+    ("turn=21+", (21, 10**9)),
+]
+INPUT_BUCKETS: list[tuple[str, tuple[int, int]]] = [
+    ("input<=8K", (0, 8 * 1024)),
+    ("input=8K-64K", (8 * 1024 + 1, 64 * 1024)),
+    ("input>64K", (64 * 1024 + 1, 10**9)),
+]
+OVERLAP_BUCKETS: list[tuple[str, tuple[float, float]]] = [
+    ("overlap<=0.3", (0.0, 0.3)),
+    ("overlap=0.3-0.7", (0.3, 0.7)),
+    ("overlap>0.7", (0.7, 1.0001)),
+]
+APPEND_BUCKETS: list[tuple[str, tuple[int, int]]] = [
+    ("append<=128", (0, 128)),
+    ("append=128-1K", (129, 1024)),
+    ("append=1K-8K", (1025, 8 * 1024)),
+    ("append>8K", (8 * 1024 + 1, 10**9)),
+]
+
+DIM_BUCKETS: dict[str, list[tuple[str, tuple]]] = {
+    "turn_id": TURN_BUCKETS,
+    "input_length": INPUT_BUCKETS,
+    "overlap_ratio": OVERLAP_BUCKETS,
+    "append_tokens": APPEND_BUCKETS,
+}
+
+
+def _quantile(values: list[float], q: float) -> float:
+    """Linear-interpolation quantile, stdlib only."""
+    if not values:
+        return float("nan")
+    s = sorted(values)
+    if len(s) == 1:
+        return s[0]
+    pos = (len(s) - 1) * q
+    lo = math.floor(pos)
+    hi = math.ceil(pos)
+    if lo == hi:
+        return s[lo]
+    return s[lo] + (s[hi] - s[lo]) * (pos - lo)
+
+
+def _stats(values: list[float]) -> dict[str, float]:
+    if not values:
+        return {"mean": float("nan"), "p50": float("nan"), "p90": float("nan"), "p99": float("nan")}
+    return {
+        "mean": sum(values) / len(values),
+        "p50": _quantile(values, 0.50),
+        "p90": _quantile(values, 0.90),
+        "p99": _quantile(values, 0.99),
+    }
+
+
+def _bucket_for(value: float | int, buckets: list[tuple[str, tuple]]) -> str:
+    for label, (lo, hi) in buckets:
+        if lo <= value <= hi:
+            return label
+    return "OOB"
+
+
+def _classify(row: dict, dim: str) -> str:
+    if dim == "turn_id":
+        return _bucket_for(int(row.get("turn_id", 0)), TURN_BUCKETS)
+    if dim == "input_length":
+        return _bucket_for(int(row.get("input_length", 0)), INPUT_BUCKETS)
+    if dim == "overlap_ratio":
+        inp = max(1, int(row.get("input_length", 0)))
+        cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
+        ratio = min(1.0, cached / inp)
+        return _bucket_for(ratio, OVERLAP_BUCKETS)
+    if dim == "append_tokens":
+        inp = int(row.get("input_length", 0))
+        cached = int(row.get("observed_overlap_blocks", 0)) * BLOCK_SIZE
+        return _bucket_for(max(0, inp - cached), APPEND_BUCKETS)
+    raise ValueError(f"Unknown dim: {dim}")
+
+
+def load_rows(paths: Iterable[Path]) -> list[dict]:
+    rows: list[dict] = []
+    for path in paths:
+        with path.open() as handle:
+            for line in handle:
+                line = line.strip()
+                if not line:
+                    continue
+                rows.append(json.loads(line))
+    return rows
+
+
+def stratify(rows: list[dict], dim: str) -> dict[str, dict]:
+    by_bucket: dict[str, list[dict]] = defaultdict(list)
+    for row in rows:
+        by_bucket[_classify(row, dim)].append(row)
+
+    output: dict[str, dict] = {}
+    for label, _ in DIM_BUCKETS[dim]:
+        bucket_rows = by_bucket.get(label, [])
+        n = len(bucket_rows)
+        ok = [r for r in bucket_rows if r.get("error") is None and r.get("latency_s") is not None]
+        n_ok = len(ok)
+        lat = [float(r["latency_s"]) for r in ok]
+        ttft = [float(r["ttft_s"]) for r in ok if r.get("ttft_s") is not None]
+        output[label] = {
+            "n": n,
+            "n_ok": n_ok,
+            "err_pct": (n - n_ok) / n if n else 0.0,
+            "latency_s": _stats(lat),
+            "ttft_s": _stats(ttft),
+        }
+    return output
+
+
+def render_table(name: str, stats: dict[str, dict]) -> str:
+    lines = [
+        f"## stratified by {name}",
+        "",
+        "| bucket | n | n_ok | err% | lat mean | lat p50 | lat p90 | lat p99 | ttft mean | ttft p50 | ttft p90 | ttft p99 |",
+        "|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|",
+    ]
+    for label, _ in DIM_BUCKETS[name]:
+        s = stats[label]
+        lat = s["latency_s"]
+        ttft = s["ttft_s"]
+        lines.append(
+            "| {label} | {n} | {n_ok} | {err:.1%} | "
+            "{lm:.3f} | {l50:.3f} | {l90:.3f} | {l99:.3f} | "
+            "{tm:.3f} | {t50:.3f} | {t90:.3f} | {t99:.3f} |".format(
+                label=label,
+                n=s["n"],
+                n_ok=s["n_ok"],
+                err=s["err_pct"],
+                lm=lat["mean"],
+                l50=lat["p50"],
+                l90=lat["p90"],
+                l99=lat["p99"],
+                tm=ttft["mean"],
+                t50=ttft["p50"],
+                t90=ttft["p90"],
+                t99=ttft["p99"],
+            )
+        )
+    return "\n".join(lines)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
+    parser.add_argument("metrics_paths", nargs="+", type=Path)
+    parser.add_argument(
+        "--dim",
+        choices=list(DIM_BUCKETS.keys()) + ["all"],
+        default="all",
+        help="stratification dimension (default: all four)",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="emit JSON instead of markdown tables",
+    )
+    args = parser.parse_args()
+
+    rows = load_rows(args.metrics_paths)
+    if not rows:
+        print("no rows loaded", file=sys.stderr)
+        sys.exit(1)
+
+    dims = list(DIM_BUCKETS.keys()) if args.dim == "all" else [args.dim]
+    result = {dim: stratify(rows, dim) for dim in dims}
+
+    if args.json:
+        json.dump(result, sys.stdout, indent=2, default=lambda x: None if isinstance(x, float) and math.isnan(x) else x)
+        sys.stdout.write("\n")
+        return
+
+    header_paths = ", ".join(str(p) for p in args.metrics_paths)
+    print(f"# stratified report ({len(rows)} rows from {header_paths})\n")
+    for dim in dims:
+        print(render_table(dim, result[dim]))
+        print()
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/convert_inferact_to_trace.py
+++ b/scripts/convert_inferact_to_trace.py
@@ -0,0 +1,189 @@
+"""Convert Inferact codex_swebenchpro_traces (ShareGPT) to agentic-pd-hybrid trace JSONL.
+
+Output schema (one JSON object per line, matching src/agentic_pd_hybrid/trace.py):
+  chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids
+
+Each trial in the input becomes one session. Each (human, gpt) pair within a trial
+becomes one turn. The prefix at turn N is the concatenation of all (human, gpt) pairs
+from turns 0..N-1 plus the current human message — this mirrors how agentic coding
+agents grow context across calls.
+
+hash_ids are derived per 24-token block via sha256 of the block's text + previous hash,
+which gives stable, deterministic, prefix-shared hashes across turns of the same session.
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import sys
+import time
+from pathlib import Path
+
+BLOCK_TOKEN_BUDGET = 24
+
+
+def _block_hash(text: str, prev_hash: int) -> int:
+    h = hashlib.sha256(text.encode("utf-8") + prev_hash.to_bytes(8, "big")).digest()
+    return int.from_bytes(h[:8], "big") & 0x7FFFFFFFFFFFFFFF
+
+
+def _build_hash_ids(token_ids: list[int]) -> list[int]:
+    out: list[int] = []
+    prev = 0
+    for start in range(0, len(token_ids), BLOCK_TOKEN_BUDGET):
+        block = token_ids[start : start + BLOCK_TOKEN_BUDGET]
+        block_repr = ",".join(str(t) for t in block)
+        prev = _block_hash(block_repr, prev)
+        out.append(prev)
+    return out
+
+
+def _pair_turns(conv: list[dict]) -> list[tuple[str, str]]:
+    """Pair consecutive (human, gpt) messages. Skip malformed."""
+    pairs: list[tuple[str, str]] = []
+    i = 0
+    while i + 1 < len(conv):
+        a, b = conv[i], conv[i + 1]
+        if (
+            isinstance(a, dict)
+            and isinstance(b, dict)
+            and a.get("from") == "human"
+            and b.get("from") == "gpt"
+        ):
+            pairs.append((str(a.get("value", "")), str(b.get("value", ""))))
+            i += 2
+        else:
+            i += 1
+    return pairs
+
+
+def convert(
+    input_path: Path,
+    output_path: Path,
+    *,
+    tokenizer_path: str,
+    max_trials: int | None,
+    inter_turn_gap_s: float,
+    session_stagger_s: float,
+    request_type: str,
+) -> None:
+    from transformers import AutoTokenizer
+
+    print(f"loading tokenizer from {tokenizer_path}", file=sys.stderr)
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
+
+    print(f"loading {input_path}", file=sys.stderr)
+    data = json.loads(input_path.read_text())
+    if max_trials is not None:
+        data = data[:max_trials]
+    print(f"{len(data)} trials to process", file=sys.stderr)
+
+    next_chat_id = 1_000_000
+    written = 0
+    skipped_trials = 0
+    t0 = time.time()
+
+    with output_path.open("w", encoding="utf-8") as out_f:
+        for trial_idx, trial in enumerate(data):
+            conv = trial.get("conversations") or []
+            turns = _pair_turns(conv)
+            if not turns:
+                skipped_trials += 1
+                continue
+
+            base_ts = trial_idx * session_stagger_s
+            ts = base_ts
+            parent_chat_id = -1
+            prefix_text = ""
+
+            for turn_idx, (human, assistant) in enumerate(turns):
+                # Input at this turn = full prior context + current human message.
+                current_text = (
+                    prefix_text + ("\n\n[USER]\n" if prefix_text else "[USER]\n") + human
+                )
+                input_ids = tokenizer.encode(current_text, add_special_tokens=False)
+                input_length = len(input_ids)
+
+                output_ids = tokenizer.encode(assistant, add_special_tokens=False)
+                output_length = max(1, len(output_ids))
+
+                hash_ids = _build_hash_ids(input_ids)
+
+                chat_id = next_chat_id
+                next_chat_id += 1
+                record = {
+                    "chat_id": chat_id,
+                    "parent_chat_id": parent_chat_id,
+                    "timestamp": round(ts, 6),
+                    "input_length": input_length,
+                    "output_length": output_length,
+                    "type": request_type,
+                    "turn": turn_idx,
+                    "hash_ids": hash_ids,
+                }
+                out_f.write(json.dumps(record) + "\n")
+                written += 1
+
+                parent_chat_id = chat_id
+                ts += inter_turn_gap_s
+                prefix_text = current_text + "\n\n[ASSISTANT]\n" + assistant
+
+            if (trial_idx + 1) % 20 == 0:
+                elapsed = time.time() - t0
+                rate = (trial_idx + 1) / elapsed if elapsed > 0 else 0
+                eta = (len(data) - trial_idx - 1) / rate if rate > 0 else 0
+                print(
+                    f"  trial {trial_idx + 1}/{len(data)} reqs={written} "
+                    f"rate={rate:.1f} trial/s eta={eta:.0f}s",
+                    file=sys.stderr,
+                )
+
+    elapsed = time.time() - t0
+    print(
+        f"done: wrote {written} requests across {len(data) - skipped_trials} sessions "
+        f"({skipped_trials} trials skipped, empty conversations) in {elapsed:.1f}s "
+        f"to {output_path}",
+        file=sys.stderr,
+    )
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument(
+        "--input",
+        type=Path,
+        default=Path("third_party/codex_swebenchpro_traces/codex_swebenchpro.json"),
+    )
+    p.add_argument("--output", type=Path, required=True)
+    p.add_argument(
+        "--tokenizer",
+        default="/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507",
+        help="Path or HF id for the tokenizer. Default matches v2 sweep model.",
+    )
+    p.add_argument(
+        "--max-trials",
+        type=int,
+        default=None,
+        help="Cap number of trials processed (useful for smoke / quick tests).",
+    )
+    p.add_argument("--inter-turn-gap-s", type=float, default=2.5)
+    p.add_argument("--session-stagger-s", type=float, default=1.0)
+    p.add_argument("--request-type", default="chat")
+    args = p.parse_args()
+
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    convert(
+        input_path=args.input,
+        output_path=args.output,
+        tokenizer_path=args.tokenizer,
+        max_trials=args.max_trials,
+        inter_turn_gap_s=args.inter_turn_gap_s,
+        session_stagger_s=args.session_stagger_s,
+        request_type=args.request_type,
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/sample_trace_subset.py
+++ b/scripts/sample_trace_subset.py
@@ -0,0 +1,81 @@
+"""Deterministically slice the first N sessions of an agentic-pd-hybrid trace.
+
+Method: scan in file order, count records whose `parent_chat_id == -1` (= a
+session's turn 0), and write every record until the (N+1)-th such record is
+seen. No RNG, no hashing — re-running on the same input produces a byte-
+identical output. Used to derive matched subsets for paired sweeps (E1 vs E2)
+without spending GPU hours on the full trace.
+
+Usage:
+    uv run --no-sync python scripts/sample_trace_subset.py \
+        --input outputs/inferact_codex_swebenchpro.jsonl \
+        --output outputs/inferact_50sess.jsonl \
+        --sessions 50
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import sys
+from pathlib import Path
+
+
+def slice_first_n_sessions(input_path: Path, output_path: Path, n_sessions: int) -> dict:
+    sessions_seen = 0
+    requests_written = 0
+    input_length_sum = 0
+    output_length_sum = 0
+    min_in = float("inf")
+    max_in = 0
+
+    with input_path.open("r", encoding="utf-8") as f_in, output_path.open(
+        "w", encoding="utf-8"
+    ) as f_out:
+        for line in f_in:
+            rec = json.loads(line)
+            if rec["parent_chat_id"] == -1:
+                sessions_seen += 1
+                if sessions_seen > n_sessions:
+                    break
+            f_out.write(line)
+            requests_written += 1
+            il = int(rec["input_length"])
+            input_length_sum += il
+            output_length_sum += int(rec["output_length"])
+            if il < min_in:
+                min_in = il
+            if il > max_in:
+                max_in = il
+
+    h = hashlib.md5(output_path.read_bytes()).hexdigest()
+    return {
+        "sessions": min(sessions_seen, n_sessions),
+        "requests": requests_written,
+        "input_length_mean": input_length_sum / max(1, requests_written),
+        "input_length_min": int(min_in) if min_in != float("inf") else 0,
+        "input_length_max": max_in,
+        "output_length_mean": output_length_sum / max(1, requests_written),
+        "output_md5": h,
+    }
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument(
+        "--input",
+        type=Path,
+        default=Path("outputs/inferact_codex_swebenchpro.jsonl"),
+    )
+    p.add_argument("--output", type=Path, required=True)
+    p.add_argument("--sessions", type=int, default=50)
+    args = p.parse_args()
+
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    stats = slice_first_n_sessions(args.input, args.output, args.sessions)
+    print(json.dumps(stats, indent=2), file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/setup_env.sh
+++ b/scripts/setup_env.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Source this file in every shell that will run agentic-pd-hybrid.
+#
+#   source scripts/setup_env.sh
+#
+# Why all three are needed:
+# - CUDA_HOME / PATH point tvm_ffi (vendor sglang JIT compiler) at cu12.8 nvcc.
+#   Without this it falls back to /usr/local/cuda-13.0/bin/nvcc and the
+#   resulting .so links libcudart.so.13 which driver 570 (cu12.8 API) rejects
+#   with cudaErrorInsufficientDriver.
+# - LD_LIBRARY_PATH must expose libcudart.so.12 for mooncake.engine (cu12 wheel)
+#   AND ~/cuda-12.8/lib64 for tvm_ffi compile-time linker searches.
+#
+# See docs/H200_DRIVER570_SETUP_ZH.md for the full rationale.
+
+REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+
+if [ ! -x "$HOME/cuda-12.8/bin/nvcc" ]; then
+  echo "ERROR: $HOME/cuda-12.8/bin/nvcc not found." >&2
+  echo "Install cu12.8 toolkit first (see docs/H200_DRIVER570_SETUP_ZH.md §3)." >&2
+  return 1 2>/dev/null || exit 1
+fi
+
+if [ ! -f "$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12" ]; then
+  echo "ERROR: venv libcudart.so.12 missing. Run 'uv sync' from $REPO_ROOT." >&2
+  return 1 2>/dev/null || exit 1
+fi
+
+export CUDA_HOME="$HOME/cuda-12.8"
+export PATH="$HOME/cuda-12.8/bin:$PATH"
+export LD_LIBRARY_PATH="$REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib:$HOME/cuda-12.8/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
+
+# Mooncake batch_transfer_sync C++ timeout (seconds). Default in mooncake is
+# 30 s; a single LRU eviction sweep on a saturated D scheduler can exceed
+# that and cause the hair-trigger blacklist in conn.py:1270 to permanently
+# mark the D's mooncake_session_id "failed". 1800 s = 30 min gives us
+# headroom while still detecting genuinely broken peers eventually.
+# See docs/E1_E2_RESULTS_ZH.md §5c and docs/E1_E2_FIX_DESIGN_ZH.md Q1.C.
+export MC_TRANSFER_TIMEOUT="${MC_TRANSFER_TIMEOUT:-1800}"
+
+echo "agentic-pd-hybrid env ready:"
+echo "  CUDA_HOME=$CUDA_HOME ($(nvcc --version | grep release | sed 's/.*release //'))"
+echo "  libcudart.so.12 at $REPO_ROOT/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib"
+echo "  MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT}s"
--- a/scripts/sweep_e1_naive_1p3d.sh
+++ b/scripts/sweep_e1_naive_1p3d.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+# E1 — naive 1P3D + kv-aware + RDMA, ts=1
+#
+# Tests hypothesis H1 from ONBOARDING_NEXT_AGENT_ZH §3.1: separate the
+# contribution of "1P3D topology + kv-aware policy" from "KVC layer
+# (admission / migration / direct-to-D)".
+#
+# Mechanism = pd-disaggregation (no KVC layer); policy = kv-aware.
+# Topology = 1P3D, RDMA on (mlx5_60 = cuda:0 NUMA-local).
+#
+# Prerequisites:
+#   - source scripts/setup_env.sh (sets CUDA_HOME etc.)
+#   - outputs/inferact_codex_swebenchpro.jsonl exists
+#     (run scripts/convert_inferact_to_trace.py if not)
+#
+# Usage:
+#   bash scripts/sweep_e1_naive_1p3d.sh
+#
+# Override defaults via env:
+#   MODEL=/path TRACE=path OUTPUT=path IB_DEVICE=mlx5_XX bash scripts/sweep_e1_naive_1p3d.sh
+
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+if [ -z "${CUDA_HOME:-}" ]; then
+  echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
+  exit 1
+fi
+
+MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
+TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
+OUTPUT=${OUTPUT:-outputs/e1_naive_1p3d_kvaware_rdma_50sess}
+IB_DEVICE=${IB_DEVICE:-mlx5_60}
+
+if [ ! -f "$TRACE" ]; then
+  echo "ERROR: trace not found at $TRACE" >&2
+  echo "Run: uv run --no-sync python scripts/convert_inferact_to_trace.py --output $TRACE" >&2
+  exit 1
+fi
+
+mkdir -p "$OUTPUT"
+LOG="$OUTPUT/sweep.log"
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
+
+log "=== E1: naive 1P3D kv-aware + RDMA, ts=1 ==="
+log "MODEL=$MODEL"
+log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
+log "OUTPUT=$OUTPUT"
+log "IB_DEVICE=$IB_DEVICE"
+
+label=e1_naive_1p3d_kvaware_run1
+log ""
+log "=== [E1] $label starting ==="
+
+uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
+  --trace "$TRACE" \
+  --output-root "$OUTPUT" \
+  --mechanism pd-disaggregation \
+  --policy kv-aware \
+  --model-path "$MODEL" \
+  --prefill-workers 1 --decode-workers 3 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
+  --transfer-backend mooncake \
+  --force-rdma --ib-device "$IB_DEVICE" \
+  --gpu-budget 4 \
+  --time-scale 1 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 1800 \
+  --request-timeout-s 300 2>&1 | tee -a "$LOG"
+
+run_dir=$(ls -td "$OUTPUT"/pd-disaggregation-*/ 2>/dev/null | head -1)
+log "=== [E1] $label COMPLETED, artifacts at $run_dir ==="
+
+if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+  cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+  cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+  log "=== summary saved to $OUTPUT/${label}_summary.json ==="
+fi
--- a/scripts/sweep_e2_kvc_v2_rdma.sh
+++ b/scripts/sweep_e2_kvc_v2_rdma.sh
@@ -0,0 +1,90 @@
+#!/usr/bin/env bash
+# E2 — KVC v2 + RDMA, ts=1
+#
+# Tests hypotheses H2/H3 from ONBOARDING_NEXT_AGENT_ZH §3.1: validate
+# that enabling real RDMA pushes TTFT p99 from the reported 1.28s
+# (TCP loopback) down toward ~0.7s (still expected to lose to DP 0.43s
+# because re-prefill segment of reseed slow-path remains).
+#
+# Mechanism = kvcache-centric; policy = kv-aware; topology = 1P3D.
+# All --kvcache-* tuning flags from sweep_ts1_migration_v2.sh
+# (reset-on-success + threshold 8192). RDMA on (mlx5_60).
+#
+# Uses the same outputs/inferact_50sess.jsonl as E1 — see
+# scripts/sample_trace_subset.py — so the two runs are paired.
+#
+# Prerequisites:
+#   - source scripts/setup_env.sh
+#   - E1 must already have completed (releases GPUs)
+#
+# Usage:
+#   bash scripts/sweep_e2_kvc_v2_rdma.sh
+
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+if [ -z "${CUDA_HOME:-}" ]; then
+  echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
+  exit 1
+fi
+
+MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
+TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
+OUTPUT=${OUTPUT:-outputs/e2_kvc_v2_rdma_50sess}
+IB_DEVICE=${IB_DEVICE:-mlx5_60}
+
+if [ ! -f "$TRACE" ]; then
+  echo "ERROR: trace not found at $TRACE" >&2
+  echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
+  exit 1
+fi
+
+mkdir -p "$OUTPUT"
+LOG="$OUTPUT/sweep.log"
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
+
+log "=== E2: KVC v2 + RDMA, ts=1 ==="
+log "MODEL=$MODEL"
+log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
+log "OUTPUT=$OUTPUT"
+log "IB_DEVICE=$IB_DEVICE"
+
+label=e2_kvc_v2_rdma_run1
+log ""
+log "=== [E2] $label starting ==="
+
+uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
+  --trace "$TRACE" \
+  --output-root "$OUTPUT" \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path "$MODEL" \
+  --prefill-workers 1 --decode-workers 3 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
+  --transfer-backend mooncake \
+  --force-rdma --ib-device "$IB_DEVICE" \
+  --gpu-budget 4 \
+  --time-scale 1 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 1800 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction \
+  --kvcache-migration-reject-threshold 3 \
+  --kvcache-direct-max-uncached-tokens 8192 2>&1 | tee -a "$LOG"
+
+run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
+log "=== [E2] $label COMPLETED, artifacts at $run_dir ==="
+
+if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+  cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+  cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+  log "=== summary saved to $OUTPUT/${label}_summary.json ==="
+fi
--- a/scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
+++ b/scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
@@ -0,0 +1,105 @@
+#!/usr/bin/env bash
+# E3 — KVC v2 + RDMA + load-floor bonus, ts=1
+#
+# Validates the load-floor bonus fix proposed in
+# docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B. Identical to E2 except:
+#   --kvcache-load-floor-bonus 200
+#
+# Pair-wise vs E1 (no KVC layer) and E2 (KVC v2 without bonus) on the
+# exact same outputs/inferact_50sess.jsonl subset.
+#
+# Hypotheses being tested:
+#   H1 (load balance): D2 should now receive non-trivial bindings
+#                      (E1/E2 had 0 — see E1_E2_RESULTS_ZH.md §5d).
+#   H2 (failure rate): mooncake batch_transfer_sync timeouts should
+#                      stop firing because D0/D1 KV pool no longer
+#                      saturates → no LRU thrash → control plane no
+#                      longer starves. E2 had 1054 failures; expect
+#                      ≤ E1's 85.
+#   H3 (TTFT):         the 231 successful E2 reqs had TTFT p50 = 0.43s,
+#                      well under E1's 88.6s. With the failure cascade
+#                      removed, these should generalize to most reqs.
+#
+# Prerequisites:
+#   - source scripts/setup_env.sh
+#     (sets CUDA_HOME, MC_TRANSFER_TIMEOUT=1800, etc.)
+#   - outputs/inferact_50sess.jsonl exists (md5 7bb263a32600ef5a6ef5099ba340a487)
+#   - Previous sweep done; GPUs idle.
+#
+# Usage:
+#   bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
+#
+# Override defaults via env:
+#   K=500 LOAD_FLOOR_BONUS=$K bash scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh
+
+set -euo pipefail
+cd "$(dirname "$0")/.."
+
+if [ -z "${CUDA_HOME:-}" ]; then
+  echo "ERROR: CUDA_HOME not set. Source scripts/setup_env.sh first." >&2
+  exit 1
+fi
+
+MODEL=${MODEL:-/mnt/models/Qwen/Qwen3-30B-A3B-Instruct-2507}
+TRACE=${TRACE:-outputs/inferact_50sess.jsonl}
+OUTPUT=${OUTPUT:-outputs/e3_kvc_v2_loadfloor_rdma_50sess}
+IB_DEVICE=${IB_DEVICE:-mlx5_60}
+LOAD_FLOOR_BONUS=${LOAD_FLOOR_BONUS:-200}
+
+if [ ! -f "$TRACE" ]; then
+  echo "ERROR: trace not found at $TRACE" >&2
+  echo "Run: uv run --no-sync python scripts/sample_trace_subset.py --output $TRACE --sessions 50" >&2
+  exit 1
+fi
+
+mkdir -p "$OUTPUT"
+LOG="$OUTPUT/sweep.log"
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
+
+log "=== E3: KVC v2 + RDMA + load-floor bonus K=$LOAD_FLOOR_BONUS, ts=1 ==="
+log "MODEL=$MODEL"
+log "TRACE=$TRACE ($(wc -l < $TRACE) requests)"
+log "OUTPUT=$OUTPUT"
+log "IB_DEVICE=$IB_DEVICE"
+log "MC_TRANSFER_TIMEOUT=${MC_TRANSFER_TIMEOUT:-default-30s}"
+
+label=e3_kvc_v2_loadfloor_run1
+log ""
+log "=== [E3] $label starting ==="
+
+uv run --no-sync python -m agentic_pd_hybrid.cli benchmark-live \
+  --trace "$TRACE" \
+  --output-root "$OUTPUT" \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --model-path "$MODEL" \
+  --prefill-workers 1 --decode-workers 3 \
+  --prefill-tp-size 1 --decode-tp-size 1 \
+  --prefill-gpu-ids 0 --decode-gpu-ids 1,2,3 \
+  --transfer-backend mooncake \
+  --force-rdma --ib-device "$IB_DEVICE" \
+  --gpu-budget 4 \
+  --time-scale 1 \
+  --session-sample-rate 1.0 \
+  --target-duration-s 100000 \
+  --concurrency-limit 32 \
+  --timeout-s 1800 \
+  --request-timeout-s 300 \
+  --kvcache-admission-mode worker \
+  --kvcache-seed-min-turn-id 1 \
+  --kvcache-seed-max-inflight-decode -1 \
+  --kvcache-prefill-backup-policy release-after-transfer \
+  --kvcache-prefill-priority-eviction \
+  --kvcache-migration-reject-threshold 3 \
+  --kvcache-direct-max-uncached-tokens 8192 \
+  --kvcache-load-floor-bonus "$LOAD_FLOOR_BONUS" 2>&1 | tee -a "$LOG"
+
+run_dir=$(ls -td "$OUTPUT"/kvcache-centric-*/ 2>/dev/null | head -1)
+log "=== [E3] $label COMPLETED, artifacts at $run_dir ==="
+
+if [ -f "$run_dir/request-metrics.jsonl.summary.json" ]; then
+  cp "$run_dir/request-metrics.jsonl.summary.json" "$OUTPUT/${label}_summary.json"
+  cp "$run_dir/request-metrics.jsonl" "$OUTPUT/${label}_metrics.jsonl"
+  log "=== summary saved to $OUTPUT/${label}_summary.json ==="
+fi
--- a/src/agentic_pd_hybrid/benchmark.py
+++ b/src/agentic_pd_hybrid/benchmark.py
@@ -48,6 +48,7 @@ class BenchmarkConfig:
    enable_backpressure: bool = False
    backpressure_max_pause_s: float = 2.0
    kvcache_migration_reject_threshold: int = 3
+    kvcache_load_floor_bonus: int = 0
    sample_profile: str = "default"
    min_initial_input_tokens: int | None = None
    max_initial_input_tokens: int | None = None
@@ -200,6 +201,7 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
            enable_backpressure=config.enable_backpressure,
            backpressure_max_pause_s=config.backpressure_max_pause_s,
            kvcache_migration_reject_threshold=config.kvcache_migration_reject_threshold,
+            kvcache_load_floor_bonus=config.kvcache_load_floor_bonus,
        )
        if config.request_timeout_s is not None:
            replay_config = replace(
@@ -261,6 +263,7 @@ def run_live_benchmark(config: BenchmarkConfig) -> BenchmarkArtifacts:
                "enable_backpressure": config.enable_backpressure,
                "backpressure_max_pause_s": config.backpressure_max_pause_s,
                "kvcache_migration_reject_threshold": config.kvcache_migration_reject_threshold,
+                "kvcache_load_floor_bonus": config.kvcache_load_floor_bonus,
                "sample_profile": config.sample_profile,
                "min_initial_input_tokens": config.min_initial_input_tokens,
                "max_initial_input_tokens": config.max_initial_input_tokens,
--- a/src/agentic_pd_hybrid/cli.py
+++ b/src/agentic_pd_hybrid/cli.py
@@ -270,6 +270,19 @@ def main() -> None:
            "See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
        ),
    )
+    replay.add_argument(
+        "--kvcache-load-floor-bonus",
+        type=int,
+        default=0,
+        help=(
+            "Graduated bonus added to lex-score position 0 for under-loaded D "
+            "workers (gated on not-sticky so turn-1+ requests still stick). "
+            "Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
+            "Set above max expected cross-session boilerplate overlap "
+            "(Inferact ~50 → use 200). 0 disables. "
+            "See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
+        ),
+    )

    sample = subparsers.add_parser(
        "sample-sessions",
@@ -521,6 +534,19 @@ def main() -> None:
            "See REFACTOR_PLAN_V1 §6.2 / TEAM_REPORT §2.1."
        ),
    )
+    benchmark.add_argument(
+        "--kvcache-load-floor-bonus",
+        type=int,
+        default=0,
+        help=(
+            "Graduated bonus added to lex-score position 0 for under-loaded D "
+            "workers (gated on not-sticky so turn-1+ requests still stick). "
+            "Magnitude scales as K * max(0, mean - assigned[D]) / mean. "
+            "Set above max expected cross-session boilerplate overlap "
+            "(Inferact ~50 → use 200). 0 disables. "
+            "See docs/E1_E2_FIX_DESIGN_ZH.md §Q2."
+        ),
+    )
    benchmark.add_argument(
        "--sample-profile",
        choices=["default", "small-append"],
@@ -607,6 +633,7 @@ def main() -> None:
            enable_backpressure=args.enable_backpressure,
            backpressure_max_pause_s=args.backpressure_max_pause_s,
            kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
+            kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
        )
        results = asyncio.run(replay_trace(config))
        print(
@@ -754,6 +781,7 @@ def main() -> None:
                enable_backpressure=args.enable_backpressure,
                backpressure_max_pause_s=args.backpressure_max_pause_s,
                kvcache_migration_reject_threshold=args.kvcache_migration_reject_threshold,
+                kvcache_load_floor_bonus=args.kvcache_load_floor_bonus,
                sample_profile=args.sample_profile,
                min_initial_input_tokens=args.min_initial_input_tokens,
                max_initial_input_tokens=args.max_initial_input_tokens,
@@ -848,6 +876,8 @@ def _topology_from_args(args: argparse.Namespace):
        force_rdma=args.force_rdma,
        trust_remote_code=not args.no_trust_remote_code,
        ib_device=args.ib_device,
+        prefill_extra_server_args=("--disable-overlap-schedule",),
+        decode_extra_server_args=("--disable-overlap-schedule",),
        direct_extra_server_args=("--enable-streaming-session",),
    )

--- a/src/agentic_pd_hybrid/policies.py
+++ b/src/agentic_pd_hybrid/policies.py
@@ -152,6 +152,49 @@ class StickyDecodePolicy:
        )


+CandidateScore = tuple[int, int, int, int]
+
+
+def score_candidate(
+    *,
+    overlap: int,
+    sticky: bool,
+    inflight: int,
+    assigned: int,
+    mean_assigned: float,
+    sticky_bonus: int,
+    load_floor_bonus: int,
+) -> CandidateScore:
+    """Pure scoring function for KvAwarePolicy (Algorithm 1 in KVC_ROUTER_ALGORITHM.md).
+
+    Returns the 4-tuple compared lexicographically by `select()` to pick the
+    best D. Extracted as a top-level function so unit tests can exercise it
+    without constructing topology/state objects.
+
+    Score tuple positions:
+        0: overlap + sticky_bonus*sticky + floor_bonus  — primary, KV reuse aware
+        1: sticky                                         — tie-1, session locality
+        2: -inflight                                      — tie-2, prefer low load
+        3: -assigned                                      — tie-3, prefer rarely-picked
+
+    Load-floor bonus is gated on `not sticky` (turn-1+ sessions continue to
+    stick to their original D). The boost magnitude scales linearly with the
+    D's deficit relative to the running mean of decode_assignment_counts:
+        floor_bonus = load_floor_bonus * max(0, mean - assigned) / max(1, mean)
+    When mean == 0 (warmup) the bonus is 0 for all candidates (lex tiebreak
+    falls through to iteration order).
+
+    See docs/E1_E2_FIX_DESIGN_ZH.md §Q2 for the load-floor design and
+    docs/KVC_ROUTER_ALGORITHM.md §3.1 for the lex-score formalism.
+    """
+    floor_bonus = 0
+    if load_floor_bonus > 0 and not sticky and mean_assigned > 0:
+        deficit = max(0.0, mean_assigned - assigned)
+        floor_bonus = int(load_floor_bonus * deficit / mean_assigned)
+    primary = overlap + (sticky_bonus if sticky else 0) + floor_bonus
+    return (primary, int(sticky), -inflight, -assigned)
+
+
@dataclass(frozen=True)
 class KvAwarePolicy:
    name: str = "kv-aware"
@@ -161,6 +204,12 @@ class KvAwarePolicy:
    # 0 disables the mechanism. Default 3 picked empirically to allow brief
    # transient saturation without panicking, but to reroute persistent starvation.
    migration_reject_threshold: int = 3
+    # Load-floor bonus: see score_candidate() docstring for the exact formula.
+    # Set above the max cross-session boilerplate overlap you expect (so fresh
+    # sessions reach under-loaded D's even at 0 overlap), but below the
+    # magnitude of "real" prefix overlap (so a warm D still wins for its own
+    # session). 0 disables.
+    load_floor_bonus: int = 0

    def select(
        self,
@@ -172,9 +221,12 @@ class KvAwarePolicy:
        prefill_worker_id = state.next_prefill_worker_id(topology)
        session = state.session_state.get(request.session_id)

+        n_route_workers = max(1, len(topology.route_workers))
+        total_assigned = sum(state.decode_assignment_counts.values())
+        mean_assigned = total_assigned / n_route_workers
+
        best_decode_worker_id: str | None = None
-        best_score: tuple[int, int, int, int] | None = None
-        candidates_considered = 0
+        best_score: CandidateScore | None = None
        for worker in topology.route_workers:
            # Migration: skip workers that have rejected this session too many times.
            # If all candidates get filtered (degenerate case), fall through to
@@ -185,16 +237,17 @@ class KvAwarePolicy:
                )
                if rejects >= self.migration_reject_threshold:
                    continue
-            candidates_considered += 1
-            overlap = _overlap_blocks(request, state, worker.worker_id)
-            sticky = int(session is not None and session.last_decode_worker == worker.worker_id)
-            inflight_penalty = -state.inflight_decode.get(worker.worker_id, 0)
-            assignment_penalty = -state.decode_assignment_counts.get(worker.worker_id, 0)
-            score = (
-                overlap + sticky * self.sticky_bonus,
-                sticky,
-                inflight_penalty,
-                assignment_penalty,
+            score = score_candidate(
+                overlap=_overlap_blocks(request, state, worker.worker_id),
+                sticky=(
+                    session is not None
+                    and session.last_decode_worker == worker.worker_id
+                ),
+                inflight=state.inflight_decode.get(worker.worker_id, 0),
+                assigned=state.decode_assignment_counts.get(worker.worker_id, 0),
+                mean_assigned=mean_assigned,
+                sticky_bonus=self.sticky_bonus,
+                load_floor_bonus=self.load_floor_bonus,
            )
            if best_score is None or score > best_score:
                best_score = score
@@ -223,14 +276,22 @@ class KvAwarePolicy:
        )


-def create_policy(name: str, *, migration_reject_threshold: int = 3) -> RoutingPolicy:
+def create_policy(
+    name: str,
+    *,
+    migration_reject_threshold: int = 3,
+    load_floor_bonus: int = 0,
+) -> RoutingPolicy:
    normalized = name.strip().lower()
    if normalized == "default":
        return DefaultPolicy()
    if normalized == "sticky":
        return StickyDecodePolicy()
    if normalized in {"kv-aware", "kv_aware", "kv"}:
-        return KvAwarePolicy(migration_reject_threshold=migration_reject_threshold)
+        return KvAwarePolicy(
+            migration_reject_threshold=migration_reject_threshold,
+            load_floor_bonus=load_floor_bonus,
+        )
    raise ValueError(f"Unsupported policy: {name}")


--- a/src/agentic_pd_hybrid/replay.py
+++ b/src/agentic_pd_hybrid/replay.py
@@ -111,6 +111,11 @@ class ReplayConfig:
    # KvAwarePolicy skips that D for the session (forcing migration). Default 3.
    # Set 0 to disable. See REFACTOR_PLAN_V1 §6.2.
    kvcache_migration_reject_threshold: int = 3
+    # Load-floor bonus magnitude for KvAwarePolicy: graduated boost added to
+    # under-loaded D workers to break overlap-pinning imbalance on workloads
+    # with shared cross-session prefix. 0 disables. See
+    # docs/E1_E2_FIX_DESIGN_ZH.md §Q2.
+    kvcache_load_floor_bonus: int = 0
    structural_log_dir: Path | None = None


@@ -198,6 +203,7 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
    policy = create_policy(
        config.policy_name,
        migration_reject_threshold=config.kvcache_migration_reject_threshold,
+        load_floor_bonus=config.kvcache_load_floor_bonus,
    )
    state = RoutingState.create(config.topology)
    state_lock = asyncio.Lock()
--- a/src/agentic_pd_hybrid/stack.py
+++ b/src/agentic_pd_hybrid/stack.py
@@ -201,6 +201,14 @@ def _build_process_env(topology: SingleNodeTopology) -> dict[str, str]:
        # Default to TCP when RDMA is not forced (e.g. loopback on same node)
        env.setdefault("MOONCAKE_PROTOCOL", "tcp")

+    # Mooncake C++ batch_transfer_sync default timeout is 30 s, which can
+    # fire as a false positive when a saturated D scheduler thread is busy
+    # with LRU eviction (see docs/E1_E2_RESULTS_ZH.md §5c). Default to 1800 s
+    # so the hair-trigger blacklist in conn.py:1270 doesn't latch on
+    # transient stalls. Caller can override via shell env (setup_env.sh).
+    if topology.transfer_backend == "mooncake":
+        env.setdefault("MC_TRANSFER_TIMEOUT", "1800")
+
    repo_root = Path(__file__).resolve().parents[2]
    python_paths = [
        str(repo_root / "src"),
--- a/tests/README.md
+++ b/tests/README.md
@@ -0,0 +1,39 @@
+# Tests
+
+Pure-Python unit + property tests for the algorithm layer. These tests do
+**not** import SGLang and do **not** need a GPU — they validate the routing
+algorithm (Algorithm 1/2/3 in `docs/KVC_ROUTER_ALGORITHM.md`) and its
+theorems against the pure functions extracted from `policies.py`.
+
+## Run
+
+```bash
+uv sync --group test
+uv run pytest
+```
+
+Or, without uv:
+
+```bash
+pip install pytest
+PYTHONPATH=src pytest tests
+```
+
+## Scope
+
+- `test_policy_scoring.py` — Algorithm 1 lex-score properties (overlap
+  dominates sticky, load-floor gating, tie-breakers).
+- `test_no_starvation.py` — Theorem 1: bounded retries before some D either
+  accepts or the least-rejected D is forced through the degenerate path.
+
+Future:
+- block-level eviction `MockRadixCache` tests (see
+  `docs/BLOCK_LEVEL_EVICTION_DESIGN_ZH.md` §5).
+- D→P sync `staleness_budget` property tests (see
+  `docs/D_TO_P_SYNC_CONTRACT_ZH.md` §1).
+
+## Why no integration tests here
+
+Anything that needs SGLang, mooncake, or a real model is an integration
+test and must run on hardware. Those tests live as `scripts/sweep_*.sh`
+under the evaluation protocol in `docs/EVALUATION_PROTOCOL_ZH.md`.
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/_fixtures.py
+++ b/tests/_fixtures.py
@@ -0,0 +1,66 @@
+"""Lightweight fixtures for algorithm-layer tests.
+
+Builds minimal TraceRequest / SingleNodeTopology / RoutingState instances
+without invoking build_single_node_topology() (which validates GPU budgets
+we don't care about in unit tests).
+"""
+
+from __future__ import annotations
+
+from agentic_pd_hybrid.topology import SingleNodeTopology, WorkerSpec
+from agentic_pd_hybrid.trace import TraceRequest
+
+
+def make_topology(decode_count: int = 3, prefill_count: int = 1) -> SingleNodeTopology:
+    prefill_workers = tuple(
+        WorkerSpec(
+            role="prefill",
+            ordinal=i,
+            gpu_ids=(i,),
+            host="127.0.0.1",
+            port=30000 + i,
+        )
+        for i in range(prefill_count)
+    )
+    decode_workers = tuple(
+        WorkerSpec(
+            role="decode",
+            ordinal=i,
+            gpu_ids=(prefill_count + i,),
+            host="127.0.0.1",
+            port=31000 + i,
+        )
+        for i in range(decode_count)
+    )
+    return SingleNodeTopology(
+        model_path="/dev/null/test-model",
+        prefill_workers=prefill_workers,
+        decode_workers=decode_workers,
+        direct_workers=(),
+        router_host="127.0.0.1",
+        router_port=8000,
+        transfer_backend="mooncake",
+        trust_remote_code=True,
+    )
+
+
+def make_request(
+    *,
+    session_id: str = "sess-1",
+    turn_id: int = 0,
+    hash_ids: tuple[int, ...] = (),
+    input_length: int = 1024,
+    output_length: int = 64,
+) -> TraceRequest:
+    return TraceRequest(
+        request_id=f"{session_id}-t{turn_id}",
+        session_id=session_id,
+        chat_id=int(turn_id),
+        parent_chat_id=-1 if turn_id == 0 else int(turn_id - 1),
+        timestamp_s=float(turn_id),
+        input_length=input_length,
+        output_length=output_length,
+        request_type="user",
+        turn_id=turn_id,
+        hash_ids=hash_ids,
+    )
--- a/tests/test_no_starvation.py
+++ b/tests/test_no_starvation.py
@@ -0,0 +1,150 @@
+"""Theorem 1 — no permanent starvation under bounded retries.
+
+Reference: docs/KVC_ROUTER_ALGORITHM.md §4.1.
+
+    For any session s with τ_reject ≥ 1, after at most |D| · τ_reject
+    consecutive admission rejects on s, the routing policy MUST still
+    return a valid decision (via the degenerate "least-rejected D"
+    fallback). The session cannot be permanently starved at the policy
+    layer.
+
+We can't exercise the full Dispatch loop here (it lives in replay.py and
+needs HTTP, mooncake, etc.). What we CAN test is the policy-layer
+guarantee: after K = |D| · τ_reject reject bumps, select() never raises
+and never returns a worker that's both blacklisted *and* has positive
+overlap (the degenerate path chooses by least-rejected).
+
+This is the property-layer companion to test_policy_scoring.py's
+quantitative checks.
+"""
+
+from __future__ import annotations
+
+from agentic_pd_hybrid.policies import KvAwarePolicy, RoutingState
+
+from ._fixtures import make_request, make_topology
+
+
+def test_select_returns_valid_decision_under_full_blacklist():
+    """Bump all (s, d) reject counters past τ_reject. select() must still
+    pick a worker (degenerate fallback, no exception, no None)."""
+    topology = make_topology(decode_count=3)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-stuck", turn_id=0)
+    policy = KvAwarePolicy(migration_reject_threshold=3)
+
+    # Pre-fill the blacklist for every D.
+    for worker in topology.route_workers:
+        for _ in range(3):
+            state.record_admission_reject(request.session_id, worker.worker_id)
+
+    decision = policy.select(request=request, topology=topology, state=state)
+    assert decision.decode_worker_id is not None
+    assert decision.decode_worker_id in {w.worker_id for w in topology.route_workers}
+
+
+def test_bounded_retries_to_force_degenerate_path():
+    """Theorem 1: at most |D| · τ_reject rejects suffice to either exhaust
+    every D or to force the degenerate fallback. Simulate the worst case
+    where each retry picks a fresh D and is immediately rejected."""
+    topology = make_topology(decode_count=4)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-worst", turn_id=0)
+    threshold = 3
+    policy = KvAwarePolicy(migration_reject_threshold=threshold)
+
+    seen_decoders: set[str] = set()
+    max_retries = len(topology.route_workers) * threshold
+
+    for retry in range(max_retries):
+        decision = policy.select(request=request, topology=topology, state=state)
+        seen_decoders.add(decision.decode_worker_id)
+        # Adversary: this D rejects this session.
+        state.record_admission_reject(request.session_id, decision.decode_worker_id)
+
+    # After |D|·τ_reject rejects every D must be blacklisted, so the next
+    # select() takes the degenerate "least-rejected" branch and STILL
+    # returns a valid worker.
+    final = policy.select(request=request, topology=topology, state=state)
+    assert final.decode_worker_id in {w.worker_id for w in topology.route_workers}
+    # And we should have explored every D over the bounded retries — the
+    # algorithm cannot trap a session on a single D when all are rejecting.
+    assert seen_decoders == {w.worker_id for w in topology.route_workers}
+
+
+def test_least_rejected_d_chosen_when_all_blacklisted():
+    """When every D is past threshold, the degenerate fallback chooses the
+    one with the *fewest* rejects (Algorithm 1, line 4)."""
+    topology = make_topology(decode_count=3)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-lr", turn_id=0)
+    policy = KvAwarePolicy(migration_reject_threshold=3)
+
+    # Skew rejections: decode-0 has 5, decode-1 has 10, decode-2 has 3.
+    # All are >= threshold=3, so the filter wipes out every candidate.
+    # The fallback should pick decode-2 (smallest rejection count).
+    workers = list(topology.route_workers)
+    bumps = {workers[0].worker_id: 5, workers[1].worker_id: 10, workers[2].worker_id: 3}
+    for wid, n in bumps.items():
+        for _ in range(n):
+            state.record_admission_reject(request.session_id, wid)
+
+    decision = policy.select(request=request, topology=topology, state=state)
+    assert decision.decode_worker_id == workers[2].worker_id
+
+
+def test_other_session_unaffected_by_blacklist():
+    """Algorithm 1's filter is per-(session, D), not per-D. Session A's
+    rejects must not influence session B's routing."""
+    topology = make_topology(decode_count=2)
+    state = RoutingState.create(topology)
+    policy = KvAwarePolicy(migration_reject_threshold=3)
+
+    # Blacklist decode-0 for session A.
+    workers = list(topology.route_workers)
+    for _ in range(3):
+        state.record_admission_reject("session-A", workers[0].worker_id)
+
+    # Session B sees a clean slate — should be able to pick decode-0
+    # (which is the iteration-order winner under empty state).
+    decision_b = policy.select(
+        request=make_request(session_id="session-B"),
+        topology=topology,
+        state=state,
+    )
+    # decode-0 wins iteration-order tiebreak when all scores are (0,0,0,0).
+    assert decision_b.decode_worker_id == workers[0].worker_id
+
+
+def test_threshold_zero_disables_blacklist():
+    """migration_reject_threshold=0 means the migration mechanism is off:
+    every D stays a candidate regardless of its reject count."""
+    topology = make_topology(decode_count=2)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-no-mig")
+    policy = KvAwarePolicy(migration_reject_threshold=0)
+
+    workers = list(topology.route_workers)
+    # Pile a huge number of rejects on decode-0.
+    for _ in range(100):
+        state.record_admission_reject(request.session_id, workers[0].worker_id)
+
+    decision = policy.select(request=request, topology=topology, state=state)
+    # decode-0 should still be eligible; with empty overlap/sticky/inflight,
+    # iteration order picks decode-0 first.
+    assert decision.decode_worker_id == workers[0].worker_id
+
+
+def test_reject_counter_only_grows_on_record():
+    """RoutingState.record_admission_reject is the ONLY mutator for the
+    counter. select() must not silently bump it."""
+    topology = make_topology(decode_count=2)
+    state = RoutingState.create(topology)
+    request = make_request(session_id="s-clean")
+    policy = KvAwarePolicy()
+
+    for _ in range(5):
+        policy.select(request=request, topology=topology, state=state)
+
+    # No explicit record_admission_reject -> all counters stay zero.
+    assert sum(state.session_d_rejects.values()) == 0
--- a/tests/test_policy_scoring.py
+++ b/tests/test_policy_scoring.py
@@ -0,0 +1,189 @@
+"""Unit tests for Algorithm 1 (KvAwarePolicy score_candidate).
+
+Reference: docs/KVC_ROUTER_ALGORITHM.md §3.1. The lex-score is
+
+    (overlap + sticky_bonus*sticky + floor_bonus,
+     sticky,
+     -inflight,
+     -assigned)
+
+These tests pin down the qualitative properties that the algorithm's
+correctness arguments rely on. They run without SGLang/GPU.
+"""
+
+from __future__ import annotations
+
+from agentic_pd_hybrid.policies import score_candidate
+
+
+def _score(**overrides):
+    """Helper: build a score with all defaults and per-test overrides."""
+    args = dict(
+        overlap=0,
+        sticky=False,
+        inflight=0,
+        assigned=0,
+        mean_assigned=0.0,
+        sticky_bonus=1,
+        load_floor_bonus=0,
+    )
+    args.update(overrides)
+    return score_candidate(**args)
+
+
+# -- Determinism ----------------------------------------------------------------
+
+
+def test_score_is_pure():
+    """Same kwargs must produce the same tuple (no hidden state)."""
+    a = _score(overlap=3, sticky=True, inflight=1, assigned=7)
+    b = _score(overlap=3, sticky=True, inflight=1, assigned=7)
+    assert a == b
+
+
+def test_score_returns_4_tuple():
+    s = _score()
+    assert isinstance(s, tuple)
+    assert len(s) == 4
+    assert all(isinstance(x, int) for x in s)
+
+
+# -- Primary term: overlap dominates sticky --------------------------------------
+
+
+def test_overlap_strictly_dominates_pure_sticky():
+    """Theorem-2 building block: any positive overlap on a non-sticky D wins
+    against a sticky-only D with zero overlap (sticky_bonus=1)."""
+    overlap = _score(overlap=2, sticky=False)
+    sticky_only = _score(overlap=0, sticky=True)
+    assert overlap > sticky_only
+
+
+def test_overlap_plus_sticky_beats_overlap_alone():
+    """Two D's with equal overlap: sticky one wins (sticky_bonus contributes
+    to primary AND wins tie-1)."""
+    sticky_d = _score(overlap=5, sticky=True)
+    fresh_d = _score(overlap=5, sticky=False)
+    assert sticky_d > fresh_d
+
+
+# -- Tie breakers ----------------------------------------------------------------
+
+
+def test_tiebreaker_inflight_lower_wins():
+    """Equal primary & sticky: prefer the D with fewer in-flight requests."""
+    low = _score(overlap=3, sticky=False, inflight=0, assigned=10)
+    high = _score(overlap=3, sticky=False, inflight=5, assigned=10)
+    assert low > high
+
+
+def test_tiebreaker_assigned_lower_wins():
+    """Equal primary & sticky & inflight: prefer rarely-picked D."""
+    rare = _score(overlap=3, sticky=False, inflight=2, assigned=1)
+    frequent = _score(overlap=3, sticky=False, inflight=2, assigned=99)
+    assert rare > frequent
+
+
+def test_tiebreaker_strict_lex_order():
+    """Sticky always beats non-sticky on tie-1 even if non-sticky has lower
+    inflight (the lex order is strict, position 1 outranks positions 2/3)."""
+    sticky_busy = _score(overlap=4, sticky=True, inflight=10, assigned=10)
+    fresh_idle = _score(overlap=4, sticky=False, inflight=0, assigned=0)
+    # Note: with sticky_bonus=1 added to position 0, sticky_busy actually wins
+    # on position 0 first (5 > 4). Force equal primary by lowering sticky's
+    # overlap.
+    sticky_busy_eq_primary = _score(overlap=3, sticky=True, inflight=10, assigned=10)
+    fresh_idle_eq_primary = _score(overlap=4, sticky=False, inflight=0, assigned=0)
+    # Now equal primary (3+1=4 vs 4). Sticky wins position 1.
+    assert sticky_busy_eq_primary > fresh_idle_eq_primary
+
+
+# -- Load-floor bonus ------------------------------------------------------------
+
+
+def test_load_floor_disabled_by_default():
+    """load_floor_bonus=0 → no contribution to primary."""
+    s = _score(overlap=0, sticky=False, mean_assigned=10, assigned=0)
+    assert s[0] == 0
+
+
+def test_load_floor_gated_off_when_sticky():
+    """Even with load_floor_bonus>0, sticky D does NOT receive the boost.
+    Otherwise a session would migrate away from its warm D under load."""
+    sticky_under_loaded = _score(
+        overlap=0, sticky=True, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    # primary = overlap(0) + sticky_bonus(1) + floor(0) = 1
+    assert sticky_under_loaded[0] == 1
+
+
+def test_load_floor_zero_when_mean_zero():
+    """Warmup case: mean_assigned=0 -> no D gets boost -> degenerate to lex
+    tiebreak by iteration order."""
+    s = _score(
+        overlap=0, sticky=False, mean_assigned=0, assigned=0, load_floor_bonus=200
+    )
+    assert s[0] == 0
+
+
+def test_load_floor_proportional_to_deficit():
+    """floor_bonus = K * deficit / mean. assigned=0, mean=10, K=200 -> 200."""
+    s_zero = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    s_half = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=5, load_floor_bonus=200
+    )
+    s_full = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
+    )
+    # deficit = max(0, 10-0)=10 -> bonus = int(200*10/10) = 200
+    # deficit = max(0, 10-5)=5  -> bonus = int(200*5/10)  = 100
+    # deficit = max(0, 10-10)=0 -> bonus = 0
+    assert s_zero[0] == 200
+    assert s_half[0] == 100
+    assert s_full[0] == 0
+
+
+def test_load_floor_does_not_underflow_when_overloaded():
+    """assigned > mean -> deficit clamped to 0, no negative bonus."""
+    s = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=50, load_floor_bonus=200
+    )
+    assert s[0] == 0
+
+
+# -- Routing intent: real overlap beats load-floor bonus -------------------------
+
+
+def test_real_prefix_overlap_beats_load_floor_on_warm_d():
+    """E1_E2_FIX_DESIGN_ZH §Q2: load_floor should be set such that
+    real per-session prefix overlap outweighs the cold-D bonus.
+    With overlap=800 (a per-session prefix) and load_floor_bonus=200,
+    a warm D (high overlap, possibly high load) should still win against
+    a cold D with floor bonus."""
+    warm = _score(
+        overlap=800, sticky=True, mean_assigned=10, assigned=10, load_floor_bonus=200
+    )
+    cold = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    # warm primary = 800 + 1 + 0 = 801. cold primary = 0 + 0 + 200 = 200.
+    assert warm[0] == 801
+    assert cold[0] == 200
+    assert warm > cold
+
+
+def test_boilerplate_overlap_loses_to_load_floor_for_cold_d():
+    """Same §Q2: load_floor should beat cross-session boilerplate overlap.
+    If load_floor_bonus=200 and the worst-case boilerplate overlap is ~50,
+    a fresh cold D should still win against a slightly-warm-from-boilerplate D."""
+    warm_boilerplate = _score(
+        overlap=50, sticky=False, mean_assigned=10, assigned=10, load_floor_bonus=200
+    )
+    cold_under_loaded = _score(
+        overlap=0, sticky=False, mean_assigned=10, assigned=0, load_floor_bonus=200
+    )
+    # warm_boilerplate primary = 50 + 0 + 0 = 50 (assigned=mean, no deficit).
+    # cold_under_loaded primary = 0 + 0 + 200 = 200.
+    assert cold_under_loaded > warm_boilerplate
--- a/third_party/sglang/python/sglang/srt/managers/schedule_batch.py
+++ b/third_party/sglang/python/sglang/srt/managers/schedule_batch.py
@@ -1564,6 +1564,74 @@ class ScheduleBatch(ScheduleBatchDisaggregationDecodeMixin):
            # For DLLM, we use a separate forward mode
            self.forward_mode = ForwardMode.DLLM_EXTEND

+        # Pre-filter pass: drop streaming-session reqs whose committed prefix
+        # already covers fill_ids. The streaming-session correction below would
+        # set extend_input_len = max(0, fill_len - prefix_len) = 0 for these
+        # reqs, but the downstream invariant at the per-req loop
+        # (`assert seq_len - pre_len == req.extend_input_len`) is computed from
+        # raw fill_ids/prefix_indices lengths and has no path to be satisfied
+        # when fill_len < prefix_len. Treat the condition as upstream state
+        # inconsistency, abort the affected reqs (so the client sees an error
+        # response instead of the worker crashing), and continue with the
+        # remaining batch. See docs/E3_FINDINGS_ZH.md for the failure mode
+        # this guards against.
+        if self.reqs:
+            kept_reqs = []
+            for req in self.reqs:
+                if (
+                    req.session is not None
+                    and req.session.streaming
+                    and len(req.fill_ids) < len(req.prefix_indices)
+                ):
+                    logger.error(
+                        "Dropping streaming-session req with fill_ids shorter than "
+                        "prefix_indices (rid=%s, session_id=%s, fill_len=%d, "
+                        "prefix_len=%d, kv_committed_len=%d). Upstream state "
+                        "inconsistency would crash prepare_for_extend's invariant; "
+                        "aborting this req. See docs/E3_FINDINGS_ZH.md.",
+                        req.rid,
+                        req.session.session_id,
+                        len(req.fill_ids),
+                        len(req.prefix_indices),
+                        req.kv_committed_len,
+                    )
+                    req.finished_reason = FINISH_ABORT(
+                        message=(
+                            "streaming-session inconsistency: fill_ids "
+                            f"({len(req.fill_ids)}) < prefix_indices "
+                            f"({len(req.prefix_indices)})"
+                        ),
+                    )
+                else:
+                    kept_reqs.append(req)
+            if len(kept_reqs) != len(self.reqs):
+                self.reqs = kept_reqs
+
+        if not self.reqs:
+            # Whole batch filtered. Set empty tensor / list state so
+            # downstream callers (model_runner.forward, batch_result handlers)
+            # see a valid no-op batch and skip the model pass cleanly.
+            _pin = is_pin_memory_available(self.device)
+            empty_long = torch.zeros(0, dtype=torch.int64, pin_memory=_pin).to(
+                self.device, non_blocking=True
+            )
+            empty_int = torch.zeros(0, dtype=torch.int32, pin_memory=_pin).to(
+                self.device, non_blocking=True
+            )
+            self.input_ids = empty_long
+            self.req_pool_indices = empty_int
+            self.seq_lens = empty_long
+            self.seq_lens_cpu = torch.zeros(0, dtype=torch.int64)
+            self.orig_seq_lens = empty_int
+            self.prefix_lens = []
+            self.extend_lens = []
+            self.extend_num_tokens = 0
+            self.out_cache_loc = empty_int
+            self.input_embeds = None
+            self.multimodal_inputs = []
+            self.token_type_ids = None
+            return
+
        # Init tensors
        reqs = self.reqs
        for req in reqs:
--- a/uv.lock
+++ b/uv.lock
@@ -2,15 +2,33 @@ version = 1
 revision = 3
 requires-python = ">=3.12"
 resolution-markers = [
-    "python_full_version >= '3.14' and sys_platform == 'win32'",
-    "python_full_version >= '3.14' and sys_platform == 'emscripten'",
-    "python_full_version >= '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
-    "python_full_version == '3.13.*' and sys_platform == 'win32'",
-    "python_full_version == '3.13.*' and sys_platform == 'emscripten'",
-    "python_full_version == '3.13.*' and sys_platform != 'emscripten' and sys_platform != 'win32'",
-    "python_full_version < '3.13' and sys_platform == 'win32'",
-    "python_full_version < '3.13' and sys_platform == 'emscripten'",
-    "python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
 ]

 [options]
@@ -30,7 +48,7 @@ dependencies = [
 requires-dist = [
    { name = "httpx", specifier = ">=0.28.1" },
    { name = "mooncake-transfer-engine" },
-    { name = "sglang", specifier = "==0.5.10" },
+    { name = "sglang", editable = "third_party/sglang/python" },
 ]

 [[package]]
@@ -457,7 +475,8 @@ source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "loguru" },
    { name = "pydantic" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "transformers" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/98/c0/8fb99aa86bc538d3a025749633d1d0105d849b35eb240ba7ba30e22de49b/compressed_tensors-0.15.1a20260409.tar.gz", hash = "sha256:a9a477691c2887bc8d2c46aef82aa60c85fe1f014cacb2218b423904aff04f4d", size = 238217, upload-time = "2026-04-09T21:21:52.922Z" }
@@ -565,8 +584,8 @@ name = "decord2"
 version = "3.3.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
-    { name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/51/c3/fbc81c2cc18b2b7ca8a3a26ca2e8dfa243a2c7f5c4431f4b3839a8f12f0a/decord2-3.3.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:3a67fb644041a031bc3f21b2e1adcf92b9742d980bd90f3bc45396c2a0ddcbfa", size = 25036754, upload-time = "2026-04-06T18:09:46.005Z" },
@@ -664,7 +683,8 @@ dependencies = [
    { name = "einops" },
    { name = "nvidia-cutlass-dsl" },
    { name = "quack-kernels" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "torch-c-dlpack-ext" },
    { name = "typing-extensions" },
 ]
@@ -699,7 +719,8 @@ dependencies = [
    { name = "packaging" },
    { name = "requests" },
    { name = "tabulate" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "tqdm" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/cc/95/81eafb78574312db79ef7144a4e77f2fee015343f413ef3000f279c8a118/flashinfer_python-0.6.7.post2.tar.gz", hash = "sha256:924cb1788d0335225293eea384da40f40daa6b4e32b6a5ebc214ab679b4e2125", size = 6509418, upload-time = "2026-04-04T07:10:25.516Z" }
@@ -904,34 +925,34 @@ wheels = [

 [[package]]
 name = "hf-xet"
-version = "1.5.0.dev1"
+version = "1.5.0"
 source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/c9/b5/73db543ba19129c23b2ca52d837373eb4243f0332130093f31b3ecc6739f/hf_xet-1.5.0.dev1.tar.gz", hash = "sha256:a21c9c85869ee122747543dd93471826cc0e9b5f61b11411aabd4adf72e345b1", size = 823729, upload-time = "2026-04-17T08:22:19.349Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/74/d8/5c06fc76461418326a7decf8367480c35be11a41fd938633929c60a9ec6b/hf_xet-1.5.0.tar.gz", hash = "sha256:e0fb0a34d9f406eed88233e829a67ec016bec5af19e480eac65a233ea289a948", size = 837196, upload-time = "2026-05-06T06:18:15.583Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/79/c1/15fb7a67b1fad51b0d3e3a4e0a33ac2fca8197da842a922bf2f707521915/hf_xet-1.5.0.dev1-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:41abc1601e9449c57880c203332221bc571a9c85154c1789a740259781ba9596", size = 6903797, upload-time = "2026-04-17T08:21:38.028Z" },
-    { url = "https://files.pythonhosted.org/packages/c5/a6/66924109da0089c803a0b42eeccd37f321906b0224bad6c220e46a9f6ad2/hf_xet-1.5.0.dev1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:045c43a49776d1dc9836ee0782e85fecbd2e85a6f55ebc39a4a14eb9c83fc004", size = 6570723, upload-time = "2026-04-17T08:21:35.605Z" },
-    { url = "https://files.pythonhosted.org/packages/ad/19/c9d51b5512eae52dd3b6eac5f02552cfe78156410e71e1e3d1295f778a0c/hf_xet-1.5.0.dev1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:908325bf4e53209dfe56d99a5cfed63907e677a32b1ba1f000cd72a8290871e4", size = 63298006, upload-time = "2026-04-17T08:21:12.867Z" },
-    { url = "https://files.pythonhosted.org/packages/66/a7/1781b5a465fb4cce525a96c8bf7719583d115eaf2ea4d4ef560a394801a2/hf_xet-1.5.0.dev1-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:d51c3c20460012540dca4094615b74e1b757a7d702910149c7b8175eda91567a", size = 58640118, upload-time = "2026-04-17T08:21:07.745Z" },
-    { url = "https://files.pythonhosted.org/packages/38/ef/2c02f7602b94b0f0454f66f9f52e7f37edaf81c3ccfa57073c17ee7e57d8/hf_xet-1.5.0.dev1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:36d45543060cfda059a910cfa702fe2221cba88a49401d9359ae442ccb6fe8e7", size = 59133723, upload-time = "2026-04-17T08:21:51.701Z" },
-    { url = "https://files.pythonhosted.org/packages/7d/76/732941c4ce0c0f5991ec1962a1848325a4ee11da2942c2f85100b68cba28/hf_xet-1.5.0.dev1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:3363073f1abc0a55027ba5e666bbdd0147681e856ed3ddda083428f8d81786cf", size = 60269392, upload-time = "2026-04-17T08:21:56.95Z" },
-    { url = "https://files.pythonhosted.org/packages/c3/22/65e1146977ddb940136ccd932675425a2fa1a13aef2a35fa54b969e07d77/hf_xet-1.5.0.dev1-cp313-cp313t-win_amd64.whl", hash = "sha256:aa93dcb1271a3cd2846ab07f9e37f27280604dd5c50ea299050553a4fe6fd60d", size = 3993380, upload-time = "2026-04-17T08:22:23.592Z" },
-    { url = "https://files.pythonhosted.org/packages/eb/8c/71bc286a6d52a53682c669abeea1d4dd3f320812d9c1816f8d71ad4e99ba/hf_xet-1.5.0.dev1-cp313-cp313t-win_arm64.whl", hash = "sha256:7928c15eef205aaa1786e63294331f184152e8e7d9f0f352047bf1b590f540cd", size = 3851055, upload-time = "2026-04-17T08:22:21.556Z" },
-    { url = "https://files.pythonhosted.org/packages/3c/79/42bace8f9651276eb96463b2ad275f6b53fe2b22ba3c5ea7f1819b580785/hf_xet-1.5.0.dev1-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:11a00f8ec39f69c3cd32fb8980b86c91945aaf0588667079994edda9fa2e3cb2", size = 6897594, upload-time = "2026-04-17T08:21:47.543Z" },
-    { url = "https://files.pythonhosted.org/packages/c1/b0/7d950c8f68280c1907b146e848e244eec054300769b6645455cf92075094/hf_xet-1.5.0.dev1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:d333be26f91cbfa573d24005c5502ce48eb19ec416982ebd5cf8212cdb549942", size = 6569370, upload-time = "2026-04-17T08:21:45.24Z" },
-    { url = "https://files.pythonhosted.org/packages/be/20/60828b7429397f5fe417e312b3b222f97a3293e129977c7d6c1fe07b14cc/hf_xet-1.5.0.dev1-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:44ca5ad2a82c60f1b749a65e361c006fa8c9feaab703e4c9e72b5ff830dca1f6", size = 63253090, upload-time = "2026-04-17T08:21:32.004Z" },
-    { url = "https://files.pythonhosted.org/packages/71/54/3fc89b6e47e9e43b86613e32c1cccb8cdeaaa5b19a99decc41d6b57f0d65/hf_xet-1.5.0.dev1-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:df5ba34b731c0be6eb5290cd46adb7b245583bdbf271f87caed60f3a3f65e859", size = 58659612, upload-time = "2026-04-17T08:21:27.084Z" },
-    { url = "https://files.pythonhosted.org/packages/18/76/2165625d83309a38dd2b91ce3b7ccb0384151f7f205b033575849b996546/hf_xet-1.5.0.dev1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:c4661dd045f6d59f838119423948d9cec06ac498ac09a869f7df4abbe70f01aa", size = 59152315, upload-time = "2026-04-17T08:22:11.349Z" },
-    { url = "https://files.pythonhosted.org/packages/ef/b1/e0effd9fb1acbd142c6e9345db171254f953a701b16799b815535cae771c/hf_xet-1.5.0.dev1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:2b07f87bb1d21cde3889d684f194e0c6047091c94b54c3e52d1b80e738d016ed", size = 60228716, upload-time = "2026-04-17T08:22:16.177Z" },
-    { url = "https://files.pythonhosted.org/packages/aa/9e/73921723685e27f6b54a016374894d69fb06eb0452fe7b7ada12b54b32fd/hf_xet-1.5.0.dev1-cp314-cp314t-win_amd64.whl", hash = "sha256:bb81277c04fcd49a4c3e93bc5bcf1d33a9604b32085f3f7e95f52edb9c2deca6", size = 3994035, upload-time = "2026-04-17T08:22:31.471Z" },
-    { url = "https://files.pythonhosted.org/packages/4c/7f/a2f422bb7d3050760d0aae59f4999dbfcb84708b822432f2d5bc3dd76234/hf_xet-1.5.0.dev1-cp314-cp314t-win_arm64.whl", hash = "sha256:724fa6f5f644295de503e6cdb1b1c96a7ad2512db6a641daa32b0f33888e88f7", size = 3851354, upload-time = "2026-04-17T08:22:29.647Z" },
-    { url = "https://files.pythonhosted.org/packages/85/fa/6c404999f13892e8ef2b75ec07af0b118fa1241a7bd278f6b93d61063746/hf_xet-1.5.0.dev1-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:5a180160a120357cabc0cd60167864f110bb8f0b1c38b71e0a93cde13839475e", size = 6907817, upload-time = "2026-04-17T08:21:42.228Z" },
-    { url = "https://files.pythonhosted.org/packages/ad/d1/6c828e215079a436d6e916d30248093b7b3ea911e4e6d40b954d21089fc8/hf_xet-1.5.0.dev1-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:8701d2e1268c78a1c3cd0e4480b74c0a505cfa864269308efae9d73d0e2203f9", size = 6577425, upload-time = "2026-04-17T08:21:40.097Z" },
-    { url = "https://files.pythonhosted.org/packages/e3/c9/2b93ba287824948450ddf64e2596220b58633d019dda278c12abadbf7bb5/hf_xet-1.5.0.dev1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e5480448001f9e59046ac4c463f2e25fb652066605dd183a82d2b5625b939487", size = 63137387, upload-time = "2026-04-17T08:21:21.775Z" },
-    { url = "https://files.pythonhosted.org/packages/dc/b5/c74899d4da67155db8b4f9d8b21110a919d969a15b75aceaec9502c8e7c3/hf_xet-1.5.0.dev1-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:14e9773ade3fb48dcfa9f493c8ed065704dd3031d29a5a289fed58b8223f2409", size = 58503933, upload-time = "2026-04-17T08:21:17.434Z" },
-    { url = "https://files.pythonhosted.org/packages/27/42/d9d511d425696a8b54cf67af0d3de0f8564f81f81e046b107a967f35f00e/hf_xet-1.5.0.dev1-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:21accf171949d78b18099bf57a4e8490db1ad88c0a4e907f8930c78ffe21f47d", size = 59035994, upload-time = "2026-04-17T08:22:01.526Z" },
-    { url = "https://files.pythonhosted.org/packages/8c/b6/49afbe73752f8d176231e49bc02b8b3fe96284ba82d856481c598b5343f4/hf_xet-1.5.0.dev1-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:07d8ec5c300a7ce3a39fa8598024992f6d2fcfa167b71cc0cde07abdcd05ca01", size = 60139405, upload-time = "2026-04-17T08:22:06.759Z" },
-    { url = "https://files.pythonhosted.org/packages/98/ab/e243e97ba2d5e55c848cdb5622466300990d2d0380c4456132d209ce1252/hf_xet-1.5.0.dev1-cp37-abi3-win_amd64.whl", hash = "sha256:ad32cfd5aa66bdf922b7f8eb9a94eb9f64a8f68a31ffede803060b44bd4060f8", size = 4004017, upload-time = "2026-04-17T08:22:27.78Z" },
-    { url = "https://files.pythonhosted.org/packages/f7/08/645da274ebe22d06a1ad103667deae75eb658e2b8e493f3a04a8ab140e2d/hf_xet-1.5.0.dev1-cp37-abi3-win_arm64.whl", hash = "sha256:2093091921534e51e13cbeb956550cded7b97aa7ba1d774123c21d9b06f06231", size = 3859306, upload-time = "2026-04-17T08:22:25.602Z" },
+    { url = "https://files.pythonhosted.org/packages/68/9b/6912c99070915a4f28119e3c5b52a9abd1eec0ad5cb293b8c967a0c6f5a2/hf_xet-1.5.0-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:7d70fe2ce97b9db73b9c9b9c81fe3693640aec83416a966c446afea54acfae3c", size = 4023383, upload-time = "2026-05-06T06:17:53.947Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/6d/9563cfde59b5d8128a9c7ec972a087f4c782e4f7bac5a85234edfd5d5e49/hf_xet-1.5.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:73a0dae8c71de3b0633a45c73f4a4a5ed09e94b43441d82981a781d4f12baa42", size = 3792751, upload-time = "2026-05-06T06:17:51.791Z" },
+    { url = "https://files.pythonhosted.org/packages/07/a5/ed5a0cf35b49a0571af5a8f53416dad1877a718c021c9937c3a53cb45781/hf_xet-1.5.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:a60290ec57e9b71767fba7c3645ddafdd0759974b540441510c629c6db6db24a", size = 4456058, upload-time = "2026-05-06T06:17:40.735Z" },
+    { url = "https://files.pythonhosted.org/packages/60/fb/3ae8bf2a7a37a4197d0195d7247fd25b3952e15cb8a599e285dfaa6f52b3/hf_xet-1.5.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:e5de0f6deada0dada870bb376a11bcd1f08abf3a968a6d118f33e72d1b1eb480", size = 4250783, upload-time = "2026-05-06T06:17:38.412Z" },
+    { url = "https://files.pythonhosted.org/packages/a2/9b/8bae40d4d91525085137196e84eb0ed49cf65b5e96e5c3ecdadd8bd0fac2/hf_xet-1.5.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:c799d49f1a5544a0ef7591c0ee75e0d6b93d6f56dc7a4979f59f7518d2872216", size = 4445594, upload-time = "2026-05-06T06:18:04.219Z" },
+    { url = "https://files.pythonhosted.org/packages/13/59/c74efbbd4e8728172b2cc72a2bc014d2947a4b7bdced932fbd3f5da1a4e5/hf_xet-1.5.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:2baea1b0b989e5c152fe81425f7745ddc8901280ba3d97c98d8cdece7b706c60", size = 4663995, upload-time = "2026-05-06T06:18:06.1Z" },
+    { url = "https://files.pythonhosted.org/packages/73/32/8e1e0410af64cda9b139d1dcebdc993a8ff9c8c7c0e2696ae356d75ccc0d/hf_xet-1.5.0-cp313-cp313t-win_amd64.whl", hash = "sha256:526345b3ed45f374f6317349df489167606736c876241ba984105afe7fd4839d", size = 3966608, upload-time = "2026-05-06T06:18:19.74Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/34/a8febc8f4edbea8b3e21b02ebc8b628679b84ba7e45cde624a7736b51500/hf_xet-1.5.0-cp313-cp313t-win_arm64.whl", hash = "sha256:786d28e2eb8315d5035544b9d137b4a842d600c434bb91bf7d0d953cce906ad4", size = 3796946, upload-time = "2026-05-06T06:18:17.568Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/20/8fc8996afe5815fa1a6be8e9e5c02f24500f409d599e905800d498a4e14d/hf_xet-1.5.0-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:872d5601e6deea30d15865ede55d29eac6daf5a534ab417b99b6ef6b076dd96c", size = 4023495, upload-time = "2026-05-06T06:18:01.94Z" },
+    { url = "https://files.pythonhosted.org/packages/32/6a/93d84463c00cecb561a7508aa6303e35ee2894294eac14245526924415fe/hf_xet-1.5.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:9929561f5abf4581c8ea79587881dfef6b8abb2a0d8a51915936fc2a614f4e73", size = 3792731, upload-time = "2026-05-06T06:18:00.021Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/5a/8ec8e0c863b382d00b3c2e2af6ded6b06371be617144a625903a6d562f4b/hf_xet-1.5.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:f7b7bbae318e583a86fb21e5a4a175d6721d628a2874f4bd022d0e660c32a682", size = 4456738, upload-time = "2026-05-06T06:17:49.574Z" },
+    { url = "https://files.pythonhosted.org/packages/c5/ca/f7effa1a67717da2bcc6b6c28f71c6ca648c77acaec4e2c32f40cbe16d85/hf_xet-1.5.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:cf7b2dc6f31a4ea754bb50f74cde482dcf5d366d184076d8530b9872787f3761", size = 4251622, upload-time = "2026-05-06T06:17:47.096Z" },
+    { url = "https://files.pythonhosted.org/packages/65/f2/19247dba3e231cf77dec59ddfb878f00057635ff773d099c9b59d37812c3/hf_xet-1.5.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:8dbcbab554c9ef158ef2c991545c3e970ddd8cc7acdcd0a78c5a41095dab4ded", size = 4445667, upload-time = "2026-05-06T06:18:11.983Z" },
+    { url = "https://files.pythonhosted.org/packages/7f/64/6f116801a3bcfb6f59f5c251f48cadc47ea54026441c4a385079286a94fa/hf_xet-1.5.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5906bf7718d3636dc13402914736abe723492cb730f744834f5f5b67d3a12702", size = 4664619, upload-time = "2026-05-06T06:18:13.771Z" },
+    { url = "https://files.pythonhosted.org/packages/5c/e8/069542d37946ed08669b127e1496fa99e78196d71de8d41eda5e9f1b7a58/hf_xet-1.5.0-cp314-cp314t-win_amd64.whl", hash = "sha256:5f3dc2248fc01cc0a00cd392ab497f1ca373fcbc7e3f2da1f452480b384e839e", size = 3966802, upload-time = "2026-05-06T06:18:28.162Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/91/fc6fdec27b14d04e88c386ac0a0129732b53fa23f7c4a78f4b83a039c567/hf_xet-1.5.0-cp314-cp314t-win_arm64.whl", hash = "sha256:b285cea1b5bab46b758772716ba8d6854a1a0310fed1c249d678a8b38601e5a0", size = 3797168, upload-time = "2026-05-06T06:18:26.287Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/fb/69ff198a82cae7eb1a69fb84d93b3a3e4816564d76817fe541ddc96874eb/hf_xet-1.5.0-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:dad0dc84e941b8ba3c860659fe1fdc35c049d47cce293f003287757e971a8f56", size = 4030814, upload-time = "2026-05-06T06:17:57.933Z" },
+    { url = "https://files.pythonhosted.org/packages/9b/ff/edcc2b40162bef3ff78e14ab637e5f3b89243d6aee72f5949d3bb6a5af83/hf_xet-1.5.0-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:fd6e5a9b0fdac4ed03ed45ef79254a655b1aaab514a02202617fbf643f5fdf7a", size = 3798444, upload-time = "2026-05-06T06:17:55.79Z" },
+    { url = "https://files.pythonhosted.org/packages/49/4d/103f76b04310e5e57656696cc184690d20c466af0bca3ca88f8c8ea5d4f3/hf_xet-1.5.0-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:3531b1823a0e6d77d80f9ed15ca0e00f0d115094f8ac033d5cae88f4564cc949", size = 4465986, upload-time = "2026-05-06T06:17:44.886Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/a2/546f47f464737b3edbab6f8ddb57f2599b93d2cbb66f06abb475ccb48651/hf_xet-1.5.0-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:9a0ee58cd18d5ea799f7ed11290bbccbe56bdd8b1d97ca74b9cc49a3945d7a3b", size = 4259865, upload-time = "2026-05-06T06:17:42.639Z" },
+    { url = "https://files.pythonhosted.org/packages/95/7f/1be593c1f28613be2e196473481cd81bfc5910795e30a34e8f744f6cac4f/hf_xet-1.5.0-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:1e60df5a42e9bed8628b6416af2cba4cba57ae9f02de226a06b020d98e1aab18", size = 4459835, upload-time = "2026-05-06T06:18:08.026Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/b2/703569fc881f3284487e68cda7b42179978480da3c438042a6bbbb4a671c/hf_xet-1.5.0-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:4b35549ce62601b84da4ff9b24d970032ace3d4430f52d91bcbb26c901d6c690", size = 4672414, upload-time = "2026-05-06T06:18:09.864Z" },
+    { url = "https://files.pythonhosted.org/packages/af/37/1b6def445c567286b50aa3b33828158e135b1be44938dde59f11382a500c/hf_xet-1.5.0-cp37-abi3-win_amd64.whl", hash = "sha256:2806c7c17b4d23f8d88f7c4814f838c3b6150773fe339c20af23e1cfaf2797e4", size = 3977238, upload-time = "2026-05-06T06:18:23.621Z" },
+    { url = "https://files.pythonhosted.org/packages/62/94/3b66b148778ee100dcfd69c2ca22b57b41b44d3063ceec934f209e9184ce/hf_xet-1.5.0-cp37-abi3-win_arm64.whl", hash = "sha256:b6c9df403040248c76d808d3e047d64db2d923bae593eb244c41e425cf6cd7be", size = 3806916, upload-time = "2026-05-06T06:18:21.7Z" },
 ]

 [[package]]
@@ -1635,9 +1656,15 @@ name = "numpy"
 version = "2.3.5"
 source = { registry = "https://pypi.org/simple" }
 resolution-markers = [
-    "python_full_version < '3.13' and sys_platform == 'win32'",
-    "python_full_version < '3.13' and sys_platform == 'emscripten'",
-    "python_full_version < '3.13' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/76/65/21b3bc86aac7b8f2862db1e808f1ea22b028e30a225a34a5ede9bf8678f2/numpy-2.3.5.tar.gz", hash = "sha256:784db1dcdab56bf0517743e746dfb0f885fc68d948aba86eeec2cba234bdf1c0", size = 20584950, upload-time = "2025-11-16T22:52:42.067Z" }
 wheels = [
@@ -1703,12 +1730,24 @@ name = "numpy"
 version = "2.4.4"
 source = { registry = "https://pypi.org/simple" }
 resolution-markers = [
-    "python_full_version >= '3.14' and sys_platform == 'win32'",
-    "python_full_version >= '3.14' and sys_platform == 'emscripten'",
-    "python_full_version >= '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
-    "python_full_version == '3.13.*' and sys_platform == 'win32'",
-    "python_full_version == '3.13.*' and sys_platform == 'emscripten'",
-    "python_full_version == '3.13.*' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/d7/9f/b8cef5bffa569759033adda9481211426f12f53299629b410340795c2514/numpy-2.4.4.tar.gz", hash = "sha256:2d390634c5182175533585cc89f3608a4682ccb173cc9bb940b2881c8d6f8fa0", size = 20731587, upload-time = "2026-03-29T13:22:01.298Z" }
 wheels = [
@@ -1771,42 +1810,116 @@ wheels = [
 name = "nvidia-cublas-cu12"
 version = "12.8.4.1"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/dc/61/e24b560ab2e2eaeb3c839129175fb330dfcfc29e5203196e5541a4c44682/nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:8ac4e771d5a348c551b2a426eda6193c19aa630236b418086020df5ba9667142", size = 594346921, upload-time = "2025-03-07T01:44:31.254Z" },
 ]

+[[package]]
+name = "nvidia-cublas-cu12"
+version = "12.9.1.4"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/82/6c/90d3f532f608a03a13c1d6c16c266ffa3828e8011b1549d3b61db2ad59f5/nvidia_cublas_cu12-12.9.1.4-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:7a950dae01add3b415a5a5cdc4ec818fb5858263e9cca59004bb99fdbbd3a5d6", size = 575006342, upload-time = "2025-06-05T20:04:16.902Z" },
+]
+
 [[package]]
 name = "nvidia-cuda-cupti-cu12"
 version = "12.8.90"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/f8/02/2adcaa145158bf1a8295d83591d22e4103dbfd821bcaf6f3f53151ca4ffa/nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ea0cb07ebda26bb9b29ba82cda34849e73c166c18162d3913575b0c9db9a6182", size = 10248621, upload-time = "2025-03-07T01:40:21.213Z" },
 ]

+[[package]]
+name = "nvidia-cuda-cupti-cu12"
+version = "12.9.79"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b4/78/351b5c8cdbd9a6b4fb0d6ee73fb176dcdc1b6b6ad47c2ffff5ae8ca4a1f7/nvidia_cuda_cupti_cu12-12.9.79-py3-none-manylinux_2_25_aarch64.whl", hash = "sha256:791853b030602c6a11d08b5578edfb957cadea06e9d3b26adbf8d036135a4afe", size = 10077166, upload-time = "2025-06-05T20:01:01.385Z" },
+]
+
 [[package]]
 name = "nvidia-cuda-nvrtc-cu12"
 version = "12.8.93"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/05/6b/32f747947df2da6994e999492ab306a903659555dddc0fbdeb9d71f75e52/nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:a7756528852ef889772a84c6cd89d41dfa74667e24cca16bb31f8f061e3e9994", size = 88040029, upload-time = "2025-03-07T01:42:13.562Z" },
 ]

+[[package]]
+name = "nvidia-cuda-nvrtc-cu12"
+version = "12.9.86"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/64/eb/c2295044b8f3b3b08860e2f6a912b702fc92568a167259df5dddb78f325e/nvidia_cuda_nvrtc_cu12-12.9.86-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:096d4de6bda726415dfaf3198d4f5c522b8e70139c97feef5cd2ca6d4cd9cead", size = 44528905, upload-time = "2025-06-05T20:02:29.754Z" },
+]
+
 [[package]]
 name = "nvidia-cuda-runtime-cu12"
 version = "12.8.90"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/0d/9b/a997b638fcd068ad6e4d53b8551a7d30fe8b404d6f1804abf1df69838932/nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:adade8dcbd0edf427b7204d480d6066d33902cab2a4707dcfc48a2d0fd44ab90", size = 954765, upload-time = "2025-03-07T01:40:01.615Z" },
 ]

+[[package]]
+name = "nvidia-cuda-runtime-cu12"
+version = "12.9.79"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/bc/e0/0279bd94539fda525e0c8538db29b72a5a8495b0c12173113471d28bce78/nvidia_cuda_runtime_cu12-12.9.79-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:83469a846206f2a733db0c42e223589ab62fd2fabac4432d2f8802de4bded0a4", size = 3515012, upload-time = "2025-06-05T20:00:35.519Z" },
+]
+
 [[package]]
 name = "nvidia-cudnn-cu12"
 version = "9.10.2.21"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "nvidia-cublas-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
 ]
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/fa/41/e79269ce215c857c935fd86bcfe91a451a584dfc27f1e068f568b9ad1ab7/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:c9132cc3f8958447b4910a1720036d9eff5928cc3179b0a51fb6d167c6cc87d8", size = 705026878, upload-time = "2025-06-06T21:52:51.348Z" },
    { url = "https://files.pythonhosted.org/packages/ba/51/e123d997aa098c61d029f76663dedbfb9bc8dcf8c60cbd6adbe42f76d049/nvidia_cudnn_cu12-9.10.2.21-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:949452be657fa16687d0930933f032835951ef0892b37d2d53824d1a84dc97a8", size = 706758467, upload-time = "2025-06-06T21:54:08.597Z" },
 ]

@@ -1830,58 +1943,160 @@ wheels = [
 name = "nvidia-cufft-cu12"
 version = "11.3.3.83"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 dependencies = [
-    { name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/1f/13/ee4e00f30e676b66ae65b4f08cb5bcbb8392c03f54f2d5413ea99a5d1c80/nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:4d2dd21ec0b88cf61b62e6b43564355e5222e4a3fb394cac0db101f2dd0d4f74", size = 193118695, upload-time = "2025-03-07T01:45:27.821Z" },
 ]

+[[package]]
+name = "nvidia-cufft-cu12"
+version = "11.4.1.4"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+dependencies = [
+    { name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9b/2b/76445b0af890da61b501fde30650a1a4bd910607261b209cccb5235d3daa/nvidia_cufft_cu12-11.4.1.4-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:1a28c9b12260a1aa7a8fd12f5ebd82d027963d635ba82ff39a1acfa7c4c0fbcf", size = 200822453, upload-time = "2025-06-05T20:05:27.889Z" },
+]
+
 [[package]]
 name = "nvidia-cufile-cu12"
 version = "1.13.1.3"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/bb/fe/1bcba1dfbfb8d01be8d93f07bfc502c93fa23afa6fd5ab3fc7c1df71038a/nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1d069003be650e131b21c932ec3d8969c1715379251f8d23a1860554b1cb24fc", size = 1197834, upload-time = "2025-03-07T01:45:50.723Z" },
 ]

+[[package]]
+name = "nvidia-cufile-cu12"
+version = "1.14.1.1"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b9/d2/110af3a1f77999d5eebf6ffae5d2305ab839e53c76eec3696640cc25b35d/nvidia_cufile_cu12-1.14.1.1-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:8dea77590761e02cb6dd955a57cb6414c58aa3cb1b7adbf9919869a11509cf65", size = 1135994, upload-time = "2025-06-05T20:06:03.952Z" },
+]
+
 [[package]]
 name = "nvidia-curand-cu12"
 version = "10.3.9.90"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/fb/aa/6584b56dc84ebe9cf93226a5cde4d99080c8e90ab40f0c27bda7a0f29aa1/nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:b32331d4f4df5d6eefa0554c565b626c7216f87a06a4f56fab27c3b68a830ec9", size = 63619976, upload-time = "2025-03-07T01:46:23.323Z" },
 ]

+[[package]]
+name = "nvidia-curand-cu12"
+version = "10.3.10.19"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/14/1c/2a45afc614d99558d4a773fa740d8bb5471c8398eeed925fc0fcba020173/nvidia_curand_cu12-10.3.10.19-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:de663377feb1697e1d30ed587b07d5721fdd6d2015c738d7528a6002a6134d37", size = 68292066, upload-time = "2025-05-01T19:39:13.595Z" },
+]
+
 [[package]]
 name = "nvidia-cusolver-cu12"
 version = "11.7.3.90"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 dependencies = [
-    { name = "nvidia-cublas-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
-    { name = "nvidia-cusparse-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
-    { name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-cusparse-cu12", version = "12.5.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/85/48/9a13d2975803e8cf2777d5ed57b87a0b6ca2cc795f9a4f59796a910bfb80/nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:4376c11ad263152bd50ea295c05370360776f8c3427b30991df774f9fb26c450", size = 267506905, upload-time = "2025-03-07T01:47:16.273Z" },
 ]

+[[package]]
+name = "nvidia-cusolver-cu12"
+version = "11.7.5.82"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+dependencies = [
+    { name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-cusparse-cu12", version = "12.5.10.65", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/03/99/686ff9bf3a82a531c62b1a5c614476e8dfa24a9d89067aeedf3592ee4538/nvidia_cusolver_cu12-11.7.5.82-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:62efa83e4ace59a4c734d052bb72158e888aa7b770e1a5f601682f16fe5b4fd2", size = 337869834, upload-time = "2025-06-05T20:06:53.125Z" },
+]
+
 [[package]]
 name = "nvidia-cusparse-cu12"
 version = "12.5.8.93"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 dependencies = [
-    { name = "nvidia-nvjitlink-cu12", marker = "sys_platform != 'emscripten' and sys_platform != 'win32'" },
+    { name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/c2/f5/e1854cb2f2bcd4280c44736c93550cc300ff4b8c95ebe370d0aa7d2b473d/nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1ec05d76bbbd8b61b06a80e1eaf8cf4959c3d4ce8e711b65ebd0443bb0ebb13b", size = 288216466, upload-time = "2025-03-07T01:48:13.779Z" },
 ]

+[[package]]
+name = "nvidia-cusparse-cu12"
+version = "12.5.10.65"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+dependencies = [
+    { name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/5e/6f/8710fbd17cdd1d0fc3fea7d36d5b65ce1933611c31e1861da330206b253a/nvidia_cusparse_cu12-12.5.10.65-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:221c73e7482dd93eda44e65ce567c031c07e2f93f6fa0ecd3ba876a195023e83", size = 366359408, upload-time = "2025-06-05T20:07:42.501Z" },
+]
+
 [[package]]
 name = "nvidia-cusparselt-cu12"
 version = "0.7.1"
 source = { registry = "https://pypi.org/simple" }
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/73/b9/598f6ff36faaece4b3c50d26f50e38661499ff34346f00e057760b35cc9d/nvidia_cusparselt_cu12-0.7.1-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8878dce784d0fac90131b6817b607e803c36e629ba34dc5b433471382196b6a5", size = 283835557, upload-time = "2025-02-26T00:16:54.265Z" },
    { url = "https://files.pythonhosted.org/packages/56/79/12978b96bd44274fe38b5dde5cfb660b1d114f70a65ef962bcbbed99b549/nvidia_cusparselt_cu12-0.7.1-py3-none-manylinux2014_x86_64.whl", hash = "sha256:f1bb701d6b930d5a7cea44c19ceb973311500847f81b634d802b7b539dc55623", size = 287193691, upload-time = "2025-02-26T00:15:44.104Z" },
 ]

@@ -1929,6 +2144,7 @@ name = "nvidia-nccl-cu12"
 version = "2.27.5"
 source = { registry = "https://pypi.org/simple" }
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/bb/1c/857979db0ef194ca5e21478a0612bcdbbe59458d7694361882279947b349/nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:31432ad4d1fb1004eb0c56203dc9bc2178a1ba69d1d9e02d64a6938ab5e40e7a", size = 322400625, upload-time = "2025-06-26T04:11:04.496Z" },
    { url = "https://files.pythonhosted.org/packages/6e/89/f7a07dc961b60645dbbf42e80f2bc85ade7feb9a491b11a1e973aa00071f/nvidia_nccl_cu12-2.27.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ad730cf15cb5d25fe849c6e6ca9eb5b76db16a80f13f425ac68d8e2e55624457", size = 322348229, upload-time = "2025-06-26T04:11:28.385Z" },
 ]

@@ -1936,15 +2152,34 @@ wheels = [
 name = "nvidia-nvjitlink-cu12"
 version = "12.8.93"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/f6/74/86a07f1d0f42998ca31312f998bd3b9a7eff7f52378f4f270c8679c77fb9/nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:81ff63371a7ebd6e6451970684f916be2eab07321b73c9d244dc2b4da7f73b88", size = 39254836, upload-time = "2025-03-07T01:49:55.661Z" },
 ]

+[[package]]
+name = "nvidia-nvjitlink-cu12"
+version = "12.9.86"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/97/bc/2dcba8e70cf3115b400fef54f213bcd6715a3195eba000f8330f11e40c45/nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:994a05ef08ef4b0b299829cde613a424382aff7efb08a7172c1fa616cc3af2ca", size = 39514880, upload-time = "2025-06-05T20:10:04.89Z" },
+]
+
 [[package]]
 name = "nvidia-nvshmem-cu12"
 version = "3.3.20"
 source = { registry = "https://pypi.org/simple" }
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/92/9d/3dd98852568fb845ec1f7902c90a22b240fe1cbabda411ccedf2fd737b7b/nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:0b0b960da3842212758e4fa4696b94f129090b30e5122fea3c5345916545cff0", size = 124484616, upload-time = "2025-08-04T20:24:59.172Z" },
    { url = "https://files.pythonhosted.org/packages/3b/6c/99acb2f9eb85c29fc6f3a7ac4dccfd992e22666dd08a642b303311326a97/nvidia_nvshmem_cu12-3.3.20-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:d00f26d3f9b2e3c3065be895e3059d6479ea5c638a3f38c9fec49b1b9dd7c1e5", size = 124657145, upload-time = "2025-08-04T20:25:19.995Z" },
 ]

@@ -1952,10 +2187,28 @@ wheels = [
 name = "nvidia-nvtx-cu12"
 version = "12.8.90"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/a2/eb/86626c1bbc2edb86323022371c39aa48df6fd8b0a1647bc274577f72e90b/nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:5b17e2001cc0d751a5bc2c6ec6d26ad95913324a4adb86788c944f8ce9ba441f", size = 89954, upload-time = "2025-03-07T01:42:44.131Z" },
 ]

+[[package]]
+name = "nvidia-nvtx-cu12"
+version = "12.9.79"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c4/e4/82155e4aaedb41621087ba219c95e99c5e417f37a7649b4fb6ec32dcb14d/nvidia_nvtx_cu12-12.9.79-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d1f258e752294acdb4f61c3d31fee87bd0f60e459f1e2f624376369b524cd15d", size = 86120, upload-time = "2025-06-05T20:02:51.838Z" },
+]
+
 [[package]]
 name = "openai"
 version = "2.6.1"
@@ -2072,7 +2325,8 @@ dependencies = [
    { name = "pydantic" },
    { name = "referencing" },
    { name = "requests" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "tqdm" },
    { name = "typing-extensions" },
 ]
@@ -2893,7 +3147,8 @@ source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "apache-tvm-ffi" },
    { name = "nvidia-cutlass-dsl" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "torch-c-dlpack-ext" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/73/34/bcc87d1ee53cf245bf58ea563b276b9bd86a405bda5a42e7bd1386db9941/quack_kernels-0.3.11.tar.gz", hash = "sha256:d589417476030fb62e70730c4bd0732339a04b8bb91fd49bf4cc70e20a27170b", size = 246675, upload-time = "2026-04-20T01:08:12.269Z" }
@@ -3315,8 +3570,7 @@ wheels = [

 [[package]]
 name = "sglang"
-version = "0.5.10"
-source = { registry = "https://pypi.org/simple" }
+source = { editable = "third_party/sglang/python" }
 dependencies = [
    { name = "aiohttp" },
    { name = "anthropic" },
@@ -3369,7 +3623,8 @@ dependencies = [
    { name = "soundfile" },
    { name = "tiktoken" },
    { name = "timm" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "torch-memory-saver" },
    { name = "torchao" },
    { name = "torchaudio" },
@@ -3382,10 +3637,118 @@ dependencies = [
    { name = "watchfiles" },
    { name = "xgrammar" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/c8/4e/bd00d332098337ae13fa783a13258935d568dd5b7e1fd9df205184145224/sglang-0.5.10.tar.gz", hash = "sha256:db78367f41a1f385f8624a10e9506b671e788f9943978df6a37a486867c1edc7", size = 4700833, upload-time = "2026-04-05T23:57:27.556Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/1f/ee/f7a946162ed538f47a1c5542f93410e5bf9a0c4ca6021d4000e6f9b87f7d/sglang-0.5.10-py3-none-any.whl", hash = "sha256:ac8855a5d57dac8831fee526bca5212f1ae451f378e2ab08b3baecbc4deb4076", size = 6064398, upload-time = "2026-04-05T23:57:25.28Z" },
+
+[package.metadata]
+requires-dist = [
+    { name = "accelerate", marker = "extra == 'test'" },
+    { name = "addict", marker = "extra == 'diffusion'", specifier = "==2.4.0" },
+    { name = "addict", marker = "extra == 'test'" },
+    { name = "aiohttp" },
+    { name = "anthropic", specifier = ">=0.20.0" },
+    { name = "apache-tvm-ffi", specifier = ">=0.1.5,<0.2" },
+    { name = "av", marker = "(platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'armv7l' and sys_platform == 'linux')" },
+    { name = "av", marker = "extra == 'diffusion'", specifier = "==16.1.0" },
+    { name = "bitsandbytes", marker = "extra == 'test'" },
+    { name = "blobfile", specifier = "==3.0.0" },
+    { name = "build" },
+    { name = "cache-dit", marker = "extra == 'diffusion'", specifier = "==1.3.0" },
+    { name = "checkpoint-engine", marker = "extra == 'checkpoint-engine'", specifier = "==0.1.2" },
+    { name = "cloudpickle", marker = "extra == 'diffusion'", specifier = "==3.1.2" },
+    { name = "compressed-tensors" },
+    { name = "cuda-python", specifier = "==12.9" },
+    { name = "datasets" },
+    { name = "decord2", marker = "(platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'armv7l' and sys_platform == 'linux')" },
+    { name = "diff-cover", marker = "extra == 'test'" },
+    { name = "diffusers", marker = "extra == 'diffusion'", specifier = "==0.37.0" },
+    { name = "einops" },
+    { name = "expecttest", marker = "extra == 'test'" },
+    { name = "fastapi" },
+    { name = "flash-attn-4", specifier = ">=4.0.0b4" },
+    { name = "flashinfer-cubin", specifier = "==0.6.7.post2" },
+    { name = "flashinfer-python", specifier = "==0.6.7.post2" },
+    { name = "gguf" },
+    { name = "imageio", marker = "extra == 'diffusion'", specifier = "==2.36.0" },
+    { name = "imageio-ffmpeg", marker = "extra == 'diffusion'", specifier = "==0.5.1" },
+    { name = "interegular" },
+    { name = "ipython" },
+    { name = "jsonlines", marker = "extra == 'test'" },
+    { name = "llguidance", specifier = ">=0.7.11,<0.8.0" },
+    { name = "lm-eval", extras = ["api"], marker = "extra == 'test'", specifier = ">=0.4.9.2" },
+    { name = "matplotlib", marker = "extra == 'test'" },
+    { name = "mistral-common", specifier = ">=1.9.0" },
+    { name = "modelscope" },
+    { name = "moviepy", marker = "extra == 'diffusion'", specifier = ">=2.0.0" },
+    { name = "msgspec" },
+    { name = "ninja" },
+    { name = "numpy" },
+    { name = "nvidia-cutlass-dsl", specifier = ">=4.4.1" },
+    { name = "nvidia-ml-py" },
+    { name = "openai", specifier = "==2.6.1" },
+    { name = "openai-harmony", specifier = "==0.0.4" },
+    { name = "opencv-python-headless", marker = "extra == 'diffusion'", specifier = "==4.10.0.84" },
+    { name = "opentelemetry-api", marker = "extra == 'tracing'" },
+    { name = "opentelemetry-exporter-otlp", marker = "extra == 'tracing'" },
+    { name = "opentelemetry-exporter-otlp-proto-grpc", marker = "extra == 'tracing'" },
+    { name = "opentelemetry-sdk", marker = "extra == 'tracing'" },
+    { name = "orjson" },
+    { name = "outlines", specifier = "==0.1.11" },
+    { name = "packaging" },
+    { name = "pandas", marker = "extra == 'test'" },
+    { name = "parameterized", marker = "extra == 'test'" },
+    { name = "partial-json-parser" },
+    { name = "peft", marker = "extra == 'test'", specifier = ">=0.18.0" },
+    { name = "pillow" },
+    { name = "polars", marker = "extra == 'test'" },
+    { name = "prometheus-client", specifier = ">=0.20.0" },
+    { name = "psutil" },
+    { name = "py-spy" },
+    { name = "pybase64" },
+    { name = "pydantic" },
+    { name = "pytest", marker = "extra == 'test'" },
+    { name = "pytest-cov", marker = "extra == 'test'" },
+    { name = "python-multipart" },
+    { name = "pyyaml", marker = "extra == 'diffusion'", specifier = "==6.0.1" },
+    { name = "pyzmq", specifier = ">=25.1.2" },
+    { name = "quack-kernels", specifier = ">=0.3.0" },
+    { name = "ray", extras = ["default"], marker = "extra == 'ray'", specifier = ">=2.54.0" },
+    { name = "remote-pdb", marker = "extra == 'diffusion'", specifier = "==2.1.0" },
+    { name = "requests" },
+    { name = "runai-model-streamer", marker = "extra == 'diffusion'", specifier = ">=0.15.7" },
+    { name = "runai-model-streamer", extras = ["azure", "gcs", "s3"], marker = "extra == 'runai'", specifier = ">=0.15.7" },
+    { name = "scikit-image", marker = "extra == 'diffusion'", specifier = "==0.25.2" },
+    { name = "scipy" },
+    { name = "sentence-transformers", marker = "extra == 'test'" },
+    { name = "sentencepiece" },
+    { name = "setproctitle" },
+    { name = "sglang", extras = ["diffusion"], marker = "extra == 'all'" },
+    { name = "sglang", extras = ["test"], marker = "extra == 'dev'" },
+    { name = "sglang", extras = ["tracing"], marker = "extra == 'all'" },
+    { name = "sglang-kernel", specifier = "==0.4.1" },
+    { name = "smg-grpc-servicer", specifier = ">=0.5.0" },
+    { name = "soundfile", specifier = "==0.13.1" },
+    { name = "st-attn", marker = "platform_machine != 'aarch64' and platform_machine != 'arm64' and extra == 'diffusion'", specifier = "==0.0.7" },
+    { name = "tabulate", marker = "extra == 'test'" },
+    { name = "tiktoken" },
+    { name = "timm", specifier = "==1.0.16" },
+    { name = "torch", marker = "platform_machine != 'aarch64' and platform_machine != 'x86_64'", specifier = "==2.9.1" },
+    { name = "torch", marker = "platform_machine == 'aarch64'", specifier = "==2.9.1", index = "https://download.pytorch.org/whl/cu129" },
+    { name = "torch", marker = "platform_machine == 'x86_64'", specifier = "==2.9.1", index = "https://pypi.org/simple" },
+    { name = "torch-memory-saver", specifier = "==0.0.9" },
+    { name = "torchao", specifier = "==0.9.0" },
+    { name = "torchaudio", specifier = "==2.9.1" },
+    { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l') or sys_platform != 'linux'", specifier = "==0.9.1" },
+    { name = "torchvision" },
+    { name = "tqdm" },
+    { name = "transformers", specifier = "==5.3.0" },
+    { name = "trimesh", marker = "extra == 'diffusion'", specifier = ">=4.0.0" },
+    { name = "uvicorn" },
+    { name = "uvloop" },
+    { name = "vsa", marker = "platform_machine != 'aarch64' and platform_machine != 'arm64' and extra == 'diffusion'", specifier = "==0.0.4" },
+    { name = "watchfiles" },
+    { name = "xatlas", marker = "extra == 'diffusion'" },
+    { name = "xgrammar", specifier = "==0.1.32" },
 ]
+provides-extras = ["checkpoint-engine", "runai", "diffusion", "ray", "tracing", "test", "dev", "all"]

 [[package]]
 name = "sglang-kernel"
@@ -3574,7 +3937,8 @@ dependencies = [
    { name = "huggingface-hub" },
    { name = "pyyaml" },
    { name = "safetensors" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "torchvision" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/94/f6/4d7a8c261341fa6ad281920618739f2a650f41043afcedb570f24e99a776/timm-1.0.16.tar.gz", hash = "sha256:a3b8130dd2cb8dc3b9f5e3d09ab6d677a6315a8695fd5264eb6d52a4a46c1044", size = 2339999, upload-time = "2025-06-26T17:09:44.208Z" }
@@ -3612,30 +3976,50 @@ wheels = [
 name = "torch"
 version = "2.9.1"
 source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version >= '3.14' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'x86_64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
 dependencies = [
-    { name = "filelock" },
-    { name = "fsspec" },
-    { name = "jinja2" },
-    { name = "networkx" },
-    { name = "nvidia-cublas-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-cuda-cupti-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-cuda-nvrtc-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-cuda-runtime-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "filelock", marker = "platform_machine != 'aarch64'" },
+    { name = "fsspec", marker = "platform_machine != 'aarch64'" },
+    { name = "jinja2", marker = "platform_machine != 'aarch64'" },
+    { name = "networkx", marker = "platform_machine != 'aarch64'" },
+    { name = "nvidia-cublas-cu12", version = "12.8.4.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-cuda-cupti-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-cuda-nvrtc-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-cuda-runtime-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
    { name = "nvidia-cudnn-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-cufft-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-cufile-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-curand-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-cusolver-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-cusparse-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-cufft-cu12", version = "11.3.3.83", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-cufile-cu12", version = "1.13.1.3", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-curand-cu12", version = "10.3.9.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-cusolver-cu12", version = "11.7.3.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-cusparse-cu12", version = "12.5.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
    { name = "nvidia-cusparselt-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
    { name = "nvidia-nccl-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-nvjitlink-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "nvidia-nvjitlink-cu12", version = "12.8.93", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
    { name = "nvidia-nvshmem-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "nvidia-nvtx-cu12", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "setuptools" },
-    { name = "sympy" },
+    { name = "nvidia-nvtx-cu12", version = "12.8.90", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
+    { name = "setuptools", marker = "platform_machine != 'aarch64'" },
+    { name = "sympy", marker = "platform_machine != 'aarch64'" },
    { name = "triton", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
-    { name = "typing-extensions" },
+    { name = "typing-extensions", marker = "platform_machine != 'aarch64'" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/0f/27/07c645c7673e73e53ded71705045d6cb5bae94c4b021b03aa8d03eee90ab/torch-2.9.1-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:da5f6f4d7f4940a173e5572791af238cb0b9e21b1aab592bd8b26da4c99f1cd6", size = 104126592, upload-time = "2025-11-12T15:20:41.62Z" },
@@ -3660,12 +4044,61 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/db/2b/f7818f6ec88758dfd21da46b6cd46af9d1b3433e53ddbb19ad1e0da17f9b/torch-2.9.1-cp314-cp314t-win_amd64.whl", hash = "sha256:c88d3299ddeb2b35dcc31753305612db485ab6f1823e37fb29451c8b2732b87e", size = 111163659, upload-time = "2025-11-12T15:23:20.009Z" },
 ]

+[[package]]
+name = "torch"
+version = "2.9.1+cu129"
+source = { registry = "https://download.pytorch.org/whl/cu129" }
+resolution-markers = [
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version >= '3.14' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version == '3.13.*' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'win32'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform == 'emscripten'",
+    "python_full_version < '3.13' and platform_machine == 'aarch64' and sys_platform != 'emscripten' and sys_platform != 'win32'",
+]
+dependencies = [
+    { name = "filelock", marker = "platform_machine == 'aarch64'" },
+    { name = "fsspec", marker = "platform_machine == 'aarch64'" },
+    { name = "jinja2", marker = "platform_machine == 'aarch64'" },
+    { name = "networkx", marker = "platform_machine == 'aarch64'" },
+    { name = "nvidia-cublas-cu12", version = "12.9.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cuda-cupti-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cuda-nvrtc-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cuda-runtime-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cudnn-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cufft-cu12", version = "11.4.1.4", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cufile-cu12", version = "1.14.1.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-curand-cu12", version = "10.3.10.19", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cusolver-cu12", version = "11.7.5.82", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cusparse-cu12", version = "12.5.10.65", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-cusparselt-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-nccl-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-nvjitlink-cu12", version = "12.9.86", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-nvshmem-cu12", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "nvidia-nvtx-cu12", version = "12.9.79", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "setuptools", marker = "platform_machine == 'aarch64'" },
+    { name = "sympy", marker = "platform_machine == 'aarch64'" },
+    { name = "triton", marker = "platform_machine == 'aarch64' and sys_platform == 'linux'" },
+    { name = "typing-extensions", marker = "platform_machine == 'aarch64'" },
+]
+wheels = [
+    { url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:c501c66fe5b0e2fc70f9d8a18e17a265f92ad1d1009dba03f5938d2f15a9066f", upload-time = "2026-01-26T17:26:29Z" },
+    { url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:ab44cf28e6ca2df679f0845fb4b950c81834431218840ca01c0a1583892a0986", upload-time = "2026-01-26T17:26:26Z" },
+    { url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:794482180a4f2d92a960f470fcd47e066dbe2eeb27816880e618d3ce031805f7", upload-time = "2026-01-26T17:26:04Z" },
+    { url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:4559e1254e2c8e1a337758626d1cf33ca5a5ded3509fa012070334bf886b686b", upload-time = "2026-01-26T17:25:38Z" },
+    { url = "https://download-r2.pytorch.org/whl/cu129/torch-2.9.1%2Bcu129-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:cbe8955514ace826d3638a5d5dc1faa2f9dda1de4de74941d2e86b1a0859477c", upload-time = "2026-01-26T17:25:36Z" },
+]
+
 [[package]]
 name = "torch-c-dlpack-ext"
 version = "0.1.5"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/37/de/921b6491efce5c389a5ef9bbed3d2d6660005840dae488124173180859ab/torch_c_dlpack_ext-0.1.5.tar.gz", hash = "sha256:d06f0357d575d22a168cc77acb9020fc4bae30968ceb6718a055dcbe92bacabe", size = 12913, upload-time = "2026-01-12T11:25:08.484Z" }
 wheels = [
@@ -3706,7 +4139,8 @@ name = "torchaudio"
 version = "2.9.1"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/f1/83/71cbadd7b66753818b5775f2088bad4f721d581de276996df4968000a626/torchaudio-2.9.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:7581ef170794c599aed55918e00d0acd9e5c9a0f19400c9a9a840955180365c5", size = 808098, upload-time = "2025-11-12T15:26:01.408Z" },
@@ -3755,7 +4189,8 @@ dependencies = [
    { name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13'" },
    { name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" },
    { name = "pillow" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
 ]
 wheels = [
    { url = "https://files.pythonhosted.org/packages/f0/af/18e2c6b9538a045f60718a0c5a058908ccb24f88fde8e6f0fc12d5ff7bd3/torchvision-0.24.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:e48bf6a8ec95872eb45763f06499f87bd2fb246b9b96cb00aae260fda2f96193", size = 1891433, upload-time = "2025-11-12T15:25:03.232Z" },
@@ -3827,10 +4262,15 @@ name = "triton"
 version = "3.5.1"
 source = { registry = "https://pypi.org/simple" }
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/db/53/2bcc46879910991f09c063eea07627baef2bc62fe725302ba8f46a2c1ae5/triton-3.5.1-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:275a045b6ed670dd1bd005c3e6c2d61846c74c66f4512d6f33cc027b11de8fd4", size = 159940689, upload-time = "2025-11-11T17:51:55.938Z" },
    { url = "https://files.pythonhosted.org/packages/f2/50/9a8358d3ef58162c0a415d173cfb45b67de60176e1024f71fbc4d24c0b6d/triton-3.5.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d2c6b915a03888ab931a9fd3e55ba36785e1fe70cbea0b40c6ef93b20fc85232", size = 170470207, upload-time = "2025-11-11T17:41:00.253Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/ba/805684a992ee32d486b7948d36aed2f5e3c643fc63883bf8bdca1c3f3980/triton-3.5.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:56765ffe12c554cd560698398b8a268db1f616c120007bfd8829d27139abd24a", size = 159955460, upload-time = "2025-11-11T17:52:01.861Z" },
    { url = "https://files.pythonhosted.org/packages/27/46/8c3bbb5b0a19313f50edcaa363b599e5a1a5ac9683ead82b9b80fe497c8d/triton-3.5.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f3f4346b6ebbd4fad18773f5ba839114f4826037c9f2f34e0148894cd5dd3dba", size = 170470410, upload-time = "2025-11-11T17:41:06.319Z" },
+    { url = "https://files.pythonhosted.org/packages/84/1e/7df59baef41931e21159371c481c31a517ff4c2517343b62503d0cd2be99/triton-3.5.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:02c770856f5e407d24d28ddc66e33cf026e6f4d360dcb8b2fabe6ea1fc758621", size = 160072799, upload-time = "2025-11-11T17:52:07.293Z" },
    { url = "https://files.pythonhosted.org/packages/37/92/e97fcc6b2c27cdb87ce5ee063d77f8f26f19f06916aa680464c8104ef0f6/triton-3.5.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0b4d2c70127fca6a23e247f9348b8adde979d2e7a20391bfbabaac6aebc7e6a8", size = 170579924, upload-time = "2025-11-11T17:41:12.455Z" },
+    { url = "https://files.pythonhosted.org/packages/14/f9/0430e879c1e63a1016cb843261528fd3187c872c3a9539132efc39514753/triton-3.5.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f617aa7925f9ea9968ec2e1adaf93e87864ff51549c8f04ce658f29bbdb71e2d", size = 159956163, upload-time = "2025-11-11T17:52:12.999Z" },
    { url = "https://files.pythonhosted.org/packages/a4/e6/c595c35e5c50c4bc56a7bac96493dad321e9e29b953b526bbbe20f9911d0/triton-3.5.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d0637b1efb1db599a8e9dc960d53ab6e4637db7d4ab6630a0974705d77b14b60", size = 170480488, upload-time = "2025-11-11T17:41:18.222Z" },
+    { url = "https://files.pythonhosted.org/packages/41/1e/63d367c576c75919e268e4fbc33c1cb33b6dc12bb85e8bfe531c2a8bd5d3/triton-3.5.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8932391d7f93698dfe5bc9bead77c47a24f97329e9f20c10786bb230a9083f56", size = 160073620, upload-time = "2025-11-11T17:52:18.403Z" },
    { url = "https://files.pythonhosted.org/packages/16/b5/b0d3d8b901b6a04ca38df5e24c27e53afb15b93624d7fd7d658c7cd9352a/triton-3.5.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bac7f7d959ad0f48c0e97d6643a1cc0fd5786fe61cb1f83b537c6b2d54776478", size = 170582192, upload-time = "2025-11-11T17:41:23.963Z" },
 ]

@@ -4029,7 +4469,8 @@ dependencies = [
    { name = "numpy", version = "2.3.5", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.13'" },
    { name = "numpy", version = "2.4.4", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.13'" },
    { name = "pydantic" },
-    { name = "torch" },
+    { name = "torch", version = "2.9.1", source = { registry = "https://pypi.org/simple" }, marker = "platform_machine != 'aarch64'" },
+    { name = "torch", version = "2.9.1+cu129", source = { registry = "https://download.pytorch.org/whl/cu129" }, marker = "platform_machine == 'aarch64'" },
    { name = "transformers" },
    { name = "triton", marker = "platform_machine == 'x86_64' and sys_platform == 'linux'" },
    { name = "typing-extensions" },