docs: reframe PAPER_OUTLINE to GPU-hit-first + embed v2 figures

Reorganizes the outline from the EAR / dispatch-coupling framing (kept in git history) into the GPU-hit-first structure: - §1 background splits PD-colo / PD-disagg / KV storage hierarchy, each with a forward pointer to where it is used or refuted. - §2 leads with the metric argument (request latency / TPS / GPU util, not TTFT/TPOT); dispatch coupling is demoted to that justification. §2.2 embeds the two new v2 figures -- the measured 4-tier hit hierarchy (GPU < CPU-local < remote-RDMA-store << miss) and the capacity->APC/latency knee (Evidence #1) -- plus the cluster-scale correction to the working_set "14 nodes" number. - §3 recasts the three optimizations as corollaries of GPU-hit-first: make PD-colocation default (3.1), biased KV-awareness routing (3.2), dedup via migration not replication (3.3). - §5 related work now engages the storage-hierarchy camp directly. - Validation-status table and work plan updated (top priority: wall-clock amplification sweep). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 13:34:19 +08:00
parent dc8e6dd5a8
commit d9cf3126c6
1 changed files with 275 additions and 336 deletions
--- a/PAPER_OUTLINE.md
+++ b/PAPER_OUTLINE.md
@@ -1,423 +1,362 @@
-# EAR: Elastic Affinity Routing for Agentic LLM Serving
+# GPU-Hit-First: Serving Agentic LLM Workloads by Keeping the Working Set in HBM

-> **One-liner**: Agentic LLM workload 的 KV reuse 93% 是 intra-session 的，且 turn 间 tool-call 反馈耦合把单 request 的延迟差放大成 throughput 差距 —— locality 因此成为主导调度杠杆；现有 load-balance 丢 locality、静态 PD-disagg 撞 D 侧 KV 墙、pure session-sticky 造 hot pin，我们提出 session-affinity routing + hot-instance 触发 session migration 的调度器 **EAR (Elastic Affinity Router)**，单一方案同时拿到 locality 和 balance。
+> **Thesis (one-liner)**: 对 agentic LLM 负载，用户感受到的端到端 metric 是 **request latency / TPS / GPU
+> utilization**，而它们由一件事主导 —— **KV cache 命中是否发生在 GPU HBM 上**。Agentic 的 KV reuse 93% 在
+> session 内、且活跃 working set 小到一个节点就能常驻 HBM；命中层级 `GPU ≫ CPU-local > remote-RDMA-store ≫
+> recompute` 的代价差随 context 拉大。由此得到一条统一原则 —— **GPU-hit-first**：把活跃 working set 留在 HBM，
+> 而不是建深的 CPU/storage hierarchy 去追长尾。三个推论分别修复现有系统的三处失配：(3.1) 让 PD-colocation 重新
+> 成为默认；(3.2) 在全局路由里做 biased KV-cache-awareness；(3.3) 用 KV migration 而非 replication 做跨实例
+> GPU 去重。
+
+> **Framing note (2026-05-30)**：本 outline 取代早期的 "EAR: Elastic Affinity Routing" 版本（保留在 git
+> 历史里）。原 EAR 的 dispatch-coupling 形式化在此 **降级为 §2 的 metric 论证**（解释"为什么是 request latency
+> 而不是 TTFT/TPOT"），不再是 headline；headline 升格为 GPU-hit-first 原则，affinity routing / migration
+> 成为该原则的两个推论（§3.2 / §3.3）。

 ---

-## 📊 Validation Status (2026-05-27)
+## 📊 Validation Status (2026-05-30)

-| 部分 | 现有数据 | 待补 |
-|---|---|---|
-| §2 Workload characterization | ✅ 完整 (3 张图复用) | — |
-| §3.1 Load-balance 丢 locality | ✅ 完整 (`f4a`) | — |
-| §3.2 静态 PD-disagg 撞 KV 墙 | ✅ 完整 (`f4b`) | — |
-| §3.3 Sticky 造 hot pin | ✅ 完整 (`f4c`, `f4d`) | — |
-| §4.1-2 Affinity routing | ✅ 已实现（current `unified` 算法）| — |
-| `kv_both` substrate cost | ✅ **VERIFIED net-positive** (2026-05-27, commit `ef9e010`) | TTFT p90 −18.6% w/o DR-fix, −36.6% w/ DR-fix |
-| §4.3 Migration mechanism (e2e) | 🚧 **PARTIAL** | substrate 已通；e2e trigger + target selection 实验未跑 |
-| §5.2 End-to-end | ⚠️ 5/6 baseline 有数据 (`f6`) | 缺 static PD-disagg；EAR 列待 migration |
-| §5.3 Ablation | 🚧 **PARTIAL DEFER** | 仅 affinity-only 现可做，full 待 migration |
-| §5.4 Dispatch coupling validation | 🚧 **NEW DATA NEEDED** | 5 baseline wall-clock 重跑（Phase 1 patch 后）|
-| §5.5 Sensitivity | 🚧 **PARTIAL DEFER** | λ/skew/KV pool 可做；`T_hot`/`T_cool` 待 migration |
-| §5.6 Migration microbench | 🚧 **FULL DEFER** | 完全依赖 migration validation |
-
-**前提背景**：team 之前 4 次尝试 migration 都因 transfer overhead 被还原（见 `analysis/unified_routing_fix_review.md`）；2026-05-27 的 trace-replay A/B/C（`microbench/connector_tax/cache_sweep/REPORT_TRACE_REPLAY.md`）证明 `kv_both` substrate 已经反转 —— 不仅 +45% penalty obsolete，substrate 本身就是 net positive（TTFT p90 −18.6% vs plain，DR-fix 后 −36.6%）。**之前 4 次 migration revert 的最大根因消失，但 e2e migration 策略层（trigger + target selection 在反馈环里的真实收益）仍未直接验证 ——  EAR 的 migration 部分实验已无 substrate 风险，只剩策略层风险。**
+| 章节 | 论点 | 证据 | 状态 |
+|---|---|---|---|
+| §1 | 背景：PD-colo / PD-disagg / KV storage hierarchy | — | 写作 |
+| §2 metric | request latency 而非 TTFT/TPOT；TPS；GPU util | dispatch coupling + amplification；`bench_report.py` | 🟡 论证全，wall-clock sweep 待补 |
+| §2.1 | KV hit 普遍且关键 | `f2a` intra 93.2%、APC 上界 79.6%、C1/C2 | ✅ |
+| **§2.2** | **GPU hit > CPU hit > RDMA-store hit ≫ miss** | **`v2/figs/exp_a_tier_latency.png`（四层实测）** | ✅ **NEW** |
+| **§2.2 Ev#1** | **GPU 足以常驻"有价值的" working set** | **`v2/figs/exp_b_capacity_knee.png`（knee）+ working_set + cluster-scale 校正** | ✅ **NEW** |
+| §3.1 | Make PD-colocation great again | `PD_DISAGG_RESULTS`、`crossover_pd_advantage`、MB1/MB2、C2/C3 | ✅ |
+| §3.2 | Biased KV-aware global routing | LPWL（TTFT p90 −31%）、LMetric 乘性稀释、sticky hot-pin、ES ablation | ✅ |
+| §3.3 | GPU dedup via migration not replication | substrate net-positive（−18.6%/−36.6%）；correctness smoke tests | 🟡 substrate 通，policy e2e 待验证 |
+| §4 | 集成系统端到端 eval | 散落 mb1/mb2/mb5/crossover/lpwl，需统一 | 🚧 |
+| §5 | Related work（含 storage hierarchy 正面回应）| — | 写作 |

 ---

-## §1 Introduction
+## §1 Background and System Setup

-Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任务 —— 已经成为推理系统的主导负载，但现有为 chatbot 设计的调度策略在 agentic 下普遍失败。本文先刻画 agentic 与 chatbot 的本质区别，然后说明为什么三类主流调度都不够，最后给出 EAR 设计。
+### §1.1 LLM 与 KV cache

-**Contributions**:
+Transformer 自回归推理分两段：**prefill**（一次性算完 prompt 的全部 KV，compute-bound）与 **decode**（逐 token
+生成，memory-bandwidth-bound）。每个 token 的 KV 常驻 GPU HBM 才能被后续 attention 复用。Prefix caching（APC）让
+相同 prompt 前缀直接命中已算好的 KV，省掉重复 prefill —— 这是本文全部优化的物理基础。

- **C1 Dispatch coupling 论证**：我们形式化一个 agentic workload 独有的反馈环 —— 单 turn 服务时间通过 Little's Law 隐式方程影响并发 session 数，从而把 per-request 延迟差放大成 throughput 差距。实测：load-balance baseline 在 600s trace 上跑出 **8x** wall-clock amplification；EAR 跑出 **TBDx**。
- **C2 EAR 设计**：两个 pillar 的调度器 —— affinity-default routing 抓 intra-session locality，hot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance，避免 hot pin。
- **C3 评估**：在真实 Qwen3-Coder agentic trace 上，EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、worst-worker p90、wall-clock 五个维度。
+> Qwen3-Coder-30B-A3B（GQA, 48 层, 4 KV heads, head_dim 128, bf16）：**KV = 96 KiB/token**，1 GiB = 10,923
+> token，block(16 tok) = 1.573 MB。

-**Figure 1: Teaser — wall-clock vs trace-time across schedulers** — `figs/f1_teaser.png` **🚧 TBD (NEW DATA NEEDED)**
-> Needs Phase 3 measurements: 5 baselines × 3 runs of trace replay, extract `amplification = wall_clock_s / trace_span_s` from each summary (Phase 1 patch already exposes the field). Plot as bar chart with y=1 reference line. EAR row 暂为 TBD（待 migration validation）。
+### §1.2 Agentic workflow
+
+Agentic 负载 = LLM 通过 tool-call 自驱、多 turn 完成任务。与 chatbot 的本质差异：每个 turn 由上一个 turn 的
+tool-call 结果触发（无人类 think-time），prefill-dominated（input/output ≈ 75×）。
+
+**但它是一个 mixture，不是"全多轮"**（C1, `figs/workload_chars/c1_session_mixture.png`）：
+
+- **90.3%** 的 session 是单轮（mean 1.62 turns）；但多轮 session（9.7%）= **44.2% 的请求**、**66.9% 的
+  prefill 质量**。
+- Continuation hazard（Lindy）：turn1→2 仅 **10.2%**，turn5→6 87%，turn12→13 **94.3%** —— heaviness 在
+  cold-start 几乎不可预测（corr(turn1_input, n_turns) = **0.04**）。
+
+> **Routing 含义（贯穿 §3.2）**：heaviness 在 session 起点不可预测 → 必须 **reactive**（观测累计负载），不能
+> proactive 预判；单轮海洋与深尾的最优策略相反（前者 load-balance、后者 affinity-pin），且 turn-1 无法区分 →
+> 唯一可行的策略是"人人 load-balanced 起步、随 turn 累积变 sticky" —— 正是 LPWL 的 emergent 行为。
+
+### §1.3 Serving agents in the wild
+
+- **PD colocation（8C）**：每个 instance 对称，prefill 与 decode 在同一 GPU 上由 chunked-prefill +
+  continuous batching 交错。弹性 KV 池，无静态分区。
+- **PD disaggregation**：把 instance 静态分成 prefill 池（P）与 decode 池（D），物理隔离两个阶段
+  （DistServe / Splitwise）。
+- **KV cache storage hierarchy**：GPU HBM → CPU DRAM → 远端 pooled store（RDMA/SSD，如 Mooncake Store /
+  LMCache）。把被淘汰/跨实例的 KV 下沉到更慢但更大的层，用传输换重算。
+
+> 三者各被本文一节回应：PD-colo 在 §3.1 被"复活"为默认；PD-disagg 在 §3.1 被证否（agentic regime）；storage
+> hierarchy 在 §2.2 被定量地"限位"（GPU 命中远胜下层，且活跃 working set 本就装得下）。

 ---

-## §2 Background and Workload Characterization
+## §2 GPU memory hit is the key to serving agents

-### §2.1 Agentic Workload Primer
+### §2.0 正确的 metric：request latency / TPS / GPU utilization（不是 TTFT/TPOT）

-Agentic workload 与 chatbot 的三个本质差异：
+**为什么不是 per-request TTFT/TPOT**：agentic 的 turn 之间有反馈环，单 turn 的延迟会跨 turn **复利**成 session
+端到端时间与系统吞吐差距。只有 **request/session latency、tokens-per-second、GPU utilization** 能 capture 这件事。

- **Multi-turn, programmatic continuation**：每个 turn 由上一个 turn 的 tool-call 结果触发，没有人类 think-time
- **Prefill-dominated**：input/output token ratio **75x**，98% 计算在 prefill 阶段（chatbot 为 1-10x）
- **Skewed sessions**（来自 Qwen3 production trace，n=1.3M session / 2.1M req / 7200s）：top 1% 贡献 **46.5%** input token，top 5% **66.5%**，top 10% **74.6%**，top 25% **87.5%**，top 50% **96.0%** —— 半数 session 几乎占满全部 input mass
+**Dispatch coupling（降级为本节论证）**。每个 turn 间有外部 gap `T_external`（chatbot 是人在读/想/打字；agentic
+是 tool 执行）。Little's Law：`L = Λ · N · (W_turn(L) + T_external)`。当 `W_turn ≳ T_external` 时
+`W_turn(L)` 经 KV 竞争耦合到并发 L，scheduler 的 ε 退步被反馈环放大成 wall-clock 数倍差。

-平均 session 长度 TBD turn、TBD 输入 token。Per-request KV footprint（Qwen3-Coder-30B-A3B, 98304 B/token）：p50 **1.8 GiB**, p90 **8.0 GiB**, p95 **9.6 GiB**, p99 **11.5 GiB**. 单 instance KV pool ≈ 0.4 × 96 GiB = **38.4 GiB**（剩 50% model params bf16 + 10% runtime activation），所以 p99 请求一个 instance 只能装 **3 个 concurrent decode**；改 PD-disagg 4P+4D 让系统 decode 容量直接减半（系统并发 24 → 12）。
+production trace 实测 `T_external` CDF（`figs/f3a_inter_turn_gap.png`）：

-### §2.2 KV Cache Reuse Topology
-
-Trace 上 KV reuse 的分解：
-
-| Class | Share |
-|---|---|
-| Intra-session | **93.2%** |
-| Cross-session | 5.7% |
-| Shared prefix | 1.1% |
-
-理论 APC 上界：any-session **80.3%**，intra-session-only **79.6%**，差距 <1pp。**cache 本质上是 session-local 的**；任何不保留 session affinity 的调度都丢掉绝大部分 reuse 机会。
-
-**Figure 2: Workload characterization (3 panels)** — 现有数据可复用：
-
-![F2a Reuse topology — intra 93.2% / cross 5.7% / shared 1.1%](figs/f2a_reuse_topology.png)
-
-![F2b Session input-token mass CDF — production trace top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% (replay window overlaid for sanity)](figs/f2b_session_skew.png)
-
-![F2c Per-instance decode concurrency vs deployment (KV pool 38.4 GiB; p99 req fits only 3/inst; PD-disagg halves system decode capacity)](figs/f2c_kv_footprint_cdf.png)
-
-> 📝 Layout TBD：三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。
-
-### §2.3 Dispatch Coupling — Why Locality Dominates
-
-这是本文最依赖直觉的论证，单独成节。
-
-**直觉**。每个 turn 之间有一段**外部 gap** `T_external`（chatbot 是人在读+想+打字、agentic 是 tool 执行）。下一 turn 在 `T_external` 之后到达。Little's Law: `L = Λ · N · (W_turn(L) + T_external)`。系统能不能避免反馈环，取决于 `W_turn` 是否**远小于** `T_external`：
- 如果 `W_turn ≪ T_external`：session 停留时间被 `T_external` 主导，scheduler 调度速度的小变化几乎不动 `L`，系统在**开环 regime**；
- 如果 `W_turn ≳ T_external`：`W_turn(L)` 这一项被 KV 竞争耦合到 L，Little's Law 变成 L 的隐式方程，**闭环 regime**，scheduler 上的 ε 退步被反馈环放大成几倍的 L*。
-
-**Agentic 与 chatbot 不在二元区分上，而在 `T_external` 的分布上**。下面是 production trace 实测的 `T_external = next.start − prev.end` CDF（agentic = Qwen3-Coder, n=783k inter-turn gaps; chatbot = qwen3-max chat, n=42k gaps）：
-
-![F3a Inter-turn external gap CDF — agentic vs chatbot](figs/f3a_inter_turn_gap.png)
-
-| Metric | Agentic | Chatbot |
+| | Agentic | Chatbot |
 |---|---:|---:|
-| p25 | 0.69s | 4.85s |
-| **p50** | **1.6s** | **7.2s** |
-| p90 | 44s | 15s |
-| p99 | 738s | 43s |
+| p50 | **1.6 s** | 7.2 s |
 | gap < 1 s | **39%** | 4% |
 | gap < 5 s | 67% | 29% |
+| p99 | 738 s | 43 s |

-两个分布形状完全不同：
- **Chatbot 是 unimodal**，5–15s 紧密集中（人类交互节奏）；
- **Agentic 是 bimodal**：39% 的 gap < 1s（autonomous tool-call mode，chatbot 仅 4%）+ 13% 的 gap > 30s（session paused/abandoned，chatbot 仅 2%）。
+agentic 有一段 chatbot 没有的 **sub-second tool-call mass（39% vs 4%）**，几乎天然 `W_turn ≫ T_external` → 闭环。
+**实测**：lmetric 跑 600s trace 用 49 min wall-clock = **8× amplification**。**结论：per-turn 延迟的小差被放大成
+端到端数量级差 → 必须用 request latency / TPS / GPU util 衡量。**

-**Agentic 的危险来自 sub-second tool-call mode** —— 这 39% 的 turn 几乎天然 `W_turn ≫ T_external`，dispatch coupling 必然激活；而 chatbot 没有这一段质量，要把 W_turn 推得很大才会进入闭环。
+- **TPS / GPU util** 的工具：`microbench/fresh_setup/bench_report.py`（TTFT/TPOT/E2E 全分位 + TPS +
+  per-worker GPU util）。PD stall 时 GPU ~0% util vs colo 34%（§3.1）即是 GPU-util 作为 metric 的直接体现。

-**实测 regime 对照**：
+> **🚧 待补**：5 baseline × ≥3 runs 的 wall-clock amplification sweep（replayer 已输出 `amplification` 字段），
+> 钉死本节实证 closure。优先级高。

-| Scheduler | TTFT p90 | Agentic frac(W_turn > T_ext) | Chatbot frac(W_turn > T_ext) |
-|---|---:|---:|---:|
-| unified | 7.3s | **73%** | 32% |
-| lmetric | 15.7s | **80%** | 88% |
+### §2.1 KV$ hit is common and critical

-unified 在 agentic 上把 73% 的 turn 推进闭环，在 chatbot 上只有 32%。lmetric 在 agentic 上 80%、**chatbot 上也到 88%** —— lmetric 的 W_turn 大到把 chatbot 自己也推进闭环，这是 lmetric 在两种 workload 都 underperform 的一个直接根因。
+Trace 上 KV reuse 的分解（`figs/f2a_reuse_topology.png`）：**intra-session 93.2% / cross-session 5.7% /
+shared-prefix 1.1%**。理论 APC 上界：intra-only **79.6%** vs any-session 80.3%，差 <1pp —— **cache 本质上是
+session-local 的**。

-**具体例子**：一个 coding agent 跑 20 turn 的任务，假设 `T_external` 是 sub-second 模式（tool-call 0.5s）。
+per-turn 视角（C2, `figs/workload_chars/c2_work_amortization.png`）：resident context 11k→56k+ token 增长而
+new-prefill 从 2.7k 坍缩到 ~200 token，per-turn reuse 爬到 **99.6%**，resident/new（"PD tax"）到 turn 12 ≈
+250×、turn 30 ≈ 450×。**绝大部分 prefill 工作是可被命中省掉的**；命中与否直接决定 TTFT。

- 快策略：`W_turn` = 2s，每 turn 总 2.5s，session 共 50s，平均并发 10 session
- 慢策略（线性估算）：`W_turn` = 3s，每 turn 3.5s，session 70s，应并发 14
- 慢策略（实际）：14 并发让 KV pool 更紧 → `W_turn` 推到 4s → session 90s → 18 并发 → `W_turn` 5s …… 反馈环放大到撞墙或落到一个远更糟的不动点
+### §2.2 Hits on GPU is more important than the CPU

-**形式化**。记 `Λ` = session 到达率，`N` = 每 session turn 数，`W_turn(L)` = 单 turn 服务时间作为并发 L 的递增函数（并发越多、KV 竞争越激烈、`W_turn` 越大）。Little's Law:
+**命中层级的代价是实测的，不是断言的**（Qwen3-Coder-30B-A3B / H20）。TTFT(s, p50) 服务一段长 L 的复用前缀，
+来自每个 KV 层：

-```
-L = Λ · N · (W_turn(L) + T_external)
-```
+![四层命中延迟：GPU < CPU-local < remote-RDMA-store ≪ miss，差距随 context 拉大](v2/figs/exp_a_tier_latency.png)

-设策略变化让 `W_turn` 整体放大 `(1+ε)` 倍，小扰动分析得到不动点 L\* 的灵敏度：

-```
-dL*/dε = L* · W_turn(L*) / [W_turn(L*) + T_external − Λ · N · W_turn(L*) · W'_turn(L*)]
-```
+| prefix L | miss(recompute) | **remote RDMA store** | CPU-local(DRAM,PCIe) | GPU(HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| 8k  | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
+| 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
+| 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
+| **64k** | **15.23** | **0.97** | **0.27** | **0.11** | **15.8×** | **3.6×** | **2.4×** |

-注意两点：
-1. 分子 ∝ `W_turn / (W_turn + T_external)`：当 `T_external ≫ W_turn` 时灵敏度 → 0（开环）；当 `T_external → 0` 时灵敏度趋于其上界（闭环）。**所以 agentic 的 sub-second tool-call mass 把灵敏度推到上界，chatbot 的 5–15s mass 把灵敏度压低**。
-2. 分母 `... − Λ · N · W'_turn(L*)`：接近 KV 饱和时趋于 0，**任何调度退步在饱和附近都被无限放大** —— 这是 lmetric 在 600s trace 上跑出 8x wall-clock 的根因。
+- **GPU hit ~flat**（42→111 ms / 1k→64k）：命中即整段前缀在 HBM，只重算最后一个 token。
+- **CPU-local hit** transfer-bound（PCIe H2D 实测 ~54 GB/s）；CPU-hit ≈ GPU-hit + KV/PCIe + ~0.15s 开销。
+  （native KV offload，命中经 `vllm:external_prefix_cache_hits` 100% 验证。）
+- **remote RDMA-store hit** = Mooncake-Store 机制（实测：两 instance，B 用 `do_remote_prefill` 经 RDMA 从 A 拉取
+  缓存前缀而非重算；`mb2_kv_transfer.py` / `v2/.../run_rdma.sh`）。对 recompute 是大赢（**最高 16×**，与 blog 的
+  46× 同向），但付 **NIC 税**（有效 ~5–7 GB/s，cf. MB2 raw ~9.7 GB/s；multi-NIC pooling 可抬高），故比 CPU-local
+  慢 3.6×、比 GPU 慢 ~9×（64k），**代价差随 context 拉大**。
+- **结论 —— 层级严格且随 context 拉大：`GPU < CPU-local < remote-RDMA-store ≪ miss`**。global KV store 确实
+  有用（这也是该路线存在的理由），但每靠近 GPU 一层就再省 1.4–4× TTFT。**最值钱的复用是 GPU-resident 的那种。**

-**Figure 3: Dispatch coupling schematic** — `figs/f3_coupling_schematic.png` **🚧 TBD (CUSTOM DRAW)**
-> 需要新画一张示意图：左半 timeline 对比（chatbot：`system → T_external (5–15s) → system`；agentic：`system → T_external (sub-second to long-tail) → system`），右半反馈环 `W_turn → L → W_turn`，标注两个 regime 的判别条件 `W_turn ≷ T_external`。
+#### Evidence #1：GPU is sufficient to hold most KV requests

-### §2.4 Takeaway
+**realized APC 与 latency 在很小的 GPU 容量就饱和**（closed-loop 多轮负载，并发 4，扫 GPU KV 容量）：

-三个性质 —— intra-session locality dominant (§2.2)、long context + prefill-heavy (§2.1)、dispatch coupling (§2.3) —— 共同决定了 agentic workload 的调度必须以 **locality 为主导**，并能容忍 skew 带来的 instance 间负载不均。
+![容量 knee：APC 与 TTFT p90 在 3.6 GB（=活跃 working set）饱和，之后 dead-flat](v2/figs/exp_b_capacity_knee.png)
+
+| GPU KV (GB) | realized APC | TTFT p90 |
+|---:|---:|---:|
+| 1.2 | 7.4% | 13.00 s |
+| 2.4 | 36.3% | 4.62 s |
+| **3.6** | **80.3%** | **0.53 s** |
+| 9.7 | 72.9% | 0.65 s |
+| 14.5| 72.9% | 0.65 s |
+
+**Knee 出现在 3.6 GB = 恰好 = 活跃 working set（4 session × 0.91 GB）**：APC 饱和到上界、TTFT p90 从 13.0s 坍缩
+到 0.53s，之后 dead-flat。**超过 working set 的 HBM 买不到额外收益；为追长尾而建的 CPU/storage tier 同理 ≈ 0。**
+
+**Cluster-scale 校正（关键）**：working_set 分析（`analysis/working_set/`，`figs/working_set/`）显示"装下整个
+2h cluster 的全部 reuse 尾巴需 ~14 节点"，**这不构成 CPU offload 的动机** —— 那是用 1 个 replica 容量去装**整个
+cluster** 的 reuse；产出该 trace 的真实 cluster 远不止 14 节点（trace 是 cluster 聚合，见
+`project-trace-is-cluster-level`）。在 per-cluster 的 HBM 总量下，活跃 working set 本就 GPU-resident（live KV
+533–1157 GB ≪ 单节点 1528 GB）。**knee 位置随并发线性增长 = 随 cluster GPU 数增长**，而 cluster 本就提供了它。
+
+> ⚠ **Scope**：本小节"装得下"指的是**活跃 working set 产生的近期高价值复用**，不是"全部 reuse 尾巴"。冷
+> session 长 gap 后回来的深尾命中（既低价值/byte 又贵于 fetch）正是 storage-hierarchy 派追的东西；本文论点是
+> 在 agentic 下这条尾巴不值得为之建深层级。§5 正面回应该派。
+
+### §2.3 Takeaway
+
+正确的 metric（§2.0）+ 命中集中在 GPU 才便宜（§2.2）+ 活跃 working set 装得下 HBM（Ev#1）⇒ **GPU-hit-first**：
+设计目标是最大化活跃 working set 的 **GPU 常驻 + 命中**，而非建深 CPU/storage hierarchy。§3 给出三个推论。

 ---

-## §3 Why Existing Schedulers Don't Fit
+## §3 Optimizing agent serving with the GPU-Hit-First Principle

-三类现有调度各自撞上 §2 三个性质中的一个：
+### §3.1 Make PD-colocation great again

-### §3.1 Load-balanced routing 丢 locality
+静态 PD-disaggregation 对 chatbot 有效，对 agentic 结构性失败 —— colocation 才应是默认。

-Round-robin 和 load-aware routing（如 LMetric, OSDI'26）最大化 instance 利用率，但忽略 session affinity。**实测 APC 跌到 56.9%**（vs 上界 79.6%），23pp 的差距直接来自丢失的 intra-session cache hit。违反 §2.2。
+**端到端证据**（`microbench/fresh_setup/PD_DISAGG_RESULTS.md`，8×H20，trace replay）：**没有任何静态 P/D 比能赢
+8-way colocation（8C）**，且失败模式随比例移动：

-**为什么"cache-aware load routing"也不够 —— LMetric 的 cache 信号被乘性 score 稀释**。LMetric 的打分是
+| Metric | **8C** | 6P+2D | 4P+4D | 2P+6D |
+|---|---:|---:|---:|---:|
+| completion | **100%** | 100% | 100% | **9%** 💀 |
+| wall-clock | **2994 s** | 3419 | 4171 | 5762 |
+| prefix-cache hit | **19.4%** | 0% | 0% | 0% |
+| TTFT p50 | **7.0 s** | 41.0 | 56.4 | 23.6 |
+| E2E p90 | **83.3 s** | 91.8 | 157.1 | 499 |

-```
-P = pending_prefill_tokens + (input_length - cache_hit)
-score = P × num_requests
-```
+- **D-heavy（4P+4D）**：decode 池饱和 **97.5%**、prefill 池 ~30% —— 半个 cluster 的 KV 被困在错的一侧；agentic
+  请求大（p99 KV **11.5 GiB**），4D 让系统 decode 容量直接减半（24→12 并发，`figs/f2c_kv_footprint_cdf.png`、
+  `figs/f4b_pdsep_kv_wall.png`）。
+- **P-heavy（2P+6D）**：prefill 池 jam 99.7%，872 请求堆积，**91% 永不完成**。
+- **更聪明的路由救不了**（§6.x）：给 P 侧加 session-affinity 反而更差（4P+4D completion 100%→36%），GPU ~0%
+  util，cluster 卡在 KV-transfer 协调而非 compute —— 复现 producer hot-pinning。

-cache_hit 只在 `P` 里作减项；而 `score` 是**乘性**的。一个 session affinity 的 instance 会因为持续接到 session 而 `num_requests` 升高，乘积把 cache 收益吃掉。例：8000 输入 token、暖 instance cache_hit = 7500 vs 冷 instance cache_hit = 0、pending_prefill 都是 2000、num_requests 分别 5 vs 1，则 LMetric score 暖 = `2500 × 5 = 12500`、冷 = `10000 × 1 = 10000`，**LMetric 选冷**，丢掉 ~90% cache。结果：
+**为什么 colo 赢（正确论证，C2/C3 支撑）**：

-| 策略 | APC | vs load_only | 设计点 |
+- **时变 P:D 需求**：agentic 同时在 roofline 两侧有实质工作 —— compute-bound prefill（~30% 时间）+
+  memory-bound decode（**~70% 时间**，C3 token≠time 校正，`figs/workload_chars/c3_prefill_decode_balance.png`）。
+  colo 的弹性池吸收当下热的那一相；静态分区让 P-instance 带宽闲、D-instance 算力闲。
+- **resident KV 本地化**（C2）：下一 turn 的 prefix = [prevPrompt+prevAnswer] 横跨 P/D 两侧，disagg 必须
+  gather/transfer，colo 免费本地保留。
+- **transfer 不便宜且拓扑无关**（MB2，`figs/mb2_transfer_time_compare.png`）：Mooncake `batch_transfer_sync_write`
+  恒走 RDMA NIC（~9.7 GB/s），intra ≈ inter；PD-disagg 的 per-request transfer 税无法靠拓扑买回。
+- **phase-isolation 是 disagg 唯一的真赢面但被压倒**（MB1，`figs/mb1_interference.png`：32k prefill 让
+  per-stream TPOT 退化 52×，131k → 183×）—— 但被 D 侧容量天花板压倒（见上）。
+
+**边界（不 overclaim）**：crossover sweep（`analysis/crossover/`，`figs/crossover_pd_advantage.png`）给出 colo
+停止占优的 input 长度 —— colo 在 agentic 工作点赢，且我们知道边界在哪。
+
+### §3.2 Biased KV-cache-awareness in global routing
+
+GPU-hit-first 在路由层 = **把 cache-awareness 作为带偏置的独立路由路径**，而不是折叠进 load cost。
+
+**反例：load-balanced / 朴素 cache-aware-load 丢 locality**（`figs/f4a_apc_loss.png`）。LMetric（OSDI'26）打分
+`P = pending_prefill + (input − cache_hit)`，`score = P × num_requests` —— cache 只作 cost-model **减项**，而 score
+是**乘性**的，有 affinity 的 instance 因 num_requests 高被乘式吃掉 cache 收益：
+
+| 策略 | APC | vs load_only | cache 处理方式 |
 |---|---:|---:|---|
-| load_only | 53.9% | — | 纯负载 (`score = num_requests`) |
-| LMetric | 57.2% | **+3.3pp** | cache 作 cost-model 减项 |
-| sticky | 77.7% | **+23.8pp** | cache 作硬约束 |
-| unified | 78.7% | **+24.8pp** | cache 作硬+软偏好混合 |
+| load_only | 53.9% | — | 无 |
+| LMetric | 57.2% | **+3.3pp** | cost-model 减项（被稀释）|
+| sticky | 77.7% | **+23.8pp** | 硬约束 |
+| unified | 78.7% | **+24.8pp** | 硬+软混合 |

-**`load_only → LMetric` 的 +3.3pp 几乎可忽略；`LMetric → sticky` 的 +20.5pp 才是 cache 信号被正确处理的回报**。Cache awareness 不能只作为 cost-model 的一项被吞掉 —— 必须作为**独立路由路径**（sticky / unified hybrid）。这是 §3.1 比"丢 locality"更具体的失败模式。
+`load_only→LMetric` 的 +3.3pp 几乎可忽略；**+20.5pp 的回报来自把 cache 作独立路由路径**。

-### §3.2 静态 PD-disaggregation 撞 D 侧 KV 墙
+**本文方法：LPWL（least-prefill-work，parameter-free）**（`project-lpwl-policy`、`analysis/lpwl_5policy_600s.md`）：
+按 `new_uncached ≈ input − cache_hit` 路由 —— `new_uncached≈input`（冷/新 session）自动按负载分散，`new_uncached≈0`
+（暖 session）自动 stick。**零旋钮**，在 600s trace 上击败 tuned unified+A+B（TTFT p90 **−31%**），full w600 上打平。
+这正是 C1 mixture 要的形状：无需 classifier 自动切分单轮海洋与深尾。

-静态把 instance 分成 P pool 和 D pool 对 chatbot 有效，对 agentic 失败：agentic 请求平均 33.6k token，需要 **3.3GB** KV；4D 方案下 p90 请求占 D 侧 KV pool **69%**，p99 直接 **溢出 138%**。结果：**TTFT p50 暴涨 62-72x**，成功率从 99.5% 跌至 **52-68%**。违反 §2.1（prefill-dominant + 长 context）。
+**两条被否的杠杆（节省读者时间）**：

-### §3.3 Pure session-sticky 的真正失败：全员被 hot session 拖累
+- **real-time engine state 不是路由杠杆**（ES ablation，`project-es-ablation-sweep`）：只对一次性 placement 有用
+  （sticky −26%），对 per-req load-chasing 有害（load_only +27%）。
+- **derived-κ decode 项 net-negative**：decode-awareness 是错的杠杆。

-硬 session-instance 绑定恢复 locality（APC **77.2%**，达到上界 97%），但**绝对 worker latency** 全员被拖累 —— 是 pure sticky 的真正失败模式。
+**Pure sticky 的失败用绝对 per-worker latency 衡量**（`figs/f4c_per_worker_ttft.png`，不能用 max/median 比值）：

 | | median worker TTFT p90 | max worker | system e2e p90 |
 |---|---:|---:|---:|
-| `sticky` | **20.3 s** | 55.4 s | **34.6 s** |
-| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
-| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
+| sticky | **20.3 s** | 55.4 s | **34.6 s** |
+| unified | **10.3 s** | 37.7 s | **18.0 s** |

-机制：production trace 上 top 1% session 占 46.5% input mass、top 5% 占 66.5%，hot session 数量远大于 instance 数（8）；sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**，median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker，保留 7/8 worker 的速度。系统 p90 由大多数请求决定，所以 unified 在 e2e p90 上 ~2x 快于 sticky。
+hot session 数 ≫ instance 数（8），sticky 的 hash 绑定让**每个 worker 都自接一份 hot session**，median 也被拖慢；
+biased 路由把 cold/new 重路由到非 hot worker，保 7/8 worker 速度 → e2e p90 ~2× 快。**这引出 §3.3：sticky 的残余
+hot-pin 需要 migration 解。**

-**§3.3 sub-finding**：hot pin failure 必须用 **per-worker absolute latency**（median + max）衡量，**不能用 normalized ratio**。`max/median` 在 unified 这样的"affinity + escape"方案上会反向惩罚 —— sticky 的 ratio 2.73 比 unified 的 3.67 低，但 sticky 的 median 也高（20.3s vs unified 10.3s），ratio 越低反而越糟。本文 paper 中所有 worker 平衡相关的比较一律用 (median, max) 双指标，不用单一比值。
+### §3.3 GPU KV-cache dedup with migration instead of replication

-违反 §2.4 的 skew 容忍要求。
+**视角**：多个 instance 各自缓存同一段 prefix = GPU 容量被 **replication** 浪费；GPU-hit-first 要求**全局只留一份 +
+把 session 搬到那一份**（migration/dedup），既保 GPU 命中又均衡负载、修复 §3.2 的残余 hot-pin。

-**Figure 4: Three baselines, three failure modes** — 拆成三个子图，分别放在 §3.1/§3.2/§3.3：
+- **Trigger**：host 的 `pending_prefill > T_hot` 且 session 在 `T_cool` 内未迁移过。
+- **Target**：用 **observable pending prefill tokens** 选最轻 instance，**不用** cost-model 预测（mooncake cost
+  model 误差 10–21×，by construction 绕过）。
+- **Mechanism**：当前 request 重定向到 target，session KV 经 Mooncake `kv_connector` 迁移；binding 更新；后续
+  turn 按 affinity 路由到新 host。迁移成本在该 session 剩余 turn 上摊销。
+- **Thrashing prevention**：per-session cooldown `T_cool`。

-§3.1 — APC 实测 vs 理论上界 79.6% (lmetric 56.9%, load_only 54.1%, sticky 77.2%, unified 79.4%)：
-![F4a APC loss](figs/f4a_apc_loss.png)
-
-§3.2 — D 侧 KV pool 占用 vs per-request KV footprint，4P+4D 和 6P+2D 在 agentic regime 都穿过 90% 内存墙：
-![F4b PD-sep KV memory wall](figs/f4b_pdsep_kv_wall.png)
-
-§3.3 — Per-worker TTFT p90 across 8 instances × 5 policies。sticky 的所有 worker 都被拖慢（median 20.3s），unified 把伤害集中在 e4 上、其他 worker 快（median 10.3s）：
-![F4c Per-worker TTFT p90 distribution](figs/f4c_per_worker_ttft.png)
-
-
-> 📝 可选支撑图 — Prefill-decode 干扰（同 GPU 8k prefill 让 TPOT 退化 66x），放 §3.3 支撑 sticky 的 interference 论证：
-![F4d PD interference](figs/f4d_pd_interference.png)
-
-### §3.4 Takeaway
-
-**问题不是任何单一 baseline 太弱，而是没有一个方案同时满足 §2 的三个性质**：保留 locality、尊重 D 侧 KV 容量、容忍 skew 带来的负载不均。EAR 是据我们所知第一个三件事同时做到的调度器。
+> **状态**：**substrate 已验证 net-positive**（`kv_both` connector 在 trace replay 上 TTFT p90 **−18.6%**，
+> DR-fix 后 **−36.6%** vs plain；之前认为是 blocker 的 transfer overhead 已不存在，4 次历史 revert 的主因消失）。
+> migration correctness smoke tests 通过。**policy 层 e2e（trigger + target 在反馈环里的真实收益）仍未直接验证**
+> —— 这是全文最弱的一环，独立风险是"决策错误 + cooldown thrashing"。affinity-only pillar（§3.2 LPWL）已独立成立，
+> 即便 migration 仍 marginal，paper 也有 strong-routing 主线。

 ---

-## §4 Design: EAR
+## §4 System Evaluation

-### §4.1 Architecture
+> **🚧 关键缺口**：目前证据散落在 mb1/mb2/mb5/crossover/lpwl/v2；§4 需要一个**集成系统**（colocation +
+> biased routing + dedup-migration，统一命名）跑端到端、用 §2.0 的新 metric（request latency / TPS / GPU util）
+> 评测，并把 §3.1/§3.2/§3.3 做成 ablation。

-EAR 是位于 N 个同质 instance 之上的 router。每个 instance 是对称的 PD-colocated，没有静态 P/D 分区。每个 session 在 router 内维护一个 **host binding** —— 当前持有该 session KV 状态的 instance。Binding 在常态下稳定，仅在 hotspot 触发时通过 migration 改变。
+### §4.1 Setup
+- **Trace**：真实 Qwen3-Coder agentic trace（cluster-level，见 `project-trace-is-cluster-level`）；日常迭代
+  `w600_r0.0015_st30`（~850 req / ~13 min / peak QPS ~1.6 / APC headroom ~76%）。
+- **Hardware**：8× H20 (96 GB)；vLLM 0.18.1 (V1, chunked-prefill) + Mooncake。
+- **Baselines**：① round-robin ② LMetric (OSDI'26) ③ kvcache-aware+load 线性混合 ④ sticky ⑤ static PD-disagg
+  (4P/4D) ⑥ **本文系统**（colo + LPWL + dedup-migration）。
+- **Metrics**：request latency (mean/p50/p90/p99)、**TPS**、**GPU util (median/max worker)**、APC、
+  wall-clock amplification（不单看 TTFT/TPOT，§2.0）。

-**Figure 5: EAR architecture and request flow** — `figs/f5_architecture.png` **🚧 TBD (CUSTOM DRAW)**
-> 组件图：router (含 session→host table) → N 个 symmetric instances；affinity 路径实线，migration path 虚线。适合 TikZ / draw.io。
+### §4.2 End-to-end
+- `figs/f6_e2e_latency_bars.png` / `f6_e2e_latency_full_grid.png`（现有 4–5 baseline；🚧 补 static PD-disagg
+  列 + 本文系统列）。

-### §4.2 Pillar 1: Affinity-Default Routing
+### §4.3 Ablation（GPU-hit-first 三推论各关一个）
+- −colocation（→ static PD-disagg，§3.1）
+- −biased routing（→ load-balance / LMetric，§3.2）
+- −dedup-migration（→ pure sticky，§3.3）
+- 🚧 migration-only / full 待 policy e2e 验证。

- **Cold start**：新 session 到达时，router 用 load-balance（选 pending prefill tokens 最少的 instance）分配初始 host
- **Warm path**：已建立 session 的后续每个 turn 一律路由到当前 host
- **效果**：intra-session KV reuse 被构造性保留，APC 接近 §2.2 的上界 79.6%
+### §4.4 Microbench 支撑（已有，复用）
+- §2.2 四层命中：`v2/figs/exp_a_tier_latency.png`
+- §2.2 容量 knee：`v2/figs/exp_b_capacity_knee.png`
+- §3.1 PD-disagg 失败：`figs/f4b`、`PD_DISAGG_RESULTS`、`crossover_pd_advantage`
+- transfer cost：`figs/mb2_*`；phase interference：`figs/mb1_interference.png`

-### §4.3 Pillar 2: Hot-Triggered Session Migration 🚧 PARTIAL VALIDATION
-
-避免 Pillar 1 退化成 pure sticky 的关键 mechanism。
-
-> **状态（2026-05-27 更新）**：
-> - **Substrate 验证 PASS**（commit `ef9e010`）：`kv_both` connector 在 trace replay 上 net positive（TTFT p90 −18.6%），DR-fix 后再 −22%。之前认为是 migration blocker 的 transfer overhead 已不存在。
-> - **策略层 e2e 验证 PENDING**：trigger 阈值 + target selection 在 agentic 反馈环里的真实收益仍未直接测。之前 4 次 migration 尝试（`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`）被还原的主因（substrate overhead）已消失，但 trigger 决策错误 + cooldown thrashing 是独立风险，需新一轮 e2e 实验确认。
-
-#### §4.3.1 Trigger signal
-
-EAR 实时监控每个 instance 的 **pending prefill tokens**。新 request 到达且按 affinity 应该路由到 host H 时，router 先检查：
- `H.pending_prefill > T_hot`？（hotspot 检测）
- session 在过去 `T_cool` 秒内未发生过 migration？（thrashing prevention，§4.3.4）
-
-两个条件同时满足才考虑触发 migration。`T_hot` 和 `T_cool` 的取值见 §5.5 sensitivity。
-
-#### §4.3.2 Target selection
-
-候选集：所有 instance 中 (a) 剩余 KV 容量能装下 session 现有 context、(b) `pending_prefill` 严格小于 H 的。选 `pending_prefill` 最低者。
-
-**关键设计点**：我们用 **observable current load** 而不是 **predicted transfer time** 排序。文献和 colleague 数据均显示 mooncake cost model 的预测误差达 10-21x；而 pending prefill tokens 是 router 直接观察到的数值，accuracy by construction。
-
-若候选集为空（所有其他 instance 都装不下，或都比 H 更忙），EAR 保留当前 binding，继续在 H 上处理请求 —— **migration 是 opportunistic，不是 mandatory**。
-
-#### §4.3.3 Mechanism
-
-Migration 触发时：
-
-1. 当前 request 直接重定向到 target instance T
-2. session 累计的 KV 状态从 source H 通过 Mooncake `kv_connector` 传输到 T
-3. session 的 host binding 更新为 T；后续 turn 按 affinity 自动路由到 T
-
-KV transfer 发生在触发该 migration 的 request 的 critical path 上，但被该 session 剩余的 TBD turn 摊销。
-
-#### §4.3.4 Thrashing prevention
-
-每个 session 维护 `last_migration_timestamp`。在 cooldown `T_cool` 内被禁止再次 migrate。Cooldown 把 migration 行为限制在 O(session_lifetime / T_cool) 量级。
-
-### §4.4 Implementation
-
-基于 vLLM 0.18.1 + Mooncake (vanilla kv_connector)。EAR 是一个 router 进程，~TBD LoC。Session→host 表用 TBD（in-memory dict / Redis）维护。
+### §4.5 Sensitivity
+- 到达率 λ / skew(Zipf α) / KV pool size（不依赖 migration，可做）；`T_hot`/`T_cool`（依赖 migration，deferred）。

 ---

-## §5 Evaluation
+## §5 Related Work

-### §5.1 Setup
-
- **Trace**: 真实 Qwen3-Coder agentic trace，TBD requests / TBD seconds / r=0.0015 st=0.3，peak QPS ~1.6，APC headroom ~76%
- **Hardware**: TBD × H20 (96GB HBM)
- **Engine**: vLLM 0.18.1 + Mooncake `kv_connector`
- **Baselines** (6 个):
-  1. `load-balance` —— round-robin
-  2. `LMetric` —— OSDI'26 load-aware routing
-  3. `kvcache-aware + load-balance` —— linear combination of cache score and load score
-  4. `sticky` —— 硬 session-instance pinning
-  5. `static PD-disagg` —— 4P / 4D 静态分区
-  6. `EAR` —— 本文
- **Metrics**: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2E、APC、worker TTFT p90 (median + max)、wall-clock vs trace-time
-
-### §5.2 End-to-end Performance
-
-**Figure 6 (headline, p90 only)** — ✅ (PARTIAL，缺 PD-disagg 列)
-
-![F6 E2E latency bars — 4 policies, p90 only](figs/f6_e2e_latency_bars.png)
-
-**Figure 6 full (mean / p50 / p90 / p99 × TTFT / TPOT / E2E)** — ✅ 数据完备：
-
-![F6 full latency grid — 4 percentiles × 3 metrics](figs/f6_e2e_latency_full_grid.png)
-
-> **🚧 TBD (NEW DATA)**：两张图都缺 `static PD-disagg` 那一列；EAR 列也是 TBD（需 migration validation）。要再补同样格式但包含全 6 个 baseline 的版本。Headline 图用 p90 一行进 main paper，完整 grid 可进附录或 supplementary。
-
-| Scheduler | TTFT p50 | TTFT p90 | TPOT p90 | APC | Worker p90 (median / max) | Wall-clock factor |
-|---|---|---|---|---|---|---|
-| load-balance | TBD | TBD | TBD | TBD | TBD | TBD |
-| LMetric | TBD | TBD | TBD | 56.9% | 6.53 | ~8x |
-| kvcache+load | TBD | TBD | TBD | TBD | TBD | TBD |
-| sticky | TBD | 18.02s | TBD | 77.2% | 13.65 | TBD |
-| static PD-disagg | 62.8s | TBD | TBD | TBD | TBD | TBD |
-| **EAR** | TBD | **7.35s** | TBD | **79.4%** | TBD | TBD |
-
-(粗体数字来自现有 "unified" 原型测量。)
-
-### §5.3 Ablation 🚧 PARTIAL DEFER
-
-我们独立关闭两个 pillar:
-
- **EAR (affinity only)**: 等价于 pure sticky；衡量 locality 单独贡献
- **EAR (migration only)**: cold-balance + reactive migration，无 affinity；衡量 migration 能否独立成立
- **EAR (full)**: 两个 pillar 都开
-
-**Figure 7: Ablation** — `figs/f7_ablation.png` **🚧 TBD — DEFERRED (BLOCKED ON MIGRATION VALIDATION)**
-> 完整 ablation 需要 migration-only / both / affinity-only 三个配置。Migration-only 和 both 都依赖 migration 重测。现阶段可先做 affinity-only vs load-balance 的两点对比（已有数据：unified 79.4% APC vs lmetric 56.9% APC）。
-
-预期结论：affinity-only 拿到 locality 但 interference 翻倍；migration-only 抓不住 locality；两者都必须。
-
-### §5.4 Dispatch Coupling Validation
-
-闭环 §2.3 的论证。对每个 baseline 测量：
-
- 单 turn 平均服务时间 `W_turn`（x 轴）
- Wall-clock / trace-time amplification（y 轴）
-
-**Figure 8: Wall-clock amplification vs per-turn service time** — `figs/f8_coupling_measured.png` **🚧 TBD (NEW DATA)**
-> 散点：x = 平均 per-turn `W_turn`（从 per-request metrics 算 TTFT + decode_time），y = amplification (`wall_clock / trace_span`，Phase 1 patch 已暴露)。每个 baseline 一个点。理论曲线 `L*/(1 − Λ·N·W'(L*))` 叠加（可选）。这是 §2.3 论证的实证 closure，**优先级最高**。
-
-预期：EAR 在 `W_turn` 最小且放大系数最低的角上。
-
-### §5.5 Sensitivity
-
-| 参数 | 范围 | 检验 |
-|---|---|---|
-| 到达率 λ | TBD | EAR 在低/高负载下是否稳定 dominate |
-| Skew 程度 (Zipf α) | TBD | sticky 与 EAR 的差距是否随 skew 拉开 |
-| KV pool size | TBD | static PD-disagg 撞墙边界 |
-| `T_hot` (migration threshold) | TBD | 触发太宽 → thrash，太严 → 错过 |
-| `T_cool` (cooldown) | TBD | 同上 |
-
-**Figure 9: Sensitivity heatmaps** — `figs/f9_sensitivity.png` **🚧 TBD (NEW DATA, PARTIAL DEFER)**
-> Arrival rate / skew / KV pool size 这三轴可现在做（不依赖 migration）；`T_hot` / `T_cool` 两轴依赖 migration validation，deferred。
-
-### §5.6 Migration Microbenchmark 🚧 FULL DEFER
-
-刻画 EAR 内部 migration 行为：
-
- Migration 触发率（% of requests）
- 平均 KV transfer 时间
- Migration accuracy：迁移后 target instance 在接下来 TBD 个 turn 内保持非 hot 的比例
- Thrashing rate：cooldown 窗口内多次迁移的 session 占比（应为 0）
-
-**Figure 10: Migration timeline** — `figs/f10_migration_timeline.png` **🚧 TBD — DEFERRED (BLOCKED ON MIGRATION VALIDATION)**
-> 时间轴上每个 instance 的 pending prefill tokens heatmap，migration 事件以箭头标出。完全依赖 migration 重测。
+- **Serving systems**：vLLM, Mooncake, SGLang, DistServe, Splitwise。本文基于 vLLM+Mooncake，与
+  DistServe/Splitwise 不同在于**不做静态 P/D 分区**（§3.1 给出 agentic regime 的证否 + crossover 边界）。
+- **Cache-aware routing**：LMCache, Production-Stack, LMetric (OSDI'26)。本文指出 cache 信号不能折叠进乘性 load
+  score（§3.2 LMetric 稀释分析），须作独立 biased 路由路径。
+- **KV storage hierarchy / offload（主要对手，正面回应）**：Mooncake Store, LMCache, AttentionStore 等把 KV 下沉
+  到 CPU/SSD/远端 pool。**本文不否认其对 recompute 的收益**（§2.2 实测 remote-RDMA 命中比 recompute 快达 16×，
+  与 Mooncake-Store blog 的 46× 同向）；但论证在 **agentic** 下 (i) 命中层级 GPU ≫ CPU ≫ RDMA-store（§2.2），
+  (ii) 活跃 working set 本就 GPU-resident（§2.2 Ev#1）⇒ 深层级的边际收益低，应优先 GPU 常驻而非建深 hierarchy。
+- **Stateful migration**：Pollux, Gandiva（RL training 的 migration-as-rebalancing）。本文把该思路迁到 LLM KV
+  cache 的 dedup（§3.3）。

 ---

-## §6 Discussion and Limitations
+## §6 Conclusion

- **Extreme skew**: 若单个 session 自己就把任意 instance 撑成 hot，EAR 退化为 sticky。我们未在该 regime 做 stress test。
- **Cost model accuracy**: EAR 用 observable load 绕过了预测误差问题。但未来若引入 predictive admission control，需要解决 mooncake cost model 10-21x 误差。
- **Heterogeneous hardware / multi-model**: EAR 假设 instance 同质。混合模型 / 混合 GPU 池需要扩展 binding 模型。
- **Per-instance batch tuning (future)**: 动态调整 `max_batched_tokens` 可能进一步降低 instance 内部 prefill-decode 干扰，留作 future work。
-
---
-
-## §7 Related Work
-
- **LLM serving systems**: vLLM, Mooncake, SGLang, DistServe, Splitwise. EAR 基于 vLLM + Mooncake 实现，与 DistServe/Splitwise 不同之处在于不做静态 P/D 分区。
- **Cache-aware routing**: LMCache, Production-Stack, LMetric (OSDI'26)。这些工作最小化 cross-instance cache miss，但不迁移状态。
- **Stateful service migration**: Pollux, Gandiva (RL training)。EAR 借鉴 migration-as-rebalancing 思路，将其迁移到 LLM inference 的 KV cache 场景。
-
---
-
-## §8 Conclusion
-
-对 agentic LLM workload，locality 是主导调度杠杆。EAR 用 session-affinity routing 抓住它，用 hot-triggered session migration 保护它，单一方案在 TTFT、APC、worst-worker p90、wall-clock throughput 四个维度同时 dominate 五个 baseline。
+对 agentic LLM 负载，用户感受的 request latency / TPS / GPU util 由 **KV 命中是否在 GPU HBM 上** 主导。命中层级
+`GPU ≫ CPU > RDMA-store ≫ recompute` 且代价差随 context 拉大（§2.2），而活跃 working set 小到本就 GPU-resident
+（§2.2 Ev#1）。**GPU-hit-first** 原则由此统一三件事：复活 PD-colocation（§3.1）、在路由里做 biased
+KV-awareness（§3.2）、用 migration 而非 replication 做 GPU 去重（§3.3）。

 ---

 ## Work Plan

 ### ✅ Done
+- §1/§2 背景与 metric 论证；dispatch-coupling 数学；inter-turn gap CDF（`f3a`）
+- §2.1 reuse topology / C1–C3（`f2a`、`workload_chars/*`）
+- **§2.2 四层命中实测（`v2/exp_a_tier_latency`：GPU/CPU/RDMA/miss）**
+- **§2.2 Ev#1 容量 knee（`v2/exp_b_capacity_knee`）+ working_set + cluster-scale 校正**
+- §3.1 PD-disagg 证否（`PD_DISAGG_RESULTS`、crossover、MB1/MB2）
+- §3.2 LPWL（−31%）、LMetric 稀释、sticky hot-pin、ES ablation
+- §3.3 migration substrate net-positive + correctness smoke tests

- [x] §1 anchor sentence + contribution bullets
- [x] §2 outline + reuse existing characterization figures (`f2a`/`f2b`/`f2c`)
- [x] §3.1/§3.2/§3.3 outline + reuse existing baseline failure figures (`f4a`/`f4b`/`f4c`/`f4d`)
- [x] §4 design description (§4.3 待实证)
- [x] §5.2 partial figure (`f6` 5/6 baselines)
- [x] `replayer/replay.py` patched to emit `trace_span_s` + `amplification` in summary
+### 🟢 不依赖 migration，现在可做
+- §2.0 wall-clock amplification sweep（5 baseline × ≥3 runs）— **优先级最高**
+- §4 集成系统命名 + 端到端 baseline 矩阵（含 static PD-disagg 列）
+- §4.5 λ / skew / KV pool sensitivity
+- 草拟 §1–§3 正文（证据/图已齐）

-### 🟢 Can do without migration (paper writing now possible)
+### 🚧 Deferred（待 migration policy e2e）
+- §3.3 migration trigger + target 的反馈环收益验证
+- §4.3 full / migration-only ablation
+- §4.5 `T_hot` / `T_cool` sensitivity

- [ ] Draft §1-§4 正文（数据全有，figures 已 copy 完）
- [ ] §2.3 dispatch coupling 那一节的正文 draft（数学已经在 conversation 里推完）
- [ ] §3 三个失败模式正文 draft
- [ ] §5.4 wall-clock amplification 实测（5 baseline × ≥3 runs）— **优先级最高**，这是 §2.3 的实证 closure
- [ ] §5.2 把 static PD-disagg 补进 `f6` 那张图（重跑或合并现有 PD-sep 数据）
- [ ] §5.5 sensitivity 的 λ / skew / KV pool 三轴
- [ ] §3 三张子图各自独立的 latex/markdown layout 决定
+### 🎨 待画
+- §1.3 storage-hierarchy 示意（GPU HBM → CPU DRAM → RDMA store）
+- §2.0 dispatch-coupling schematic（chatbot vs agentic timeline + 反馈环）
+- 集成系统 architecture 图

-### 🚧 Deferred (待 migration validation)
-
- [ ] §4.3 migration mechanism e2e 验证：substrate 已通（commit `ef9e010`），缺 trigger + target selection 的策略层实验
- [ ] §5.3 full ablation (migration-only + both 两个配置)
- [ ] §5.5 `T_hot` / `T_cool` 两轴 sensitivity
- [ ] §5.6 migration microbench 全部
- [ ] §1 teaser 图 (`f1`) EAR 那一列
- [ ] §5.2 表里 EAR 那一行
- [ ] §4.3.1 / §4.3.4 的 `T_hot` 和 `T_cool` 取值
-
-### 🎨 Custom drawings (paper-writing 阶段)
-
- [ ] `f3_coupling_schematic.png` —— chatbot vs agentic timeline + 反馈环
- [ ] `f5_architecture.png` —— EAR 组件图
-
-### ❓ Open design decisions
-
- [ ] §4.4 session→host 表的存储介质（in-memory dict vs Redis）
- [ ] §5.1 instance 数量、trace 总长度的最终定稿
+### ❓ Open
+- 集成系统最终命名（GPU-hit-first 是原则；系统名待定）
+- §4 instance 数 / trace 总长定稿