agentic-kvc/PAPER_OUTLINE.md

# GPU-Hit-First: Serving Agentic LLM Workloads by Keeping the Working Set in HBM

> **Thesis (one-liner)**: 对 agentic LLM 负载，用户感受到的端到端 metric 是 **request latency / TPS / GPU
> utilization**，而它们由一件事主导 —— **KV cache 命中是否发生在 GPU HBM 上**。Agentic 的 KV reuse 93% 在
> session 内、且活跃 working set 小到一个节点就能常驻 HBM；命中层级 `GPU ≫ CPU-local > remote-RDMA-store ≫
> recompute` 的代价差随 context 拉大。由此得到一条统一原则 —— **GPU-hit-first**：把活跃 working set 留在 HBM，
> 而不是建深的 CPU/storage hierarchy 去追长尾。三个推论分别修复现有系统的三处失配：(3.1) 让 PD-colocation 重新
> 成为默认；(3.2) 在全局路由里做 biased KV-cache-awareness；(3.3) 用 KV migration 而非 replication 做跨实例
> GPU 去重。

> **Framing note (2026-05-30)**：本 outline 取代早期的 "EAR: Elastic Affinity Routing" 版本（保留在 git
> 历史里）。原 EAR 的 dispatch-coupling 形式化在此 **降级为 §2 的 metric 论证**（解释"为什么是 request latency
> 而不是 TTFT/TPOT"），不再是 headline；headline 升格为 GPU-hit-first 原则，affinity routing / migration
> 成为该原则的两个推论（§3.2 / §3.3）。

---

## 📊 Validation Status (2026-05-30)

| 章节 | 论点 | 证据 | 状态 |
|---|---|---|---|
| §1 | 背景：PD-colo / PD-disagg / KV storage hierarchy | — | 写作 |
| §2 metric | request latency 而非 TTFT/TPOT；TPS；GPU util | dispatch coupling + amplification；`bench_report.py` | 🟡 论证全，wall-clock sweep 待补 |
| §2.1 | KV hit 普遍且关键 | `f2a` intra 93.2%、APC 上界 79.6%、C1/C2 | ✅ |
| **§2.2** | **GPU hit > CPU hit > RDMA-store hit ≫ miss** | **`v2/figs/exp_a_tier_latency.png`（四层实测）** | ✅ **NEW** |
| **§2.2 Ev#1** | **GPU 足以常驻"有价值的" working set** | **`v2/figs/exp_b_capacity_knee.png`（knee）+ working_set + cluster-scale 校正** | ✅ **NEW** |
| §3.1 | Make PD-colocation great again | `PD_DISAGG_RESULTS`、`crossover_pd_advantage`、MB1/MB2、C2/C3 | ✅ |
| §3.2 | Biased KV-aware global routing | LPWL（TTFT p90 −31%）、LMetric 乘性稀释、sticky hot-pin、ES ablation | ✅ |
| §3.3 | GPU dedup via migration not replication | substrate net-positive（−18.6%/−36.6%）；correctness smoke tests | 🟡 substrate 通，policy e2e 待验证 |
| §4 | 集成系统端到端 eval | 散落 mb1/mb2/mb5/crossover/lpwl，需统一 | 🚧 |
| §5 | Related work（含 storage hierarchy 正面回应）| — | 写作 |

---

## §1 Background and System Setup

### §1.1 LLM 与 KV cache

Transformer 自回归推理分两段：**prefill**（一次性算完 prompt 的全部 KV，compute-bound）与 **decode**（逐 token
生成，memory-bandwidth-bound）。每个 token 的 KV 常驻 GPU HBM 才能被后续 attention 复用。Prefix caching（APC）让
相同 prompt 前缀直接命中已算好的 KV，省掉重复 prefill —— 这是本文全部优化的物理基础。

> Qwen3-Coder-30B-A3B（GQA, 48 层, 4 KV heads, head_dim 128, bf16）：**KV = 96 KiB/token**，1 GiB = 10,923
> token，block(16 tok) = 1.573 MB。

### §1.2 Agentic workflow

Agentic 负载 = LLM 通过 tool-call 自驱、多 turn 完成任务。与 chatbot 的本质差异：每个 turn 由上一个 turn 的
tool-call 结果触发（无人类 think-time），prefill-dominated（input/output ≈ 75×）。

**但它是一个 mixture，不是"全多轮"**（C1, `figs/workload_chars/c1_session_mixture.png`）：

- **90.3%** 的 session 是单轮（mean 1.62 turns）；但多轮 session（9.7%）= **44.2% 的请求**、**66.9% 的
  prefill 质量**。
- Continuation hazard（Lindy）：turn1→2 仅 **10.2%**，turn5→6 87%，turn12→13 **94.3%** —— heaviness 在
  cold-start 几乎不可预测（corr(turn1_input, n_turns) = **0.04**）。

> **Routing 含义（贯穿 §3.2）**：heaviness 在 session 起点不可预测 → 必须 **reactive**（观测累计负载），不能
> proactive 预判；单轮海洋与深尾的最优策略相反（前者 load-balance、后者 affinity-pin），且 turn-1 无法区分 →
> 唯一可行的策略是"人人 load-balanced 起步、随 turn 累积变 sticky" —— 正是 LPWL 的 emergent 行为。

### §1.3 Serving agents in the wild

- **PD colocation（8C）**：每个 instance 对称，prefill 与 decode 在同一 GPU 上由 chunked-prefill +
  continuous batching 交错。弹性 KV 池，无静态分区。
- **PD disaggregation**：把 instance 静态分成 prefill 池（P）与 decode 池（D），物理隔离两个阶段
  （DistServe / Splitwise）。
- **KV cache storage hierarchy**：GPU HBM → CPU DRAM → 远端 pooled store（RDMA/SSD，如 Mooncake Store /
  LMCache）。把被淘汰/跨实例的 KV 下沉到更慢但更大的层，用传输换重算。

> 三者各被本文一节回应：PD-colo 在 §3.1 被"复活"为默认；PD-disagg 在 §3.1 被证否（agentic regime）；storage
> hierarchy 在 §2.2 被定量地"限位"（GPU 命中远胜下层，且活跃 working set 本就装得下）。

---

## §2 GPU memory hit is the key to serving agents

### §2.0 正确的 metric：request latency / TPS / GPU utilization（不是 TTFT/TPOT）

**为什么不是 per-request TTFT/TPOT**：agentic 的 turn 之间有反馈环，单 turn 的延迟会跨 turn **复利**成 session
端到端时间与系统吞吐差距。只有 **request/session latency、tokens-per-second、GPU utilization** 能 capture 这件事。

**Dispatch coupling（降级为本节论证）**。每个 turn 间有外部 gap `T_external`（chatbot 是人在读/想/打字；agentic
是 tool 执行）。Little's Law：`L = Λ · N · (W_turn(L) + T_external)`。当 `W_turn ≳ T_external` 时
`W_turn(L)` 经 KV 竞争耦合到并发 L，scheduler 的 ε 退步被反馈环放大成 wall-clock 数倍差。

production trace 实测 `T_external` CDF（`figs/f3a_inter_turn_gap.png`）：

| | Agentic | Chatbot |
|---|---:|---:|
| p50 | **1.6 s** | 7.2 s |
| gap < 1 s | **39%** | 4% |
| gap < 5 s | 67% | 29% |
| p99 | 738 s | 43 s |

agentic 有一段 chatbot 没有的 **sub-second tool-call mass（39% vs 4%）**，几乎天然 `W_turn ≫ T_external` → 闭环。
**实测**：lmetric 跑 600s trace 用 49 min wall-clock = **8× amplification**。**结论：per-turn 延迟的小差被放大成
端到端数量级差 → 必须用 request latency / TPS / GPU util 衡量。**

- **TPS / GPU util** 的工具：`microbench/fresh_setup/bench_report.py`（TTFT/TPOT/E2E 全分位 + TPS +
  per-worker GPU util）。PD stall 时 GPU ~0% util vs colo 34%（§3.1）即是 GPU-util 作为 metric 的直接体现。

> **🚧 待补**：5 baseline × ≥3 runs 的 wall-clock amplification sweep（replayer 已输出 `amplification` 字段），
> 钉死本节实证 closure。优先级高。

### §2.1 KV$ hit is common and critical

Trace 上 KV reuse 的分解（`figs/f2a_reuse_topology.png`）：**intra-session 93.2% / cross-session 5.7% /
shared-prefix 1.1%**。理论 APC 上界：intra-only **79.6%** vs any-session 80.3%，差 <1pp —— **cache 本质上是
session-local 的**。

per-turn 视角（C2, `figs/workload_chars/c2_work_amortization.png`）：resident context 11k→56k+ token 增长而
new-prefill 从 2.7k 坍缩到 ~200 token，per-turn reuse 爬到 **99.6%**，resident/new（"PD tax"）到 turn 12 ≈
250×、turn 30 ≈ 450×。**绝大部分 prefill 工作是可被命中省掉的**；命中与否直接决定 TTFT。

### §2.2 Hits on GPU is more important than the CPU

**命中层级的代价是实测的，不是断言的**（Qwen3-Coder-30B-A3B / H20）。TTFT(s, p50) 服务一段长 L 的复用前缀，
来自每个 KV 层：

![四层命中延迟：GPU < CPU-local < remote-RDMA-store ≪ miss，差距随 context 拉大](v2/figs/exp_a_tier_latency.png)


| prefix L | miss(recompute) | **remote RDMA store** | CPU-local(DRAM,PCIe) | GPU(HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 8k  | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
| 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
| 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
| **64k** | **15.23** | **0.97** | **0.27** | **0.11** | **15.8×** | **3.6×** | **2.4×** |

- **GPU hit ~flat**（42→111 ms / 1k→64k）：命中即整段前缀在 HBM，只重算最后一个 token。
- **CPU-local hit** transfer-bound（PCIe H2D 实测 ~54 GB/s）；CPU-hit ≈ GPU-hit + KV/PCIe + ~0.15s 开销。
  （native KV offload，命中经 `vllm:external_prefix_cache_hits` 100% 验证。）
- **remote RDMA-store hit** = Mooncake-Store 机制（实测：两 instance，B 用 `do_remote_prefill` 经 RDMA 从 A 拉取
  缓存前缀而非重算；`mb2_kv_transfer.py` / `v2/.../run_rdma.sh`）。对 recompute 是大赢（**最高 16×**，与 blog 的
  46× 同向），但付 **NIC 税**（有效 ~5–7 GB/s，cf. MB2 raw ~9.7 GB/s；multi-NIC pooling 可抬高），故比 CPU-local
  慢 3.6×、比 GPU 慢 ~9×（64k），**代价差随 context 拉大**。
- **结论 —— 层级严格且随 context 拉大：`GPU < CPU-local < remote-RDMA-store ≪ miss`**。global KV store 确实
  有用（这也是该路线存在的理由），但每靠近 GPU 一层就再省 1.4–4× TTFT。**最值钱的复用是 GPU-resident 的那种。**

#### Evidence #1：GPU is sufficient to hold most KV requests

**realized APC 与 latency 在很小的 GPU 容量就饱和**（closed-loop 多轮负载，并发 4，扫 GPU KV 容量）：

![容量 knee：APC 与 TTFT p90 在 3.6 GB（=活跃 working set）饱和，之后 dead-flat](v2/figs/exp_b_capacity_knee.png)

| GPU KV (GB) | realized APC | TTFT p90 |
|---:|---:|---:|
| 1.2 | 7.4% | 13.00 s |
| 2.4 | 36.3% | 4.62 s |
| **3.6** | **80.3%** | **0.53 s** |
| 9.7 | 72.9% | 0.65 s |
| 14.5| 72.9% | 0.65 s |

**Knee 出现在 3.6 GB = 恰好 = 活跃 working set（4 session × 0.91 GB）**：APC 饱和到上界、TTFT p90 从 13.0s 坍缩
到 0.53s，之后 dead-flat。**超过 working set 的 HBM 买不到额外收益；为追长尾而建的 CPU/storage tier 同理 ≈ 0。**

**Cluster-scale 校正（关键）**：working_set 分析（`analysis/working_set/`，`figs/working_set/`）显示"装下整个
2h cluster 的全部 reuse 尾巴需 ~14 节点"，**这不构成 CPU offload 的动机** —— 那是用 1 个 replica 容量去装**整个
cluster** 的 reuse；产出该 trace 的真实 cluster 远不止 14 节点（trace 是 cluster 聚合，见
`project-trace-is-cluster-level`）。在 per-cluster 的 HBM 总量下，活跃 working set 本就 GPU-resident（live KV
533–1157 GB ≪ 单节点 1528 GB）。**knee 位置随并发线性增长 = 随 cluster GPU 数增长**，而 cluster 本就提供了它。

> ⚠ **Scope**：本小节"装得下"指的是**活跃 working set 产生的近期高价值复用**，不是"全部 reuse 尾巴"。冷
> session 长 gap 后回来的深尾命中（既低价值/byte 又贵于 fetch）正是 storage-hierarchy 派追的东西；本文论点是
> 在 agentic 下这条尾巴不值得为之建深层级。§5 正面回应该派。

### §2.3 Takeaway

正确的 metric（§2.0）+ 命中集中在 GPU 才便宜（§2.2）+ 活跃 working set 装得下 HBM（Ev#1）⇒ **GPU-hit-first**：
设计目标是最大化活跃 working set 的 **GPU 常驻 + 命中**，而非建深 CPU/storage hierarchy。§3 给出三个推论。

---

## §3 Optimizing agent serving with the GPU-Hit-First Principle

### §3.1 Make PD-colocation great again

静态 PD-disaggregation 对 chatbot 有效，对 agentic 结构性失败 —— colocation 才应是默认。

**端到端证据**（`microbench/fresh_setup/PD_DISAGG_RESULTS.md`，8×H20，trace replay）：**没有任何静态 P/D 比能赢
8-way colocation（8C）**，且失败模式随比例移动：

| Metric | **8C** | 6P+2D | 4P+4D | 2P+6D |
|---|---:|---:|---:|---:|
| completion | **100%** | 100% | 100% | **9%** 💀 |
| wall-clock | **2994 s** | 3419 | 4171 | 5762 |
| prefix-cache hit | **19.4%** | 0% | 0% | 0% |
| TTFT p50 | **7.0 s** | 41.0 | 56.4 | 23.6 |
| E2E p90 | **83.3 s** | 91.8 | 157.1 | 499 |

- **D-heavy（4P+4D）**：decode 池饱和 **97.5%**、prefill 池 ~30% —— 半个 cluster 的 KV 被困在错的一侧；agentic
  请求大（p99 KV **11.5 GiB**），4D 让系统 decode 容量直接减半（24→12 并发，`figs/f2c_kv_footprint_cdf.png`、
  `figs/f4b_pdsep_kv_wall.png`）。
- **P-heavy（2P+6D）**：prefill 池 jam 99.7%，872 请求堆积，**91% 永不完成**。
- **更聪明的路由救不了**（§6.x）：给 P 侧加 session-affinity 反而更差（4P+4D completion 100%→36%），GPU ~0%
  util，cluster 卡在 KV-transfer 协调而非 compute —— 复现 producer hot-pinning。

**为什么 colo 赢（正确论证，C2/C3 支撑）**：

- **时变 P:D 需求**：agentic 同时在 roofline 两侧有实质工作 —— compute-bound prefill（~30% 时间）+
  memory-bound decode（**~70% 时间**，C3 token≠time 校正，`figs/workload_chars/c3_prefill_decode_balance.png`）。
  colo 的弹性池吸收当下热的那一相；静态分区让 P-instance 带宽闲、D-instance 算力闲。
- **resident KV 本地化**（C2）：下一 turn 的 prefix = [prevPrompt+prevAnswer] 横跨 P/D 两侧，disagg 必须
  gather/transfer，colo 免费本地保留。
- **transfer 不便宜且拓扑无关**（MB2，`figs/mb2_transfer_time_compare.png`）：Mooncake `batch_transfer_sync_write`
  恒走 RDMA NIC（~9.7 GB/s），intra ≈ inter；PD-disagg 的 per-request transfer 税无法靠拓扑买回。
- **phase-isolation 是 disagg 唯一的真赢面但被压倒**（MB1，`figs/mb1_interference.png`：32k prefill 让
  per-stream TPOT 退化 52×，131k → 183×）—— 但被 D 侧容量天花板压倒（见上）。

**边界（不 overclaim）**：crossover sweep（`analysis/crossover/`，`figs/crossover_pd_advantage.png`）给出 colo
停止占优的 input 长度 —— colo 在 agentic 工作点赢，且我们知道边界在哪。

### §3.2 Biased KV-cache-awareness in global routing

GPU-hit-first 在路由层 = **把 cache-awareness 作为带偏置的独立路由路径**，而不是折叠进 load cost。

**反例：load-balanced / 朴素 cache-aware-load 丢 locality**（`figs/f4a_apc_loss.png`）。LMetric（OSDI'26）打分
`P = pending_prefill + (input − cache_hit)`，`score = P × num_requests` —— cache 只作 cost-model **减项**，而 score
是**乘性**的，有 affinity 的 instance 因 num_requests 高被乘式吃掉 cache 收益：

| 策略 | APC | vs load_only | cache 处理方式 |
|---|---:|---:|---|
| load_only | 53.9% | — | 无 |
| LMetric | 57.2% | **+3.3pp** | cost-model 减项（被稀释）|
| sticky | 77.7% | **+23.8pp** | 硬约束 |
| unified | 78.7% | **+24.8pp** | 硬+软混合 |

`load_only→LMetric` 的 +3.3pp 几乎可忽略；**+20.5pp 的回报来自把 cache 作独立路由路径**。

**本文方法：LPWL（least-prefill-work，parameter-free）**（`project-lpwl-policy`、`analysis/lpwl_5policy_600s.md`）：
按 `new_uncached ≈ input − cache_hit` 路由 —— `new_uncached≈input`（冷/新 session）自动按负载分散，`new_uncached≈0`
（暖 session）自动 stick。**零旋钮**，在 600s trace 上击败 tuned unified+A+B（TTFT p90 **−31%**），full w600 上打平。
这正是 C1 mixture 要的形状：无需 classifier 自动切分单轮海洋与深尾。

**两条被否的杠杆（节省读者时间）**：

- **real-time engine state 不是路由杠杆**（ES ablation，`project-es-ablation-sweep`）：只对一次性 placement 有用
  （sticky −26%），对 per-req load-chasing 有害（load_only +27%）。
- **derived-κ decode 项 net-negative**：decode-awareness 是错的杠杆。

**Pure sticky 的失败用绝对 per-worker latency 衡量**（`figs/f4c_per_worker_ttft.png`，不能用 max/median 比值）：

| | median worker TTFT p90 | max worker | system e2e p90 |
|---|---:|---:|---:|
| sticky | **20.3 s** | 55.4 s | **34.6 s** |
| unified | **10.3 s** | 37.7 s | **18.0 s** |

hot session 数 ≫ instance 数（8），sticky 的 hash 绑定让**每个 worker 都自接一份 hot session**，median 也被拖慢；
biased 路由把 cold/new 重路由到非 hot worker，保 7/8 worker 速度 → e2e p90 ~2× 快。**这引出 §3.3：sticky 的残余
hot-pin 需要 migration 解。**

### §3.3 GPU KV-cache dedup with migration instead of replication

**视角**：多个 instance 各自缓存同一段 prefix = GPU 容量被 **replication** 浪费；GPU-hit-first 要求**全局只留一份 +
把 session 搬到那一份**（migration/dedup），既保 GPU 命中又均衡负载、修复 §3.2 的残余 hot-pin。

- **Trigger**：host 的 `pending_prefill > T_hot` 且 session 在 `T_cool` 内未迁移过。
- **Target**：用 **observable pending prefill tokens** 选最轻 instance，**不用** cost-model 预测（mooncake cost
  model 误差 10–21×，by construction 绕过）。
- **Mechanism**：当前 request 重定向到 target，session KV 经 Mooncake `kv_connector` 迁移；binding 更新；后续
  turn 按 affinity 路由到新 host。迁移成本在该 session 剩余 turn 上摊销。
- **Thrashing prevention**：per-session cooldown `T_cool`。

> **状态**：**substrate 已验证 net-positive**（`kv_both` connector 在 trace replay 上 TTFT p90 **−18.6%**，
> DR-fix 后 **−36.6%** vs plain；之前认为是 blocker 的 transfer overhead 已不存在，4 次历史 revert 的主因消失）。
> migration correctness smoke tests 通过。**policy 层 e2e（trigger + target 在反馈环里的真实收益）仍未直接验证**
> —— 这是全文最弱的一环，独立风险是"决策错误 + cooldown thrashing"。affinity-only pillar（§3.2 LPWL）已独立成立，
> 即便 migration 仍 marginal，paper 也有 strong-routing 主线。

---

## §4 System Evaluation

> **🚧 关键缺口**：目前证据散落在 mb1/mb2/mb5/crossover/lpwl/v2；§4 需要一个**集成系统**（colocation +
> biased routing + dedup-migration，统一命名）跑端到端、用 §2.0 的新 metric（request latency / TPS / GPU util）
> 评测，并把 §3.1/§3.2/§3.3 做成 ablation。

### §4.1 Setup
- **Trace**：真实 Qwen3-Coder agentic trace（cluster-level，见 `project-trace-is-cluster-level`）；日常迭代
  `w600_r0.0015_st30`（~850 req / ~13 min / peak QPS ~1.6 / APC headroom ~76%）。
- **Hardware**：8× H20 (96 GB)；vLLM 0.18.1 (V1, chunked-prefill) + Mooncake。
- **Baselines**：① round-robin ② LMetric (OSDI'26) ③ kvcache-aware+load 线性混合 ④ sticky ⑤ static PD-disagg
  (4P/4D) ⑥ **本文系统**（colo + LPWL + dedup-migration）。
- **Metrics**：request latency (mean/p50/p90/p99)、**TPS**、**GPU util (median/max worker)**、APC、
  wall-clock amplification（不单看 TTFT/TPOT，§2.0）。

### §4.2 End-to-end
- `figs/f6_e2e_latency_bars.png` / `f6_e2e_latency_full_grid.png`（现有 4–5 baseline；🚧 补 static PD-disagg
  列 + 本文系统列）。

### §4.3 Ablation（GPU-hit-first 三推论各关一个）
- −colocation（→ static PD-disagg，§3.1）
- −biased routing（→ load-balance / LMetric，§3.2）
- −dedup-migration（→ pure sticky，§3.3）
- 🚧 migration-only / full 待 policy e2e 验证。

### §4.4 Microbench 支撑（已有，复用）
- §2.2 四层命中：`v2/figs/exp_a_tier_latency.png`
- §2.2 容量 knee：`v2/figs/exp_b_capacity_knee.png`
- §3.1 PD-disagg 失败：`figs/f4b`、`PD_DISAGG_RESULTS`、`crossover_pd_advantage`
- transfer cost：`figs/mb2_*`；phase interference：`figs/mb1_interference.png`

### §4.5 Sensitivity
- 到达率 λ / skew(Zipf α) / KV pool size（不依赖 migration，可做）；`T_hot`/`T_cool`（依赖 migration，deferred）。

---

## §5 Related Work

- **Serving systems**：vLLM, Mooncake, SGLang, DistServe, Splitwise。本文基于 vLLM+Mooncake，与
  DistServe/Splitwise 不同在于**不做静态 P/D 分区**（§3.1 给出 agentic regime 的证否 + crossover 边界）。
- **Cache-aware routing**：LMCache, Production-Stack, LMetric (OSDI'26)。本文指出 cache 信号不能折叠进乘性 load
  score（§3.2 LMetric 稀释分析），须作独立 biased 路由路径。
- **KV storage hierarchy / offload（主要对手，正面回应）**：Mooncake Store, LMCache, AttentionStore 等把 KV 下沉
  到 CPU/SSD/远端 pool。**本文不否认其对 recompute 的收益**（§2.2 实测 remote-RDMA 命中比 recompute 快达 16×，
  与 Mooncake-Store blog 的 46× 同向）；但论证在 **agentic** 下 (i) 命中层级 GPU ≫ CPU ≫ RDMA-store（§2.2），
  (ii) 活跃 working set 本就 GPU-resident（§2.2 Ev#1）⇒ 深层级的边际收益低，应优先 GPU 常驻而非建深 hierarchy。
- **Stateful migration**：Pollux, Gandiva（RL training 的 migration-as-rebalancing）。本文把该思路迁到 LLM KV
  cache 的 dedup（§3.3）。

---

## §6 Conclusion

对 agentic LLM 负载，用户感受的 request latency / TPS / GPU util 由 **KV 命中是否在 GPU HBM 上** 主导。命中层级
`GPU ≫ CPU > RDMA-store ≫ recompute` 且代价差随 context 拉大（§2.2），而活跃 working set 小到本就 GPU-resident
（§2.2 Ev#1）。**GPU-hit-first** 原则由此统一三件事：复活 PD-colocation（§3.1）、在路由里做 biased
KV-awareness（§3.2）、用 migration 而非 replication 做 GPU 去重（§3.3）。

---

## Work Plan

### ✅ Done
- §1/§2 背景与 metric 论证；dispatch-coupling 数学；inter-turn gap CDF（`f3a`）
- §2.1 reuse topology / C1–C3（`f2a`、`workload_chars/*`）
- **§2.2 四层命中实测（`v2/exp_a_tier_latency`：GPU/CPU/RDMA/miss）**
- **§2.2 Ev#1 容量 knee（`v2/exp_b_capacity_knee`）+ working_set + cluster-scale 校正**
- §3.1 PD-disagg 证否（`PD_DISAGG_RESULTS`、crossover、MB1/MB2）
- §3.2 LPWL（−31%）、LMetric 稀释、sticky hot-pin、ES ablation
- §3.3 migration substrate net-positive + correctness smoke tests

### 🟢 不依赖 migration，现在可做
- §2.0 wall-clock amplification sweep（5 baseline × ≥3 runs）— **优先级最高**
- §4 集成系统命名 + 端到端 baseline 矩阵（含 static PD-disagg 列）
- §4.5 λ / skew / KV pool sensitivity
- 草拟 §1–§3 正文（证据/图已齐）

### 🚧 Deferred（待 migration policy e2e）
- §3.3 migration trigger + target 的反馈环收益验证
- §4.3 full / migration-only ablation
- §4.5 `T_hot` / `T_cool` sensitivity

### 🎨 待画
- §1.3 storage-hierarchy 示意（GPU HBM → CPU DRAM → RDMA store）
- §2.0 dispatch-coupling schematic（chatbot vs agentic timeline + 反馈环）
- 集成系统 architecture 图

### ❓ Open
- 集成系统最终命名（GPU-hit-first 是原则；系统名待定）
- §4 instance 数 / trace 总长定稿