Files

Gahow Wang d9cf3126c6 docs: reframe PAPER_OUTLINE to GPU-hit-first + embed v2 figures

Reorganizes the outline from the EAR / dispatch-coupling framing (kept in
git history) into the GPU-hit-first structure:

- §1 background splits PD-colo / PD-disagg / KV storage hierarchy, each with
  a forward pointer to where it is used or refuted.
- §2 leads with the metric argument (request latency / TPS / GPU util, not
  TTFT/TPOT); dispatch coupling is demoted to that justification. §2.2 embeds
  the two new v2 figures -- the measured 4-tier hit hierarchy
  (GPU < CPU-local < remote-RDMA-store << miss) and the capacity->APC/latency
  knee (Evidence #1) -- plus the cluster-scale correction to the working_set
  "14 nodes" number.
- §3 recasts the three optimizations as corollaries of GPU-hit-first:
  make PD-colocation default (3.1), biased KV-awareness routing (3.2),
  dedup via migration not replication (3.3).
- §5 related work now engages the storage-hierarchy camp directly.
- Validation-status table and work plan updated (top priority: wall-clock
  amplification sweep).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 13:34:19 +08:00

23 KiB

Raw Blame History

GPU-Hit-First: Serving Agentic LLM Workloads by Keeping the Working Set in HBM

Thesis (one-liner): 对 agentic LLM 负载，用户感受到的端到端 metric 是 request latency / TPS / GPU utilization，而它们由一件事主导 —— KV cache 命中是否发生在 GPU HBM 上。Agentic 的 KV reuse 93% 在 session 内、且活跃 working set 小到一个节点就能常驻 HBM；命中层级 GPU ≫ CPU-local > remote-RDMA-store ≫ recompute 的代价差随 context 拉大。由此得到一条统一原则 —— GPU-hit-first：把活跃 working set 留在 HBM，而不是建深的 CPU/storage hierarchy 去追长尾。三个推论分别修复现有系统的三处失配：(3.1) 让 PD-colocation 重新成为默认；(3.2) 在全局路由里做 biased KV-cache-awareness；(3.3) 用 KV migration 而非 replication 做跨实例 GPU 去重。

Framing note (2026-05-30)：本 outline 取代早期的 "EAR: Elastic Affinity Routing" 版本（保留在 git 历史里）。原 EAR 的 dispatch-coupling 形式化在此 降级为 §2 的 metric 论证（解释"为什么是 request latency 而不是 TTFT/TPOT"），不再是 headline；headline 升格为 GPU-hit-first 原则，affinity routing / migration 成为该原则的两个推论（§3.2 / §3.3）。

📊 Validation Status (2026-05-30)

章节	论点	证据	状态
§1	背景：PD-colo / PD-disagg / KV storage hierarchy	—	写作
§2 metric	request latency 而非 TTFT/TPOT；TPS；GPU util	dispatch coupling + amplification；`bench_report.py`	🟡 论证全，wall-clock sweep 待补
§2.1	KV hit 普遍且关键	`f2a` intra 93.2%、APC 上界 79.6%、C1/C2	✅
§2.2	GPU hit > CPU hit > RDMA-store hit ≫ miss	`v2/figs/exp_a_tier_latency.png`（四层实测）	✅ NEW
§2.2 Ev#1	GPU 足以常驻"有价值的" working set	`v2/figs/exp_b_capacity_knee.png`（knee）+ working_set + cluster-scale 校正	✅ NEW
§3.1	Make PD-colocation great again	`PD_DISAGG_RESULTS`、`crossover_pd_advantage`、MB1/MB2、C2/C3	✅
§3.2	Biased KV-aware global routing	LPWL（TTFT p90 −31%）、LMetric 乘性稀释、sticky hot-pin、ES ablation	✅
§3.3	GPU dedup via migration not replication	substrate net-positive（−18.6%/−36.6%）；correctness smoke tests	🟡 substrate 通，policy e2e 待验证
§4	集成系统端到端 eval	散落 mb1/mb2/mb5/crossover/lpwl，需统一	🚧
§5	Related work（含 storage hierarchy 正面回应）	—	写作

§1 Background and System Setup

§1.1 LLM 与 KV cache

Transformer 自回归推理分两段：prefill（一次性算完 prompt 的全部 KV，compute-bound）与 decode（逐 token 生成，memory-bandwidth-bound）。每个 token 的 KV 常驻 GPU HBM 才能被后续 attention 复用。Prefix caching（APC）让相同 prompt 前缀直接命中已算好的 KV，省掉重复 prefill —— 这是本文全部优化的物理基础。

Qwen3-Coder-30B-A3B（GQA, 48 层, 4 KV heads, head_dim 128, bf16）：KV = 96 KiB/token，1 GiB = 10,923 token，block(16 tok) = 1.573 MB。

§1.2 Agentic workflow

Agentic 负载 = LLM 通过 tool-call 自驱、多 turn 完成任务。与 chatbot 的本质差异：每个 turn 由上一个 turn 的 tool-call 结果触发（无人类 think-time），prefill-dominated（input/output ≈ 75×）。

但它是一个 mixture，不是"全多轮"（C1, figs/workload_chars/c1_session_mixture.png）：

90.3% 的 session 是单轮（mean 1.62 turns）；但多轮 session（9.7%）= 44.2% 的请求、66.9% 的 prefill 质量。
Continuation hazard（Lindy）：turn1→2 仅 10.2%，turn5→6 87%，turn12→13 94.3% —— heaviness 在 cold-start 几乎不可预测（corr(turn1_input, n_turns) = 0.04）。

Routing 含义（贯穿 §3.2）：heaviness 在 session 起点不可预测 → 必须 reactive（观测累计负载），不能 proactive 预判；单轮海洋与深尾的最优策略相反（前者 load-balance、后者 affinity-pin），且 turn-1 无法区分 → 唯一可行的策略是"人人 load-balanced 起步、随 turn 累积变 sticky" —— 正是 LPWL 的 emergent 行为。

§1.3 Serving agents in the wild

PD colocation（8C）：每个 instance 对称，prefill 与 decode 在同一 GPU 上由 chunked-prefill + continuous batching 交错。弹性 KV 池，无静态分区。
PD disaggregation：把 instance 静态分成 prefill 池（P）与 decode 池（D），物理隔离两个阶段（DistServe / Splitwise）。
KV cache storage hierarchy：GPU HBM → CPU DRAM → 远端 pooled store（RDMA/SSD，如 Mooncake Store / LMCache）。把被淘汰/跨实例的 KV 下沉到更慢但更大的层，用传输换重算。

三者各被本文一节回应：PD-colo 在 §3.1 被"复活"为默认；PD-disagg 在 §3.1 被证否（agentic regime）；storage hierarchy 在 §2.2 被定量地"限位"（GPU 命中远胜下层，且活跃 working set 本就装得下）。

§2 GPU memory hit is the key to serving agents

§2.0 正确的 metric：request latency / TPS / GPU utilization（不是 TTFT/TPOT）

为什么不是 per-request TTFT/TPOT：agentic 的 turn 之间有反馈环，单 turn 的延迟会跨 turn 复利成 session 端到端时间与系统吞吐差距。只有 request/session latency、tokens-per-second、GPU utilization 能 capture 这件事。

Dispatch coupling（降级为本节论证）。每个 turn 间有外部 gap T_external（chatbot 是人在读/想/打字；agentic 是 tool 执行）。Little's Law：L = Λ · N · (W_turn(L) + T_external)。当 W_turn ≳ T_external 时 W_turn(L) 经 KV 竞争耦合到并发 L，scheduler 的 ε 退步被反馈环放大成 wall-clock 数倍差。

production trace 实测 T_external CDF（figs/f3a_inter_turn_gap.png）：

	Agentic	Chatbot
p50	1.6 s	7.2 s
gap < 1 s	39%	4%
gap < 5 s	67%	29%
p99	738 s	43 s

agentic 有一段 chatbot 没有的 sub-second tool-call mass（39% vs 4%），几乎天然 W_turn ≫ T_external → 闭环。实测：lmetric 跑 600s trace 用 49 min wall-clock = 8× amplification。结论：per-turn 延迟的小差被放大成端到端数量级差 → 必须用 request latency / TPS / GPU util 衡量。

TPS / GPU util 的工具：microbench/fresh_setup/bench_report.py（TTFT/TPOT/E2E 全分位 + TPS + per-worker GPU util）。PD stall 时 GPU ~0% util vs colo 34%（§3.1）即是 GPU-util 作为 metric 的直接体现。

🚧 待补：5 baseline × ≥3 runs 的 wall-clock amplification sweep（replayer 已输出 amplification 字段），钉死本节实证 closure。优先级高。

§2.1 KV$ hit is common and critical

Trace 上 KV reuse 的分解（figs/f2a_reuse_topology.png）：intra-session 93.2% / cross-session 5.7% / shared-prefix 1.1%。理论 APC 上界：intra-only 79.6% vs any-session 80.3%，差 <1pp —— cache 本质上是 session-local 的。

per-turn 视角（C2, figs/workload_chars/c2_work_amortization.png）：resident context 11k→56k+ token 增长而 new-prefill 从 2.7k 坍缩到 ~200 token，per-turn reuse 爬到 99.6%，resident/new（"PD tax"）到 turn 12 ≈ 250×、turn 30 ≈ 450×。绝大部分 prefill 工作是可被命中省掉的；命中与否直接决定 TTFT。

§2.2 Hits on GPU is more important than the CPU

命中层级的代价是实测的，不是断言的（Qwen3-Coder-30B-A3B / H20）。TTFT(s, p50) 服务一段长 L 的复用前缀，来自每个 KV 层：

prefix L	miss(recompute)	remote RDMA store	CPU-local(DRAM,PCIe)	GPU(HBM)	miss/RDMA	RDMA/CPU	CPU/GPU
8k	0.588	0.151	0.076	0.053	3.9×	2.0×	1.5×
16k	1.547	0.262	0.105	0.063	5.9×	2.5×	1.7×
32k	4.604	0.680	0.158	0.080	6.8×	4.3×	2.0×
64k	15.23	0.97	0.27	0.11	15.8×	3.6×	2.4×

GPU hit ~flat（42→111 ms / 1k→64k）：命中即整段前缀在 HBM，只重算最后一个 token。
CPU-local hit transfer-bound（PCIe H2D 实测 ~54 GB/s）；CPU-hit ≈ GPU-hit + KV/PCIe + ~0.15s 开销。（native KV offload，命中经 vllm:external_prefix_cache_hits 100% 验证。）
remote RDMA-store hit = Mooncake-Store 机制（实测：两 instance，B 用 do_remote_prefill 经 RDMA 从 A 拉取缓存前缀而非重算；mb2_kv_transfer.py / v2/.../run_rdma.sh）。对 recompute 是大赢（最高 16×，与 blog 的 46× 同向），但付 NIC 税（有效 ~5–7 GB/s，cf. MB2 raw ~9.7 GB/s；multi-NIC pooling 可抬高），故比 CPU-local 慢 3.6×、比 GPU 慢 ~9×（64k），代价差随 context 拉大。
结论 —— 层级严格且随 context 拉大：GPU < CPU-local < remote-RDMA-store ≪ miss。global KV store 确实有用（这也是该路线存在的理由），但每靠近 GPU 一层就再省 1.4–4× TTFT。最值钱的复用是 GPU-resident 的那种。

Evidence #1：GPU is sufficient to hold most KV requests

realized APC 与 latency 在很小的 GPU 容量就饱和（closed-loop 多轮负载，并发 4，扫 GPU KV 容量）：

GPU KV (GB)	realized APC	TTFT p90
1.2	7.4%	13.00 s
2.4	36.3%	4.62 s
3.6	80.3%	0.53 s
9.7	72.9%	0.65 s
14.5	72.9%	0.65 s

Knee 出现在 3.6 GB = 恰好 = 活跃 working set（4 session × 0.91 GB）：APC 饱和到上界、TTFT p90 从 13.0s 坍缩到 0.53s，之后 dead-flat。超过 working set 的 HBM 买不到额外收益；为追长尾而建的 CPU/storage tier 同理 ≈ 0。

Cluster-scale 校正（关键）：working_set 分析（analysis/working_set/，figs/working_set/）显示"装下整个 2h cluster 的全部 reuse 尾巴需 ~14 节点"，这不构成 CPU offload 的动机 —— 那是用 1 个 replica 容量去装整个 cluster 的 reuse；产出该 trace 的真实 cluster 远不止 14 节点（trace 是 cluster 聚合，见 project-trace-is-cluster-level）。在 per-cluster 的 HBM 总量下，活跃 working set 本就 GPU-resident（live KV 533–1157 GB ≪ 单节点 1528 GB）。knee 位置随并发线性增长 = 随 cluster GPU 数增长，而 cluster 本就提供了它。

⚠ Scope：本小节"装得下"指的是活跃 working set 产生的近期高价值复用，不是"全部 reuse 尾巴"。冷 session 长 gap 后回来的深尾命中（既低价值/byte 又贵于 fetch）正是 storage-hierarchy 派追的东西；本文论点是在 agentic 下这条尾巴不值得为之建深层级。§5 正面回应该派。

§2.3 Takeaway

正确的 metric（§2.0）+ 命中集中在 GPU 才便宜（§2.2）+ 活跃 working set 装得下 HBM（Ev#1）⇒ GPU-hit-first：设计目标是最大化活跃 working set 的 GPU 常驻 + 命中，而非建深 CPU/storage hierarchy。§3 给出三个推论。

§3 Optimizing agent serving with the GPU-Hit-First Principle

§3.1 Make PD-colocation great again

静态 PD-disaggregation 对 chatbot 有效，对 agentic 结构性失败 —— colocation 才应是默认。

端到端证据（microbench/fresh_setup/PD_DISAGG_RESULTS.md，8×H20，trace replay）：没有任何静态 P/D 比能赢 8-way colocation（8C），且失败模式随比例移动：

Metric	8C	6P+2D	4P+4D	2P+6D
completion	100%	100%	100%	9% 💀
wall-clock	2994 s	3419	4171	5762
prefix-cache hit	19.4%	0%	0%	0%
TTFT p50	7.0 s	41.0	56.4	23.6
E2E p90	83.3 s	91.8	157.1	499

D-heavy（4P+4D）：decode 池饱和 97.5%、prefill 池 ~30% —— 半个 cluster 的 KV 被困在错的一侧；agentic 请求大（p99 KV 11.5 GiB），4D 让系统 decode 容量直接减半（24→12 并发，figs/f2c_kv_footprint_cdf.png、 figs/f4b_pdsep_kv_wall.png）。
P-heavy（2P+6D）：prefill 池 jam 99.7%，872 请求堆积，91% 永不完成。
更聪明的路由救不了（§6.x）：给 P 侧加 session-affinity 反而更差（4P+4D completion 100%→36%），GPU ~0% util，cluster 卡在 KV-transfer 协调而非 compute —— 复现 producer hot-pinning。

为什么 colo 赢（正确论证，C2/C3 支撑）：

时变 P:D 需求：agentic 同时在 roofline 两侧有实质工作 —— compute-bound prefill（~30% 时间）+ memory-bound decode（~70% 时间，C3 token≠time 校正，figs/workload_chars/c3_prefill_decode_balance.png）。 colo 的弹性池吸收当下热的那一相；静态分区让 P-instance 带宽闲、D-instance 算力闲。
resident KV 本地化（C2）：下一 turn 的 prefix = [prevPrompt+prevAnswer] 横跨 P/D 两侧，disagg 必须 gather/transfer，colo 免费本地保留。
transfer 不便宜且拓扑无关（MB2，figs/mb2_transfer_time_compare.png）：Mooncake batch_transfer_sync_write 恒走 RDMA NIC（~9.7 GB/s），intra ≈ inter；PD-disagg 的 per-request transfer 税无法靠拓扑买回。
phase-isolation 是 disagg 唯一的真赢面但被压倒（MB1，figs/mb1_interference.png：32k prefill 让 per-stream TPOT 退化 52×，131k → 183×）—— 但被 D 侧容量天花板压倒（见上）。

边界（不 overclaim）：crossover sweep（analysis/crossover/，figs/crossover_pd_advantage.png）给出 colo 停止占优的 input 长度 —— colo 在 agentic 工作点赢，且我们知道边界在哪。

§3.2 Biased KV-cache-awareness in global routing

GPU-hit-first 在路由层 = 把 cache-awareness 作为带偏置的独立路由路径，而不是折叠进 load cost。

反例：load-balanced / 朴素 cache-aware-load 丢 locality（figs/f4a_apc_loss.png）。LMetric（OSDI'26）打分 P = pending_prefill + (input − cache_hit)，score = P × num_requests —— cache 只作 cost-model 减项，而 score 是乘性的，有 affinity 的 instance 因 num_requests 高被乘式吃掉 cache 收益：

策略	APC	vs load_only	cache 处理方式
load_only	53.9%	—	无
LMetric	57.2%	+3.3pp	cost-model 减项（被稀释）
sticky	77.7%	+23.8pp	硬约束
unified	78.7%	+24.8pp	硬+软混合

load_only→LMetric 的 +3.3pp 几乎可忽略；+20.5pp 的回报来自把 cache 作独立路由路径。

本文方法：LPWL（least-prefill-work，parameter-free）（project-lpwl-policy、analysis/lpwl_5policy_600s.md）：按 new_uncached ≈ input − cache_hit 路由 —— new_uncached≈input（冷/新 session）自动按负载分散，new_uncached≈0 （暖 session）自动 stick。零旋钮，在 600s trace 上击败 tuned unified+A+B（TTFT p90 −31%），full w600 上打平。这正是 C1 mixture 要的形状：无需 classifier 自动切分单轮海洋与深尾。

两条被否的杠杆（节省读者时间）：

real-time engine state 不是路由杠杆（ES ablation，project-es-ablation-sweep）：只对一次性 placement 有用（sticky −26%），对 per-req load-chasing 有害（load_only +27%）。
derived-κ decode 项 net-negative：decode-awareness 是错的杠杆。

Pure sticky 的失败用绝对 per-worker latency 衡量（figs/f4c_per_worker_ttft.png，不能用 max/median 比值）：

	median worker TTFT p90	max worker	system e2e p90
sticky	20.3 s	55.4 s	34.6 s
unified	10.3 s	37.7 s	18.0 s

hot session 数 ≫ instance 数（8），sticky 的 hash 绑定让每个 worker 都自接一份 hot session，median 也被拖慢； biased 路由把 cold/new 重路由到非 hot worker，保 7/8 worker 速度 → e2e p90 ~2× 快。这引出 §3.3：sticky 的残余 hot-pin 需要 migration 解。

§3.3 GPU KV-cache dedup with migration instead of replication

视角：多个 instance 各自缓存同一段 prefix = GPU 容量被 replication 浪费；GPU-hit-first 要求全局只留一份 + 把 session 搬到那一份（migration/dedup），既保 GPU 命中又均衡负载、修复 §3.2 的残余 hot-pin。

Trigger：host 的 pending_prefill > T_hot 且 session 在 T_cool 内未迁移过。
Target：用 observable pending prefill tokens 选最轻 instance，不用 cost-model 预测（mooncake cost model 误差 10–21×，by construction 绕过）。
Mechanism：当前 request 重定向到 target，session KV 经 Mooncake kv_connector 迁移；binding 更新；后续 turn 按 affinity 路由到新 host。迁移成本在该 session 剩余 turn 上摊销。
Thrashing prevention：per-session cooldown T_cool。

状态：substrate 已验证 net-positive（kv_both connector 在 trace replay 上 TTFT p90 −18.6%， DR-fix 后 −36.6% vs plain；之前认为是 blocker 的 transfer overhead 已不存在，4 次历史 revert 的主因消失）。 migration correctness smoke tests 通过。policy 层 e2e（trigger + target 在反馈环里的真实收益）仍未直接验证 —— 这是全文最弱的一环，独立风险是"决策错误 + cooldown thrashing"。affinity-only pillar（§3.2 LPWL）已独立成立，即便 migration 仍 marginal，paper 也有 strong-routing 主线。

§4 System Evaluation

🚧 关键缺口：目前证据散落在 mb1/mb2/mb5/crossover/lpwl/v2；§4 需要一个集成系统（colocation + biased routing + dedup-migration，统一命名）跑端到端、用 §2.0 的新 metric（request latency / TPS / GPU util）评测，并把 §3.1/§3.2/§3.3 做成 ablation。

§4.1 Setup

Trace：真实 Qwen3-Coder agentic trace（cluster-level，见 project-trace-is-cluster-level）；日常迭代 w600_r0.0015_st30（~850 req / ~13 min / peak QPS ~1.6 / APC headroom ~76%）。
Hardware：8× H20 (96 GB)；vLLM 0.18.1 (V1, chunked-prefill) + Mooncake。
Baselines：① round-robin ② LMetric (OSDI'26) ③ kvcache-aware+load 线性混合 ④ sticky ⑤ static PD-disagg (4P/4D) ⑥ 本文系统（colo + LPWL + dedup-migration）。
Metrics：request latency (mean/p50/p90/p99)、TPS、GPU util (median/max worker)、APC、 wall-clock amplification（不单看 TTFT/TPOT，§2.0）。

§4.2 End-to-end

figs/f6_e2e_latency_bars.png / f6_e2e_latency_full_grid.png（现有 4–5 baseline；🚧 补 static PD-disagg 列 + 本文系统列）。

§4.3 Ablation（GPU-hit-first 三推论各关一个）

−colocation（→ static PD-disagg，§3.1）
−biased routing（→ load-balance / LMetric，§3.2）
−dedup-migration（→ pure sticky，§3.3）
🚧 migration-only / full 待 policy e2e 验证。

§4.4 Microbench 支撑（已有，复用）

§2.2 四层命中：v2/figs/exp_a_tier_latency.png
§2.2 容量 knee：v2/figs/exp_b_capacity_knee.png
§3.1 PD-disagg 失败：figs/f4b、PD_DISAGG_RESULTS、crossover_pd_advantage
transfer cost：figs/mb2_*；phase interference：figs/mb1_interference.png

§4.5 Sensitivity

到达率 λ / skew(Zipf α) / KV pool size（不依赖 migration，可做）；T_hot/T_cool（依赖 migration，deferred）。

Serving systems：vLLM, Mooncake, SGLang, DistServe, Splitwise。本文基于 vLLM+Mooncake，与 DistServe/Splitwise 不同在于不做静态 P/D 分区（§3.1 给出 agentic regime 的证否 + crossover 边界）。
Cache-aware routing：LMCache, Production-Stack, LMetric (OSDI'26)。本文指出 cache 信号不能折叠进乘性 load score（§3.2 LMetric 稀释分析），须作独立 biased 路由路径。
KV storage hierarchy / offload（主要对手，正面回应）：Mooncake Store, LMCache, AttentionStore 等把 KV 下沉到 CPU/SSD/远端 pool。本文不否认其对 recompute 的收益（§2.2 实测 remote-RDMA 命中比 recompute 快达 16×，与 Mooncake-Store blog 的 46× 同向）；但论证在 agentic 下 (i) 命中层级 GPU ≫ CPU ≫ RDMA-store（§2.2）， (ii) 活跃 working set 本就 GPU-resident（§2.2 Ev#1）⇒ 深层级的边际收益低，应优先 GPU 常驻而非建深 hierarchy。
Stateful migration：Pollux, Gandiva（RL training 的 migration-as-rebalancing）。本文把该思路迁到 LLM KV cache 的 dedup（§3.3）。

§6 Conclusion

对 agentic LLM 负载，用户感受的 request latency / TPS / GPU util 由 KV 命中是否在 GPU HBM 上 主导。命中层级 GPU ≫ CPU > RDMA-store ≫ recompute 且代价差随 context 拉大（§2.2），而活跃 working set 小到本就 GPU-resident （§2.2 Ev#1）。GPU-hit-first 原则由此统一三件事：复活 PD-colocation（§3.1）、在路由里做 biased KV-awareness（§3.2）、用 migration 而非 replication 做 GPU 去重（§3.3）。

Work Plan

✅ Done

§1/§2 背景与 metric 论证；dispatch-coupling 数学；inter-turn gap CDF（f3a）
§2.1 reuse topology / C1–C3（f2a、workload_chars/*）
§2.2 四层命中实测（v2/exp_a_tier_latency：GPU/CPU/RDMA/miss）
§2.2 Ev#1 容量 knee（v2/exp_b_capacity_knee）+ working_set + cluster-scale 校正
§3.1 PD-disagg 证否（PD_DISAGG_RESULTS、crossover、MB1/MB2）
§3.2 LPWL（−31%）、LMetric 稀释、sticky hot-pin、ES ablation
§3.3 migration substrate net-positive + correctness smoke tests

🟢 不依赖 migration，现在可做

§2.0 wall-clock amplification sweep（5 baseline × ≥3 runs）— 优先级最高
§4 集成系统命名 + 端到端 baseline 矩阵（含 static PD-disagg 列）
§4.5 λ / skew / KV pool sensitivity
草拟 §1–§3 正文（证据/图已齐）

🚧 Deferred（待 migration policy e2e）

§3.3 migration trigger + target 的反馈环收益验证
§4.3 full / migration-only ablation
§4.5 T_hot / T_cool sensitivity

🎨 待画

§1.3 storage-hierarchy 示意（GPU HBM → CPU DRAM → RDMA store）
§2.0 dispatch-coupling schematic（chatbot vs agentic timeline + 反馈环）
集成系统 architecture 图

❓ Open

集成系统最终命名（GPU-hit-first 是原则；系统名待定）
§4 instance 数 / trace 总长定稿

23 KiB Raw Blame History Unescape Escape