Files
agentic-kvc/PAPER_OUTLINE.md
Gahow Wang 555cabcf1f f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
  ~50% HBM for model params (~48 GiB on 96 GiB H20)
  ~10% for runtime activation buffers
  ~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.

New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
  Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)

Key reads off the figure:
  p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
  p90 (8.0 GiB/req):  4 fit/inst →  32 / 16 /  8
  p95 (9.6 GiB/req):  4 fit/inst →  32 / 16 /  8
  p99 (11.5 GiB/req): 3 fit/inst →  24 / 12 /  6

PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).

- analysis/characterization/render_window1_figures.py:
  fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
  but computes floor(KV_pool / req_size) × N_D and annotates the
  per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
  new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
  framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:28:47 +08:00

380 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# EAR: Elastic Affinity Routing for Agentic LLM Serving
> **One-liner**: Agentic LLM workload 的 KV reuse 93% 是 intra-session 的,且 turn 间 tool-call 反馈耦合把单 request 的延迟差放大成 throughput 差距 —— locality 因此成为主导调度杠杆;现有 load-balance 丢 locality、静态 PD-disagg 撞 D 侧 KV 墙、pure session-sticky 造 hot pin我们提出 session-affinity routing + hot-instance 触发 session migration 的调度器 **EAR (Elastic Affinity Router)**,单一方案同时拿到 locality 和 balance。
---
## 📊 Validation Status (2026-05-27)
| 部分 | 现有数据 | 待补 |
|---|---|---|
| §2 Workload characterization | ✅ 完整 (3 张图复用) | — |
| §3.1 Load-balance 丢 locality | ✅ 完整 (`f4a`) | — |
| §3.2 静态 PD-disagg 撞 KV 墙 | ✅ 完整 (`f4b`) | — |
| §3.3 Sticky 造 hot pin | ✅ 完整 (`f4c`, `f4d`) | — |
| §4.1-2 Affinity routing | ✅ 已实现current `unified` 算法)| — |
| `kv_both` substrate cost | ✅ **VERIFIED net-positive** (2026-05-27, commit `ef9e010`) | TTFT p90 18.6% w/o DR-fix, 36.6% w/ DR-fix |
| §4.3 Migration mechanism (e2e) | 🚧 **PARTIAL** | substrate 已通e2e trigger + target selection 实验未跑 |
| §5.2 End-to-end | ⚠️ 5/6 baseline 有数据 (`f6`) | 缺 static PD-disaggEAR 列待 migration |
| §5.3 Ablation | 🚧 **PARTIAL DEFER** | 仅 affinity-only 现可做full 待 migration |
| §5.4 Dispatch coupling validation | 🚧 **NEW DATA NEEDED** | 5 baseline wall-clock 重跑Phase 1 patch 后)|
| §5.5 Sensitivity | 🚧 **PARTIAL DEFER** | λ/skew/KV pool 可做;`T_hot`/`T_cool` 待 migration |
| §5.6 Migration microbench | 🚧 **FULL DEFER** | 完全依赖 migration validation |
**前提背景**team 之前 4 次尝试 migration 都因 transfer overhead 被还原(见 `analysis/unified_routing_fix_review.md`2026-05-27 的 trace-replay A/B/C`microbench/connector_tax/cache_sweep/REPORT_TRACE_REPLAY.md`)证明 `kv_both` substrate 已经反转 —— 不仅 +45% penalty obsoletesubstrate 本身就是 net positiveTTFT p90 18.6% vs plainDR-fix 后 36.6%)。**之前 4 次 migration revert 的最大根因消失,但 e2e migration 策略层trigger + target selection 在反馈环里的真实收益)仍未直接验证 —— EAR 的 migration 部分实验已无 substrate 风险,只剩策略层风险。**
---
## §1 Introduction
Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任务 —— 已经成为推理系统的主导负载,但现有为 chatbot 设计的调度策略在 agentic 下普遍失败。本文先刻画 agentic 与 chatbot 的本质区别,然后说明为什么三类主流调度都不够,最后给出 EAR 设计。
**Contributions**:
- **C1 Dispatch coupling 论证**:我们形式化一个 agentic workload 独有的反馈环 —— 单 turn 服务时间通过 Little's Law 隐式方程影响并发 session 数,从而把 per-request 延迟差放大成 throughput 差距。实测load-balance baseline 在 600s trace 上跑出 **8x** wall-clock amplificationEAR 跑出 **TBDx**
- **C2 EAR 设计**:两个 pillar 的调度器 —— affinity-default routing 抓 intra-session localityhot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance避免 hot pin。
- **C3 评估**:在真实 Qwen3-Coder agentic trace 上EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、worst-worker p90、wall-clock 五个维度。
**Figure 1: Teaser — wall-clock vs trace-time across schedulers**`figs/f1_teaser.png` **🚧 TBD (NEW DATA NEEDED)**
> Needs Phase 3 measurements: 5 baselines × 3 runs of trace replay, extract `amplification = wall_clock_s / trace_span_s` from each summary (Phase 1 patch already exposes the field). Plot as bar chart with y=1 reference line. EAR row 暂为 TBD待 migration validation
---
## §2 Background and Workload Characterization
### §2.1 Agentic Workload Primer
Agentic workload 与 chatbot 的三个本质差异:
- **Multi-turn, programmatic continuation**:每个 turn 由上一个 turn 的 tool-call 结果触发,没有人类 think-time
- **Prefill-dominated**input/output token ratio **75x**98% 计算在 prefill 阶段chatbot 为 1-10x
- **Skewed sessions**(来自 Qwen3 production tracen=1.3M session / 2.1M req / 7200stop 1% 贡献 **46.5%** input tokentop 5% **66.5%**top 10% **74.6%**top 25% **87.5%**top 50% **96.0%** —— 半数 session 几乎占满全部 input mass
平均 session 长度 TBD turn、TBD 输入 token。Per-request KV footprintQwen3-Coder-30B-A3B, 98304 B/tokenp50 **1.8 GiB**, p90 **8.0 GiB**, p95 **9.6 GiB**, p99 **11.5 GiB**. 单 instance KV pool ≈ 0.4 × 96 GiB = **38.4 GiB**(剩 50% model params bf16 + 10% runtime activation所以 p99 请求一个 instance 只能装 **3 个 concurrent decode**;改 PD-disagg 4P+4D 让系统 decode 容量直接减半(系统并发 24 → 12
### §2.2 KV Cache Reuse Topology
Trace 上 KV reuse 的分解:
| Class | Share |
|---|---|
| Intra-session | **93.2%** |
| Cross-session | 5.7% |
| Shared prefix | 1.1% |
理论 APC 上界any-session **80.3%**intra-session-only **79.6%**,差距 <1pp。**cache 本质上是 session-local **任何不保留 session affinity 的调度都丢掉绝大部分 reuse 机会
**Figure 2: Workload characterization (3 panels)** 现有数据可复用
![F2a Reuse topology — intra 93.2% / cross 5.7% / shared 1.1%](figs/f2a_reuse_topology.png)
![F2b Session input-token mass CDF — production trace top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% (replay window overlaid for sanity)](figs/f2b_session_skew.png)
![F2c Per-instance decode concurrency vs deployment (KV pool 38.4 GiB; p99 req fits only 3/inst; PD-disagg halves system decode capacity)](figs/f2c_kv_footprint_cdf.png)
> 📝 Layout TBD三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。
### §2.3 Dispatch Coupling — Why Locality Dominates
这是本文最依赖直觉的论证单独成节
**直觉**chatbot 里每个 turn 后人要读打字**外部时钟**控制下一个 turn 何时到达agentic LLM 一拿到 tool-call 结果立刻发下一个 request**系统自己的速度决定下一个 turn 何时到达**。所以一个慢策略不仅让单请求变慢还让 session 在系统里停留更久 并发 session 更多 KV 竞争更激烈 每个 turn 更慢 —— 反馈环
**具体例子**一个 coding agent 20 turn 的任务
- 快策略 turn 2ssession 40s平均并发 10 session
- 慢策略线性估算 turn 3ssession 60s应该并发 15
- 慢策略实际15 个并发让每 turn 被推到 4ssession 80s并发 20 turn 再推到 5s …… 直到撞墙或落到一个远更糟的新平衡
对照 chatbot turn 后人读 30sturn 2s 3ssession 32s 33s3% 差距几乎无反馈
**形式化**`Λ` = session 到达率trace 给定`N` = session turn `W_turn(L)` = turn 服务时间是当前并发 session `L` 的递增函数并发越多KV 竞争越激烈`W_turn` 越大)。
Chatbot Little's Law:
```
L = Λ · N · (W_turn(L) + T_human)
```
被大常数 `T_human` 主导`W_turn(L)` 的扰动几乎不动 `L`
Agentic Little's Law`T_human ≈ 0`:
```
L = Λ · N · W_turn(L)
```
这是关于 `L` 的隐式方程设策略变化让 `W_turn` 整体放大 `(1+ε)` 小扰动分析得到
```
dL/dε|_{ε=0} = L* / (1 Λ · N · W'_turn(L*))
```
**分母接近 0** 系统接近 KV 饱和放大系数发散这就是为什么 lmetric 600s trace 上跑出 8x wall-clock 放大
**Figure 3: Dispatch coupling schematic** `figs/f3_coupling_schematic.png` **🚧 TBD (CUSTOM DRAW)**
> 需要新画一张示意图:上半 chatbot timeline`system → T_human → system → T_human → ...`),下半 agentic timeline`system → ε → system → ε → ...`),右侧叠一个反馈环箭头 `W_turn → Λ → L → W_turn`。适合用 TikZ / draw.io / matplotlib annotate。
### §2.4 Takeaway
三个性质 —— intra-session locality dominant 2.2)、long context + prefill-heavy 2.1)、dispatch coupling 2.3) —— 共同决定了 agentic workload 的调度必须以 **locality 为主导**并能容忍 skew 带来的 instance 间负载不均
---
## §3 Why Existing Schedulers Don't Fit
三类现有调度各自撞上 §2 三个性质中的一个
### §3.1 Load-balanced routing 丢 locality
Round-robin load-aware routing LMetric, OSDI'26最大化 instance 利用率但忽略 session affinity。**实测 APC 跌到 56.9%**vs 上界 79.6%23pp 的差距直接来自丢失的 intra-session cache hit违反 §2.2
### §3.2 静态 PD-disaggregation 撞 D 侧 KV 墙
静态把 instance 分成 P pool D pool chatbot 有效 agentic 失败agentic 请求平均 33.6k token需要 **3.3GB** KV4D 方案下 p90 请求占 D KV pool **69%**p99 直接 **溢出 138%**结果**TTFT p50 暴涨 62-72x**成功率从 99.5% 跌至 **52-68%**违反 §2.1prefill-dominant + context)。
### §3.3 Pure session-sticky 的真正失败:全员被 hot session 拖累
session-instance 绑定恢复 localityAPC **77.2%**达到上界 97%**绝对 worker latency** 全员被拖累 —— pure sticky 的真正失败模式
| | median worker TTFT p90 | max worker | system e2e p90 |
|---|---:|---:|---:|
| `sticky` | **20.3 s** | 55.4 s | **34.6 s** |
| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
机制production trace top 1% session 46.5% input masstop 5% 66.5%hot session 数量远大于 instance 8sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢到 20s 量级unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified e2e p90 ~2x 快于 sticky
**§3.3 sub-finding**hot pin failure 必须用 **per-worker absolute latency**median + max衡量**不能用 normalized ratio**。`max/median` unified 这样的"affinity + escape"方案上会反向惩罚 —— sticky ratio 2.73 unified 3.67 sticky median 也高20.3s vs unified 10.3sratio 越低反而越糟本文 paper 中所有 worker 平衡相关的比较一律用 (median, max) 双指标不用单一比值
违反 §2.4 skew 容忍要求
**Figure 4: Three baselines, three failure modes** 拆成三个子图分别放在 §3.13.23.3
§3.1 APC 实测 vs 理论上界 79.6% (lmetric 56.9%, load_only 54.1%, sticky 77.2%, unified 79.4%)
![F4a APC loss](figs/f4a_apc_loss.png)
§3.2 D KV pool 占用 vs per-request KV footprint4P+4D 6P+2D agentic regime 都穿过 90% 内存墙
![F4b PD-sep KV memory wall](figs/f4b_pdsep_kv_wall.png)
§3.3 Per-worker TTFT p90 across 8 instances × 5 policiessticky 的所有 worker 都被拖慢median 20.3sunified 把伤害集中在 e4 其他 worker median 10.3s
![F4c Per-worker TTFT p90 distribution](figs/f4c_per_worker_ttft.png)
> 📝 可选支撑图 — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x放 §3.3 支撑 sticky 的 interference 论证:
![F4d PD interference](figs/f4d_pd_interference.png)
### §3.4 Takeaway
**问题不是任何单一 baseline 太弱,而是没有一个方案同时满足 §2 的三个性质**保留 locality尊重 D KV 容量容忍 skew 带来的负载不均EAR 是据我们所知第一个三件事同时做到的调度器
---
## §4 Design: EAR
### §4.1 Architecture
EAR 是位于 N 个同质 instance 之上的 router每个 instance 是对称的 PD-colocated没有静态 P/D 分区每个 session router 内维护一个 **host binding** —— 当前持有该 session KV 状态的 instanceBinding 在常态下稳定仅在 hotspot 触发时通过 migration 改变
**Figure 5: EAR architecture and request flow** `figs/f5_architecture.png` **🚧 TBD (CUSTOM DRAW)**
> 组件图router (含 session→host table) → N 个 symmetric instancesaffinity 路径实线migration path 虚线。适合 TikZ / draw.io。
### §4.2 Pillar 1: Affinity-Default Routing
- **Cold start** session 到达时router load-balance pending prefill tokens 最少的 instance分配初始 host
- **Warm path**已建立 session 的后续每个 turn 一律路由到当前 host
- **效果**intra-session KV reuse 被构造性保留APC 接近 §2.2 的上界 79.6%
### §4.3 Pillar 2: Hot-Triggered Session Migration 🚧 PARTIAL VALIDATION
避免 Pillar 1 退化成 pure sticky 的关键 mechanism
> **状态2026-05-27 更新)**
> - **Substrate 验证 PASS**commit `ef9e010``kv_both` connector 在 trace replay 上 net positiveTTFT p90 18.6%DR-fix 后再 22%。之前认为是 migration blocker 的 transfer overhead 已不存在。
> - **策略层 e2e 验证 PENDING**trigger 阈值 + target selection 在 agentic 反馈环里的真实收益仍未直接测。之前 4 次 migration 尝试(`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`被还原的主因substrate overhead已消失但 trigger 决策错误 + cooldown thrashing 是独立风险,需新一轮 e2e 实验确认。
#### §4.3.1 Trigger signal
EAR 实时监控每个 instance **pending prefill tokens** request 到达且按 affinity 应该路由到 host H router 先检查
- `H.pending_prefill > T_hot`hotspot 检测
- session 在过去 `T_cool` 秒内未发生过 migrationthrashing prevention,§4.3.4
两个条件同时满足才考虑触发 migration`T_hot` `T_cool` 的取值见 §5.5 sensitivity
#### §4.3.2 Target selection
候选集所有 instance (a) 剩余 KV 容量能装下 session 现有 context、(b) `pending_prefill` 严格小于 H `pending_prefill` 最低者
**关键设计点**我们用 **observable current load** 而不是 **predicted transfer time** 排序文献和 colleague 数据均显示 mooncake cost model 的预测误差达 10-21x pending prefill tokens router 直接观察到的数值accuracy by construction
若候选集为空所有其他 instance 都装不下或都比 H 更忙EAR 保留当前 binding继续在 H 上处理请求 —— **migration 是 opportunistic不是 mandatory**
#### §4.3.3 Mechanism
Migration 触发时
1. 当前 request 直接重定向到 target instance T
2. session 累计的 KV 状态从 source H 通过 Mooncake `kv_connector` 传输到 T
3. session host binding 更新为 T后续 turn affinity 自动路由到 T
KV transfer 发生在触发该 migration request critical path 但被该 session 剩余的 TBD turn 摊销
#### §4.3.4 Thrashing prevention
每个 session 维护 `last_migration_timestamp` cooldown `T_cool` 内被禁止再次 migrateCooldown migration 行为限制在 O(session_lifetime / T_cool) 量级
### §4.4 Implementation
基于 vLLM 0.18.1 + Mooncake (vanilla kv_connector)。EAR 是一个 router 进程~TBD LoCSessionhost 表用 TBDin-memory dict / Redis维护
---
## §5 Evaluation
### §5.1 Setup
- **Trace**: 真实 Qwen3-Coder agentic traceTBD requests / TBD seconds / r=0.0015 st=0.3peak QPS ~1.6APC headroom ~76%
- **Hardware**: TBD × H20 (96GB HBM)
- **Engine**: vLLM 0.18.1 + Mooncake `kv_connector`
- **Baselines** (6 ):
1. `load-balance` —— round-robin
2. `LMetric` —— OSDI'26 load-aware routing
3. `kvcache-aware + load-balance` —— linear combination of cache score and load score
4. `sticky` —— session-instance pinning
5. `static PD-disagg` —— 4P / 4D 静态分区
6. `EAR` —— 本文
- **Metrics**: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2EAPCworker TTFT p90 (median + max)、wall-clock vs trace-time
### §5.2 End-to-end Performance
**Figure 6 (headline, p90 only)** (PARTIAL PD-disagg )
![F6 E2E latency bars — 4 policies, p90 only](figs/f6_e2e_latency_bars.png)
**Figure 6 full (mean / p50 / p90 / p99 × TTFT / TPOT / E2E)** 数据完备
![F6 full latency grid — 4 percentiles × 3 metrics](figs/f6_e2e_latency_full_grid.png)
> **🚧 TBD (NEW DATA)**:两张图都缺 `static PD-disagg` 那一列EAR 列也是 TBD需 migration validation。要再补同样格式但包含全 6 个 baseline 的版本。Headline 图用 p90 一行进 main paper完整 grid 可进附录或 supplementary。
| Scheduler | TTFT p50 | TTFT p90 | TPOT p90 | APC | Worker p90 (median / max) | Wall-clock factor |
|---|---|---|---|---|---|---|
| load-balance | TBD | TBD | TBD | TBD | TBD | TBD |
| LMetric | TBD | TBD | TBD | 56.9% | 6.53 | ~8x |
| kvcache+load | TBD | TBD | TBD | TBD | TBD | TBD |
| sticky | TBD | 18.02s | TBD | 77.2% | 13.65 | TBD |
| static PD-disagg | 62.8s | TBD | TBD | TBD | TBD | TBD |
| **EAR** | TBD | **7.35s** | TBD | **79.4%** | TBD | TBD |
(粗体数字来自现有 "unified" 原型测量。)
### §5.3 Ablation 🚧 PARTIAL DEFER
我们独立关闭两个 pillar:
- **EAR (affinity only)**: 等价于 pure sticky衡量 locality 单独贡献
- **EAR (migration only)**: cold-balance + reactive migration affinity衡量 migration 能否独立成立
- **EAR (full)**: 两个 pillar 都开
**Figure 7: Ablation** `figs/f7_ablation.png` **🚧 TBD DEFERRED (BLOCKED ON MIGRATION VALIDATION)**
> 完整 ablation 需要 migration-only / both / affinity-only 三个配置。Migration-only 和 both 都依赖 migration 重测。现阶段可先做 affinity-only vs load-balance 的两点对比已有数据unified 79.4% APC vs lmetric 56.9% APC
预期结论affinity-only 拿到 locality interference 翻倍migration-only 抓不住 locality两者都必须
### §5.4 Dispatch Coupling Validation
闭环 §2.3 的论证对每个 baseline 测量
- turn 平均服务时间 `W_turn`x
- Wall-clock / trace-time amplificationy
**Figure 8: Wall-clock amplification vs per-turn service time** `figs/f8_coupling_measured.png` **🚧 TBD (NEW DATA)**
> 散点x = 平均 per-turn `W_turn`(从 per-request metrics 算 TTFT + decode_timey = amplification (`wall_clock / trace_span`Phase 1 patch 已暴露)。每个 baseline 一个点。理论曲线 `L*/(1 Λ·N·W'(L*))` 叠加(可选)。这是 §2.3 论证的实证 closure**优先级最高**。
预期EAR `W_turn` 最小且放大系数最低的角上
### §5.5 Sensitivity
| 参数 | 范围 | 检验 |
|---|---|---|
| 到达率 λ | TBD | EAR 在低/高负载下是否稳定 dominate |
| Skew 程度 (Zipf α) | TBD | sticky EAR 的差距是否随 skew 拉开 |
| KV pool size | TBD | static PD-disagg 撞墙边界 |
| `T_hot` (migration threshold) | TBD | 触发太宽 thrash太严 错过 |
| `T_cool` (cooldown) | TBD | 同上 |
**Figure 9: Sensitivity heatmaps** `figs/f9_sensitivity.png` **🚧 TBD (NEW DATA, PARTIAL DEFER)**
> Arrival rate / skew / KV pool size 这三轴可现在做(不依赖 migration`T_hot` / `T_cool` 两轴依赖 migration validationdeferred。
### §5.6 Migration Microbenchmark 🚧 FULL DEFER
刻画 EAR 内部 migration 行为
- Migration 触发率% of requests
- 平均 KV transfer 时间
- Migration accuracy迁移后 target instance 在接下来 TBD turn 内保持非 hot 的比例
- Thrashing ratecooldown 窗口内多次迁移的 session 占比应为 0
**Figure 10: Migration timeline** `figs/f10_migration_timeline.png` **🚧 TBD DEFERRED (BLOCKED ON MIGRATION VALIDATION)**
> 时间轴上每个 instance 的 pending prefill tokens heatmapmigration 事件以箭头标出。完全依赖 migration 重测。
---
## §6 Discussion and Limitations
- **Extreme skew**: 若单个 session 自己就把任意 instance 撑成 hotEAR 退化为 sticky我们未在该 regime stress test
- **Cost model accuracy**: EAR observable load 绕过了预测误差问题但未来若引入 predictive admission control需要解决 mooncake cost model 10-21x 误差
- **Heterogeneous hardware / multi-model**: EAR 假设 instance 同质混合模型 / 混合 GPU 池需要扩展 binding 模型
- **Per-instance batch tuning (future)**: 动态调整 `max_batched_tokens` 可能进一步降低 instance 内部 prefill-decode 干扰留作 future work
---
## §7 Related Work
- **LLM serving systems**: vLLM, Mooncake, SGLang, DistServe, Splitwise. EAR 基于 vLLM + Mooncake 实现 DistServe/Splitwise 不同之处在于不做静态 P/D 分区
- **Cache-aware routing**: LMCache, Production-Stack, LMetric (OSDI'26)。这些工作最小化 cross-instance cache miss但不迁移状态
- **Stateful service migration**: Pollux, Gandiva (RL training)。EAR 借鉴 migration-as-rebalancing 思路将其迁移到 LLM inference KV cache 场景
---
## §8 Conclusion
agentic LLM workloadlocality 是主导调度杠杆EAR session-affinity routing 抓住它 hot-triggered session migration 保护它单一方案在 TTFTAPCworst-worker p90wall-clock throughput 四个维度同时 dominate 五个 baseline
---
## Work Plan
### ✅ Done
- [x] §1 anchor sentence + contribution bullets
- [x] §2 outline + reuse existing characterization figures (`f2a`/`f2b`/`f2c`)
- [x] §3.13.23.3 outline + reuse existing baseline failure figures (`f4a`/`f4b`/`f4c`/`f4d`)
- [x] §4 design description 4.3 待实证)
- [x] §5.2 partial figure (`f6` 5/6 baselines)
- [x] `replayer/replay.py` patched to emit `trace_span_s` + `amplification` in summary
### 🟢 Can do without migration (paper writing now possible)
- [ ] Draft §1-§4 正文数据全有figures copy
- [ ] §2.3 dispatch coupling 那一节的正文 draft数学已经在 conversation 里推完
- [ ] §3 三个失败模式正文 draft
- [ ] §5.4 wall-clock amplification 实测5 baseline × 3 runs)— **优先级最高**这是 §2.3 的实证 closure
- [ ] §5.2 static PD-disagg 补进 `f6` 那张图重跑或合并现有 PD-sep 数据
- [ ] §5.5 sensitivity λ / skew / KV pool 三轴
- [ ] §3 三张子图各自独立的 latex/markdown layout 决定
### 🚧 Deferred (待 migration validation)
- [ ] §4.3 migration mechanism e2e 验证substrate 已通commit `ef9e010` trigger + target selection 的策略层实验
- [ ] §5.3 full ablation (migration-only + both 两个配置)
- [ ] §5.5 `T_hot` / `T_cool` 两轴 sensitivity
- [ ] §5.6 migration microbench 全部
- [ ] §1 teaser (`f1`) EAR 那一列
- [ ] §5.2 表里 EAR 那一行
- [ ] §4.3.1 / §4.3.4 `T_hot` `T_cool` 取值
### 🎨 Custom drawings (paper-writing 阶段)
- [ ] `f3_coupling_schematic.png` —— chatbot vs agentic timeline + 反馈环
- [ ] `f5_architecture.png` —— EAR 组件图
### ❓ Open design decisions
- [ ] §4.4 sessionhost 表的存储介质in-memory dict vs Redis
- [ ] §5.1 instance 数量trace 总长度的最终定稿