Files

Gahow Wang b11dc30945 §2.3 reframe: dispatch coupling is regime-dependent, not binary chatbot/agentic

The previous §2.3 narrative said "chatbot has T_human ≈ 30 s think-time,
agentic has T_external ≈ 0, so agentic is always closed-loop and chatbot
never is". The new T_external measurements on the production chatbot
trace (qwen3-max, n=42 k inter-turn gaps from formatted parent_chat_id
sessions) show the binary framing is wrong:

  agentic   p50 1.6 s,  39% gaps < 1 s,  p99 738 s
  chatbot   p50 7.2 s,   4% gaps < 1 s,  p99  43 s

Both have nonzero T_external. The right distinction is the *shape*:
chatbot is unimodal around 5–15 s (human cadence); agentic is bimodal
with a sub-second tool-call mass (39 % vs chatbot's 4 %) plus a long-
pause tail (13 % > 30 s). The agentic sub-second mass is what activates
dispatch coupling — for any W_turn > 1 s scheduler those turns satisfy
W_turn ≫ T_external by construction.

The empirical regime split:
                 unified  TTFT p90 = 7.3 s   →  agentic 73% closed-loop, chatbot 32%
                 lmetric  TTFT p90 = 15.7s   →  agentic 80%,             chatbot 88%

lmetric is bad enough that it drags the chatbot regime into closed-loop
too. This is a direct empirical explanation for lmetric underperforming
on both workloads.

Updates:
- PAPER_OUTLINE.md §2.3: lead with the regime threshold W_turn ≷
  T_external, replace the "T_human dominates" Little's Law with the
  general form L = Λ · N · (W_turn(L) + T_external), embed f3a CDF,
  add the empirical regime table; correct the small-perturbation
  formula to include the +T_external dampening term.
- MEETING.md §1: same reframe, condensed (CDF figure, two-row regime
  table, one-line conclusion).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 16:51:38 +08:00

26 KiB

Raw Blame History

EAR: Elastic Affinity Routing for Agentic LLM Serving

One-liner: Agentic LLM workload 的 KV reuse 93% 是 intra-session 的，且 turn 间 tool-call 反馈耦合把单 request 的延迟差放大成 throughput 差距 —— locality 因此成为主导调度杠杆；现有 load-balance 丢 locality、静态 PD-disagg 撞 D 侧 KV 墙、pure session-sticky 造 hot pin，我们提出 session-affinity routing + hot-instance 触发 session migration 的调度器 EAR (Elastic Affinity Router)，单一方案同时拿到 locality 和 balance。

📊 Validation Status (2026-05-27)

部分	现有数据	待补
§2 Workload characterization	✅ 完整 (3 张图复用)	—
§3.1 Load-balance 丢 locality	✅ 完整 (`f4a`)	—
§3.2 静态 PD-disagg 撞 KV 墙	✅ 完整 (`f4b`)	—
§3.3 Sticky 造 hot pin	✅ 完整 (`f4c`, `f4d`)	—
§4.1-2 Affinity routing	✅ 已实现（current `unified` 算法）	—
`kv_both` substrate cost	✅ VERIFIED net-positive (2026-05-27, commit `ef9e010`)	TTFT p90 −18.6% w/o DR-fix, −36.6% w/ DR-fix
§4.3 Migration mechanism (e2e)	🚧 PARTIAL	substrate 已通；e2e trigger + target selection 实验未跑
§5.2 End-to-end	⚠️ 5/6 baseline 有数据 (`f6`)	缺 static PD-disagg；EAR 列待 migration
§5.3 Ablation	🚧 PARTIAL DEFER	仅 affinity-only 现可做，full 待 migration
§5.4 Dispatch coupling validation	🚧 NEW DATA NEEDED	5 baseline wall-clock 重跑（Phase 1 patch 后）
§5.5 Sensitivity	🚧 PARTIAL DEFER	λ/skew/KV pool 可做；`T_hot`/`T_cool` 待 migration
§5.6 Migration microbench	🚧 FULL DEFER	完全依赖 migration validation

前提背景：team 之前 4 次尝试 migration 都因 transfer overhead 被还原（见 analysis/unified_routing_fix_review.md）；2026-05-27 的 trace-replay A/B/C（microbench/connector_tax/cache_sweep/REPORT_TRACE_REPLAY.md）证明 kv_both substrate 已经反转 —— 不仅 +45% penalty obsolete，substrate 本身就是 net positive（TTFT p90 −18.6% vs plain，DR-fix 后 −36.6%）。之前 4 次 migration revert 的最大根因消失，但 e2e migration 策略层（trigger + target selection 在反馈环里的真实收益）仍未直接验证 —— EAR 的 migration 部分实验已无 substrate 风险，只剩策略层风险。

§1 Introduction

Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任务 —— 已经成为推理系统的主导负载，但现有为 chatbot 设计的调度策略在 agentic 下普遍失败。本文先刻画 agentic 与 chatbot 的本质区别，然后说明为什么三类主流调度都不够，最后给出 EAR 设计。

Contributions:

C1 Dispatch coupling 论证：我们形式化一个 agentic workload 独有的反馈环 —— 单 turn 服务时间通过 Little's Law 隐式方程影响并发 session 数，从而把 per-request 延迟差放大成 throughput 差距。实测：load-balance baseline 在 600s trace 上跑出 8x wall-clock amplification；EAR 跑出 TBDx。
C2 EAR 设计：两个 pillar 的调度器 —— affinity-default routing 抓 intra-session locality，hot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance，避免 hot pin。
C3 评估：在真实 Qwen3-Coder agentic trace 上，EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、worst-worker p90、wall-clock 五个维度。

Figure 1: Teaser — wall-clock vs trace-time across schedulers — figs/f1_teaser.png 🚧 TBD (NEW DATA NEEDED)

Needs Phase 3 measurements: 5 baselines × 3 runs of trace replay, extract amplification = wall_clock_s / trace_span_s from each summary (Phase 1 patch already exposes the field). Plot as bar chart with y=1 reference line. EAR row 暂为 TBD（待 migration validation）。

§2 Background and Workload Characterization

§2.1 Agentic Workload Primer

Agentic workload 与 chatbot 的三个本质差异：

Multi-turn, programmatic continuation：每个 turn 由上一个 turn 的 tool-call 结果触发，没有人类 think-time
Prefill-dominated：input/output token ratio 75x，98% 计算在 prefill 阶段（chatbot 为 1-10x）
Skewed sessions（来自 Qwen3 production trace，n=1.3M session / 2.1M req / 7200s）：top 1% 贡献 46.5% input token，top 5% 66.5%，top 10% 74.6%，top 25% 87.5%，top 50% 96.0% —— 半数 session 几乎占满全部 input mass

平均 session 长度 TBD turn、TBD 输入 token。Per-request KV footprint（Qwen3-Coder-30B-A3B, 98304 B/token）：p50 1.8 GiB, p90 8.0 GiB, p95 9.6 GiB, p99 11.5 GiB. 单 instance KV pool ≈ 0.4 × 96 GiB = 38.4 GiB（剩 50% model params bf16 + 10% runtime activation），所以 p99 请求一个 instance 只能装 3 个 concurrent decode；改 PD-disagg 4P+4D 让系统 decode 容量直接减半（系统并发 24 → 12）。

§2.2 KV Cache Reuse Topology

Trace 上 KV reuse 的分解：

Class	Share
Intra-session	93.2%
Cross-session	5.7%
Shared prefix	1.1%

理论 APC 上界：any-session 80.3%，intra-session-only 79.6%，差距 <1pp。cache 本质上是 session-local 的；任何不保留 session affinity 的调度都丢掉绝大部分 reuse 机会。

Figure 2: Workload characterization (3 panels) — 现有数据可复用：

📝 Layout TBD：三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。

§2.3 Dispatch Coupling — Why Locality Dominates

这是本文最依赖直觉的论证，单独成节。

直觉。每个 turn 之间有一段外部 gap T_external（chatbot 是人在读+想+打字、agentic 是 tool 执行）。下一 turn 在 T_external 之后到达。Little's Law: L = Λ · N · (W_turn(L) + T_external)。系统能不能避免反馈环，取决于 W_turn 是否远小于 T_external：

如果 W_turn ≪ T_external：session 停留时间被 T_external 主导，scheduler 调度速度的小变化几乎不动 L，系统在开环 regime；
如果 W_turn ≳ T_external：W_turn(L) 这一项被 KV 竞争耦合到 L，Little's Law 变成 L 的隐式方程，闭环 regime，scheduler 上的 ε 退步被反馈环放大成几倍的 L*。

Agentic 与 chatbot 不在二元区分上，而在 T_external 的分布上。下面是 production trace 实测的 T_external = next.start − prev.end CDF（agentic = Qwen3-Coder, n=783k inter-turn gaps; chatbot = qwen3-max chat, n=42k gaps）：

Metric	Agentic	Chatbot
p25	0.69s	4.85s
p50	1.6s	7.2s
p90	44s	15s
p99	738s	43s
gap < 1s	39%	4%
gap < 5s	67%	29%

两个分布形状完全不同：

Chatbot 是 unimodal，5–15s 紧密集中（人类交互节奏）；
Agentic 是 bimodal：39% 的 gap < 1s（autonomous tool-call mode，chatbot 仅 4%）+ 13% 的 gap > 30s（session paused/abandoned，chatbot 仅 2%）。

Agentic 的危险来自 sub-second tool-call mode —— 这 39% 的 turn 几乎天然 W_turn ≫ T_external，dispatch coupling 必然激活；而 chatbot 没有这一段质量，要把 W_turn 推得很大才会进入闭环。

实测 regime 对照：

Scheduler	TTFT p90	Agentic frac(W_turn > T_ext)	Chatbot frac(W_turn > T_ext)
unified	7.3s	73%	32%
lmetric	15.7s	80%	88%

unified 在 agentic 上把 73% 的 turn 推进闭环，在 chatbot 上只有 32%。lmetric 在 agentic 上 80%、chatbot 上也到 88% —— lmetric 的 W_turn 大到把 chatbot 自己也推进闭环，这是 lmetric 在两种 workload 都 underperform 的一个直接根因。

具体例子：一个 coding agent 跑 20 turn 的任务，假设 T_external 是 sub-second 模式（tool-call 0.5s）。

快策略：W_turn = 2s，每 turn 总 2.5s，session 共 50s，平均并发 10 session
慢策略（线性估算）：W_turn = 3s，每 turn 3.5s，session 70s，应并发 14
慢策略（实际）：14 并发让 KV pool 更紧 → W_turn 推到 4s → session 90s → 18 并发 → W_turn 5s …… 反馈环放大到撞墙或落到一个远更糟的不动点

形式化。记 Λ = session 到达率，N = 每 session turn 数，W_turn(L) = 单 turn 服务时间作为并发 L 的递增函数（并发越多、KV 竞争越激烈、W_turn 越大）。Little's Law:

L = Λ · N · (W_turn(L) + T_external)

设策略变化让 W_turn 整体放大 (1+ε) 倍，小扰动分析得到不动点 L* 的灵敏度：

dL*/dε = L* · W_turn(L*) / [W_turn(L*) + T_external − Λ · N · W_turn(L*) · W'_turn(L*)]

注意两点：

分子 ∝ W_turn / (W_turn + T_external)：当 T_external ≫ W_turn 时灵敏度 → 0（开环）；当 T_external → 0 时灵敏度趋于其上界（闭环）。所以 agentic 的 sub-second tool-call mass 把灵敏度推到上界，chatbot 的 5–15s mass 把灵敏度压低。
分母 ... − Λ · N · W'_turn(L*)：接近 KV 饱和时趋于 0，任何调度退步在饱和附近都被无限放大 —— 这是 lmetric 在 600s trace 上跑出 8x wall-clock 的根因。

Figure 3: Dispatch coupling schematic — figs/f3_coupling_schematic.png 🚧 TBD (CUSTOM DRAW)

需要新画一张示意图：左半 timeline 对比（chatbot：system → T_external (5–15s) → system；agentic：system → T_external (sub-second to long-tail) → system），右半反馈环 W_turn → L → W_turn，标注两个 regime 的判别条件 W_turn ≷ T_external。

§2.4 Takeaway

三个性质 —— intra-session locality dominant (§2.2)、long context + prefill-heavy (§2.1)、dispatch coupling (§2.3) —— 共同决定了 agentic workload 的调度必须以 locality 为主导，并能容忍 skew 带来的 instance 间负载不均。

§3 Why Existing Schedulers Don't Fit

三类现有调度各自撞上 §2 三个性质中的一个：

§3.1 Load-balanced routing 丢 locality

Round-robin 和 load-aware routing（如 LMetric, OSDI'26）最大化 instance 利用率，但忽略 session affinity。实测 APC 跌到 56.9%（vs 上界 79.6%），23pp 的差距直接来自丢失的 intra-session cache hit。违反 §2.2。

为什么"cache-aware load routing"也不够 —— LMetric 的 cache 信号被乘性 score 稀释。LMetric 的打分是

P = pending_prefill_tokens + (input_length - cache_hit)
score = P × num_requests

cache_hit 只在 P 里作减项；而 score 是乘性的。一个 session affinity 的 instance 会因为持续接到 session 而 num_requests 升高，乘积把 cache 收益吃掉。例：8000 输入 token、暖 instance cache_hit = 7500 vs 冷 instance cache_hit = 0、pending_prefill 都是 2000、num_requests 分别 5 vs 1，则 LMetric score 暖 = 2500 × 5 = 12500、冷 = 10000 × 1 = 10000，LMetric 选冷，丢掉 ~90% cache。结果：

策略	APC	vs load_only	设计点
load_only	53.9%	—	纯负载 (`score = num_requests`)
LMetric	57.2%	+3.3pp	cache 作 cost-model 减项
sticky	77.7%	+23.8pp	cache 作硬约束
unified	78.7%	+24.8pp	cache 作硬+软偏好混合

load_only → LMetric 的 +3.3pp 几乎可忽略；LMetric → sticky 的 +20.5pp 才是 cache 信号被正确处理的回报。Cache awareness 不能只作为 cost-model 的一项被吞掉 —— 必须作为独立路由路径（sticky / unified hybrid）。这是 §3.1 比"丢 locality"更具体的失败模式。

§3.2 静态 PD-disaggregation 撞 D 侧 KV 墙

静态把 instance 分成 P pool 和 D pool 对 chatbot 有效，对 agentic 失败：agentic 请求平均 33.6k token，需要 3.3GB KV；4D 方案下 p90 请求占 D 侧 KV pool 69%，p99 直接 溢出 138%。结果：TTFT p50 暴涨 62-72x，成功率从 99.5% 跌至 52-68%。违反 §2.1（prefill-dominant + 长 context）。

§3.3 Pure session-sticky 的真正失败：全员被 hot session 拖累

硬 session-instance 绑定恢复 locality（APC 77.2%，达到上界 97%），但绝对 worker latency 全员被拖累 —— 是 pure sticky 的真正失败模式。

	median worker TTFT p90	max worker	system e2e p90
`sticky`	20.3 s	55.4 s	34.6 s
`unified` (affinity + LMetric fallback)	10.3 s	37.7 s	18.0 s
`lmetric`	14.0 s	31.3 s	24.8 s

机制：production trace 上 top 1% session 占 46.5% input mass、top 5% 占 66.5%，hot session 数量远大于 instance 数（8）；sticky 的 hash 绑定让 每个 worker 都自己承接一份 hot session，median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker，保留 7/8 worker 的速度。系统 p90 由大多数请求决定，所以 unified 在 e2e p90 上 ~2x 快于 sticky。

§3.3 sub-finding：hot pin failure 必须用 per-worker absolute latency（median + max）衡量，不能用 normalized ratio。max/median 在 unified 这样的"affinity + escape"方案上会反向惩罚 —— sticky 的 ratio 2.73 比 unified 的 3.67 低，但 sticky 的 median 也高（20.3s vs unified 10.3s），ratio 越低反而越糟。本文 paper 中所有 worker 平衡相关的比较一律用 (median, max) 双指标，不用单一比值。

违反 §2.4 的 skew 容忍要求。

Figure 4: Three baselines, three failure modes — 拆成三个子图，分别放在 §3.1/§3.2/§3.3：

§3.1 — APC 实测 vs 理论上界 79.6% (lmetric 56.9%, load_only 54.1%, sticky 77.2%, unified 79.4%)：

§3.2 — D 侧 KV pool 占用 vs per-request KV footprint，4P+4D 和 6P+2D 在 agentic regime 都穿过 90% 内存墙：

§3.3 — Per-worker TTFT p90 across 8 instances × 5 policies。sticky 的所有 worker 都被拖慢（median 20.3s），unified 把伤害集中在 e4 上、其他 worker 快（median 10.3s）：

📝 可选支撑图 — Prefill-decode 干扰（同 GPU 8k prefill 让 TPOT 退化 66x），放 §3.3 支撑 sticky 的 interference 论证：

§3.4 Takeaway

问题不是任何单一 baseline 太弱，而是没有一个方案同时满足 §2 的三个性质：保留 locality、尊重 D 侧 KV 容量、容忍 skew 带来的负载不均。EAR 是据我们所知第一个三件事同时做到的调度器。

§4 Design: EAR

§4.1 Architecture

EAR 是位于 N 个同质 instance 之上的 router。每个 instance 是对称的 PD-colocated，没有静态 P/D 分区。每个 session 在 router 内维护一个 host binding —— 当前持有该 session KV 状态的 instance。Binding 在常态下稳定，仅在 hotspot 触发时通过 migration 改变。

Figure 5: EAR architecture and request flow — figs/f5_architecture.png 🚧 TBD (CUSTOM DRAW)

组件图：router (含 session→host table) → N 个 symmetric instances；affinity 路径实线，migration path 虚线。适合 TikZ / draw.io。

§4.2 Pillar 1: Affinity-Default Routing

Cold start：新 session 到达时，router 用 load-balance（选 pending prefill tokens 最少的 instance）分配初始 host
Warm path：已建立 session 的后续每个 turn 一律路由到当前 host
效果：intra-session KV reuse 被构造性保留，APC 接近 §2.2 的上界 79.6%

§4.3 Pillar 2: Hot-Triggered Session Migration 🚧 PARTIAL VALIDATION

避免 Pillar 1 退化成 pure sticky 的关键 mechanism。

状态（2026-05-27 更新）：

Substrate 验证 PASS（commit ef9e010）：kv_both connector 在 trace replay 上 net positive（TTFT p90 −18.6%），DR-fix 后再 −22%。之前认为是 migration blocker 的 transfer overhead 已不存在。

策略层 e2e 验证 PENDING：trigger 阈值 + target selection 在 agentic 反馈环里的真实收益仍未直接测。之前 4 次 migration 尝试（6b255fa, e991960/5772149, cc6e562, 4c583f2）被还原的主因（substrate overhead）已消失，但 trigger 决策错误 + cooldown thrashing 是独立风险，需新一轮 e2e 实验确认。

§4.3.1 Trigger signal

EAR 实时监控每个 instance 的 pending prefill tokens。新 request 到达且按 affinity 应该路由到 host H 时，router 先检查：

H.pending_prefill > T_hot？（hotspot 检测）
session 在过去 T_cool 秒内未发生过 migration？（thrashing prevention，§4.3.4）

两个条件同时满足才考虑触发 migration。T_hot 和 T_cool 的取值见 §5.5 sensitivity。

§4.3.2 Target selection

候选集：所有 instance 中 (a) 剩余 KV 容量能装下 session 现有 context、(b) pending_prefill 严格小于 H 的。选 pending_prefill 最低者。

关键设计点：我们用 observable current load 而不是 predicted transfer time 排序。文献和 colleague 数据均显示 mooncake cost model 的预测误差达 10-21x；而 pending prefill tokens 是 router 直接观察到的数值，accuracy by construction。

若候选集为空（所有其他 instance 都装不下，或都比 H 更忙），EAR 保留当前 binding，继续在 H 上处理请求 —— migration 是 opportunistic，不是 mandatory。

§4.3.3 Mechanism

Migration 触发时：

当前 request 直接重定向到 target instance T
session 累计的 KV 状态从 source H 通过 Mooncake kv_connector 传输到 T
session 的 host binding 更新为 T；后续 turn 按 affinity 自动路由到 T

KV transfer 发生在触发该 migration 的 request 的 critical path 上，但被该 session 剩余的 TBD turn 摊销。

§4.3.4 Thrashing prevention

每个 session 维护 last_migration_timestamp。在 cooldown T_cool 内被禁止再次 migrate。Cooldown 把 migration 行为限制在 O(session_lifetime / T_cool) 量级。

§4.4 Implementation

基于 vLLM 0.18.1 + Mooncake (vanilla kv_connector)。EAR 是一个 router 进程，~TBD LoC。Session→host 表用 TBD（in-memory dict / Redis）维护。

§5 Evaluation

§5.1 Setup

Trace: 真实 Qwen3-Coder agentic trace，TBD requests / TBD seconds / r=0.0015 st=0.3，peak QPS ~1.6，APC headroom ~76%
Hardware: TBD × H20 (96GB HBM)
Engine: vLLM 0.18.1 + Mooncake kv_connector
Baselines (6 个):
1. load-balance —— round-robin
2. LMetric —— OSDI'26 load-aware routing
3. kvcache-aware + load-balance —— linear combination of cache score and load score
4. sticky —— 硬 session-instance pinning
5. static PD-disagg —— 4P / 4D 静态分区
6. EAR —— 本文
Metrics: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2E、APC、worker TTFT p90 (median + max)、wall-clock vs trace-time

§5.2 End-to-end Performance

Figure 6 (headline, p90 only) — ✅ (PARTIAL，缺 PD-disagg 列)

Figure 6 full (mean / p50 / p90 / p99 × TTFT / TPOT / E2E) — ✅ 数据完备：

🚧 TBD (NEW DATA)：两张图都缺 static PD-disagg 那一列；EAR 列也是 TBD（需 migration validation）。要再补同样格式但包含全 6 个 baseline 的版本。Headline 图用 p90 一行进 main paper，完整 grid 可进附录或 supplementary。

Scheduler	TTFT p50	TTFT p90	TPOT p90	APC	Worker p90 (median / max)	Wall-clock factor
load-balance	TBD	TBD	TBD	TBD	TBD	TBD
LMetric	TBD	TBD	TBD	56.9%	6.53	~8x
kvcache+load	TBD	TBD	TBD	TBD	TBD	TBD
sticky	TBD	18.02s	TBD	77.2%	13.65	TBD
static PD-disagg	62.8s	TBD	TBD	TBD	TBD	TBD
EAR	TBD	7.35s	TBD	79.4%	TBD	TBD

(粗体数字来自现有 "unified" 原型测量。)

§5.3 Ablation 🚧 PARTIAL DEFER

我们独立关闭两个 pillar:

EAR (affinity only): 等价于 pure sticky；衡量 locality 单独贡献
EAR (migration only): cold-balance + reactive migration，无 affinity；衡量 migration 能否独立成立
EAR (full): 两个 pillar 都开

Figure 7: Ablation — figs/f7_ablation.png 🚧 TBD — DEFERRED (BLOCKED ON MIGRATION VALIDATION)

完整 ablation 需要 migration-only / both / affinity-only 三个配置。Migration-only 和 both 都依赖 migration 重测。现阶段可先做 affinity-only vs load-balance 的两点对比（已有数据：unified 79.4% APC vs lmetric 56.9% APC）。

预期结论：affinity-only 拿到 locality 但 interference 翻倍；migration-only 抓不住 locality；两者都必须。

§5.4 Dispatch Coupling Validation

闭环 §2.3 的论证。对每个 baseline 测量：

单 turn 平均服务时间 W_turn（x 轴）
Wall-clock / trace-time amplification（y 轴）

Figure 8: Wall-clock amplification vs per-turn service time — figs/f8_coupling_measured.png 🚧 TBD (NEW DATA)

散点：x = 平均 per-turn W_turn（从 per-request metrics 算 TTFT + decode_time），y = amplification (wall_clock / trace_span，Phase 1 patch 已暴露)。每个 baseline 一个点。理论曲线 L*/(1 − Λ·N·W'(L*)) 叠加（可选）。这是 §2.3 论证的实证 closure，优先级最高。

预期：EAR 在 W_turn 最小且放大系数最低的角上。

§5.5 Sensitivity

参数	范围	检验
到达率 λ	TBD	EAR 在低/高负载下是否稳定 dominate
Skew 程度 (Zipf α)	TBD	sticky 与 EAR 的差距是否随 skew 拉开
KV pool size	TBD	static PD-disagg 撞墙边界
`T_hot` (migration threshold)	TBD	触发太宽 → thrash，太严 → 错过
`T_cool` (cooldown)	TBD	同上

Figure 9: Sensitivity heatmaps — figs/f9_sensitivity.png 🚧 TBD (NEW DATA, PARTIAL DEFER)

Arrival rate / skew / KV pool size 这三轴可现在做（不依赖 migration）；T_hot / T_cool 两轴依赖 migration validation，deferred。

§5.6 Migration Microbenchmark 🚧 FULL DEFER

刻画 EAR 内部 migration 行为：

Migration 触发率（% of requests）
平均 KV transfer 时间
Migration accuracy：迁移后 target instance 在接下来 TBD 个 turn 内保持非 hot 的比例
Thrashing rate：cooldown 窗口内多次迁移的 session 占比（应为 0）

Figure 10: Migration timeline — figs/f10_migration_timeline.png 🚧 TBD — DEFERRED (BLOCKED ON MIGRATION VALIDATION)

时间轴上每个 instance 的 pending prefill tokens heatmap，migration 事件以箭头标出。完全依赖 migration 重测。

§6 Discussion and Limitations

Extreme skew: 若单个 session 自己就把任意 instance 撑成 hot，EAR 退化为 sticky。我们未在该 regime 做 stress test。
Cost model accuracy: EAR 用 observable load 绕过了预测误差问题。但未来若引入 predictive admission control，需要解决 mooncake cost model 10-21x 误差。
Heterogeneous hardware / multi-model: EAR 假设 instance 同质。混合模型 / 混合 GPU 池需要扩展 binding 模型。
Per-instance batch tuning (future): 动态调整 max_batched_tokens 可能进一步降低 instance 内部 prefill-decode 干扰，留作 future work。

LLM serving systems: vLLM, Mooncake, SGLang, DistServe, Splitwise. EAR 基于 vLLM + Mooncake 实现，与 DistServe/Splitwise 不同之处在于不做静态 P/D 分区。
Cache-aware routing: LMCache, Production-Stack, LMetric (OSDI'26)。这些工作最小化 cross-instance cache miss，但不迁移状态。
Stateful service migration: Pollux, Gandiva (RL training)。EAR 借鉴 migration-as-rebalancing 思路，将其迁移到 LLM inference 的 KV cache 场景。

§8 Conclusion

对 agentic LLM workload，locality 是主导调度杠杆。EAR 用 session-affinity routing 抓住它，用 hot-triggered session migration 保护它，单一方案在 TTFT、APC、worst-worker p90、wall-clock throughput 四个维度同时 dominate 五个 baseline。

Work Plan

✅ Done

§1 anchor sentence + contribution bullets
§2 outline + reuse existing characterization figures (f2a/f2b/f2c)
§3.1/§3.2/§3.3 outline + reuse existing baseline failure figures (f4a/f4b/f4c/f4d)
§4 design description (§4.3 待实证)
§5.2 partial figure (f6 5/6 baselines)
replayer/replay.py patched to emit trace_span_s + amplification in summary

🟢 Can do without migration (paper writing now possible)

Draft §1-§4 正文（数据全有，figures 已 copy 完）
§2.3 dispatch coupling 那一节的正文 draft（数学已经在 conversation 里推完）
§3 三个失败模式正文 draft
§5.4 wall-clock amplification 实测（5 baseline × ≥3 runs）— 优先级最高，这是 §2.3 的实证 closure
§5.2 把 static PD-disagg 补进 f6 那张图（重跑或合并现有 PD-sep 数据）
§5.5 sensitivity 的 λ / skew / KV pool 三轴
§3 三张子图各自独立的 latex/markdown layout 决定

🚧 Deferred (待 migration validation)

§4.3 migration mechanism e2e 验证：substrate 已通（commit ef9e010），缺 trigger + target selection 的策略层实验
§5.3 full ablation (migration-only + both 两个配置)
§5.5 T_hot / T_cool 两轴 sensitivity
§5.6 migration microbench 全部
§1 teaser 图 (f1) EAR 那一列
§5.2 表里 EAR 那一行
§4.3.1 / §4.3.4 的 T_hot 和 T_cool 取值

🎨 Custom drawings (paper-writing 阶段)

f3_coupling_schematic.png —— chatbot vs agentic timeline + 反馈环
f5_architecture.png —— EAR 组件图

❓ Open design decisions

§4.4 session→host 表的存储介质（in-memory dict vs Redis）
§5.1 instance 数量、trace 总长度的最终定稿

26 KiB Raw Blame History Unescape Escape