Files

Gahow Wang e2f94495a1 EAR paper outline: anchor + dispatch coupling motivation

Initial 8-section outline for "Elastic Affinity Router" — agentic LLM
scheduler with session-affinity routing + hot-triggered session migration.

Centerpiece is §2.3's dispatch coupling argument: agentic workloads close
Little's Law on themselves (no human think-time), so per-turn W enters Λ,
amplifying small latency differences into throughput differences. This is
the intellectual hook the design hangs on.

§3 attacks three baselines on three orthogonal failure modes (load-balance
loses locality, static PD-disagg hits D-side KV wall, pure sticky creates
hot pin). §4 frames EAR as the single scheduler that addresses all three.

All figures and several numbers (T_hot, T_cool, EAR wall-clock factor) are
TBD — see Open Items at bottom.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 01:24:02 +08:00

15 KiB

Raw Blame History

EAR: Elastic Affinity Routing for Agentic LLM Serving

One-liner: Agentic LLM workload 的 KV reuse 93% 是 intra-session 的，且 turn 间 tool-call 反馈耦合把单 request 的延迟差放大成 throughput 差距 —— locality 因此成为主导调度杠杆；现有 load-balance 丢 locality、静态 PD-disagg 撞 D 侧 KV 墙、pure session-sticky 造 hot pin，我们提出 session-affinity routing + hot-instance 触发 session migration 的调度器 EAR (Elastic Affinity Router)，单一方案同时拿到 locality 和 balance。

§1 Introduction

Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任务 —— 已经成为推理系统的主导负载，但现有为 chatbot 设计的调度策略在 agentic 下普遍失败。本文先刻画 agentic 与 chatbot 的本质区别，然后说明为什么三类主流调度都不够，最后给出 EAR 设计。

Contributions:

C1 Dispatch coupling 论证：我们形式化一个 agentic workload 独有的反馈环 —— 单 turn 服务时间通过 Little's Law 隐式方程影响并发 session 数，从而把 per-request 延迟差放大成 throughput 差距。实测：load-balance baseline 在 600s trace 上跑出 8x wall-clock amplification；EAR 跑出 TBDx。
C2 EAR 设计：两个 pillar 的调度器 —— affinity-default routing 抓 intra-session locality，hot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance，避免 hot pin。
C3 评估：在真实 Qwen3-Coder agentic trace 上，EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、hotspot index、wall-clock 五个维度。

TBD: 把 5 个 baseline + EAR 的 (wall-clock / trace-time) 比值画成一张 bar chart 或散点图，配一条 y=1 的参考线。

§2 Background and Workload Characterization

§2.1 Agentic Workload Primer

Agentic workload 与 chatbot 的三个本质差异：

Multi-turn, programmatic continuation：每个 turn 由上一个 turn 的 tool-call 结果触发，没有人类 think-time
Prefill-dominated：input/output token ratio 75x，98% 计算在 prefill 阶段（chatbot 为 1-10x）
Skewed sessions：top 1% session 贡献 46.5% input token 量

平均 session 长度 TBD turn、TBD 输入 token；p99 单请求 KV 占用 11.49 GiB（H20 96GB HBM 的 12%）。

§2.2 KV Cache Reuse Topology

Trace 上 KV reuse 的分解：

Class	Share
Intra-session	93.2%
Cross-session	5.7%
Shared prefix	1.1%

理论 APC 上界：any-session 80.3%，intra-session-only 79.6%，差距 <1pp。cache 本质上是 session-local 的；任何不保留 session affinity 的调度都丢掉绝大部分 reuse 机会。

TBD: 两个 subplot 一张图 —— 左：reuse pie；右：session input-token CDF (highlight top 1% = 46.5%).

§2.3 Dispatch Coupling — Why Locality Dominates

这是本文最依赖直觉的论证，单独成节。

直觉：chatbot 里每个 turn 后人要读、想、打字，外部时钟控制下一个 turn 何时到达；agentic 里 LLM 一拿到 tool-call 结果立刻发下一个 request，系统自己的速度决定下一个 turn 何时到达。所以一个慢策略不仅让单请求变慢，还让 session 在系统里停留更久 → 并发 session 更多 → KV 竞争更激烈 → 每个 turn 更慢 —— 反馈环。

具体例子：一个 coding agent 跑 20 turn 的任务。

快策略：每 turn 2s，session 共 40s，平均并发 10 个 session
慢策略（线性估算）：每 turn 3s，session 共 60s，应该并发 15 个
慢策略（实际）：15 个并发让每 turn 被推到 4s，session 80s，并发 20 个，turn 再推到 5s …… 直到撞墙或落到一个远更糟的新平衡

对照 chatbot：每 turn 后人读 30s。turn 从 2s 变 3s，session 从 32s 变 33s，3% 差距，几乎无反馈。

形式化。记 Λ = session 到达率（trace 给定），N = 每 session 的 turn 数，W_turn(L) = 单 turn 服务时间，是当前并发 session 数 L 的递增函数（并发越多、KV 竞争越激烈、W_turn 越大）。

Chatbot 的 Little's Law:

L = Λ · N · (W_turn(L) + T_human)

被大常数 T_human 主导，W_turn(L) 的扰动几乎不动 L。

Agentic 的 Little's Law（T_human ≈ 0）:

L = Λ · N · W_turn(L)

这是关于 L 的隐式方程。设策略变化让 W_turn 整体放大 (1+ε) 倍，小扰动分析得到：

dL/dε|_{ε=0} = L* / (1 − Λ · N · W'_turn(L*))

分母接近 0 时（系统接近 KV 饱和），放大系数发散。这就是为什么 lmetric 在 600s trace 上跑出 8x wall-clock 放大。

TBD: 一张示意图，并排画 chatbot timeline (system→human→system→human...) 和 agentic timeline (system→system→system...)，标出 T_human 缓冲不存在导致 W_turn 直接进入 Λ 的环路。

§2.4 Takeaway

三个性质 —— intra-session locality dominant (§2.2)、long context + prefill-heavy (§2.1)、dispatch coupling (§2.3) —— 共同决定了 agentic workload 的调度必须以 locality 为主导，并能容忍 skew 带来的 instance 间负载不均。

§3 Why Existing Schedulers Don't Fit

三类现有调度各自撞上 §2 三个性质中的一个：

§3.1 Load-balanced routing 丢 locality

Round-robin 和 load-aware routing（如 LMetric, OSDI'26）最大化 instance 利用率，但忽略 session affinity。实测 APC 跌到 56.9%（vs 上界 79.6%），23pp 的差距直接来自丢失的 intra-session cache hit。违反 §2.2。

§3.2 静态 PD-disaggregation 撞 D 侧 KV 墙

静态把 instance 分成 P pool 和 D pool 对 chatbot 有效，对 agentic 失败：agentic 请求平均 33.6k token，需要 3.3GB KV；4D 方案下 p90 请求占 D 侧 KV pool 69%，p99 直接 溢出 138%。结果：TTFT p50 暴涨 62-72x，成功率从 99.5% 跌至 52-68%。违反 §2.1（prefill-dominant + 长 context）。

§3.3 Pure session-sticky 造 hot pin

硬 session-instance 绑定恢复 locality（APC 77.2%，达到上界 97%），但把 skew 中的大 session 锁在单 instance 上，interference index 从 LMetric 的 6.53 翻倍到 13.65（同 trace 同硬件）。违反 §2.4 的 skew 容忍要求。

TBD: 三个 baseline × 三个失败维度的 bar chart 或 radar。视觉上让"每个 baseline 在自己失败的维度有一根明显高/低的柱"立刻能看出来。

§3.4 Takeaway

问题不是任何单一 baseline 太弱，而是没有一个方案同时满足 §2 的三个性质：保留 locality、尊重 D 侧 KV 容量、容忍 skew 带来的负载不均。EAR 是据我们所知第一个三件事同时做到的调度器。

§4 Design: EAR

§4.1 Architecture

EAR 是位于 N 个同质 instance 之上的 router。每个 instance 是对称的 PD-colocated，没有静态 P/D 分区。每个 session 在 router 内维护一个 host binding —— 当前持有该 session KV 状态的 instance。Binding 在常态下稳定，仅在 hotspot 触发时通过 migration 改变。

TBD: 一张组件图 —— router (含 session→host table) → N 个 symmetric instances；migration path 用虚线标出。

§4.2 Pillar 1: Affinity-Default Routing

Cold start：新 session 到达时，router 用 load-balance（选 pending prefill tokens 最少的 instance）分配初始 host
Warm path：已建立 session 的后续每个 turn 一律路由到当前 host
效果：intra-session KV reuse 被构造性保留，APC 接近 §2.2 的上界 79.6%

§4.3 Pillar 2: Hot-Triggered Session Migration

避免 Pillar 1 退化成 pure sticky 的关键 mechanism。

§4.3.1 Trigger signal

EAR 实时监控每个 instance 的 pending prefill tokens。新 request 到达且按 affinity 应该路由到 host H 时，router 先检查：

H.pending_prefill > T_hot？（hotspot 检测）
session 在过去 T_cool 秒内未发生过 migration？（thrashing prevention，§4.3.4）

两个条件同时满足才考虑触发 migration。T_hot 和 T_cool 的取值见 §5.5 sensitivity。

§4.3.2 Target selection

候选集：所有 instance 中 (a) 剩余 KV 容量能装下 session 现有 context、(b) pending_prefill 严格小于 H 的。选 pending_prefill 最低者。

关键设计点：我们用 observable current load 而不是 predicted transfer time 排序。文献和 colleague 数据均显示 mooncake cost model 的预测误差达 10-21x；而 pending prefill tokens 是 router 直接观察到的数值，accuracy by construction。

若候选集为空（所有其他 instance 都装不下，或都比 H 更忙），EAR 保留当前 binding，继续在 H 上处理请求 —— migration 是 opportunistic，不是 mandatory。

§4.3.3 Mechanism

Migration 触发时：

当前 request 直接重定向到 target instance T
session 累计的 KV 状态从 source H 通过 Mooncake kv_connector 传输到 T
session 的 host binding 更新为 T；后续 turn 按 affinity 自动路由到 T

KV transfer 发生在触发该 migration 的 request 的 critical path 上，但被该 session 剩余的 TBD turn 摊销。

§4.3.4 Thrashing prevention

每个 session 维护 last_migration_timestamp。在 cooldown T_cool 内被禁止再次 migrate。Cooldown 把 migration 行为限制在 O(session_lifetime / T_cool) 量级。

§4.4 Implementation

基于 vLLM 0.18.1 + Mooncake (vanilla kv_connector)。EAR 是一个 router 进程，~TBD LoC。Session→host 表用 TBD（in-memory dict / Redis）维护。

§5 Evaluation

§5.1 Setup

Trace: 真实 Qwen3-Coder agentic trace，TBD requests / TBD seconds / r=0.0015 st=0.3，peak QPS ~1.6，APC headroom ~76%
Hardware: TBD × H20 (96GB HBM)
Engine: vLLM 0.18.1 + Mooncake kv_connector
Baselines (6 个):
1. load-balance —— round-robin
2. LMetric —— OSDI'26 load-aware routing
3. kvcache-aware + load-balance —— linear combination of cache score and load score
4. sticky —— 硬 session-instance pinning
5. static PD-disagg —— 4P / 4D 静态分区
6. EAR —— 本文
Metrics: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2E、APC、hotspot index、wall-clock vs trace-time

§5.2 End-to-end Performance

TBD: TTFT/TPOT/APC/hotspot 四个 subplot 或一张雷达图。

Scheduler	TTFT p50	TTFT p90	TPOT p90	APC	Hotspot idx	Wall-clock factor
load-balance	TBD	TBD	TBD	TBD	TBD	TBD
LMetric	TBD	TBD	TBD	56.9%	6.53	~8x
kvcache+load	TBD	TBD	TBD	TBD	TBD	TBD
sticky	TBD	18.02s	TBD	77.2%	13.65	TBD
static PD-disagg	62.8s	TBD	TBD	TBD	TBD	TBD
EAR	TBD	7.35s	TBD	79.4%	TBD	TBD

(粗体数字来自现有 "unified" 原型测量。)

§5.3 Ablation

我们独立关闭两个 pillar:

EAR (affinity only): 等价于 pure sticky；衡量 locality 单独贡献
EAR (migration only): cold-balance + reactive migration，无 affinity；衡量 migration 能否独立成立
EAR (full): 两个 pillar 都开

TBD: 2×2 grid 或并排 bar chart。

预期结论：affinity-only 拿到 locality 但 interference 翻倍；migration-only 抓不住 locality；两者都必须。

§5.4 Dispatch Coupling Validation

闭环 §2.3 的论证。对每个 baseline 测量：

单 turn 平均服务时间 W_turn（x 轴）
Wall-clock / trace-time amplification（y 轴）

TBD: 每个 baseline 一个点；理论曲线 1/(1 − Λ·N·W'(L*)) 叠加为参考线。

预期：EAR 在 W_turn 最小且放大系数最低的角上。

§5.5 Sensitivity

参数	范围	检验
到达率 λ	TBD	EAR 在低/高负载下是否稳定 dominate
Skew 程度 (Zipf α)	TBD	sticky 与 EAR 的差距是否随 skew 拉开
KV pool size	TBD	static PD-disagg 撞墙边界
`T_hot` (migration threshold)	TBD	触发太宽 → thrash，太严 → 错过
`T_cool` (cooldown)	TBD	同上

§5.6 Migration Microbenchmark

刻画 EAR 内部 migration 行为：

Migration 触发率（% of requests）
平均 KV transfer 时间
Migration accuracy：迁移后 target instance 在接下来 TBD 个 turn 内保持非 hot 的比例
Thrashing rate：cooldown 窗口内多次迁移的 session 占比（应为 0）

TBD: 时间轴上每个 instance 的 pending prefill tokens heatmap，migration 事件以箭头标出。

§6 Discussion and Limitations

Extreme skew: 若单个 session 自己就把任意 instance 撑成 hot，EAR 退化为 sticky。我们未在该 regime 做 stress test。
Cost model accuracy: EAR 用 observable load 绕过了预测误差问题。但未来若引入 predictive admission control，需要解决 mooncake cost model 10-21x 误差。
Heterogeneous hardware / multi-model: EAR 假设 instance 同质。混合模型 / 混合 GPU 池需要扩展 binding 模型。
Per-instance batch tuning (future): 动态调整 max_batched_tokens 可能进一步降低 instance 内部 prefill-decode 干扰，留作 future work。

LLM serving systems: vLLM, Mooncake, SGLang, DistServe, Splitwise. EAR 基于 vLLM + Mooncake 实现，与 DistServe/Splitwise 不同之处在于不做静态 P/D 分区。
Cache-aware routing: LMCache, Production-Stack, LMetric (OSDI'26)。这些工作最小化 cross-instance cache miss，但不迁移状态。
Stateful service migration: Pollux, Gandiva (RL training)。EAR 借鉴 migration-as-rebalancing 思路，将其迁移到 LLM inference 的 KV cache 场景。

§8 Conclusion

对 agentic LLM workload，locality 是主导调度杠杆。EAR 用 session-affinity routing 抓住它，用 hot-triggered session migration 保护它，单一方案在 TTFT、APC、hotspot、wall-clock throughput 四个维度同时 dominate 五个 baseline。

Open Items（写正文前必须解决）

§4.3.1: T_hot 的实际取值（pending prefill token 阈值）
§4.3.4: T_cool 的实际取值（cooldown 秒数 / turn 数）
§5.1: instance 数量、trace 总长度、最终 baseline 配置定稿
§4.4: session→host 表的存储介质决定
§1 / §5.2: EAR 的 wall-clock amplification 实测数（关键 number）
Figures: f1–f10 全部 TBD，目前只有占位

15 KiB Raw Blame History Unescape Escape