Add EAR meeting pitch doc

Minimal one-page sell doc for advisor meeting. Leads with dispatch
coupling insight + 8x amplification number, then workload chars,
three baseline failure modes, EAR two-pillar design, progress/TODO/risk.

Uses the 8 figs already in figs/. Migration Pillar 2 explicitly marked
as design-complete-validation-pending (the 4 prior reverts + DR-fix
context).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 01:48:53 +08:00
parent 52cdb80367
commit 0bb97c9dca

112
MEETING.md Normal file
View File

@@ -0,0 +1,112 @@
# EAR — Agentic Serving Scheduler 汇报
**One-liner**Agentic workload 的 KV reuse 93% 在 session 内turn 间 tool-call 反馈耦合把单 request 延迟差放大成 throughput 差距 —— locality 因此是主导调度杠杆;现有 load-balance 丢 locality、static PD-disagg 撞 D 侧 KV 墙、pure sticky 造 hot pin我们提 EAR (Elastic Affinity Router) = session-affinity routing + hot-instance 触发 session migration。
---
## 1. 关键洞察Dispatch Coupling
Chatbotturn 间有人类 think-time系统快慢 ⊥ 下一 turn 到达率。
Agenticturn 间只有 tool-call 返回 (≈0)**系统跑慢 → session 停留长 → 并发多 → KV pool 紧 → 更慢**。
Little's Law 隐式方程:
```
L = Λ · N · W_turn(L) # agentic, T_human≈0
```
小扰动分析amplification = `1 / (1 Λ·N·W'(L*))`,系统接近 KV 饱和时发散。
**实测**lmetric 跑 600s trace 用 49 min wall-clock = **8x amplification**。同硬件 unified 比 lmetric session 清空速度 ~3x。**per-turn W 的小差异被放大成 wall-clock 数量级差距** —— 这意味着 locality 不是 nice-to-have是 dominant lever。
---
## 2. Workload 实证(三件事)
| | 数据 | 图 |
|---|---|---|
| KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% | ![](figs/f2a_reuse_topology.png) |
| Session 极度偏斜 | top 1% = 46.5% input mass | ![](figs/f2b_session_skew.png) |
| 单请求 KV footprint 已经很大 | p99 = 11.8 GiB ≈ H20 12% | ![](figs/f2c_kv_footprint_cdf.png) |
理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**
---
## 3. 现有调度的三种失败模式
### Load-balanced (LMetric / round-robin / kv-aware):丢 locality
![](figs/f4a_apc_loss.png)
LMetric 56.9%、load_only 54.1%、capped 31.6% APC远低于 79.6% 上界23pp 缺口直接来自跨 instance 路由丢的 intra-session hit
### 静态 PD-disaggD 侧 KV 容量墙
![](figs/f4b_pdsep_kv_wall.pdf)
agentic 平均请求 33.6k token 3.3GB KV4P+4D / 6P+2D agentic regime 都穿过 90% 内存墙。**TTFT p50 暴涨 62-72x成功率 99.5% 52-68%**。
### Pure sticky / current unifiedhot pin
![](figs/f4c_apc_vs_hotspot_tradeoff.png)
APC 拉到 77-79%接近上界 hotspot index 翻倍sticky 2.73unified 3.66 vs lmetric 2.25load_only 1.29skew 中的大 session 被锁在单 instance 造成 prefill-decode 干扰
---
## 4. EAR 设计
两个 pillar所有 instance 对称 PD-colocated无静态 P/D 分区
**Pillar 1 — Affinity-default routing已实现**
session load-balance 分配 host后续 turn sessionhost binding 路由
这就是当前 `unified` 算法hybrid LMetric + high-cache affinityAPC 79.4%达到上界 97%。
**Pillar 2 — Hot-triggered session migration实证待补**
host `pending_prefill_tokens > T_hot`把整个 session KV 通过 mooncake `kv_connector` migrate 到更轻 instancesession binding 更新后续 turn 路由到新 host
关键 design
- Target 选择用 **observable pending prefill tokens****不用** cost-model prediction实测 mooncake cost model 误差 10-21x绕过
- Per-session cooldown thrashing
- 若无候选 instance 能装下 session context 保留当前 bindingopportunistic mandatory
---
## 5. 进展 & TODO
### ✅ 已完成
- Workload characterization 三件事的实证齐全`f2a/b/c`
- 三类 baseline 失败的实证齐全`f4a/b/c/d`
- Anchor + paper outline`PAPER_OUTLINE.md`
- Pillar 1 affinity routing 已实现并测过current `unified` 算法
- Dispatch coupling Little's Law 形式化推导
- `replayer/replay.py` patched 输出 `amplification`
### 🟢 不依赖 migration 可以现在做
1. **5 baseline × 3 runs wall-clock sweep**patched replayer 直接出 amplification 字段)— §2.3 的实证 closure**最高优先级**一晚能跑完
2. Static PD-disagg 补进 end-to-end
3. λ / skew / KV pool 三轴 sensitivity
4. Draft §1-§4 正文数据已齐
### 🚧 待 migration validation
- §4.3 migration mechanism `connector_tax` DR-fix 之上重测
- Full ablationmigration-only + both
- §5.6 migration microbench
### 风险
- Migration 之前 4 次尝试`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`都被 transfer overhead 吞掉而 revert
- 最近 DR-fix `build_connector_meta` slope +81 -0.7 μs/1k blocks**未在 trace replay 上验证**
- migration validation failpaper pivot "affinity-only is enough" —— 仍然能发强度降一档
---
## 6. 一句话总结要 sell 的事
> **Agentic 让 locality 从 nice-to-have 变成 dominant leverdispatch coupling 论证EAR 用 affinity-default + hot-triggered migration 单一方案同时拿到 locality 和 balance。Pillar 1 已实证APC 79.4%Pillar 2 design 完整、validation pending in DR-fix 之上的重测。**
下一步主战场 wall-clock sweep §2.3 dispatch coupling 论证钉死