EAR outline: copy reusable figures, mark migration sections deferred

- replayer/replay.py: emit trace_span_s and amplification in summary
  (Phase 1 of the wall-clock amplification measurement plan; needed for
  §2.3 dispatch coupling empirical closure)
- figs/: 8 reusable figures copied from analysis/ with paper-spec names
  (f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial)
- PAPER_OUTLINE.md: real figure paths, explicit TBD markers for
  custom drawings and pending data; new "Validation Status" table at top
  and reorganized "Work Plan" splitting can-do-now vs migration-deferred

Migration validation deferred per user: 4 prior attempts (6b255fa,
e991960/5772149, cc6e562, 4c583f2) were reverted due to transfer
overhead; pending re-test on top of connector_tax DR-fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 01:44:13 +08:00
parent e2f94495a1
commit 52cdb80367
10 changed files with 95 additions and 29 deletions

View File

@@ -4,6 +4,26 @@
---
## 📊 Validation Status (2026-05-27)
| 部分 | 现有数据 | 待补 |
|---|---|---|
| §2 Workload characterization | ✅ 完整 (3 张图复用) | — |
| §3.1 Load-balance 丢 locality | ✅ 完整 (`f4a`) | — |
| §3.2 静态 PD-disagg 撞 KV 墙 | ✅ 完整 (`f4b`) | — |
| §3.3 Sticky 造 hot pin | ✅ 完整 (`f4c`, `f4d`) | — |
| §4.1-2 Affinity routing | ✅ 已实现current `unified` 算法)| — |
| §4.3 Migration mechanism | 🚧 **DEFERRED** | 待 connector_tax fix 后重测 |
| §5.2 End-to-end | ⚠️ 5/6 baseline 有数据 (`f6`) | 缺 static PD-disaggEAR 列待 migration |
| §5.3 Ablation | 🚧 **PARTIAL DEFER** | 仅 affinity-only 现可做full 待 migration |
| §5.4 Dispatch coupling validation | 🚧 **NEW DATA NEEDED** | 5 baseline wall-clock 重跑Phase 1 patch 后)|
| §5.5 Sensitivity | 🚧 **PARTIAL DEFER** | λ/skew/KV pool 可做;`T_hot`/`T_cool` 待 migration |
| §5.6 Migration microbench | 🚧 **FULL DEFER** | 完全依赖 migration validation |
**前提背景**team 之前 4 次尝试 migration 都因 transfer overhead 被还原(见 `analysis/unified_routing_fix_review.md`);最近 `connector_tax` 工作的 DR-fix 把 build_connector_meta 的 1.4ms/step overhead 降到接近 0但还未跑过完整 migration 实验。**EAR 的 migration 部分目前是 design intent待重测后写入实证。**
---
## §1 Introduction
Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任务 —— 已经成为推理系统的主导负载,但现有为 chatbot 设计的调度策略在 agentic 下普遍失败。本文先刻画 agentic 与 chatbot 的本质区别,然后说明为什么三类主流调度都不够,最后给出 EAR 设计。
@@ -14,8 +34,8 @@ Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任
- **C2 EAR 设计**:两个 pillar 的调度器 —— affinity-default routing 抓 intra-session localityhot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance避免 hot pin。
- **C3 评估**:在真实 Qwen3-Coder agentic trace 上EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、hotspot index、wall-clock 五个维度。
![Figure 1: Teaser — wall-clock vs trace-time across schedulers](figs/f1_teaser.png)
**TBD**: 把 5 baseline + EAR 的 (wall-clock / trace-time) 比值画成一张 bar chart 或散点图,配一条 y=1 的参考线
**Figure 1: Teaser — wall-clock vs trace-time across schedulers**`figs/f1_teaser.png` **🚧 TBD (NEW DATA NEEDED)**
> Needs Phase 3 measurements: 5 baselines × 3 runs of trace replay, extract `amplification = wall_clock_s / trace_span_s` from each summary (Phase 1 patch already exposes the field). Plot as bar chart with y=1 reference line. EAR row 暂为 TBD待 migration validation
---
@@ -43,8 +63,12 @@ Trace 上 KV reuse 的分解:
理论 APC 上界any-session **80.3%**intra-session-only **79.6%**,差距 <1pp。**cache 本质上是 session-local **任何不保留 session affinity 的调度都丢掉绝大部分 reuse 机会
![Figure 2: KV reuse topology + session skew](figs/f2_workload.png)
**TBD**: 两个 subplot 一张图 reuse piesession input-token CDF (highlight top 1% = 46.5%).
**Figure 2: Workload characterization (3 panels)** 现有数据可复用
- `figs/f2a_reuse_topology.png` intra-session 93.2% / cross-session 5.7% / shared 1.1% bar
- `figs/f2b_session_skew.png` top 1%/5%/10% session input-token mass
- `figs/f2c_kv_footprint_cdf.png` per-request KV footprint p50/p90/p95/p99 (p99 = 11.8 GiB 12% of H20)
> 📝 Layout TBD三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。
### §2.3 Dispatch Coupling — Why Locality Dominates
@@ -78,8 +102,8 @@ dL/dε|_{ε=0} = L* / (1 Λ · N · W'_turn(L*))
```
**分母接近 0** 系统接近 KV 饱和放大系数发散这就是为什么 lmetric 600s trace 上跑出 8x wall-clock 放大
![Figure 3: Dispatch coupling — chatbot vs agentic timing schematic](figs/f3_coupling_schematic.png)
**TBD**: 一张示意图并排画 chatbot timeline (systemhumansystemhuman...) agentic timeline (systemsystemsystem...)标出 `T_human` 缓冲不存在导致 `W_turn` 直接进入 `Λ` 的环路
**Figure 3: Dispatch coupling schematic** `figs/f3_coupling_schematic.png` **🚧 TBD (CUSTOM DRAW)**
> 需要新画一张示意图:上半 chatbot timeline`system → T_human → system → T_human → ...`),下半 agentic timeline`system → ε → system → ε → ...`),右侧叠一个反馈环箭头 `W_turn → Λ → L → W_turn`。适合用 TikZ / draw.io / matplotlib annotate。
### §2.4 Takeaway
@@ -103,8 +127,12 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
session-instance 绑定恢复 localityAPC **77.2%**达到上界 97%但把 skew 中的大 session 锁在单 instance **interference index LMetric 6.53 翻倍到 13.65** trace 同硬件)。违反 §2.4 skew 容忍要求
![Figure 4: Three baselines, three failure modes](figs/f4_baseline_failures.png)
**TBD**: 三个 baseline × 三个失败维度的 bar chart radar视觉上让"每个 baseline 在自己失败的维度有一根明显高/低的柱"立刻能看出来
**Figure 4: Three baselines, three failure modes** 拆成三个子图分别放在 §3.13.23.3
- §3.1 `figs/f4a_apc_loss.png` APC 实测 vs 理论上界 79.6% (lmetric 56.9%, load_only 54.1%, capped 31.6%, sticky 77.2%, unified 79.4%)
- §3.2 `figs/f4b_pdsep_kv_wall.pdf` D KV pool 占用 vs per-request KV footprint4P+4D 6P+2D agentic regime 都穿过 90% 内存墙
- §3.3 `figs/f4c_apc_vs_hotspot_tradeoff.png` APC vs hotspot index 散点unified/sticky 在高 APC hotspot lmetric/load_only 在低 APC hotspot
> 📝 可选 `figs/f4d_pd_interference.png` ✅ — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x放 §3.3 支撑 sticky 的 interference 论证。
### §3.4 Takeaway
@@ -118,8 +146,8 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
EAR 是位于 N 个同质 instance 之上的 router每个 instance 是对称的 PD-colocated没有静态 P/D 分区每个 session router 内维护一个 **host binding** —— 当前持有该 session KV 状态的 instanceBinding 在常态下稳定仅在 hotspot 触发时通过 migration 改变
![Figure 5: EAR architecture and request flow](figs/f5_architecture.png)
**TBD**: 一张组件图 —— router ( sessionhost table) N symmetric instancesmigration path 用虚线标出
**Figure 5: EAR architecture and request flow** `figs/f5_architecture.png` **🚧 TBD (CUSTOM DRAW)**
> 组件图router (含 sessionhost table) → N 个 symmetric instancesaffinity 路径实线,migration path 虚线。适合 TikZ / draw.io。
### §4.2 Pillar 1: Affinity-Default Routing
@@ -127,10 +155,12 @@ EAR 是位于 N 个同质 instance 之上的 router。每个 instance 是对称
- **Warm path**已建立 session 的后续每个 turn 一律路由到当前 host
- **效果**intra-session KV reuse 被构造性保留APC 接近 §2.2 的上界 79.6%
### §4.3 Pillar 2: Hot-Triggered Session Migration
### §4.3 Pillar 2: Hot-Triggered Session Migration 🚧 DEFERRED VALIDATION
避免 Pillar 1 退化成 pure sticky 的关键 mechanism
> **状态**Design 描述完整,但实证数据待 `connector_tax` DR-fix 之后重测。之前 4 次 migration 尝试(`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`)都因 transfer overhead 被还原 —— 直到 DR-fix 之前migration 的实测收益始终被 overhead 吞掉。新一轮验证未跑。
#### §4.3.1 Trigger signal
EAR 实时监控每个 instance **pending prefill tokens** request 到达且按 affinity 应该路由到 host H router 先检查
@@ -185,8 +215,9 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
### §5.2 End-to-end Performance
![Figure 6: End-to-end metrics across 6 schedulers](figs/f6_e2e.png)
**TBD**: TTFT/TPOT/APC/hotspot 四个 subplot 或一张雷达图
**Figure 6: End-to-end performance** `figs/f6_e2e_latency_bars.png` (PARTIAL)
> 现有TTFT/TPOT/E2E p90 bar chart × 5 policies (lmetric / load_only / sticky / unified / capped)。
> **🚧 TBD (NEW DATA)**:缺 `static PD-disagg` 那一列EAR 列也是 TBD需 migration validation。要再补一张同样格式但包含全 6 个 baseline 的图。
| Scheduler | TTFT p50 | TTFT p90 | TPOT p90 | APC | Hotspot idx | Wall-clock factor |
|---|---|---|---|---|---|---|
@@ -199,7 +230,7 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
(粗体数字来自现有 "unified" 原型测量。)
### §5.3 Ablation
### §5.3 Ablation 🚧 PARTIAL DEFER
我们独立关闭两个 pillar:
@@ -207,8 +238,8 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
- **EAR (migration only)**: cold-balance + reactive migration affinity衡量 migration 能否独立成立
- **EAR (full)**: 两个 pillar 都开
![Figure 7: Ablation across 4 configurations](figs/f7_ablation.png)
**TBD**: 2×2 grid 或并排 bar chart
**Figure 7: Ablation** `figs/f7_ablation.png` **🚧 TBD DEFERRED (BLOCKED ON MIGRATION VALIDATION)**
> 完整 ablation 需要 migration-only / both / affinity-only 三个配置。Migration-only 和 both 都依赖 migration 重测。现阶段可先做 affinity-only vs load-balance 的两点对比已有数据unified 79.4% APC vs lmetric 56.9% APC
预期结论affinity-only 拿到 locality interference 翻倍migration-only 抓不住 locality两者都必须
@@ -219,8 +250,8 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
- turn 平均服务时间 `W_turn`x
- Wall-clock / trace-time amplificationy
![Figure 8: Wall-clock amplification vs per-turn service time](figs/f8_coupling_measured.png)
**TBD**: 每个 baseline 一个点理论曲线 `1/(1 Λ·N·W'(L*))` 叠加为参考线
**Figure 8: Wall-clock amplification vs per-turn service time** `figs/f8_coupling_measured.png` **🚧 TBD (NEW DATA)**
> 散点x = 平均 per-turn `W_turn`(从 per-request metrics 算 TTFT + decode_timey = amplification (`wall_clock / trace_span`Phase 1 patch 已暴露)。每个 baseline 一个点。理论曲线 `L*/(1 Λ·N·W'(L*))` 叠加(可选)。这是 §2.3 论证的实证 closure**优先级最高**。
预期EAR `W_turn` 最小且放大系数最低的角上
@@ -234,9 +265,10 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
| `T_hot` (migration threshold) | TBD | 触发太宽 thrash太严 错过 |
| `T_cool` (cooldown) | TBD | 同上 |
![Figure 9: Sensitivity heatmaps](figs/f9_sensitivity.png)
**Figure 9: Sensitivity heatmaps** `figs/f9_sensitivity.png` **🚧 TBD (NEW DATA, PARTIAL DEFER)**
> Arrival rate / skew / KV pool size 这三轴可现在做(不依赖 migration`T_hot` / `T_cool` 两轴依赖 migration validationdeferred。
### §5.6 Migration Microbenchmark
### §5.6 Migration Microbenchmark 🚧 FULL DEFER
刻画 EAR 内部 migration 行为
@@ -245,8 +277,8 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
- Migration accuracy迁移后 target instance 在接下来 TBD turn 内保持非 hot 的比例
- Thrashing ratecooldown 窗口内多次迁移的 session 占比应为 0
![Figure 10: Migration timeline over a 60s window](figs/f10_migration_timeline.png)
**TBD**: 时间轴上每个 instance pending prefill tokens heatmapmigration 事件以箭头标出
**Figure 10: Migration timeline** `figs/f10_migration_timeline.png` **🚧 TBD DEFERRED (BLOCKED ON MIGRATION VALIDATION)**
> 时间轴上每个 instancepending prefill tokens heatmapmigration 事件以箭头标出。完全依赖 migration 重测。
---
@@ -273,11 +305,43 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
---
## Open Items写正文前必须解决
## Work Plan
- [ ] **§4.3.1**: `T_hot` 的实际取值pending prefill token 阈值
- [ ] **§4.3.4**: `T_cool` 的实际取值cooldown 秒数 / turn
- [ ] **§5.1**: instance 数量trace 总长度最终 baseline 配置定稿
- [ ] **§4.4**: sessionhost 表的存储介质决定
- [ ] **§1 / §5.2**: EAR wall-clock amplification 实测数关键 number
- [ ] **Figures**: f1f10 全部 TBD目前只有占位
### ✅ Done
- [x] §1 anchor sentence + contribution bullets
- [x] §2 outline + reuse existing characterization figures (`f2a`/`f2b`/`f2c`)
- [x] §3.13.23.3 outline + reuse existing baseline failure figures (`f4a`/`f4b`/`f4c`/`f4d`)
- [x] §4 design description 4.3 待实证)
- [x] §5.2 partial figure (`f6` 5/6 baselines)
- [x] `replayer/replay.py` patched to emit `trace_span_s` + `amplification` in summary
### 🟢 Can do without migration (paper writing now possible)
- [ ] Draft §1-§4 正文数据全有figures copy
- [ ] §2.3 dispatch coupling 那一节的正文 draft数学已经在 conversation 里推完
- [ ] §3 三个失败模式正文 draft
- [ ] §5.4 wall-clock amplification 实测5 baseline × 3 runs)— **优先级最高**这是 §2.3 的实证 closure
- [ ] §5.2 static PD-disagg 补进 `f6` 那张图重跑或合并现有 PD-sep 数据
- [ ] §5.5 sensitivity λ / skew / KV pool 三轴
- [ ] §3 三张子图各自独立的 latex/markdown layout 决定
### 🚧 Deferred (待 migration validation)
- [ ] §4.3 migration mechanism 重测`connector_tax` DR-fix 之后跑
- [ ] §5.3 full ablation (migration-only + both 两个配置)
- [ ] §5.5 `T_hot` / `T_cool` 两轴 sensitivity
- [ ] §5.6 migration microbench 全部
- [ ] §1 teaser (`f1`) EAR 那一列
- [ ] §5.2 表里 EAR 那一行
- [ ] §4.3.1 / §4.3.4 `T_hot` `T_cool` 取值
### 🎨 Custom drawings (paper-writing 阶段)
- [ ] `f3_coupling_schematic.png` —— chatbot vs agentic timeline + 反馈环
- [ ] `f5_architecture.png` —— EAR 组件图
### ❓ Open design decisions
- [ ] §4.4 sessionhost 表的存储介质in-memory dict vs Redis
- [ ] §5.1 instance 数量trace 总长度的最终定稿