The previous §2.3 narrative said "chatbot has T_human ≈ 30 s think-time,
agentic has T_external ≈ 0, so agentic is always closed-loop and chatbot
never is". The new T_external measurements on the production chatbot
trace (qwen3-max, n=42 k inter-turn gaps from formatted parent_chat_id
sessions) show the binary framing is wrong:
agentic p50 1.6 s, 39% gaps < 1 s, p99 738 s
chatbot p50 7.2 s, 4% gaps < 1 s, p99 43 s
Both have nonzero T_external. The right distinction is the *shape*:
chatbot is unimodal around 5–15 s (human cadence); agentic is bimodal
with a sub-second tool-call mass (39 % vs chatbot's 4 %) plus a long-
pause tail (13 % > 30 s). The agentic sub-second mass is what activates
dispatch coupling — for any W_turn > 1 s scheduler those turns satisfy
W_turn ≫ T_external by construction.
The empirical regime split:
unified TTFT p90 = 7.3 s → agentic 73% closed-loop, chatbot 32%
lmetric TTFT p90 = 15.7s → agentic 80%, chatbot 88%
lmetric is bad enough that it drags the chatbot regime into closed-loop
too. This is a direct empirical explanation for lmetric underperforming
on both workloads.
Updates:
- PAPER_OUTLINE.md §2.3: lead with the regime threshold W_turn ≷
T_external, replace the "T_human dominates" Little's Law with the
general form L = Λ · N · (W_turn(L) + T_external), embed f3a CDF,
add the empirical regime table; correct the small-perturbation
formula to include the +T_external dampening term.
- MEETING.md §1: same reframe, condensed (CDF figure, two-row regime
table, one-line conclusion).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
138 lines
8.5 KiB
Markdown
138 lines
8.5 KiB
Markdown
# EAR — Agentic Serving Scheduler 汇报
|
||
|
||
**One-liner**:Agentic workload 的 KV reuse 93% 在 session 内,turn 间 tool-call 反馈耦合把单 request 延迟差放大成 throughput 差距 —— locality 因此是主导调度杠杆;现有 load-balance 丢 locality、static PD-disagg 撞 D 侧 KV 墙、pure sticky 造 hot pin;我们提 EAR (Elastic Affinity Router) = session-affinity routing + hot-instance 触发 session migration。
|
||
|
||
---
|
||
|
||
## 1. 关键洞察:Dispatch Coupling
|
||
|
||
每个 turn 间有一段外部 gap `T_external`(chatbot 是人类读+想+打字;agentic 是 tool 执行)。**Little's Law `L = Λ · N · (W_turn + T_external)`** 在两种 workload 下都成立 —— 差异在于 `T_external` 的分布相对于 `W_turn` 的位置:
|
||
- `T_external ≫ W_turn` → 开环 regime:scheduler 退一步不动 L
|
||
- `T_external ≲ W_turn` → 闭环 regime:`W_turn(L)` 因 KV 竞争耦合到 L,反馈环把 scheduler 的 ε 退步放大几倍
|
||
|
||
**Production trace 实测 `T_external` 分布**(next.start − prev.end,formatted session 链作 ground truth):
|
||
|
||

|
||
|
||
| | Agentic | Chatbot |
|
||
|---|---:|---:|
|
||
| p50 | **1.6s** | **7.2s** |
|
||
| gap < 1s | **39%** | 4% |
|
||
| gap < 5s | 67% | 29% |
|
||
| p99 | 738s | 43s |
|
||
|
||
两个分布形状完全不同:chatbot unimodal 集中在 5–15s(人类节奏);agentic bimodal —— **39% 的 gap 在 sub-second 里(autonomous tool-call mode)**,外加 13% > 30s 的长尾。**Agentic 的 sub-second mass 是 chatbot 没有的**,正是 dispatch coupling 激活的来源。
|
||
|
||
**实测 regime**:在 unified(TTFT p90 = 7.3s)下,**73% 的 agentic turn 把系统推进闭环**(W_turn > T_external),chatbot 仅 32%。在 lmetric(15.7s)下 agentic 80%、**chatbot 也到 88%** —— lmetric 把 chatbot 自己也拖进闭环,这就是它在两种 workload 都 underperform 的根因。
|
||
|
||
**结果**:lmetric 跑 600s trace 用 49 min wall-clock = **8x amplification**。**per-turn W 的小差异被放大成 wall-clock 数量级差距** —— locality 不是 nice-to-have,是 dominant lever。
|
||
|
||
---
|
||
|
||
## 2. Workload 实证(三件事)
|
||
|
||
| | 数据 | 图 |
|
||
|---|---|---|
|
||
| KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% |  |
|
||
| Session 极度偏斜 | production trace 上 top 1% / 5% / 10% / 25% / 50% = **46.5% / 66.5% / 74.6% / 87.5% / 96.0%** input mass |  |
|
||
| 单请求 KV footprint 大,单 instance KV pool 很快被占满 | per-instance KV pool ≈ **38 GiB**(0.4 × 96 GiB H20,剩 50% params + 10% activation);p99 req 11.5 GiB → 一个 instance 只装 **3 个 p99 decode**;4P+4D 让系统 decode 容量直接减半 |  |
|
||
|
||
理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**
|
||
|
||
---
|
||
|
||
## 3. 现有调度的三种失败模式
|
||
|
||
### Load-balanced (LMetric / round-robin / kv-aware):丢 locality
|
||
|
||

|
||
|
||
LMetric 56.9%、load_only 54.1% APC,远低于 79.6% 上界。23pp 缺口直接来自跨 instance 路由丢的 intra-session hit。
|
||
|
||
注意 LMetric 比 load_only 只好 **+3.3pp**:LMetric 的 score = `(pending_prefill + input − cache_hit) × num_requests`,cache_hit 只作 cost-model 减项,但 score 是**乘性**的 —— 一个有 affinity 的 instance 因 num_requests 高被乘式吃掉 cache 收益,LMetric 仍然会选冷 instance。sticky 把 cache 作硬约束直接拉到 77.2%。**结论:cache-aware-load routing 不够 —— affinity 必须是独立路由路径,不能折叠进 load cost 里**。
|
||
|
||
### 静态 PD-disagg:D 侧 KV 容量墙
|
||
|
||

|
||
|
||
agentic 平均请求 33.6k token 需 3.3GB KV;4P+4D / 6P+2D 在 agentic regime 都穿过 90% 内存墙。**TTFT p50 暴涨 62-72x,成功率 99.5% → 52-68%**。
|
||
|
||
### Pure sticky:全员被 hot session 拖累
|
||
|
||

|
||
|
||
我们刻意 **用 (median, max) 两个绝对数**衡量 worker 不平衡,不用 `max/median` 单一比值 —— 比值会把 unified(一个 worker 牺牲、其他 7 个快)算成比 sticky(全员一起慢)更不平衡,与系统 e2e p90 实际排序反向。下面是绝对数:
|
||
|
||
| | median worker TTFT p90 | max worker | system e2e p90 |
|
||
|---|---:|---:|---:|
|
||
| sticky | **20.3s** | 55.4s | **34.6s** |
|
||
| unified | **10.3s** | 37.7s | **18.0s** |
|
||
|
||
机制:production trace 上 top 1% 的 session 占 46.5% input 量、且 hot session 数量远多于 instance 数(8 个),sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**,median worker 也被拖慢。Unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker,保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 几乎 2x 快。
|
||
|
||
---
|
||
|
||
## 4. EAR 设计
|
||
|
||
两个 pillar,所有 instance 对称 PD-colocated(无静态 P/D 分区):
|
||
|
||
**Pillar 1 — Affinity-default routing(已实现)**
|
||
新 session 用 load-balance 分配 host;后续 turn 按 session→host binding 路由。
|
||
→ 这就是当前 `unified` 算法(hybrid LMetric + high-cache affinity),APC 79.4%,达到上界 97%。
|
||
|
||
**Pillar 2 — Hot-triggered session migration(end-to-end 实证待补,substrate 已验证)**
|
||
当 host 的 `pending_prefill_tokens > T_hot`,把整个 session 的 KV 通过 mooncake `kv_connector` migrate 到更轻 instance;session binding 更新;后续 turn 路由到新 host。
|
||
|
||
> 🆕 **2026-05-27 数据**(commit `ef9e010`):之前认为是 migration blocker 的 `kv_both` substrate overhead 已经不存在。在 8×TP1 trace replay 上 A/B/C 对比:
|
||
> - plain unified: TTFT p90 = 11.97s
|
||
> - unified + `kv_both`(未 DR-fix): 9.74s(**−18.6%** vs plain)
|
||
> - unified + `kv_both` + DR-fix: 7.58s(**−36.6%** vs plain)
|
||
>
|
||
> 即原 elastic_migration_v2 论文里 "+45% kv_both penalty" 已 obsolete;当前 substrate 是 **net positive**(connector mode 的 `delay_free_blocks=True` 在 93% intra-session-reuse trace 上把跨 turn cache hit 窗口拉长)。Migration 之前 4 次 revert 的主因消失。
|
||
|
||
关键 design:
|
||
- Target 选择用 **observable pending prefill tokens**,**不用** cost-model prediction(实测 mooncake cost model 误差 10-21x,绕过)
|
||
- Per-session cooldown 防 thrashing
|
||
- 若无候选 instance 能装下 session context → 保留当前 binding,opportunistic 不 mandatory
|
||
|
||
---
|
||
|
||
## 5. 进展 & TODO
|
||
|
||
### ✅ 已完成
|
||
|
||
- Workload characterization 三件事的实证齐全(`f2a/b/c`)
|
||
- 三类 baseline 失败的实证齐全(`f4a/b/c/d`)
|
||
- Anchor + paper outline(`PAPER_OUTLINE.md`)
|
||
- Pillar 1 affinity routing 已实现并测过(current `unified` 算法)
|
||
- Dispatch coupling 的 Little's Law 形式化推导
|
||
- `replayer/replay.py` patched 输出 `amplification`
|
||
- 🆕 **kv_both substrate validation**(commit `ef9e010`):trace replay A/B/C 证明 substrate 已经是 net positive(TTFT p90 −18.6% / DR-fix 后 −36.6% vs plain),原 +45% penalty obsolete
|
||
|
||
### 🟢 不依赖 migration 可以现在做
|
||
|
||
1. **5 baseline × 3 runs wall-clock sweep**(patched replayer 直接出 amplification 字段)— §2.3 的实证 closure,**最高优先级**,一晚能跑完
|
||
2. Static PD-disagg 补进 end-to-end 表
|
||
3. λ / skew / KV pool 三轴 sensitivity
|
||
4. Draft §1-§4 正文(数据已齐)
|
||
|
||
### 🚧 待 migration end-to-end validation
|
||
|
||
- §4.3 migration mechanism 的 e2e trigger + target selection 实验(substrate 已通,只缺策略层)
|
||
- Full ablation(migration-only + both)
|
||
- §5.6 migration microbench
|
||
|
||
### 风险
|
||
|
||
- Migration 之前 4 次尝试(`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`)都被 transfer overhead 吞掉而 revert —— **该 overhead 已在 2026-05-27 验证不再存在**(substrate net positive)
|
||
- 仍未直接验证 e2e migration 策略层(trigger + target 选择)能在反馈环里产生正收益;中间还有"决策错误 + cooldown thrashing"两类风险,独立于 substrate
|
||
- 即便 migration e2e 仍 marginal,affinity-only pillar 的实证已经独立成立,paper 至少有 strong-affinity 的 storyline 可写
|
||
|
||
---
|
||
|
||
## 6. 一句话总结要 sell 的事
|
||
|
||
> **Agentic 让 locality 从 nice-to-have 变成 dominant lever(dispatch coupling 论证);EAR 用 affinity-default + hot-triggered migration 单一方案同时拿到 locality 和 balance。Pillar 1 已实证(APC 79.4%);Pillar 2 design 完整、validation pending in DR-fix 之上的重测。**
|
||
|
||
下一步主战场:跑 wall-clock sweep 把 §2.3 dispatch coupling 论证钉死。
|