Files
agentic-kvc/MEETING.md
Gahow Wang b11dc30945 §2.3 reframe: dispatch coupling is regime-dependent, not binary chatbot/agentic
The previous §2.3 narrative said "chatbot has T_human ≈ 30 s think-time,
agentic has T_external ≈ 0, so agentic is always closed-loop and chatbot
never is". The new T_external measurements on the production chatbot
trace (qwen3-max, n=42 k inter-turn gaps from formatted parent_chat_id
sessions) show the binary framing is wrong:

  agentic   p50 1.6 s,  39% gaps < 1 s,  p99 738 s
  chatbot   p50 7.2 s,   4% gaps < 1 s,  p99  43 s

Both have nonzero T_external. The right distinction is the *shape*:
chatbot is unimodal around 5–15 s (human cadence); agentic is bimodal
with a sub-second tool-call mass (39 % vs chatbot's 4 %) plus a long-
pause tail (13 % > 30 s). The agentic sub-second mass is what activates
dispatch coupling — for any W_turn > 1 s scheduler those turns satisfy
W_turn ≫ T_external by construction.

The empirical regime split:
                 unified  TTFT p90 = 7.3 s   →  agentic 73% closed-loop, chatbot 32%
                 lmetric  TTFT p90 = 15.7s   →  agentic 80%,             chatbot 88%

lmetric is bad enough that it drags the chatbot regime into closed-loop
too. This is a direct empirical explanation for lmetric underperforming
on both workloads.

Updates:
- PAPER_OUTLINE.md §2.3: lead with the regime threshold W_turn ≷
  T_external, replace the "T_human dominates" Little's Law with the
  general form L = Λ · N · (W_turn(L) + T_external), embed f3a CDF,
  add the empirical regime table; correct the small-perturbation
  formula to include the +T_external dampening term.
- MEETING.md §1: same reframe, condensed (CDF figure, two-row regime
  table, one-line conclusion).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 16:51:38 +08:00

138 lines
8.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# EAR — Agentic Serving Scheduler 汇报
**One-liner**Agentic workload 的 KV reuse 93% 在 session 内turn 间 tool-call 反馈耦合把单 request 延迟差放大成 throughput 差距 —— locality 因此是主导调度杠杆;现有 load-balance 丢 locality、static PD-disagg 撞 D 侧 KV 墙、pure sticky 造 hot pin我们提 EAR (Elastic Affinity Router) = session-affinity routing + hot-instance 触发 session migration。
---
## 1. 关键洞察Dispatch Coupling
每个 turn 间有一段外部 gap `T_external`chatbot 是人类读+想+打字agentic 是 tool 执行)。**Little's Law `L = Λ · N · (W_turn + T_external)`** 在两种 workload 下都成立 —— 差异在于 `T_external` 的分布相对于 `W_turn` 的位置:
- `T_external ≫ W_turn` → 开环 regimescheduler 退一步不动 L
- `T_external ≲ W_turn` → 闭环 regime`W_turn(L)` 因 KV 竞争耦合到 L反馈环把 scheduler 的 ε 退步放大几倍
**Production trace 实测 `T_external` 分布**next.start prev.endformatted session 链作 ground truth
![](figs/f3a_inter_turn_gap.png)
| | Agentic | Chatbot |
|---|---:|---:|
| p50 | **1.6s** | **7.2s** |
| gap < 1s | **39%** | 4% |
| gap < 5s | 67% | 29% |
| p99 | 738s | 43s |
两个分布形状完全不同chatbot unimodal 集中在 515s人类节奏agentic bimodal —— **39% 的 gap 在 sub-second 里autonomous tool-call mode**外加 13% > 30s 的长尾。**Agentic 的 sub-second mass 是 chatbot 没有的**,正是 dispatch coupling 激活的来源。
**实测 regime**:在 unifiedTTFT p90 = 7.3s)下,**73% 的 agentic turn 把系统推进闭环**W_turn > T_externalchatbot 仅 32%。在 lmetric15.7s)下 agentic 80%、**chatbot 也到 88%** —— lmetric 把 chatbot 自己也拖进闭环,这就是它在两种 workload 都 underperform 的根因。
**结果**lmetric 跑 600s trace 用 49 min wall-clock = **8x amplification**。**per-turn W 的小差异被放大成 wall-clock 数量级差距** —— locality 不是 nice-to-have是 dominant lever。
---
## 2. Workload 实证(三件事)
| | 数据 | 图 |
|---|---|---|
| KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% | ![](figs/f2a_reuse_topology.png) |
| Session 极度偏斜 | production trace 上 top 1% / 5% / 10% / 25% / 50% = **46.5% / 66.5% / 74.6% / 87.5% / 96.0%** input mass | ![](figs/f2b_session_skew.png) |
| 单请求 KV footprint 大,单 instance KV pool 很快被占满 | per-instance KV pool ≈ **38 GiB**0.4 × 96 GiB H20剩 50% params + 10% activationp99 req 11.5 GiB → 一个 instance 只装 **3 个 p99 decode**4P+4D 让系统 decode 容量直接减半 | ![](figs/f2c_kv_footprint_cdf.png) |
理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**
---
## 3. 现有调度的三种失败模式
### Load-balanced (LMetric / round-robin / kv-aware):丢 locality
![](figs/f4a_apc_loss.png)
LMetric 56.9%、load_only 54.1% APC远低于 79.6% 上界23pp 缺口直接来自跨 instance 路由丢的 intra-session hit
注意 LMetric load_only 只好 **+3.3pp**LMetric score = `(pending_prefill + input cache_hit) × num_requests`cache_hit 只作 cost-model 减项 score **乘性** —— 一个有 affinity instance num_requests 高被乘式吃掉 cache 收益LMetric 仍然会选冷 instancesticky cache 作硬约束直接拉到 77.2%。**结论cache-aware-load routing 不够 —— affinity 必须是独立路由路径不能折叠进 load cost **。
### 静态 PD-disaggD 侧 KV 容量墙
![](figs/f4b_pdsep_kv_wall.png)
agentic 平均请求 33.6k token 3.3GB KV4P+4D / 6P+2D agentic regime 都穿过 90% 内存墙。**TTFT p50 暴涨 62-72x成功率 99.5% 52-68%**。
### Pure sticky全员被 hot session 拖累
![](figs/f4c_per_worker_ttft.png)
我们刻意 ** (median, max) 两个绝对数**衡量 worker 不平衡不用 `max/median` 单一比值 —— 比值会把 unified一个 worker 牺牲其他 7 个快算成比 sticky全员一起慢更不平衡与系统 e2e p90 实际排序反向下面是绝对数
| | median worker TTFT p90 | max worker | system e2e p90 |
|---|---:|---:|---:|
| sticky | **20.3s** | 55.4s | **34.6s** |
| unified | **10.3s** | 37.7s | **18.0s** |
机制production trace top 1% session 46.5% input hot session 数量远多于 instance 8 sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢Unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified 几乎 2x
---
## 4. EAR 设计
两个 pillar所有 instance 对称 PD-colocated无静态 P/D 分区
**Pillar 1 — Affinity-default routing已实现**
session load-balance 分配 host后续 turn sessionhost binding 路由
这就是当前 `unified` 算法hybrid LMetric + high-cache affinityAPC 79.4%达到上界 97%。
**Pillar 2 — Hot-triggered session migrationend-to-end 实证待补substrate 已验证)**
host `pending_prefill_tokens > T_hot`把整个 session KV 通过 mooncake `kv_connector` migrate 到更轻 instancesession binding 更新后续 turn 路由到新 host
> 🆕 **2026-05-27 数据**commit `ef9e010`):之前认为是 migration blocker 的 `kv_both` substrate overhead 已经不存在。在 8×TP1 trace replay 上 A/B/C 对比:
> - plain unified: TTFT p90 = 11.97s
> - unified + `kv_both`(未 DR-fix: 9.74s**18.6%** vs plain
> - unified + `kv_both` + DR-fix: 7.58s**36.6%** vs plain
>
> 即原 elastic_migration_v2 论文里 "+45% kv_both penalty" 已 obsolete当前 substrate 是 **net positive**connector mode 的 `delay_free_blocks=True` 在 93% intra-session-reuse trace 上把跨 turn cache hit 窗口拉长。Migration 之前 4 次 revert 的主因消失。
关键 design
- Target 选择用 **observable pending prefill tokens****不用** cost-model prediction实测 mooncake cost model 误差 10-21x绕过
- Per-session cooldown thrashing
- 若无候选 instance 能装下 session context 保留当前 bindingopportunistic mandatory
---
## 5. 进展 & TODO
### ✅ 已完成
- Workload characterization 三件事的实证齐全`f2a/b/c`
- 三类 baseline 失败的实证齐全`f4a/b/c/d`
- Anchor + paper outline`PAPER_OUTLINE.md`
- Pillar 1 affinity routing 已实现并测过current `unified` 算法
- Dispatch coupling Little's Law 形式化推导
- `replayer/replay.py` patched 输出 `amplification`
- 🆕 **kv_both substrate validation**commit `ef9e010`trace replay A/B/C 证明 substrate 已经是 net positiveTTFT p90 18.6% / DR-fix 36.6% vs plain +45% penalty obsolete
### 🟢 不依赖 migration 可以现在做
1. **5 baseline × 3 runs wall-clock sweep**patched replayer 直接出 amplification 字段)— §2.3 的实证 closure**最高优先级**一晚能跑完
2. Static PD-disagg 补进 end-to-end
3. λ / skew / KV pool 三轴 sensitivity
4. Draft §1-§4 正文数据已齐
### 🚧 待 migration end-to-end validation
- §4.3 migration mechanism e2e trigger + target selection 实验substrate 已通只缺策略层
- Full ablationmigration-only + both
- §5.6 migration microbench
### 风险
- Migration 之前 4 次尝试`6b255fa`, `e991960/5772149`, `cc6e562`, `4c583f2`都被 transfer overhead 吞掉而 revert —— **该 overhead 已在 2026-05-27 验证不再存在**substrate net positive
- 仍未直接验证 e2e migration 策略层trigger + target 选择能在反馈环里产生正收益中间还有"决策错误 + cooldown thrashing"两类风险独立于 substrate
- 即便 migration e2e marginalaffinity-only pillar 的实证已经独立成立paper 至少有 strong-affinity storyline 可写
---
## 6. 一句话总结要 sell 的事
> **Agentic 让 locality 从 nice-to-have 变成 dominant leverdispatch coupling 论证EAR 用 affinity-default + hot-triggered migration 单一方案同时拿到 locality 和 balance。Pillar 1 已实证APC 79.4%Pillar 2 design 完整、validation pending in DR-fix 之上的重测。**
下一步主战场 wall-clock sweep §2.3 dispatch coupling 论证钉死