Files
agentic-kvc/MEETING.md
Gahow Wang 5e6e98aee7 Replace max/median hotspot index with (median, max) absolute pair
The max/median ratio inverts the actual user-facing p90 ranking:
  sticky:  hotspot=2.73 but system e2e p90 = 34.6s  (worst)
  unified: hotspot=3.67 but system e2e p90 = 18.0s  (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.

Changes:
- analysis/characterization/render_window1_figures.py:
  fig_b3_per_worker_ttft now annotates each subplot with
  "median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
  documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
  was the deprecated ratio; superseded by f4c per-worker bars + f6
  e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
  conclusion — replace "hotspot index" mentions with
  "worst-worker p90" or "(median, max) worker p90"; promote the
  §3.3 methodology note to a top-level sub-finding ("hot pin
  failure must be measured with per-worker absolute latency,
  not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
  pair directly; explicit one-line note on why the ratio is dropped.

Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:07:12 +08:00

6.9 KiB
Raw Blame History

EAR — Agentic Serving Scheduler 汇报

One-linerAgentic workload 的 KV reuse 93% 在 session 内turn 间 tool-call 反馈耦合把单 request 延迟差放大成 throughput 差距 —— locality 因此是主导调度杠杆;现有 load-balance 丢 locality、static PD-disagg 撞 D 侧 KV 墙、pure sticky 造 hot pin我们提 EAR (Elastic Affinity Router) = session-affinity routing + hot-instance 触发 session migration。


1. 关键洞察Dispatch Coupling

Chatbotturn 间有人类 think-time系统快慢 ⊥ 下一 turn 到达率。 Agenticturn 间只有 tool-call 返回 (≈0)系统跑慢 → session 停留长 → 并发多 → KV pool 紧 → 更慢

Little's Law 隐式方程:

L = Λ · N · W_turn(L)        # agentic, T_human≈0

小扰动分析amplification = 1 / (1 Λ·N·W'(L*)),系统接近 KV 饱和时发散。

实测lmetric 跑 600s trace 用 49 min wall-clock = 8x amplification。同硬件 unified 比 lmetric session 清空速度 ~3x。per-turn W 的小差异被放大成 wall-clock 数量级差距 —— 这意味着 locality 不是 nice-to-have是 dominant lever。


2. Workload 实证(三件事)

数据
KV reuse 几乎只在 session 内 intra 93.2% / cross 5.7% / shared 1.1%
Session 极度偏斜 production trace 上 top 1% / 5% / 10% / 25% / 50% = 46.5% / 66.5% / 74.6% / 87.5% / 96.0% input mass
单请求 KV footprint 已经很大 p99 = 11.8 GiB ≈ H20 12%

理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。任何不 affinity 的调度都丢绝大部分 reuse。


3. 现有调度的三种失败模式

Load-balanced (LMetric / round-robin / kv-aware):丢 locality

LMetric 56.9%、load_only 54.1% APC远低于 79.6% 上界。23pp 缺口直接来自跨 instance 路由丢的 intra-session hit。

静态 PD-disaggD 侧 KV 容量墙

agentic 平均请求 33.6k token 需 3.3GB KV4P+4D / 6P+2D 在 agentic regime 都穿过 90% 内存墙。TTFT p50 暴涨 62-72x成功率 99.5% → 52-68%

Pure sticky全员被 hot session 拖累

我们刻意 用 (median, max) 两个绝对数衡量 worker 不平衡,不用 max/median 单一比值 —— 比值会把 unified一个 worker 牺牲、其他 7 个快)算成比 sticky全员一起慢更不平衡与系统 e2e p90 实际排序反向。下面是绝对数:

median worker TTFT p90 max worker system e2e p90
sticky 20.3s 55.4s 34.6s
unified 10.3s 37.7s 18.0s

机制production trace 上 top 1% 的 session 占 46.5% input 量、且 hot session 数量远多于 instance 数8 个sticky 的 hash 绑定让 每个 worker 都自己承接一份 hot sessionmedian worker 也被拖慢。Unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 几乎 2x 快。


4. EAR 设计

两个 pillar所有 instance 对称 PD-colocated无静态 P/D 分区):

Pillar 1 — Affinity-default routing已实现 新 session 用 load-balance 分配 host后续 turn 按 session→host binding 路由。 → 这就是当前 unified 算法hybrid LMetric + high-cache affinityAPC 79.4%,达到上界 97%。

Pillar 2 — Hot-triggered session migrationend-to-end 实证待补substrate 已验证) 当 host 的 pending_prefill_tokens > T_hot,把整个 session 的 KV 通过 mooncake kv_connector migrate 到更轻 instancesession binding 更新;后续 turn 路由到新 host。

🆕 2026-05-27 数据commit ef9e010):之前认为是 migration blocker 的 kv_both substrate overhead 已经不存在。在 8×TP1 trace replay 上 A/B/C 对比:

  • plain unified: TTFT p90 = 11.97s
  • unified + kv_both(未 DR-fix: 9.74s18.6% vs plain
  • unified + kv_both + DR-fix: 7.58s36.6% vs plain

即原 elastic_migration_v2 论文里 "+45% kv_both penalty" 已 obsolete当前 substrate 是 net positiveconnector mode 的 delay_free_blocks=True 在 93% intra-session-reuse trace 上把跨 turn cache hit 窗口拉长。Migration 之前 4 次 revert 的主因消失。

关键 design

  • Target 选择用 observable pending prefill tokens不用 cost-model prediction实测 mooncake cost model 误差 10-21x绕过
  • Per-session cooldown 防 thrashing
  • 若无候选 instance 能装下 session context → 保留当前 bindingopportunistic 不 mandatory

5. 进展 & TODO

已完成

  • Workload characterization 三件事的实证齐全(f2a/b/c
  • 三类 baseline 失败的实证齐全(f4a/b/c/d
  • Anchor + paper outlinePAPER_OUTLINE.md
  • Pillar 1 affinity routing 已实现并测过current unified 算法)
  • Dispatch coupling 的 Little's Law 形式化推导
  • replayer/replay.py patched 输出 amplification
  • 🆕 kv_both substrate validationcommit ef9e010trace replay A/B/C 证明 substrate 已经是 net positiveTTFT p90 18.6% / DR-fix 后 36.6% vs plain原 +45% penalty obsolete

🟢 不依赖 migration 可以现在做

  1. 5 baseline × 3 runs wall-clock sweeppatched replayer 直接出 amplification 字段)— §2.3 的实证 closure最高优先级,一晚能跑完
  2. Static PD-disagg 补进 end-to-end 表
  3. λ / skew / KV pool 三轴 sensitivity
  4. Draft §1-§4 正文(数据已齐)

🚧 待 migration end-to-end validation

  • §4.3 migration mechanism 的 e2e trigger + target selection 实验substrate 已通,只缺策略层)
  • Full ablationmigration-only + both
  • §5.6 migration microbench

风险

  • Migration 之前 4 次尝试(6b255fa, e991960/5772149, cc6e562, 4c583f2)都被 transfer overhead 吞掉而 revert —— 该 overhead 已在 2026-05-27 验证不再存在substrate net positive
  • 仍未直接验证 e2e migration 策略层trigger + target 选择)能在反馈环里产生正收益;中间还有"决策错误 + cooldown thrashing"两类风险,独立于 substrate
  • 即便 migration e2e 仍 marginalaffinity-only pillar 的实证已经独立成立paper 至少有 strong-affinity 的 storyline 可写

6. 一句话总结要 sell 的事

Agentic 让 locality 从 nice-to-have 变成 dominant leverdispatch coupling 论证EAR 用 affinity-default + hot-triggered migration 单一方案同时拿到 locality 和 balance。Pillar 1 已实证APC 79.4%Pillar 2 design 完整、validation pending in DR-fix 之上的重测。

下一步主战场:跑 wall-clock sweep 把 §2.3 dispatch coupling 论证钉死。