Files

Gahow Wang 555cabcf1f f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling

Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
  ~50% HBM for model params (~48 GiB on 96 GiB H20)
  ~10% for runtime activation buffers
  ~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.

New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
  Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)

Key reads off the figure:
  p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
  p90 (8.0 GiB/req):  4 fit/inst →  32 / 16 /  8
  p95 (9.6 GiB/req):  4 fit/inst →  32 / 16 /  8
  p99 (11.5 GiB/req): 3 fit/inst →  24 / 12 /  6

PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).

- analysis/characterization/render_window1_figures.py:
  fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
  but computes floor(KV_pool / req_size) × N_D and annotates the
  per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
  new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
  framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 11:28:47 +08:00

7.1 KiB

Raw Blame History

EAR — Agentic Serving Scheduler 汇报

One-liner：Agentic workload 的 KV reuse 93% 在 session 内，turn 间 tool-call 反馈耦合把单 request 延迟差放大成 throughput 差距 —— locality 因此是主导调度杠杆；现有 load-balance 丢 locality、static PD-disagg 撞 D 侧 KV 墙、pure sticky 造 hot pin；我们提 EAR (Elastic Affinity Router) = session-affinity routing + hot-instance 触发 session migration。

1. 关键洞察：Dispatch Coupling

Chatbot：turn 间有人类 think-time，系统快慢 ⊥ 下一 turn 到达率。 Agentic：turn 间只有 tool-call 返回 (≈0)，系统跑慢 → session 停留长 → 并发多 → KV pool 紧 → 更慢。

Little's Law 隐式方程：

L = Λ · N · W_turn(L)        # agentic, T_human≈0

小扰动分析：amplification = 1 / (1 − Λ·N·W'(L*))，系统接近 KV 饱和时发散。

实测：lmetric 跑 600s trace 用 49 min wall-clock = 8x amplification。同硬件 unified 比 lmetric session 清空速度 ~3x。per-turn W 的小差异被放大成 wall-clock 数量级差距 —— 这意味着 locality 不是 nice-to-have，是 dominant lever。

2. Workload 实证（三件事）

	数据	图
KV reuse 几乎只在 session 内	intra 93.2% / cross 5.7% / shared 1.1%
Session 极度偏斜	production trace 上 top 1% / 5% / 10% / 25% / 50% = 46.5% / 66.5% / 74.6% / 87.5% / 96.0% input mass
单请求 KV footprint 大，单 instance KV pool 很快被占满	per-instance KV pool ≈ 38 GiB（0.4 × 96 GiB H20，剩 50% params + 10% activation）；p99 req 11.5 GiB → 一个 instance 只装 3 个 p99 decode；4P+4D 让系统 decode 容量直接减半

理论 APC 上界 = intra-session 79.6% / any-session 80.3%，差 <1pp。任何不 affinity 的调度都丢绝大部分 reuse。

3. 现有调度的三种失败模式

Load-balanced (LMetric / round-robin / kv-aware)：丢 locality

LMetric 56.9%、load_only 54.1% APC，远低于 79.6% 上界。23pp 缺口直接来自跨 instance 路由丢的 intra-session hit。

静态 PD-disagg：D 侧 KV 容量墙

agentic 平均请求 33.6k token 需 3.3GB KV；4P+4D / 6P+2D 在 agentic regime 都穿过 90% 内存墙。TTFT p50 暴涨 62-72x，成功率 99.5% → 52-68%。

Pure sticky：全员被 hot session 拖累

我们刻意 用 (median, max) 两个绝对数衡量 worker 不平衡，不用 max/median 单一比值 —— 比值会把 unified（一个 worker 牺牲、其他 7 个快）算成比 sticky（全员一起慢）更不平衡，与系统 e2e p90 实际排序反向。下面是绝对数：

	median worker TTFT p90	max worker	system e2e p90
sticky	20.3s	55.4s	34.6s
unified	10.3s	37.7s	18.0s

机制：production trace 上 top 1% 的 session 占 46.5% input 量、且 hot session 数量远多于 instance 数（8 个），sticky 的 hash 绑定让 每个 worker 都自己承接一份 hot session，median worker 也被拖慢。Unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker，保留 7/8 worker 的速度。系统 p90 由大多数请求决定，所以 unified 几乎 2x 快。

4. EAR 设计

两个 pillar，所有 instance 对称 PD-colocated（无静态 P/D 分区）：

Pillar 1 — Affinity-default routing（已实现） 新 session 用 load-balance 分配 host；后续 turn 按 session→host binding 路由。 → 这就是当前 unified 算法（hybrid LMetric + high-cache affinity），APC 79.4%，达到上界 97%。

Pillar 2 — Hot-triggered session migration（end-to-end 实证待补，substrate 已验证） 当 host 的 pending_prefill_tokens > T_hot，把整个 session 的 KV 通过 mooncake kv_connector migrate 到更轻 instance；session binding 更新；后续 turn 路由到新 host。

🆕 2026-05-27 数据（commit ef9e010）：之前认为是 migration blocker 的 kv_both substrate overhead 已经不存在。在 8×TP1 trace replay 上 A/B/C 对比：

plain unified: TTFT p90 = 11.97s

unified + kv_both（未 DR-fix）: 9.74s（−18.6% vs plain）

unified + kv_both + DR-fix: 7.58s（−36.6% vs plain）

即原 elastic_migration_v2 论文里 "+45% kv_both penalty" 已 obsolete；当前 substrate 是 net positive（connector mode 的 delay_free_blocks=True 在 93% intra-session-reuse trace 上把跨 turn cache hit 窗口拉长）。Migration 之前 4 次 revert 的主因消失。

关键 design：

Target 选择用 observable pending prefill tokens，不用 cost-model prediction（实测 mooncake cost model 误差 10-21x，绕过）
Per-session cooldown 防 thrashing
若无候选 instance 能装下 session context → 保留当前 binding，opportunistic 不 mandatory

5. 进展 & TODO

✅ 已完成

Workload characterization 三件事的实证齐全（f2a/b/c）
三类 baseline 失败的实证齐全（f4a/b/c/d）
Anchor + paper outline（PAPER_OUTLINE.md）
Pillar 1 affinity routing 已实现并测过（current unified 算法）
Dispatch coupling 的 Little's Law 形式化推导
replayer/replay.py patched 输出 amplification
🆕 kv_both substrate validation（commit ef9e010）：trace replay A/B/C 证明 substrate 已经是 net positive（TTFT p90 −18.6% / DR-fix 后 −36.6% vs plain），原 +45% penalty obsolete

🟢 不依赖 migration 可以现在做

5 baseline × 3 runs wall-clock sweep（patched replayer 直接出 amplification 字段）— §2.3 的实证 closure，最高优先级，一晚能跑完
Static PD-disagg 补进 end-to-end 表
λ / skew / KV pool 三轴 sensitivity
Draft §1-§4 正文（数据已齐）

🚧 待 migration end-to-end validation

§4.3 migration mechanism 的 e2e trigger + target selection 实验（substrate 已通，只缺策略层）
Full ablation（migration-only + both）
§5.6 migration microbench

风险

Migration 之前 4 次尝试（6b255fa, e991960/5772149, cc6e562, 4c583f2）都被 transfer overhead 吞掉而 revert —— 该 overhead 已在 2026-05-27 验证不再存在（substrate net positive）
仍未直接验证 e2e migration 策略层（trigger + target 选择）能在反馈环里产生正收益；中间还有"决策错误 + cooldown thrashing"两类风险，独立于 substrate
即便 migration e2e 仍 marginal，affinity-only pillar 的实证已经独立成立，paper 至少有 strong-affinity 的 storyline 可写

6. 一句话总结要 sell 的事

Agentic 让 locality 从 nice-to-have 变成 dominant lever（dispatch coupling 论证）；EAR 用 affinity-default + hot-triggered migration 单一方案同时拿到 locality 和 balance。Pillar 1 已实证（APC 79.4%）；Pillar 2 design 完整、validation pending in DR-fix 之上的重测。

下一步主战场：跑 wall-clock sweep 把 §2.3 dispatch coupling 论证钉死。

7.1 KiB Raw Blame History Unescape Escape