Replace max/median hotspot index with (median, max) absolute pair

The max/median ratio inverts the actual user-facing p90 ranking:
  sticky:  hotspot=2.73 but system e2e p90 = 34.6s  (worst)
  unified: hotspot=3.67 but system e2e p90 = 18.0s  (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.

Changes:
- analysis/characterization/render_window1_figures.py:
  fig_b3_per_worker_ttft now annotates each subplot with
  "median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
  documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
  was the deprecated ratio; superseded by f4c per-worker bars + f6
  e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
  conclusion — replace "hotspot index" mentions with
  "worst-worker p90" or "(median, max) worker p90"; promote the
  §3.3 methodology note to a top-level sub-finding ("hot pin
  failure must be measured with per-worker absolute latency,
  not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
  pair directly; explicit one-line note on why the ratio is dropped.

Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 11:07:12 +08:00
parent 9ddabee6ae
commit 5e6e98aee7
5 changed files with 19 additions and 11 deletions

View File

@@ -33,7 +33,7 @@ Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任
- **C1 Dispatch coupling 论证**:我们形式化一个 agentic workload 独有的反馈环 —— 单 turn 服务时间通过 Little's Law 隐式方程影响并发 session 数,从而把 per-request 延迟差放大成 throughput 差距。实测load-balance baseline 在 600s trace 上跑出 **8x** wall-clock amplificationEAR 跑出 **TBDx**
- **C2 EAR 设计**:两个 pillar 的调度器 —— affinity-default routing 抓 intra-session localityhot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance避免 hot pin。
- **C3 评估**:在真实 Qwen3-Coder agentic trace 上EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、hotspot index、wall-clock 五个维度。
- **C3 评估**:在真实 Qwen3-Coder agentic trace 上EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、worst-worker p90、wall-clock 五个维度。
**Figure 1: Teaser — wall-clock vs trace-time across schedulers**`figs/f1_teaser.png` **🚧 TBD (NEW DATA NEEDED)**
> Needs Phase 3 measurements: 5 baselines × 3 runs of trace replay, extract `amplification = wall_clock_s / trace_span_s` from each summary (Phase 1 patch already exposes the field). Plot as bar chart with y=1 reference line. EAR row 暂为 TBD待 migration validation
@@ -129,7 +129,7 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
### §3.3 Pure session-sticky 的真正失败:全员被 hot session 拖累
session-instance 绑定恢复 localityAPC **77.2%**达到上界 97%**绝对 worker latency** pure sticky 的真正失败模式 —— 不是 max/median hotspot ratio
session-instance 绑定恢复 localityAPC **77.2%**达到上界 97%**绝对 worker latency** 全员被拖累 —— pure sticky 的真正失败模式
| | median worker TTFT p90 | max worker | system e2e p90 |
|---|---:|---:|---:|
@@ -139,7 +139,7 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
机制production trace top 1% session 46.5% input masstop 5% 66.5%hot session 数量远大于 instance 8sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢到 20s 量级unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified e2e p90 ~2x 快于 sticky
**注意**hotspot ratio (max/median) 单独看是误导性的 —— sticky 2.73 unified 3.67 *低*因为 sticky median 也高20.3s vs unified 10.3s系统整体更慢一个有用的 §3.3 sub-finding**hot pin failure 必须用 per-worker absolute latency 衡量不能用 normalized ratio**
**§3.3 sub-finding**hot pin failure 必须用 **per-worker absolute latency**median + max衡量**不能用 normalized ratio**。`max/median` unified 这样的"affinity + escape"方案上会反向惩罚 —— sticky ratio 2.73 unified 3.67 sticky median 也高20.3s vs unified 10.3sratio 越低反而越糟本文 paper 中所有 worker 平衡相关的比较一律用 (median, max) 双指标不用单一比值
违反 §2.4 skew 容忍要求
@@ -154,8 +154,6 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
§3.3 Per-worker TTFT p90 across 8 instances × 5 policiessticky 的所有 worker 都被拖慢median 20.3sunified 把伤害集中在 e4 其他 worker median 10.3s
![F4c Per-worker TTFT p90 distribution](figs/f4c_per_worker_ttft.png)
> 📝 Supplementary不进 main §3可放 §5 multi-policy summary 或附录)—— APC vs hotspot ratio 散点:
![F4c-supp APC vs hotspot tradeoff (supplementary)](figs/f4c_apc_vs_hotspot_tradeoff.png)
> 📝 可选支撑图 — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x放 §3.3 支撑 sticky 的 interference 论证:
![F4d PD interference](figs/f4d_pd_interference.png)
@@ -239,7 +237,7 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
4. `sticky` —— session-instance pinning
5. `static PD-disagg` —— 4P / 4D 静态分区
6. `EAR` —— 本文
- **Metrics**: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2EAPChotspot indexwall-clock vs trace-time
- **Metrics**: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2EAPCworker TTFT p90 (median + max)wall-clock vs trace-time
### §5.2 End-to-end Performance
@@ -331,7 +329,7 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
## §8 Conclusion
agentic LLM workloadlocality 是主导调度杠杆EAR session-affinity routing 抓住它 hot-triggered session migration 保护它单一方案在 TTFTAPChotspotwall-clock throughput 四个维度同时 dominate 五个 baseline
agentic LLM workloadlocality 是主导调度杠杆EAR session-affinity routing 抓住它 hot-triggered session migration 保护它单一方案在 TTFTAPCworst-worker p90wall-clock throughput 四个维度同时 dominate 五个 baseline
---