Replace max/median hotspot index with (median, max) absolute pair
The max/median ratio inverts the actual user-facing p90 ranking:
sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst)
unified: hotspot=3.67 but system e2e p90 = 18.0s (best)
because sticky's median is also high (everyone slow) while unified
concentrates the damage on one worker and keeps the other 7 fast.
Any "imbalance" metric structurally punishes the affinity-then-escape
schemes that we actually want to advocate for.
Changes:
- analysis/characterization/render_window1_figures.py:
fig_b3_per_worker_ttft now annotates each subplot with
"median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring
documents why we drop the ratio.
- figs/f4c_per_worker_ttft.png: regenerated with new titles.
- figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis
was the deprecated ratio; superseded by f4c per-worker bars + f6
e2e bars which together carry the same information honestly.
- PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8
conclusion — replace "hotspot index" mentions with
"worst-worker p90" or "(median, max) worker p90"; promote the
§3.3 methodology note to a top-level sub-finding ("hot pin
failure must be measured with per-worker absolute latency,
not normalized ratio").
- MEETING.md: §3.3 narrative reworded to lead with the (median, max)
pair directly; explicit one-line note on why the ratio is dropped.
Conceptual uses of "hot session" / "hot instance" / "hot pin" remain
unchanged — only the *metric* called hotspot index is retired.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -51,7 +51,7 @@ agentic 平均请求 33.6k token 需 3.3GB KV;4P+4D / 6P+2D 在 agentic regime
|
|||||||
|
|
||||||

|

|
||||||
|
|
||||||
注意 hotspot index(max/median 比值)单独看会误导:sticky 的 hotspot=2.73 比 unified 的 3.67 *低*,但**绝对值**告诉我们 sticky 是"全员一起慢",unified 是"一个 worker 牺牲、其他 7 个快":
|
我们刻意 **用 (median, max) 两个绝对数**衡量 worker 不平衡,不用 `max/median` 单一比值 —— 比值会把 unified(一个 worker 牺牲、其他 7 个快)算成比 sticky(全员一起慢)更不平衡,与系统 e2e p90 实际排序反向。下面是绝对数:
|
||||||
|
|
||||||
| | median worker TTFT p90 | max worker | system e2e p90 |
|
| | median worker TTFT p90 | max worker | system e2e p90 |
|
||||||
|---|---:|---:|---:|
|
|---|---:|---:|---:|
|
||||||
|
|||||||
@@ -33,7 +33,7 @@ Agentic LLM workload —— 由 LLM 通过 tool call 自驱、多 turn 完成任
|
|||||||
|
|
||||||
- **C1 Dispatch coupling 论证**:我们形式化一个 agentic workload 独有的反馈环 —— 单 turn 服务时间通过 Little's Law 隐式方程影响并发 session 数,从而把 per-request 延迟差放大成 throughput 差距。实测:load-balance baseline 在 600s trace 上跑出 **8x** wall-clock amplification;EAR 跑出 **TBDx**。
|
- **C1 Dispatch coupling 论证**:我们形式化一个 agentic workload 独有的反馈环 —— 单 turn 服务时间通过 Little's Law 隐式方程影响并发 session 数,从而把 per-request 延迟差放大成 throughput 差距。实测:load-balance baseline 在 600s trace 上跑出 **8x** wall-clock amplification;EAR 跑出 **TBDx**。
|
||||||
- **C2 EAR 设计**:两个 pillar 的调度器 —— affinity-default routing 抓 intra-session locality,hot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance,避免 hot pin。
|
- **C2 EAR 设计**:两个 pillar 的调度器 —— affinity-default routing 抓 intra-session locality,hot-instance 触发的 session migration 在 hotspot 出现时把整个 session 的 KV 搬到更轻的 instance,避免 hot pin。
|
||||||
- **C3 评估**:在真实 Qwen3-Coder agentic trace 上,EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、hotspot index、wall-clock 五个维度。
|
- **C3 评估**:在真实 Qwen3-Coder agentic trace 上,EAR 同时 dominate 5 个 baseline 的 TTFT、TPOT、APC、worst-worker p90、wall-clock 五个维度。
|
||||||
|
|
||||||
**Figure 1: Teaser — wall-clock vs trace-time across schedulers** — `figs/f1_teaser.png` **🚧 TBD (NEW DATA NEEDED)**
|
**Figure 1: Teaser — wall-clock vs trace-time across schedulers** — `figs/f1_teaser.png` **🚧 TBD (NEW DATA NEEDED)**
|
||||||
> Needs Phase 3 measurements: 5 baselines × 3 runs of trace replay, extract `amplification = wall_clock_s / trace_span_s` from each summary (Phase 1 patch already exposes the field). Plot as bar chart with y=1 reference line. EAR row 暂为 TBD(待 migration validation)。
|
> Needs Phase 3 measurements: 5 baselines × 3 runs of trace replay, extract `amplification = wall_clock_s / trace_span_s` from each summary (Phase 1 patch already exposes the field). Plot as bar chart with y=1 reference line. EAR row 暂为 TBD(待 migration validation)。
|
||||||
@@ -129,7 +129,7 @@ Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance
|
|||||||
|
|
||||||
### §3.3 Pure session-sticky 的真正失败:全员被 hot session 拖累
|
### §3.3 Pure session-sticky 的真正失败:全员被 hot session 拖累
|
||||||
|
|
||||||
硬 session-instance 绑定恢复 locality(APC **77.2%**,达到上界 97%),但**绝对 worker latency**是 pure sticky 的真正失败模式 —— 不是 max/median 的 hotspot ratio。
|
硬 session-instance 绑定恢复 locality(APC **77.2%**,达到上界 97%),但**绝对 worker latency** 全员被拖累 —— 是 pure sticky 的真正失败模式。
|
||||||
|
|
||||||
| | median worker TTFT p90 | max worker | system e2e p90 |
|
| | median worker TTFT p90 | max worker | system e2e p90 |
|
||||||
|---|---:|---:|---:|
|
|---|---:|---:|---:|
|
||||||
@@ -139,7 +139,7 @@ Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance
|
|||||||
|
|
||||||
机制:production trace 上 top 1% session 占 46.5% input mass、top 5% 占 66.5%,hot session 数量远大于 instance 数(8);sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**,median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker,保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 在 e2e p90 上 ~2x 快于 sticky。
|
机制:production trace 上 top 1% session 占 46.5% input mass、top 5% 占 66.5%,hot session 数量远大于 instance 数(8);sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**,median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker,保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 在 e2e p90 上 ~2x 快于 sticky。
|
||||||
|
|
||||||
**注意**:hotspot ratio (max/median) 单独看是误导性的 —— sticky 的 2.73 比 unified 的 3.67 *低*,但因为 sticky 的 median 也高(20.3s vs unified 的 10.3s),系统整体更慢。一个有用的 §3.3 sub-finding:**hot pin failure 必须用 per-worker absolute latency 衡量,不能用 normalized ratio**。
|
**§3.3 sub-finding**:hot pin failure 必须用 **per-worker absolute latency**(median + max)衡量,**不能用 normalized ratio**。`max/median` 在 unified 这样的"affinity + escape"方案上会反向惩罚 —— sticky 的 ratio 2.73 比 unified 的 3.67 低,但 sticky 的 median 也高(20.3s vs unified 10.3s),ratio 越低反而越糟。本文 paper 中所有 worker 平衡相关的比较一律用 (median, max) 双指标,不用单一比值。
|
||||||
|
|
||||||
违反 §2.4 的 skew 容忍要求。
|
违反 §2.4 的 skew 容忍要求。
|
||||||
|
|
||||||
@@ -154,8 +154,6 @@ Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance
|
|||||||
§3.3 — Per-worker TTFT p90 across 8 instances × 5 policies。sticky 的所有 worker 都被拖慢(median 20.3s),unified 把伤害集中在 e4 上、其他 worker 快(median 10.3s):
|
§3.3 — Per-worker TTFT p90 across 8 instances × 5 policies。sticky 的所有 worker 都被拖慢(median 20.3s),unified 把伤害集中在 e4 上、其他 worker 快(median 10.3s):
|
||||||

|

|
||||||
|
|
||||||
> 📝 Supplementary(不进 main §3,可放 §5 multi-policy summary 或附录)—— APC vs hotspot ratio 散点:
|
|
||||||

|
|
||||||
|
|
||||||
> 📝 可选支撑图 — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x),放 §3.3 支撑 sticky 的 interference 论证:
|
> 📝 可选支撑图 — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x),放 §3.3 支撑 sticky 的 interference 论证:
|
||||||

|

|
||||||
@@ -239,7 +237,7 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
|
|||||||
4. `sticky` —— 硬 session-instance pinning
|
4. `sticky` —— 硬 session-instance pinning
|
||||||
5. `static PD-disagg` —— 4P / 4D 静态分区
|
5. `static PD-disagg` —— 4P / 4D 静态分区
|
||||||
6. `EAR` —— 本文
|
6. `EAR` —— 本文
|
||||||
- **Metrics**: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2E、APC、hotspot index、wall-clock vs trace-time
|
- **Metrics**: TTFT (mean/p50/p90/p99)、TPOT (同上)、E2E、APC、worker TTFT p90 (median + max)、wall-clock vs trace-time
|
||||||
|
|
||||||
### §5.2 End-to-end Performance
|
### §5.2 End-to-end Performance
|
||||||
|
|
||||||
@@ -331,7 +329,7 @@ KV transfer 发生在触发该 migration 的 request 的 critical path 上,但
|
|||||||
|
|
||||||
## §8 Conclusion
|
## §8 Conclusion
|
||||||
|
|
||||||
对 agentic LLM workload,locality 是主导调度杠杆。EAR 用 session-affinity routing 抓住它,用 hot-triggered session migration 保护它,单一方案在 TTFT、APC、hotspot、wall-clock throughput 四个维度同时 dominate 五个 baseline。
|
对 agentic LLM workload,locality 是主导调度杠杆。EAR 用 session-affinity routing 抓住它,用 hot-triggered session migration 保护它,单一方案在 TTFT、APC、worst-worker p90、wall-clock throughput 四个维度同时 dominate 五个 baseline。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@@ -147,7 +147,13 @@ def fig_b3_failure_breakdown(comp: dict, out: Path) -> None:
|
|||||||
|
|
||||||
|
|
||||||
def fig_b3_per_worker_ttft(results_dir: Path, comp: dict, out: Path) -> None:
|
def fig_b3_per_worker_ttft(results_dir: Path, comp: dict, out: Path) -> None:
|
||||||
"""Per-worker TTFT p90 grouped bars; reads each policy's hotspot_index.json."""
|
"""Per-worker TTFT p90 grouped bars; title shows median + max worker p90.
|
||||||
|
|
||||||
|
We deliberately do NOT report a max/median 'hotspot index' here: it is a
|
||||||
|
ratio and treats unified (most workers fast, one hot) as worse than
|
||||||
|
sticky (all workers slow), which inverts the actual user-facing p90.
|
||||||
|
"""
|
||||||
|
import statistics
|
||||||
by = {r["policy"]: r for r in comp["rows"]}
|
by = {r["policy"]: r for r in comp["rows"]}
|
||||||
pols = [p for p in POLICY_ORDER if p in by]
|
pols = [p for p in POLICY_ORDER if p in by]
|
||||||
fig, axes = plt.subplots(1, len(pols), figsize=(3 * len(pols), 4),
|
fig, axes = plt.subplots(1, len(pols), figsize=(3 * len(pols), 4),
|
||||||
@@ -168,8 +174,12 @@ def fig_b3_per_worker_ttft(results_dir: Path, comp: dict, out: Path) -> None:
|
|||||||
edgecolor="black", linewidth=0.5)
|
edgecolor="black", linewidth=0.5)
|
||||||
for i, v in enumerate(vals):
|
for i, v in enumerate(vals):
|
||||||
ax.text(i, v, f"{v:.1f}", ha="center", va="bottom", fontsize=8)
|
ax.text(i, v, f"{v:.1f}", ha="center", va="bottom", fontsize=8)
|
||||||
ax.set_title(f"{pol}\nhotspot={by[pol]['hotspot_index_ttft_p90']:.2f}",
|
median_v = statistics.median(vals)
|
||||||
fontsize=10)
|
max_v = max(vals)
|
||||||
|
ax.set_title(
|
||||||
|
f"{pol}\nmedian {median_v:.1f}s · max {max_v:.1f}s",
|
||||||
|
fontsize=10,
|
||||||
|
)
|
||||||
ax.tick_params(axis="x", labelsize=8)
|
ax.tick_params(axis="x", labelsize=8)
|
||||||
ax.grid(alpha=0.3, axis="y")
|
ax.grid(alpha=0.3, axis="y")
|
||||||
axes[0].set_ylabel("worker TTFT p90 (s)")
|
axes[0].set_ylabel("worker TTFT p90 (s)")
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 37 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 45 KiB After Width: | Height: | Size: 49 KiB |
Reference in New Issue
Block a user