§3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio

User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and misleading on its own. Per-worker absolute TTFT p90: sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s unified: median 10.3s, max 37.7s -> system e2e p90 18.0s Mechanism: top 1% sessions own 46.5% input mass and there are more hot sessions than instances (8), so sticky's hash binding gives *every* worker its own hot session and the median worker is also slow. Unified's LMetric fallback re-routes cold/new sessions away from hot affinity instances, preserving 7/8 worker speed. System p90 is dominated by the majority of requests landing on fast workers, hence the 2x e2e gap. Changes: - Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars) instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter) - §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around absolute median + max + system e2e p90 instead of hotspot ratio - Add a §3.3 sub-finding: "hot pin failure must be measured with per-worker absolute latency, not normalized ratio" - Keep the scatter as supplementary for §5 multi-policy summary Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:10:23 +08:00
parent 18f1bd4240
commit 020a5c79a7
3 changed files with 29 additions and 7 deletions
--- a/MEETING.md
+++ b/MEETING.md
@@ -47,11 +47,18 @@ LMetric 56.9%、load_only 54.1%、capped 31.6% APC，远低于 79.6% 上界。23

 agentic 平均请求 33.6k token 需 3.3GB KV；4P+4D / 6P+2D 在 agentic regime 都穿过 90% 内存墙。**TTFT p50 暴涨 62-72x，成功率 99.5% → 52-68%**。

-### Pure sticky / current unified：hot pin
+### Pure sticky：全员被 hot session 拖累

-![](figs/f4c_apc_vs_hotspot_tradeoff.png)
+![](figs/f4c_per_worker_ttft.png)

-APC 拉到 77-79%（接近上界），但 hotspot index 翻倍：sticky 2.73、unified 3.66 vs lmetric 2.25、load_only 1.29。skew 中的大 session 被锁在单 instance 上，造成 prefill-decode 干扰。
+注意 hotspot index（max/median 比值）单独看会误导：sticky 的 hotspot=2.73 比 unified 的 3.67 *低*，但**绝对值**告诉我们 sticky 是"全员一起慢"，unified 是"一个 worker 牺牲、其他 7 个快"：
+
+| | median worker TTFT p90 | max worker | system e2e p90 |
+|---|---:|---:|---:|
+| sticky | **20.3s** | 55.4s | **34.6s** |
+| unified | **10.3s** | 37.7s | **18.0s** |
+
+机制：top 1% 的 session 占 46.5% input 量、且 hot session 数量多于 instance 数（8 个），sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**，median worker 也被拖慢。Unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker，保留 7/8 worker 的速度。系统 p90 由大多数请求决定，所以 unified 几乎 2x 快。

 ---

--- a/PAPER_OUTLINE.md
+++ b/PAPER_OUTLINE.md
@@ -127,9 +127,21 @@ Round-robin 和 load-aware routing（如 LMetric, OSDI'26）最大化 instance

 静态把 instance 分成 P pool 和 D pool 对 chatbot 有效，对 agentic 失败：agentic 请求平均 33.6k token，需要 **3.3GB** KV；4D 方案下 p90 请求占 D 侧 KV pool **69%**，p99 直接 **溢出 138%**。结果：**TTFT p50 暴涨 62-72x**，成功率从 99.5% 跌至 **52-68%**。违反 §2.1（prefill-dominant + 长 context）。

-### §3.3 Pure session-sticky 造 hot pin
+### §3.3 Pure session-sticky 的真正失败：全员被 hot session 拖累

-硬 session-instance 绑定恢复 locality（APC **77.2%**，达到上界 97%），但把 skew 中的大 session 锁在单 instance 上，**interference index 从 LMetric 的 6.53 翻倍到 13.65**（同 trace 同硬件）。违反 §2.4 的 skew 容忍要求。
+硬 session-instance 绑定恢复 locality（APC **77.2%**，达到上界 97%），但**绝对 worker latency**是 pure sticky 的真正失败模式 —— 不是 max/median 的 hotspot ratio。
+
+| | median worker TTFT p90 | max worker | system e2e p90 |
+|---|---:|---:|---:|
+| `sticky` | **20.3 s** | 55.4 s | **34.6 s** |
+| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
+| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
+
+机制：top 1% session 占 46.5% input mass，hot session 数量 ≥ instance 数（8）；sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**，median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker，保留 7/8 worker 的速度。系统 p90 由大多数请求决定，所以 unified 在 e2e p90 上 ~2x 快于 sticky。
+
+**注意**：hotspot ratio (max/median) 单独看是误导性的 —— sticky 的 2.73 比 unified 的 3.67 *低*，但因为 sticky 的 median 也高（20.3s vs unified 的 10.3s），系统整体更慢。一个有用的 §3.3 sub-finding：**hot pin failure 必须用 per-worker absolute latency 衡量，不能用 normalized ratio**。
+
+违反 §2.4 的 skew 容忍要求。

 **Figure 4: Three baselines, three failure modes** — 拆成三个子图，分别放在 §3.1/§3.2/§3.3：

@@ -139,8 +151,11 @@ Round-robin 和 load-aware routing（如 LMetric, OSDI'26）最大化 instance
 §3.2 — D 侧 KV pool 占用 vs per-request KV footprint，4P+4D 和 6P+2D 在 agentic regime 都穿过 90% 内存墙：
 ![F4b PD-sep KV memory wall](figs/f4b_pdsep_kv_wall.png)

-§3.3 — APC vs hotspot index 散点（unified/sticky 在高 APC 高 hotspot 区，lmetric/load_only 在低 APC 低 hotspot 区）：
-![F4c APC vs hotspot tradeoff](figs/f4c_apc_vs_hotspot_tradeoff.png)
+§3.3 — Per-worker TTFT p90 across 8 instances × 5 policies。sticky 的所有 worker 都被拖慢（median 20.3s），unified 把伤害集中在 e4 上、其他 worker 快（median 10.3s）：
+![F4c Per-worker TTFT p90 distribution](figs/f4c_per_worker_ttft.png)
+
+> 📝 Supplementary（不进 main §3，可放 §5 multi-policy summary 或附录）—— APC vs hotspot ratio 散点：
+![F4c-supp APC vs hotspot tradeoff (supplementary)](figs/f4c_apc_vs_hotspot_tradeoff.png)

 > 📝 可选支撑图 — Prefill-decode 干扰（同 GPU 8k prefill 让 TPOT 退化 66x），放 §3.3 支撑 sticky 的 interference 论证：
 ![F4d PD interference](figs/f4d_pd_interference.png)
--- a/figs/f4c_per_worker_ttft.png
+++ b/figs/f4c_per_worker_ttft.png