§3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio

User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:

  sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
  unified: median 10.3s, max 37.7s -> system e2e p90 18.0s

Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.

Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
  instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
  absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
  per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:10:23 +08:00
parent 18f1bd4240
commit 020a5c79a7
3 changed files with 29 additions and 7 deletions

View File

@@ -47,11 +47,18 @@ LMetric 56.9%、load_only 54.1%、capped 31.6% APC远低于 79.6% 上界。23
agentic 平均请求 33.6k token 3.3GB KV4P+4D / 6P+2D agentic regime 都穿过 90% 内存墙。**TTFT p50 暴涨 62-72x成功率 99.5% 52-68%**。
### Pure sticky / current unifiedhot pin
### Pure sticky:全员被 hot session 拖累
![](figs/f4c_apc_vs_hotspot_tradeoff.png)
![](figs/f4c_per_worker_ttft.png)
APC 拉到 77-79%接近上界 hotspot index 翻倍sticky 2.73unified 3.66 vs lmetric 2.25load_only 1.29skew 中的大 session 被锁在单 instance 造成 prefill-decode 干扰
注意 hotspot indexmax/median 比值单独看会误导sticky hotspot=2.73 unified 3.67 *低***绝对值**告诉我们 sticky "全员一起慢"unified "一个 worker 牺牲其他 7 个快"
| | median worker TTFT p90 | max worker | system e2e p90 |
|---|---:|---:|---:|
| sticky | **20.3s** | 55.4s | **34.6s** |
| unified | **10.3s** | 37.7s | **18.0s** |
机制top 1% session 46.5% input hot session 数量多于 instance 8 sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢Unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified 几乎 2x
---

View File

@@ -127,9 +127,21 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
静态把 instance 分成 P pool D pool chatbot 有效 agentic 失败agentic 请求平均 33.6k token需要 **3.3GB** KV4D 方案下 p90 请求占 D KV pool **69%**p99 直接 **溢出 138%**结果**TTFT p50 暴涨 62-72x**成功率从 99.5% 跌至 **52-68%**违反 §2.1prefill-dominant + context)。
### §3.3 Pure session-sticky 造 hot pin
### §3.3 Pure session-sticky 的真正失败:全员被 hot session 拖累
session-instance 绑定恢复 localityAPC **77.2%**达到上界 97% skew 中的大 session 锁在单 instance **interference index LMetric 6.53 翻倍到 13.65** trace 同硬件)。违反 §2.4 skew 容忍要求
session-instance 绑定恢复 localityAPC **77.2%**达到上界 97%**绝对 worker latency** pure sticky 的真正失败模式 —— 不是 max/median hotspot ratio
| | median worker TTFT p90 | max worker | system e2e p90 |
|---|---:|---:|---:|
| `sticky` | **20.3 s** | 55.4 s | **34.6 s** |
| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
机制top 1% session 46.5% input masshot session 数量 instance 8sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢到 20s 量级unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified e2e p90 ~2x 快于 sticky
**注意**hotspot ratio (max/median) 单独看是误导性的 —— sticky 2.73 unified 3.67 *低*但因为 sticky median 也高20.3s vs unified 10.3s系统整体更慢一个有用的 §3.3 sub-finding**hot pin failure 必须用 per-worker absolute latency 衡量不能用 normalized ratio**。
违反 §2.4 skew 容忍要求
**Figure 4: Three baselines, three failure modes** 拆成三个子图分别放在 §3.13.23.3
@@ -139,8 +151,11 @@ Round-robin 和 load-aware routing如 LMetric, OSDI'26最大化 instance
§3.2 D KV pool 占用 vs per-request KV footprint4P+4D 6P+2D agentic regime 都穿过 90% 内存墙
![F4b PD-sep KV memory wall](figs/f4b_pdsep_kv_wall.png)
§3.3 APC vs hotspot index 散点unified/sticky 在高 APC hotspot lmetric/load_only 在低 APC hotspot
![F4c APC vs hotspot tradeoff](figs/f4c_apc_vs_hotspot_tradeoff.png)
§3.3 Per-worker TTFT p90 across 8 instances × 5 policiessticky 的所有 worker 都被拖慢median 20.3sunified 把伤害集中在 e4 其他 worker median 10.3s
![F4c Per-worker TTFT p90 distribution](figs/f4c_per_worker_ttft.png)
> 📝 Supplementary不进 main §3可放 §5 multi-policy summary 或附录)—— APC vs hotspot ratio 散点:
![F4c-supp APC vs hotspot tradeoff (supplementary)](figs/f4c_apc_vs_hotspot_tradeoff.png)
> 📝 可选支撑图 — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x放 §3.3 支撑 sticky 的 interference 论证:
![F4d PD interference](figs/f4d_pd_interference.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB