§3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio

User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified
has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly
half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and
misleading on its own. Per-worker absolute TTFT p90:

  sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s
  unified: median 10.3s, max 37.7s -> system e2e p90 18.0s

Mechanism: top 1% sessions own 46.5% input mass and there are more hot
sessions than instances (8), so sticky's hash binding gives *every* worker
its own hot session and the median worker is also slow. Unified's LMetric
fallback re-routes cold/new sessions away from hot affinity instances,
preserving 7/8 worker speed. System p90 is dominated by the majority of
requests landing on fast workers, hence the 2x e2e gap.

Changes:
- Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars)
  instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter)
- §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around
  absolute median + max + system e2e p90 instead of hotspot ratio
- Add a §3.3 sub-finding: "hot pin failure must be measured with
  per-worker absolute latency, not normalized ratio"
- Keep the scatter as supplementary for §5 multi-policy summary

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:10:23 +08:00
parent 18f1bd4240
commit 020a5c79a7
3 changed files with 29 additions and 7 deletions

View File

@@ -47,11 +47,18 @@ LMetric 56.9%、load_only 54.1%、capped 31.6% APC远低于 79.6% 上界。23
agentic 平均请求 33.6k token 3.3GB KV4P+4D / 6P+2D agentic regime 都穿过 90% 内存墙。**TTFT p50 暴涨 62-72x成功率 99.5% 52-68%**。
### Pure sticky / current unifiedhot pin
### Pure sticky:全员被 hot session 拖累
![](figs/f4c_apc_vs_hotspot_tradeoff.png)
![](figs/f4c_per_worker_ttft.png)
APC 拉到 77-79%接近上界 hotspot index 翻倍sticky 2.73unified 3.66 vs lmetric 2.25load_only 1.29skew 中的大 session 被锁在单 instance 造成 prefill-decode 干扰
注意 hotspot indexmax/median 比值单独看会误导sticky hotspot=2.73 unified 3.67 *低***绝对值**告诉我们 sticky "全员一起慢"unified "一个 worker 牺牲其他 7 个快"
| | median worker TTFT p90 | max worker | system e2e p90 |
|---|---:|---:|---:|
| sticky | **20.3s** | 55.4s | **34.6s** |
| unified | **10.3s** | 37.7s | **18.0s** |
机制top 1% session 46.5% input hot session 数量多于 instance 8 sticky hash 绑定让 **每个 worker 都自己承接一份 hot session**median worker 也被拖慢Unified LMetric fallback cold/new session 重路由到非 hot worker保留 7/8 worker 的速度系统 p90 由大多数请求决定所以 unified 几乎 2x
---