§3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and misleading on its own. Per-worker absolute TTFT p90: sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s unified: median 10.3s, max 37.7s -> system e2e p90 18.0s Mechanism: top 1% sessions own 46.5% input mass and there are more hot sessions than instances (8), so sticky's hash binding gives *every* worker its own hot session and the median worker is also slow. Unified's LMetric fallback re-routes cold/new sessions away from hot affinity instances, preserving 7/8 worker speed. System p90 is dominated by the majority of requests landing on fast workers, hence the 2x e2e gap. Changes: - Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars) instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter) - §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around absolute median + max + system e2e p90 instead of hotspot ratio - Add a §3.3 sub-finding: "hot pin failure must be measured with per-worker absolute latency, not normalized ratio" - Keep the scatter as supplementary for §5 multi-policy summary Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
13
MEETING.md
13
MEETING.md
@@ -47,11 +47,18 @@ LMetric 56.9%、load_only 54.1%、capped 31.6% APC,远低于 79.6% 上界。23
|
||||
|
||||
agentic 平均请求 33.6k token 需 3.3GB KV;4P+4D / 6P+2D 在 agentic regime 都穿过 90% 内存墙。**TTFT p50 暴涨 62-72x,成功率 99.5% → 52-68%**。
|
||||
|
||||
### Pure sticky / current unified:hot pin
|
||||
### Pure sticky:全员被 hot session 拖累
|
||||
|
||||

|
||||

|
||||
|
||||
APC 拉到 77-79%(接近上界),但 hotspot index 翻倍:sticky 2.73、unified 3.66 vs lmetric 2.25、load_only 1.29。skew 中的大 session 被锁在单 instance 上,造成 prefill-decode 干扰。
|
||||
注意 hotspot index(max/median 比值)单独看会误导:sticky 的 hotspot=2.73 比 unified 的 3.67 *低*,但**绝对值**告诉我们 sticky 是"全员一起慢",unified 是"一个 worker 牺牲、其他 7 个快":
|
||||
|
||||
| | median worker TTFT p90 | max worker | system e2e p90 |
|
||||
|---|---:|---:|---:|
|
||||
| sticky | **20.3s** | 55.4s | **34.6s** |
|
||||
| unified | **10.3s** | 37.7s | **18.0s** |
|
||||
|
||||
机制:top 1% 的 session 占 46.5% input 量、且 hot session 数量多于 instance 数(8 个),sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**,median worker 也被拖慢。Unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker,保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 几乎 2x 快。
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user