§3.3 reframe: hot pin failure is uniformly-slow workers, not max/median ratio
User pointed out the apparent paradox: in fig_b3_per_worker_ttft_p90, unified has hotspot index 3.67 while sticky has 2.73, yet unified e2e p90 is roughly half of sticky's. Resolution: hotspot index (max/median) is a *ratio* and misleading on its own. Per-worker absolute TTFT p90: sticky : median 20.3s, max 55.4s -> system e2e p90 34.6s unified: median 10.3s, max 37.7s -> system e2e p90 18.0s Mechanism: top 1% sessions own 46.5% input mass and there are more hot sessions than instances (8), so sticky's hash binding gives *every* worker its own hot session and the median worker is also slow. Unified's LMetric fallback re-routes cold/new sessions away from hot affinity instances, preserving 7/8 worker speed. System p90 is dominated by the majority of requests landing on fast workers, hence the 2x e2e gap. Changes: - Replace §3.3 figure with figs/f4c_per_worker_ttft.png (per-worker bars) instead of figs/f4c_apc_vs_hotspot_tradeoff.png (the ratio scatter) - §3.3 narrative in PAPER_OUTLINE.md and MEETING.md rewritten around absolute median + max + system e2e p90 instead of hotspot ratio - Add a §3.3 sub-finding: "hot pin failure must be measured with per-worker absolute latency, not normalized ratio" - Keep the scatter as supplementary for §5 multi-policy summary Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -127,9 +127,21 @@ Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance
|
||||
|
||||
静态把 instance 分成 P pool 和 D pool 对 chatbot 有效,对 agentic 失败:agentic 请求平均 33.6k token,需要 **3.3GB** KV;4D 方案下 p90 请求占 D 侧 KV pool **69%**,p99 直接 **溢出 138%**。结果:**TTFT p50 暴涨 62-72x**,成功率从 99.5% 跌至 **52-68%**。违反 §2.1(prefill-dominant + 长 context)。
|
||||
|
||||
### §3.3 Pure session-sticky 造 hot pin
|
||||
### §3.3 Pure session-sticky 的真正失败:全员被 hot session 拖累
|
||||
|
||||
硬 session-instance 绑定恢复 locality(APC **77.2%**,达到上界 97%),但把 skew 中的大 session 锁在单 instance 上,**interference index 从 LMetric 的 6.53 翻倍到 13.65**(同 trace 同硬件)。违反 §2.4 的 skew 容忍要求。
|
||||
硬 session-instance 绑定恢复 locality(APC **77.2%**,达到上界 97%),但**绝对 worker latency**是 pure sticky 的真正失败模式 —— 不是 max/median 的 hotspot ratio。
|
||||
|
||||
| | median worker TTFT p90 | max worker | system e2e p90 |
|
||||
|---|---:|---:|---:|
|
||||
| `sticky` | **20.3 s** | 55.4 s | **34.6 s** |
|
||||
| `unified` (affinity + LMetric fallback) | **10.3 s** | 37.7 s | **18.0 s** |
|
||||
| `lmetric` | 14.0 s | 31.3 s | 24.8 s |
|
||||
|
||||
机制:top 1% session 占 46.5% input mass,hot session 数量 ≥ instance 数(8);sticky 的 hash 绑定让 **每个 worker 都自己承接一份 hot session**,median worker 也被拖慢到 20s 量级。unified 用 LMetric fallback 把 cold/new session 重路由到非 hot worker,保留 7/8 worker 的速度。系统 p90 由大多数请求决定,所以 unified 在 e2e p90 上 ~2x 快于 sticky。
|
||||
|
||||
**注意**:hotspot ratio (max/median) 单独看是误导性的 —— sticky 的 2.73 比 unified 的 3.67 *低*,但因为 sticky 的 median 也高(20.3s vs unified 的 10.3s),系统整体更慢。一个有用的 §3.3 sub-finding:**hot pin failure 必须用 per-worker absolute latency 衡量,不能用 normalized ratio**。
|
||||
|
||||
违反 §2.4 的 skew 容忍要求。
|
||||
|
||||
**Figure 4: Three baselines, three failure modes** — 拆成三个子图,分别放在 §3.1/§3.2/§3.3:
|
||||
|
||||
@@ -139,8 +151,11 @@ Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance
|
||||
§3.2 — D 侧 KV pool 占用 vs per-request KV footprint,4P+4D 和 6P+2D 在 agentic regime 都穿过 90% 内存墙:
|
||||

|
||||
|
||||
§3.3 — APC vs hotspot index 散点(unified/sticky 在高 APC 高 hotspot 区,lmetric/load_only 在低 APC 低 hotspot 区):
|
||||

|
||||
§3.3 — Per-worker TTFT p90 across 8 instances × 5 policies。sticky 的所有 worker 都被拖慢(median 20.3s),unified 把伤害集中在 e4 上、其他 worker 快(median 10.3s):
|
||||

|
||||
|
||||
> 📝 Supplementary(不进 main §3,可放 §5 multi-policy summary 或附录)—— APC vs hotspot ratio 散点:
|
||||

|
||||
|
||||
> 📝 可选支撑图 — Prefill-decode 干扰(同 GPU 8k prefill 让 TPOT 退化 66x),放 §3.3 支撑 sticky 的 interference 论证:
|
||||

|
||||
|
||||
Reference in New Issue
Block a user