From cef914ecd4aabf11b5c89ded88b5e2a46727b938 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Wed, 27 May 2026 14:04:14 +0800 Subject: [PATCH] =?UTF-8?q?=C2=A73.1:=20add=20LMetric=20vs=20load=5Fonly?= =?UTF-8?q?=20design=20analysis=20(cache=20signal=20diluted=20by=20=C3=97s?= =?UTF-8?q?core)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Why the LMetric → load_only APC gap is only +3.3pp despite LMetric explicitly being "cache-aware load routing": P = pending_prefill_tokens + (input_length - cache_hit) score = P × num_requests <-- multiplicative cache_hit appears only as a reduction inside P. Because score is multiplicative in num_requests, a session-affinity instance whose num_requests has climbed will lose argmin to a cold instance even when cache_hit on the warm one is ~90%. Worked example: warm: P=2500, num_req=5 -> score 12500 cold: P=10000, num_req=1 -> score 10000 <-- LMetric picks cold load_only 53.9% APC (pure num_requests) LMetric 57.2% +3.3pp (cache as additive cost term) sticky 77.7% +23.8pp (cache as hard constraint) unified 78.7% +24.8pp (cache as hard+soft hybrid) Lesson worth stating explicitly in §3.1: cache awareness folded into a multiplicative load cost-model is structurally insufficient. Affinity must be a separate routing branch (sticky / unified hybrid), not a correction term inside a load score. PAPER_OUTLINE.md §3.1 gets the design analysis + the new APC table; MEETING.md gets a one-paragraph version of the same point. Co-Authored-By: Claude Opus 4.7 --- MEETING.md | 2 ++ PAPER_OUTLINE.md | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/MEETING.md b/MEETING.md index 5e93890..14b6129 100644 --- a/MEETING.md +++ b/MEETING.md @@ -41,6 +41,8 @@ L = Λ · N · W_turn(L) # agentic, T_human≈0 LMetric 56.9%、load_only 54.1% APC,远低于 79.6% 上界。23pp 缺口直接来自跨 instance 路由丢的 intra-session hit。 +注意 LMetric 比 load_only 只好 **+3.3pp**:LMetric 的 score = `(pending_prefill + input − cache_hit) × num_requests`,cache_hit 只作 cost-model 减项,但 score 是**乘性**的 —— 一个有 affinity 的 instance 因 num_requests 高被乘式吃掉 cache 收益,LMetric 仍然会选冷 instance。sticky 把 cache 作硬约束直接拉到 77.2%。**结论:cache-aware-load routing 不够 —— affinity 必须是独立路由路径,不能折叠进 load cost 里**。 + ### 静态 PD-disagg:D 侧 KV 容量墙 ![](figs/f4b_pdsep_kv_wall.png) diff --git a/PAPER_OUTLINE.md b/PAPER_OUTLINE.md index 73728a4..ae8d59e 100644 --- a/PAPER_OUTLINE.md +++ b/PAPER_OUTLINE.md @@ -123,6 +123,24 @@ dL/dε|_{ε=0} = L* / (1 − Λ · N · W'_turn(L*)) Round-robin 和 load-aware routing(如 LMetric, OSDI'26)最大化 instance 利用率,但忽略 session affinity。**实测 APC 跌到 56.9%**(vs 上界 79.6%),23pp 的差距直接来自丢失的 intra-session cache hit。违反 §2.2。 +**为什么"cache-aware load routing"也不够 —— LMetric 的 cache 信号被乘性 score 稀释**。LMetric 的打分是 + +``` +P = pending_prefill_tokens + (input_length - cache_hit) +score = P × num_requests +``` + +cache_hit 只在 `P` 里作减项;而 `score` 是**乘性**的。一个 session affinity 的 instance 会因为持续接到 session 而 `num_requests` 升高,乘积把 cache 收益吃掉。例:8000 输入 token、暖 instance cache_hit = 7500 vs 冷 instance cache_hit = 0、pending_prefill 都是 2000、num_requests 分别 5 vs 1,则 LMetric score 暖 = `2500 × 5 = 12500`、冷 = `10000 × 1 = 10000`,**LMetric 选冷**,丢掉 ~90% cache。结果: + +| 策略 | APC | vs load_only | 设计点 | +|---|---:|---:|---| +| load_only | 53.9% | — | 纯负载 (`score = num_requests`) | +| LMetric | 57.2% | **+3.3pp** | cache 作 cost-model 减项 | +| sticky | 77.7% | **+23.8pp** | cache 作硬约束 | +| unified | 78.7% | **+24.8pp** | cache 作硬+软偏好混合 | + +**`load_only → LMetric` 的 +3.3pp 几乎可忽略;`LMetric → sticky` 的 +20.5pp 才是 cache 信号被正确处理的回报**。Cache awareness 不能只作为 cost-model 的一项被吞掉 —— 必须作为**独立路由路径**(sticky / unified hybrid)。这是 §3.1 比"丢 locality"更具体的失败模式。 + ### §3.2 静态 PD-disaggregation 撞 D 侧 KV 墙 静态把 instance 分成 P pool 和 D pool 对 chatbot 有效,对 agentic 失败:agentic 请求平均 33.6k token,需要 **3.3GB** KV;4D 方案下 p90 请求占 D 侧 KV pool **69%**,p99 直接 **溢出 138%**。结果:**TTFT p50 暴涨 62-72x**,成功率从 99.5% 跌至 **52-68%**。违反 §2.1(prefill-dominant + 长 context)。