§3.1: add LMetric vs load_only design analysis (cache signal diluted by ×score)

Why the LMetric → load_only APC gap is only +3.3pp despite LMetric
explicitly being "cache-aware load routing":

  P = pending_prefill_tokens + (input_length - cache_hit)
  score = P × num_requests   <-- multiplicative

cache_hit appears only as a reduction inside P. Because score is
multiplicative in num_requests, a session-affinity instance whose
num_requests has climbed will lose argmin to a cold instance even
when cache_hit on the warm one is ~90%. Worked example:

  warm: P=2500, num_req=5 -> score 12500
  cold: P=10000, num_req=1 -> score 10000   <-- LMetric picks cold

  load_only 53.9% APC  (pure num_requests)
  LMetric   57.2%      +3.3pp (cache as additive cost term)
  sticky    77.7%     +23.8pp (cache as hard constraint)
  unified   78.7%     +24.8pp (cache as hard+soft hybrid)

Lesson worth stating explicitly in §3.1: cache awareness folded into
a multiplicative load cost-model is structurally insufficient. Affinity
must be a separate routing branch (sticky / unified hybrid), not a
correction term inside a load score.

PAPER_OUTLINE.md §3.1 gets the design analysis + the new APC table;
MEETING.md gets a one-paragraph version of the same point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 14:04:14 +08:00
parent c33c825256
commit cef914ecd4
2 changed files with 20 additions and 0 deletions

View File

@@ -41,6 +41,8 @@ L = Λ · N · W_turn(L) # agentic, T_human≈0
LMetric 56.9%、load_only 54.1% APC远低于 79.6% 上界23pp 缺口直接来自跨 instance 路由丢的 intra-session hit
注意 LMetric load_only 只好 **+3.3pp**LMetric score = `(pending_prefill + input cache_hit) × num_requests`cache_hit 只作 cost-model 减项 score **乘性** —— 一个有 affinity instance num_requests 高被乘式吃掉 cache 收益LMetric 仍然会选冷 instancesticky cache 作硬约束直接拉到 77.2%。**结论cache-aware-load routing 不够 —— affinity 必须是独立路由路径不能折叠进 load cost **。
### 静态 PD-disaggD 侧 KV 容量墙
![](figs/f4b_pdsep_kv_wall.png)

View File

@@ -123,6 +123,24 @@ dL/dε|_{ε=0} = L* / (1 Λ · N · W'_turn(L*))
Round-robin load-aware routing LMetric, OSDI'26最大化 instance 利用率但忽略 session affinity。**实测 APC 跌到 56.9%**vs 上界 79.6%23pp 的差距直接来自丢失的 intra-session cache hit违反 §2.2
**为什么"cache-aware load routing"也不够 —— LMetric 的 cache 信号被乘性 score 稀释**LMetric 的打分是
```
P = pending_prefill_tokens + (input_length - cache_hit)
score = P × num_requests
```
cache_hit 只在 `P` 里作减项 `score` **乘性**一个 session affinity instance 会因为持续接到 session `num_requests` 升高乘积把 cache 收益吃掉8000 输入 token instance cache_hit = 7500 vs instance cache_hit = 0、pending_prefill 都是 2000num_requests 分别 5 vs 1 LMetric score = `2500 × 5 = 12500`、冷 = `10000 × 1 = 10000`**LMetric 选冷**丢掉 ~90% cache结果
| 策略 | APC | vs load_only | 设计点 |
|---|---:|---:|---|
| load_only | 53.9% | | 纯负载 (`score = num_requests`) |
| LMetric | 57.2% | **+3.3pp** | cache cost-model 减项 |
| sticky | 77.7% | **+23.8pp** | cache 作硬约束 |
| unified | 78.7% | **+24.8pp** | cache 作硬+软偏好混合 |
**`load_only → LMetric` +3.3pp 几乎可忽略`LMetric → sticky` +20.5pp 才是 cache 信号被正确处理的回报**。Cache awareness 不能只作为 cost-model 的一项被吞掉 —— 必须作为**独立路由路径**sticky / unified hybrid)。这是 §3.1 " locality"更具体的失败模式
### §3.2 静态 PD-disaggregation 撞 D 侧 KV 墙
静态把 instance 分成 P pool D pool chatbot 有效 agentic 失败agentic 请求平均 33.6k token需要 **3.3GB** KV4D 方案下 p90 请求占 D KV pool **69%**p99 直接 **溢出 138%**结果**TTFT p50 暴涨 62-72x**成功率从 99.5% 跌至 **52-68%**违反 §2.1prefill-dominant + context)。