§3.1: add LMetric vs load_only design analysis (cache signal diluted by ×score)
Why the LMetric → load_only APC gap is only +3.3pp despite LMetric explicitly being "cache-aware load routing": P = pending_prefill_tokens + (input_length - cache_hit) score = P × num_requests <-- multiplicative cache_hit appears only as a reduction inside P. Because score is multiplicative in num_requests, a session-affinity instance whose num_requests has climbed will lose argmin to a cold instance even when cache_hit on the warm one is ~90%. Worked example: warm: P=2500, num_req=5 -> score 12500 cold: P=10000, num_req=1 -> score 10000 <-- LMetric picks cold load_only 53.9% APC (pure num_requests) LMetric 57.2% +3.3pp (cache as additive cost term) sticky 77.7% +23.8pp (cache as hard constraint) unified 78.7% +24.8pp (cache as hard+soft hybrid) Lesson worth stating explicitly in §3.1: cache awareness folded into a multiplicative load cost-model is structurally insufficient. Affinity must be a separate routing branch (sticky / unified hybrid), not a correction term inside a load score. PAPER_OUTLINE.md §3.1 gets the design analysis + the new APC table; MEETING.md gets a one-paragraph version of the same point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -41,6 +41,8 @@ L = Λ · N · W_turn(L) # agentic, T_human≈0
|
||||
|
||||
LMetric 56.9%、load_only 54.1% APC,远低于 79.6% 上界。23pp 缺口直接来自跨 instance 路由丢的 intra-session hit。
|
||||
|
||||
注意 LMetric 比 load_only 只好 **+3.3pp**:LMetric 的 score = `(pending_prefill + input − cache_hit) × num_requests`,cache_hit 只作 cost-model 减项,但 score 是**乘性**的 —— 一个有 affinity 的 instance 因 num_requests 高被乘式吃掉 cache 收益,LMetric 仍然会选冷 instance。sticky 把 cache 作硬约束直接拉到 77.2%。**结论:cache-aware-load routing 不够 —— affinity 必须是独立路由路径,不能折叠进 load cost 里**。
|
||||
|
||||
### 静态 PD-disagg:D 侧 KV 容量墙
|
||||
|
||||

|
||||
|
||||
Reference in New Issue
Block a user