Files

tim 6d1c9237fa docs(architecture): KVC eviction granularity is the wrong abstraction

After E3 exposed massive session-level eviction (90 trims × avg
67K tokens/evict = 6.1M tokens trashed in 1h12min), we have to
acknowledge the local-patch sequence (E2→load-floor→Fix A →
proposed disable-migration → proposed disable-admission) was a
KVC-to-DP collapse trajectory, not a fix.

The fundamental issue: SessionAwareCache merged two responsibilities
that should be separate.

  1. Session lifecycle tracking (legitimate — streaming sessions
     reuse KV across turns and need per-session metadata).
  2. Eviction granularity decision (wrong — sessions should not be
     the eviction unit).

`release_session` frees the session-exclusive range
[cache_protected_len, kv_allocated_len), which is the post-radix-
commit tail accumulated over decode/extend. On Inferact's
50-session workload this is 35-87K tokens per session. The radix
tree never gets a chance to do block-level leaf-LRU on that range
because it was never committed there.

Effect: evict-revisit cycle forces full 50-90K re-prefill per
session per evict — which is exactly the per-request cost of naive
PD-disagg. KVC's direct-to-D fast-path advantage collapses.

The right fix is structural (not a patch): progressively commit
streaming-session decode output to the radix tree so SGLang's
block-level LRU can shed only the deepest leaves, preserving the
recent prefix that next-turn requests are most likely to match.
SessionSlot becomes pure metadata. Scope is ~1-2 weeks of vendored
SGLang refactor, orthogonal-and-complementary to the D→P sync work
proposed in RESEED_SLOW_PATH_AND_D_TO_P_GAP §4.

Doc lists five anti-patterns the next agent should avoid (tuning
migration_reject_threshold, disabling migration/admission, etc) —
all of those are local symptoms downstream of the eviction
granularity choice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 14:21:45 +08:00

14 KiB

Raw Blame History

KVC Eviction Granularity — 设计审视 (架构层)

日期: 2026-05-12 Status: 架构审视 / 待 design discussion Companion: docs/E1_E2_RESULTS_ZH.md, docs/E3_FINDINGS_ZH.md, docs/E1_E2_FIX_DESIGN_ZH.md Branch: h200-cu130

本文是 E2 → E3 迭代后的高层架构反思，不是又一份 fix design。前几轮 E2 → E3 我一直在加 local patches（load-floor bonus、Fix A skip-zero-extend、调 migration_reject_threshold 等），但 E3 实测数据迫使我们承认这些 patches 大局上看是 KVC 在向 DP / naive PD-disagg 退化的轨迹。

0. TL;DR

KVC 的 value proposition 是"session pin 在 D 上、KV 跨 turn 连续累积、direct-to-D 快路径 0.04s TTFT"。
SessionAwareCache.release_session 在 trim 时一次性 free 整段 session-exclusive 尾部：实测 E3 一次 trim 平均 free 67,726 tokens（samples: 35K / 38K / 40K / 86K / 87K），不是 "几个 leaf block"。
被 evict 的 session 下次到来时必须从客户端原 prompt 重 prefill 50-90K + mooncake transfer 5-9 GB → 跟 naive PD-disagg 一模一样。
→ 在 saturation regime 下 KVC 的 cache continuity 设计被自己的 eviction 抵消。Session-level eviction 与 KVC 的设计意图冲突。
真正的方向不是堆 patch，是 改 eviction granularity: 让 streaming-session 的 decode 输出 progressively commit 进 radix tree，由 SGLang 标准的 block-level LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。

1. 我们做对了什么，又错过了什么

KVC 的 design promise（来自 `KVC_ROUTER_ALGORITHM.md` §1）

Property	设计意图
Session 钉定	Session `s` pin 在 `pin[s]` 这一个 D；同 session 的所有 turn 在同一个 D 上做 KV 累积
Direct-to-D 快路径	`req.session ∈ M_d ∧ append_len ≤ τ_append ∧ cap_ok` → 仅 append 新 token，不走 P→D mooncake transfer
TTFT 优势	append-only path TTFT ≈ 40ms (历史 v2 在 SWE-Bench 的 fast-path p50)
集中 cache 而非 fragment	同 session cache 集中在一个 D 上，命中率高

我们当前实测在做什么（E3, killed at 1h12min）

指标	实测值	与设计 promise 的偏离
Eviction 次数	90	设计假设 "session 一旦绑就持续累积"
平均每次 evict 释放	67,726 tokens	不是 "几个 leaf block"，是整段 session 尾部
总释放	6,095,375 tokens	在 1h12min 里 trash 了 ≈ 8 个 session-pool 容量的 KV
触发 reseed 的 session 数	25 / 50 (50%)	这些 session 每个被 evict-revisit 一次 = 付一次 50-90K re-prefill
单次 reseed 平均耗时	3-7s (P prefill + mooncake)	跟 naive PD-disagg 持平

E1 对照：0 eviction、0 retract、50 sessions 顺利完成。E1 用的是 pd-disaggregation mechanism，没有 KVC 层、没有 admission RPC，但反而保留了 cache continuity（router-side sticky 让 session 不挪窝）。

讽刺: E1 (naive 1P2D + kv-aware policy) 意外地 比 E3 (KVC v2 + load-floor + RDMA) 更接近 KVC 设计意图——因为 E1 没有 admission 反馈链路，所以没人会触发那 90 次 session-level evict。

2. 为什么 session-level evict 是错的

`release_session` 实测语义（`session_aware_cache.py:250-281`）

def release_session(self, session_id: str):
    slot = self.slots.pop(session_id, None)
    ...
    if slot.last_node is not None:
        self.inner.dec_lock_ref(slot.last_node, ...)        # 解 radix 锁 ✓

    if slot.is_holding_kv:
        start = slot.cache_protected_len
        end = slot.kv_allocated_len
        if start < end:
            kv_indices = self.req_to_token_pool.req_to_token[
                slot.req_pool_idx, start:end
            ]
            self.token_to_kv_pool_allocator.free(kv_indices)  # 显式 free 一段 KV
        ...

[cache_protected_len, kv_allocated_len) 是 session-exclusive 尾部——从首 turn 提交 radix tree 之后所有累积的 decode output + 后续 turn 的 extend。在 Inferact workload 上：

cache_protected_len ≈ 首 turn 提交的 boilerplate 部分 (~12K)
kv_allocated_len ≈ 50-100K（多 turn 累积）
释放范围 = 38-88K

这部分 KV 没有进 radix tree，所以也享受不到 radix block-level LRU 的渐进式 shedding。release_session 一刀切。

与 SGLang 标准 radix LRU 的本质差异

SGLang 标准 inner.evict()（base_prefix_cache.py 接口由 RadixCache 实现）：

按节点 last_access_time 排序，从 leaf 开始 evict (因为 evict 中间节点会破坏树结构)
每次释放一个 leaf node 的 KV indices
lock_ref > 0 的节点不可 evict

特性对比:

	session-level (current)	block-level (SGLang radix)
单次释放粒度	整段 session 尾部 (35-87K)	一个 leaf node (~24 tokens / page-size)
Recent prefix 保留	❌ 全丢	✅ 保留 (recent 访问 → 时间戳新 → 不被先 evict)
Evict-revisit 成本	50-90K re-prefill	仅丢的 leaf 部分 (≪ 50K)
与 session lifecycle	强绑定 (是 lifecycle 退出动作)	解耦 (lifecycle 仅做 lock_ref 管理)

为什么会变这样：SessionAwareCache 的双重职责混淆

SessionAwareCache 设计承担了两个本应分离的职责：

Session lifecycle 跟踪 (合理)：streaming session 跨多个 req 复用 KV，需要在 turn 间保留 (req_pool_idx, kv_committed_len, kv_allocated_len, last_node) 这些字段，恢复给下个 turn 的 req。
Eviction granularity 决策 (问题所在)：把 session 当成 evict 的最小单位，绕过了 SGLang 标准 LRU 的 leaf-by-leaf 渐进 shedding。

第 2 个职责本不该存在于 SessionAwareCache 里。SGLang radix 已经能处理 block-level LRU——前提是 session 的 KV 真的进了 radix 树。但因为 session-exclusive 尾部没 commit 进 radix tree，radix LRU 看不到它们，只能由 release_session 一次性大块 free。

3. 我们前几轮 patches 的总体轨迹

按 commit 时间线审视，每一步看似在修当下 issue，整体方向却是 KVC → DP 退化：

Iteration	改动	局部目标	大局影响
E2 baseline	mechanism=kvcache-centric, worker admission	跑出 KVC v2 头条数字	D2 cold + cascade → 1054 failures (KVC 设计前提崩塌)
E3 load-floor bonus	让 fresh session 均匀分到 D2	解 cold-start 偏置	触发 migration → 25 sessions reseed → 暴露 evict granularity 问题
E3 → Fix A	修 vendored SGLang `prepare_for_extend` 的 fill_ids<prefix_indices invariant	防 decode-1 assertion crash	Patch 局部 bug，没动 evict 设计
我之前提议: disable migration	`--kvcache-migration-reject-threshold 0`	"让 session 不挪窝"	会让 KVC 退化成 pd-disagg + load-floor（admission RPC 还在但 migration 不生效）
更早提议: disable admission	砍 admission RPC	"省掉那个 RPC overhead"	直接砍 KVC 的 direct-to-D fast path (KVC_ROUTER_ALGORITHM.md §3.2 Algorithm 2 不存在)

用户每次都正确地阻止了进一步退化。没有人在审视 evict granularity 这个根本问题——直到现在。

4. 正确方向（粗描）

核心思路: 让 streaming session 的 decode 输出 progressively commit 进 radix tree，由 SGLang 标准 radix LRU 蚕食最老的 leaf。SessionSlot 退化成纯 metadata。

4.1 目标行为

场景	当前行为	目标行为
Session 累积 50K KV，D 满了	release_session 一次释放 38K (整段 session-exclusive 尾部)	radix LRU evict 最老 leaf (可能是首 turn 的 boilerplate tail，~24 tokens)
Session 被 evict 后再到来	必须 reseed 50K (P prefill + mooncake)	仅 re-prefill 被 evict 的 leaf 部分 (e.g. ~5K)
TTFT 对 evicted session 的影响	50-90K reseed = 3-7s	5K append-prefill = ~200ms
不被 evict 的 session	同 session 内 turns append-only	同样 append-only ✓ (不变)
KVC fast-path 命中率	91.6% (历史 SWE-Bench) / 38% (E3 Inferact, 因为 evict-revisit)	应稳定在 >85% 即使 saturation

4.2 需要的 refactor scope

按依赖排序，每一步可独立做但有耦合：

Streaming session decode output 增量进 radix tree (vendor SGLang)
- 当前: decode output 累积在 kv_allocated_len 维度，但 radix tree 只记录到 cache_protected_len
- 改: 每 turn finish 时把新的 decode tail 通过 radix cache_finished_req 路径插入 radix 树
- 影响: streaming session 在 radix 树里有持续 growing 的 chain，每个 24-token block 一个 node
- 牵涉: radix_cache.py 的 insert 路径、schedule_batch.py 的 cache_finished_req hook、SessionSlot.save_from_req
SessionSlot 退化成纯 metadata
- 当前: SessionSlot 拥有 req_pool_idx + [cache_protected_len, kv_allocated_len) 范围的 KV 索引所有权
- 改: SessionSlot 仅持有 last_node（指向 radix 树某 node）和 lock_ref 状态，不直接管 KV 范围
- 影响: restore_to_req 改成基于 radix match_prefix 重建 req 状态，不直接 reuse req_pool_idx
release_session 改为仅 dec_lock_ref + 删 slot metadata
- 当前: 还 free [cache_protected_len, kv_allocated_len) 范围 KV
- 改: 只 dec_lock_ref → 让 radix LRU 自然 evict
- 影响: maybe_trim_decode_session_cache 不再"按 session 释放"，而是用 SGLang 现有的 tree_cache.evict(required_tokens)
admit_direct_append 的 capacity 检查改用 radix-resident 长度
- 当前: current_tokens = session.resident_tokens (来自 SessionSlot)
- 改: current_tokens = radix tree 上该 session 实际 commit 的长度 = match_prefix(session.last_node).matched_length
- 影响: admission 评估的 "uncached = input - radix-resident" 更精确，evict-revisit 场景下 admission 反映出"只丢了一部分"而不是"全丢"
prepare_for_extend 的 streaming-session correction 重新设计
- 当前: Fix A patches 的 fill_ids/prefix_indices invariant 是基于 session-exclusive 尾部的复杂 fixup
- 改: 如果 SessionSlot 不再拥有独立 KV 范围，整个 correction 路径需要重写或可能不再必要

4.3 与 onboarding §4.4 D→P sync 的关系

docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md §4 描述的 D→P 增量同步是针对 reseed 自身成本的 fix（让 P 端 backup 跟上，避免 reseed 时 P 重 prefill）。

本文 §4 描述的 eviction granularity 是针对 reseed 触发频率的 fix（让 session 不被一次性 evict 整段，减少 evict-revisit）。

两者正交、互补:

单做 evict-granularity fix: reseed 频率下降，但偶发 reseed 仍然慢
单做 D→P sync: reseed 自身快了，但仍然频繁触发
都做: reseed 几乎消失、即使触发也快

工程量都是 ~1-2 周量级，可并行启动。

4.4 不是 local patch

注意整个 §4.2 列表里没有"调一个 hyperparameter"或者"加一个 CLI flag"这种局部改动。这是 vendor SGLang 内部数据结构的 invariants 重新设计，不能通过更精确的 K 值或更宽的 substring filter 解决。

5. 我们不该再做的事 (anti-patterns)

防止下个 agent 走同样的局部 patch 路径：

不要继续调整 migration_reject_threshold — 这个参数只是控制"reject 后多久换 D"，跟 evict granularity 无关。调小让 migration 更频繁 → 更多 reseed → 更糟。调大 → blacklist 永久化 (v1 thrashing 问题)。
不要 disable migration — 会让 KVC 退化到 sticky pd-disagg。失去 v2 的 reset-on-success 整体设计。
不要 disable admission — 会砍掉 direct-to-D fast path 这个 KVC 唯一的差异化优势。
不要继续 tune _decode_session_cache_low_watermark_tokens — 调高让 LRU 更激进 → 更多 evict → 更糟。调低让 LRU 不触发 → 顶到 retract decode → 更糟。是治标。
不要再加 _ADMISSION_REJECTION_SUBSTRINGS — 之前修的 string filter bug (Q2 forensic) 让 migration counter 真的递增，反而暴露了 migration 本身的 reseed 成本。修这个 bug 没错，但显示出 migration 机制本身在 saturated 场景下是负收益。

6. 推荐 Decision Points

#	Question	推荐
D1	接受本文的诊断（session-level evict 是根本问题）？	Yes
D2	暂停 E1/E2/E3 ablation 线索，集中精力做 §4.2 refactor？	Yes (current path 在用 GPU 时间确认已知结论)
D3	refactor 在 vendored SGLang 主线（kvc-debug-journey-v1-to-v4）还是新分支？	新分支 `feat/block-level-evict`（隔离 risk）
D4	同时启动 §4.3 的 D→P sync（`feat/d-to-p-sync` 分支已预留）？	视团队带宽
D5	在 refactor 完成前对外的 paper 表述如何处理？	标"v2 系列在 saturation regime 下的 evict 行为是已识别的 limitation，§future-work 已 propose 修复"

7. 给下个 agent 的接班

如果你接手要做 §4.2 refactor，按顺序读:

KVC_ROUTER_ALGORITHM.md §2-3 — KVC 设计意图
本文 §2.1, §2.2 — 实测 evict 行为
SGLang vendor mem_cache/radix_cache.py — 标准 radix LRU 实现细节
SGLang vendor mem_cache/session_aware_cache.py — 当前 SessionSlot 设计
SGLang vendor managers/schedule_batch.py — prepare_for_extend 怎么用 session state
docs/RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md §4 — D→P sync 的工程 scope（互补 work）

关键 invariant 不变量: SessionSlot.restore_to_req 必须保持幂等（chunked prefill 失败可能 retry 多次）。任何 refactor 都要测试此 invariant。

关键 testing pattern: 单元化测试 streaming session 在 LRU 压力下的行为。具体：注入一个 fake inner.evict() 返回部分 leaf 被 evict 的状态，断言 SessionSlot.restore_to_req 仍然返回合法 req 状态（不抛 assertion，re-prefill 长度合理）。

核心句: 我们前 3 轮 patch 都在解 saturation 暴露的 secondary 问题（cold-D 偏置、admission 字符串 bug、streaming-session correction 边界），但真正的 primary 问题是 SessionAwareCache 把 session lifecycle 跟踪和 eviction granularity 决策混在一起。session 是 lifecycle 边界，不应该是 eviction 边界。Eviction 应该交还给 SGLang 已经做得很好的 block-level radix LRU。

14 KiB Raw Blame History Unescape Escape