Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

42 KiB

Raw Blame History

0. 一句话定位

我建议把这两个方向拆成 底座项目 A 和 策略项目 B：

项目	名称	核心问题	论文/系统贡献定位
A	HeteroCache: Hybrid-Attention-Aware KV State Manager	hybrid attention 下，KV cache 不再是同构 token block，PagedAttention 抽象不够	新 KV cache abstraction + GPU/CPU/SSD/recompute 联合管理
B	OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid Serving	agentic long-context 下，P/D 分离带来 KV duplicate 和状态迁移成本	KV ownership + PD hybrid routing + recompute/transfer planner

两者关系是：


\text{B depends on A}

A 解决 “KV 状态应该如何被表达和管理”；B 解决 “这些状态应该由谁拥有、何时迁移、何时重算、请求应该路由到哪里”。

DeepSeek-V4 报告给了非常强的动机：它明确说 hybrid attention 产生多种 KV entry，尺寸、更新规则、cache policy 都不同，并且违反 PagedAttention 及其变种的基本假设；报告还把 KV cache 分为 classical KV cache 与 state cache，并针对 on-disk shared-prefix reuse 设计了压缩 KV 存储、SWA full/periodic/zero caching 等策略。

1. 共同背景：为什么现在是新问题

1.1 传统 LLM serving 的核心抽象

过去 vLLM/SGLang 一类 serving engine 的核心抽象大致是：


\text{Request} \rightarrow \text{Dense KV Blocks} \rightarrow \text{Decode Scheduler}

KV cache 通常被看作：

维度	传统假设
结构	每层、每 token 生成同构 KV
生命周期	prefill 生成，decode 持续 append
位置	主要常驻 GPU HBM
复用	exact prefix hit 后复用已有 KV
调度目标	降低 TTFT、TPOT，提高 goodput
缓存管理	block allocator + eviction policy

这套抽象在 dense attention / GQA / MLA 时代仍然比较有效。

1.2 DeepSeek-V4 改变了什么

DeepSeek-V4 的核心变化是：为了支持 million-token context，它把 attention KV 从 dense token-level state 改成了多种压缩、稀疏、局部窗口、尾部状态混合的 heterogeneous state。

报告中提到，DeepSeek-V4 使用 CSA 和 HCA hybrid attention，CSA 把 KV 沿 sequence dimension 压缩后再 sparse select，HCA 更激进地压缩 KV 但保持 dense attention；V4-Pro 在 1M context 下 single-token inference FLOPs 约为 V3.2 的 $27%$，KV cache 约为 V3.2 的 $10%$。

因此，新 serving engine 的对象不再是单一 KV block，而是：


\text{KV State}
===============

\text{CSA Main KV}
+
\text{CSA Indexer KV}
+
\text{HCA KV}
+
\text{SWA KV}
+
\text{Uncompressed Tail State}

报告明确说 hybrid attention 引入多种 KV entries，其 KV cache sizes 和 update rules 不同；SWA 有单独 hit/eviction policy；compression branch 中未凑满压缩块的 tokens 和 hidden states 必须暂存为 sequence state。

2. Project A Proposal：HeteroCache

2.1 项目名称

HeteroCache: A Heterogeneous KV State Manager for Hybrid-Attention Long-Context LLM Serving

2.2 背景

DeepSeek-V4 报告中最关键的 infra 信号是：

hybrid attention 使 KV cache 从 homogeneous dense blocks 变成 heterogeneous KV states。

报告中的 Figure 6 把 KV cache 拆成两类：

组件	内容
Classical KV cache	CSA/HCA compressed KV blocks
State cache	SWA KV + CSA/HCA 尚未压缩的 tail states

每个 request 分配 fixed-size state cache block；classical KV cache 中，每个 cache block 覆盖 \mathrm{lcm}(m,m') 个 original tokens，并分别产生 CSA compressed tokens 和 HCA compressed tokens。

这说明传统 PagedAttention 的核心假设已经被破坏：


\text{one layer} \approx \text{one homogeneous KV layout}

现在变成：


\text{one request}
\rightarrow
\left{
\text{layer-specific cache type},
\text{branch-specific compression ratio},
\text{state-specific eviction policy}
\right}

2.3 过去的假设

过去假设	为什么过去成立
KV cache 是同构 block	dense attention / GQA 下，每层 KV shape 规则，block size 固定
prefix cache hit 是二元事件	prefix tokens 一样，就可以直接复用对应 KV
GPU HBM 是主缓存层级	context 较短，KV cache 主要在 GPU 内存内管理
eviction policy 可以 layer-agnostic	每层 KV 重要性、大小、更新规则近似一致
attention kernel 只消费 cache layout	kernel 不强烈反过来约束 cache layout
tail recompute 不重要	dense KV 每个 token 都即时 materialize，没有压缩块未完成的问题

2.4 现在可能不成立的假设

失效假设	DeepSeek-V4 里的反例
KV 是同构的	CSA main KV、CSA indexer KV、HCA KV、SWA KV、tail state 都不同
prefix hit 可以直接复用全部 prefix	incomplete compression block 仍需 recompute
SWA 与主 KV 可以统一管理	报告说 SWA KV 未压缩、每层都有，体积约为 CSA/HCA compressed KV 的 $8\times$，需要独立策略
cache block size 只由 allocator 决定	sparse attention kernel 有 alignment requirement，需要 cache layout + kernel co-design
on-disk KV 只是 swap	V4 的 on-disk KV 是 shared-prefix artifact reuse，不是被动换页

2.5 需要建立的新假设

Hypothesis A1：KV cache 应该被建模为 heterogeneous state，而不是 dense tensor blocks

新抽象：


\text{KVState}_{r}
==================

\left{
s_i
\mid
s_i =
(\text{type}, \text{layer}, \text{range}, \text{precision}, \text{location}, \text{restore_cost})
\right}

其中 \text{type} 可以是：

Type	例子
`COMPRESSED_CSA_MAIN`	CSA compressed main KV
`COMPRESSED_CSA_INDEX`	CSA lightning indexer KV
`COMPRESSED_HCA`	HCA compressed KV
`WINDOW_SWA`	recent `n_{\mathrm{win}}` uncompressed KV
`TAIL_STATE`	未满 `m` 或 `m'` 的 pending hidden states
`DENSE_KV`	兼容传统模型

Hypothesis A2：prefix reuse 应该按 restore cost 建模，而不是 hit/miss 建模

传统 prefix cache：


\text{hit}(p)\in {0,1}

新的 prefix artifact reuse：


\text{reuse_benefit}(p)
=======================

## C_{\text{prefill}}(p)

## C_{\text{load}}(p)

## C_{\text{recompute}}(p)

C_{\text{stall}}(p)

其中：

项	含义
`C_{\text{prefill}}(p)`	如果不复用，需要重新 prefill 的成本
`C_{\text{load}}(p)`	从 GPU/CPU/SSD/remote 读取 artifact 的成本
`C_{\text{recompute}}(p)`	incomplete block、SWA state restore 的重算成本
`C_{\text{stall}}(p)`	I/O 或调度等待对请求造成的阻塞

Hypothesis A3：cache policy 必须 model-aware、branch-aware、kernel-aware

也就是：


\text{cache policy}
===================

f(
\text{attention branch},
\text{compression ratio},
\text{kernel alignment},
\text{reuse pattern},
\text{memory pressure}
)

而不是单纯 LRU/LFU。

2.6 方案设计

2.6.1 系统总览

HeteroCache 可以分成四层：

Request / Session
      ↓
KV State Abstraction Layer
      ↓
GPU / CPU / SSD Artifact Manager
      ↓
Kernel-Aware Layout Manager
      ↓
Attention Kernels / Serving Engine

2.6.2 核心模块

模块	功能
Attention Spec Registry	注册每层 attention type、KV type、compression ratio、precision、alignment
State Cache Manager	管理 SWA KV 和 uncompressed tail states
Compressed KV Allocator	管理 CSA/HCA compressed blocks
Artifact Store	存储 prefix compressed KV，可位于 GPU/CPU/SSD
Restore Planner	决定 load、recompute、partial reuse 的组合
Kernel Layout Adapter	根据 sparse attention kernel 要求组织 block layout
Policy Engine	在 memory pressure 下做 eviction / checkpoint / prefetch

2.6.3 Attention Spec Registry

每个模型需要声明：

LayerSpec {
  layer_id
  attention_type: CSA | HCA | SWA | Dense
  compression_ratio: m or m'
  kv_entry_size
  precision: BF16 | FP8 | FP4
  block_alignment
  state_size
  restore_fn
}

对 DeepSeek-V4-like 模型，可能是：

Layer Type	Cache Type	Compression
CSA layer	CSA main + CSA indexer + SWA + tail	`m=4`
HCA layer	HCA compressed + SWA + tail	`m'=128`
pure SWA layer	SWA only	windowed
dense layer	dense KV	none

2.6.4 State Cache Manager

State cache 管理：


\text{StateCache}
=================

\text{SWA}*{n*{\mathrm{win}}}
+
\text{Tail}*{<m}
+
\text{Tail}*{<m'}

核心策略：

策略	说明
fixed-size per-sequence state block	与 DeepSeek-V4 报告一致，每个 request 分配有限 state block
tail-aware append	不满 compression block 时只更新 tail state
block-finalize trigger	凑满 `m` 或 `m'` 后生成 compressed KV，并释放 tail
state spill policy	长 idle session 可把 SWA checkpoint 写入 CPU/SSD
restore policy	prefix hit 时根据 SWA 策略选择 load 或 recompute

2.6.5 Artifact Store

Artifact ID：


\text{ArtifactID}
=================

H(
\text{model_id},
\text{model_revision},
\text{tokenizer},
\text{attention_spec},
\text{prefix_hash},
\text{block_range},
\text{precision}
)

Artifact metadata：

字段	作用
`prefix_hash`	exact prefix identity
`block_range`	覆盖哪些 original tokens
`kv_type`	CSA/HCA/SWA/tail
`location`	GPU / CPU / SSD / remote
`size_bytes`	cache accounting
`restore_cost`	scheduler 使用
`last_access`	eviction
`reuse_count`	utility estimation

2.6.6 Restore Planner

当新请求命中 prefix，Restore Planner 做：


a^*
===

\arg\min_{a\in A}
\left(
T_{\text{load}}(a)
+
T_{\text{recompute}}(a)
+
T_{\text{queue}}(a)
+
\lambda M_{\text{HBM}}(a)
\right)

动作集合：

动作	场景
`LOAD_COMPRESSED_KV`	compressed CSA/HCA artifact 已存在
`RECOMPUTE_TAIL`	incomplete compression block 不值得存
`LOAD_SWA_CHECKPOINT`	periodic checkpoint 命中
`RECOMPUTE_SWA`	zero SWA caching 或 checkpoint 太旧
`PREFETCH_ARTIFACT`	session 很可能继续，需要提前拉取
`DROP_LOW_UTILITY_ARTIFACT`	memory/SSD pressure 高

2.6.7 On-disk SWA Policy

直接借鉴并系统化 DeepSeek-V4 的三类策略：

策略	存储成本	重算成本	适用场景
Full SWA Caching	最高	最低	极热 prefix、极低 TTFT 需求
Periodic Checkpointing	中等	中等	默认策略，参数 `p` 可调
Zero SWA Caching	最低	最高	SSD 写压力大、SWA restore 不频繁

DeepSeek-V4 报告指出，Zero SWA Caching 下，对于 L 层模型，利用 cached CSA/HCA KV，重算最后 n_{\mathrm{win}}\cdot L tokens 足以恢复最后 n_{\mathrm{win}} 个 SWA KV entries。

这可以变成可调 policy：


p^*
===

\arg\min_p
\left(
\alpha \cdot \text{SSDWrite}(p)
+
\beta \cdot \text{RestoreLatency}(p)
+
\gamma \cdot \text{GPURecompute}(p)
\right)

2.7 可执行实现计划

Phase A0：trace + simulator

目标：不先碰复杂 kernel，先证明 policy 有收益。

工作	内容
Trace replay	用 chat/thinking/coder trace，补充 agentic session trace
KV state simulator	模拟 CSA/HCA/SWA/tail state 大小和生命周期
Cost model	GPU/CPU/SSD load、tail recompute、SWA restore
Policy comparison	LRU、full caching、periodic checkpoint、zero caching、cost-based

Phase A1：serving engine 插件化 prototype

建议先在 SGLang 或 vLLM 外围做一个 external KV artifact manager，不直接改核心 kernel。

实现项	说明
prefix hash manager	记录 session prefix chain
artifact metadata DB	SQLite/RocksDB/Redis 均可
GPU/CPU/SSD cache tiers	先模拟，后真实落盘
restore planner	输出 load/recompute plan
integration shim	接入 prefix cache hook

Phase A2：mock hybrid attention / DeepSeek-V4-like backend

如果 DeepSeek-V4 inference 实现可用，可以逐步接入；否则先做 mock：

Mock 层	作用
dense KV → synthetic compressed KV	模拟不同 compression ratio
SWA state	真实维护 recent window
tail state	按 `m,m'` 管理 incomplete block
sparse index artifact	先只计 size 和 load cost

这样即使没有完整 V4 kernel，也可以评估 cache manager 的系统价值。

Phase A3：kernel-aware layout

后期再进入真正系统贡献：

任务	说明
lcm block layout	block 覆盖 `\mathrm{lcm}(m,m')` 的倍数
alignment padding	针对 sparse attention kernel cache-line alignment
batch gather API	为 selected sparse KV indices 提供高效 gather
prefetch stream	SSD/CPU 到 GPU 异步拉取 compressed KV

2.8 评价设计

Baselines

Baseline	说明
PagedAttention-style dense KV	传统 vLLM/SGLang
GPU-only prefix cache	不落盘，只复用 GPU KV
Full SWA caching	DeepSeek-V4 提到的低重算高存储策略
Zero SWA caching	低存储高重算策略
Periodic checkpointing	参数 `p` 固定
HeteroCache adaptive	你的 cost-based policy

Metrics

指标	含义
TTFT	prefix hit 后恢复与首 token 时间
E2E latency	agentic session 完整完成时间
HBM per active session	每个活跃 session 占用 GPU 内存
max concurrent long-context sessions	同样 GPU 内存下可容纳 session 数
prefill tokens saved	复用带来的 prefill 减少
SSD write amplification	on-disk artifact 写入放大
restore latency breakdown	load / recompute / queue 分解
policy overhead	metadata lookup 和 planning 成本

2.9 预期收益

收益	预期方向
更高 context concurrency	同样 HBM 下支持更多 `100K\sim1M` session
更低 repeated-prefix TTFT	对 repo/doc/agent session 复用 compressed artifact
更低 SSD 写放大	避免 naive full SWA caching
更稳定 memory pressure 行为	state cache 与 classical KV cache 分开管理
更强模型适配性	支持 dense、CSA/HCA、SWA、MLA-like mixed KV
论文贡献清晰	证明 PagedAttention 抽象不足，提出 heterogeneous state abstraction

2.10 风险与缓解

风险	缓解
没有完整 DeepSeek-V4 kernel	先做 mock hybrid attention + trace simulator
on-disk KV I/O 真实收益不稳定	先做 cost model，再做 SSD microbenchmark
系统实现过重	先做 external artifact manager，避免一开始深改 engine
质量影响难评估	A 项目主要关注 state management，不主动改变 attention result
reviewer 认为只是 engineering	强调新抽象：heterogeneous KV state，而非 cache policy 小修小补

3. Project B Proposal：OwnerServe

3.1 项目名称

OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid LLM Serving

3.2 背景

你之前的判断是对的：在 coding agent / long-horizon agent 场景里，用户通常不太关心每一步 TPOT，而更关心：


T_{\mathrm{E2E}}
================

T_{\mathrm{LLM}}
+
T_{\mathrm{tool}}
+
T_{\mathrm{sandbox}}
+
T_{\mathrm{I/O}}
+
T_{\mathrm{retry}}

传统 P/D 分离的目标是避免 prefill 干扰 decode。但 agentic workload 有几个新特征：

特征	对 PD 分离的影响
多轮长 session	KV state 生命周期很长
每轮新增 tokens 相对 prefix 较小	增量 prefill 常常不是大 prefill
tool call 间隔长	decode 不是持续满负载
prefix reuse 极高	repo、history、tool trace 可大量复用
context 可到 `100K+` 或更长	KV duplicate 成本显著

DeepSeek-V4 进一步改变了 PD 问题：如果模型采用 compressed attention，P/D 之间不再只是传完整 dense KV，而是传：


\text{Transfer State}
=====================

\text{Compressed CSA/HCA KV}
+
\text{Indexer KV}
+
\text{SWA State}
+
\text{Tail State}

其中 compressed KV 可能较小，但 SWA/tail state 更新频繁且 restore policy 复杂。

3.3 过去的假设

过去假设	说明
P/D 分离总是有利于 decode latency	prefill 和 decode 的计算形态不同，隔离可以避免 interference
P 生成 KV，D 消费 KV	请求生命周期简单，prefill 一次，decode 一段
KV duplicate 是必要代价	P 和 D 都可能持有同一 prefix KV
routing 主要看负载	选空闲 P 和空闲 D 即可
KV transfer 是一次性成本	prefill 后传输 KV，后面主要 decode
session 不需要强 ownership	请求之间关联不强，cache locality 不是主导因素

3.4 现在可能不成立的假设

失效假设	原因
P/D 静态分离总是最好	agentic decode 中间有 tool gap，P 干扰 decode 的机会不一定大
完整 KV transfer 可以接受	long-context prefix 大，重复传输/duplicate 会消耗 HBM 和带宽
KV 是 request-local	agentic 多轮 session 中，KV 是 long-lived session state
routing 只看 queue length	prefix artifact locality 可能比 queue length 更重要
prefill 是大块一次性任务	coding agent 常是很多小增量 prefill
复用只发生在 GPU 内	DeepSeek-V4 已经把 on-disk shared-prefix reuse 纳入 inference framework

3.5 需要建立的新假设

Hypothesis B1：long-context agent serving 应该围绕 KV ownership 调度，而不是围绕 request 调度

传统：


\text{schedule(request)}

新范式：


\text{schedule(session, KV_owner, artifact_location, restore_cost)}

也就是：谁拥有 prefix state，比哪个 worker 当前最空闲更重要。

Hypothesis B2：P/D separation 应该是动态决策，而不是静态架构

动作集合：


a
\in
{
\text{colocated-prefill},
\text{remote-prefill},
\text{decode-continuation},
\text{artifact-fetch},
\text{tail-recompute},
\text{state-migration}
}

调度目标：


a^*
===

\arg\min_a
\left(
T_{\text{queue}}
+
T_{\text{prefill}}
+
T_{\text{transfer}}
+
T_{\text{restore}}
+
\lambda M_{\text{duplicate}}
+
\mu I_{\text{decode}}
\right)

其中 I_{\text{decode}} 是 prefill 对正在 decode 的请求造成的 interference。

Hypothesis B3：single-owner KV artifact 可以降低 HBM duplicate，同时保持 prefix reuse

核心思想：

对一个 long-lived prefix，系统只维护一个 authoritative owner；其他节点只持有 transient read cache 或通过 recompute/fetch 恢复。

形式化：


\forall p,\quad
|\text{Owner}(p)| = 1

但允许：


|\text{ReplicaCache}(p)| \ge 0

即 owner 唯一，cache replica 可控。

Hypothesis B4：tail/SWA state 不一定值得传输，很多时候重算更便宜

尤其在 compressed attention 下，bulk compressed KV 可作为 artifact 复用；但 tail state 和 SWA state 可能更适合：

State	推荐策略
compressed CSA/HCA KV	存储/传输/复用
indexer KV	与 compressed KV 一起管理
incomplete tail	多数情况重算
SWA state	hot session 保留；cold session checkpoint 或重算
dense decode continuation KV	尽量保持在 D owner 上

3.6 方案设计

3.6.1 系统总览

Global Router
    ↓
Ownership Directory
    ↓
Cost-Based PD-Hybrid Scheduler
    ↓
P Workers / D Workers / Hybrid Workers
    ↓
KV Artifact Store + HeteroCache

3.6.2 核心概念

KV Artifact

Artifact 是 immutable prefix state：


A_p =
(\text{prefix_hash}, \text{block_range}, \text{model_version}, \text{kv_type}, \text{location})

KV Owner

Owner 是 artifact 的 authoritative holder：


\text{Owner}(A_p)=w_i

Session State

Session state 是 mutable continuation：


S =
(A_{\text{stable prefix}},; \text{SWA state},; \text{tail state},; \text{decode state})

这里要区分：

状态	性质
stable prefix artifact	immutable，可共享
tail state	mutable，短生命周期
SWA state	mutable，但可 checkpoint
decode state	strongly locality-sensitive

3.6.3 Ownership Directory

维护：

Key	Value
`prefix_hash`	owner worker
`artifact_id`	location list
`session_id`	current D owner
`repo_id / doc_id`	hot prefix group
`lease_expiry`	owner lease
`ref_count`	forked sessions 数量
`last_access`	eviction 和 migration

需要支持：

lookup(prefix_hash)
claim_owner(prefix_hash, worker)
renew_lease(prefix_hash, worker)
release(prefix_hash)
replicate_readonly(prefix_hash, target_worker)

3.6.4 PD-Hybrid Scheduler

对每次 agent step，scheduler 先分类：

请求类型	例子	默认策略
New long prefix	第一次加载 repo/doc	remote P 或 dedicated prefill
Small incremental turn	tool result + short instruction	colocated prefill on D
Long tool output	test log、large file diff	remote P or hybrid
Decode continuation	normal generation	stay on D owner
Forked rollout	`n` samples / RL rollout	immutable prefix artifact + many D readers
Cold resume	idle session 回来	load compressed artifact + recompute tail/SWA

3.6.5 决策模型

对某个 session step，候选动作 a 的成本：


C(a)
====

T_{\text{queue}}(a)
+
T_{\text{compute}}(a)
+
T_{\text{network}}(a)
+
T_{\text{restore}}(a)
+
\lambda \cdot M_{\text{HBM}}(a)
+
\mu \cdot I_{\text{decode}}(a)

选择：


a^* = \arg\min_a C(a)

其中：

项	说明
`T_{\text{queue}}`	worker 当前排队时间
`T_{\text{compute}}`	prefill/decode 计算时间
`T_{\text{network}}`	KV artifact 传输时间
`T_{\text{restore}}`	tail/SWA recompute 或 checkpoint load
`M_{\text{HBM}}`	额外 HBM 占用
`I_{\text{decode}}`	prefill 对 colocated decode 的干扰

3.6.6 Single-owner 机制

规则 1：stable prefix artifact immutable

当 prefix 到达 compression block boundary 后，将其 seal 成 artifact：


A_{0:k} = \mathrm{seal}(S_{0:k})

seal 后不可修改，只能 append 新 artifact：


A_{0:k} \rightarrow A_{0:k} + A_{k:k+\Delta}

规则 2：mutable tail 只属于 session owner

tail state 不全局共享：


\text{Owner}(\text{tail}_s)=\text{D-owner}(s)

除非 tail 很大，否则迁移时重算，不传输。

规则 3：forked sessions 使用 copy-on-write

对 agent/RL 多分支：


S_i = A_{\text{shared prefix}} + \Delta_i

所有分支共享 prefix artifact，各自维护自己的 tail/SWA/decode state。

规则 4：owner lease + failure recovery

Owner 使用 lease：


\text{lease}(A_p,w_i,t_{\text{expire}})

worker crash 后，directory 重新分配 owner；如果 artifact 在 SSD/CPU 有副本，则恢复；否则从 token log 重算。

3.7 可执行实现计划

Phase B0：trace-driven simulator

先不要直接深改 serving engine。先用 trace 和 cost model 验证 single-owner 是否有潜在收益。

输入	内容
agent trace	session id、turn id、input length、output length、tool latency
prefix relation	每轮与前一轮共享多少 tokens
worker config	P/D/hybrid workers 数量、GPU 数、网络带宽
KV model	dense KV 或 hybrid KV
scheduling policy	static PD、colocated、OwnerServe

输出：

指标	含义
duplicate HBM	同一 prefix 被复制多少份
network traffic	P→D KV transfer 总量
E2E latency	每个 agent task 完成时间
decode interference	colocated prefill 对 decode 的影响
owner hit rate	请求路由到 KV owner 的比例

Phase B1：在单机 `8\times` H20 上实现 xPyD prototype

这与你现有方向非常吻合。

实验设置：

项	建议
serving backend	SGLang xPyD 或自定义 proxy
hardware	单机 `8\times` H20
constraint	`x+y\le 8`
P→D link	即使本地也模拟 RDMA loopback
workload	SWE-bench generated + internal agent trace
model	先用 Qwen3-Coder-30B-A3B 或类似可跑模型

实现模块：

模块	MVP
Global router	Python/Rust proxy
Ownership directory	Redis / in-memory map
Prefix hash	token-level rolling hash
KV ownership	先以 logical ownership 模拟
Transfer planner	记录实际或估算 KV bytes
Colocated fallback	小增量 turn 直接在 D 上 prefill

Phase B2：接入 HeteroCache

当 Project A 有 prototype 后，B 可以从 dense KV ownership 升级到 heterogeneous artifact ownership：

Artifact Type	Owner 策略
dense KV	D owner 优先
compressed CSA/HCA	artifact owner，可跨 D 共享
SWA checkpoint	session owner 或 SSD
tail state	session-local，重算优先
indexer KV	跟随 compressed artifact

Phase B3：真实 agentic evaluation

Workloads：

workload	价值
SWE-bench generated	可复现、可公开
repo-level coding agent trace	长 prefix + 多轮 tool
synthetic forked rollout	测 shared prefix 多分支
long-doc QA multi-turn	测 on-disk prefix reuse
internal Ali traces	工业真实性

3.8 Baselines

Baseline	说明
Static PD	固定 P node 和 D node，P 完成 prefill 后传 KV
Colocated serving	P/D 不分离，全部在一个 worker
Round-robin PD	不考虑 KV locality
Prefix-aware but multi-owner	prefix cache 命中但允许多副本
OwnerServe	single-owner + cost-based PD hybrid
Oracle	知道未来请求序列的最优调度，用作上界

3.9 Metrics

指标	为什么重要
E2E task time	agentic workload 的主指标
p95/p99 step latency	每轮用户感知
HBM duplicate factor	衡量 KV 浪费
network KV traffic	P→D/remote fetch 成本
prefix owner hit rate	ownership routing 有效性
decode interference	hybrid colocate 是否伤害 decode
tool idle utilization	tool gap 期间 GPU 是否更好利用
successful trajectory throughput	单位 GPU 完成多少成功 agent tasks

3.10 预期收益

收益	预期方向
降低 KV duplicate	long prefix 不再在 P/D 多处长期复制
降低 P→D traffic	传 delta artifact，而不是每次完整 KV
降低 E2E time	小增量 turn colocate，避免远程 prefill + transfer
提高并发 session 数	HBM 被 prefix duplicate 占用更少
更适合 forked rollout	多分支共享 immutable prefix artifact
更好解释 agentic serving trade-off	从 TTFT/TPOT 转向 E2E + ownership + artifact locality

保守地说，这个项目最容易拿到的硬收益不是“单步 latency 大幅下降”，而是：


\text{same GPUs} \Rightarrow \text{more long-context sessions}

以及：


\text{same E2E target} \Rightarrow \text{less KV transfer and duplication}

3.11 风险与缓解

风险	缓解
single-owner 可能造成热点	支持 read-only replica cache，但 authoritative owner 唯一
colocated prefill 可能干扰 decode	cost model 显式加入 `I_{\text{decode}}`
KV ownership 难接入现有 engine	先 logical ownership + simulator，再逐步接真实 KV hooks
dense model 上收益不如 hybrid model 明显	先证明 agentic prefix reuse 和 PD duplicate；后续接 HeteroCache 放大收益
reviewer 认为像 router heuristic	强调 ownership abstraction、artifact lifecycle、cost model 和 trace-driven evidence

4. 两个项目的组合架构

最终系统可以长这样：

Agent Request / Tool Result / Session Resume
                 ↓
          Session Router
                 ↓
       Ownership Directory
                 ↓
      PD-Hybrid Cost Planner
                 ↓
 ┌───────────────┴────────────────┐
 │                                │
P/Hybrid Worker              D/Hybrid Worker
 │                                │
 └───────────────┬────────────────┘
                 ↓
          HeteroCache Manager
                 ↓
   GPU HBM / CPU DRAM / SSD Artifact Store

项目 A 提供：


\text{what state exists and how to restore it}

项目 B 决定：


\text{where the state should live and where the request should run}

5. 建议的论文 framing

5.1 Project A 的论文标题方向

HeteroCache: Managing Heterogeneous KV States for Hybrid-Attention Long-Context LLM Serving

核心贡献

贡献	说明
Observation	hybrid attention breaks homogeneous KV cache assumption
Abstraction	KV state type registry + restore-cost based prefix reuse
System	GPU/CPU/SSD/recompute-aware heterogeneous KV manager
Evaluation	long-context traces + shared-prefix workloads
Result	higher session concurrency, lower restore latency, lower redundant prefill

5.2 Project B 的论文标题方向

OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid LLM Serving

核心贡献

贡献	说明
Observation	static PD separation duplicates long-lived agentic KV state
Abstraction	KV artifact ownership and session-state lifecycle
System	ownership-aware router + PD-hybrid scheduler
Policy	recompute/fetch/transfer/colocate cost model
Evaluation	coding-agent traces, forked rollout, long-prefix sessions

6. 我的建议：先做 A，再把 B 作为 killer application

6.1 为什么 A 更基础

A 解决的是一个正在变成共性的问题：

long-context model architecture 正在破坏 existing KV cache abstraction。

只要未来模型越来越多采用 CSA/HCA/SWA/MLA/sliding/dilated/sparse/hybrid attention，A 的问题就持续存在。

A 的优势：

维度	判断
独立性	不强依赖 PD
系统贡献清楚	新 cache abstraction
可渐进实现	simulator → external manager → engine integration
适合 SOSP/OSDI framing	challenge old abstraction, propose new system primitive

6.2 为什么 B 是更强的应用场景

B 更贴近你一直关心的 agentic workload + PD hybrid + KV ownership。

但 B 的难点是：

难点	说明
需要真实 agent trace	否则收益容易被认为是 synthetic
需要 PD prototype	实现复杂度更高
需要证明不伤害 decode	prefill/decode interference 需要仔细测
需要与 A 结合	否则只在 dense KV 上做 ownership，故事不够新

所以我建议路线是：

Stage 1: HeteroCache simulator + trace evidence
Stage 2: HeteroCache prototype
Stage 3: OwnerServe simulator
Stage 4: xPyD OwnerServe prototype
Stage 5: combine A+B into agentic long-context serving paper

7. 最终项目摘要

7.1 Project A 摘要

HeteroCache 认为 long-context hybrid attention 模型的 KV cache 已经从 homogeneous dense block 变成 heterogeneous KV state。它提出一个 model-aware KV state abstraction，把 CSA/HCA compressed KV、indexer KV、SWA KV、tail state 统一纳入 GPU/CPU/SSD/recompute 联合管理，并用 restore-cost 替代 hit/miss 作为 prefix reuse 的核心决策指标。

7.2 Project B 摘要

OwnerServe 认为 agentic long-context serving 的核心瓶颈不是单个 request 的 prefill/decode，而是 long-lived session KV state 的 ownership、duplication、migration 和 reuse。它提出 single-owner KV artifact abstraction，并用 cost-based PD-hybrid scheduler 在 colocated prefill、remote prefill、artifact fetch、tail recompute、state migration 之间动态选择，以降低 HBM duplicate 和 P→D traffic，同时优化 E2E agent task time。

7.3 合并后的大命题

最强的总命题是：

For long-context agentic LLM serving, the primary scheduling object should shift from requests to KV artifacts.

也就是：


\text{Request-centric serving}
\quad\Longrightarrow\quad
\text{KV-artifact-centric serving}

这个 framing 很适合你当前的研究线：它自然连接 DeepSeek-V4 的 million-token hybrid attention、你关心的 PD hybrid、KVCache ownership、agentic E2E latency、以及 trace-driven reproducible evaluation。

42 KiB Raw Blame History Unescape Escape

0. 一句话定位

1. 共同背景：为什么现在是新问题

1.1 传统 LLM serving 的核心抽象

1.2 DeepSeek-V4 改变了什么

2. Project A Proposal：HeteroCache

2.1 项目名称

2.2 背景

2.3 过去的假设

2.4 现在可能不成立的假设

2.5 需要建立的新假设

Hypothesis A1：KV cache 应该被建模为 heterogeneous state，而不是 dense tensor blocks

Hypothesis A2：prefix reuse 应该按 restore cost 建模，而不是 hit/miss 建模

Hypothesis A3：cache policy 必须 model-aware、branch-aware、kernel-aware

2.6 方案设计

2.6.1 系统总览

2.6.2 核心模块

2.6.3 Attention Spec Registry

2.6.4 State Cache Manager

2.6.5 Artifact Store

2.6.6 Restore Planner

2.6.7 On-disk SWA Policy

2.7 可执行实现计划

Phase A0：trace + simulator

Phase A1：serving engine 插件化 prototype

Phase A2：mock hybrid attention / DeepSeek-V4-like backend

Phase A3：kernel-aware layout

2.8 评价设计

Baselines

Metrics

2.9 预期收益

2.10 风险与缓解

3. Project B Proposal：OwnerServe

3.1 项目名称

3.2 背景

3.3 过去的假设

3.4 现在可能不成立的假设

3.5 需要建立的新假设

Hypothesis B1：long-context agent serving 应该围绕 KV ownership 调度，而不是围绕 request 调度

Hypothesis B2：P/D separation 应该是动态决策，而不是静态架构

Hypothesis B3：single-owner KV artifact 可以降低 HBM duplicate，同时保持 prefix reuse

Hypothesis B4：tail/SWA state 不一定值得传输，很多时候重算更便宜

3.6 方案设计

3.6.1 系统总览

3.6.2 核心概念

KV Artifact

KV Owner

Session State

3.6.3 Ownership Directory

3.6.4 PD-Hybrid Scheduler

3.6.5 决策模型

3.6.6 Single-owner 机制

规则 1：stable prefix artifact immutable

规则 2：mutable tail 只属于 session owner

规则 3：forked sessions 使用 copy-on-write

规则 4：owner lease + failure recovery

3.7 可执行实现计划

Phase B0：trace-driven simulator

Phase B1：在单机 8\times H20 上实现 xPyD prototype

Phase B2：接入 HeteroCache

Phase B3：真实 agentic evaluation

3.8 Baselines

3.9 Metrics

3.10 预期收益

3.11 风险与缓解

4. 两个项目的组合架构

5. 建议的论文 framing

5.1 Project A 的论文标题方向

核心贡献

5.2 Project B 的论文标题方向

核心贡献

6. 我的建议：先做 A，再把 B 作为 killer application

6.1 为什么 A 更基础

6.2 为什么 B 是更强的应用场景

7. 最终项目摘要

7.1 Project A 摘要

7.2 Project B 摘要

7.3 合并后的大命题

42 KiB

Raw Blame History

Phase B1：在单机 `8\times` H20 上实现 xPyD prototype