obsidian/Untitled.md

## 0. 一句话定位

我建议把这两个方向拆成 **底座项目 A** 和 **策略项目 B**：

| 项目    | 名称                                                                                | 核心问题                                                              | 论文/系统贡献定位                                                     |
| ----- | --------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- |
| **A** | **HeteroCache: Hybrid-Attention-Aware KV State Manager**                          | hybrid attention 下，KV cache 不再是同构 token block，PagedAttention 抽象不够 | 新 KV cache abstraction + GPU/CPU/SSD/recompute 联合管理           |
| **B** | **OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid Serving** | agentic long-context 下，P/D 分离带来 KV duplicate 和状态迁移成本              | KV ownership + PD hybrid routing + recompute/transfer planner |

两者关系是：

$$
\text{B depends on A}
$$

A 解决 **“KV 状态应该如何被表达和管理”**；B 解决 **“这些状态应该由谁拥有、何时迁移、何时重算、请求应该路由到哪里”**。

DeepSeek-V4 报告给了非常强的动机：它明确说 hybrid attention 产生多种 KV entry，尺寸、更新规则、cache policy 都不同，并且违反 PagedAttention 及其变种的基本假设；报告还把 KV cache 分为 classical KV cache 与 state cache，并针对 on-disk shared-prefix reuse 设计了压缩 KV 存储、SWA full/periodic/zero caching 等策略。

---

# 1. 共同背景：为什么现在是新问题

## 1.1 传统 LLM serving 的核心抽象

过去 vLLM/SGLang 一类 serving engine 的核心抽象大致是：

$$
\text{Request} \rightarrow \text{Dense KV Blocks} \rightarrow \text{Decode Scheduler}
$$

KV cache 通常被看作：

| 维度       | 传统假设                              |
| -------- | --------------------------------- |
| **结构**   | 每层、每 token 生成同构 KV                |
| **生命周期** | prefill 生成，decode 持续 append       |
| **位置**   | 主要常驻 GPU HBM                      |
| **复用**   | exact prefix hit 后复用已有 KV         |
| **调度目标** | 降低 TTFT、TPOT，提高 goodput           |
| **缓存管理** | block allocator + eviction policy |

这套抽象在 dense attention / GQA / MLA 时代仍然比较有效。

---

## 1.2 DeepSeek-V4 改变了什么

DeepSeek-V4 的核心变化是：**为了支持 million-token context，它把 attention KV 从 dense token-level state 改成了多种压缩、稀疏、局部窗口、尾部状态混合的 heterogeneous state**。

报告中提到，DeepSeek-V4 使用 CSA 和 HCA hybrid attention，CSA 把 KV 沿 sequence dimension 压缩后再 sparse select，HCA 更激进地压缩 KV 但保持 dense attention；V4-Pro 在 $1M$ context 下 single-token inference FLOPs 约为 V3.2 的 $27%$，KV cache 约为 V3.2 的 $10%$。

因此，新 serving engine 的对象不再是单一 KV block，而是：

$$
\text{KV State}
===============

\text{CSA Main KV}
+
\text{CSA Indexer KV}
+
\text{HCA KV}
+
\text{SWA KV}
+
\text{Uncompressed Tail State}
$$

报告明确说 hybrid attention 引入多种 KV entries，其 KV cache sizes 和 update rules 不同；SWA 有单独 hit/eviction policy；compression branch 中未凑满压缩块的 tokens 和 hidden states 必须暂存为 sequence state。

---

# 2. Project A Proposal：HeteroCache

## 2.1 项目名称

**HeteroCache: A Heterogeneous KV State Manager for Hybrid-Attention Long-Context LLM Serving**

---

## 2.2 背景

DeepSeek-V4 报告中最关键的 infra 信号是：

> hybrid attention 使 KV cache 从 homogeneous dense blocks 变成 heterogeneous KV states。

报告中的 Figure 6 把 KV cache 拆成两类：

| 组件                     | 内容                                 |
| ---------------------- | ---------------------------------- |
| **Classical KV cache** | CSA/HCA compressed KV blocks       |
| **State cache**        | SWA KV + CSA/HCA 尚未压缩的 tail states |

每个 request 分配 fixed-size state cache block；classical KV cache 中，每个 cache block 覆盖 $\mathrm{lcm}(m,m')$ 个 original tokens，并分别产生 CSA compressed tokens 和 HCA compressed tokens。

这说明传统 PagedAttention 的核心假设已经被破坏：

$$
\text{one layer} \approx \text{one homogeneous KV layout}
$$

现在变成：

$$
\text{one request}
\rightarrow
\left{
\text{layer-specific cache type},
\text{branch-specific compression ratio},
\text{state-specific eviction policy}
\right}
$$

---

## 2.3 过去的假设

| 过去假设                                  | 为什么过去成立                                              |
| ------------------------------------- | ---------------------------------------------------- |
| **KV cache 是同构 block**                | dense attention / GQA 下，每层 KV shape 规则，block size 固定 |
| **prefix cache hit 是二元事件**            | prefix tokens 一样，就可以直接复用对应 KV                        |
| **GPU HBM 是主缓存层级**                    | context 较短，KV cache 主要在 GPU 内存内管理                    |
| **eviction policy 可以 layer-agnostic** | 每层 KV 重要性、大小、更新规则近似一致                                |
| **attention kernel 只消费 cache layout** | kernel 不强烈反过来约束 cache layout                         |
| **tail recompute 不重要**                | dense KV 每个 token 都即时 materialize，没有压缩块未完成的问题        |

---

## 2.4 现在可能不成立的假设

| 失效假设                                 | DeepSeek-V4 里的反例                                                                    |
| ------------------------------------ | ----------------------------------------------------------------------------------- |
| **KV 是同构的**                          | CSA main KV、CSA indexer KV、HCA KV、SWA KV、tail state 都不同                             |
| **prefix hit 可以直接复用全部 prefix**       | incomplete compression block 仍需 recompute                                           |
| **SWA 与主 KV 可以统一管理**                 | 报告说 SWA KV 未压缩、每层都有，体积约为 CSA/HCA compressed KV 的 $8\times$，需要独立策略                   |
| **cache block size 只由 allocator 决定** | sparse attention kernel 有 alignment requirement，需要 cache layout + kernel co-design  |
| **on-disk KV 只是 swap**               | V4 的 on-disk KV 是 shared-prefix artifact reuse，不是被动换页                               |

---

## 2.5 需要建立的新假设

### Hypothesis A1：KV cache 应该被建模为 heterogeneous state，而不是 dense tensor blocks

新抽象：

$$
\text{KVState}_{r}
==================

\left{
s_i
\mid
s_i =
(\text{type}, \text{layer}, \text{range}, \text{precision}, \text{location}, \text{restore_cost})
\right}
$$

其中 $\text{type}$ 可以是：

| Type                   | 例子                                        |
| ---------------------- | ----------------------------------------- |
| `COMPRESSED_CSA_MAIN`  | CSA compressed main KV                    |
| `COMPRESSED_CSA_INDEX` | CSA lightning indexer KV                  |
| `COMPRESSED_HCA`       | HCA compressed KV                         |
| `WINDOW_SWA`           | recent $n_{\mathrm{win}}$ uncompressed KV |
| `TAIL_STATE`           | 未满 $m$ 或 $m'$ 的 pending hidden states     |
| `DENSE_KV`             | 兼容传统模型                                    |

---

### Hypothesis A2：prefix reuse 应该按 restore cost 建模，而不是 hit/miss 建模

传统 prefix cache：

$$
\text{hit}(p)\in {0,1}
$$

新的 prefix artifact reuse：

$$
\text{reuse_benefit}(p)
=======================

## C_{\text{prefill}}(p)

## C_{\text{load}}(p)

## C_{\text{recompute}}(p)

C_{\text{stall}}(p)
$$

其中：

| 项                         | 含义                                       |
| ------------------------- | ---------------------------------------- |
| $C_{\text{prefill}}(p)$   | 如果不复用，需要重新 prefill 的成本                   |
| $C_{\text{load}}(p)$      | 从 GPU/CPU/SSD/remote 读取 artifact 的成本     |
| $C_{\text{recompute}}(p)$ | incomplete block、SWA state restore 的重算成本 |
| $C_{\text{stall}}(p)$     | I/O 或调度等待对请求造成的阻塞                        |

---

### Hypothesis A3：cache policy 必须 model-aware、branch-aware、kernel-aware

也就是：

$$
\text{cache policy}
===================

f(
\text{attention branch},
\text{compression ratio},
\text{kernel alignment},
\text{reuse pattern},
\text{memory pressure}
)
$$

而不是单纯 LRU/LFU。

---

## 2.6 方案设计

## 2.6.1 系统总览

HeteroCache 可以分成四层：

```text
Request / Session
      ↓
KV State Abstraction Layer
      ↓
GPU / CPU / SSD Artifact Manager
      ↓
Kernel-Aware Layout Manager
      ↓
Attention Kernels / Serving Engine
```

---

## 2.6.2 核心模块

| 模块                          | 功能                                                                |
| --------------------------- | ----------------------------------------------------------------- |
| **Attention Spec Registry** | 注册每层 attention type、KV type、compression ratio、precision、alignment |
| **State Cache Manager**     | 管理 SWA KV 和 uncompressed tail states                              |
| **Compressed KV Allocator** | 管理 CSA/HCA compressed blocks                                      |
| **Artifact Store**          | 存储 prefix compressed KV，可位于 GPU/CPU/SSD                           |
| **Restore Planner**         | 决定 load、recompute、partial reuse 的组合                               |
| **Kernel Layout Adapter**   | 根据 sparse attention kernel 要求组织 block layout                      |
| **Policy Engine**           | 在 memory pressure 下做 eviction / checkpoint / prefetch             |

---

## 2.6.3 Attention Spec Registry

每个模型需要声明：

```text
LayerSpec {
  layer_id
  attention_type: CSA | HCA | SWA | Dense
  compression_ratio: m or m'
  kv_entry_size
  precision: BF16 | FP8 | FP4
  block_alignment
  state_size
  restore_fn
}
```

对 DeepSeek-V4-like 模型，可能是：

| Layer Type     | Cache Type                          | Compression |
| -------------- | ----------------------------------- | ----------- |
| CSA layer      | CSA main + CSA indexer + SWA + tail | $m=4$       |
| HCA layer      | HCA compressed + SWA + tail         | $m'=128$    |
| pure SWA layer | SWA only                            | windowed    |
| dense layer    | dense KV                            | none        |

---

## 2.6.4 State Cache Manager

State cache 管理：

$$
\text{StateCache}
=================

\text{SWA}*{n*{\mathrm{win}}}
+
\text{Tail}*{<m}
+
\text{Tail}*{<m'}
$$

核心策略：

| 策略                                      | 说明                                             |
| --------------------------------------- | ---------------------------------------------- |
| **fixed-size per-sequence state block** | 与 DeepSeek-V4 报告一致，每个 request 分配有限 state block |
| **tail-aware append**                   | 不满 compression block 时只更新 tail state           |
| **block-finalize trigger**              | 凑满 $m$ 或 $m'$ 后生成 compressed KV，并释放 tail       |
| **state spill policy**                  | 长 idle session 可把 SWA checkpoint 写入 CPU/SSD    |
| **restore policy**                      | prefix hit 时根据 SWA 策略选择 load 或 recompute       |

---

## 2.6.5 Artifact Store

Artifact ID：

$$
\text{ArtifactID}
=================

H(
\text{model_id},
\text{model_revision},
\text{tokenizer},
\text{attention_spec},
\text{prefix_hash},
\text{block_range},
\text{precision}
)
$$

Artifact metadata：

| 字段             | 作用                       |
| -------------- | ------------------------ |
| `prefix_hash`  | exact prefix identity    |
| `block_range`  | 覆盖哪些 original tokens     |
| `kv_type`      | CSA/HCA/SWA/tail         |
| `location`     | GPU / CPU / SSD / remote |
| `size_bytes`   | cache accounting         |
| `restore_cost` | scheduler 使用             |
| `last_access`  | eviction                 |
| `reuse_count`  | utility estimation       |

---

## 2.6.6 Restore Planner

当新请求命中 prefix，Restore Planner 做：

$$
a^*
===

\arg\min_{a\in A}
\left(
T_{\text{load}}(a)
+
T_{\text{recompute}}(a)
+
T_{\text{queue}}(a)
+
\lambda M_{\text{HBM}}(a)
\right)
$$

动作集合：

| 动作                          | 场景                                |
| --------------------------- | --------------------------------- |
| `LOAD_COMPRESSED_KV`        | compressed CSA/HCA artifact 已存在   |
| `RECOMPUTE_TAIL`            | incomplete compression block 不值得存 |
| `LOAD_SWA_CHECKPOINT`       | periodic checkpoint 命中            |
| `RECOMPUTE_SWA`             | zero SWA caching 或 checkpoint 太旧  |
| `PREFETCH_ARTIFACT`         | session 很可能继续，需要提前拉取              |
| `DROP_LOW_UTILITY_ARTIFACT` | memory/SSD pressure 高             |

---

## 2.6.7 On-disk SWA Policy

直接借鉴并系统化 DeepSeek-V4 的三类策略：

| 策略                         | 存储成本 | 重算成本 | 适用场景                     |
| -------------------------- | ---: | ---: | ------------------------ |
| **Full SWA Caching**       |   最高 |   最低 | 极热 prefix、极低 TTFT 需求     |
| **Periodic Checkpointing** |   中等 |   中等 | 默认策略，参数 $p$ 可调           |
| **Zero SWA Caching**       |   最低 |   最高 | SSD 写压力大、SWA restore 不频繁 |

DeepSeek-V4 报告指出，Zero SWA Caching 下，对于 $L$ 层模型，利用 cached CSA/HCA KV，重算最后 $n_{\mathrm{win}}\cdot L$ tokens 足以恢复最后 $n_{\mathrm{win}}$ 个 SWA KV entries。

这可以变成可调 policy：

$$
p^*
===

\arg\min_p
\left(
\alpha \cdot \text{SSDWrite}(p)
+
\beta \cdot \text{RestoreLatency}(p)
+
\gamma \cdot \text{GPURecompute}(p)
\right)
$$

---

## 2.7 可执行实现计划

### Phase A0：trace + simulator

目标：不先碰复杂 kernel，先证明 policy 有收益。

| 工作                 | 内容                                                           |
| ------------------ | ------------------------------------------------------------ |
| Trace replay       | 用 chat/thinking/coder trace，补充 agentic session trace         |
| KV state simulator | 模拟 CSA/HCA/SWA/tail state 大小和生命周期                            |
| Cost model         | GPU/CPU/SSD load、tail recompute、SWA restore                  |
| Policy comparison  | LRU、full caching、periodic checkpoint、zero caching、cost-based |

---

### Phase A1：serving engine 插件化 prototype

建议先在 SGLang 或 vLLM 外围做一个 **external KV artifact manager**，不直接改核心 kernel。

| 实现项                     | 说明                      |
| ----------------------- | ----------------------- |
| prefix hash manager     | 记录 session prefix chain |
| artifact metadata DB    | SQLite/RocksDB/Redis 均可 |
| GPU/CPU/SSD cache tiers | 先模拟，后真实落盘               |
| restore planner         | 输出 load/recompute plan  |
| integration shim        | 接入 prefix cache hook    |

---

### Phase A2：mock hybrid attention / DeepSeek-V4-like backend

如果 DeepSeek-V4 inference 实现可用，可以逐步接入；否则先做 mock：

| Mock 层                             | 作用                           |
| ---------------------------------- | ---------------------------- |
| dense KV → synthetic compressed KV | 模拟不同 compression ratio       |
| SWA state                          | 真实维护 recent window           |
| tail state                         | 按 $m,m'$ 管理 incomplete block |
| sparse index artifact              | 先只计 size 和 load cost         |

这样即使没有完整 V4 kernel，也可以评估 cache manager 的系统价值。

---

### Phase A3：kernel-aware layout

后期再进入真正系统贡献：

| 任务                | 说明                                              |
| ----------------- | ----------------------------------------------- |
| lcm block layout  | block 覆盖 $\mathrm{lcm}(m,m')$ 的倍数               |
| alignment padding | 针对 sparse attention kernel cache-line alignment |
| batch gather API  | 为 selected sparse KV indices 提供高效 gather        |
| prefetch stream   | SSD/CPU 到 GPU 异步拉取 compressed KV                |

---

## 2.8 评价设计

### Baselines

| Baseline                          | 说明                      |
| --------------------------------- | ----------------------- |
| **PagedAttention-style dense KV** | 传统 vLLM/SGLang          |
| **GPU-only prefix cache**         | 不落盘，只复用 GPU KV          |
| **Full SWA caching**              | DeepSeek-V4 提到的低重算高存储策略 |
| **Zero SWA caching**              | 低存储高重算策略                |
| **Periodic checkpointing**        | 参数 $p$ 固定               |
| **HeteroCache adaptive**          | 你的 cost-based policy    |

---

### Metrics

| 指标                                       | 含义                            |
| ---------------------------------------- | ----------------------------- |
| **TTFT**                                 | prefix hit 后恢复与首 token 时间     |
| **E2E latency**                          | agentic session 完整完成时间        |
| **HBM per active session**               | 每个活跃 session 占用 GPU 内存        |
| **max concurrent long-context sessions** | 同样 GPU 内存下可容纳 session 数       |
| **prefill tokens saved**                 | 复用带来的 prefill 减少              |
| **SSD write amplification**              | on-disk artifact 写入放大         |
| **restore latency breakdown**            | load / recompute / queue 分解   |
| **policy overhead**                      | metadata lookup 和 planning 成本 |

---

## 2.9 预期收益

| 收益                          | 预期方向                                                      |
| --------------------------- | --------------------------------------------------------- |
| **更高 context concurrency**  | 同样 HBM 下支持更多 $100K\sim1M$ session                         |
| **更低 repeated-prefix TTFT** | 对 repo/doc/agent session 复用 compressed artifact           |
| **更低 SSD 写放大**              | 避免 naive full SWA caching                                 |
| **更稳定 memory pressure 行为**  | state cache 与 classical KV cache 分开管理                     |
| **更强模型适配性**                 | 支持 dense、CSA/HCA、SWA、MLA-like mixed KV                    |
| **论文贡献清晰**                  | 证明 PagedAttention 抽象不足，提出 heterogeneous state abstraction |

---

## 2.10 风险与缓解

| 风险                        | 缓解                                                |
| ------------------------- | ------------------------------------------------- |
| 没有完整 DeepSeek-V4 kernel   | 先做 mock hybrid attention + trace simulator        |
| on-disk KV I/O 真实收益不稳定    | 先做 cost model，再做 SSD microbenchmark               |
| 系统实现过重                    | 先做 external artifact manager，避免一开始深改 engine       |
| 质量影响难评估                   | A 项目主要关注 state management，不主动改变 attention result  |
| reviewer 认为只是 engineering | 强调新抽象：heterogeneous KV state，而非 cache policy 小修小补 |

---

# 3. Project B Proposal：OwnerServe

## 3.1 项目名称

**OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid LLM Serving**

---

## 3.2 背景

你之前的判断是对的：在 coding agent / long-horizon agent 场景里，用户通常不太关心每一步 TPOT，而更关心：

$$
T_{\mathrm{E2E}}
================

T_{\mathrm{LLM}}
+
T_{\mathrm{tool}}
+
T_{\mathrm{sandbox}}
+
T_{\mathrm{I/O}}
+
T_{\mathrm{retry}}
$$

传统 P/D 分离的目标是避免 prefill 干扰 decode。但 agentic workload 有几个新特征：

| 特征                       | 对 PD 分离的影响                    |
| ------------------------ | ----------------------------- |
| 多轮长 session              | KV state 生命周期很长               |
| 每轮新增 tokens 相对 prefix 较小 | 增量 prefill 常常不是大 prefill      |
| tool call 间隔长            | decode 不是持续满负载                |
| prefix reuse 极高          | repo、history、tool trace 可大量复用 |
| context 可到 $100K+$ 或更长   | KV duplicate 成本显著             |

DeepSeek-V4 进一步改变了 PD 问题：如果模型采用 compressed attention，P/D 之间不再只是传完整 dense KV，而是传：

$$
\text{Transfer State}
=====================

\text{Compressed CSA/HCA KV}
+
\text{Indexer KV}
+
\text{SWA State}
+
\text{Tail State}
$$

其中 compressed KV 可能较小，但 SWA/tail state 更新频繁且 restore policy 复杂。

---

## 3.3 过去的假设

| 过去假设                           | 说明                                           |
| ------------------------------ | -------------------------------------------- |
| **P/D 分离总是有利于 decode latency** | prefill 和 decode 的计算形态不同，隔离可以避免 interference |
| **P 生成 KV，D 消费 KV**            | 请求生命周期简单，prefill 一次，decode 一段                |
| **KV duplicate 是必要代价**         | P 和 D 都可能持有同一 prefix KV                      |
| **routing 主要看负载**              | 选空闲 P 和空闲 D 即可                               |
| **KV transfer 是一次性成本**         | prefill 后传输 KV，后面主要 decode                   |
| **session 不需要强 ownership**     | 请求之间关联不强，cache locality 不是主导因素               |

---

## 3.4 现在可能不成立的假设

| 失效假设                        | 原因                                                                  |
| --------------------------- | ------------------------------------------------------------------- |
| **P/D 静态分离总是最好**            | agentic decode 中间有 tool gap，P 干扰 decode 的机会不一定大                     |
| **完整 KV transfer 可以接受**     | long-context prefix 大，重复传输/duplicate 会消耗 HBM 和带宽                    |
| **KV 是 request-local**      | agentic 多轮 session 中，KV 是 long-lived session state                  |
| **routing 只看 queue length** | prefix artifact locality 可能比 queue length 更重要                       |
| **prefill 是大块一次性任务**        | coding agent 常是很多小增量 prefill                                        |
| **复用只发生在 GPU 内**            | DeepSeek-V4 已经把 on-disk shared-prefix reuse 纳入 inference framework  |

---

## 3.5 需要建立的新假设

### Hypothesis B1：long-context agent serving 应该围绕 KV ownership 调度，而不是围绕 request 调度

传统：

$$
\text{schedule(request)}
$$

新范式：

$$
\text{schedule(session, KV_owner, artifact_location, restore_cost)}
$$

也就是：**谁拥有 prefix state，比哪个 worker 当前最空闲更重要。**

---

### Hypothesis B2：P/D separation 应该是动态决策，而不是静态架构

动作集合：

$$
a
\in
{
\text{colocated-prefill},
\text{remote-prefill},
\text{decode-continuation},
\text{artifact-fetch},
\text{tail-recompute},
\text{state-migration}
}
$$

调度目标：

$$
a^*
===

\arg\min_a
\left(
T_{\text{queue}}
+
T_{\text{prefill}}
+
T_{\text{transfer}}
+
T_{\text{restore}}
+
\lambda M_{\text{duplicate}}
+
\mu I_{\text{decode}}
\right)
$$

其中 $I_{\text{decode}}$ 是 prefill 对正在 decode 的请求造成的 interference。

---

### Hypothesis B3：single-owner KV artifact 可以降低 HBM duplicate，同时保持 prefix reuse

核心思想：

> 对一个 long-lived prefix，系统只维护一个 authoritative owner；其他节点只持有 transient read cache 或通过 recompute/fetch 恢复。

形式化：

$$
\forall p,\quad
|\text{Owner}(p)| = 1
$$

但允许：

$$
|\text{ReplicaCache}(p)| \ge 0
$$

即 owner 唯一，cache replica 可控。

---

### Hypothesis B4：tail/SWA state 不一定值得传输，很多时候重算更便宜

尤其在 compressed attention 下，bulk compressed KV 可作为 artifact 复用；但 tail state 和 SWA state 可能更适合：

| State                        | 推荐策略                                       |
| ---------------------------- | ------------------------------------------ |
| compressed CSA/HCA KV        | 存储/传输/复用                                   |
| indexer KV                   | 与 compressed KV 一起管理                       |
| incomplete tail              | 多数情况重算                                     |
| SWA state                    | hot session 保留；cold session checkpoint 或重算 |
| dense decode continuation KV | 尽量保持在 D owner 上                            |

---

## 3.6 方案设计

## 3.6.1 系统总览

```text
Global Router
    ↓
Ownership Directory
    ↓
Cost-Based PD-Hybrid Scheduler
    ↓
P Workers / D Workers / Hybrid Workers
    ↓
KV Artifact Store + HeteroCache
```

---

## 3.6.2 核心概念

### KV Artifact

Artifact 是 immutable prefix state：

$$
A_p =
(\text{prefix_hash}, \text{block_range}, \text{model_version}, \text{kv_type}, \text{location})
$$

### KV Owner

Owner 是 artifact 的 authoritative holder：

$$
\text{Owner}(A_p)=w_i
$$

### Session State

Session state 是 mutable continuation：

$$
S =
(A_{\text{stable prefix}},; \text{SWA state},; \text{tail state},; \text{decode state})
$$

这里要区分：

| 状态                     | 性质                          |
| ---------------------- | --------------------------- |
| stable prefix artifact | immutable，可共享               |
| tail state             | mutable，短生命周期               |
| SWA state              | mutable，但可 checkpoint       |
| decode state           | strongly locality-sensitive |

---

## 3.6.3 Ownership Directory

维护：

| Key                | Value                |
| ------------------ | -------------------- |
| `prefix_hash`      | owner worker         |
| `artifact_id`      | location list        |
| `session_id`       | current D owner      |
| `repo_id / doc_id` | hot prefix group     |
| `lease_expiry`     | owner lease          |
| `ref_count`        | forked sessions 数量   |
| `last_access`      | eviction 和 migration |

需要支持：

```text
lookup(prefix_hash)
claim_owner(prefix_hash, worker)
renew_lease(prefix_hash, worker)
release(prefix_hash)
replicate_readonly(prefix_hash, target_worker)
```

---

## 3.6.4 PD-Hybrid Scheduler

对每次 agent step，scheduler 先分类：

| 请求类型                       | 例子                              | 默认策略                                          |
| -------------------------- | ------------------------------- | --------------------------------------------- |
| **New long prefix**        | 第一次加载 repo/doc                  | remote P 或 dedicated prefill                  |
| **Small incremental turn** | tool result + short instruction | colocated prefill on D                        |
| **Long tool output**       | test log、large file diff        | remote P or hybrid                            |
| **Decode continuation**    | normal generation               | stay on D owner                               |
| **Forked rollout**         | $n$ samples / RL rollout        | immutable prefix artifact + many D readers    |
| **Cold resume**            | idle session 回来                 | load compressed artifact + recompute tail/SWA |

---

## 3.6.5 决策模型

对某个 session step，候选动作 $a$ 的成本：

$$
C(a)
====

T_{\text{queue}}(a)
+
T_{\text{compute}}(a)
+
T_{\text{network}}(a)
+
T_{\text{restore}}(a)
+
\lambda \cdot M_{\text{HBM}}(a)
+
\mu \cdot I_{\text{decode}}(a)
$$

选择：

$$
a^* = \arg\min_a C(a)
$$

其中：

| 项                    | 说明                                   |
| -------------------- | ------------------------------------ |
| $T_{\text{queue}}$   | worker 当前排队时间                        |
| $T_{\text{compute}}$ | prefill/decode 计算时间                  |
| $T_{\text{network}}$ | KV artifact 传输时间                     |
| $T_{\text{restore}}$ | tail/SWA recompute 或 checkpoint load |
| $M_{\text{HBM}}$     | 额外 HBM 占用                            |
| $I_{\text{decode}}$  | prefill 对 colocated decode 的干扰       |

---

## 3.6.6 Single-owner 机制

### 规则 1：stable prefix artifact immutable

当 prefix 到达 compression block boundary 后，将其 seal 成 artifact：

$$
A_{0:k} = \mathrm{seal}(S_{0:k})
$$

seal 后不可修改，只能 append 新 artifact：

$$
A_{0:k} \rightarrow A_{0:k} + A_{k:k+\Delta}
$$

---

### 规则 2：mutable tail 只属于 session owner

tail state 不全局共享：

$$
\text{Owner}(\text{tail}_s)=\text{D-owner}(s)
$$

除非 tail 很大，否则迁移时重算，不传输。

---

### 规则 3：forked sessions 使用 copy-on-write

对 agent/RL 多分支：

$$
S_i = A_{\text{shared prefix}} + \Delta_i
$$

所有分支共享 prefix artifact，各自维护自己的 tail/SWA/decode state。

---

### 规则 4：owner lease + failure recovery

Owner 使用 lease：

$$
\text{lease}(A_p,w_i,t_{\text{expire}})
$$

worker crash 后，directory 重新分配 owner；如果 artifact 在 SSD/CPU 有副本，则恢复；否则从 token log 重算。

---

## 3.7 可执行实现计划

## Phase B0：trace-driven simulator

先不要直接深改 serving engine。先用 trace 和 cost model 验证 single-owner 是否有潜在收益。

| 输入                | 内容                                                         |
| ----------------- | ---------------------------------------------------------- |
| agent trace       | session id、turn id、input length、output length、tool latency |
| prefix relation   | 每轮与前一轮共享多少 tokens                                          |
| worker config     | P/D/hybrid workers 数量、GPU 数、网络带宽                           |
| KV model          | dense KV 或 hybrid KV                                       |
| scheduling policy | static PD、colocated、OwnerServe                             |

输出：

| 指标                  | 含义                             |
| ------------------- | ------------------------------ |
| duplicate HBM       | 同一 prefix 被复制多少份               |
| network traffic     | P→D KV transfer 总量             |
| E2E latency         | 每个 agent task 完成时间             |
| decode interference | colocated prefill 对 decode 的影响 |
| owner hit rate      | 请求路由到 KV owner 的比例             |

---

## Phase B1：在单机 $8\times$ H20 上实现 xPyD prototype

这与你现有方向非常吻合。

实验设置：

| 项               | 建议                                         |
| --------------- | ------------------------------------------ |
| serving backend | SGLang xPyD 或自定义 proxy                     |
| hardware        | 单机 $8\times$ H20                           |
| constraint      | $x+y\le 8$                                 |
| P→D link        | 即使本地也模拟 RDMA loopback                      |
| workload        | SWE-bench generated + internal agent trace |
| model           | 先用 Qwen3-Coder-30B-A3B 或类似可跑模型             |

实现模块：

| 模块                  | MVP                      |
| ------------------- | ------------------------ |
| Global router       | Python/Rust proxy        |
| Ownership directory | Redis / in-memory map    |
| Prefix hash         | token-level rolling hash |
| KV ownership        | 先以 logical ownership 模拟  |
| Transfer planner    | 记录实际或估算 KV bytes         |
| Colocated fallback  | 小增量 turn 直接在 D 上 prefill |

---

## Phase B2：接入 HeteroCache

当 Project A 有 prototype 后，B 可以从 dense KV ownership 升级到 heterogeneous artifact ownership：

| Artifact Type      | Owner 策略               |
| ------------------ | ---------------------- |
| dense KV           | D owner 优先             |
| compressed CSA/HCA | artifact owner，可跨 D 共享 |
| SWA checkpoint     | session owner 或 SSD    |
| tail state         | session-local，重算优先     |
| indexer KV         | 跟随 compressed artifact |

---

## Phase B3：真实 agentic evaluation

Workloads：

| workload                      | 价值                     |
| ----------------------------- | ---------------------- |
| SWE-bench generated           | 可复现、可公开                |
| repo-level coding agent trace | 长 prefix + 多轮 tool     |
| synthetic forked rollout      | 测 shared prefix 多分支    |
| long-doc QA multi-turn        | 测 on-disk prefix reuse |
| internal Ali traces           | 工业真实性                  |

---

## 3.8 Baselines

| Baseline                         | 说明                                    |
| -------------------------------- | ------------------------------------- |
| **Static PD**                    | 固定 P node 和 D node，P 完成 prefill 后传 KV |
| **Colocated serving**            | P/D 不分离，全部在一个 worker                  |
| **Round-robin PD**               | 不考虑 KV locality                       |
| **Prefix-aware but multi-owner** | prefix cache 命中但允许多副本                 |
| **OwnerServe**                   | single-owner + cost-based PD hybrid   |
| **Oracle**                       | 知道未来请求序列的最优调度，用作上界                    |

---

## 3.9 Metrics

| 指标                                   | 为什么重要                       |
| ------------------------------------ | --------------------------- |
| **E2E task time**                    | agentic workload 的主指标       |
| **p95/p99 step latency**             | 每轮用户感知                      |
| **HBM duplicate factor**             | 衡量 KV 浪费                    |
| **network KV traffic**               | P→D/remote fetch 成本         |
| **prefix owner hit rate**            | ownership routing 有效性       |
| **decode interference**              | hybrid colocate 是否伤害 decode |
| **tool idle utilization**            | tool gap 期间 GPU 是否更好利用      |
| **successful trajectory throughput** | 单位 GPU 完成多少成功 agent tasks   |

---

## 3.10 预期收益

| 收益                                 | 预期方向                                               |
| ---------------------------------- | -------------------------------------------------- |
| **降低 KV duplicate**                | long prefix 不再在 P/D 多处长期复制                         |
| **降低 P→D traffic**                 | 传 delta artifact，而不是每次完整 KV                        |
| **降低 E2E time**                    | 小增量 turn colocate，避免远程 prefill + transfer          |
| **提高并发 session 数**                 | HBM 被 prefix duplicate 占用更少                        |
| **更适合 forked rollout**             | 多分支共享 immutable prefix artifact                    |
| **更好解释 agentic serving trade-off** | 从 TTFT/TPOT 转向 E2E + ownership + artifact locality |

保守地说，这个项目最容易拿到的硬收益不是“单步 latency 大幅下降”，而是：

$$
\text{same GPUs} \Rightarrow \text{more long-context sessions}
$$

以及：

$$
\text{same E2E target} \Rightarrow \text{less KV transfer and duplication}
$$

---

## 3.11 风险与缓解

| 风险                                | 缓解                                                                             |
| --------------------------------- | ------------------------------------------------------------------------------ |
| single-owner 可能造成热点               | 支持 read-only replica cache，但 authoritative owner 唯一                            |
| colocated prefill 可能干扰 decode     | cost model 显式加入 $I_{\text{decode}}$                                            |
| KV ownership 难接入现有 engine         | 先 logical ownership + simulator，再逐步接真实 KV hooks                                |
| dense model 上收益不如 hybrid model 明显 | 先证明 agentic prefix reuse 和 PD duplicate；后续接 HeteroCache 放大收益                   |
| reviewer 认为像 router heuristic     | 强调 ownership abstraction、artifact lifecycle、cost model 和 trace-driven evidence |

---

# 4. 两个项目的组合架构

最终系统可以长这样：

```text
Agent Request / Tool Result / Session Resume
                 ↓
          Session Router
                 ↓
       Ownership Directory
                 ↓
      PD-Hybrid Cost Planner
                 ↓
 ┌───────────────┴────────────────┐
 │                                │
P/Hybrid Worker              D/Hybrid Worker
 │                                │
 └───────────────┬────────────────┘
                 ↓
          HeteroCache Manager
                 ↓
   GPU HBM / CPU DRAM / SSD Artifact Store
```

项目 A 提供：

$$
\text{what state exists and how to restore it}
$$

项目 B 决定：

$$
\text{where the state should live and where the request should run}
$$

---

# 5. 建议的论文 framing

## 5.1 Project A 的论文标题方向

**HeteroCache: Managing Heterogeneous KV States for Hybrid-Attention Long-Context LLM Serving**

### 核心贡献

| 贡献              | 说明                                                                         |
| --------------- | -------------------------------------------------------------------------- |
| **Observation** | hybrid attention breaks homogeneous KV cache assumption                    |
| **Abstraction** | KV state type registry + restore-cost based prefix reuse                   |
| **System**      | GPU/CPU/SSD/recompute-aware heterogeneous KV manager                       |
| **Evaluation**  | long-context traces + shared-prefix workloads                              |
| **Result**      | higher session concurrency, lower restore latency, lower redundant prefill |

---

## 5.2 Project B 的论文标题方向

**OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid LLM Serving**

### 核心贡献

| 贡献              | 说明                                                          |
| --------------- | ----------------------------------------------------------- |
| **Observation** | static PD separation duplicates long-lived agentic KV state |
| **Abstraction** | KV artifact ownership and session-state lifecycle           |
| **System**      | ownership-aware router + PD-hybrid scheduler                |
| **Policy**      | recompute/fetch/transfer/colocate cost model                |
| **Evaluation**  | coding-agent traces, forked rollout, long-prefix sessions   |

---

# 6. 我的建议：先做 A，再把 B 作为 killer application

## 6.1 为什么 A 更基础

A 解决的是一个正在变成共性的问题：

> long-context model architecture 正在破坏 existing KV cache abstraction。

只要未来模型越来越多采用 CSA/HCA/SWA/MLA/sliding/dilated/sparse/hybrid attention，A 的问题就持续存在。

A 的优势：

| 维度                       | 判断                                                      |
| ------------------------ | ------------------------------------------------------- |
| **独立性**                  | 不强依赖 PD                                                 |
| **系统贡献清楚**               | 新 cache abstraction                                     |
| **可渐进实现**                | simulator → external manager → engine integration       |
| **适合 SOSP/OSDI framing** | challenge old abstraction, propose new system primitive |

---

## 6.2 为什么 B 是更强的应用场景

B 更贴近你一直关心的 **agentic workload + PD hybrid + KV ownership**。

但 B 的难点是：

| 难点               | 说明                                |
| ---------------- | --------------------------------- |
| 需要真实 agent trace | 否则收益容易被认为是 synthetic              |
| 需要 PD prototype  | 实现复杂度更高                           |
| 需要证明不伤害 decode   | prefill/decode interference 需要仔细测 |
| 需要与 A 结合         | 否则只在 dense KV 上做 ownership，故事不够新  |

所以我建议路线是：

```text
Stage 1: HeteroCache simulator + trace evidence
Stage 2: HeteroCache prototype
Stage 3: OwnerServe simulator
Stage 4: xPyD OwnerServe prototype
Stage 5: combine A+B into agentic long-context serving paper
```

---

# 7. 最终项目摘要

## 7.1 Project A 摘要

**HeteroCache** 认为 long-context hybrid attention 模型的 KV cache 已经从 homogeneous dense block 变成 heterogeneous KV state。它提出一个 model-aware KV state abstraction，把 CSA/HCA compressed KV、indexer KV、SWA KV、tail state 统一纳入 GPU/CPU/SSD/recompute 联合管理，并用 restore-cost 替代 hit/miss 作为 prefix reuse 的核心决策指标。

---

## 7.2 Project B 摘要

**OwnerServe** 认为 agentic long-context serving 的核心瓶颈不是单个 request 的 prefill/decode，而是 long-lived session KV state 的 ownership、duplication、migration 和 reuse。它提出 single-owner KV artifact abstraction，并用 cost-based PD-hybrid scheduler 在 colocated prefill、remote prefill、artifact fetch、tail recompute、state migration 之间动态选择，以降低 HBM duplicate 和 P→D traffic，同时优化 E2E agent task time。

---

## 7.3 合并后的大命题

最强的总命题是：

> **For long-context agentic LLM serving, the primary scheduling object should shift from requests to KV artifacts.**

也就是：

$$
\text{Request-centric serving}
\quad\Longrightarrow\quad
\text{KV-artifact-centric serving}
$$

这个 framing 很适合你当前的研究线：它自然连接 DeepSeek-V4 的 million-token hybrid attention、你关心的 PD hybrid、KVCache ownership、agentic E2E latency、以及 trace-driven reproducible evaluation。