1194 lines
42 KiB
Markdown
1194 lines
42 KiB
Markdown
## 0. 一句话定位
|
||
|
||
我建议把这两个方向拆成 **底座项目 A** 和 **策略项目 B**:
|
||
|
||
| 项目 | 名称 | 核心问题 | 论文/系统贡献定位 |
|
||
| ----- | --------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- |
|
||
| **A** | **HeteroCache: Hybrid-Attention-Aware KV State Manager** | hybrid attention 下,KV cache 不再是同构 token block,PagedAttention 抽象不够 | 新 KV cache abstraction + GPU/CPU/SSD/recompute 联合管理 |
|
||
| **B** | **OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid Serving** | agentic long-context 下,P/D 分离带来 KV duplicate 和状态迁移成本 | KV ownership + PD hybrid routing + recompute/transfer planner |
|
||
|
||
两者关系是:
|
||
|
||
$$
|
||
\text{B depends on A}
|
||
$$
|
||
|
||
A 解决 **“KV 状态应该如何被表达和管理”**;B 解决 **“这些状态应该由谁拥有、何时迁移、何时重算、请求应该路由到哪里”**。
|
||
|
||
DeepSeek-V4 报告给了非常强的动机:它明确说 hybrid attention 产生多种 KV entry,尺寸、更新规则、cache policy 都不同,并且违反 PagedAttention 及其变种的基本假设;报告还把 KV cache 分为 classical KV cache 与 state cache,并针对 on-disk shared-prefix reuse 设计了压缩 KV 存储、SWA full/periodic/zero caching 等策略。
|
||
|
||
---
|
||
|
||
# 1. 共同背景:为什么现在是新问题
|
||
|
||
## 1.1 传统 LLM serving 的核心抽象
|
||
|
||
过去 vLLM/SGLang 一类 serving engine 的核心抽象大致是:
|
||
|
||
$$
|
||
\text{Request} \rightarrow \text{Dense KV Blocks} \rightarrow \text{Decode Scheduler}
|
||
$$
|
||
|
||
KV cache 通常被看作:
|
||
|
||
| 维度 | 传统假设 |
|
||
| -------- | --------------------------------- |
|
||
| **结构** | 每层、每 token 生成同构 KV |
|
||
| **生命周期** | prefill 生成,decode 持续 append |
|
||
| **位置** | 主要常驻 GPU HBM |
|
||
| **复用** | exact prefix hit 后复用已有 KV |
|
||
| **调度目标** | 降低 TTFT、TPOT,提高 goodput |
|
||
| **缓存管理** | block allocator + eviction policy |
|
||
|
||
这套抽象在 dense attention / GQA / MLA 时代仍然比较有效。
|
||
|
||
---
|
||
|
||
## 1.2 DeepSeek-V4 改变了什么
|
||
|
||
DeepSeek-V4 的核心变化是:**为了支持 million-token context,它把 attention KV 从 dense token-level state 改成了多种压缩、稀疏、局部窗口、尾部状态混合的 heterogeneous state**。
|
||
|
||
报告中提到,DeepSeek-V4 使用 CSA 和 HCA hybrid attention,CSA 把 KV 沿 sequence dimension 压缩后再 sparse select,HCA 更激进地压缩 KV 但保持 dense attention;V4-Pro 在 $1M$ context 下 single-token inference FLOPs 约为 V3.2 的 $27%$,KV cache 约为 V3.2 的 $10%$。
|
||
|
||
因此,新 serving engine 的对象不再是单一 KV block,而是:
|
||
|
||
$$
|
||
\text{KV State}
|
||
===============
|
||
|
||
\text{CSA Main KV}
|
||
+
|
||
\text{CSA Indexer KV}
|
||
+
|
||
\text{HCA KV}
|
||
+
|
||
\text{SWA KV}
|
||
+
|
||
\text{Uncompressed Tail State}
|
||
$$
|
||
|
||
报告明确说 hybrid attention 引入多种 KV entries,其 KV cache sizes 和 update rules 不同;SWA 有单独 hit/eviction policy;compression branch 中未凑满压缩块的 tokens 和 hidden states 必须暂存为 sequence state。
|
||
|
||
---
|
||
|
||
# 2. Project A Proposal:HeteroCache
|
||
|
||
## 2.1 项目名称
|
||
|
||
**HeteroCache: A Heterogeneous KV State Manager for Hybrid-Attention Long-Context LLM Serving**
|
||
|
||
---
|
||
|
||
## 2.2 背景
|
||
|
||
DeepSeek-V4 报告中最关键的 infra 信号是:
|
||
|
||
> hybrid attention 使 KV cache 从 homogeneous dense blocks 变成 heterogeneous KV states。
|
||
|
||
报告中的 Figure 6 把 KV cache 拆成两类:
|
||
|
||
| 组件 | 内容 |
|
||
| ---------------------- | ---------------------------------- |
|
||
| **Classical KV cache** | CSA/HCA compressed KV blocks |
|
||
| **State cache** | SWA KV + CSA/HCA 尚未压缩的 tail states |
|
||
|
||
每个 request 分配 fixed-size state cache block;classical KV cache 中,每个 cache block 覆盖 $\mathrm{lcm}(m,m')$ 个 original tokens,并分别产生 CSA compressed tokens 和 HCA compressed tokens。
|
||
|
||
这说明传统 PagedAttention 的核心假设已经被破坏:
|
||
|
||
$$
|
||
\text{one layer} \approx \text{one homogeneous KV layout}
|
||
$$
|
||
|
||
现在变成:
|
||
|
||
$$
|
||
\text{one request}
|
||
\rightarrow
|
||
\left{
|
||
\text{layer-specific cache type},
|
||
\text{branch-specific compression ratio},
|
||
\text{state-specific eviction policy}
|
||
\right}
|
||
$$
|
||
|
||
---
|
||
|
||
## 2.3 过去的假设
|
||
|
||
| 过去假设 | 为什么过去成立 |
|
||
| ------------------------------------- | ---------------------------------------------------- |
|
||
| **KV cache 是同构 block** | dense attention / GQA 下,每层 KV shape 规则,block size 固定 |
|
||
| **prefix cache hit 是二元事件** | prefix tokens 一样,就可以直接复用对应 KV |
|
||
| **GPU HBM 是主缓存层级** | context 较短,KV cache 主要在 GPU 内存内管理 |
|
||
| **eviction policy 可以 layer-agnostic** | 每层 KV 重要性、大小、更新规则近似一致 |
|
||
| **attention kernel 只消费 cache layout** | kernel 不强烈反过来约束 cache layout |
|
||
| **tail recompute 不重要** | dense KV 每个 token 都即时 materialize,没有压缩块未完成的问题 |
|
||
|
||
---
|
||
|
||
## 2.4 现在可能不成立的假设
|
||
|
||
| 失效假设 | DeepSeek-V4 里的反例 |
|
||
| ------------------------------------ | ----------------------------------------------------------------------------------- |
|
||
| **KV 是同构的** | CSA main KV、CSA indexer KV、HCA KV、SWA KV、tail state 都不同 |
|
||
| **prefix hit 可以直接复用全部 prefix** | incomplete compression block 仍需 recompute |
|
||
| **SWA 与主 KV 可以统一管理** | 报告说 SWA KV 未压缩、每层都有,体积约为 CSA/HCA compressed KV 的 $8\times$,需要独立策略 |
|
||
| **cache block size 只由 allocator 决定** | sparse attention kernel 有 alignment requirement,需要 cache layout + kernel co-design |
|
||
| **on-disk KV 只是 swap** | V4 的 on-disk KV 是 shared-prefix artifact reuse,不是被动换页 |
|
||
|
||
---
|
||
|
||
## 2.5 需要建立的新假设
|
||
|
||
### Hypothesis A1:KV cache 应该被建模为 heterogeneous state,而不是 dense tensor blocks
|
||
|
||
新抽象:
|
||
|
||
$$
|
||
\text{KVState}_{r}
|
||
==================
|
||
|
||
\left{
|
||
s_i
|
||
\mid
|
||
s_i =
|
||
(\text{type}, \text{layer}, \text{range}, \text{precision}, \text{location}, \text{restore_cost})
|
||
\right}
|
||
$$
|
||
|
||
其中 $\text{type}$ 可以是:
|
||
|
||
| Type | 例子 |
|
||
| ---------------------- | ----------------------------------------- |
|
||
| `COMPRESSED_CSA_MAIN` | CSA compressed main KV |
|
||
| `COMPRESSED_CSA_INDEX` | CSA lightning indexer KV |
|
||
| `COMPRESSED_HCA` | HCA compressed KV |
|
||
| `WINDOW_SWA` | recent $n_{\mathrm{win}}$ uncompressed KV |
|
||
| `TAIL_STATE` | 未满 $m$ 或 $m'$ 的 pending hidden states |
|
||
| `DENSE_KV` | 兼容传统模型 |
|
||
|
||
---
|
||
|
||
### Hypothesis A2:prefix reuse 应该按 restore cost 建模,而不是 hit/miss 建模
|
||
|
||
传统 prefix cache:
|
||
|
||
$$
|
||
\text{hit}(p)\in {0,1}
|
||
$$
|
||
|
||
新的 prefix artifact reuse:
|
||
|
||
$$
|
||
\text{reuse_benefit}(p)
|
||
=======================
|
||
|
||
## C_{\text{prefill}}(p)
|
||
|
||
## C_{\text{load}}(p)
|
||
|
||
## C_{\text{recompute}}(p)
|
||
|
||
C_{\text{stall}}(p)
|
||
$$
|
||
|
||
其中:
|
||
|
||
| 项 | 含义 |
|
||
| ------------------------- | ---------------------------------------- |
|
||
| $C_{\text{prefill}}(p)$ | 如果不复用,需要重新 prefill 的成本 |
|
||
| $C_{\text{load}}(p)$ | 从 GPU/CPU/SSD/remote 读取 artifact 的成本 |
|
||
| $C_{\text{recompute}}(p)$ | incomplete block、SWA state restore 的重算成本 |
|
||
| $C_{\text{stall}}(p)$ | I/O 或调度等待对请求造成的阻塞 |
|
||
|
||
---
|
||
|
||
### Hypothesis A3:cache policy 必须 model-aware、branch-aware、kernel-aware
|
||
|
||
也就是:
|
||
|
||
$$
|
||
\text{cache policy}
|
||
===================
|
||
|
||
f(
|
||
\text{attention branch},
|
||
\text{compression ratio},
|
||
\text{kernel alignment},
|
||
\text{reuse pattern},
|
||
\text{memory pressure}
|
||
)
|
||
$$
|
||
|
||
而不是单纯 LRU/LFU。
|
||
|
||
---
|
||
|
||
## 2.6 方案设计
|
||
|
||
## 2.6.1 系统总览
|
||
|
||
HeteroCache 可以分成四层:
|
||
|
||
```text
|
||
Request / Session
|
||
↓
|
||
KV State Abstraction Layer
|
||
↓
|
||
GPU / CPU / SSD Artifact Manager
|
||
↓
|
||
Kernel-Aware Layout Manager
|
||
↓
|
||
Attention Kernels / Serving Engine
|
||
```
|
||
|
||
---
|
||
|
||
## 2.6.2 核心模块
|
||
|
||
| 模块 | 功能 |
|
||
| --------------------------- | ----------------------------------------------------------------- |
|
||
| **Attention Spec Registry** | 注册每层 attention type、KV type、compression ratio、precision、alignment |
|
||
| **State Cache Manager** | 管理 SWA KV 和 uncompressed tail states |
|
||
| **Compressed KV Allocator** | 管理 CSA/HCA compressed blocks |
|
||
| **Artifact Store** | 存储 prefix compressed KV,可位于 GPU/CPU/SSD |
|
||
| **Restore Planner** | 决定 load、recompute、partial reuse 的组合 |
|
||
| **Kernel Layout Adapter** | 根据 sparse attention kernel 要求组织 block layout |
|
||
| **Policy Engine** | 在 memory pressure 下做 eviction / checkpoint / prefetch |
|
||
|
||
---
|
||
|
||
## 2.6.3 Attention Spec Registry
|
||
|
||
每个模型需要声明:
|
||
|
||
```text
|
||
LayerSpec {
|
||
layer_id
|
||
attention_type: CSA | HCA | SWA | Dense
|
||
compression_ratio: m or m'
|
||
kv_entry_size
|
||
precision: BF16 | FP8 | FP4
|
||
block_alignment
|
||
state_size
|
||
restore_fn
|
||
}
|
||
```
|
||
|
||
对 DeepSeek-V4-like 模型,可能是:
|
||
|
||
| Layer Type | Cache Type | Compression |
|
||
| -------------- | ----------------------------------- | ----------- |
|
||
| CSA layer | CSA main + CSA indexer + SWA + tail | $m=4$ |
|
||
| HCA layer | HCA compressed + SWA + tail | $m'=128$ |
|
||
| pure SWA layer | SWA only | windowed |
|
||
| dense layer | dense KV | none |
|
||
|
||
---
|
||
|
||
## 2.6.4 State Cache Manager
|
||
|
||
State cache 管理:
|
||
|
||
$$
|
||
\text{StateCache}
|
||
=================
|
||
|
||
\text{SWA}*{n*{\mathrm{win}}}
|
||
+
|
||
\text{Tail}*{<m}
|
||
+
|
||
\text{Tail}*{<m'}
|
||
$$
|
||
|
||
核心策略:
|
||
|
||
| 策略 | 说明 |
|
||
| --------------------------------------- | ---------------------------------------------- |
|
||
| **fixed-size per-sequence state block** | 与 DeepSeek-V4 报告一致,每个 request 分配有限 state block |
|
||
| **tail-aware append** | 不满 compression block 时只更新 tail state |
|
||
| **block-finalize trigger** | 凑满 $m$ 或 $m'$ 后生成 compressed KV,并释放 tail |
|
||
| **state spill policy** | 长 idle session 可把 SWA checkpoint 写入 CPU/SSD |
|
||
| **restore policy** | prefix hit 时根据 SWA 策略选择 load 或 recompute |
|
||
|
||
---
|
||
|
||
## 2.6.5 Artifact Store
|
||
|
||
Artifact ID:
|
||
|
||
$$
|
||
\text{ArtifactID}
|
||
=================
|
||
|
||
H(
|
||
\text{model_id},
|
||
\text{model_revision},
|
||
\text{tokenizer},
|
||
\text{attention_spec},
|
||
\text{prefix_hash},
|
||
\text{block_range},
|
||
\text{precision}
|
||
)
|
||
$$
|
||
|
||
Artifact metadata:
|
||
|
||
| 字段 | 作用 |
|
||
| -------------- | ------------------------ |
|
||
| `prefix_hash` | exact prefix identity |
|
||
| `block_range` | 覆盖哪些 original tokens |
|
||
| `kv_type` | CSA/HCA/SWA/tail |
|
||
| `location` | GPU / CPU / SSD / remote |
|
||
| `size_bytes` | cache accounting |
|
||
| `restore_cost` | scheduler 使用 |
|
||
| `last_access` | eviction |
|
||
| `reuse_count` | utility estimation |
|
||
|
||
---
|
||
|
||
## 2.6.6 Restore Planner
|
||
|
||
当新请求命中 prefix,Restore Planner 做:
|
||
|
||
$$
|
||
a^*
|
||
===
|
||
|
||
\arg\min_{a\in A}
|
||
\left(
|
||
T_{\text{load}}(a)
|
||
+
|
||
T_{\text{recompute}}(a)
|
||
+
|
||
T_{\text{queue}}(a)
|
||
+
|
||
\lambda M_{\text{HBM}}(a)
|
||
\right)
|
||
$$
|
||
|
||
动作集合:
|
||
|
||
| 动作 | 场景 |
|
||
| --------------------------- | --------------------------------- |
|
||
| `LOAD_COMPRESSED_KV` | compressed CSA/HCA artifact 已存在 |
|
||
| `RECOMPUTE_TAIL` | incomplete compression block 不值得存 |
|
||
| `LOAD_SWA_CHECKPOINT` | periodic checkpoint 命中 |
|
||
| `RECOMPUTE_SWA` | zero SWA caching 或 checkpoint 太旧 |
|
||
| `PREFETCH_ARTIFACT` | session 很可能继续,需要提前拉取 |
|
||
| `DROP_LOW_UTILITY_ARTIFACT` | memory/SSD pressure 高 |
|
||
|
||
---
|
||
|
||
## 2.6.7 On-disk SWA Policy
|
||
|
||
直接借鉴并系统化 DeepSeek-V4 的三类策略:
|
||
|
||
| 策略 | 存储成本 | 重算成本 | 适用场景 |
|
||
| -------------------------- | ---: | ---: | ------------------------ |
|
||
| **Full SWA Caching** | 最高 | 最低 | 极热 prefix、极低 TTFT 需求 |
|
||
| **Periodic Checkpointing** | 中等 | 中等 | 默认策略,参数 $p$ 可调 |
|
||
| **Zero SWA Caching** | 最低 | 最高 | SSD 写压力大、SWA restore 不频繁 |
|
||
|
||
DeepSeek-V4 报告指出,Zero SWA Caching 下,对于 $L$ 层模型,利用 cached CSA/HCA KV,重算最后 $n_{\mathrm{win}}\cdot L$ tokens 足以恢复最后 $n_{\mathrm{win}}$ 个 SWA KV entries。
|
||
|
||
这可以变成可调 policy:
|
||
|
||
$$
|
||
p^*
|
||
===
|
||
|
||
\arg\min_p
|
||
\left(
|
||
\alpha \cdot \text{SSDWrite}(p)
|
||
+
|
||
\beta \cdot \text{RestoreLatency}(p)
|
||
+
|
||
\gamma \cdot \text{GPURecompute}(p)
|
||
\right)
|
||
$$
|
||
|
||
---
|
||
|
||
## 2.7 可执行实现计划
|
||
|
||
### Phase A0:trace + simulator
|
||
|
||
目标:不先碰复杂 kernel,先证明 policy 有收益。
|
||
|
||
| 工作 | 内容 |
|
||
| ------------------ | ------------------------------------------------------------ |
|
||
| Trace replay | 用 chat/thinking/coder trace,补充 agentic session trace |
|
||
| KV state simulator | 模拟 CSA/HCA/SWA/tail state 大小和生命周期 |
|
||
| Cost model | GPU/CPU/SSD load、tail recompute、SWA restore |
|
||
| Policy comparison | LRU、full caching、periodic checkpoint、zero caching、cost-based |
|
||
|
||
---
|
||
|
||
### Phase A1:serving engine 插件化 prototype
|
||
|
||
建议先在 SGLang 或 vLLM 外围做一个 **external KV artifact manager**,不直接改核心 kernel。
|
||
|
||
| 实现项 | 说明 |
|
||
| ----------------------- | ----------------------- |
|
||
| prefix hash manager | 记录 session prefix chain |
|
||
| artifact metadata DB | SQLite/RocksDB/Redis 均可 |
|
||
| GPU/CPU/SSD cache tiers | 先模拟,后真实落盘 |
|
||
| restore planner | 输出 load/recompute plan |
|
||
| integration shim | 接入 prefix cache hook |
|
||
|
||
---
|
||
|
||
### Phase A2:mock hybrid attention / DeepSeek-V4-like backend
|
||
|
||
如果 DeepSeek-V4 inference 实现可用,可以逐步接入;否则先做 mock:
|
||
|
||
| Mock 层 | 作用 |
|
||
| ---------------------------------- | ---------------------------- |
|
||
| dense KV → synthetic compressed KV | 模拟不同 compression ratio |
|
||
| SWA state | 真实维护 recent window |
|
||
| tail state | 按 $m,m'$ 管理 incomplete block |
|
||
| sparse index artifact | 先只计 size 和 load cost |
|
||
|
||
这样即使没有完整 V4 kernel,也可以评估 cache manager 的系统价值。
|
||
|
||
---
|
||
|
||
### Phase A3:kernel-aware layout
|
||
|
||
后期再进入真正系统贡献:
|
||
|
||
| 任务 | 说明 |
|
||
| ----------------- | ----------------------------------------------- |
|
||
| lcm block layout | block 覆盖 $\mathrm{lcm}(m,m')$ 的倍数 |
|
||
| alignment padding | 针对 sparse attention kernel cache-line alignment |
|
||
| batch gather API | 为 selected sparse KV indices 提供高效 gather |
|
||
| prefetch stream | SSD/CPU 到 GPU 异步拉取 compressed KV |
|
||
|
||
---
|
||
|
||
## 2.8 评价设计
|
||
|
||
### Baselines
|
||
|
||
| Baseline | 说明 |
|
||
| --------------------------------- | ----------------------- |
|
||
| **PagedAttention-style dense KV** | 传统 vLLM/SGLang |
|
||
| **GPU-only prefix cache** | 不落盘,只复用 GPU KV |
|
||
| **Full SWA caching** | DeepSeek-V4 提到的低重算高存储策略 |
|
||
| **Zero SWA caching** | 低存储高重算策略 |
|
||
| **Periodic checkpointing** | 参数 $p$ 固定 |
|
||
| **HeteroCache adaptive** | 你的 cost-based policy |
|
||
|
||
---
|
||
|
||
### Metrics
|
||
|
||
| 指标 | 含义 |
|
||
| ---------------------------------------- | ----------------------------- |
|
||
| **TTFT** | prefix hit 后恢复与首 token 时间 |
|
||
| **E2E latency** | agentic session 完整完成时间 |
|
||
| **HBM per active session** | 每个活跃 session 占用 GPU 内存 |
|
||
| **max concurrent long-context sessions** | 同样 GPU 内存下可容纳 session 数 |
|
||
| **prefill tokens saved** | 复用带来的 prefill 减少 |
|
||
| **SSD write amplification** | on-disk artifact 写入放大 |
|
||
| **restore latency breakdown** | load / recompute / queue 分解 |
|
||
| **policy overhead** | metadata lookup 和 planning 成本 |
|
||
|
||
---
|
||
|
||
## 2.9 预期收益
|
||
|
||
| 收益 | 预期方向 |
|
||
| --------------------------- | --------------------------------------------------------- |
|
||
| **更高 context concurrency** | 同样 HBM 下支持更多 $100K\sim1M$ session |
|
||
| **更低 repeated-prefix TTFT** | 对 repo/doc/agent session 复用 compressed artifact |
|
||
| **更低 SSD 写放大** | 避免 naive full SWA caching |
|
||
| **更稳定 memory pressure 行为** | state cache 与 classical KV cache 分开管理 |
|
||
| **更强模型适配性** | 支持 dense、CSA/HCA、SWA、MLA-like mixed KV |
|
||
| **论文贡献清晰** | 证明 PagedAttention 抽象不足,提出 heterogeneous state abstraction |
|
||
|
||
---
|
||
|
||
## 2.10 风险与缓解
|
||
|
||
| 风险 | 缓解 |
|
||
| ------------------------- | ------------------------------------------------- |
|
||
| 没有完整 DeepSeek-V4 kernel | 先做 mock hybrid attention + trace simulator |
|
||
| on-disk KV I/O 真实收益不稳定 | 先做 cost model,再做 SSD microbenchmark |
|
||
| 系统实现过重 | 先做 external artifact manager,避免一开始深改 engine |
|
||
| 质量影响难评估 | A 项目主要关注 state management,不主动改变 attention result |
|
||
| reviewer 认为只是 engineering | 强调新抽象:heterogeneous KV state,而非 cache policy 小修小补 |
|
||
|
||
---
|
||
|
||
# 3. Project B Proposal:OwnerServe
|
||
|
||
## 3.1 项目名称
|
||
|
||
**OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid LLM Serving**
|
||
|
||
---
|
||
|
||
## 3.2 背景
|
||
|
||
你之前的判断是对的:在 coding agent / long-horizon agent 场景里,用户通常不太关心每一步 TPOT,而更关心:
|
||
|
||
$$
|
||
T_{\mathrm{E2E}}
|
||
================
|
||
|
||
T_{\mathrm{LLM}}
|
||
+
|
||
T_{\mathrm{tool}}
|
||
+
|
||
T_{\mathrm{sandbox}}
|
||
+
|
||
T_{\mathrm{I/O}}
|
||
+
|
||
T_{\mathrm{retry}}
|
||
$$
|
||
|
||
传统 P/D 分离的目标是避免 prefill 干扰 decode。但 agentic workload 有几个新特征:
|
||
|
||
| 特征 | 对 PD 分离的影响 |
|
||
| ------------------------ | ----------------------------- |
|
||
| 多轮长 session | KV state 生命周期很长 |
|
||
| 每轮新增 tokens 相对 prefix 较小 | 增量 prefill 常常不是大 prefill |
|
||
| tool call 间隔长 | decode 不是持续满负载 |
|
||
| prefix reuse 极高 | repo、history、tool trace 可大量复用 |
|
||
| context 可到 $100K+$ 或更长 | KV duplicate 成本显著 |
|
||
|
||
DeepSeek-V4 进一步改变了 PD 问题:如果模型采用 compressed attention,P/D 之间不再只是传完整 dense KV,而是传:
|
||
|
||
$$
|
||
\text{Transfer State}
|
||
=====================
|
||
|
||
\text{Compressed CSA/HCA KV}
|
||
+
|
||
\text{Indexer KV}
|
||
+
|
||
\text{SWA State}
|
||
+
|
||
\text{Tail State}
|
||
$$
|
||
|
||
其中 compressed KV 可能较小,但 SWA/tail state 更新频繁且 restore policy 复杂。
|
||
|
||
---
|
||
|
||
## 3.3 过去的假设
|
||
|
||
| 过去假设 | 说明 |
|
||
| ------------------------------ | -------------------------------------------- |
|
||
| **P/D 分离总是有利于 decode latency** | prefill 和 decode 的计算形态不同,隔离可以避免 interference |
|
||
| **P 生成 KV,D 消费 KV** | 请求生命周期简单,prefill 一次,decode 一段 |
|
||
| **KV duplicate 是必要代价** | P 和 D 都可能持有同一 prefix KV |
|
||
| **routing 主要看负载** | 选空闲 P 和空闲 D 即可 |
|
||
| **KV transfer 是一次性成本** | prefill 后传输 KV,后面主要 decode |
|
||
| **session 不需要强 ownership** | 请求之间关联不强,cache locality 不是主导因素 |
|
||
|
||
---
|
||
|
||
## 3.4 现在可能不成立的假设
|
||
|
||
| 失效假设 | 原因 |
|
||
| --------------------------- | ------------------------------------------------------------------- |
|
||
| **P/D 静态分离总是最好** | agentic decode 中间有 tool gap,P 干扰 decode 的机会不一定大 |
|
||
| **完整 KV transfer 可以接受** | long-context prefix 大,重复传输/duplicate 会消耗 HBM 和带宽 |
|
||
| **KV 是 request-local** | agentic 多轮 session 中,KV 是 long-lived session state |
|
||
| **routing 只看 queue length** | prefix artifact locality 可能比 queue length 更重要 |
|
||
| **prefill 是大块一次性任务** | coding agent 常是很多小增量 prefill |
|
||
| **复用只发生在 GPU 内** | DeepSeek-V4 已经把 on-disk shared-prefix reuse 纳入 inference framework |
|
||
|
||
---
|
||
|
||
## 3.5 需要建立的新假设
|
||
|
||
### Hypothesis B1:long-context agent serving 应该围绕 KV ownership 调度,而不是围绕 request 调度
|
||
|
||
传统:
|
||
|
||
$$
|
||
\text{schedule(request)}
|
||
$$
|
||
|
||
新范式:
|
||
|
||
$$
|
||
\text{schedule(session, KV_owner, artifact_location, restore_cost)}
|
||
$$
|
||
|
||
也就是:**谁拥有 prefix state,比哪个 worker 当前最空闲更重要。**
|
||
|
||
---
|
||
|
||
### Hypothesis B2:P/D separation 应该是动态决策,而不是静态架构
|
||
|
||
动作集合:
|
||
|
||
$$
|
||
a
|
||
\in
|
||
{
|
||
\text{colocated-prefill},
|
||
\text{remote-prefill},
|
||
\text{decode-continuation},
|
||
\text{artifact-fetch},
|
||
\text{tail-recompute},
|
||
\text{state-migration}
|
||
}
|
||
$$
|
||
|
||
调度目标:
|
||
|
||
$$
|
||
a^*
|
||
===
|
||
|
||
\arg\min_a
|
||
\left(
|
||
T_{\text{queue}}
|
||
+
|
||
T_{\text{prefill}}
|
||
+
|
||
T_{\text{transfer}}
|
||
+
|
||
T_{\text{restore}}
|
||
+
|
||
\lambda M_{\text{duplicate}}
|
||
+
|
||
\mu I_{\text{decode}}
|
||
\right)
|
||
$$
|
||
|
||
其中 $I_{\text{decode}}$ 是 prefill 对正在 decode 的请求造成的 interference。
|
||
|
||
---
|
||
|
||
### Hypothesis B3:single-owner KV artifact 可以降低 HBM duplicate,同时保持 prefix reuse
|
||
|
||
核心思想:
|
||
|
||
> 对一个 long-lived prefix,系统只维护一个 authoritative owner;其他节点只持有 transient read cache 或通过 recompute/fetch 恢复。
|
||
|
||
形式化:
|
||
|
||
$$
|
||
\forall p,\quad
|
||
|\text{Owner}(p)| = 1
|
||
$$
|
||
|
||
但允许:
|
||
|
||
$$
|
||
|\text{ReplicaCache}(p)| \ge 0
|
||
$$
|
||
|
||
即 owner 唯一,cache replica 可控。
|
||
|
||
---
|
||
|
||
### Hypothesis B4:tail/SWA state 不一定值得传输,很多时候重算更便宜
|
||
|
||
尤其在 compressed attention 下,bulk compressed KV 可作为 artifact 复用;但 tail state 和 SWA state 可能更适合:
|
||
|
||
| State | 推荐策略 |
|
||
| ---------------------------- | ------------------------------------------ |
|
||
| compressed CSA/HCA KV | 存储/传输/复用 |
|
||
| indexer KV | 与 compressed KV 一起管理 |
|
||
| incomplete tail | 多数情况重算 |
|
||
| SWA state | hot session 保留;cold session checkpoint 或重算 |
|
||
| dense decode continuation KV | 尽量保持在 D owner 上 |
|
||
|
||
---
|
||
|
||
## 3.6 方案设计
|
||
|
||
## 3.6.1 系统总览
|
||
|
||
```text
|
||
Global Router
|
||
↓
|
||
Ownership Directory
|
||
↓
|
||
Cost-Based PD-Hybrid Scheduler
|
||
↓
|
||
P Workers / D Workers / Hybrid Workers
|
||
↓
|
||
KV Artifact Store + HeteroCache
|
||
```
|
||
|
||
---
|
||
|
||
## 3.6.2 核心概念
|
||
|
||
### KV Artifact
|
||
|
||
Artifact 是 immutable prefix state:
|
||
|
||
$$
|
||
A_p =
|
||
(\text{prefix_hash}, \text{block_range}, \text{model_version}, \text{kv_type}, \text{location})
|
||
$$
|
||
|
||
### KV Owner
|
||
|
||
Owner 是 artifact 的 authoritative holder:
|
||
|
||
$$
|
||
\text{Owner}(A_p)=w_i
|
||
$$
|
||
|
||
### Session State
|
||
|
||
Session state 是 mutable continuation:
|
||
|
||
$$
|
||
S =
|
||
(A_{\text{stable prefix}},; \text{SWA state},; \text{tail state},; \text{decode state})
|
||
$$
|
||
|
||
这里要区分:
|
||
|
||
| 状态 | 性质 |
|
||
| ---------------------- | --------------------------- |
|
||
| stable prefix artifact | immutable,可共享 |
|
||
| tail state | mutable,短生命周期 |
|
||
| SWA state | mutable,但可 checkpoint |
|
||
| decode state | strongly locality-sensitive |
|
||
|
||
---
|
||
|
||
## 3.6.3 Ownership Directory
|
||
|
||
维护:
|
||
|
||
| Key | Value |
|
||
| ------------------ | -------------------- |
|
||
| `prefix_hash` | owner worker |
|
||
| `artifact_id` | location list |
|
||
| `session_id` | current D owner |
|
||
| `repo_id / doc_id` | hot prefix group |
|
||
| `lease_expiry` | owner lease |
|
||
| `ref_count` | forked sessions 数量 |
|
||
| `last_access` | eviction 和 migration |
|
||
|
||
需要支持:
|
||
|
||
```text
|
||
lookup(prefix_hash)
|
||
claim_owner(prefix_hash, worker)
|
||
renew_lease(prefix_hash, worker)
|
||
release(prefix_hash)
|
||
replicate_readonly(prefix_hash, target_worker)
|
||
```
|
||
|
||
---
|
||
|
||
## 3.6.4 PD-Hybrid Scheduler
|
||
|
||
对每次 agent step,scheduler 先分类:
|
||
|
||
| 请求类型 | 例子 | 默认策略 |
|
||
| -------------------------- | ------------------------------- | --------------------------------------------- |
|
||
| **New long prefix** | 第一次加载 repo/doc | remote P 或 dedicated prefill |
|
||
| **Small incremental turn** | tool result + short instruction | colocated prefill on D |
|
||
| **Long tool output** | test log、large file diff | remote P or hybrid |
|
||
| **Decode continuation** | normal generation | stay on D owner |
|
||
| **Forked rollout** | $n$ samples / RL rollout | immutable prefix artifact + many D readers |
|
||
| **Cold resume** | idle session 回来 | load compressed artifact + recompute tail/SWA |
|
||
|
||
---
|
||
|
||
## 3.6.5 决策模型
|
||
|
||
对某个 session step,候选动作 $a$ 的成本:
|
||
|
||
$$
|
||
C(a)
|
||
====
|
||
|
||
T_{\text{queue}}(a)
|
||
+
|
||
T_{\text{compute}}(a)
|
||
+
|
||
T_{\text{network}}(a)
|
||
+
|
||
T_{\text{restore}}(a)
|
||
+
|
||
\lambda \cdot M_{\text{HBM}}(a)
|
||
+
|
||
\mu \cdot I_{\text{decode}}(a)
|
||
$$
|
||
|
||
选择:
|
||
|
||
$$
|
||
a^* = \arg\min_a C(a)
|
||
$$
|
||
|
||
其中:
|
||
|
||
| 项 | 说明 |
|
||
| -------------------- | ------------------------------------ |
|
||
| $T_{\text{queue}}$ | worker 当前排队时间 |
|
||
| $T_{\text{compute}}$ | prefill/decode 计算时间 |
|
||
| $T_{\text{network}}$ | KV artifact 传输时间 |
|
||
| $T_{\text{restore}}$ | tail/SWA recompute 或 checkpoint load |
|
||
| $M_{\text{HBM}}$ | 额外 HBM 占用 |
|
||
| $I_{\text{decode}}$ | prefill 对 colocated decode 的干扰 |
|
||
|
||
---
|
||
|
||
## 3.6.6 Single-owner 机制
|
||
|
||
### 规则 1:stable prefix artifact immutable
|
||
|
||
当 prefix 到达 compression block boundary 后,将其 seal 成 artifact:
|
||
|
||
$$
|
||
A_{0:k} = \mathrm{seal}(S_{0:k})
|
||
$$
|
||
|
||
seal 后不可修改,只能 append 新 artifact:
|
||
|
||
$$
|
||
A_{0:k} \rightarrow A_{0:k} + A_{k:k+\Delta}
|
||
$$
|
||
|
||
---
|
||
|
||
### 规则 2:mutable tail 只属于 session owner
|
||
|
||
tail state 不全局共享:
|
||
|
||
$$
|
||
\text{Owner}(\text{tail}_s)=\text{D-owner}(s)
|
||
$$
|
||
|
||
除非 tail 很大,否则迁移时重算,不传输。
|
||
|
||
---
|
||
|
||
### 规则 3:forked sessions 使用 copy-on-write
|
||
|
||
对 agent/RL 多分支:
|
||
|
||
$$
|
||
S_i = A_{\text{shared prefix}} + \Delta_i
|
||
$$
|
||
|
||
所有分支共享 prefix artifact,各自维护自己的 tail/SWA/decode state。
|
||
|
||
---
|
||
|
||
### 规则 4:owner lease + failure recovery
|
||
|
||
Owner 使用 lease:
|
||
|
||
$$
|
||
\text{lease}(A_p,w_i,t_{\text{expire}})
|
||
$$
|
||
|
||
worker crash 后,directory 重新分配 owner;如果 artifact 在 SSD/CPU 有副本,则恢复;否则从 token log 重算。
|
||
|
||
---
|
||
|
||
## 3.7 可执行实现计划
|
||
|
||
## Phase B0:trace-driven simulator
|
||
|
||
先不要直接深改 serving engine。先用 trace 和 cost model 验证 single-owner 是否有潜在收益。
|
||
|
||
| 输入 | 内容 |
|
||
| ----------------- | ---------------------------------------------------------- |
|
||
| agent trace | session id、turn id、input length、output length、tool latency |
|
||
| prefix relation | 每轮与前一轮共享多少 tokens |
|
||
| worker config | P/D/hybrid workers 数量、GPU 数、网络带宽 |
|
||
| KV model | dense KV 或 hybrid KV |
|
||
| scheduling policy | static PD、colocated、OwnerServe |
|
||
|
||
输出:
|
||
|
||
| 指标 | 含义 |
|
||
| ------------------- | ------------------------------ |
|
||
| duplicate HBM | 同一 prefix 被复制多少份 |
|
||
| network traffic | P→D KV transfer 总量 |
|
||
| E2E latency | 每个 agent task 完成时间 |
|
||
| decode interference | colocated prefill 对 decode 的影响 |
|
||
| owner hit rate | 请求路由到 KV owner 的比例 |
|
||
|
||
---
|
||
|
||
## Phase B1:在单机 $8\times$ H20 上实现 xPyD prototype
|
||
|
||
这与你现有方向非常吻合。
|
||
|
||
实验设置:
|
||
|
||
| 项 | 建议 |
|
||
| --------------- | ------------------------------------------ |
|
||
| serving backend | SGLang xPyD 或自定义 proxy |
|
||
| hardware | 单机 $8\times$ H20 |
|
||
| constraint | $x+y\le 8$ |
|
||
| P→D link | 即使本地也模拟 RDMA loopback |
|
||
| workload | SWE-bench generated + internal agent trace |
|
||
| model | 先用 Qwen3-Coder-30B-A3B 或类似可跑模型 |
|
||
|
||
实现模块:
|
||
|
||
| 模块 | MVP |
|
||
| ------------------- | ------------------------ |
|
||
| Global router | Python/Rust proxy |
|
||
| Ownership directory | Redis / in-memory map |
|
||
| Prefix hash | token-level rolling hash |
|
||
| KV ownership | 先以 logical ownership 模拟 |
|
||
| Transfer planner | 记录实际或估算 KV bytes |
|
||
| Colocated fallback | 小增量 turn 直接在 D 上 prefill |
|
||
|
||
---
|
||
|
||
## Phase B2:接入 HeteroCache
|
||
|
||
当 Project A 有 prototype 后,B 可以从 dense KV ownership 升级到 heterogeneous artifact ownership:
|
||
|
||
| Artifact Type | Owner 策略 |
|
||
| ------------------ | ---------------------- |
|
||
| dense KV | D owner 优先 |
|
||
| compressed CSA/HCA | artifact owner,可跨 D 共享 |
|
||
| SWA checkpoint | session owner 或 SSD |
|
||
| tail state | session-local,重算优先 |
|
||
| indexer KV | 跟随 compressed artifact |
|
||
|
||
---
|
||
|
||
## Phase B3:真实 agentic evaluation
|
||
|
||
Workloads:
|
||
|
||
| workload | 价值 |
|
||
| ----------------------------- | ---------------------- |
|
||
| SWE-bench generated | 可复现、可公开 |
|
||
| repo-level coding agent trace | 长 prefix + 多轮 tool |
|
||
| synthetic forked rollout | 测 shared prefix 多分支 |
|
||
| long-doc QA multi-turn | 测 on-disk prefix reuse |
|
||
| internal Ali traces | 工业真实性 |
|
||
|
||
---
|
||
|
||
## 3.8 Baselines
|
||
|
||
| Baseline | 说明 |
|
||
| -------------------------------- | ------------------------------------- |
|
||
| **Static PD** | 固定 P node 和 D node,P 完成 prefill 后传 KV |
|
||
| **Colocated serving** | P/D 不分离,全部在一个 worker |
|
||
| **Round-robin PD** | 不考虑 KV locality |
|
||
| **Prefix-aware but multi-owner** | prefix cache 命中但允许多副本 |
|
||
| **OwnerServe** | single-owner + cost-based PD hybrid |
|
||
| **Oracle** | 知道未来请求序列的最优调度,用作上界 |
|
||
|
||
---
|
||
|
||
## 3.9 Metrics
|
||
|
||
| 指标 | 为什么重要 |
|
||
| ------------------------------------ | --------------------------- |
|
||
| **E2E task time** | agentic workload 的主指标 |
|
||
| **p95/p99 step latency** | 每轮用户感知 |
|
||
| **HBM duplicate factor** | 衡量 KV 浪费 |
|
||
| **network KV traffic** | P→D/remote fetch 成本 |
|
||
| **prefix owner hit rate** | ownership routing 有效性 |
|
||
| **decode interference** | hybrid colocate 是否伤害 decode |
|
||
| **tool idle utilization** | tool gap 期间 GPU 是否更好利用 |
|
||
| **successful trajectory throughput** | 单位 GPU 完成多少成功 agent tasks |
|
||
|
||
---
|
||
|
||
## 3.10 预期收益
|
||
|
||
| 收益 | 预期方向 |
|
||
| ---------------------------------- | -------------------------------------------------- |
|
||
| **降低 KV duplicate** | long prefix 不再在 P/D 多处长期复制 |
|
||
| **降低 P→D traffic** | 传 delta artifact,而不是每次完整 KV |
|
||
| **降低 E2E time** | 小增量 turn colocate,避免远程 prefill + transfer |
|
||
| **提高并发 session 数** | HBM 被 prefix duplicate 占用更少 |
|
||
| **更适合 forked rollout** | 多分支共享 immutable prefix artifact |
|
||
| **更好解释 agentic serving trade-off** | 从 TTFT/TPOT 转向 E2E + ownership + artifact locality |
|
||
|
||
保守地说,这个项目最容易拿到的硬收益不是“单步 latency 大幅下降”,而是:
|
||
|
||
$$
|
||
\text{same GPUs} \Rightarrow \text{more long-context sessions}
|
||
$$
|
||
|
||
以及:
|
||
|
||
$$
|
||
\text{same E2E target} \Rightarrow \text{less KV transfer and duplication}
|
||
$$
|
||
|
||
---
|
||
|
||
## 3.11 风险与缓解
|
||
|
||
| 风险 | 缓解 |
|
||
| --------------------------------- | ------------------------------------------------------------------------------ |
|
||
| single-owner 可能造成热点 | 支持 read-only replica cache,但 authoritative owner 唯一 |
|
||
| colocated prefill 可能干扰 decode | cost model 显式加入 $I_{\text{decode}}$ |
|
||
| KV ownership 难接入现有 engine | 先 logical ownership + simulator,再逐步接真实 KV hooks |
|
||
| dense model 上收益不如 hybrid model 明显 | 先证明 agentic prefix reuse 和 PD duplicate;后续接 HeteroCache 放大收益 |
|
||
| reviewer 认为像 router heuristic | 强调 ownership abstraction、artifact lifecycle、cost model 和 trace-driven evidence |
|
||
|
||
---
|
||
|
||
# 4. 两个项目的组合架构
|
||
|
||
最终系统可以长这样:
|
||
|
||
```text
|
||
Agent Request / Tool Result / Session Resume
|
||
↓
|
||
Session Router
|
||
↓
|
||
Ownership Directory
|
||
↓
|
||
PD-Hybrid Cost Planner
|
||
↓
|
||
┌───────────────┴────────────────┐
|
||
│ │
|
||
P/Hybrid Worker D/Hybrid Worker
|
||
│ │
|
||
└───────────────┬────────────────┘
|
||
↓
|
||
HeteroCache Manager
|
||
↓
|
||
GPU HBM / CPU DRAM / SSD Artifact Store
|
||
```
|
||
|
||
项目 A 提供:
|
||
|
||
$$
|
||
\text{what state exists and how to restore it}
|
||
$$
|
||
|
||
项目 B 决定:
|
||
|
||
$$
|
||
\text{where the state should live and where the request should run}
|
||
$$
|
||
|
||
---
|
||
|
||
# 5. 建议的论文 framing
|
||
|
||
## 5.1 Project A 的论文标题方向
|
||
|
||
**HeteroCache: Managing Heterogeneous KV States for Hybrid-Attention Long-Context LLM Serving**
|
||
|
||
### 核心贡献
|
||
|
||
| 贡献 | 说明 |
|
||
| --------------- | -------------------------------------------------------------------------- |
|
||
| **Observation** | hybrid attention breaks homogeneous KV cache assumption |
|
||
| **Abstraction** | KV state type registry + restore-cost based prefix reuse |
|
||
| **System** | GPU/CPU/SSD/recompute-aware heterogeneous KV manager |
|
||
| **Evaluation** | long-context traces + shared-prefix workloads |
|
||
| **Result** | higher session concurrency, lower restore latency, lower redundant prefill |
|
||
|
||
---
|
||
|
||
## 5.2 Project B 的论文标题方向
|
||
|
||
**OwnerServe: Single-Owner KV Artifact Scheduling for Agentic PD-Hybrid LLM Serving**
|
||
|
||
### 核心贡献
|
||
|
||
| 贡献 | 说明 |
|
||
| --------------- | ----------------------------------------------------------- |
|
||
| **Observation** | static PD separation duplicates long-lived agentic KV state |
|
||
| **Abstraction** | KV artifact ownership and session-state lifecycle |
|
||
| **System** | ownership-aware router + PD-hybrid scheduler |
|
||
| **Policy** | recompute/fetch/transfer/colocate cost model |
|
||
| **Evaluation** | coding-agent traces, forked rollout, long-prefix sessions |
|
||
|
||
---
|
||
|
||
# 6. 我的建议:先做 A,再把 B 作为 killer application
|
||
|
||
## 6.1 为什么 A 更基础
|
||
|
||
A 解决的是一个正在变成共性的问题:
|
||
|
||
> long-context model architecture 正在破坏 existing KV cache abstraction。
|
||
|
||
只要未来模型越来越多采用 CSA/HCA/SWA/MLA/sliding/dilated/sparse/hybrid attention,A 的问题就持续存在。
|
||
|
||
A 的优势:
|
||
|
||
| 维度 | 判断 |
|
||
| ------------------------ | ------------------------------------------------------- |
|
||
| **独立性** | 不强依赖 PD |
|
||
| **系统贡献清楚** | 新 cache abstraction |
|
||
| **可渐进实现** | simulator → external manager → engine integration |
|
||
| **适合 SOSP/OSDI framing** | challenge old abstraction, propose new system primitive |
|
||
|
||
---
|
||
|
||
## 6.2 为什么 B 是更强的应用场景
|
||
|
||
B 更贴近你一直关心的 **agentic workload + PD hybrid + KV ownership**。
|
||
|
||
但 B 的难点是:
|
||
|
||
| 难点 | 说明 |
|
||
| ---------------- | --------------------------------- |
|
||
| 需要真实 agent trace | 否则收益容易被认为是 synthetic |
|
||
| 需要 PD prototype | 实现复杂度更高 |
|
||
| 需要证明不伤害 decode | prefill/decode interference 需要仔细测 |
|
||
| 需要与 A 结合 | 否则只在 dense KV 上做 ownership,故事不够新 |
|
||
|
||
所以我建议路线是:
|
||
|
||
```text
|
||
Stage 1: HeteroCache simulator + trace evidence
|
||
Stage 2: HeteroCache prototype
|
||
Stage 3: OwnerServe simulator
|
||
Stage 4: xPyD OwnerServe prototype
|
||
Stage 5: combine A+B into agentic long-context serving paper
|
||
```
|
||
|
||
---
|
||
|
||
# 7. 最终项目摘要
|
||
|
||
## 7.1 Project A 摘要
|
||
|
||
**HeteroCache** 认为 long-context hybrid attention 模型的 KV cache 已经从 homogeneous dense block 变成 heterogeneous KV state。它提出一个 model-aware KV state abstraction,把 CSA/HCA compressed KV、indexer KV、SWA KV、tail state 统一纳入 GPU/CPU/SSD/recompute 联合管理,并用 restore-cost 替代 hit/miss 作为 prefix reuse 的核心决策指标。
|
||
|
||
---
|
||
|
||
## 7.2 Project B 摘要
|
||
|
||
**OwnerServe** 认为 agentic long-context serving 的核心瓶颈不是单个 request 的 prefill/decode,而是 long-lived session KV state 的 ownership、duplication、migration 和 reuse。它提出 single-owner KV artifact abstraction,并用 cost-based PD-hybrid scheduler 在 colocated prefill、remote prefill、artifact fetch、tail recompute、state migration 之间动态选择,以降低 HBM duplicate 和 P→D traffic,同时优化 E2E agent task time。
|
||
|
||
---
|
||
|
||
## 7.3 合并后的大命题
|
||
|
||
最强的总命题是:
|
||
|
||
> **For long-context agentic LLM serving, the primary scheduling object should shift from requests to KV artifacts.**
|
||
|
||
也就是:
|
||
|
||
$$
|
||
\text{Request-centric serving}
|
||
\quad\Longrightarrow\quad
|
||
\text{KV-artifact-centric serving}
|
||
$$
|
||
|
||
这个 framing 很适合你当前的研究线:它自然连接 DeepSeek-V4 的 million-token hybrid attention、你关心的 PD hybrid、KVCache ownership、agentic E2E latency、以及 trace-driven reproducible evaluation。
|