Phase 14 (Flash Attention): - Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible), kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip, known limitations and optimization roadmap - Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline), performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings analysis, remaining bottleneck breakdown Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache" with explicit note that paged allocation was not implemented. Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state — iteration-level scheduling implemented + verified (6.0x concurrent speedup), batched GPU forward explicitly marked as not yet implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
62 lines
1.9 KiB
Markdown
62 lines
1.9 KiB
Markdown
# Phase 11: GPU-Resident KV Cache — Design Document
|
||
|
||
> **注意**: 原计划为 "Paged Attention + KV Cache Manager",实际实现为 GPU 连续预分配 KV cache(非 paged)。Paged allocation 留待后续优化。
|
||
|
||
## Goal
|
||
|
||
将 KV cache 从 CPU Vec 迁移到 GPU,消除每步 decode 的 CPU round-trip(当前 KV cache 最大性能瓶颈之一)。
|
||
|
||
## 当前问题
|
||
|
||
每步 decode 的 KV cache 路径:
|
||
```
|
||
GPU tensor (K_new) → CPU (per-head Vec append) → reconstruct → CPU tensor → GPU tensor
|
||
```
|
||
这涉及 2 次 GPU↔CPU 拷贝 × 36 层 × 2(K,V) = 144 次 transfer/token。
|
||
|
||
## 目标设计
|
||
|
||
KV cache 直接存在 GPU 上,decode 时只做 GPU→GPU append:
|
||
```
|
||
GPU tensor (K_new) → GPU KV cache (in-place append, no CPU)
|
||
```
|
||
|
||
## 实现方案
|
||
|
||
### GPU KV Cache(简化版,非 paged)
|
||
|
||
先实现连续分配的 GPU KV cache(预分配 max_seq_len),消除 CPU round-trip。Paged allocation 留待后续优化。
|
||
|
||
```rust
|
||
pub struct GpuKVCache {
|
||
// 预分配: [num_layers, 2, num_kv_heads, max_seq_len, head_dim] on GPU
|
||
k_caches: Vec<Tensor>, // per layer: [1, num_kv_heads, max_seq_len, head_dim]
|
||
v_caches: Vec<Tensor>,
|
||
seq_len: usize, // 当前已填充的长度
|
||
max_seq_len: usize,
|
||
}
|
||
```
|
||
|
||
### Append 操作
|
||
|
||
用 cudaMemcpy D2D 将新 K/V 写入 cache 的正确偏移位置:
|
||
```
|
||
k_cache[layer][0, :, seq_len:seq_len+new, :] = k_new[0, :, :, :]
|
||
```
|
||
|
||
### 读取操作
|
||
|
||
不需要拷贝——直接用 view/slice 返回 [0, :, 0:seq_len, :] 的 GPU tensor。
|
||
|
||
## 需要的新功能
|
||
|
||
1. Tensor slice 支持(view into sub-range of a dimension)
|
||
2. GPU D2D copy at offset(写入 cache 指定位置)
|
||
3. 去掉 Qwen3/GPT-2 forward 中的 CPU round-trip KV cache 路径
|
||
|
||
## Test Plan
|
||
|
||
- [ ] GPU KV cache 输出与 CPU KV cache bit-identical
|
||
- [ ] Benchmark: TBT 应显著降低(消除 144 次 CPU round-trip)
|
||
- [ ] 50-prompt correctness re-validation
|