Files

Gahow Wang 6cc1c9332d docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

Phase 14 (Flash Attention):
- Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible),
  kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip,
  known limitations and optimization roadmap
- Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline),
  performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings
  analysis, remaining bottleneck breakdown

Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache"
with explicit note that paged allocation was not implemented.

Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state —
iteration-level scheduling implemented + verified (6.0x concurrent speedup),
batched GPU forward explicitly marked as not yet implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 18:51:29 +08:00

1.9 KiB

Raw Blame History

Phase 11: GPU-Resident KV Cache — Design Document

注意: 原计划为 "Paged Attention + KV Cache Manager"，实际实现为 GPU 连续预分配 KV cache（非 paged）。Paged allocation 留待后续优化。

Goal

将 KV cache 从 CPU Vec 迁移到 GPU，消除每步 decode 的 CPU round-trip（当前 KV cache 最大性能瓶颈之一）。

当前问题

每步 decode 的 KV cache 路径：

GPU tensor (K_new) → CPU (per-head Vec append) → reconstruct → CPU tensor → GPU tensor

这涉及 2 次 GPU↔CPU 拷贝 × 36 层 × 2(K,V) = 144 次 transfer/token。

目标设计

KV cache 直接存在 GPU 上，decode 时只做 GPU→GPU append：

GPU tensor (K_new) → GPU KV cache (in-place append, no CPU)

实现方案

GPU KV Cache（简化版，非 paged）

先实现连续分配的 GPU KV cache（预分配 max_seq_len），消除 CPU round-trip。Paged allocation 留待后续优化。

pub struct GpuKVCache {
    // 预分配: [num_layers, 2, num_kv_heads, max_seq_len, head_dim] on GPU
    k_caches: Vec<Tensor>,  // per layer: [1, num_kv_heads, max_seq_len, head_dim]
    v_caches: Vec<Tensor>,
    seq_len: usize,         // 当前已填充的长度
    max_seq_len: usize,
}

Append 操作

用 cudaMemcpy D2D 将新 K/V 写入 cache 的正确偏移位置：

k_cache[layer][0, :, seq_len:seq_len+new, :] = k_new[0, :, :, :]

读取操作

不需要拷贝——直接用 view/slice 返回 [0, :, 0:seq_len, :] 的 GPU tensor。

需要的新功能

Tensor slice 支持（view into sub-range of a dimension）
GPU D2D copy at offset（写入 cache 指定位置）
去掉 Qwen3/GPT-2 forward 中的 CPU round-trip KV cache 路径

Test Plan

GPU KV cache 输出与 CPU KV cache bit-identical
Benchmark: TBT 应显著降低（消除 144 次 CPU round-trip）
50-prompt correctness re-validation

1.9 KiB Raw Blame History Unescape Escape