6cc1c9332d98428b9da3639206b1f79362c6479b
Phase 14 (Flash Attention): - Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible), kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip, known limitations and optimization roadmap - Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline), performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings analysis, remaining bottleneck breakdown Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache" with explicit note that paged allocation was not implemented. Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state — iteration-level scheduling implemented + verified (6.0x concurrent speedup), batched GPU forward explicitly marked as not yet implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Rust
67.5%
Python
15.1%
Cuda
13.5%
Shell
3.9%