xserv

Go to file

Gahow Wang 6cc1c9332d docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

Phase 14 (Flash Attention):
- Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible),
  kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip,
  known limitations and optimization roadmap
- Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline),
  performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings
  analysis, remaining bottleneck breakdown

Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache"
with explicit note that paged allocation was not implemented.

Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state —
iteration-level scheduling implemented + verified (6.0x concurrent speedup),
batched GPU forward explicitly marked as not yet implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 18:51:29 +08:00

crates

phase 14: Flash Attention 2 for SM120 (RTX 5090)

2026-05-22 18:27:39 +08:00

csrc

phase 14: Flash Attention 2 for SM120 (RTX 5090)

2026-05-22 18:27:39 +08:00

docs

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

tools

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul