xserv

gahow/xserv

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	6cc1c9332d	docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty Phase 14 (Flash Attention): - Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible), kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip, known limitations and optimization roadmap - Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline), performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings analysis, remaining bottleneck breakdown Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache" with explicit note that paged allocation was not implemented. Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state — iteration-level scheduling implemented + verified (6.0x concurrent speedup), batched GPU forward explicitly marked as not yet implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 18:51:29 +08:00
Gahow Wang	2d48f25e66	phase 11: GPU-resident KV cache - GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset - Per-head strided layout [num_kv_heads, max_seq_len, head_dim] - Fixed critical bug: seq_len must advance AFTER all layers write (not inside the loop per-layer) - GpuBuffer::copy_from_device_at for offset-based D2D copy - Tensor::from_storage constructor for wrapping raw GPU buffers - Exported Storage and Dims from xserv-tensor Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical Performance: ~neutral (KV cache was never the main bottleneck — reshape/merge/transpose CPU round-trips dominate for Qwen3-8B) TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 11:50:12 +08:00

Author

SHA1

Message

Date

Gahow Wang

6cc1c9332d

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

Phase 14 (Flash Attention):
- Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible),
  kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip,
  known limitations and optimization roadmap
- Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline),
  performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings
  analysis, remaining bottleneck breakdown

Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache"
with explicit note that paged allocation was not implemented.

Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state —
iteration-level scheduling implemented + verified (6.0x concurrent speedup),
batched GPU forward explicitly marked as not yet implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 18:51:29 +08:00

Gahow Wang

2d48f25e66

phase 11: GPU-resident KV cache

- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset
- Per-head strided layout [num_kv_heads, max_seq_len, head_dim]
- Fixed critical bug: seq_len must advance AFTER all layers write
  (not inside the loop per-layer)
- GpuBuffer::copy_from_device_at for offset-based D2D copy
- Tensor::from_storage constructor for wrapping raw GPU buffers
- Exported Storage and Dims from xserv-tensor

Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical
Performance: ~neutral (KV cache was never the main bottleneck —
reshape/merge/transpose CPU round-trips dominate for Qwen3-8B)

TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 11:50:12 +08:00

2 Commits