xserv

Go to file

Gahow Wang 64084d3489 phase 9: KV cache + autoregressive generation

- KVCache: per-layer, per-head storage with append + reconstruct
- forward_with_cache: prefill (full prompt) + decode (single token) modes
- Fixed data layout bug: per-head vectors avoid cross-head interleaving
- CLI updated to use KV cache by default
- bench-gpt2 supports --no-cache flag for comparison

Benchmark results (50 prompts × 20 tokens):
- KV cache vs no-cache: 50/50 bit-identical (cache is correct)
- 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s
- vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 23:39:41 +08:00

crates

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

csrc

phase 5: naive multi-head attention

2026-05-21 21:17:23 +08:00

docs

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

tools

phase 9: KV cache + autoregressive generation