Files
xserv/docs/benchmarks/phase9-kv-cache.md
Gahow Wang 64084d3489 phase 9: KV cache + autoregressive generation
- KVCache: per-layer, per-head storage with append + reconstruct
- forward_with_cache: prefill (full prompt) + decode (single token) modes
- Fixed data layout bug: per-head vectors avoid cross-head interleaving
- CLI updated to use KV cache by default
- bench-gpt2 supports --no-cache flag for comparison

Benchmark results (50 prompts × 20 tokens):
- KV cache vs no-cache: 50/50 bit-identical (cache is correct)
- 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s
- vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 23:39:41 +08:00

1.6 KiB
Raw Blame History

Phase 9 Benchmark: KV Cache

Date: 2026-05-21 Hardware: RTX 5090 (32GB, CC 12.0) Model: GPT-2 124M (FP32) Config: 50 prompts × 20 generated tokens, greedy decoding

Correctness

Metric Result
xserv KV-cache vs xserv no-cache 50/50 (100.0%) — bit-identical
xserv vs HF transformers 40/50 (80.0%)

The 10 mismatches vs HF are floating point divergence (different CUDA kernels, computation order). Logit gap at divergence points: min=0.04, max=0.56, avg=0.20. Not a correctness bug.

Performance

Metric Phase 8 (no cache) Phase 9 (KV cache) Improvement HF transformers
TTFT (avg) 400.6 ms 24.2 ms 16.5x 4.0 ms
TBT (avg) 407.2 ms 22.6 ms 18.0x 3.9 ms
Throughput 2.5 tok/s 44.3 tok/s 17.7x 257.7 tok/s
vs HF ratio 0.01x 0.17x 1.0x

Analysis

KV cache delivers ~18x speedup by eliminating redundant computation:

  • Before: every decode step recomputed all layers for all tokens O(S²)
  • After: decode step only computes 1 new token, reads K/V from cache O(S)

Remaining gap vs HF (~6x slower):

  1. CPU round-trips still present (~100 per forward pass)
  2. cuBLAS handle created per matmul
  3. KV cache stored on CPU (rebuilt as GPU tensor each step)
  4. No kernel fusion

Tracking

Phase TTFT (ms) TBT (ms) tok/s Correctness Notes
8 (baseline) 400.6 407.2 2.5 50/50 vs HF No KV cache
9 (KV cache) 24.2 22.6 44.3 50/50 self-consistent 18x speedup