Files
xserv/docs/benchmarks/phase8-gpt2-baseline.md
Gahow Wang cb12250ef0 phase 8: add benchmark framework + baseline results
- bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON
- bench_compare.py: compares xserv vs transformers token-by-token + timing
- Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch)
- Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 23:29:41 +08:00

1.2 KiB
Raw Permalink Blame History

Phase 8 Benchmark: GPT-2 124M Baseline

Date: 2026-05-21 Hardware: RTX 5090 (32GB, CC 12.0, 170 SMs) Model: GPT-2 124M (FP32) Config: 50 prompts × 20 generated tokens, greedy decoding, no KV cache

Correctness

Metric Result
Prompts tested 50
Token-level match vs transformers 50/50 (100.0%)
Mismatches 0

Performance

Metric xserv transformers (PyTorch) Ratio
TTFT (avg) 400.6 ms 4.0 ms 100x slower
TBT (avg) 407.2 ms 3.8 ms 106x slower
Throughput 2.5 tok/s 260 tok/s 0.01x

Known Bottlenecks

  1. No KV Cache: full recompute per token (O(S²) attention every step)
  2. CPU round-trips: ~100 GPU→CPU→GPU transfers per forward pass for add/bias/split_qkv/merge_heads
  3. cuBLAS handle per matmul: ~50 handle create/destroy per forward pass
  4. No kernel fusion: every op is a separate kernel launch + sync

Tracking

Phase TTFT (ms) TBT (ms) tok/s Correctness Notes
8 (baseline) 400.6 407.2 2.5 50/50 No KV cache, CPU round-trips