Files

Gahow Wang cb12250ef0 phase 8: add benchmark framework + baseline results

- bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON
- bench_compare.py: compares xserv vs transformers token-by-token + timing
- Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch)
- Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 23:29:41 +08:00

1.2 KiB

Raw Permalink Blame History

Phase 8 Benchmark: GPT-2 124M Baseline

Date: 2026-05-21 Hardware: RTX 5090 (32GB, CC 12.0, 170 SMs) Model: GPT-2 124M (FP32) Config: 50 prompts × 20 generated tokens, greedy decoding, no KV cache

Correctness

Metric	Result
Prompts tested	50
Token-level match vs transformers	50/50 (100.0%)
Mismatches	0

Performance

Metric	xserv	transformers (PyTorch)	Ratio
TTFT (avg)	400.6 ms	4.0 ms	100x slower
TBT (avg)	407.2 ms	3.8 ms	106x slower
Throughput	2.5 tok/s	260 tok/s	0.01x

Known Bottlenecks

No KV Cache: full recompute per token (O(S²) attention every step)
CPU round-trips: ~100 GPU→CPU→GPU transfers per forward pass for add/bias/split_qkv/merge_heads
cuBLAS handle per matmul: ~50 handle create/destroy per forward pass
No kernel fusion: every op is a separate kernel launch + sync

Tracking

Phase	TTFT (ms)	TBT (ms)	tok/s	Correctness	Notes
8 (baseline)	400.6	407.2	2.5	50/50	No KV cache, CPU round-trips

1.2 KiB Raw Permalink Blame History Unescape Escape