Phase 8 Benchmark: GPT-2 124M Baseline
Date: 2026-05-21
Hardware: RTX 5090 (32GB, CC 12.0, 170 SMs)
Model: GPT-2 124M (FP32)
Config: 50 prompts × 20 generated tokens, greedy decoding, no KV cache
Correctness
| Metric |
Result |
| Prompts tested |
50 |
| Token-level match vs transformers |
50/50 (100.0%) |
| Mismatches |
0 |
Performance
| Metric |
xserv |
transformers (PyTorch) |
Ratio |
| TTFT (avg) |
400.6 ms |
4.0 ms |
100x slower |
| TBT (avg) |
407.2 ms |
3.8 ms |
106x slower |
| Throughput |
2.5 tok/s |
260 tok/s |
0.01x |
Known Bottlenecks
- No KV Cache: full recompute per token (O(S²) attention every step)
- CPU round-trips: ~100 GPU→CPU→GPU transfers per forward pass for add/bias/split_qkv/merge_heads
- cuBLAS handle per matmul: ~50 handle create/destroy per forward pass
- No kernel fusion: every op is a separate kernel launch + sync
Tracking
| Phase |
TTFT (ms) |
TBT (ms) |
tok/s |
Correctness |
Notes |
| 8 (baseline) |
400.6 |
407.2 |
2.5 |
50/50 |
No KV cache, CPU round-trips |