docs: Phase 15 design doc + benchmark report
Design document (docs/15-performance.md): - Roofline analysis: 112 tok/s theoretical at 1.79 TB/s - Bottleneck quantification: cuBLAS M=1 GEMV at 8% bandwidth → 77% of step time - Six optimizations with rationale, implementation details, and expected impact - Ablation table with per-optimization delta measurements - Remaining 55% roofline gap breakdown with next-step priorities Benchmark report (docs/benchmarks/phase15-performance.md): - Full ablation: 12.9 → 50.3 tok/s across 6 optimizations - Per-prompt detail (8 prompts, 46-51 tok/s range) - Concurrent throughput analysis (batch=4 vs serial) - Phase-over-phase tracking from Phase 8 to Phase 15 (2.5 → 50.3 tok/s) - Correctness verification (9/10 top-1 match, 52/52 API pass) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
85
docs/benchmarks/phase15-performance.md
Normal file
85
docs/benchmarks/phase15-performance.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Phase 15 Benchmark: Performance Optimization
|
||||
|
||||
**Date**: 2026-05-23
|
||||
**Hardware**: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs, 1.79 TB/s)
|
||||
**Model**: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128)
|
||||
**Config**: greedy decoding (temperature=0), max_tokens=64, serial (batch=1)
|
||||
|
||||
## Ablation: Each Optimization Measured Independently
|
||||
|
||||
| # | Optimization | tok/s | Delta | ms/token | Roofline |
|
||||
|---|-------------|-------|-------|----------|----------|
|
||||
| 0 | Phase 14 baseline (FA2 + naive cuBLAS GEMV) | 12.9 | — | 77.5 | 12% |
|
||||
| 1 | + Decode attention kernel (256 threads) | 12.9 | +0% | 77.5 | 12% |
|
||||
| 2 | + Fused SiLU×Mul | 13.0 | +1% | 76.9 | 12% |
|
||||
| 3 | + Fused Add+RMSNorm | 13.2 | +2% | 75.8 | 12% |
|
||||
| 4 | + Custom GEMV (M=1, K-split tiled) | 46.6 | +253% | 21.5 | 42% |
|
||||
| 5 | + Tensor::empty (skip cudaMemset) | **50.3** | **+8%** | **19.9** | **45%** |
|
||||
|
||||
## Comparison with HuggingFace transformers
|
||||
|
||||
8 prompts (short/medium/long) × max_tokens=64, greedy, serial:
|
||||
|
||||
| System | tok/s | ms/token | Roofline |
|
||||
|--------|-------|----------|----------|
|
||||
| HF transformers (BF16, torch 2.8, SDPA) | 36.0 | 27.8 | 32% |
|
||||
| **xserv Phase 15** | **50.3** | **19.9** | **45%** |
|
||||
| Roofline (1.79 TB/s, 16GB model) | 112.0 | 8.9 | 100% |
|
||||
|
||||
**xserv is 140% of HF transformers throughput.**
|
||||
|
||||
## Per-Prompt Detail (Phase 15 Final)
|
||||
|
||||
| # | Prompt | pt | ct | Time | tok/s |
|
||||
|---|--------|----|----|------|-------|
|
||||
| 1 | What is gravity? | 12 | 64 | 1.39s | 46.0 |
|
||||
| 2 | Hello, how are you? | 14 | 64 | 1.27s | 50.5 |
|
||||
| 3 | Explain DNA briefly. | 13 | 64 | 1.25s | 51.2 |
|
||||
| 4 | Write a detailed explanation of photosynthesis... | 27 | 64 | 1.26s | 50.7 |
|
||||
| 5 | Describe machine learning. | 13 | 64 | 1.25s | 51.2 |
|
||||
| 6 | What causes earthquakes? | 12 | 64 | 1.25s | 51.1 |
|
||||
| 7 | How does the internet work? | 14 | 64 | 1.25s | 51.1 |
|
||||
| 8 | What is the speed of light? | 15 | 64 | 1.25s | 51.0 |
|
||||
|
||||
Prompt 1 is slower (46.0 vs 51.x) due to first-request warmup (caching allocator cold start).
|
||||
|
||||
## Concurrent Throughput
|
||||
|
||||
8 requests concurrent, max_batch=4:
|
||||
|
||||
| Config | tok/s | Wall clock | Speedup |
|
||||
|--------|-------|-----------|---------|
|
||||
| Serial (batch=1, custom GEMV) | 50.3 | — | — |
|
||||
| Concurrent (batch=4, cuBLAS M=4) | 28.2 | 9.09s | 6.47x scheduling |
|
||||
| Concurrent (batch=4, custom GEMV) | 35.1* | ~7.3s | ~6x scheduling |
|
||||
|
||||
*Note: batch=4 with custom GEMV is slower than serial because:
|
||||
1. Batched decode path uses cuBLAS for M>1 matmuls, losing the GEMV advantage
|
||||
2. Per-seq attention/reshape overhead in the batched path adds ~2ms/step
|
||||
3. Custom GEMV already saturates bandwidth at M=1
|
||||
|
||||
Serial decode with custom GEMV is the optimal path for current architecture.
|
||||
|
||||
## Correctness Verification
|
||||
|
||||
| Test | Result |
|
||||
|------|--------|
|
||||
| Top-1 logits match vs HF (10 prompts) | 9/10 (90%) |
|
||||
| Top-5 overlap vs HF (10 prompts) | 4.0/5 avg |
|
||||
| vs pre-optimization baseline | Identical (same 9/10) |
|
||||
| API generation (52 prompts) | 52/52 pass |
|
||||
| SSE streaming | Working |
|
||||
| Chinese prompts | Working |
|
||||
|
||||
## Phase-over-Phase Performance Tracking
|
||||
|
||||
| Phase | Key Change | tok/s | vs HF | Roofline |
|
||||
|-------|-----------|-------|-------|----------|
|
||||
| 8 | GPT-2 inference (no cache) | 2.5 | 7% | — |
|
||||
| 9 | + KV cache (CPU) | 44.3 (GPT-2) | — | — |
|
||||
| 10 | Qwen3-8B (CPU KV cache) | 6.9 | 19% | 6% |
|
||||
| 11 | + GPU KV cache | 10.3 | 29% | 9% |
|
||||
| 14 | + Flash Attention 2 | 12.9 | 36% | 12% |
|
||||
| **15** | **+ Custom GEMV + fused + empty** | **50.3** | **140%** | **45%** |
|
||||
|
||||
Total speedup from Phase 10 to Phase 15: **7.3x** (6.9 → 50.3 tok/s).
|
||||
Reference in New Issue
Block a user