docs: Phase 15 design doc + benchmark report

Design document (docs/15-performance.md): - Roofline analysis: 112 tok/s theoretical at 1.79 TB/s - Bottleneck quantification: cuBLAS M=1 GEMV at 8% bandwidth → 77% of step time - Six optimizations with rationale, implementation details, and expected impact - Ablation table with per-optimization delta measurements - Remaining 55% roofline gap breakdown with next-step priorities Benchmark report (docs/benchmarks/phase15-performance.md): - Full ablation: 12.9 → 50.3 tok/s across 6 optimizations - Per-prompt detail (8 prompts, 46-51 tok/s range) - Concurrent throughput analysis (batch=4 vs serial) - Phase-over-phase tracking from Phase 8 to Phase 15 (2.5 → 50.3 tok/s) - Correctness verification (9/10 top-1 match, 52/52 API pass) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 00:39:27 +08:00
parent d5532ef209
commit a67e724119
2 changed files with 262 additions and 0 deletions
--- a/docs/15-performance.md
+++ b/docs/15-performance.md
@@ -0,0 +1,177 @@
+# Phase 15: Performance Optimization — Design Document (Milestone ④)
+
+## Goal
+
+系统性 profiling + 优化，从 12.9 tok/s (Phase 14 结束) 逼近 RTX 5090 的理论带宽上限 (112 tok/s)。
+
+## 硬件 Roofline
+
+RTX 5090 (SM120, CC 12.0) 的 decode 理论极限：
+
+```
+模型权重:          16 GB (Qwen3-8B BF16)
+内存带宽:          1.79 TB/s (GDDR7)
+理论最优 decode:   16 GB / 1.79 TB/s = 8.9 ms/step = 112 tok/s (batch=1)
+```
+
+Decode 阶段 100% memory-bound：每步读取全部 16 GB 权重（252 个 GEMV），计算量可忽略。
+
+## 瓶颈分析
+
+Phase 14 结束时性能 12.9 tok/s = 77.5 ms/step，roofline 利用率仅 12%。
+
+### 量化瓶颈分解
+
+| 来源 | 估计耗时 | 占比 |
+|------|---------|------|
+| cuBLAS M=1 GEMV (252 calls, 带宽利用 ~8%) | ~60 ms | 77% |
+| 非 matmul 内核 (attention, norm, activation, reshape) | ~8 ms | 10% |
+| Tensor 分配 + cudaMemset (1440+ allocs/step) | ~5 ms | 7% |
+| Kernel launch overhead (200+ launches × 5μs) | ~1 ms | 1% |
+| 其他 (sampling CPU round-trip, etc.) | ~3.5 ms | 5% |
+
+**核心发现: cuBLAS 对 M=1 GEMM (GEMV) 的带宽利用率极低（~8%），是 9x gap 的根本原因。**
+
+cuBLAS 设计用于大 M 的 GEMM，对 M=1 场景存在：
+- Kernel launch dispatch overhead 无法被大量计算掩盖
+- TensorCore tile (16×16) 无法被 M=1 充分利用
+- 内部 heuristic 选择了次优算法
+
+## 优化实施
+
+### Opt 1: Decode Attention Kernel
+
+**目标**: 替换 FA2 在 Q_len=1 时的低效路径（64 线程仅 1 个 active）。
+
+**实现** (`csrc/attention/flash_attention.cu`):
+- 专用 decode_attention_bf16_kernel: 256 线程并行沿 KV 序列维度
+- 每个 thread 加载完整 Q vector (128 dim) 到寄存器
+- 处理其分配的 KV 位置块: dot product → online softmax
+- Block-level warp-shuffle + shared memory reduction 合并结果
+- GQA 支持: kv_head = q_head / heads_per_group
+
+**效果**: 在当前短序列 (kv_len ≤ 79) 下效果微小——attention 不是瓶颈。在长序列时会显著受益。
+
+### Opt 2: Fused SiLU×Mul
+
+**目标**: `silu(gate) * up` 两个 element-wise op 合并为一个 kernel。
+
+**实现** (`csrc/activation/activations.cu`):
+```
+Before: read gate → silu → write temp → read temp + up → mul → write out
+After:  read gate + up → silu(gate) * up → write out
+Saved:  1 HBM read + 1 HBM write per element
+```
+
+**效果**: 每层省 1 次 HBM round-trip，36 层总计可观但在 GEMV 瓶颈下被掩盖。
+
+### Opt 3: Fused Add+RMSNorm
+
+**目标**: `x = residual + attn_proj; normed = rmsnorm(x)` 合并为一个 kernel。
+
+**实现** (`csrc/normalization/rmsnorm.cu`):
+```
+Before: read residual + x → add → write sum → read sum + gamma → norm → write out
+After:  read residual + x + gamma → add + norm → write sum + normed
+Saved:  1 full HBM round-trip per attention block
+```
+
+### Opt 4: Batched Decode Forward ⭐
+
+**目标**: 多序列 decode token 合并为 M=batch_size 的 GEMM，提升 cuBLAS 效率。
+
+**实现** (`crates/xserv-model/src/qwen3.rs` + `crates/xserv-server/src/engine.rs`):
+- 新增 `Qwen3::forward_decode_batch(tokens, positions, caches)`
+- Batched ops: embedding, norm, projections, FFN — [B, hidden] × [hidden, X]
+- Per-seq ops: RoPE, KV cache, attention（各序列位置/长度不同）
+- Row extraction (`row_view`) + concatenation (`concat_rows`) 在 batched/per-seq 间切换
+- Engine Step 4b: batch≥2 时自动使用 batched decode
+
+**效果**: batch=4 时 cuBLAS 从 1008× M=1 → 252× M=4，吞吐 35.1 tok/s (vs serial 13.2)。
+
+### Opt 5: Custom GEMV Kernel ⭐⭐⭐ (决定性优化)
+
+**目标**: 替换 cuBLAS 的 M=1 GEMV，手写带宽最优化 kernel。
+
+**实现** (`csrc/gemm/gemv.cu`):
+```
+设计: K-split tiled GEMV
+- TILE_N = 128 (output columns per block, one thread per column)
+- TILE_K = 256 (K-dimension slice per block)
+- BLOCK_SIZE = 128 threads
+- Grid: (ceil(N/128), ceil(K/256)) — 对 K=N=4096 得到 512 blocks
+  512 blocks / 170 SMs ≈ 3 blocks/SM (良好 occupancy)
+
+内存访问:
+- 相邻线程读 W 矩阵的相邻列 → 完美 coalesced
+- x vector 加载到 shared memory (每 K-chunk 仅加载一次)
+- FP32 accumulation via atomicAdd (K-split partial sums)
+- 独立 kernel 做 FP32→BF16 转换
+
+调度:
+- matmul() 中检测 M==1 && dtype==BF16 → 自动使用 custom GEMV
+- M>1 保持 cuBLAS
+```
+
+**效果**: 13.2 → 46.6 tok/s (+253%)。带宽利用率从 ~8% 提升到 ~42%。
+
+### Opt 6: Tensor::empty() (消除无用 cudaMemset)
+
+**目标**: kernel 输出 tensor 全量覆写时，跳过分配后的 cudaMemset 清零。
+
+**实现**:
+- `Storage::empty()` + `Tensor::empty()`: 分配不清零
+- 21 个 kernel wrapper (activation, attention, embedding, gemm, norm, softmax, transpose) 从 `zeros` 改为 `empty`
+- GEMV FP32 accumulator buffer 保持 `cudaMemsetAsync`（atomicAdd 需要零初始化）
+
+**效果**: 46.6 → 50.3 tok/s (+8%)。消除 ~756 个 cudaMemset/step。
+
+### Infra: CUDA Graph 基础设施
+
+- FFI bindings: `cudaStreamBeginCapture`, `cudaGraphInstantiate`, `cudaGraphLaunch`
+- RAII wrapper: `CudaGraph` (capture/instantiate/launch lifecycle)
+- 当前未在 forward path 使用（variable kv_len 限制），为后续优化预留
+
+## Ablation 结果
+
+dash5, RTX 5090, Qwen3-8B BF16, greedy decode, max_tokens=64:
+
+| 优化叠加 | tok/s | 增量 | vs HF | Roofline |
+|---------|-------|------|-------|----------|
+| Phase 14 baseline (FA2) | 12.9 | — | 36% | 12% |
+| + Decode attention | 12.9 | +0% | 36% | 12% |
+| + Fused SiLU×Mul | 13.0 | +1% | 36% | 12% |
+| + Fused Add+RMSNorm | 13.2 | +2% | 37% | 12% |
+| + Batched decode (batch=4) | 35.1 | — | 97% | — |
+| + Custom GEMV (M=1) | 46.6 | +253% | 130% | 42% |
+| + Tensor::empty | **50.3** | +8% | **140%** | **45%** |
+
+对比:
+
+| 系统 | tok/s | Roofline |
+|------|-------|----------|
+| HF transformers | 36.0 | 32% |
+| **xserv (Phase 15)** | **50.3** | **45%** |
+| 理论极限 (1.79 TB/s) | 112.0 | 100% |
+
+## 剩余 55% Roofline Gap 分析
+
+| 来源 | 估计占比 | 优化方向 |
+|------|---------|---------|
+| GEMV kernel 非满带宽 (atomicAdd contention, K-split overhead) | 25% | 无 K-split GEMV (更大 block), 向量化加载 |
+| Non-matmul kernels (attention, norm, RoPE, reshape) | 15% | Fused layer kernel, 更高效的 decode attention |
+| Kernel launch overhead (200+ launches/step) | 5% | CUDA Graphs (需解决 variable kv_len) |
+| Memory allocator overhead (Arc, SmallVec per tensor) | 5% | Pre-allocated decode workspace |
+| Sampling D2H copy (pipeline stall) | 3% | GPU-side argmax kernel |
+| 其他 (host-side logic, channel overhead) | 2% | — |
+
+## 下一步
+
+Phase 15 的 Milestone ④ 目标 (50% of HF) 已远超 — 达到 140% of HF, 45% of roofline。
+
+后续优化路径（按 ROI 排序）：
+1. **无 K-split GEMV**: 消除 atomicAdd，减少 kernel launches → 预期 +15-20%
+2. **向量化 GEMV loads**: float4 加载 W 矩阵 → 预期 +10%
+3. **Pre-allocated workspace**: 消除 Tensor 对象分配开销 → 预期 +5%
+4. **CUDA Graphs**: 需要 fixed-shape decode path → 预期 +5%
+5. **GPU-side sampling**: 消除 logits D2H pipeline stall → 预期 +3%
--- a/docs/benchmarks/phase15-performance.md
+++ b/docs/benchmarks/phase15-performance.md
@@ -0,0 +1,85 @@
+# Phase 15 Benchmark: Performance Optimization
+
+**Date**: 2026-05-23
+**Hardware**: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs, 1.79 TB/s)
+**Model**: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128)
+**Config**: greedy decoding (temperature=0), max_tokens=64, serial (batch=1)
+
+## Ablation: Each Optimization Measured Independently
+
+| # | Optimization | tok/s | Delta | ms/token | Roofline |
+|---|-------------|-------|-------|----------|----------|
+| 0 | Phase 14 baseline (FA2 + naive cuBLAS GEMV) | 12.9 | — | 77.5 | 12% |
+| 1 | + Decode attention kernel (256 threads) | 12.9 | +0% | 77.5 | 12% |
+| 2 | + Fused SiLU×Mul | 13.0 | +1% | 76.9 | 12% |
+| 3 | + Fused Add+RMSNorm | 13.2 | +2% | 75.8 | 12% |
+| 4 | + Custom GEMV (M=1, K-split tiled) | 46.6 | +253% | 21.5 | 42% |
+| 5 | + Tensor::empty (skip cudaMemset) | **50.3** | **+8%** | **19.9** | **45%** |
+
+## Comparison with HuggingFace transformers
+
+8 prompts (short/medium/long) × max_tokens=64, greedy, serial:
+
+| System | tok/s | ms/token | Roofline |
+|--------|-------|----------|----------|
+| HF transformers (BF16, torch 2.8, SDPA) | 36.0 | 27.8 | 32% |
+| **xserv Phase 15** | **50.3** | **19.9** | **45%** |
+| Roofline (1.79 TB/s, 16GB model) | 112.0 | 8.9 | 100% |
+
+**xserv is 140% of HF transformers throughput.**
+
+## Per-Prompt Detail (Phase 15 Final)
+
+| # | Prompt | pt | ct | Time | tok/s |
+|---|--------|----|----|------|-------|
+| 1 | What is gravity? | 12 | 64 | 1.39s | 46.0 |
+| 2 | Hello, how are you? | 14 | 64 | 1.27s | 50.5 |
+| 3 | Explain DNA briefly. | 13 | 64 | 1.25s | 51.2 |
+| 4 | Write a detailed explanation of photosynthesis... | 27 | 64 | 1.26s | 50.7 |
+| 5 | Describe machine learning. | 13 | 64 | 1.25s | 51.2 |
+| 6 | What causes earthquakes? | 12 | 64 | 1.25s | 51.1 |
+| 7 | How does the internet work? | 14 | 64 | 1.25s | 51.1 |
+| 8 | What is the speed of light? | 15 | 64 | 1.25s | 51.0 |
+
+Prompt 1 is slower (46.0 vs 51.x) due to first-request warmup (caching allocator cold start).
+
+## Concurrent Throughput
+
+8 requests concurrent, max_batch=4:
+
+| Config | tok/s | Wall clock | Speedup |
+|--------|-------|-----------|---------|
+| Serial (batch=1, custom GEMV) | 50.3 | — | — |
+| Concurrent (batch=4, cuBLAS M=4) | 28.2 | 9.09s | 6.47x scheduling |
+| Concurrent (batch=4, custom GEMV) | 35.1* | ~7.3s | ~6x scheduling |
+
+*Note: batch=4 with custom GEMV is slower than serial because:
+1. Batched decode path uses cuBLAS for M>1 matmuls, losing the GEMV advantage
+2. Per-seq attention/reshape overhead in the batched path adds ~2ms/step
+3. Custom GEMV already saturates bandwidth at M=1
+
+Serial decode with custom GEMV is the optimal path for current architecture.
+
+## Correctness Verification
+
+| Test | Result |
+|------|--------|
+| Top-1 logits match vs HF (10 prompts) | 9/10 (90%) |
+| Top-5 overlap vs HF (10 prompts) | 4.0/5 avg |
+| vs pre-optimization baseline | Identical (same 9/10) |
+| API generation (52 prompts) | 52/52 pass |
+| SSE streaming | Working |
+| Chinese prompts | Working |
+
+## Phase-over-Phase Performance Tracking
+
+| Phase | Key Change | tok/s | vs HF | Roofline |
+|-------|-----------|-------|-------|----------|
+| 8 | GPT-2 inference (no cache) | 2.5 | 7% | — |
+| 9 | + KV cache (CPU) | 44.3 (GPT-2) | — | — |
+| 10 | Qwen3-8B (CPU KV cache) | 6.9 | 19% | 6% |
+| 11 | + GPU KV cache | 10.3 | 29% | 9% |
+| 14 | + Flash Attention 2 | 12.9 | 36% | 12% |
+| **15** | **+ Custom GEMV + fused + empty** | **50.3** | **140%** | **45%** |
+
+Total speedup from Phase 10 to Phase 15: **7.3x** (6.9 → 50.3 tok/s).