Files
xserv/docs/15-performance.md
Gahow Wang a67e724119 docs: Phase 15 design doc + benchmark report
Design document (docs/15-performance.md):
- Roofline analysis: 112 tok/s theoretical at 1.79 TB/s
- Bottleneck quantification: cuBLAS M=1 GEMV at 8% bandwidth → 77% of step time
- Six optimizations with rationale, implementation details, and expected impact
- Ablation table with per-optimization delta measurements
- Remaining 55% roofline gap breakdown with next-step priorities

Benchmark report (docs/benchmarks/phase15-performance.md):
- Full ablation: 12.9 → 50.3 tok/s across 6 optimizations
- Per-prompt detail (8 prompts, 46-51 tok/s range)
- Concurrent throughput analysis (batch=4 vs serial)
- Phase-over-phase tracking from Phase 8 to Phase 15 (2.5 → 50.3 tok/s)
- Correctness verification (9/10 top-1 match, 52/52 API pass)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 00:39:27 +08:00

178 lines
7.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 15: Performance Optimization — Design Document (Milestone ④)
## Goal
系统性 profiling + 优化,从 12.9 tok/s (Phase 14 结束) 逼近 RTX 5090 的理论带宽上限 (112 tok/s)。
## 硬件 Roofline
RTX 5090 (SM120, CC 12.0) 的 decode 理论极限:
```
模型权重: 16 GB (Qwen3-8B BF16)
内存带宽: 1.79 TB/s (GDDR7)
理论最优 decode: 16 GB / 1.79 TB/s = 8.9 ms/step = 112 tok/s (batch=1)
```
Decode 阶段 100% memory-bound每步读取全部 16 GB 权重252 个 GEMV计算量可忽略。
## 瓶颈分析
Phase 14 结束时性能 12.9 tok/s = 77.5 ms/steproofline 利用率仅 12%。
### 量化瓶颈分解
| 来源 | 估计耗时 | 占比 |
|------|---------|------|
| cuBLAS M=1 GEMV (252 calls, 带宽利用 ~8%) | ~60 ms | 77% |
| 非 matmul 内核 (attention, norm, activation, reshape) | ~8 ms | 10% |
| Tensor 分配 + cudaMemset (1440+ allocs/step) | ~5 ms | 7% |
| Kernel launch overhead (200+ launches × 5μs) | ~1 ms | 1% |
| 其他 (sampling CPU round-trip, etc.) | ~3.5 ms | 5% |
**核心发现: cuBLAS 对 M=1 GEMM (GEMV) 的带宽利用率极低(~8%),是 9x gap 的根本原因。**
cuBLAS 设计用于大 M 的 GEMM对 M=1 场景存在:
- Kernel launch dispatch overhead 无法被大量计算掩盖
- TensorCore tile (16×16) 无法被 M=1 充分利用
- 内部 heuristic 选择了次优算法
## 优化实施
### Opt 1: Decode Attention Kernel
**目标**: 替换 FA2 在 Q_len=1 时的低效路径64 线程仅 1 个 active
**实现** (`csrc/attention/flash_attention.cu`):
- 专用 decode_attention_bf16_kernel: 256 线程并行沿 KV 序列维度
- 每个 thread 加载完整 Q vector (128 dim) 到寄存器
- 处理其分配的 KV 位置块: dot product → online softmax
- Block-level warp-shuffle + shared memory reduction 合并结果
- GQA 支持: kv_head = q_head / heads_per_group
**效果**: 在当前短序列 (kv_len ≤ 79) 下效果微小——attention 不是瓶颈。在长序列时会显著受益。
### Opt 2: Fused SiLU×Mul
**目标**: `silu(gate) * up` 两个 element-wise op 合并为一个 kernel。
**实现** (`csrc/activation/activations.cu`):
```
Before: read gate → silu → write temp → read temp + up → mul → write out
After: read gate + up → silu(gate) * up → write out
Saved: 1 HBM read + 1 HBM write per element
```
**效果**: 每层省 1 次 HBM round-trip36 层总计可观但在 GEMV 瓶颈下被掩盖。
### Opt 3: Fused Add+RMSNorm
**目标**: `x = residual + attn_proj; normed = rmsnorm(x)` 合并为一个 kernel。
**实现** (`csrc/normalization/rmsnorm.cu`):
```
Before: read residual + x → add → write sum → read sum + gamma → norm → write out
After: read residual + x + gamma → add + norm → write sum + normed
Saved: 1 full HBM round-trip per attention block
```
### Opt 4: Batched Decode Forward ⭐
**目标**: 多序列 decode token 合并为 M=batch_size 的 GEMM提升 cuBLAS 效率。
**实现** (`crates/xserv-model/src/qwen3.rs` + `crates/xserv-server/src/engine.rs`):
- 新增 `Qwen3::forward_decode_batch(tokens, positions, caches)`
- Batched ops: embedding, norm, projections, FFN — [B, hidden] × [hidden, X]
- Per-seq ops: RoPE, KV cache, attention各序列位置/长度不同)
- Row extraction (`row_view`) + concatenation (`concat_rows`) 在 batched/per-seq 间切换
- Engine Step 4b: batch≥2 时自动使用 batched decode
**效果**: batch=4 时 cuBLAS 从 1008× M=1 → 252× M=4吞吐 35.1 tok/s (vs serial 13.2)。
### Opt 5: Custom GEMV Kernel ⭐⭐⭐ (决定性优化)
**目标**: 替换 cuBLAS 的 M=1 GEMV手写带宽最优化 kernel。
**实现** (`csrc/gemm/gemv.cu`):
```
设计: K-split tiled GEMV
- TILE_N = 128 (output columns per block, one thread per column)
- TILE_K = 256 (K-dimension slice per block)
- BLOCK_SIZE = 128 threads
- Grid: (ceil(N/128), ceil(K/256)) — 对 K=N=4096 得到 512 blocks
512 blocks / 170 SMs ≈ 3 blocks/SM (良好 occupancy)
内存访问:
- 相邻线程读 W 矩阵的相邻列 → 完美 coalesced
- x vector 加载到 shared memory (每 K-chunk 仅加载一次)
- FP32 accumulation via atomicAdd (K-split partial sums)
- 独立 kernel 做 FP32→BF16 转换
调度:
- matmul() 中检测 M==1 && dtype==BF16 → 自动使用 custom GEMV
- M>1 保持 cuBLAS
```
**效果**: 13.2 → 46.6 tok/s (+253%)。带宽利用率从 ~8% 提升到 ~42%。
### Opt 6: Tensor::empty() (消除无用 cudaMemset)
**目标**: kernel 输出 tensor 全量覆写时,跳过分配后的 cudaMemset 清零。
**实现**:
- `Storage::empty()` + `Tensor::empty()`: 分配不清零
- 21 个 kernel wrapper (activation, attention, embedding, gemm, norm, softmax, transpose) 从 `zeros` 改为 `empty`
- GEMV FP32 accumulator buffer 保持 `cudaMemsetAsync`atomicAdd 需要零初始化)
**效果**: 46.6 → 50.3 tok/s (+8%)。消除 ~756 个 cudaMemset/step。
### Infra: CUDA Graph 基础设施
- FFI bindings: `cudaStreamBeginCapture`, `cudaGraphInstantiate`, `cudaGraphLaunch`
- RAII wrapper: `CudaGraph` (capture/instantiate/launch lifecycle)
- 当前未在 forward path 使用variable kv_len 限制),为后续优化预留
## Ablation 结果
dash5, RTX 5090, Qwen3-8B BF16, greedy decode, max_tokens=64:
| 优化叠加 | tok/s | 增量 | vs HF | Roofline |
|---------|-------|------|-------|----------|
| Phase 14 baseline (FA2) | 12.9 | — | 36% | 12% |
| + Decode attention | 12.9 | +0% | 36% | 12% |
| + Fused SiLU×Mul | 13.0 | +1% | 36% | 12% |
| + Fused Add+RMSNorm | 13.2 | +2% | 37% | 12% |
| + Batched decode (batch=4) | 35.1 | — | 97% | — |
| + Custom GEMV (M=1) | 46.6 | +253% | 130% | 42% |
| + Tensor::empty | **50.3** | +8% | **140%** | **45%** |
对比:
| 系统 | tok/s | Roofline |
|------|-------|----------|
| HF transformers | 36.0 | 32% |
| **xserv (Phase 15)** | **50.3** | **45%** |
| 理论极限 (1.79 TB/s) | 112.0 | 100% |
## 剩余 55% Roofline Gap 分析
| 来源 | 估计占比 | 优化方向 |
|------|---------|---------|
| GEMV kernel 非满带宽 (atomicAdd contention, K-split overhead) | 25% | 无 K-split GEMV (更大 block), 向量化加载 |
| Non-matmul kernels (attention, norm, RoPE, reshape) | 15% | Fused layer kernel, 更高效的 decode attention |
| Kernel launch overhead (200+ launches/step) | 5% | CUDA Graphs (需解决 variable kv_len) |
| Memory allocator overhead (Arc, SmallVec per tensor) | 5% | Pre-allocated decode workspace |
| Sampling D2H copy (pipeline stall) | 3% | GPU-side argmax kernel |
| 其他 (host-side logic, channel overhead) | 2% | — |
## 下一步
Phase 15 的 Milestone ④ 目标 (50% of HF) 已远超 — 达到 140% of HF, 45% of roofline。
后续优化路径(按 ROI 排序):
1. **无 K-split GEMV**: 消除 atomicAdd减少 kernel launches → 预期 +15-20%
2. **向量化 GEMV loads**: float4 加载 W 矩阵 → 预期 +10%
3. **Pre-allocated workspace**: 消除 Tensor 对象分配开销 → 预期 +5%
4. **CUDA Graphs**: 需要 fixed-shape decode path → 预期 +5%
5. **GPU-side sampling**: 消除 logits D2H pipeline stall → 预期 +3%