Files
xserv/docs/benchmarks/phase14-flash-attention.md
Gahow Wang 6cc1c9332d docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty
Phase 14 (Flash Attention):
- Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible),
  kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip,
  known limitations and optimization roadmap
- Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline),
  performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings
  analysis, remaining bottleneck breakdown

Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache"
with explicit note that paged allocation was not implemented.

Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state —
iteration-level scheduling implemented + verified (6.0x concurrent speedup),
batched GPU forward explicitly marked as not yet implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 18:51:29 +08:00

110 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 14 Benchmark: Flash Attention 2
**Date**: 2026-05-22
**Hardware**: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs)
**Model**: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128)
**Config**: greedy decoding (temperature=0), max_tokens=64, single-request serial
## Correctness
Logits comparison with HuggingFace transformers (10 prompts, raw text without ChatML):
| Metric | Result |
|--------|--------|
| Prefill Top-1 match vs HF | **9/10 (90%)** |
| Avg Top-5 overlap vs HF | **4.0/5** |
| Result vs pre-FA2 naive attention | **Identical** (same 9/10 top-1, same 4.0/5 overlap) |
The single top-1 mismatch ("Explain quantum computing.") has logits differing by 0.125
(22.000 vs 21.875) — within BF16 precision. The top-5 sets are identical (5/5 overlap).
FA2 introduces no precision degradation compared to the naive attention path.
## API Generation
52 diverse prompts (English, Chinese, code) via `/v1/chat/completions`:
| Metric | Result |
|--------|--------|
| Success rate | **52/52 (100%)** |
| SSE streaming | **Working** (role chunk, content chunks, finish_reason, [DONE]) |
| Usage stats | Correct (prompt_tokens + completion_tokens = total_tokens) |
## Performance
### xserv vs HuggingFace transformers
8 prompts (short/medium/long) × max_tokens=64, greedy:
| Category | Prompt Tokens | xserv (tok/s) | HF (tok/s) | Ratio |
|----------|--------------|---------------|------------|-------|
| Short (~12 tok) | 12-14 | 12.5 | 38.5 | 0.32x |
| Medium (~28 tok) | 27-28 | 13.6 | 44.1 | 0.31x |
| Long (~60 tok) | 58-64 | 13.0 | 36.0 | 0.36x |
| **Overall** | — | **12.9** | **36.6** | **0.35x** |
### Phase-over-Phase Improvement
| Phase | Attention | repeat_kv | tok/s | vs HF |
|-------|-----------|-----------|-------|-------|
| 10 | Naive (O(S²), cuBLAS batched) | CPU round-trip | 6.9 | 15% |
| 11 | Naive + GPU KV cache | GPU repeat_kv | 10.3 | 30% |
| **14** | **FA2 (O(1), fused kernel)** | **None (GQA in kernel)** | **12.9** | **35%** |
Phase 14 vs Phase 11: **+25% throughput** (10.3 → 12.9 tok/s).
### Improvement Breakdown (estimated)
| Factor | Contribution |
|--------|-------------|
| Eliminating repeat_kv GPU alloc + copy (per layer) | ~10% |
| Eliminating K^T transpose + contiguous | ~5% |
| Eliminating S×S score matrix alloc | ~5% |
| Fused kernel (1 launch vs 6) | ~5% |
### Concurrent Requests
8 concurrent requests, max_batch=4:
| Metric | Result |
|--------|--------|
| Wall clock | 22.5s |
| Sum of individual latencies | 135.0s |
| Scheduling speedup | **6.0x** |
| Throughput | 11.4 tok/s |
Continuous batching scheduling confirmed working (decode batch_size=4 in logs).
## Remaining Performance Gap
35% of HF throughput. Main bottlenecks:
| Bottleneck | Impact | Fix |
|-----------|--------|-----|
| **Decode Q_len=1 inefficiency** | FA2 kernel: 64 threads, only 1 active (owns_row=true for single query) | Specialized decode attention kernel (vector-dot against KV, parallel reduction along S) |
| **No kernel fusion** | RMSNorm+residual, SiLU*up: separate kernels, redundant HBM reads/writes | Fused kernels (Phase 15) |
| **No CUDA Graphs** | ~100+ kernel launches per decode step, each has host-side overhead | Capture decode iteration as CUDA Graph (Phase 15) |
| **Per-seq forward (no batched decode)** | With batch=4, 4 serial forward passes per iteration | Batched projections + per-seq attention (Phase 15, depends on FA2 decode kernel) |
| **No vectorized loads in FA2** | Scalar bf16→f32 conversion in dot product loop | float4 / bfloat162 vectorized loads |
## Memory Usage
| Component | Naive (Phase 11) | FA2 (Phase 14) |
|-----------|-----------------|----------------|
| Score matrix [1, 32, S, S] | S² × 32 × 2B | **0** |
| repeat_kv K/V [1, 32, S, 128] | 2 × S × 32 × 128 × 2B per layer | **0** |
| K^T contiguous copy | S × 32 × 128 × 2B per layer | **0** |
For S=256 (current max): savings ~6 MB per layer × 36 layers ≈ 216 MB.
For S=2048: savings ~384 MB per layer × 36 layers ≈ 13.5 GB (naive would OOM).
## Tracking
| Phase | Attention | tok/s | vs HF | Correctness |
|-------|-----------|-------|-------|-------------|
| 8 | Naive (no cache) | 2.5 | 5% | 50/50 vs HF |
| 9 | Naive + CPU KV cache | 44.3 (GPT-2) | — | 50/50 self |
| 10 | Naive + CPU KV cache | 6.9 (Qwen3-8B) | 15% | 100% top-5 |
| 11 | Naive + GPU KV cache | 10.3 | 30% | 9/10 top-1 |
| **14** | **FA2 + GQA in kernel** | **12.9** | **35%** | **9/10 top-1** |