Phase 14 (Flash Attention): - Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible), kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip, known limitations and optimization roadmap - Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline), performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings analysis, remaining bottleneck breakdown Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache" with explicit note that paged allocation was not implemented. Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state — iteration-level scheduling implemented + verified (6.0x concurrent speedup), batched GPU forward explicitly marked as not yet implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.4 KiB
Phase 14 Benchmark: Flash Attention 2
Date: 2026-05-22 Hardware: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs) Model: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128) Config: greedy decoding (temperature=0), max_tokens=64, single-request serial
Correctness
Logits comparison with HuggingFace transformers (10 prompts, raw text without ChatML):
| Metric | Result |
|---|---|
| Prefill Top-1 match vs HF | 9/10 (90%) |
| Avg Top-5 overlap vs HF | 4.0/5 |
| Result vs pre-FA2 naive attention | Identical (same 9/10 top-1, same 4.0/5 overlap) |
The single top-1 mismatch ("Explain quantum computing.") has logits differing by 0.125 (22.000 vs 21.875) — within BF16 precision. The top-5 sets are identical (5/5 overlap).
FA2 introduces no precision degradation compared to the naive attention path.
API Generation
52 diverse prompts (English, Chinese, code) via /v1/chat/completions:
| Metric | Result |
|---|---|
| Success rate | 52/52 (100%) |
| SSE streaming | Working (role chunk, content chunks, finish_reason, [DONE]) |
| Usage stats | Correct (prompt_tokens + completion_tokens = total_tokens) |
Performance
xserv vs HuggingFace transformers
8 prompts (short/medium/long) × max_tokens=64, greedy:
| Category | Prompt Tokens | xserv (tok/s) | HF (tok/s) | Ratio |
|---|---|---|---|---|
| Short (~12 tok) | 12-14 | 12.5 | 38.5 | 0.32x |
| Medium (~28 tok) | 27-28 | 13.6 | 44.1 | 0.31x |
| Long (~60 tok) | 58-64 | 13.0 | 36.0 | 0.36x |
| Overall | — | 12.9 | 36.6 | 0.35x |
Phase-over-Phase Improvement
| Phase | Attention | repeat_kv | tok/s | vs HF |
|---|---|---|---|---|
| 10 | Naive (O(S²), cuBLAS batched) | CPU round-trip | 6.9 | 15% |
| 11 | Naive + GPU KV cache | GPU repeat_kv | 10.3 | 30% |
| 14 | FA2 (O(1), fused kernel) | None (GQA in kernel) | 12.9 | 35% |
Phase 14 vs Phase 11: +25% throughput (10.3 → 12.9 tok/s).
Improvement Breakdown (estimated)
| Factor | Contribution |
|---|---|
| Eliminating repeat_kv GPU alloc + copy (per layer) | ~10% |
| Eliminating K^T transpose + contiguous | ~5% |
| Eliminating S×S score matrix alloc | ~5% |
| Fused kernel (1 launch vs 6) | ~5% |
Concurrent Requests
8 concurrent requests, max_batch=4:
| Metric | Result |
|---|---|
| Wall clock | 22.5s |
| Sum of individual latencies | 135.0s |
| Scheduling speedup | 6.0x |
| Throughput | 11.4 tok/s |
Continuous batching scheduling confirmed working (decode batch_size=4 in logs).
Remaining Performance Gap
35% of HF throughput. Main bottlenecks:
| Bottleneck | Impact | Fix |
|---|---|---|
| Decode Q_len=1 inefficiency | FA2 kernel: 64 threads, only 1 active (owns_row=true for single query) | Specialized decode attention kernel (vector-dot against KV, parallel reduction along S) |
| No kernel fusion | RMSNorm+residual, SiLU*up: separate kernels, redundant HBM reads/writes | Fused kernels (Phase 15) |
| No CUDA Graphs | ~100+ kernel launches per decode step, each has host-side overhead | Capture decode iteration as CUDA Graph (Phase 15) |
| Per-seq forward (no batched decode) | With batch=4, 4 serial forward passes per iteration | Batched projections + per-seq attention (Phase 15, depends on FA2 decode kernel) |
| No vectorized loads in FA2 | Scalar bf16→f32 conversion in dot product loop | float4 / bfloat162 vectorized loads |
Memory Usage
| Component | Naive (Phase 11) | FA2 (Phase 14) |
|---|---|---|
| Score matrix [1, 32, S, S] | S² × 32 × 2B | 0 |
| repeat_kv K/V [1, 32, S, 128] | 2 × S × 32 × 128 × 2B per layer | 0 |
| K^T contiguous copy | S × 32 × 128 × 2B per layer | 0 |
For S=256 (current max): savings ~6 MB per layer × 36 layers ≈ 216 MB. For S=2048: savings ~384 MB per layer × 36 layers ≈ 13.5 GB (naive would OOM).
Tracking
| Phase | Attention | tok/s | vs HF | Correctness |
|---|---|---|---|---|
| 8 | Naive (no cache) | 2.5 | 5% | 50/50 vs HF |
| 9 | Naive + CPU KV cache | 44.3 (GPT-2) | — | 50/50 self |
| 10 | Naive + CPU KV cache | 6.9 (Qwen3-8B) | 15% | 100% top-5 |
| 11 | Naive + GPU KV cache | 10.3 | 30% | 9/10 top-1 |
| 14 | FA2 + GQA in kernel | 12.9 | 35% | 9/10 top-1 |