Files
xserv/docs/benchmarks/phase14-flash-attention.md
Gahow Wang 6cc1c9332d docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty
Phase 14 (Flash Attention):
- Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible),
  kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip,
  known limitations and optimization roadmap
- Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline),
  performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings
  analysis, remaining bottleneck breakdown

Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache"
with explicit note that paged allocation was not implemented.

Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state —
iteration-level scheduling implemented + verified (6.0x concurrent speedup),
batched GPU forward explicitly marked as not yet implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 18:51:29 +08:00

4.4 KiB
Raw Permalink Blame History

Phase 14 Benchmark: Flash Attention 2

Date: 2026-05-22 Hardware: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs) Model: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128) Config: greedy decoding (temperature=0), max_tokens=64, single-request serial

Correctness

Logits comparison with HuggingFace transformers (10 prompts, raw text without ChatML):

Metric Result
Prefill Top-1 match vs HF 9/10 (90%)
Avg Top-5 overlap vs HF 4.0/5
Result vs pre-FA2 naive attention Identical (same 9/10 top-1, same 4.0/5 overlap)

The single top-1 mismatch ("Explain quantum computing.") has logits differing by 0.125 (22.000 vs 21.875) — within BF16 precision. The top-5 sets are identical (5/5 overlap).

FA2 introduces no precision degradation compared to the naive attention path.

API Generation

52 diverse prompts (English, Chinese, code) via /v1/chat/completions:

Metric Result
Success rate 52/52 (100%)
SSE streaming Working (role chunk, content chunks, finish_reason, [DONE])
Usage stats Correct (prompt_tokens + completion_tokens = total_tokens)

Performance

xserv vs HuggingFace transformers

8 prompts (short/medium/long) × max_tokens=64, greedy:

Category Prompt Tokens xserv (tok/s) HF (tok/s) Ratio
Short (~12 tok) 12-14 12.5 38.5 0.32x
Medium (~28 tok) 27-28 13.6 44.1 0.31x
Long (~60 tok) 58-64 13.0 36.0 0.36x
Overall 12.9 36.6 0.35x

Phase-over-Phase Improvement

Phase Attention repeat_kv tok/s vs HF
10 Naive (O(S²), cuBLAS batched) CPU round-trip 6.9 15%
11 Naive + GPU KV cache GPU repeat_kv 10.3 30%
14 FA2 (O(1), fused kernel) None (GQA in kernel) 12.9 35%

Phase 14 vs Phase 11: +25% throughput (10.3 → 12.9 tok/s).

Improvement Breakdown (estimated)

Factor Contribution
Eliminating repeat_kv GPU alloc + copy (per layer) ~10%
Eliminating K^T transpose + contiguous ~5%
Eliminating S×S score matrix alloc ~5%
Fused kernel (1 launch vs 6) ~5%

Concurrent Requests

8 concurrent requests, max_batch=4:

Metric Result
Wall clock 22.5s
Sum of individual latencies 135.0s
Scheduling speedup 6.0x
Throughput 11.4 tok/s

Continuous batching scheduling confirmed working (decode batch_size=4 in logs).

Remaining Performance Gap

35% of HF throughput. Main bottlenecks:

Bottleneck Impact Fix
Decode Q_len=1 inefficiency FA2 kernel: 64 threads, only 1 active (owns_row=true for single query) Specialized decode attention kernel (vector-dot against KV, parallel reduction along S)
No kernel fusion RMSNorm+residual, SiLU*up: separate kernels, redundant HBM reads/writes Fused kernels (Phase 15)
No CUDA Graphs ~100+ kernel launches per decode step, each has host-side overhead Capture decode iteration as CUDA Graph (Phase 15)
Per-seq forward (no batched decode) With batch=4, 4 serial forward passes per iteration Batched projections + per-seq attention (Phase 15, depends on FA2 decode kernel)
No vectorized loads in FA2 Scalar bf16→f32 conversion in dot product loop float4 / bfloat162 vectorized loads

Memory Usage

Component Naive (Phase 11) FA2 (Phase 14)
Score matrix [1, 32, S, S] × 32 × 2B 0
repeat_kv K/V [1, 32, S, 128] 2 × S × 32 × 128 × 2B per layer 0
K^T contiguous copy S × 32 × 128 × 2B per layer 0

For S=256 (current max): savings ~6 MB per layer × 36 layers ≈ 216 MB. For S=2048: savings ~384 MB per layer × 36 layers ≈ 13.5 GB (naive would OOM).

Tracking

Phase Attention tok/s vs HF Correctness
8 Naive (no cache) 2.5 5% 50/50 vs HF
9 Naive + CPU KV cache 44.3 (GPT-2) 50/50 self
10 Naive + CPU KV cache 6.9 (Qwen3-8B) 15% 100% top-5
11 Naive + GPU KV cache 10.3 30% 9/10 top-1
14 FA2 + GQA in kernel 12.9 35% 9/10 top-1