Phase 14 (Flash Attention): - Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible), kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip, known limitations and optimization roadmap - Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline), performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings analysis, remaining bottleneck breakdown Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache" with explicit note that paged allocation was not implemented. Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state — iteration-level scheduling implemented + verified (6.0x concurrent speedup), batched GPU forward explicitly marked as not yet implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
110 lines
4.4 KiB
Markdown
110 lines
4.4 KiB
Markdown
# Phase 14 Benchmark: Flash Attention 2
|
||
|
||
**Date**: 2026-05-22
|
||
**Hardware**: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs)
|
||
**Model**: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128)
|
||
**Config**: greedy decoding (temperature=0), max_tokens=64, single-request serial
|
||
|
||
## Correctness
|
||
|
||
Logits comparison with HuggingFace transformers (10 prompts, raw text without ChatML):
|
||
|
||
| Metric | Result |
|
||
|--------|--------|
|
||
| Prefill Top-1 match vs HF | **9/10 (90%)** |
|
||
| Avg Top-5 overlap vs HF | **4.0/5** |
|
||
| Result vs pre-FA2 naive attention | **Identical** (same 9/10 top-1, same 4.0/5 overlap) |
|
||
|
||
The single top-1 mismatch ("Explain quantum computing.") has logits differing by 0.125
|
||
(22.000 vs 21.875) — within BF16 precision. The top-5 sets are identical (5/5 overlap).
|
||
|
||
FA2 introduces no precision degradation compared to the naive attention path.
|
||
|
||
## API Generation
|
||
|
||
52 diverse prompts (English, Chinese, code) via `/v1/chat/completions`:
|
||
|
||
| Metric | Result |
|
||
|--------|--------|
|
||
| Success rate | **52/52 (100%)** |
|
||
| SSE streaming | **Working** (role chunk, content chunks, finish_reason, [DONE]) |
|
||
| Usage stats | Correct (prompt_tokens + completion_tokens = total_tokens) |
|
||
|
||
## Performance
|
||
|
||
### xserv vs HuggingFace transformers
|
||
|
||
8 prompts (short/medium/long) × max_tokens=64, greedy:
|
||
|
||
| Category | Prompt Tokens | xserv (tok/s) | HF (tok/s) | Ratio |
|
||
|----------|--------------|---------------|------------|-------|
|
||
| Short (~12 tok) | 12-14 | 12.5 | 38.5 | 0.32x |
|
||
| Medium (~28 tok) | 27-28 | 13.6 | 44.1 | 0.31x |
|
||
| Long (~60 tok) | 58-64 | 13.0 | 36.0 | 0.36x |
|
||
| **Overall** | — | **12.9** | **36.6** | **0.35x** |
|
||
|
||
### Phase-over-Phase Improvement
|
||
|
||
| Phase | Attention | repeat_kv | tok/s | vs HF |
|
||
|-------|-----------|-----------|-------|-------|
|
||
| 10 | Naive (O(S²), cuBLAS batched) | CPU round-trip | 6.9 | 15% |
|
||
| 11 | Naive + GPU KV cache | GPU repeat_kv | 10.3 | 30% |
|
||
| **14** | **FA2 (O(1), fused kernel)** | **None (GQA in kernel)** | **12.9** | **35%** |
|
||
|
||
Phase 14 vs Phase 11: **+25% throughput** (10.3 → 12.9 tok/s).
|
||
|
||
### Improvement Breakdown (estimated)
|
||
|
||
| Factor | Contribution |
|
||
|--------|-------------|
|
||
| Eliminating repeat_kv GPU alloc + copy (per layer) | ~10% |
|
||
| Eliminating K^T transpose + contiguous | ~5% |
|
||
| Eliminating S×S score matrix alloc | ~5% |
|
||
| Fused kernel (1 launch vs 6) | ~5% |
|
||
|
||
### Concurrent Requests
|
||
|
||
8 concurrent requests, max_batch=4:
|
||
|
||
| Metric | Result |
|
||
|--------|--------|
|
||
| Wall clock | 22.5s |
|
||
| Sum of individual latencies | 135.0s |
|
||
| Scheduling speedup | **6.0x** |
|
||
| Throughput | 11.4 tok/s |
|
||
|
||
Continuous batching scheduling confirmed working (decode batch_size=4 in logs).
|
||
|
||
## Remaining Performance Gap
|
||
|
||
35% of HF throughput. Main bottlenecks:
|
||
|
||
| Bottleneck | Impact | Fix |
|
||
|-----------|--------|-----|
|
||
| **Decode Q_len=1 inefficiency** | FA2 kernel: 64 threads, only 1 active (owns_row=true for single query) | Specialized decode attention kernel (vector-dot against KV, parallel reduction along S) |
|
||
| **No kernel fusion** | RMSNorm+residual, SiLU*up: separate kernels, redundant HBM reads/writes | Fused kernels (Phase 15) |
|
||
| **No CUDA Graphs** | ~100+ kernel launches per decode step, each has host-side overhead | Capture decode iteration as CUDA Graph (Phase 15) |
|
||
| **Per-seq forward (no batched decode)** | With batch=4, 4 serial forward passes per iteration | Batched projections + per-seq attention (Phase 15, depends on FA2 decode kernel) |
|
||
| **No vectorized loads in FA2** | Scalar bf16→f32 conversion in dot product loop | float4 / bfloat162 vectorized loads |
|
||
|
||
## Memory Usage
|
||
|
||
| Component | Naive (Phase 11) | FA2 (Phase 14) |
|
||
|-----------|-----------------|----------------|
|
||
| Score matrix [1, 32, S, S] | S² × 32 × 2B | **0** |
|
||
| repeat_kv K/V [1, 32, S, 128] | 2 × S × 32 × 128 × 2B per layer | **0** |
|
||
| K^T contiguous copy | S × 32 × 128 × 2B per layer | **0** |
|
||
|
||
For S=256 (current max): savings ~6 MB per layer × 36 layers ≈ 216 MB.
|
||
For S=2048: savings ~384 MB per layer × 36 layers ≈ 13.5 GB (naive would OOM).
|
||
|
||
## Tracking
|
||
|
||
| Phase | Attention | tok/s | vs HF | Correctness |
|
||
|-------|-----------|-------|-------|-------------|
|
||
| 8 | Naive (no cache) | 2.5 | 5% | 50/50 vs HF |
|
||
| 9 | Naive + CPU KV cache | 44.3 (GPT-2) | — | 50/50 self |
|
||
| 10 | Naive + CPU KV cache | 6.9 (Qwen3-8B) | 15% | 100% top-5 |
|
||
| 11 | Naive + GPU KV cache | 10.3 | 30% | 9/10 top-1 |
|
||
| **14** | **FA2 + GQA in kernel** | **12.9** | **35%** | **9/10 top-1** |
|