xserv/docs/benchmarks/phase14-flash-attention.md

# Phase 14 Benchmark: Flash Attention 2

**Date**: 2026-05-22
**Hardware**: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs)
**Model**: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128)
**Config**: greedy decoding (temperature=0), max_tokens=64, single-request serial

## Correctness

Logits comparison with HuggingFace transformers (10 prompts, raw text without ChatML):

| Metric | Result |
|--------|--------|
| Prefill Top-1 match vs HF | **9/10 (90%)** |
| Avg Top-5 overlap vs HF | **4.0/5** |
| Result vs pre-FA2 naive attention | **Identical** (same 9/10 top-1, same 4.0/5 overlap) |

The single top-1 mismatch ("Explain quantum computing.") has logits differing by 0.125
(22.000 vs 21.875) — within BF16 precision. The top-5 sets are identical (5/5 overlap).

FA2 introduces no precision degradation compared to the naive attention path.

## API Generation

52 diverse prompts (English, Chinese, code) via `/v1/chat/completions`:

| Metric | Result |
|--------|--------|
| Success rate | **52/52 (100%)** |
| SSE streaming | **Working** (role chunk, content chunks, finish_reason, [DONE]) |
| Usage stats | Correct (prompt_tokens + completion_tokens = total_tokens) |

## Performance

### xserv vs HuggingFace transformers

8 prompts (short/medium/long) × max_tokens=64, greedy:

| Category | Prompt Tokens | xserv (tok/s) | HF (tok/s) | Ratio |
|----------|--------------|---------------|------------|-------|
| Short (~12 tok) | 12-14 | 12.5 | 38.5 | 0.32x |
| Medium (~28 tok) | 27-28 | 13.6 | 44.1 | 0.31x |
| Long (~60 tok) | 58-64 | 13.0 | 36.0 | 0.36x |
| **Overall** | — | **12.9** | **36.6** | **0.35x** |

### Phase-over-Phase Improvement

| Phase | Attention | repeat_kv | tok/s | vs HF |
|-------|-----------|-----------|-------|-------|
| 10 | Naive (O(S²), cuBLAS batched) | CPU round-trip | 6.9 | 15% |
| 11 | Naive + GPU KV cache | GPU repeat_kv | 10.3 | 30% |
| **14** | **FA2 (O(1), fused kernel)** | **None (GQA in kernel)** | **12.9** | **35%** |

Phase 14 vs Phase 11: **+25% throughput** (10.3 → 12.9 tok/s).

### Improvement Breakdown (estimated)

| Factor | Contribution |
|--------|-------------|
| Eliminating repeat_kv GPU alloc + copy (per layer) | ~10% |
| Eliminating K^T transpose + contiguous | ~5% |
| Eliminating S×S score matrix alloc | ~5% |
| Fused kernel (1 launch vs 6) | ~5% |

### Concurrent Requests

8 concurrent requests, max_batch=4:

| Metric | Result |
|--------|--------|
| Wall clock | 22.5s |
| Sum of individual latencies | 135.0s |
| Scheduling speedup | **6.0x** |
| Throughput | 11.4 tok/s |

Continuous batching scheduling confirmed working (decode batch_size=4 in logs).

## Remaining Performance Gap

35% of HF throughput. Main bottlenecks:

| Bottleneck | Impact | Fix |
|-----------|--------|-----|
| **Decode Q_len=1 inefficiency** | FA2 kernel: 64 threads, only 1 active (owns_row=true for single query) | Specialized decode attention kernel (vector-dot against KV, parallel reduction along S) |
| **No kernel fusion** | RMSNorm+residual, SiLU*up: separate kernels, redundant HBM reads/writes | Fused kernels (Phase 15) |
| **No CUDA Graphs** | ~100+ kernel launches per decode step, each has host-side overhead | Capture decode iteration as CUDA Graph (Phase 15) |
| **Per-seq forward (no batched decode)** | With batch=4, 4 serial forward passes per iteration | Batched projections + per-seq attention (Phase 15, depends on FA2 decode kernel) |
| **No vectorized loads in FA2** | Scalar bf16→f32 conversion in dot product loop | float4 / bfloat162 vectorized loads |

## Memory Usage

| Component | Naive (Phase 11) | FA2 (Phase 14) |
|-----------|-----------------|----------------|
| Score matrix [1, 32, S, S] | S² × 32 × 2B | **0** |
| repeat_kv K/V [1, 32, S, 128] | 2 × S × 32 × 128 × 2B per layer | **0** |
| K^T contiguous copy | S × 32 × 128 × 2B per layer | **0** |

For S=256 (current max): savings ~6 MB per layer × 36 layers ≈ 216 MB.
For S=2048: savings ~384 MB per layer × 36 layers ≈ 13.5 GB (naive would OOM).

## Tracking

| Phase | Attention | tok/s | vs HF | Correctness |
|-------|-----------|-------|-------|-------------|
| 8 | Naive (no cache) | 2.5 | 5% | 50/50 vs HF |
| 9 | Naive + CPU KV cache | 44.3 (GPT-2) | — | 50/50 self |
| 10 | Naive + CPU KV cache | 6.9 (Qwen3-8B) | 15% | 100% top-5 |
| 11 | Naive + GPU KV cache | 10.3 | 30% | 9/10 top-1 |
| **14** | **FA2 + GQA in kernel** | **12.9** | **35%** | **9/10 top-1** |