Files

Gahow Wang 6cc1c9332d docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

Phase 14 (Flash Attention):
- Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible),
  kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip,
  known limitations and optimization roadmap
- Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline),
  performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings
  analysis, remaining bottleneck breakdown

Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache"
with explicit note that paged allocation was not implemented.

Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state —
iteration-level scheduling implemented + verified (6.0x concurrent speedup),
batched GPU forward explicitly marked as not yet implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 18:51:29 +08:00

4.4 KiB

Raw Permalink Blame History

Phase 14 Benchmark: Flash Attention 2

Date: 2026-05-22 Hardware: RTX 5090 (32GB GDDR7, SM120 CC 12.0, 170 SMs) Model: Qwen3-8B (BF16, 36 layers, 4096 hidden, 32 Q / 8 KV GQA heads, head_dim=128) Config: greedy decoding (temperature=0), max_tokens=64, single-request serial

Correctness

Logits comparison with HuggingFace transformers (10 prompts, raw text without ChatML):

Metric	Result
Prefill Top-1 match vs HF	9/10 (90%)
Avg Top-5 overlap vs HF	4.0/5
Result vs pre-FA2 naive attention	Identical (same 9/10 top-1, same 4.0/5 overlap)

The single top-1 mismatch ("Explain quantum computing.") has logits differing by 0.125 (22.000 vs 21.875) — within BF16 precision. The top-5 sets are identical (5/5 overlap).

FA2 introduces no precision degradation compared to the naive attention path.

API Generation

52 diverse prompts (English, Chinese, code) via /v1/chat/completions:

Metric	Result
Success rate	52/52 (100%)
SSE streaming	Working (role chunk, content chunks, finish_reason, [DONE])
Usage stats	Correct (prompt_tokens + completion_tokens = total_tokens)

Performance

xserv vs HuggingFace transformers

8 prompts (short/medium/long) × max_tokens=64, greedy:

Category	Prompt Tokens	xserv (tok/s)	HF (tok/s)	Ratio
Short (~12 tok)	12-14	12.5	38.5	0.32x
Medium (~28 tok)	27-28	13.6	44.1	0.31x
Long (~60 tok)	58-64	13.0	36.0	0.36x
Overall	—	12.9	36.6	0.35x

Phase-over-Phase Improvement

Phase	Attention	repeat_kv	tok/s	vs HF
10	Naive (O(S²), cuBLAS batched)	CPU round-trip	6.9	15%
11	Naive + GPU KV cache	GPU repeat_kv	10.3	30%
14	FA2 (O(1), fused kernel)	None (GQA in kernel)	12.9	35%

Phase 14 vs Phase 11: +25% throughput (10.3 → 12.9 tok/s).

Improvement Breakdown (estimated)

Factor	Contribution
Eliminating repeat_kv GPU alloc + copy (per layer)	~10%
Eliminating K^T transpose + contiguous	~5%
Eliminating S×S score matrix alloc	~5%
Fused kernel (1 launch vs 6)	~5%

Concurrent Requests

8 concurrent requests, max_batch=4:

Metric	Result
Wall clock	22.5s
Sum of individual latencies	135.0s
Scheduling speedup	6.0x
Throughput	11.4 tok/s

Continuous batching scheduling confirmed working (decode batch_size=4 in logs).

Remaining Performance Gap

35% of HF throughput. Main bottlenecks:

Bottleneck	Impact	Fix
Decode Q_len=1 inefficiency	FA2 kernel: 64 threads, only 1 active (owns_row=true for single query)	Specialized decode attention kernel (vector-dot against KV, parallel reduction along S)
No kernel fusion	RMSNorm+residual, SiLU*up: separate kernels, redundant HBM reads/writes	Fused kernels (Phase 15)
No CUDA Graphs	~100+ kernel launches per decode step, each has host-side overhead	Capture decode iteration as CUDA Graph (Phase 15)
Per-seq forward (no batched decode)	With batch=4, 4 serial forward passes per iteration	Batched projections + per-seq attention (Phase 15, depends on FA2 decode kernel)
No vectorized loads in FA2	Scalar bf16→f32 conversion in dot product loop	float4 / bfloat162 vectorized loads

Memory Usage

Component	Naive (Phase 11)	FA2 (Phase 14)
Score matrix [1, 32, S, S]	S² × 32 × 2B	0
repeat_kv K/V [1, 32, S, 128]	2 × S × 32 × 128 × 2B per layer	0
K^T contiguous copy	S × 32 × 128 × 2B per layer	0

For S=256 (current max): savings ~6 MB per layer × 36 layers ≈ 216 MB. For S=2048: savings ~384 MB per layer × 36 layers ≈ 13.5 GB (naive would OOM).

Tracking

Phase	Attention	tok/s	vs HF	Correctness
8	Naive (no cache)	2.5	5%	50/50 vs HF
9	Naive + CPU KV cache	44.3 (GPT-2)	—	50/50 self
10	Naive + CPU KV cache	6.9 (Qwen3-8B)	15%	100% top-5
11	Naive + GPU KV cache	10.3	30%	9/10 top-1
14	FA2 + GQA in kernel	12.9	35%	9/10 top-1

4.4 KiB Raw Permalink Blame History Unescape Escape