docs: Phase 21 — decode CUDA graph + GPU argmax results

dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps):
TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x
ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x,
TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%.
Lesson recorded: graphs bought ~0.6 ms (launches were already hidden
by async execution), the GPU argmax ~1 ms — measure, don't guess.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-06-12 20:12:37 +08:00
parent 8414f8d1e6
commit 013465fc06
4 changed files with 149 additions and 20 deletions

View File

@@ -78,19 +78,34 @@ reasons, both instructive:
noise; W8A16 removes activation-quantization error so ≥ dense is expected).
Avg 1.3 s/problem also reflects the decode speedup.
## Remaining gaps / next levers (to catch llama TP=1 at 2.9 ms)
## Phase 21 update: decode CUDA graph + GPU argmax (docs/21-cuda-graph-decode.md)
Sparse MoE removed the dominant cost; the residual ~7 ms splits roughly into
~3 ms HBM reads and ~4 ms fixed overhead. In impact order:
The whole batch=1 decode step now replays as one CUDA graph, and greedy
sampling uses the GPU argmax kernel (4-byte D2H instead of a 402 KB logits
copy + 201k-element host scan). In-process A/B: graph 0.6 ms, GPU argmax
1.0 ms. Warm-server head-to-head (same harness/GPUs, 6 reps):
1. **CUDA graphs for decode** (~24 ms): with experts down to ~12 ms, the
~200 un-graphed launches/token are now the single largest cost. (The old
"graphs ≈ useless" conclusion was relative to a 13 ms dense TPOT — no
longer true.)
2. **Quantize non-expert weights** (~11.5 ms): attn qkv/o + the 1.16 GB BF16
| | xserv FP8 (graph) | llama MXFP4 | |
|---|---|---|---|
| TP=2 TPOT | **5.765.89 ms** (170174 tok/s) | 7.428.45 ms | **xserv 1.261.47×** |
| TP=2 TTFT s/m/l | **25 / 28 / 51 ms** | 63 / 66 / 45 ms | xserv 2.4× s/m; long ~par |
| TP=1 TPOT | 5.785.95 ms | **2.803.22 ms** | llama 2.0× (was 2.5×) |
| TP=1 TTFT s/m | **32 / 35 ms** | 34 / 36 ms | xserv slightly ahead |
GSM8K-50 through the graph path: 47/50 = 94% (unchanged). Note: GPU argmax
breaks exact-tie logits differently than the host scan, so greedy trajectories
can legitimately diverge at a tie token.
## Remaining gaps / next levers (to catch llama TP=1 at 2.8 ms)
Per-token fixed overhead is now mostly gone; the residual ~5.8 ms is
dominated by HBM bytes and kernel efficiency. In impact order:
1. **Quantize non-expert weights** (~1.5 ms): attn qkv/o + the 1.16 GB BF16
lm_head read every token; FP8/MXFP4 them like llama quantizes everything.
2. **GEMV/attention bandwidth tuning**: effective BW of the hand GEMVs is
well under peak; llama's 2.8 ms implies ~85%+ efficiency on ~1.3 GB.
3. **Sparse prefill** (permute tokens by expert + grouped GEMM): long-prompt
TTFT 94120 ms → llama's ~30 ms territory.
TTFT 5175 ms → llama's ~30 ms territory.
4. **W4A4 FP4 tensor cores / bandwidth-tuned MXFP4 GEMV**: make 4-bit experts
actually beat FP8 (today sparse MXFP4 is 8.4 ms vs FP8 7.6 ms — the 4-bit
GEMV's lower effective bandwidth still cancels its byte advantage).
actually beat FP8.