docs: Phase 21 — decode CUDA graph + GPU argmax results

dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps): TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x, TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%. Lesson recorded: graphs bought ~0.6 ms (launches were already hidden by async execution), the GPU argmax ~1 ms — measure, don't guess. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
parent 8414f8d1e6
commit 013465fc06
4 changed files with 149 additions and 20 deletions
--- a/docs/benchmarks/sparse-moe.md
+++ b/docs/benchmarks/sparse-moe.md
@@ -78,19 +78,34 @@ reasons, both instructive:
  noise; W8A16 removes activation-quantization error so ≥ dense is expected).
  Avg 1.3 s/problem also reflects the decode speedup.

-## Remaining gaps / next levers (to catch llama TP=1 at 2.9 ms)
+## Phase 21 update: decode CUDA graph + GPU argmax (docs/21-cuda-graph-decode.md)

-Sparse MoE removed the dominant cost; the residual ~7 ms splits roughly into
-~3 ms HBM reads and ~4 ms fixed overhead. In impact order:
+The whole batch=1 decode step now replays as one CUDA graph, and greedy
+sampling uses the GPU argmax kernel (4-byte D2H instead of a 402 KB logits
+copy + 201k-element host scan). In-process A/B: graph −0.6 ms, GPU argmax
+−1.0 ms. Warm-server head-to-head (same harness/GPUs, 6 reps):

-1. **CUDA graphs for decode** (~2–4 ms): with experts down to ~1–2 ms, the
-   ~200 un-graphed launches/token are now the single largest cost. (The old
-   "graphs ≈ useless" conclusion was relative to a 13 ms dense TPOT — no
-   longer true.)
-2. **Quantize non-expert weights** (~1–1.5 ms): attn qkv/o + the 1.16 GB BF16
+| | xserv FP8 (graph) | llama MXFP4 | |
+|---|---|---|---|
+| TP=2 TPOT | **5.76–5.89 ms** (170–174 tok/s) | 7.42–8.45 ms | **xserv 1.26–1.47×** |
+| TP=2 TTFT s/m/l | **25 / 28 / 51 ms** | 63 / 66 / 45 ms | xserv 2.4× s/m; long ~par |
+| TP=1 TPOT | 5.78–5.95 ms | **2.80–3.22 ms** | llama 2.0× (was 2.5×) |
+| TP=1 TTFT s/m | **32 / 35 ms** | 34 / 36 ms | xserv slightly ahead |
+
+GSM8K-50 through the graph path: 47/50 = 94% (unchanged). Note: GPU argmax
+breaks exact-tie logits differently than the host scan, so greedy trajectories
+can legitimately diverge at a tie token.
+
+## Remaining gaps / next levers (to catch llama TP=1 at 2.8 ms)
+
+Per-token fixed overhead is now mostly gone; the residual ~5.8 ms is
+dominated by HBM bytes and kernel efficiency. In impact order:
+
+1. **Quantize non-expert weights** (~1.5 ms): attn qkv/o + the 1.16 GB BF16
   lm_head read every token; FP8/MXFP4 them like llama quantizes everything.
+2. **GEMV/attention bandwidth tuning**: effective BW of the hand GEMVs is
+   well under peak; llama's 2.8 ms implies ~85%+ efficiency on ~1.3 GB.
 3. **Sparse prefill** (permute tokens by expert + grouped GEMM): long-prompt
-   TTFT 94–120 ms → llama's ~30 ms territory.
+   TTFT 51–75 ms → llama's ~30 ms territory.
 4. **W4A4 FP4 tensor cores / bandwidth-tuned MXFP4 GEMV**: make 4-bit experts
-   actually beat FP8 (today sparse MXFP4 is 8.4 ms vs FP8 7.6 ms — the 4-bit
-   GEMV's lower effective bandwidth still cancels its byte advantage).
+   actually beat FP8.