docs: Phase 21 — decode CUDA graph + GPU argmax results
dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps): TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x, TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%. Lesson recorded: graphs bought ~0.6 ms (launches were already hidden by async execution), the GPU argmax ~1 ms — measure, don't guess. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -78,19 +78,34 @@ reasons, both instructive:
|
||||
noise; W8A16 removes activation-quantization error so ≥ dense is expected).
|
||||
Avg 1.3 s/problem also reflects the decode speedup.
|
||||
|
||||
## Remaining gaps / next levers (to catch llama TP=1 at 2.9 ms)
|
||||
## Phase 21 update: decode CUDA graph + GPU argmax (docs/21-cuda-graph-decode.md)
|
||||
|
||||
Sparse MoE removed the dominant cost; the residual ~7 ms splits roughly into
|
||||
~3 ms HBM reads and ~4 ms fixed overhead. In impact order:
|
||||
The whole batch=1 decode step now replays as one CUDA graph, and greedy
|
||||
sampling uses the GPU argmax kernel (4-byte D2H instead of a 402 KB logits
|
||||
copy + 201k-element host scan). In-process A/B: graph −0.6 ms, GPU argmax
|
||||
−1.0 ms. Warm-server head-to-head (same harness/GPUs, 6 reps):
|
||||
|
||||
1. **CUDA graphs for decode** (~2–4 ms): with experts down to ~1–2 ms, the
|
||||
~200 un-graphed launches/token are now the single largest cost. (The old
|
||||
"graphs ≈ useless" conclusion was relative to a 13 ms dense TPOT — no
|
||||
longer true.)
|
||||
2. **Quantize non-expert weights** (~1–1.5 ms): attn qkv/o + the 1.16 GB BF16
|
||||
| | xserv FP8 (graph) | llama MXFP4 | |
|
||||
|---|---|---|---|
|
||||
| TP=2 TPOT | **5.76–5.89 ms** (170–174 tok/s) | 7.42–8.45 ms | **xserv 1.26–1.47×** |
|
||||
| TP=2 TTFT s/m/l | **25 / 28 / 51 ms** | 63 / 66 / 45 ms | xserv 2.4× s/m; long ~par |
|
||||
| TP=1 TPOT | 5.78–5.95 ms | **2.80–3.22 ms** | llama 2.0× (was 2.5×) |
|
||||
| TP=1 TTFT s/m | **32 / 35 ms** | 34 / 36 ms | xserv slightly ahead |
|
||||
|
||||
GSM8K-50 through the graph path: 47/50 = 94% (unchanged). Note: GPU argmax
|
||||
breaks exact-tie logits differently than the host scan, so greedy trajectories
|
||||
can legitimately diverge at a tie token.
|
||||
|
||||
## Remaining gaps / next levers (to catch llama TP=1 at 2.8 ms)
|
||||
|
||||
Per-token fixed overhead is now mostly gone; the residual ~5.8 ms is
|
||||
dominated by HBM bytes and kernel efficiency. In impact order:
|
||||
|
||||
1. **Quantize non-expert weights** (~1.5 ms): attn qkv/o + the 1.16 GB BF16
|
||||
lm_head read every token; FP8/MXFP4 them like llama quantizes everything.
|
||||
2. **GEMV/attention bandwidth tuning**: effective BW of the hand GEMVs is
|
||||
well under peak; llama's 2.8 ms implies ~85%+ efficiency on ~1.3 GB.
|
||||
3. **Sparse prefill** (permute tokens by expert + grouped GEMM): long-prompt
|
||||
TTFT 94–120 ms → llama's ~30 ms territory.
|
||||
TTFT 51–75 ms → llama's ~30 ms territory.
|
||||
4. **W4A4 FP4 tensor cores / bandwidth-tuned MXFP4 GEMV**: make 4-bit experts
|
||||
actually beat FP8 (today sparse MXFP4 is 8.4 ms vs FP8 7.6 ms — the 4-bit
|
||||
GEMV's lower effective bandwidth still cancels its byte advantage).
|
||||
actually beat FP8.
|
||||
|
||||
Reference in New Issue
Block a user