xserv

Files

Gahow Wang 013465fc06 docs: Phase 21 — decode CUDA graph + GPU argmax results

dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps):
TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x
ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x,
TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%.
Lesson recorded: graphs bought ~0.6 ms (launches were already hidden
by async execution), the GPU argmax ~1 ms — measure, don't guess.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 20:12:37 +08:00

fp8-quantization.md

quantization: single strided-batched FP8 MoE GEMM — cut per-token launches ~768→48

2026-06-12 01:23:29 +08:00

llama-cpp-comparison.md

xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug

2026-06-02 00:58:10 +08:00

mxfp4-and-llama-decode.md

quantization: MXFP4 W4A16 expert weights (memory-optimization foundation)

2026-06-12 15:01:42 +08:00

phase8-gpt2-baseline.md

phase 8: add benchmark framework + baseline results