dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps):
TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x
ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x,
TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%.
Lesson recorded: graphs bought ~0.6 ms (launches were already hidden
by async execution), the GPU argmax ~1 ms — measure, don't guess.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- docs/19-gpt-oss-moe.md: the numbered series jumped 18->20; write up
gpt-oss arch deltas, harmony pitfalls, and the two CUDA debugging
postmortems (fully-masked-tile NaN in flash-attention sinks;
pre-__syncthreads early return reading uninitialized smem in the
decode GEMV) — the highest-value learning content of that phase.
- README: models/perf/capabilities were frozen at the Qwen3-only era;
now lists gpt-oss MoE, TP/PP, FP8/MXFP4, sparse MoE, and the
llama.cpp standing.
- Roadmap: record where reality diverged from the plan at Phase 18+,
add milestone entries and the ranked next-phase candidates
(21 CUDA-graph MoE decode, 22 non-expert quant, 23 sparse prefill).
- sparse-moe benchmark doc: post-review-fix numbers.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Project intro, architecture, build, basic usage (HTTP server / CLI / bench),
and the llama.cpp comparison workflow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>