- Add ChatModel enum dispatching between Qwen3 and GptOss based on config.is_moe(), following the TP engine pattern. - Add --tp N flag for tensor-parallel inference (required for 39GB gpt-oss-20b which doesn't fit on a single 32GB GPU). - Add gpt-oss harmony chat template with channel/message format. - Replace hardcoded is_stop_token() with tokenizer.is_eos() for multi-model EOS support. - Restore gpt-oss hardcoded prompt template in server api.rs, lost during the Jinja template refactor. - Fix GEMV race condition: the K-split kernel zeroed the FP32 accumulator inside the kernel (block k=0) while other blocks atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead. - Update benchmark docs with post-fix results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.7 KiB
Benchmark: xserv vs llama.cpp (Qwen3-8B)
What this adds. A standing baseline that compares xserv against llama.cpp on both response quality (correctness) and performance (TTFT / TPOT / throughput), using the same model weights and standard public datasets. This replaces HF transformers as our reference point — xserv already beat HF, so it is no longer a useful performance bar.
- Baseline engine: llama.cpp, vendored as a submodule pinned to
b9371, built with CUDA for SM120 (RTX 5090). - Same weights: the Qwen3-8B safetensors are converted to a BF16 GGUF
(
convert_hf_to_gguf.py --outtype bf16) — no quantization, so the comparison is apples-to-apples. - Standard quality datasets: AIME 2025 (30 competition-math problems, exact-match boxed integer) and GSM8K (grade-school math, exact-match).
- Black-box HTTP: both engines are driven through the OpenAI-compatible streaming API; the driver measures TTFT/TPOT/throughput and scores answers.
See docs/16-llama-cpp-comparison.md for the design and tools/bench/ for the
driver. One-click: tools/sync-and-build.sh bench.
How it runs
The GPU host (dash5) has no outbound network, so datasets are fetched locally
(tools/bench/fetch_datasets.py) into JSON and the llama.cpp source is shipped
over with the project; everything builds and runs on the GPU host. The driver
runs one engine at a time (two BF16 8B models do not co-reside on a 32GB
GPU, and a resident idle engine would distort the other's numbers).
Generation mode is matched: xserv hardcodes Qwen3 thinking off, so the
driver sends chat_template_kwargs={enable_thinking:false} to llama.cpp.
Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)
Performance — llama.cpp is the stronger baseline
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|---|---|---|---|---|
| single / medium | TTFT p50 (ms) | 28.0 | 17.7 | 0.63× |
| single / medium | TPOT p50 (ms/tok) | 17.5 | 10.4 | 0.60× |
| single / medium | throughput (tok/s) | 56.6 | 95.1 | 0.60× |
| concurrent-4 | throughput (tok/s) | 135.2 | 317.1 | 0.43× |
| concurrent-8 | throughput (tok/s) | 135.5 | 322.5 | 0.42× |
xserv runs at ~0.42–0.60× llama.cpp. It saturates at max_batch (~135 tok/s)
while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar.
The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not
context-bound at these sizes.
Quality — parity, confirming xserv's numerical fidelity
| task | n | xserv | llama.cpp |
|---|---|---|---|
| GSM8K | 50 | 100.0% (50/50) | 96.0% (48/50) |
| AIME 2025 | 30 | 16.7% (5/30) | 23.3% (7/30) |
With equal context the two engines land at comparable AIME accuracy (within the ±2-problem greedy-decode wobble band) and xserv edges ahead on GSM8K. At 8192 both generate full-length solutions (mean ~4.2k tokens), so neither is truncated. The AIME difference (2 problems) is entirely within the run-to-run non-determinism documented below. Per-problem analysis shows the disagreements are due to different greedy-decode paths (different token at position ~500+ cascades into a different solution), not systematic precision errors.
On GSM8K, xserv strictly dominates: it gets 2 problems right that llama.cpp misses, and never misses one that llama.cpp gets.
Findings the benchmark surfaced
- Context must be provisioned per-request, not total. A first run showed
xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total
-cacross--parallelslots, so-c 4096 --parallel 4gave each request only 1024 tokens, truncating long AIME solutions before the boxed answer (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which is how we caught it. Fixed: per-slot context =max_seq_len(total-c = max_seq_len × parallel). After the fix, AIME is at parity (above). - xserv OOM'd at
--max-seq-len 8192— now fixed. xserv used to eagerly pre-allocate its paged-KV pool (blocks_per_seq × max_batch × 2, ~9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing the pool to available VRAM (cudaMemGetInfo) instead of worst-case demand, plus vLLM-style swap to pinned host memory: when running sequences grow past the GPU pool, the newest are evicted to host and swapped back when blocks free up (--swap-space-gb, default 8). The results above run at 8192 with 0 swap events — the VRAM-sized pool alone covers this load; swap is the overload safety net (verified lossless under a forced-small pool). - xserv decode is not run-to-run deterministic. The same greedy (temp 0) AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA reductions flip an argmax over long (~3k-token) generations. Harmless for serving, but it explains why long-sequence accuracy wobbles by a problem.
- GEMV race condition corrupted decode outputs — now fixed. The custom
K-split GEMV kernel (used for all M=1 decode-step projections with N≥256)
had a race condition: block k=0 zeroed the FP32 accumulator (
y_fp32[col] = 0.0) while other K-blocks were already atomicAdding to it. Since CUDA provides no inter-block ordering within a single kernel launch, the zero could land before, during, or after other blocks' writes. Fix:cudaMemsetAsyncon the stream before the kernel launch, which guarantees the buffer is zeroed before any block executes. This bug was introduced after the initial benchmark and caused systematic decode-time precision errors that degraded GSM8K accuracy from 98→80% range.
Raw artifacts (per-request timings, per-problem prediction/gold) are written to
bench-out/ as comparison-<stamp>.{md,json} (gitignored).