Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.8 KiB
Benchmark: xserv vs llama.cpp (Qwen3-8B)
What this adds. A standing baseline that compares xserv against llama.cpp on both response quality (correctness) and performance (TTFT / TPOT / throughput), using the same model weights and standard public datasets. This replaces HF transformers as our reference point — xserv already beat HF, so it is no longer a useful performance bar.
- Baseline engine: llama.cpp, vendored as a submodule pinned to
b9371, built with CUDA for SM120 (RTX 5090). - Same weights: the Qwen3-8B safetensors are converted to a BF16 GGUF
(
convert_hf_to_gguf.py --outtype bf16) — no quantization, so the comparison is apples-to-apples. - Standard quality datasets: AIME 2025 (30 competition-math problems, exact-match boxed integer) and GSM8K (grade-school math, exact-match).
- Black-box HTTP: both engines are driven through the OpenAI-compatible streaming API; the driver measures TTFT/TPOT/throughput and scores answers.
See docs/16-llama-cpp-comparison.md for the design and tools/bench/ for the
driver. One-click: tools/sync-and-build.sh bench.
How it runs
The GPU host (dash5) has no outbound network, so datasets are fetched locally
(tools/bench/fetch_datasets.py) into JSON and the llama.cpp source is shipped
over with the project; everything builds and runs on the GPU host. The driver
runs one engine at a time (two BF16 8B models do not co-reside on a 32GB
GPU, and a resident idle engine would distort the other's numbers).
Generation mode is matched: xserv hardcodes Qwen3 thinking off, so the
driver sends chat_template_kwargs={enable_thinking:false} to llama.cpp.
Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)
Performance — llama.cpp is the stronger baseline
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|---|---|---|---|---|
| single / medium | TTFT p50 (ms) | 28.0 | 17.7 | 0.63× |
| single / medium | TPOT p50 (ms/tok) | 17.5 | 10.4 | 0.60× |
| single / medium | throughput (tok/s) | 56.6 | 95.1 | 0.60× |
| concurrent-4 | throughput (tok/s) | 135.2 | 317.1 | 0.43× |
| concurrent-8 | throughput (tok/s) | 135.5 | 322.5 | 0.42× |
xserv runs at ~0.42–0.60× llama.cpp. It saturates at max_batch (~135 tok/s)
while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar.
The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not
context-bound at these sizes.
Quality — parity, confirming xserv's numerical fidelity
| task | n | xserv | llama.cpp |
|---|---|---|---|
| GSM8K | 50 | 98.0% (49/50) | 96.0% (48/50) |
| AIME 2025 | 30 | 20.0% (6/30) | 20.0% (6/30) |
With equal context the two engines land at identical AIME accuracy and within one problem on GSM8K. At 8192 both generate full-length solutions (mean ~3.4k / ~4.2k tokens), so neither is truncated. Two independent engines agreeing at ~20% confirms that's genuine Qwen3-8B (thinking-off) capability and that xserv is numerically faithful. Response prefixes are byte-identical (same prompt templating); the only run-to-run wobble is greedy-decode divergence / nondeterminism on long (~3k-token) sequences (see finding 3).
Findings the benchmark surfaced
- Context must be provisioned per-request, not total. A first run showed
xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total
-cacross--parallelslots, so-c 4096 --parallel 4gave each request only 1024 tokens, truncating long AIME solutions before the boxed answer (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which is how we caught it. Fixed: per-slot context =max_seq_len(total-c = max_seq_len × parallel). After the fix, AIME is at parity (above). - xserv OOM'd at
--max-seq-len 8192— now fixed. xserv used to eagerly pre-allocate its paged-KV pool (blocks_per_seq × max_batch × 2, ~9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing the pool to available VRAM (cudaMemGetInfo) instead of worst-case demand, plus vLLM-style swap to pinned host memory: when running sequences grow past the GPU pool, the newest are evicted to host and swapped back when blocks free up (--swap-space-gb, default 8). The results above run at 8192 with 0 swap events — the VRAM-sized pool alone covers this load; swap is the overload safety net (verified lossless under a forced-small pool). - xserv decode is not run-to-run deterministic. The same greedy (temp 0) AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA reductions flip an argmax over long (~3k-token) generations. Harmless for serving, but it explains why long-sequence accuracy wobbles by a problem.
Raw artifacts (per-request timings, per-problem prediction/gold) are written to
bench-out/ as comparison-<stamp>.{md,json} (gitignored).