Files

Gahow Wang 80157e614a docs: update llama.cpp comparison with 8192 results (OOM fixed)

Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it:
- OOM finding resolved — pool sized to available VRAM + vLLM-style host swap;
  8192 runs with 0 swap events (swap is the overload safety net).
- Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%.
- Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 21:32:14 +08:00

4.8 KiB

Raw Blame History

Benchmark: xserv vs llama.cpp (Qwen3-8B)

What this adds. A standing baseline that compares xserv against llama.cpp on both response quality (correctness) and performance (TTFT / TPOT / throughput), using the same model weights and standard public datasets. This replaces HF transformers as our reference point — xserv already beat HF, so it is no longer a useful performance bar.

Baseline engine: llama.cpp, vendored as a submodule pinned to b9371, built with CUDA for SM120 (RTX 5090).
Same weights: the Qwen3-8B safetensors are converted to a BF16 GGUF (convert_hf_to_gguf.py --outtype bf16) — no quantization, so the comparison is apples-to-apples.
Standard quality datasets: AIME 2025 (30 competition-math problems, exact-match boxed integer) and GSM8K (grade-school math, exact-match).
Black-box HTTP: both engines are driven through the OpenAI-compatible streaming API; the driver measures TTFT/TPOT/throughput and scores answers.

See docs/16-llama-cpp-comparison.md for the design and tools/bench/ for the driver. One-click: tools/sync-and-build.sh bench.

How it runs

The GPU host (dash5) has no outbound network, so datasets are fetched locally (tools/bench/fetch_datasets.py) into JSON and the llama.cpp source is shipped over with the project; everything builds and runs on the GPU host. The driver runs one engine at a time (two BF16 8B models do not co-reside on a 32GB GPU, and a resident idle engine would distort the other's numbers).

Generation mode is matched: xserv hardcodes Qwen3 thinking off, so the driver sends chat_template_kwargs={enable_thinking:false} to llama.cpp.

Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)

Performance — llama.cpp is the stronger baseline

scenario	metric	xserv	llama.cpp	xserv ÷ llama.cpp
single / medium	TTFT p50 (ms)	28.0	17.7	0.63×
single / medium	TPOT p50 (ms/tok)	17.5	10.4	0.60×
single / medium	throughput (tok/s)	56.6	95.1	0.60×
concurrent-4	throughput (tok/s)	135.2	317.1	0.43×
concurrent-8	throughput (tok/s)	135.5	322.5	0.42×

xserv runs at ~0.42–0.60× llama.cpp. It saturates at max_batch (~135 tok/s) while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar. The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not context-bound at these sizes.

Quality — parity, confirming xserv's numerical fidelity

task	n	xserv	llama.cpp
GSM8K	50	98.0% (49/50)	96.0% (48/50)
AIME 2025	30	20.0% (6/30)	20.0% (6/30)

With equal context the two engines land at identical AIME accuracy and within one problem on GSM8K. At 8192 both generate full-length solutions (mean ~3.4k / ~4.2k tokens), so neither is truncated. Two independent engines agreeing at ~20% confirms that's genuine Qwen3-8B (thinking-off) capability and that xserv is numerically faithful. Response prefixes are byte-identical (same prompt templating); the only run-to-run wobble is greedy-decode divergence / nondeterminism on long (~3k-token) sequences (see finding 3).

Findings the benchmark surfaced

Context must be provisioned per-request, not total. A first run showed xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total -c across --parallel slots, so -c 4096 --parallel 4 gave each request only 1024 tokens, truncating long AIME solutions before the boxed answer (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which is how we caught it. Fixed: per-slot context = max_seq_len (total -c = max_seq_len × parallel). After the fix, AIME is at parity (above).
xserv OOM'd at --max-seq-len 8192 — now fixed. xserv used to eagerly pre-allocate its paged-KV pool (blocks_per_seq × max_batch × 2, ~9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing the pool to available VRAM (cudaMemGetInfo) instead of worst-case demand, plus vLLM-style swap to pinned host memory: when running sequences grow past the GPU pool, the newest are evicted to host and swapped back when blocks free up (--swap-space-gb, default 8). The results above run at 8192 with 0 swap events — the VRAM-sized pool alone covers this load; swap is the overload safety net (verified lossless under a forced-small pool).
xserv decode is not run-to-run deterministic. The same greedy (temp 0) AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA reductions flip an argmax over long (~3k-token) generations. Harmless for serving, but it explains why long-sequence accuracy wobbles by a problem.

Raw artifacts (per-request timings, per-problem prediction/gold) are written to bench-out/ as comparison-<stamp>.{md,json} (gitignored).

4.8 KiB Raw Blame History Unescape Escape