docs: llama.cpp vs xserv benchmark results + summary

Record what the new baseline adds (llama.cpp pinned b9371, same BF16 weights, AIME 2025 + GSM8K) and the measured results: performance (xserv ~0.45-0.61x llama.cpp throughput) and quality parity (GSM8K 94% vs 96%, AIME 23.3% vs 20% after the context fix), plus the findings the bench surfaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 15:06:21 +08:00
parent 950ccf3822
commit 3f1c3d429a
2 changed files with 80 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -16,6 +16,7 @@
 # Benchmark output + fetched datasets (transferred to GPU host, not committed)
 /bench-out/
 /tools/bench/data/
+/tools/__pycache__/
 /tools/bench/__pycache__/
 /tools/bench/**/__pycache__/

--- a/docs/benchmarks/llama-cpp-comparison.md
+++ b/docs/benchmarks/llama-cpp-comparison.md
@@ -0,0 +1,79 @@
+# Benchmark: xserv vs llama.cpp (Qwen3-8B)
+
+**What this adds.** A standing baseline that compares xserv against **llama.cpp**
+on both **response quality (correctness)** and **performance (TTFT / TPOT /
+throughput)**, using the same model weights and standard public datasets. This
+replaces HF transformers as our reference point — xserv already beat HF, so it
+is no longer a useful performance bar.
+
+- **Baseline engine**: llama.cpp, vendored as a submodule pinned to `b9371`,
+  built with CUDA for SM120 (RTX 5090).
+- **Same weights**: the Qwen3-8B safetensors are converted to a **BF16 GGUF**
+  (`convert_hf_to_gguf.py --outtype bf16`) — no quantization, so the comparison
+  is apples-to-apples.
+- **Standard quality datasets**: **AIME 2025** (30 competition-math problems,
+  exact-match boxed integer) and **GSM8K** (grade-school math, exact-match).
+- **Black-box HTTP**: both engines are driven through the OpenAI-compatible
+  streaming API; the driver measures TTFT/TPOT/throughput and scores answers.
+
+See `docs/16-llama-cpp-comparison.md` for the design and `tools/bench/` for the
+driver. One-click: `tools/sync-and-build.sh bench`.
+
+## How it runs
+
+The GPU host (dash5) has no outbound network, so datasets are fetched locally
+(`tools/bench/fetch_datasets.py`) into JSON and the llama.cpp source is shipped
+over with the project; everything builds and runs on the GPU host. The driver
+runs **one engine at a time** (two BF16 8B models do not co-reside on a 32GB
+GPU, and a resident idle engine would distort the other's numbers).
+
+Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
+driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
+
+## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4)
+
+### Performance — llama.cpp is the stronger baseline
+
+| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
+|---|---|---|---|---|
+| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× |
+| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× |
+| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× |
+| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× |
+| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× |
+
+xserv runs at **~0.45–0.61×** llama.cpp. It saturates at `max_batch` (143 tok/s)
+while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar.
+
+### Quality — parity, confirming xserv's numerical fidelity
+
+| task | n | xserv | llama.cpp |
+|---|---|---|---|
+| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) |
+| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) |
+
+With equal context, the two engines score within one problem of each other on
+both tasks. Response prefixes are byte-identical (same prompt templating), so
+the small residual difference is greedy-decode divergence on long sequences —
+not an engine quality gap.
+
+## Findings the benchmark surfaced
+
+1. **Context must be provisioned per-request, not total.** A first run showed
+   xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total
+   `-c` across `--parallel` slots, so `-c 4096 --parallel 4` gave each request
+   only **1024 tokens**, truncating long AIME solutions before the boxed answer
+   (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
+   is how we caught it. Fixed: per-slot context = `max_seq_len` (total
+   `-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
+2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
+   pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights,
+   exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The
+   comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up.
+3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
+   AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA
+   reductions flip an argmax over long (~2400-token) generations. Harmless for
+   serving, but it explains why long-sequence accuracy wobbles by a problem.
+
+Raw artifacts (per-request timings, per-problem prediction/gold) are written to
+`bench-out/` as `comparison-<stamp>.{md,json}` (gitignored).