docs: llama.cpp vs xserv benchmark results + summary

Record what the new baseline adds (llama.cpp pinned b9371, same BF16 weights,
AIME 2025 + GSM8K) and the measured results: performance (xserv ~0.45-0.61x
llama.cpp throughput) and quality parity (GSM8K 94% vs 96%, AIME 23.3% vs 20%
after the context fix), plus the findings the bench surfaced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 15:06:21 +08:00
parent 950ccf3822
commit 3f1c3d429a
2 changed files with 80 additions and 0 deletions

1
.gitignore vendored
View File

@@ -16,6 +16,7 @@
# Benchmark output + fetched datasets (transferred to GPU host, not committed)
/bench-out/
/tools/bench/data/
/tools/__pycache__/
/tools/bench/__pycache__/
/tools/bench/**/__pycache__/

View File

@@ -0,0 +1,79 @@
# Benchmark: xserv vs llama.cpp (Qwen3-8B)
**What this adds.** A standing baseline that compares xserv against **llama.cpp**
on both **response quality (correctness)** and **performance (TTFT / TPOT /
throughput)**, using the same model weights and standard public datasets. This
replaces HF transformers as our reference point — xserv already beat HF, so it
is no longer a useful performance bar.
- **Baseline engine**: llama.cpp, vendored as a submodule pinned to `b9371`,
built with CUDA for SM120 (RTX 5090).
- **Same weights**: the Qwen3-8B safetensors are converted to a **BF16 GGUF**
(`convert_hf_to_gguf.py --outtype bf16`) — no quantization, so the comparison
is apples-to-apples.
- **Standard quality datasets**: **AIME 2025** (30 competition-math problems,
exact-match boxed integer) and **GSM8K** (grade-school math, exact-match).
- **Black-box HTTP**: both engines are driven through the OpenAI-compatible
streaming API; the driver measures TTFT/TPOT/throughput and scores answers.
See `docs/16-llama-cpp-comparison.md` for the design and `tools/bench/` for the
driver. One-click: `tools/sync-and-build.sh bench`.
## How it runs
The GPU host (dash5) has no outbound network, so datasets are fetched locally
(`tools/bench/fetch_datasets.py`) into JSON and the llama.cpp source is shipped
over with the project; everything builds and runs on the GPU host. The driver
runs **one engine at a time** (two BF16 8B models do not co-reside on a 32GB
GPU, and a resident idle engine would distort the other's numbers).
Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4)
### Performance — llama.cpp is the stronger baseline
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|---|---|---|---|---|
| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× |
| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× |
| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× |
| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× |
| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× |
xserv runs at **~0.450.61×** llama.cpp. It saturates at `max_batch` (143 tok/s)
while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar.
### Quality — parity, confirming xserv's numerical fidelity
| task | n | xserv | llama.cpp |
|---|---|---|---|
| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) |
| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) |
With equal context, the two engines score within one problem of each other on
both tasks. Response prefixes are byte-identical (same prompt templating), so
the small residual difference is greedy-decode divergence on long sequences —
not an engine quality gap.
## Findings the benchmark surfaced
1. **Context must be provisioned per-request, not total.** A first run showed
xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total
`-c` across `--parallel` slots, so `-c 4096 --parallel 4` gave each request
only **1024 tokens**, truncating long AIME solutions before the boxed answer
(capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
is how we caught it. Fixed: per-slot context = `max_seq_len` (total
`-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights,
exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The
comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up.
3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA
reductions flip an argmax over long (~2400-token) generations. Harmless for
serving, but it explains why long-sequence accuracy wobbles by a problem.
Raw artifacts (per-request timings, per-problem prediction/gold) are written to
`bench-out/` as `comparison-<stamp>.{md,json}` (gitignored).