docs: llama.cpp vs xserv benchmark results + summary
Record what the new baseline adds (llama.cpp pinned b9371, same BF16 weights, AIME 2025 + GSM8K) and the measured results: performance (xserv ~0.45-0.61x llama.cpp throughput) and quality parity (GSM8K 94% vs 96%, AIME 23.3% vs 20% after the context fix), plus the findings the bench surfaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -16,6 +16,7 @@
|
||||
# Benchmark output + fetched datasets (transferred to GPU host, not committed)
|
||||
/bench-out/
|
||||
/tools/bench/data/
|
||||
/tools/__pycache__/
|
||||
/tools/bench/__pycache__/
|
||||
/tools/bench/**/__pycache__/
|
||||
|
||||
|
||||
79
docs/benchmarks/llama-cpp-comparison.md
Normal file
79
docs/benchmarks/llama-cpp-comparison.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Benchmark: xserv vs llama.cpp (Qwen3-8B)
|
||||
|
||||
**What this adds.** A standing baseline that compares xserv against **llama.cpp**
|
||||
on both **response quality (correctness)** and **performance (TTFT / TPOT /
|
||||
throughput)**, using the same model weights and standard public datasets. This
|
||||
replaces HF transformers as our reference point — xserv already beat HF, so it
|
||||
is no longer a useful performance bar.
|
||||
|
||||
- **Baseline engine**: llama.cpp, vendored as a submodule pinned to `b9371`,
|
||||
built with CUDA for SM120 (RTX 5090).
|
||||
- **Same weights**: the Qwen3-8B safetensors are converted to a **BF16 GGUF**
|
||||
(`convert_hf_to_gguf.py --outtype bf16`) — no quantization, so the comparison
|
||||
is apples-to-apples.
|
||||
- **Standard quality datasets**: **AIME 2025** (30 competition-math problems,
|
||||
exact-match boxed integer) and **GSM8K** (grade-school math, exact-match).
|
||||
- **Black-box HTTP**: both engines are driven through the OpenAI-compatible
|
||||
streaming API; the driver measures TTFT/TPOT/throughput and scores answers.
|
||||
|
||||
See `docs/16-llama-cpp-comparison.md` for the design and `tools/bench/` for the
|
||||
driver. One-click: `tools/sync-and-build.sh bench`.
|
||||
|
||||
## How it runs
|
||||
|
||||
The GPU host (dash5) has no outbound network, so datasets are fetched locally
|
||||
(`tools/bench/fetch_datasets.py`) into JSON and the llama.cpp source is shipped
|
||||
over with the project; everything builds and runs on the GPU host. The driver
|
||||
runs **one engine at a time** (two BF16 8B models do not co-reside on a 32GB
|
||||
GPU, and a resident idle engine would distort the other's numbers).
|
||||
|
||||
Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
|
||||
driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
|
||||
|
||||
## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4)
|
||||
|
||||
### Performance — llama.cpp is the stronger baseline
|
||||
|
||||
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|
||||
|---|---|---|---|---|
|
||||
| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× |
|
||||
| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× |
|
||||
| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× |
|
||||
| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× |
|
||||
| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× |
|
||||
|
||||
xserv runs at **~0.45–0.61×** llama.cpp. It saturates at `max_batch` (143 tok/s)
|
||||
while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar.
|
||||
|
||||
### Quality — parity, confirming xserv's numerical fidelity
|
||||
|
||||
| task | n | xserv | llama.cpp |
|
||||
|---|---|---|---|
|
||||
| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) |
|
||||
| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) |
|
||||
|
||||
With equal context, the two engines score within one problem of each other on
|
||||
both tasks. Response prefixes are byte-identical (same prompt templating), so
|
||||
the small residual difference is greedy-decode divergence on long sequences —
|
||||
not an engine quality gap.
|
||||
|
||||
## Findings the benchmark surfaced
|
||||
|
||||
1. **Context must be provisioned per-request, not total.** A first run showed
|
||||
xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total
|
||||
`-c` across `--parallel` slots, so `-c 4096 --parallel 4` gave each request
|
||||
only **1024 tokens**, truncating long AIME solutions before the boxed answer
|
||||
(capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
|
||||
is how we caught it. Fixed: per-slot context = `max_seq_len` (total
|
||||
`-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
|
||||
2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
|
||||
pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights,
|
||||
exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The
|
||||
comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up.
|
||||
3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
|
||||
AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA
|
||||
reductions flip an argmax over long (~2400-token) generations. Harmless for
|
||||
serving, but it explains why long-sequence accuracy wobbles by a problem.
|
||||
|
||||
Raw artifacts (per-request timings, per-problem prediction/gold) are written to
|
||||
`bench-out/` as `comparison-<stamp>.{md,json}` (gitignored).
|
||||
Reference in New Issue
Block a user