From 3f1c3d429a6aaa554a692ecef14b817cf3b239fa Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Thu, 28 May 2026 15:06:21 +0800 Subject: [PATCH] docs: llama.cpp vs xserv benchmark results + summary Record what the new baseline adds (llama.cpp pinned b9371, same BF16 weights, AIME 2025 + GSM8K) and the measured results: performance (xserv ~0.45-0.61x llama.cpp throughput) and quality parity (GSM8K 94% vs 96%, AIME 23.3% vs 20% after the context fix), plus the findings the bench surfaced. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 1 + docs/benchmarks/llama-cpp-comparison.md | 79 +++++++++++++++++++++++++ 2 files changed, 80 insertions(+) create mode 100644 docs/benchmarks/llama-cpp-comparison.md diff --git a/.gitignore b/.gitignore index 6866abf..9866984 100644 --- a/.gitignore +++ b/.gitignore @@ -16,6 +16,7 @@ # Benchmark output + fetched datasets (transferred to GPU host, not committed) /bench-out/ /tools/bench/data/ +/tools/__pycache__/ /tools/bench/__pycache__/ /tools/bench/**/__pycache__/ diff --git a/docs/benchmarks/llama-cpp-comparison.md b/docs/benchmarks/llama-cpp-comparison.md new file mode 100644 index 0000000..6cf48b6 --- /dev/null +++ b/docs/benchmarks/llama-cpp-comparison.md @@ -0,0 +1,79 @@ +# Benchmark: xserv vs llama.cpp (Qwen3-8B) + +**What this adds.** A standing baseline that compares xserv against **llama.cpp** +on both **response quality (correctness)** and **performance (TTFT / TPOT / +throughput)**, using the same model weights and standard public datasets. This +replaces HF transformers as our reference point — xserv already beat HF, so it +is no longer a useful performance bar. + +- **Baseline engine**: llama.cpp, vendored as a submodule pinned to `b9371`, + built with CUDA for SM120 (RTX 5090). +- **Same weights**: the Qwen3-8B safetensors are converted to a **BF16 GGUF** + (`convert_hf_to_gguf.py --outtype bf16`) — no quantization, so the comparison + is apples-to-apples. +- **Standard quality datasets**: **AIME 2025** (30 competition-math problems, + exact-match boxed integer) and **GSM8K** (grade-school math, exact-match). +- **Black-box HTTP**: both engines are driven through the OpenAI-compatible + streaming API; the driver measures TTFT/TPOT/throughput and scores answers. + +See `docs/16-llama-cpp-comparison.md` for the design and `tools/bench/` for the +driver. One-click: `tools/sync-and-build.sh bench`. + +## How it runs + +The GPU host (dash5) has no outbound network, so datasets are fetched locally +(`tools/bench/fetch_datasets.py`) into JSON and the llama.cpp source is shipped +over with the project; everything builds and runs on the GPU host. The driver +runs **one engine at a time** (two BF16 8B models do not co-reside on a 32GB +GPU, and a resident idle engine would distort the other's numbers). + +Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the +driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp. + +## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4) + +### Performance — llama.cpp is the stronger baseline + +| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp | +|---|---|---|---|---| +| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× | +| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× | +| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× | +| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× | +| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× | + +xserv runs at **~0.45–0.61×** llama.cpp. It saturates at `max_batch` (143 tok/s) +while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar. + +### Quality — parity, confirming xserv's numerical fidelity + +| task | n | xserv | llama.cpp | +|---|---|---|---| +| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) | +| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) | + +With equal context, the two engines score within one problem of each other on +both tasks. Response prefixes are byte-identical (same prompt templating), so +the small residual difference is greedy-decode divergence on long sequences — +not an engine quality gap. + +## Findings the benchmark surfaced + +1. **Context must be provisioned per-request, not total.** A first run showed + xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total + `-c` across `--parallel` slots, so `-c 4096 --parallel 4` gave each request + only **1024 tokens**, truncating long AIME solutions before the boxed answer + (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which + is how we caught it. Fixed: per-slot context = `max_seq_len` (total + `-c = max_seq_len × parallel`). After the fix, AIME is at parity (above). +2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly + pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights, + exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The + comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up. +3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0) + AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA + reductions flip an argmax over long (~2400-token) generations. Harmless for + serving, but it explains why long-sequence accuracy wobbles by a problem. + +Raw artifacts (per-request timings, per-problem prediction/gold) are written to +`bench-out/` as `comparison-.{md,json}` (gitignored).