Files
xserv/docs/benchmarks/llama-cpp-comparison.md
Gahow Wang ae08896f46 xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug
- Add ChatModel enum dispatching between Qwen3 and GptOss based on
  config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
  gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
  multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
  during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
  accumulator inside the kernel (block k=0) while other blocks
  atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-02 00:58:10 +08:00

5.7 KiB
Raw Permalink Blame History

Benchmark: xserv vs llama.cpp (Qwen3-8B)

What this adds. A standing baseline that compares xserv against llama.cpp on both response quality (correctness) and performance (TTFT / TPOT / throughput), using the same model weights and standard public datasets. This replaces HF transformers as our reference point — xserv already beat HF, so it is no longer a useful performance bar.

  • Baseline engine: llama.cpp, vendored as a submodule pinned to b9371, built with CUDA for SM120 (RTX 5090).
  • Same weights: the Qwen3-8B safetensors are converted to a BF16 GGUF (convert_hf_to_gguf.py --outtype bf16) — no quantization, so the comparison is apples-to-apples.
  • Standard quality datasets: AIME 2025 (30 competition-math problems, exact-match boxed integer) and GSM8K (grade-school math, exact-match).
  • Black-box HTTP: both engines are driven through the OpenAI-compatible streaming API; the driver measures TTFT/TPOT/throughput and scores answers.

See docs/16-llama-cpp-comparison.md for the design and tools/bench/ for the driver. One-click: tools/sync-and-build.sh bench.

How it runs

The GPU host (dash5) has no outbound network, so datasets are fetched locally (tools/bench/fetch_datasets.py) into JSON and the llama.cpp source is shipped over with the project; everything builds and runs on the GPU host. The driver runs one engine at a time (two BF16 8B models do not co-reside on a 32GB GPU, and a resident idle engine would distort the other's numbers).

Generation mode is matched: xserv hardcodes Qwen3 thinking off, so the driver sends chat_template_kwargs={enable_thinking:false} to llama.cpp.

Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)

Performance — llama.cpp is the stronger baseline

scenario metric xserv llama.cpp xserv ÷ llama.cpp
single / medium TTFT p50 (ms) 28.0 17.7 0.63×
single / medium TPOT p50 (ms/tok) 17.5 10.4 0.60×
single / medium throughput (tok/s) 56.6 95.1 0.60×
concurrent-4 throughput (tok/s) 135.2 317.1 0.43×
concurrent-8 throughput (tok/s) 135.5 322.5 0.42×

xserv runs at ~0.420.60× llama.cpp. It saturates at max_batch (~135 tok/s) while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar. The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not context-bound at these sizes.

Quality — parity, confirming xserv's numerical fidelity

task n xserv llama.cpp
GSM8K 50 100.0% (50/50) 96.0% (48/50)
AIME 2025 30 16.7% (5/30) 23.3% (7/30)

With equal context the two engines land at comparable AIME accuracy (within the ±2-problem greedy-decode wobble band) and xserv edges ahead on GSM8K. At 8192 both generate full-length solutions (mean ~4.2k tokens), so neither is truncated. The AIME difference (2 problems) is entirely within the run-to-run non-determinism documented below. Per-problem analysis shows the disagreements are due to different greedy-decode paths (different token at position ~500+ cascades into a different solution), not systematic precision errors.

On GSM8K, xserv strictly dominates: it gets 2 problems right that llama.cpp misses, and never misses one that llama.cpp gets.

Findings the benchmark surfaced

  1. Context must be provisioned per-request, not total. A first run showed xserv 20.0% vs llama.cpp 3.3% on AIME — an artifact: llama.cpp divides total -c across --parallel slots, so -c 4096 --parallel 4 gave each request only 1024 tokens, truncating long AIME solutions before the boxed answer (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which is how we caught it. Fixed: per-slot context = max_seq_len (total -c = max_seq_len × parallel). After the fix, AIME is at parity (above).
  2. xserv OOM'd at --max-seq-len 8192 — now fixed. xserv used to eagerly pre-allocate its paged-KV pool (blocks_per_seq × max_batch × 2, ~9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing the pool to available VRAM (cudaMemGetInfo) instead of worst-case demand, plus vLLM-style swap to pinned host memory: when running sequences grow past the GPU pool, the newest are evicted to host and swapped back when blocks free up (--swap-space-gb, default 8). The results above run at 8192 with 0 swap events — the VRAM-sized pool alone covers this load; swap is the overload safety net (verified lossless under a forced-small pool).
  3. xserv decode is not run-to-run deterministic. The same greedy (temp 0) AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA reductions flip an argmax over long (~3k-token) generations. Harmless for serving, but it explains why long-sequence accuracy wobbles by a problem.
  4. GEMV race condition corrupted decode outputs — now fixed. The custom K-split GEMV kernel (used for all M=1 decode-step projections with N≥256) had a race condition: block k=0 zeroed the FP32 accumulator (y_fp32[col] = 0.0) while other K-blocks were already atomicAdding to it. Since CUDA provides no inter-block ordering within a single kernel launch, the zero could land before, during, or after other blocks' writes. Fix: cudaMemsetAsync on the stream before the kernel launch, which guarantees the buffer is zeroed before any block executes. This bug was introduced after the initial benchmark and caused systematic decode-time precision errors that degraded GSM8K accuracy from 98→80% range.

Raw artifacts (per-request timings, per-problem prediction/gold) are written to bench-out/ as comparison-<stamp>.{md,json} (gitignored).