Files

Gahow Wang 7b8b520cda docs: TP=1/2/4 xserv vs llama.cpp benchmark results

AIME 2025 + GSM8K at TP=1/2/4. Quality on par across engines/TP. Opposite
perf scaling: xserv TPOT improves with TP (21->17->15ms) while llama.cpp
row-split regresses over PCIe (10->19->20ms), crossing over so xserv is faster
at TP=4. Includes the clean same-path bench-tp scaling (58/76/86 tok/s).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-29 11:10:52 +08:00

3.4 KiB

Raw Permalink Blame History

Benchmark: Tensor Parallelism (TP=1/2/4) — xserv vs llama.cpp

Setup. Qwen3-8B BF16 on 8× RTX 5090 (PCIe Gen5, no NVLink; GPUs grouped 0-3 / 4-7 by PHB). Both engines driven over the same OpenAI HTTP harness, same scorers, thinking-off, greedy (temp 0), max_tokens 2048. Datasets: AIME 2025 (30) + GSM8K (30). The two engines run concurrently on disjoint groups — xserv on GPU 0..N-1, llama.cpp (--split-mode row) on GPU 4..4+N-1 (tools/bench/run_tp_parallel.sh).

Correctness — on par across engines and TP

TP	task	xserv	llama.cpp
1	AIME 2025	16.7% (5/30)	13.3% (4/30)
1	GSM8K	96.7% (29/30)	96.7% (29/30)
2	AIME 2025	13.3% (4/30)	13.3% (4/30)
2	GSM8K	93.3% (28/30)	96.7% (29/30)
4	AIME 2025	16.7% (5/30)	13.3% (4/30)
4	GSM8K	96.7% (29/30)	96.7% (29/30)

Within ±1 problem everywhere — TP changes nothing about quality on either engine, and the two engines agree. (AIME is low for both: Qwen3-8B thinking-off, capped at 2048 tokens.)

Performance — TPOT (ms/token, lower is better)

TP	xserv AIME / GSM8K	llama.cpp AIME / GSM8K
1	21.0 / 17.8	10.4 / 10.3
2	17.2 / 13.9	19.0 / 18.9
4	15.2 / 12.1	20.2 / 20.2

Opposite TP scaling, with a crossover:

xserv TP scales positively: TPOT 21.0 → 17.2 → 15.2 ms (AIME), 17.8 → 13.9 → 12.1 ms (GSM8K) — TP=4 is ~1.4–1.5× faster than TP=1. GPU 0-3 all ~82% utilized. (Sublinear because of the 72 PCIe AllReduces/token.)
llama.cpp row-split regresses: TPOT 10.4 → 19.0 → 20.2 ms — TP=1 is its best; TP=2/4 nearly double the latency. GPU 4-7 only ~24% utilized (communication-bound). Row-split's per-layer cross-GPU traffic over PCIe without NVLink dominates.
Crossover: llama.cpp is ~2× faster at TP=1, still ahead at TP=2, and xserv is ~1.3× faster at TP=4 (15.2 vs 20.2 ms AIME). AIME-30 wall clock: xserv 1046 → 846 → 730 s (falling), llama.cpp 520 → 952 → 1012 s (rising).

On a NVLink-less PCIe box, xserv's TP is a genuine win and llama.cpp's tensor-split is counterproductive — exactly what the topology predicts.

Clean same-path xserv scaling (bench-tp)

The HTTP numbers above mix engines (xserv TP=1 uses the production continuous- batching engine; TP≥2 uses the serial TP coordinator). The single-stream, same-code-path scaling from bench-tp (greedy, 8 prompts × 64 tokens):

TP	xserv decode tok/s	speedup	TTFT
1	58.5	1.00×	18.0 ms
2	75.7	1.29×	13.4 ms
4	86.1	1.47×	11.5 ms

Caveats

xserv TP=1 uses the production Engine, TP≥2 the serial tp_engine coordinator — different per-token paths, so the HTTP TP=1→2 step has an engine confound. The clean same-path scaling (bench-tp, above) confirms the trend.
xserv TTFT is weaker on long AIME prompts (~460–500 ms vs llama ~100–190 ms) — prefill is a known optimization target.
llama.cpp uses --split-mode row (its tensor-parallel mode); the default layer split only memory-splits, without parallel compute.
The TP HTTP server processes requests serially (sufficient for this serial quality benchmark); continuous-batching TP is future work.

Raw artifacts: bench-out/tp{1,2,4}-{xserv,llama}/comparison-*.{md,json}.

3.4 KiB Raw Permalink Blame History Unescape Escape