docs: TP=1/2/4 xserv vs llama.cpp benchmark results
AIME 2025 + GSM8K at TP=1/2/4. Quality on par across engines/TP. Opposite perf scaling: xserv TPOT improves with TP (21->17->15ms) while llama.cpp row-split regresses over PCIe (10->19->20ms), crossing over so xserv is faster at TP=4. Includes the clean same-path bench-tp scaling (58/76/86 tok/s). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
73
docs/benchmarks/tensor-parallelism.md
Normal file
73
docs/benchmarks/tensor-parallelism.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Benchmark: Tensor Parallelism (TP=1/2/4) — xserv vs llama.cpp
|
||||
|
||||
**Setup.** Qwen3-8B BF16 on 8× RTX 5090 (PCIe Gen5, **no NVLink**; GPUs grouped
|
||||
0-3 / 4-7 by PHB). Both engines driven over the same OpenAI HTTP harness, same
|
||||
scorers, thinking-off, greedy (temp 0), `max_tokens` 2048. Datasets: **AIME
|
||||
2025** (30) + **GSM8K** (30). The two engines run **concurrently on disjoint
|
||||
groups** — xserv on GPU 0..N-1, llama.cpp (`--split-mode row`) on GPU 4..4+N-1
|
||||
(`tools/bench/run_tp_parallel.sh`).
|
||||
|
||||
## Correctness — on par across engines and TP
|
||||
|
||||
| TP | task | xserv | llama.cpp |
|
||||
|----|------|-------|-----------|
|
||||
| 1 | AIME 2025 | 16.7% (5/30) | 13.3% (4/30) |
|
||||
| 1 | GSM8K | 96.7% (29/30) | 96.7% (29/30) |
|
||||
| 2 | AIME 2025 | 13.3% (4/30) | 13.3% (4/30) |
|
||||
| 2 | GSM8K | 93.3% (28/30) | 96.7% (29/30) |
|
||||
| 4 | AIME 2025 | 16.7% (5/30) | 13.3% (4/30) |
|
||||
| 4 | GSM8K | 96.7% (29/30) | 96.7% (29/30) |
|
||||
|
||||
Within ±1 problem everywhere — TP changes nothing about quality on either
|
||||
engine, and the two engines agree. (AIME is low for both: Qwen3-8B thinking-off,
|
||||
capped at 2048 tokens.)
|
||||
|
||||
## Performance — TPOT (ms/token, lower is better)
|
||||
|
||||
| TP | xserv AIME / GSM8K | llama.cpp AIME / GSM8K |
|
||||
|----|--------------------|------------------------|
|
||||
| 1 | 21.0 / 17.8 | **10.4 / 10.3** |
|
||||
| 2 | 17.2 / 13.9 | 19.0 / 18.9 |
|
||||
| 4 | **15.2 / 12.1** | 20.2 / 20.2 |
|
||||
|
||||
**Opposite TP scaling, with a crossover:**
|
||||
|
||||
- **xserv TP scales positively**: TPOT 21.0 → 17.2 → 15.2 ms (AIME),
|
||||
17.8 → 13.9 → 12.1 ms (GSM8K) — TP=4 is ~1.4–1.5× faster than TP=1. GPU 0-3
|
||||
all ~82% utilized. (Sublinear because of the 72 PCIe AllReduces/token.)
|
||||
- **llama.cpp row-split regresses**: TPOT 10.4 → 19.0 → 20.2 ms — TP=1 is its
|
||||
best; TP=2/4 nearly double the latency. GPU 4-7 only ~24% utilized
|
||||
(communication-bound). Row-split's per-layer cross-GPU traffic over PCIe
|
||||
without NVLink dominates.
|
||||
- **Crossover**: llama.cpp is ~2× faster at TP=1, still ahead at TP=2, and xserv
|
||||
is ~1.3× faster at TP=4 (15.2 vs 20.2 ms AIME). AIME-30 wall clock: xserv
|
||||
1046 → 846 → 730 s (falling), llama.cpp 520 → 952 → 1012 s (rising).
|
||||
|
||||
On a NVLink-less PCIe box, **xserv's TP is a genuine win and llama.cpp's
|
||||
tensor-split is counterproductive** — exactly what the topology predicts.
|
||||
|
||||
### Clean same-path xserv scaling (bench-tp)
|
||||
|
||||
The HTTP numbers above mix engines (xserv TP=1 uses the production continuous-
|
||||
batching engine; TP≥2 uses the serial TP coordinator). The single-stream,
|
||||
same-code-path scaling from `bench-tp` (greedy, 8 prompts × 64 tokens):
|
||||
|
||||
| TP | xserv decode tok/s | speedup | TTFT |
|
||||
|----|--------------------|---------|------|
|
||||
| 1 | 58.5 | 1.00× | 18.0 ms |
|
||||
| 2 | 75.7 | 1.29× | 13.4 ms |
|
||||
| 4 | 86.1 | 1.47× | 11.5 ms |
|
||||
|
||||
## Caveats
|
||||
|
||||
- xserv **TP=1 uses the production `Engine`**, TP≥2 the serial `tp_engine`
|
||||
coordinator — different per-token paths, so the HTTP TP=1→2 step has an engine
|
||||
confound. The clean same-path scaling (bench-tp, above) confirms the trend.
|
||||
- xserv **TTFT is weaker** on long AIME prompts (~460–500 ms vs llama ~100–190 ms)
|
||||
— prefill is a known optimization target.
|
||||
- llama.cpp uses `--split-mode row` (its tensor-parallel mode); the default
|
||||
`layer` split only memory-splits, without parallel compute.
|
||||
- The TP HTTP server processes requests **serially** (sufficient for this serial
|
||||
quality benchmark); continuous-batching TP is future work.
|
||||
|
||||
Raw artifacts: `bench-out/tp{1,2,4}-{xserv,llama}/comparison-*.{md,json}`.
|
||||
Reference in New Issue
Block a user