docs: update llama.cpp comparison with 8192 results (OOM fixed)
Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -30,32 +30,37 @@ GPU, and a resident idle engine would distort the other's numbers).
|
||||
Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
|
||||
driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
|
||||
|
||||
## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4)
|
||||
## Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)
|
||||
|
||||
### Performance — llama.cpp is the stronger baseline
|
||||
|
||||
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|
||||
|---|---|---|---|---|
|
||||
| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× |
|
||||
| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× |
|
||||
| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× |
|
||||
| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× |
|
||||
| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× |
|
||||
| single / medium | TTFT p50 (ms) | 28.0 | 17.7 | 0.63× |
|
||||
| single / medium | TPOT p50 (ms/tok) | 17.5 | 10.4 | 0.60× |
|
||||
| single / medium | throughput (tok/s) | 56.6 | 95.1 | 0.60× |
|
||||
| concurrent-4 | throughput (tok/s) | 135.2 | 317.1 | 0.43× |
|
||||
| concurrent-8 | throughput (tok/s) | 135.5 | 322.5 | 0.42× |
|
||||
|
||||
xserv runs at **~0.45–0.61×** llama.cpp. It saturates at `max_batch` (143 tok/s)
|
||||
while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar.
|
||||
xserv runs at **~0.42–0.60×** llama.cpp. It saturates at `max_batch` (~135 tok/s)
|
||||
while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar.
|
||||
The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not
|
||||
context-bound at these sizes.
|
||||
|
||||
### Quality — parity, confirming xserv's numerical fidelity
|
||||
|
||||
| task | n | xserv | llama.cpp |
|
||||
|---|---|---|---|
|
||||
| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) |
|
||||
| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) |
|
||||
| GSM8K | 50 | 98.0% (49/50) | 96.0% (48/50) |
|
||||
| AIME 2025 | 30 | 20.0% (6/30) | 20.0% (6/30) |
|
||||
|
||||
With equal context, the two engines score within one problem of each other on
|
||||
both tasks. Response prefixes are byte-identical (same prompt templating), so
|
||||
the small residual difference is greedy-decode divergence on long sequences —
|
||||
not an engine quality gap.
|
||||
With equal context the two engines land at identical AIME accuracy and
|
||||
within one problem on GSM8K. At 8192 both generate full-length solutions
|
||||
(mean ~3.4k / ~4.2k tokens), so neither is truncated. Two independent engines
|
||||
agreeing at ~20% confirms that's genuine Qwen3-8B (thinking-off) capability and
|
||||
that xserv is numerically faithful. Response prefixes are byte-identical (same
|
||||
prompt templating); the only run-to-run wobble is greedy-decode divergence /
|
||||
nondeterminism on long (~3k-token) sequences (see finding 3).
|
||||
|
||||
## Findings the benchmark surfaced
|
||||
|
||||
@@ -66,13 +71,18 @@ not an engine quality gap.
|
||||
(capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
|
||||
is how we caught it. Fixed: per-slot context = `max_seq_len` (total
|
||||
`-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
|
||||
2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
|
||||
pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights,
|
||||
exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The
|
||||
comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up.
|
||||
2. **xserv OOM'd at `--max-seq-len 8192` — now fixed.** xserv used to eagerly
|
||||
pre-allocate its paged-KV pool (`blocks_per_seq × max_batch × 2`, ~9GB at
|
||||
8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing
|
||||
the pool to *available VRAM* (`cudaMemGetInfo`) instead of worst-case demand,
|
||||
plus vLLM-style **swap to pinned host memory**: when running sequences grow
|
||||
past the GPU pool, the newest are evicted to host and swapped back when blocks
|
||||
free up (`--swap-space-gb`, default 8). The results above run at 8192 with **0
|
||||
swap events** — the VRAM-sized pool alone covers this load; swap is the
|
||||
overload safety net (verified lossless under a forced-small pool).
|
||||
3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
|
||||
AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA
|
||||
reductions flip an argmax over long (~2400-token) generations. Harmless for
|
||||
AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA
|
||||
reductions flip an argmax over long (~3k-token) generations. Harmless for
|
||||
serving, but it explains why long-sequence accuracy wobbles by a problem.
|
||||
|
||||
Raw artifacts (per-request timings, per-problem prediction/gold) are written to
|
||||
|
||||
Reference in New Issue
Block a user