docs: update llama.cpp comparison with 8192 results (OOM fixed)

Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it:
- OOM finding resolved — pool sized to available VRAM + vLLM-style host swap;
  8192 runs with 0 swap events (swap is the overload safety net).
- Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%.
- Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 21:32:14 +08:00
parent fc1900a745
commit 80157e614a
2 changed files with 43 additions and 33 deletions

View File

@@ -126,8 +126,7 @@ HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
**Full sweep on dash5 (recommended):** **Full sweep on dash5 (recommended):**
```bash ```bash
# 4096 ctx because xserv OOMs at 8192 (see Known constraints) ./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
./tools/sync-and-build.sh fetch-bench-out ./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md open bench-out/comparison-*.md
``` ```
@@ -179,17 +178,18 @@ python3 -m tools.bench.runner \
## Known constraints / findings ## Known constraints / findings
- **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly - **xserv OOM at `--max-seq-len 8192` — fixed.** xserv used to pre-allocate its
pre-allocates its paged-KV pool (`total_blocks = blocks_per_seq · max_batch · paged-KV pool (`total_blocks = blocks_per_seq · max_batch · 2`, ≈9GB at 8192)
2`, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup on top of the 16GB weights, exceeding 32GB at startup (`paged_kv_cache.rs`
(`paged_kv_cache.rs` `alloc paged K pool: OutOfMemory`). llama.cpp allocates `alloc paged K pool: OutOfMemory`). Now the pool is sized to *available VRAM*
KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the (`cudaMemGetInfo`) and overflow is swapped to pinned host memory (vLLM-style
comparison at `--max-seq-len 4096` (xserv peaks ~28GB there). The benchmark preemption, `--swap-space-gb`). The 8192 comparison runs cleanly with 0 swap
surfaced this — it's tracked as a follow-up fix. events; swap is verified separately under a forced-small pool. The benchmark
- When the xserv engine thread dies, the request handler panics on the poisoned surfaced the OOM — a good example of the baseline doing its job.
`engine_sender` mutex and every subsequent request fails with "server - When the xserv engine thread dies, the API now returns a clean 503 (the
disconnected". The driver records these as per-request errors (no crash), so a request handler uses a poison-tolerant lock instead of cascading
broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run. mutex-poison panics). The driver records any failure as a per-request error,
so a broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
## Future extensions ## Future extensions

View File

@@ -30,32 +30,37 @@ GPU, and a resident idle engine would distort the other's numbers).
Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp. driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4) ## Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)
### Performance — llama.cpp is the stronger baseline ### Performance — llama.cpp is the stronger baseline
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp | | scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|---|---|---|---|---| |---|---|---|---|---|
| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× | | single / medium | TTFT p50 (ms) | 28.0 | 17.7 | 0.63× |
| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× | | single / medium | TPOT p50 (ms/tok) | 17.5 | 10.4 | 0.60× |
| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× | | single / medium | throughput (tok/s) | 56.6 | 95.1 | 0.60× |
| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× | | concurrent-4 | throughput (tok/s) | 135.2 | 317.1 | 0.43× |
| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× | | concurrent-8 | throughput (tok/s) | 135.5 | 322.5 | 0.42× |
xserv runs at **~0.450.61×** llama.cpp. It saturates at `max_batch` (143 tok/s) xserv runs at **~0.420.60×** llama.cpp. It saturates at `max_batch` (~135 tok/s)
while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar. while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar.
The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not
context-bound at these sizes.
### Quality — parity, confirming xserv's numerical fidelity ### Quality — parity, confirming xserv's numerical fidelity
| task | n | xserv | llama.cpp | | task | n | xserv | llama.cpp |
|---|---|---|---| |---|---|---|---|
| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) | | GSM8K | 50 | 98.0% (49/50) | 96.0% (48/50) |
| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) | | AIME 2025 | 30 | 20.0% (6/30) | 20.0% (6/30) |
With equal context, the two engines score within one problem of each other on With equal context the two engines land at identical AIME accuracy and
both tasks. Response prefixes are byte-identical (same prompt templating), so within one problem on GSM8K. At 8192 both generate full-length solutions
the small residual difference is greedy-decode divergence on long sequences — (mean ~3.4k / ~4.2k tokens), so neither is truncated. Two independent engines
not an engine quality gap. agreeing at ~20% confirms that's genuine Qwen3-8B (thinking-off) capability and
that xserv is numerically faithful. Response prefixes are byte-identical (same
prompt templating); the only run-to-run wobble is greedy-decode divergence /
nondeterminism on long (~3k-token) sequences (see finding 3).
## Findings the benchmark surfaced ## Findings the benchmark surfaced
@@ -66,13 +71,18 @@ not an engine quality gap.
(capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
is how we caught it. Fixed: per-slot context = `max_seq_len` (total is how we caught it. Fixed: per-slot context = `max_seq_len` (total
`-c = max_seq_len × parallel`). After the fix, AIME is at parity (above). `-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly 2. **xserv OOM'd at `--max-seq-len 8192` — now fixed.** xserv used to eagerly
pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights, pre-allocate its paged-KV pool (`blocks_per_seq × max_batch × 2`, ~9GB at
exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The 8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing
comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up. the pool to *available VRAM* (`cudaMemGetInfo`) instead of worst-case demand,
plus vLLM-style **swap to pinned host memory**: when running sequences grow
past the GPU pool, the newest are evicted to host and swapped back when blocks
free up (`--swap-space-gb`, default 8). The results above run at 8192 with **0
swap events** — the VRAM-sized pool alone covers this load; swap is the
overload safety net (verified lossless under a forced-small pool).
3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0) 3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA
reductions flip an argmax over long (~2400-token) generations. Harmless for reductions flip an argmax over long (~3k-token) generations. Harmless for
serving, but it explains why long-sequence accuracy wobbles by a problem. serving, but it explains why long-sequence accuracy wobbles by a problem.
Raw artifacts (per-request timings, per-problem prediction/gold) are written to Raw artifacts (per-request timings, per-problem prediction/gold) are written to