docs: update llama.cpp comparison with 8192 results (OOM fixed)

Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it:
- OOM finding resolved — pool sized to available VRAM + vLLM-style host swap;
  8192 runs with 0 swap events (swap is the overload safety net).
- Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%.
- Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 21:32:14 +08:00
parent fc1900a745
commit 80157e614a
2 changed files with 43 additions and 33 deletions

View File

@@ -126,8 +126,7 @@ HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
**Full sweep on dash5 (recommended):**
```bash
# 4096 ctx because xserv OOMs at 8192 (see Known constraints)
./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md
```
@@ -179,17 +178,18 @@ python3 -m tools.bench.runner \
## Known constraints / findings
- **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
pre-allocates its paged-KV pool (`total_blocks = blocks_per_seq · max_batch ·
2`, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup
(`paged_kv_cache.rs` `alloc paged K pool: OutOfMemory`). llama.cpp allocates
KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the
comparison at `--max-seq-len 4096` (xserv peaks ~28GB there). The benchmark
surfaced this — it's tracked as a follow-up fix.
- When the xserv engine thread dies, the request handler panics on the poisoned
`engine_sender` mutex and every subsequent request fails with "server
disconnected". The driver records these as per-request errors (no crash), so a
broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
- **xserv OOM at `--max-seq-len 8192` — fixed.** xserv used to pre-allocate its
paged-KV pool (`total_blocks = blocks_per_seq · max_batch · 2`, ≈9GB at 8192)
on top of the 16GB weights, exceeding 32GB at startup (`paged_kv_cache.rs`
`alloc paged K pool: OutOfMemory`). Now the pool is sized to *available VRAM*
(`cudaMemGetInfo`) and overflow is swapped to pinned host memory (vLLM-style
preemption, `--swap-space-gb`). The 8192 comparison runs cleanly with 0 swap
events; swap is verified separately under a forced-small pool. The benchmark
surfaced the OOM — a good example of the baseline doing its job.
- When the xserv engine thread dies, the API now returns a clean 503 (the
request handler uses a poison-tolerant lock instead of cascading
mutex-poison panics). The driver records any failure as a per-request error,
so a broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
## Future extensions

View File

@@ -30,32 +30,37 @@ GPU, and a resident idle engine would distort the other's numbers).
Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4)
## Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)
### Performance — llama.cpp is the stronger baseline
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|---|---|---|---|---|
| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× |
| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× |
| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× |
| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× |
| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× |
| single / medium | TTFT p50 (ms) | 28.0 | 17.7 | 0.63× |
| single / medium | TPOT p50 (ms/tok) | 17.5 | 10.4 | 0.60× |
| single / medium | throughput (tok/s) | 56.6 | 95.1 | 0.60× |
| concurrent-4 | throughput (tok/s) | 135.2 | 317.1 | 0.43× |
| concurrent-8 | throughput (tok/s) | 135.5 | 322.5 | 0.42× |
xserv runs at **~0.450.61×** llama.cpp. It saturates at `max_batch` (143 tok/s)
while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar.
xserv runs at **~0.420.60×** llama.cpp. It saturates at `max_batch` (~135 tok/s)
while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar.
The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not
context-bound at these sizes.
### Quality — parity, confirming xserv's numerical fidelity
| task | n | xserv | llama.cpp |
|---|---|---|---|
| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) |
| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) |
| GSM8K | 50 | 98.0% (49/50) | 96.0% (48/50) |
| AIME 2025 | 30 | 20.0% (6/30) | 20.0% (6/30) |
With equal context, the two engines score within one problem of each other on
both tasks. Response prefixes are byte-identical (same prompt templating), so
the small residual difference is greedy-decode divergence on long sequences —
not an engine quality gap.
With equal context the two engines land at identical AIME accuracy and
within one problem on GSM8K. At 8192 both generate full-length solutions
(mean ~3.4k / ~4.2k tokens), so neither is truncated. Two independent engines
agreeing at ~20% confirms that's genuine Qwen3-8B (thinking-off) capability and
that xserv is numerically faithful. Response prefixes are byte-identical (same
prompt templating); the only run-to-run wobble is greedy-decode divergence /
nondeterminism on long (~3k-token) sequences (see finding 3).
## Findings the benchmark surfaced
@@ -66,13 +71,18 @@ not an engine quality gap.
(capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
is how we caught it. Fixed: per-slot context = `max_seq_len` (total
`-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights,
exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The
comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up.
2. **xserv OOM'd at `--max-seq-len 8192` — now fixed.** xserv used to eagerly
pre-allocate its paged-KV pool (`blocks_per_seq × max_batch × 2`, ~9GB at
8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing
the pool to *available VRAM* (`cudaMemGetInfo`) instead of worst-case demand,
plus vLLM-style **swap to pinned host memory**: when running sequences grow
past the GPU pool, the newest are evicted to host and swapped back when blocks
free up (`--swap-space-gb`, default 8). The results above run at 8192 with **0
swap events** — the VRAM-sized pool alone covers this load; swap is the
overload safety net (verified lossless under a forced-small pool).
3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA
reductions flip an argmax over long (~2400-token) generations. Harmless for
AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA
reductions flip an argmax over long (~3k-token) generations. Harmless for
serving, but it explains why long-sequence accuracy wobbles by a problem.
Raw artifacts (per-request timings, per-problem prediction/gold) are written to