docs: update llama.cpp comparison with 8192 results (OOM fixed)

Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 21:32:14 +08:00
parent fc1900a745
commit 80157e614a
2 changed files with 43 additions and 33 deletions
--- a/docs/16-llama-cpp-comparison.md
+++ b/docs/16-llama-cpp-comparison.md
@@ -126,8 +126,7 @@ HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
 **Full sweep on dash5 (recommended):**
 ```bash
-# 4096 ctx because xserv OOMs at 8192 (see Known constraints)
+./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
 ./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
 ./tools/sync-and-build.sh fetch-bench-out
 open bench-out/comparison-*.md
 ```
@@ -179,17 +178,18 @@ python3 -m tools.bench.runner \
 ## Known constraints / findings
- **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
+- **xserv OOM at `--max-seq-len 8192` — fixed.** xserv used to pre-allocate its
-  pre-allocates its paged-KV pool (`total_blocks = blocks_per_seq · max_batch ·
+  paged-KV pool (`total_blocks = blocks_per_seq · max_batch · 2`, ≈9GB at 8192)
-  2`, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup
+  on top of the 16GB weights, exceeding 32GB at startup (`paged_kv_cache.rs`
-  (`paged_kv_cache.rs` `alloc paged K pool: OutOfMemory`). llama.cpp allocates
+  `alloc paged K pool: OutOfMemory`). Now the pool is sized to *available VRAM*
-  KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the
+  (`cudaMemGetInfo`) and overflow is swapped to pinned host memory (vLLM-style
-  comparison at `--max-seq-len 4096` (xserv peaks ~28GB there). The benchmark
+  preemption, `--swap-space-gb`). The 8192 comparison runs cleanly with 0 swap
-  surfaced this — it's tracked as a follow-up fix.
+  events; swap is verified separately under a forced-small pool. The benchmark
- When the xserv engine thread dies, the request handler panics on the poisoned
+  surfaced the OOM — a good example of the baseline doing its job.
-  `engine_sender` mutex and every subsequent request fails with "server
+- When the xserv engine thread dies, the API now returns a clean 503 (the
-  disconnected". The driver records these as per-request errors (no crash), so a
+  request handler uses a poison-tolerant lock instead of cascading
-  broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
+  mutex-poison panics). The driver records any failure as a per-request error,
  so a broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
 ## Future extensions
--- a/docs/benchmarks/llama-cpp-comparison.md
+++ b/docs/benchmarks/llama-cpp-comparison.md
@@ -30,32 +30,37 @@ GPU, and a resident idle engine would distort the other's numbers).
 Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
 driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
-## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4)
+## Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)
 ### Performance — llama.cpp is the stronger baseline
 | scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
 |---|---|---|---|---|
-| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× |
+| single / medium | TTFT p50 (ms) | 28.0 | 17.7 | 0.63× |
-| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× |
+| single / medium | TPOT p50 (ms/tok) | 17.5 | 10.4 | 0.60× |
-| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× |
+| single / medium | throughput (tok/s) | 56.6 | 95.1 | 0.60× |
-| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× |
+| concurrent-4 | throughput (tok/s) | 135.2 | 317.1 | 0.43× |
-| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× |
+| concurrent-8 | throughput (tok/s) | 135.5 | 322.5 | 0.42× |
-xserv runs at **~0.45–0.61×** llama.cpp. It saturates at `max_batch` (143 tok/s)
+xserv runs at **~0.42–0.60×** llama.cpp. It saturates at `max_batch` (~135 tok/s)
-while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar.
+while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar.
 The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not
 context-bound at these sizes.
 ### Quality — parity, confirming xserv's numerical fidelity
 | task | n | xserv | llama.cpp |
 |---|---|---|---|
-| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) |
+| GSM8K | 50 | 98.0% (49/50) | 96.0% (48/50) |
-| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) |
+| AIME 2025 | 30 | 20.0% (6/30) | 20.0% (6/30) |
-With equal context, the two engines score within one problem of each other on
+With equal context the two engines land at identical AIME accuracy and
-both tasks. Response prefixes are byte-identical (same prompt templating), so
+within one problem on GSM8K. At 8192 both generate full-length solutions
-the small residual difference is greedy-decode divergence on long sequences —
+(mean ~3.4k / ~4.2k tokens), so neither is truncated. Two independent engines
-not an engine quality gap.
+agreeing at ~20% confirms that's genuine Qwen3-8B (thinking-off) capability and
 that xserv is numerically faithful. Response prefixes are byte-identical (same
 prompt templating); the only run-to-run wobble is greedy-decode divergence /
 nondeterminism on long (~3k-token) sequences (see finding 3).
 ## Findings the benchmark surfaced
@@ -66,13 +71,18 @@ not an engine quality gap.
   (capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
   is how we caught it. Fixed: per-slot context = `max_seq_len` (total
   `-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
-2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
+2. **xserv OOM'd at `--max-seq-len 8192` — now fixed.** xserv used to eagerly
-   pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights,
+   pre-allocate its paged-KV pool (`blocks_per_seq × max_batch × 2`, ~9GB at
-   exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The
+   8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing
-   comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up.
+   the pool to *available VRAM* (`cudaMemGetInfo`) instead of worst-case demand,
   plus vLLM-style **swap to pinned host memory**: when running sequences grow
   past the GPU pool, the newest are evicted to host and swapped back when blocks
   free up (`--swap-space-gb`, default 8). The results above run at 8192 with **0
   swap events** — the VRAM-sized pool alone covers this load; swap is the
   overload safety net (verified lossless under a forced-small pool).
 3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
-   AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA
+   AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA
-   reductions flip an argmax over long (~2400-token) generations. Harmless for
+   reductions flip an argmax over long (~3k-token) generations. Harmless for
   serving, but it explains why long-sequence accuracy wobbles by a problem.
 Raw artifacts (per-request timings, per-problem prediction/gold) are written to