docs: update llama.cpp comparison with 8192 results (OOM fixed)
Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -126,8 +126,7 @@ HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
|
|||||||
|
|
||||||
**Full sweep on dash5 (recommended):**
|
**Full sweep on dash5 (recommended):**
|
||||||
```bash
|
```bash
|
||||||
# 4096 ctx because xserv OOMs at 8192 (see Known constraints)
|
./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
|
||||||
./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
|
|
||||||
./tools/sync-and-build.sh fetch-bench-out
|
./tools/sync-and-build.sh fetch-bench-out
|
||||||
open bench-out/comparison-*.md
|
open bench-out/comparison-*.md
|
||||||
```
|
```
|
||||||
@@ -179,17 +178,18 @@ python3 -m tools.bench.runner \
|
|||||||
|
|
||||||
## Known constraints / findings
|
## Known constraints / findings
|
||||||
|
|
||||||
- **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
|
- **xserv OOM at `--max-seq-len 8192` — fixed.** xserv used to pre-allocate its
|
||||||
pre-allocates its paged-KV pool (`total_blocks = blocks_per_seq · max_batch ·
|
paged-KV pool (`total_blocks = blocks_per_seq · max_batch · 2`, ≈9GB at 8192)
|
||||||
2`, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup
|
on top of the 16GB weights, exceeding 32GB at startup (`paged_kv_cache.rs`
|
||||||
(`paged_kv_cache.rs` `alloc paged K pool: OutOfMemory`). llama.cpp allocates
|
`alloc paged K pool: OutOfMemory`). Now the pool is sized to *available VRAM*
|
||||||
KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the
|
(`cudaMemGetInfo`) and overflow is swapped to pinned host memory (vLLM-style
|
||||||
comparison at `--max-seq-len 4096` (xserv peaks ~28GB there). The benchmark
|
preemption, `--swap-space-gb`). The 8192 comparison runs cleanly with 0 swap
|
||||||
surfaced this — it's tracked as a follow-up fix.
|
events; swap is verified separately under a forced-small pool. The benchmark
|
||||||
- When the xserv engine thread dies, the request handler panics on the poisoned
|
surfaced the OOM — a good example of the baseline doing its job.
|
||||||
`engine_sender` mutex and every subsequent request fails with "server
|
- When the xserv engine thread dies, the API now returns a clean 503 (the
|
||||||
disconnected". The driver records these as per-request errors (no crash), so a
|
request handler uses a poison-tolerant lock instead of cascading
|
||||||
broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
|
mutex-poison panics). The driver records any failure as a per-request error,
|
||||||
|
so a broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
|
||||||
|
|
||||||
## Future extensions
|
## Future extensions
|
||||||
|
|
||||||
|
|||||||
@@ -30,32 +30,37 @@ GPU, and a resident idle engine would distort the other's numbers).
|
|||||||
Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
|
Generation mode is matched: xserv hardcodes Qwen3 **thinking off**, so the
|
||||||
driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
|
driver sends `chat_template_kwargs={enable_thinking:false}` to llama.cpp.
|
||||||
|
|
||||||
## Results (RTX 5090, BF16, greedy, 4096 ctx, max_batch 4)
|
## Results (RTX 5090, BF16, greedy, 8192 ctx, max_batch 4)
|
||||||
|
|
||||||
### Performance — llama.cpp is the stronger baseline
|
### Performance — llama.cpp is the stronger baseline
|
||||||
|
|
||||||
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|
| scenario | metric | xserv | llama.cpp | xserv ÷ llama.cpp |
|
||||||
|---|---|---|---|---|
|
|---|---|---|---|---|
|
||||||
| single / medium | TTFT p50 (ms) | 26.8 | 18.0 | 0.67× |
|
| single / medium | TTFT p50 (ms) | 28.0 | 17.7 | 0.63× |
|
||||||
| single / medium | TPOT p50 (ms/tok) | 17.1 | 10.4 | 0.61× |
|
| single / medium | TPOT p50 (ms/tok) | 17.5 | 10.4 | 0.60× |
|
||||||
| single / medium | throughput (tok/s) | 58.1 | 94.9 | 0.61× |
|
| single / medium | throughput (tok/s) | 56.6 | 95.1 | 0.60× |
|
||||||
| concurrent-4 | throughput (tok/s) | 143.4 | 317.7 | 0.45× |
|
| concurrent-4 | throughput (tok/s) | 135.2 | 317.1 | 0.43× |
|
||||||
| concurrent-8 | throughput (tok/s) | 142.9 | 321.7 | 0.44× |
|
| concurrent-8 | throughput (tok/s) | 135.5 | 322.5 | 0.42× |
|
||||||
|
|
||||||
xserv runs at **~0.45–0.61×** llama.cpp. It saturates at `max_batch` (143 tok/s)
|
xserv runs at **~0.42–0.60×** llama.cpp. It saturates at `max_batch` (~135 tok/s)
|
||||||
while llama.cpp keeps scaling under load (322 tok/s). This is the honest new bar.
|
while llama.cpp keeps scaling under load (~322 tok/s). This is the honest new bar.
|
||||||
|
The ratio is the same at 4096 and 8192 — TPOT is bandwidth-bound, not
|
||||||
|
context-bound at these sizes.
|
||||||
|
|
||||||
### Quality — parity, confirming xserv's numerical fidelity
|
### Quality — parity, confirming xserv's numerical fidelity
|
||||||
|
|
||||||
| task | n | xserv | llama.cpp |
|
| task | n | xserv | llama.cpp |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| GSM8K | 50 | 94.0% (47/50) | 96.0% (48/50) |
|
| GSM8K | 50 | 98.0% (49/50) | 96.0% (48/50) |
|
||||||
| AIME 2025 | 30 | 23.3% (7/30) | 20.0% (6/30) |
|
| AIME 2025 | 30 | 20.0% (6/30) | 20.0% (6/30) |
|
||||||
|
|
||||||
With equal context, the two engines score within one problem of each other on
|
With equal context the two engines land at identical AIME accuracy and
|
||||||
both tasks. Response prefixes are byte-identical (same prompt templating), so
|
within one problem on GSM8K. At 8192 both generate full-length solutions
|
||||||
the small residual difference is greedy-decode divergence on long sequences —
|
(mean ~3.4k / ~4.2k tokens), so neither is truncated. Two independent engines
|
||||||
not an engine quality gap.
|
agreeing at ~20% confirms that's genuine Qwen3-8B (thinking-off) capability and
|
||||||
|
that xserv is numerically faithful. Response prefixes are byte-identical (same
|
||||||
|
prompt templating); the only run-to-run wobble is greedy-decode divergence /
|
||||||
|
nondeterminism on long (~3k-token) sequences (see finding 3).
|
||||||
|
|
||||||
## Findings the benchmark surfaced
|
## Findings the benchmark surfaced
|
||||||
|
|
||||||
@@ -66,13 +71,18 @@ not an engine quality gap.
|
|||||||
(capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
|
(capped at ~940 generated tokens). GSM8K (~280 tokens) was unaffected, which
|
||||||
is how we caught it. Fixed: per-slot context = `max_seq_len` (total
|
is how we caught it. Fixed: per-slot context = `max_seq_len` (total
|
||||||
`-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
|
`-c = max_seq_len × parallel`). After the fix, AIME is at parity (above).
|
||||||
2. **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
|
2. **xserv OOM'd at `--max-seq-len 8192` — now fixed.** xserv used to eagerly
|
||||||
pre-allocates its paged-KV pool (~9GB at 8192) on top of the 16GB weights,
|
pre-allocate its paged-KV pool (`blocks_per_seq × max_batch × 2`, ~9GB at
|
||||||
exceeding 32GB at startup; llama.cpp allocates KV lazily and fits 8192. The
|
8192) on top of the 16GB weights, exceeding 32GB at startup. Fixed by sizing
|
||||||
comparison above runs at 4096 (xserv peaks ~28GB). Tracked as a follow-up.
|
the pool to *available VRAM* (`cudaMemGetInfo`) instead of worst-case demand,
|
||||||
|
plus vLLM-style **swap to pinned host memory**: when running sequences grow
|
||||||
|
past the GPU pool, the newest are evicted to host and swapped back when blocks
|
||||||
|
free up (`--swap-space-gb`, default 8). The results above run at 8192 with **0
|
||||||
|
swap events** — the VRAM-sized pool alone covers this load; swap is the
|
||||||
|
overload safety net (verified lossless under a forced-small pool).
|
||||||
3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
|
3. **xserv decode is not run-to-run deterministic.** The same greedy (temp 0)
|
||||||
AIME config produced 6/30 then 7/30 across runs — non-deterministic CUDA
|
AIME config produced 6/30 / 7/30 / 6/30 across runs — non-deterministic CUDA
|
||||||
reductions flip an argmax over long (~2400-token) generations. Harmless for
|
reductions flip an argmax over long (~3k-token) generations. Harmless for
|
||||||
serving, but it explains why long-sequence accuracy wobbles by a problem.
|
serving, but it explains why long-sequence accuracy wobbles by a problem.
|
||||||
|
|
||||||
Raw artifacts (per-request timings, per-problem prediction/gold) are written to
|
Raw artifacts (per-request timings, per-problem prediction/gold) are written to
|
||||||
|
|||||||
Reference in New Issue
Block a user