xserv

Go to file

Gahow Wang 80157e614a docs: update llama.cpp comparison with 8192 results (OOM fixed)

Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it:
- OOM finding resolved — pool sized to available VRAM + vLLM-style host swap;
  8192 runs with 0 swap events (swap is the overload safety net).
- Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%.
- Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 21:32:14 +08:00

crates

server: VRAM-sized KV pool + vLLM-style swap scheduler

2026-05-28 19:59:06 +08:00

csrc

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

docs

docs: update llama.cpp comparison with 8192 results (OOM fixed)

2026-05-28 21:32:14 +08:00

third_party

tools: add llama.cpp comparison baseline + standard benchmark suite