xserv

Go to file

Gahow Wang fc1900a745 server: VRAM-sized KV pool + vLLM-style swap scheduler

Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory:

- Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the
  worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192.
- Scheduler tracks waiting/running/swapped sets: block-aware admission,
  swap-in of resumable sequences when blocks free, and preemption of the
  newest running sequence to host when the pool can't cover a decode step.
- --swap-space-gb (default 8) sizes the pinned host swap pool;
  XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping.
- api: poison-tolerant lock + clean 503 when the engine thread is gone,
  instead of cascading mutex-poison panics.

Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a
forced 40-block pool drives 48 lossless swap-out/swap-in cycles under
concurrency with coherent output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 19:59:06 +08:00

crates

server: VRAM-sized KV pool + vLLM-style swap scheduler

2026-05-28 19:59:06 +08:00

csrc

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

docs

docs: llama.cpp vs xserv benchmark results + summary

2026-05-28 15:06:21 +08:00

third_party

tools: add llama.cpp comparison baseline + standard benchmark suite