fc1900a74548116e6f8790b7460c31d3f31630d7
Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory: - Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192. - Scheduler tracks waiting/running/swapped sets: block-aware admission, swap-in of resumable sequences when blocks free, and preemption of the newest running sequence to host when the pool can't cover a decode step. - --swap-space-gb (default 8) sizes the pinned host swap pool; XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping. - api: poison-tolerant lock + clean 503 when the engine thread is gone, instead of cascading mutex-poison panics. Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a forced 40-block pool drives 48 lossless swap-out/swap-in cycles under concurrency with coherent output. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Rust
67.5%
Python
15.1%
Cuda
13.5%
Shell
3.9%