d52baa0006e3dafa21723ea5e46d7c9352dca498
- paged_kv_cache: new block-paged KV cache; adds a pinned-host swap pool with
a second BlockAllocator, per-sequence Location {Gpu,Cpu}, and lossless
swap_out/swap_in (block-granular D2H/H2D) for vLLM-style preemption.
bytes_per_block helper exposes per-block cost for VRAM-based sizing.
- decode_graph: CUDA-graph decode path.
- qwen3/gpt2/kv_cache: paged prefill/decode forward + related updates.
- tokenizer/bins: BPE updates, new xserv-chat CLI, bench-qwen3 tweaks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Rust
67.5%
Python
15.1%
Cuda
13.5%
Shell
3.9%