xserv

Go to file

Gahow Wang d52baa0006 model: paged KV cache with CPU swap pool, decode graph, qwen3 updates

- paged_kv_cache: new block-paged KV cache; adds a pinned-host swap pool with
  a second BlockAllocator, per-sequence Location {Gpu,Cpu}, and lossless
  swap_out/swap_in (block-granular D2H/H2D) for vLLM-style preemption.
  bytes_per_block helper exposes per-block cost for VRAM-based sizing.
- decode_graph: CUDA-graph decode path.
- qwen3/gpt2/kv_cache: paged prefill/decode forward + related updates.
- tokenizer/bins: BPE updates, new xserv-chat CLI, bench-qwen3 tweaks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 19:58:54 +08:00

crates

model: paged KV cache with CPU swap pool, decode graph, qwen3 updates

2026-05-28 19:58:54 +08:00

csrc

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

docs

docs: llama.cpp vs xserv benchmark results + summary

2026-05-28 15:06:21 +08:00

third_party

tools: add llama.cpp comparison baseline + standard benchmark suite