xserv

Files

Gahow Wang ce7229f4fe speculative: Qwen3 draft-model v0 with paged verify parity

Phase 22 lands a correctness-only speculative decoding loop for Qwen3
target + Qwen3 small draft (batch=1, greedy, gamma=4). Phase 23 turns
verify logits into the authoritative acceptance signal so mirror-decode
per accepted token is no longer needed.

- paged_kv_cache: truncate_sequence(slot, new_len) shrinks a registered
  sequence, freeing whole physical blocks no longer reachable and
  leaving the slot registered. Covered by a CUDA-gated unit test.
- qwen3: forward_verify_paged_decode_attention writes the draft window
  into the target cache, runs the same paged decode attention kernel per
  draft token, and uses matmul_rows_gemv so linear layers follow the
  single-token decode BF16 rounding path.
- bench-speculative: new bench binary drives the state machine with
  --gamma / --gen-tokens / --prompts / --use-verify-logits /
  --verify-path flash|paged-decode / --dump-verify-mismatches, and
  compares baseline vs spec token sequences plus TPOT / tok/s / speedup.
- docs/22 records the decode-authoritative v0 result and dash5 numbers
  (matched=true, speedup_e2e ~0.29x, verify_decode_mismatches>0 under
  --use-verify-logits).
- docs/23 records the paged-decode verify path (matched=true,
  verify_decode_mismatches=0, 50x64 speedup_e2e ~0.44x) and the
  next-step performance TODO.

2026-07-01 14:16:30 +08:00

xserv-cuda

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-distributed

style: format Rust workspace

2026-06-18 18:11:58 +08:00

xserv-kernels

cuda: deterministic BF16 gemv + paged attention reductions

2026-07-01 14:16:28 +08:00

xserv-model

speculative: Qwen3 draft-model v0 with paged verify parity