xserv

Files

Gahow Wang 6da0972740 speculative: copy_kv_position primitive for tree drafting KV remap

SGLang-style "write-all, copy-move on acceptance" approach: after tree
verification, physically copy an accepted sibling's K/V from its
physical cache slot to the canonical sequential position.

New CUDA kernel: copy_kv_position_kernel in reshape_and_cache.cu.
For one token (src_pos → dst_pos), copies head_dim × num_kv_heads BF16
elements in both K and V pools. Grid = num_kv_heads, block = head_dim.
Cost for one token across 36 layers: ~5.3 MB D2D copy @ 900 GB/s = <6μs.

Rust FFI: copy_kv_position(k_pool, v_pool, block_ids, src_pos, dst_pos,
num_kv_heads, head_dim, block_size, stream).

PagedKVCache method: copy_kv_position(slot, src_pos, dst_pos) — uploads
block_ids for the sequence, calls the kernel per layer. This is the
primitive needed by tree drafting: when a non-primary sibling at cache
position P+2 is accepted as the "true" token for target position P+1,
call copy_kv_position(slot, P+2, P+1) then truncate to P+2.

Next: wire into bench-eagle3 tree drafting loop with top-2 siblings.

2026-07-01 23:09:35 +08:00

activation

gpt-oss: drop debug syncs from forward; GPU broadcast bias-add

2026-06-12 17:02:59 +08:00

attention

speculative: copy_kv_position primitive for tree drafting KV remap

2026-07-01 23:09:35 +08:00

embedding

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

gemm

speculative: batched-GEMV kernel for verify path (Phase 24 step 1)