Files
xserv/crates/xserv-kernels/src/lib.rs
Gahow Wang 6da0972740 speculative: copy_kv_position primitive for tree drafting KV remap
SGLang-style "write-all, copy-move on acceptance" approach: after tree
verification, physically copy an accepted sibling's K/V from its
physical cache slot to the canonical sequential position.

New CUDA kernel: copy_kv_position_kernel in reshape_and_cache.cu.
For one token (src_pos → dst_pos), copies head_dim × num_kv_heads BF16
elements in both K and V pools. Grid = num_kv_heads, block = head_dim.
Cost for one token across 36 layers: ~5.3 MB D2D copy @ 900 GB/s = <6μs.

Rust FFI: copy_kv_position(k_pool, v_pool, block_ids, src_pos, dst_pos,
num_kv_heads, head_dim, block_size, stream).

PagedKVCache method: copy_kv_position(slot, src_pos, dst_pos) — uploads
block_ids for the sequence, calls the kernel per layer. This is the
primitive needed by tree drafting: when a non-primary sibling at cache
position P+2 is accepted as the "true" token for target position P+1,
call copy_kv_position(slot, P+2, P+1) then truncate to P+2.

Next: wire into bench-eagle3 tree drafting loop with top-2 siblings.
2026-07-01 23:09:35 +08:00

37 lines
1.2 KiB
Rust

pub mod activation;
pub mod argmax;
pub mod attention;
pub mod dispatch;
pub mod embedding;
pub mod gemm;
pub mod layernorm;
pub mod moe;
pub mod quantization;
pub mod rmsnorm;
pub mod rope;
pub mod softmax;
pub mod transpose;
pub use activation::{add, bias_add_2d, gelu, gpt_oss_glu, mul, scale, silu, silu_mul};
pub use argmax::{argmax_bf16_single, argmax_bf16_to_host};
pub use attention::{
attention, copy_kv_position, decode_attention, flash_attention, flash_attention_sinks,
paged_decode_attention, paged_decode_attention_sinks, paged_decode_attention_tree,
reshape_and_cache_batched_bf16, reshape_and_cache_bf16,
};
pub use embedding::{embedding, embedding_device_ids};
pub use gemm::{GemmBackend, batched_matmul, matmul, matmul_batched_gemv};
pub use layernorm::layernorm;
pub use rmsnorm::{add_rmsnorm, rmsnorm};
pub use rope::{RopeCache, rope_inplace, rope_inplace_device_pos};
pub use softmax::softmax;
pub use transpose::{
merge_heads_gpu, repeat_kv_gpu, reshape_heads_gpu, strided_to_contiguous_gpu,
transpose_for_rope_gpu, transpose_from_rope_gpu,
};
/// Register GPU kernels with the tensor crate. Call once at startup.
pub fn init() {
xserv_tensor::register_gpu_contiguous(strided_to_contiguous_gpu);
}