SGLang-style "write-all, copy-move on acceptance" approach: after tree verification, physically copy an accepted sibling's K/V from its physical cache slot to the canonical sequential position. New CUDA kernel: copy_kv_position_kernel in reshape_and_cache.cu. For one token (src_pos → dst_pos), copies head_dim × num_kv_heads BF16 elements in both K and V pools. Grid = num_kv_heads, block = head_dim. Cost for one token across 36 layers: ~5.3 MB D2D copy @ 900 GB/s = <6μs. Rust FFI: copy_kv_position(k_pool, v_pool, block_ids, src_pos, dst_pos, num_kv_heads, head_dim, block_size, stream). PagedKVCache method: copy_kv_position(slot, src_pos, dst_pos) — uploads block_ids for the sequence, calls the kernel per layer. This is the primitive needed by tree drafting: when a non-primary sibling at cache position P+2 is accepted as the "true" token for target position P+1, call copy_kv_position(slot, P+2, P+1) then truncate to P+2. Next: wire into bench-eagle3 tree drafting loop with top-2 siblings.
37 lines
1.2 KiB
Rust
37 lines
1.2 KiB
Rust
pub mod activation;
|
|
pub mod argmax;
|
|
pub mod attention;
|
|
pub mod dispatch;
|
|
pub mod embedding;
|
|
pub mod gemm;
|
|
pub mod layernorm;
|
|
pub mod moe;
|
|
pub mod quantization;
|
|
pub mod rmsnorm;
|
|
pub mod rope;
|
|
pub mod softmax;
|
|
pub mod transpose;
|
|
|
|
pub use activation::{add, bias_add_2d, gelu, gpt_oss_glu, mul, scale, silu, silu_mul};
|
|
pub use argmax::{argmax_bf16_single, argmax_bf16_to_host};
|
|
pub use attention::{
|
|
attention, copy_kv_position, decode_attention, flash_attention, flash_attention_sinks,
|
|
paged_decode_attention, paged_decode_attention_sinks, paged_decode_attention_tree,
|
|
reshape_and_cache_batched_bf16, reshape_and_cache_bf16,
|
|
};
|
|
pub use embedding::{embedding, embedding_device_ids};
|
|
pub use gemm::{GemmBackend, batched_matmul, matmul, matmul_batched_gemv};
|
|
pub use layernorm::layernorm;
|
|
pub use rmsnorm::{add_rmsnorm, rmsnorm};
|
|
pub use rope::{RopeCache, rope_inplace, rope_inplace_device_pos};
|
|
pub use softmax::softmax;
|
|
pub use transpose::{
|
|
merge_heads_gpu, repeat_kv_gpu, reshape_heads_gpu, strided_to_contiguous_gpu,
|
|
transpose_for_rope_gpu, transpose_from_rope_gpu,
|
|
};
|
|
|
|
/// Register GPU kernels with the tensor crate. Call once at startup.
|
|
pub fn init() {
|
|
xserv_tensor::register_gpu_contiguous(strided_to_contiguous_gpu);
|
|
}
|