SGLang-style "write-all, copy-move on acceptance" approach: after tree
verification, physically copy an accepted sibling's K/V from its
physical cache slot to the canonical sequential position.
New CUDA kernel: copy_kv_position_kernel in reshape_and_cache.cu.
For one token (src_pos → dst_pos), copies head_dim × num_kv_heads BF16
elements in both K and V pools. Grid = num_kv_heads, block = head_dim.
Cost for one token across 36 layers: ~5.3 MB D2D copy @ 900 GB/s = <6μs.
Rust FFI: copy_kv_position(k_pool, v_pool, block_ids, src_pos, dst_pos,
num_kv_heads, head_dim, block_size, stream).
PagedKVCache method: copy_kv_position(slot, src_pos, dst_pos) — uploads
block_ids for the sequence, calls the kernel per layer. This is the
primitive needed by tree drafting: when a non-primary sibling at cache
position P+2 is accepted as the "true" token for target position P+1,
call copy_kv_position(slot, P+2, P+1) then truncate to P+2.
Next: wire into bench-eagle3 tree drafting loop with top-2 siblings.
New CUDA kernel paged_decode_attention_tree_bf16_kernel: same as base
paged_decode_attention but with a per-query mask over the newly-written
K/V region. `tree_mask[i][j] != 0` iff query i attends to newly-written
K/V at slot j. Positions before `tree_start` are always attended.
Motivation: speculative decoding with tree drafting needs siblings at
the same target position to attend to their own branch's history, not
each other's K/V.
Rust binding: paged_decode_attention_tree(...) mirrors
paged_decode_attention plus tree_mask_ptr, tree_start, tree_len.
Forward path: Qwen3::forward_verify_paged_decode_attention_tree_with_hidden
takes explicit positions, kv_lens, and a flattened [N*N] tree_mask.
Sanity check: bench-eagle3's γ_multi path now routes through the tree
kernel with a causal mask (mask[i][j]=1 iff j<=i), producing bit-
equivalent output to the non-tree variant. matched=false pattern +
acceptance rate + speedup all identical to previous run within noise
(11.3% acceptance, 1.00× speedup with the mask-check overhead).
--tree CLI flag is parsed but reserved. Real tree drafting (siblings
sharing a target position) is blocked by KV cache position rigidity:
paged_cache stores K/V at cache-position ≡ target-position, so an
accepted sibling at target position P+1 has its K/V physically at
cache position P+2 (its unique slot in the batched write). Continuing
decode at P+1 would see the WRONG K/V (top-1 sibling's, not accepted
top-2 sibling's). Fix requires either KV-slot remap on acceptance or
a virtual position layer.
Infrastructure is in place, next step is tackling that remap.
Add launch_gemv_bf16_batched: runs M m=1 GEMVs in a single 3D grid
launch (z = batch row) with numerically identical output to M sequential
launch_gemv_bf16 calls — same K-block partial accumulation, same
fixed-order reduction. Verified on dash5 with 10 prompts × 32 tokens:
matched=true, verify_decode_mismatches=0.
Expose as matmul_batched_gemv(a: [M,K], b: [K,N]) → [M,N] in
xserv-kernels. Replace the old matmul_rows_gemv helper in qwen3
forward_verify_paged_decode_attention; the per-row loop over matmul_2d +
concat_rows is replaced by a single matmul_batched_gemv call that
allocates the partials buffer in one shot and launches 2 kernels instead
of 2*M.
Current speedup_e2e is 0.47× (same ballpark as Phase 23 0.44×);
the batched launch saves ~3 ms overhead but this is small relative to
the total 28 ms spec cost. The path forward (per docs/24 §4) is
higher acceptance rate or cheaper draft, not further kernel optimization.
- Thread-local launch stream (xserv_cuda::stream): every kernel
wrapper, cublasSetStream, and NCCL collective now launches on
current_stream_raw() — the legacy null stream by default (behavior
unchanged), or the capture stream installed via push_stream during
graph capture. Capture is impossible on the legacy stream.
- Allocator retain mode: blocks freed inside a retain window are
quarantined (RetainedBlocks) instead of pooled, so an instantiated
graph keeps exclusive ownership of every intermediate buffer it
references across replays.
- Capture mode GLOBAL -> THREAD_LOCAL: concurrent TP rank threads
must not poison each other's captures with their own cudaMallocs.
- embedding_device_ids / rope_inplace_device_pos: variants reading
token ids / positions from persistent device buffers, replacing the
per-call host upload that a captured region cannot contain.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Decode carried three leftover cudaDeviceSynchronize (prefill one) from
NaN debugging — the Qwen3 path has none and the logits D2H in sample()
already orders against the null stream.
add_bias for rows>1 round-tripped the bias through the CPU (D2H + host
tile loop + H2D) on every call — 96 times per prefill across q/k/v/o.
Replace with a bias_add_2d broadcast kernel.
dash5, FP8 TP=2, warm-server: TTFT 35/49/94 -> 29/42/79 ms
(short/medium/long), TPOT 7.19-7.32 -> 6.99-7.21 ms. Greedy tokens
unchanged; GSM8K-50 94%.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the per-token CPU-routed MoE forward with an all-GPU path:
1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
2. moe_replicate: broadcast input to all local experts
3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)
Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).
Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.
GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.
Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three new CUDA kernels and one rewrite:
- reshape_and_cache: scatter K/V into paged pool in a single kernel per
layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
Includes both single-sequence (prefill) and batched (decode) variants.
- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
decode now only D2H-transfers B×4 bytes (token ids) instead of the
full [B, vocab] logits tensor.
- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
pool backing) and offset-aware D2H/H2D copies used to move KV blocks
between the GPU pool and pinned host memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three performance optimizations targeting decode throughput:
1. Decode Attention Kernel (csrc/attention/flash_attention.cu):
- Specialized kernel for Q_len=1 (decode step)
- 256 threads parallelize across KV sequence dimension
- Online softmax with block-level warp-shuffle reduction
- Replaces FA2 kernel which wasted 63/64 threads for decode
- flash_attention() auto-dispatches when q_len==1
2. Fused SiLU×Mul (csrc/activation/activations.cu):
- Single kernel: out = silu(gate) * up
- Saves 1 HBM read + 1 HBM write per FFN layer (N elements)
- Eliminates intermediate tensor allocation
3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu):
- Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual)
- Saves 1 full HBM round-trip per attention block
- Eliminates separate add + rmsnorm kernel pair
Performance analysis:
- At current short sequences (max 79 tokens), these optimizations provide
marginal benefit because the bottleneck is cuBLAS GEMV overhead:
252 weight matrix reads × ~32MB each = 15.5 GB per decode step.
Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap).
- The fused kernels and decode attention will show larger gains at
longer sequences where attention and element-wise ops dominate.
- Next optimization target: CUDA Graphs to eliminate kernel launch
overhead, or custom GEMV kernels to replace cuBLAS for M=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kernel additions:
- add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU)
- Refactored activation.rs with dispatch_unary/dispatch_binary helpers
- Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips
GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add)
BF16 precision analysis (docs/benchmarks/phase10-qwen3.md):
- Root cause: separate attention kernels materialize BF16 intermediates
(QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path)
- HF itself SDPA vs Eager also differs by ~0.125 logit
- xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases
- Industry standard for BF16: top-5 overlap (we achieve 100%)
- Fix path: Flash Attention (Phase 14) to fuse attention in FP32
Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>