xserv

Files

Gahow Wang 9c98c169ff kernels: flash attention with gpt-oss sinks + sliding window

Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.

Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-31 00:56:10 +08:00

src

kernels: flash attention with gpt-oss sinks + sliding window

2026-05-31 00:56:10 +08:00

tests

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

build.rs

kernels: reshape_and_cache, GPU argmax, single-launch GEMV

2026-05-30 12:50:17 +08:00

Cargo.toml

phase 3: GEMM kernels (naive, tiled, cuBLAS)

2026-05-21 19:48:05 +08:00