xserv

Files

Gahow Wang 5157b2cd30 kernels: fix NaN in flash-attention sinks on fully-masked window tiles

flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the
causal `continue`); an early tile entirely outside the sliding window was
still processed with every key masked to -inf, so row_max == -INFINITY.
Folding that into the online softmax computed expf(-inf - (-inf)) = NaN,
and the next valid tile's 0*NaN correction then poisoned the whole row.

Result: the gpt-oss prefill produced all-NaN logits for any query whose
sliding window (128) starts past the first KV tile — i.e. at longer
context — collapsing generation into a single repeated token (argmax of
all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was
the residual multi-turn/long-context collapse.

Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes
nothing to the softmax. The decode kernel already guards
local_max == -INFINITY, so it was unaffected.

Verified on dash5 (TP=2): the prefill that previously went all-NaN now
produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long
prior answer) completes correctly instead of emitting "!!!!".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-02 16:09:43 +08:00

causal_mask.cu

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

flash_attention.cu

kernels: fix NaN in flash-attention sinks on fully-masked window tiles

2026-06-02 16:09:43 +08:00

paged_attention.cu

phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2

2026-05-30 15:18:01 +08:00

reshape_and_cache.cu

kernels: reshape_and_cache, GPU argmax, single-launch GEMV

2026-05-30 12:50:17 +08:00