flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the
causal `continue`); an early tile entirely outside the sliding window was
still processed with every key masked to -inf, so row_max == -INFINITY.
Folding that into the online softmax computed expf(-inf - (-inf)) = NaN,
and the next valid tile's 0*NaN correction then poisoned the whole row.
Result: the gpt-oss prefill produced all-NaN logits for any query whose
sliding window (128) starts past the first KV tile — i.e. at longer
context — collapsing generation into a single repeated token (argmax of
all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was
the residual multi-turn/long-context collapse.
Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes
nothing to the softmax. The decode kernel already guards
local_max == -INFINITY, so it was unaffected.
Verified on dash5 (TP=2): the prefill that previously went all-NaN now
produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long
prior answer) completes correctly instead of emitting "!!!!".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>