xtrain

Files

Gahow Wang 0e20821633 autodiff+model: flash-attention op + --flash opt-in wiring

ops::flash_attention autograd node (fwd caches O(N) logsumexp instead of
O(N²) probs; bwd via Tensor::flash_attention_backward). Model gets a
use_flash bool + with_flash(bool) builder; the SDPA core in attention()
picks ops::flash_attention vs ops::attention. flash threads through
block_forward so the recompute (T13) segment also runs flash. Default
off = composed path, graph unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-17 23:10:32 +08:00

xtrain-autodiff

autodiff+model: flash-attention op + --flash opt-in wiring

2026-06-17 23:10:32 +08:00

xtrain-cuda

cuda: fused flash-attention kernel (fwd + flash-style bwd)

2026-06-17 23:10:25 +08:00

xtrain-distributed

model: per-block activation recompute (--recompute)

2026-06-17 09:42:42 +08:00

xtrain-model

autodiff+model: flash-attention op + --flash opt-in wiring