xtrain

Files

Gahow Wang c88e2ab88c post-train: M2 — decode primitives (rope_at + decode_attention)

Two forward-only Tensor primitives the KV-cache decode engine is built on,
each gated by an isolated correctness test:

- rope_at(theta, pos0): RoPE at an absolute position (pos = pos0 + row, no
  modulo) for a single decode token, vs the training rope_k (pos = row %
  period) left untouched. New forward-only CUDA kernel, no training-path risk.
  Gate: bit-identical to the full-sequence rope's corresponding row.
- decode_attention(k, v, scale): single-query × cached-K/V SDPA, composed from
  the existing strided batched GEMM + plain (non-causal) softmax — no new
  kernel. Gate: equals the full causal attention's last query row (max |Δ| 6e-8).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 12:00:03 +08:00

src

post-train: M2 — decode primitives (rope_at + decode_attention)

2026-06-30 12:00:03 +08:00

tests

T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)

2026-06-15 14:42:43 +08:00

build.rs

gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests)

2026-06-18 01:37:37 +08:00

Cargo.toml

T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)

2026-06-15 14:42:43 +08:00