Phase T5 structural ops on top of the T4 set, needed to assemble the
tiny transformer:
- embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward
(atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi.
- reshape: contiguous metadata-only view (Tensor::reshape), no kernel.
- transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel).
- autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus
split_heads (->Vec<Var>) / merge_heads for per-head attention.
- tape: Var::zero_grad + set_value so a hand-written GD step can update
params and clear grads between steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>