Design doc covering the tiled forward, the dA/dB math + how transpose is
handled (materialize + reuse forward), the cuBLAS row-major reference, and
the finite-diff harness design + how T4 reuses it per-op.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Design doc for the minimal tensor layer: DType/shape/Storage/Tensor,
host↔device copy, and one elementwise kernel (scale) wired end-to-end.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
T1 shipped without a design doc; capture the Rust↔CUDA build chain
(build.rs+nvcc, no_cuda cfg pattern, RAII GpuBuffer, gitea↔dash5 flow).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>