Design doc covering the tiled forward, the dA/dB math + how transpose is
handled (materialize + reuse forward), the cuBLAS row-major reference, and
the finite-diff harness design + how T4 reuses it per-op.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>