Var = Rc<RefCell<VarNode>> on a define-by-run tape: value + optional grad +
parents + backward closure. backward() seeds a scalar loss, walks reverse
topo order, and pushes grads to parents. push_grad always SUMs into the grad
slot — the fan-out accumulation path T3 lacked. Per-crate build.rs emits the
no_cuda cfg (does not propagate); engine gated, grad_check stays host-only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>