Default-stream kernels run in order and every host read goes through a
stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize
after each kernel was pure overhead — remove all 21 in tensor.rs. Host
data is still correctly ordered by the D2H memcpy that reads it.
Also zero op-output buffers with cudaMemset (device-side, async) instead of
a blocking H2D memcpy of a host zero buffer on every allocation — that
copy was itself a hidden per-op sync point.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>