Confirms snapshot_link works for cuda device pointers, not just host
memory. Sender on cuda:0 pushes to receiver on cuda:1 via RDMA over
mlx5_60. All 5 sizes (16K, 1M, 16M, 64M, 256M) pass SHA verification.
16 KB 8.3 ms 0.016 Gbps (cold openSegment)
1 MB 0.10 ms 87.6 Gbps
16 MB 0.84 ms 159 Gbps
64 MB 2.52 ms 213 Gbps
256 MB 8.54 ms 251 Gbps (~60% NDR400 line rate)
For Inferact-scale sessions (~50K tokens × ~80 KB layer-per-token =
~4 GB), this projects D→P transfer time at ~130 ms — within the
"reseed-savings" envelope sketched in design doc §3.2.
Files:
scripts/snapshot_link_receiver_gpu.py
scripts/smoke_snapshot_link_gpu.py
Next: SGLang scheduler integration for D-side dump + P-side ingest.