- Thread-local launch stream (xserv_cuda::stream): every kernel
wrapper, cublasSetStream, and NCCL collective now launches on
current_stream_raw() — the legacy null stream by default (behavior
unchanged), or the capture stream installed via push_stream during
graph capture. Capture is impossible on the legacy stream.
- Allocator retain mode: blocks freed inside a retain window are
quarantined (RetainedBlocks) instead of pooled, so an instantiated
graph keeps exclusive ownership of every intermediate buffer it
references across replays.
- Capture mode GLOBAL -> THREAD_LOCAL: concurrent TP rank threads
must not poison each other's captures with their own cudaMallocs.
- embedding_device_ids / rope_inplace_device_pos: variants reading
token ids / positions from persistent device buffers, replacing the
per-call host upload that a captured region cannot contain.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>