Files
xserv/crates/xserv-server
Gahow Wang 34224c7c93 gpt-oss: replay the whole batch=1 decode step as one CUDA graph
Split forward_decode_paged into host bookkeeping (decode_prepare +
ids/pos upload + advance_seq_len) and a pure-GPU decode_core. The
paged-KV and sparse-MoE designs already read every per-step variable
(block table, context lens, expert ids) from stable-address device
buffers, so decode_core captures as-is.

GptOssDecodeGraph captures lazily on the second decode step (the
first eager step warms cuBLAS) after a "retained warmup": the step
runs once with the allocator quarantine on, stocking the pool with a
dedicated block for every allocation so the capture itself never
pool-misses (a cudaMalloc while capturing is illegal — and the
capture's own quarantine is what would otherwise starve the pool).
NCCL all-reduces capture cleanly; TP=2 replays in lockstep.

Wired into tp_engine, bench-gpt-oss, and xserv-chat via
GraphedGptOssDecoder (batch>1 falls back to eager;
XSERV_DECODE_GRAPH=0 disables). Greedy tokens identical to eager.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
..