All three engines emitted tokens with blocking_send on the single
decode/coordinator OS thread. A streaming client that drains slower than
generation fills its 64-slot channel, and blocking_send then blocks the whole
thread: under continuous batching one slow consumer stalls every other running
sequence (and in the serial TP/PP path it blocks admission of the next request
too). The whole point of continuous batching is defeated.
Fix: switch to try_send. engine.rs sets a client_stalled flag on Full/Closed,
reaped by is_finished() next iteration; tp_engine/pp_engine emit_text returns
bool and the decode loop breaks with finish_reason "error". When the
sequence/request is dropped its sender drops too, closing the channel so the
client receive loop ends rather than hanging. A slow client now only loses its
own sequence, never the batch.
Verified on dash5: gpt-oss FP8 TP=1 streaming via tp_engine still streams
correctly (SSE chunks, coherent content, no hang); bench-gpt-oss TP=2 5.9ms
TPOT unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Split forward_decode_paged into host bookkeeping (decode_prepare +
ids/pos upload + advance_seq_len) and a pure-GPU decode_core. The
paged-KV and sparse-MoE designs already read every per-step variable
(block table, context lens, expert ids) from stable-address device
buffers, so decode_core captures as-is.
GptOssDecodeGraph captures lazily on the second decode step (the
first eager step warms cuBLAS) after a "retained warmup": the step
runs once with the allocator quarantine on, stocking the pool with a
dedicated block for every allocation so the capture itself never
pool-misses (a cudaMalloc while capturing is illegal — and the
capture's own quarantine is what would otherwise starve the pool).
NCCL all-reduces capture cleanly; TP=2 replays in lockstep.
Wired into tp_engine, bench-gpt-oss, and xserv-chat via
GraphedGptOssDecoder (batch>1 falls back to eager;
XSERV_DECODE_GRAPH=0 disables). Greedy tokens identical to eager.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
gpt-oss has no single-GPU engine path, so --tp 1 fell through to the
Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp
even at tp=1: NCCL world-1 init works and all_reduce already no-ops
(bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB
FP8 / 13 GB MXFP4) now serves on one 32 GB 5090.
Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"):
it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one
GPU and rank 1's set_device panicked while rank 0 spun in NCCL init.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Use tokenizer.is_eos() (multi-eos) for generation termination in both PP
and TP engines instead of a single eos id, so gpt-oss stops on <|return|>
/<|call|>/<|endoftext|>.
In the TP engine, optionally apply a repetition penalty on the greedy
decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off
by default) to break greedy repetition loops.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
vs llama.cpp (BF16 GGUF) on GSM8K quality + speed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token
commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled
token always comes from rank 0, so it's correct for greedy and stochastic
sampling. Serial single-request path (sufficient for the quality benchmark).
--tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>