Three related hardening changes for the API surface:
- validate_request rejects NaN/negative temperature, out-of-range top_p,
and absurd top_k before those values reach the CUDA sampling paths.
Prevents NaN logits from downstream sampling and matches typical
OpenAI-compatible server behavior (400 instead of 500).
- normalize_finish_reason maps engine strings to the OpenAI-standard
subset. Currently only "error" (from tp/pp engine client-stall) needs
normalization — it collapses to null so SDK clients see a clean stream
close instead of an unknown finish_reason value. Applied to both
streaming (SSE) and non-streaming JSON responses.
- Replace the unbounded std::sync::mpsc engine channel with a bounded
sync_channel(256) and switch submit_to_engine to try_send. A saturated
engine now returns 503 "engine is busy" instead of letting requests
pile up in RAM. Also add axum DefaultBodyLimit(4 MiB) so a malicious
or misbehaving client cannot exhaust memory with an arbitrary JSON POST.
All three engines emitted tokens with blocking_send on the single
decode/coordinator OS thread. A streaming client that drains slower than
generation fills its 64-slot channel, and blocking_send then blocks the whole
thread: under continuous batching one slow consumer stalls every other running
sequence (and in the serial TP/PP path it blocks admission of the next request
too). The whole point of continuous batching is defeated.
Fix: switch to try_send. engine.rs sets a client_stalled flag on Full/Closed,
reaped by is_finished() next iteration; tp_engine/pp_engine emit_text returns
bool and the decode loop breaks with finish_reason "error". When the
sequence/request is dropped its sender drops too, closing the channel so the
client receive loop ends rather than hanging. A slow client now only loses its
own sequence, never the batch.
Verified on dash5: gpt-oss FP8 TP=1 streaming via tp_engine still streams
correctly (SSE chunks, coherent content, no hang); bench-gpt-oss TP=2 5.9ms
TPOT unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Split forward_decode_paged into host bookkeeping (decode_prepare +
ids/pos upload + advance_seq_len) and a pure-GPU decode_core. The
paged-KV and sparse-MoE designs already read every per-step variable
(block table, context lens, expert ids) from stable-address device
buffers, so decode_core captures as-is.
GptOssDecodeGraph captures lazily on the second decode step (the
first eager step warms cuBLAS) after a "retained warmup": the step
runs once with the allocator quarantine on, stocking the pool with a
dedicated block for every allocation so the capture itself never
pool-misses (a cudaMalloc while capturing is illegal — and the
capture's own quarantine is what would otherwise starve the pool).
NCCL all-reduces capture cleanly; TP=2 replays in lockstep.
Wired into tp_engine, bench-gpt-oss, and xserv-chat via
GraphedGptOssDecoder (batch>1 falls back to eager;
XSERV_DECODE_GRAPH=0 disables). Greedy tokens identical to eager.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- --pp with gpt-oss now fails with a clear message instead of a
cryptic missing-weight panic inside the Qwen3-only PP engine.
- Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the
uint4-vectorized kernels would silently drop a tail otherwise.
- Document the topk_ids buffer holding i32 under an F32 dtype label
(DType has no I32).
- Drop unused imports/locals and the cuBLASLt scale-mode constants
orphaned by the strided-batched FP8 rework (e631a71).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
gpt-oss has no single-GPU engine path, so --tp 1 fell through to the
Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp
even at tp=1: NCCL world-1 init works and all_reduce already no-ops
(bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB
FP8 / 13 GB MXFP4) now serves on one 32 GB 5090.
Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"):
it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one
GPU and rank 1's set_device panicked while rank 0 spun in NCCL init.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
build_prompt_gpt_oss (the hardcoded builder used when a gpt-oss model
ships no Jinja chat template) emitted the same malformed "You are a
helpful assistant." system message that destabilized the CLI. Replace it
with the canonical harmony system message (identity / knowledge cutoff /
current date via strftime_now / Reasoning: low / channels), matching the
chat_template.jinja build_system_message macro and the xserv-chat fix.
Dormant for gpt-oss-20b (it ships a Jinja template, so the template path
runs), but correct now for any gpt-oss model without one.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Let the model generate its own <|channel|> routing instead of forcing
<|channel|>final<|message|> — matches the GGUF chat template behavior.
- State machine tracks harmony channels: analysis channel rendered gray,
final channel printed normally, <|end|> stops on final channel only.
- Add repetition penalty (default 1.3 for MoE, 1.0 for Qwen) with 512
token window to prevent greedy decode loops. Configurable via
XSERV_REP_PENALTY and XSERV_REP_WINDOW env vars.
- Fix Length path: use <|end|> instead of <|im_end|> for gpt-oss to
avoid poisoning the KV cache with garbage tokens on truncation.
- Server api.rs: append <|channel|>final<|message|> to the hardcoded
gpt-oss prompt (server expects to post-process the JSON output).
- Add startswith filter to minijinja for harmony template compatibility.
Known issue: gpt-oss multi-turn NaN when total context exceeds ~256
tokens — likely a flash_attention_sinks kernel bug with sliding window
layers at large kv_len + small q_len. Single-turn and short multi-turn
conversations work correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ChatModel enum dispatching between Qwen3 and GptOss based on
config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
accumulator inside the kernel (block k=0) while other blocks
atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Load the model's chat_template.jinja (or tokenizer_config.json
chat_template field) at startup and render it with minijinja instead of
hardcoded per-model prompt builders.
Custom Jinja functions: strftime_now (date formatting), raise_exception
(template validation errors). Falls back to Qwen3 ChatML template if
no Jinja template is found.
Removes the hardcoded build_prompt_gpt_oss() — the model's own template
now drives prompt formatting, matching llama.cpp's behavior exactly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Route caller-supplied system messages into a harmony 'developer'
instructions block (<|start|>developer<|message|># Instructions...),
keeping the fixed system/meta block for the channel declaration. Harmony
puts user instructions on the developer role, not system.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Use tokenizer.is_eos() (multi-eos) for generation termination in both PP
and TP engines instead of a single eos id, so gpt-oss stops on <|return|>
/<|call|>/<|endoftext|>.
In the TP engine, optionally apply a repetition penalty on the greedy
decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off
by default) to break greedy repetition loops.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The gpt-oss model requires a specific prompt format with <|start|>,
<|message|>, <|end|>, <|channel|> tokens. Without this, the model
produces degenerate output. Auto-detected via config.model_type.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
vs llama.cpp (BF16 GGUF) on GSM8K quality + speed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).
When all active sequences use temperature=0, run argmax on the GPU and
only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size]
BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches
fall back to the existing CPU path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling +
stop logic) on the calling thread, worker stage threads for 1..P. Each
step the coordinator embeds + runs its layers, then the hidden state is
handed stage->stage over NCCL P2P; the last stage samples and returns
the token to stage 0 over an in-process channel. v1 is serial (one
request, one token/step) — correctness first; throughput via microbatch
overlap is future work.
main: wire --pp N (mutually exclusive with --tp).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token
commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled
token always comes from rank 0, so it's correct for greedy and stochastic
sampling. Serial single-request path (sufficient for the quality benchmark).
--tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory:
- Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the
worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192.
- Scheduler tracks waiting/running/swapped sets: block-aware admission,
swap-in of resumable sequences when blocks free, and preemption of the
newest running sequence to host when the pool can't cover a decode step.
- --swap-space-gb (default 8) sizes the pinned host swap pool;
XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping.
- api: poison-tolerant lock + clean 503 when the engine thread is gone,
instead of cascading mutex-poison panics.
Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a
forced 40-block pool drives 48 lossless swap-out/swap-in cycles under
concurrency with coherent output.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New crate: xserv-server
- Engine thread: loads Qwen3-8B, processes requests sequentially
- axum HTTP server: /health, /v1/models, /v1/chat/completions
- tokio::sync::mpsc channel between API and engine threads
- Non-streaming JSON response (streaming SSE to be added later)
API is OpenAI-compatible:
POST /v1/chat/completions {"messages": [...], "max_tokens": N}
→ {"choices": [{"message": {"content": "..."}}]}
Verified: "Hi" → ", I'm" (3 tokens), model runs correctly via HTTP.
Key learnings:
- std::sync::mpsc::SyncSender is Send but NOT Sync → wrap in Mutex for Arc<AppState>
- MutexGuard must not live across await points (scope carefully)
- axum 0.8 Extension<Arc<T>> requires T: Send + Sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>