The gpt-oss harmony format generates internal control tokens
(<|channel|>, <|start|>, <|end|>, <|message|>) that should not appear
in the user-facing output. Additionally, <|end|> marks the end of a
response segment but was not in the model's EOS list, causing the
model to self-prompt into analysis channels and loop.
Fix: treat <|end|> as a stop token, skip all harmony special tokens
from the output stream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ChatModel enum dispatching between Qwen3 and GptOss based on
config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
accumulator inside the kernel (block k=0) while other blocks
atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Load the model's chat_template.jinja (or tokenizer_config.json
chat_template field) at startup and render it with minijinja instead of
hardcoded per-model prompt builders.
Custom Jinja functions: strftime_now (date formatting), raise_exception
(template validation errors). Falls back to Qwen3 ChatML template if
no Jinja template is found.
Removes the hardcoded build_prompt_gpt_oss() — the model's own template
now drives prompt formatting, matching llama.cpp's behavior exactly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the per-token CPU-routed MoE forward with an all-GPU path:
1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
2. moe_replicate: broadcast input to all local experts
3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)
Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).
Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.
GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse the model's `pre_tokenizer` section to extract its Split regex
instead of hardcoding the GPT-2 pattern. The gpt-oss-20b model uses
a GPT-4-style regex that produces different word boundaries, causing a
1-token prompt mismatch vs HuggingFace (136 → 135 tokens, now aligned).
Unsupported lookahead `(?!\S)` is stripped — it only affects trailing
whitespace edge cases. Falls back to the old GPT-2/Qwen heuristic if
the model regex fails to compile or is absent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --prompt to override the fixed prompt, and two teacher-forced
diagnostics: --forced runs prefill over prompt+oracle ids and reports
per-position top-1 agreement; --forced-decode walks the oracle trajectory
through the decode path with per-position agreement bucketed by position,
to localize long-context decode divergence from the reference.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Route caller-supplied system messages into a harmony 'developer'
instructions block (<|start|>developer<|message|># Instructions...),
keeping the fixed system/meta block for the channel declaration. Harmony
puts user instructions on the developer role, not system.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Use tokenizer.is_eos() (multi-eos) for generation termination in both PP
and TP engines instead of a single eos id, so gpt-oss stops on <|return|>
/<|call|>/<|endoftext|>.
In the TP engine, optionally apply a repetition penalty on the greedy
decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off
by default) to break greedy repetition loops.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make argmax skip NaN logits (warn once) instead of panicking the engine
thread on a single NaN. Add sample_greedy_penalized() applying an
HF-style repetition penalty over recent ids on the greedy path, to break
greedy repetition loops on reasoning models without touching the forward
pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Track an ordered eos_token_ids list (not just one id) and add is_eos().
gpt-oss/harmony ends the assistant turn on <|return|> and also treats
<|call|> and <|endoftext|> as terminators (generation_config.json
eos_token_id = [200002, 199999, 200012]); single-eos families are
unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.
Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The gpt-oss model requires a specific prompt format with <|start|>,
<|message|>, <|end|>, <|channel|> tokens. Without this, the model
produces degenerate output. Auto-detected via config.model_type.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
vs llama.cpp (BF16 GGUF) on GSM8K quality + speed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The custom launch_gemv_bf16 kernel produces NaN when output dimension N
is small (e.g. N=32 for the MoE router). Fall back to cuBLAS GemmEx for
N < 256. Also removes the padding workaround in gpt_oss MoE forward.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).
When all active sequences use temperature=0, run argmax on the GPU and
only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size]
BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches
fall back to the existing CPU path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Weight fusion at load time:
- q/k/v_proj → single qkv_proj_wt, GEMV once then narrow() to split
- gate/up_proj → single gate_up_proj_wt, same pattern
- Reduces GEMV calls from 7 to 4 per layer (36 layers → 108 fewer launches)
Batched decode refactor (forward_decode_paged):
- Per-head RMSNorm: reshape to [B*H, D], one rmsnorm call
- Batched RoPE: one call for all sequences
- Batched KV scatter: one reshape_and_cache kernel per layer
- Eliminates the per-sequence loop entirely
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the Rust cudaMemcpy loop in append_tokens() with the new
reshape_and_cache kernel. Add append_tokens_batched() for the decode
path using the batched variant.
Fix: use data_ptr() instead of storage().gpu_buffer().as_ptr() so that
tensor offset is respected. The old code silently read from storage base
(element 0) instead of the tensor's logical start, which produced wrong
results when K/V tensors were narrow() views into a fused QKV buffer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three new CUDA kernels and one rewrite:
- reshape_and_cache: scatter K/V into paged pool in a single kernel per
layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
Includes both single-sequence (prefill) and batched (decode) variants.
- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
decode now only D2H-transfers B×4 bytes (token ids) instead of the
full [B, vocab] logits tensor.
- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exposes the caching allocator's trim() through a public free function.
Called after weight fusion during model loading to free temporary buffers
that would otherwise sit in the pool and cause OOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narrow(dim, start, len) creates a zero-copy slice along any dimension.
is_contiguous() now ignores stride mismatches on dimensions of size 1,
since those dimensions are never stepped. This avoids unnecessary GPU
strided copies when slicing fused projection outputs at batch=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling +
stop logic) on the calling thread, worker stage threads for 1..P. Each
step the coordinator embeds + runs its layers, then the hidden state is
handed stage->stage over NCCL P2P; the last stage samples and returns
the token to stage 0 over an in-process channel. v1 is serial (one
request, one token/step) — correctness first; throughput via microbatch
overlap is future work.
main: wire --pp N (mutually exclusive with --tp).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Layer-wise split: each stage loads only its contiguous layer range
[s*L, (s+1)*L); stage 0 keeps embed_tokens, the last stage keeps
norm/lm_head (others get a 1x1 placeholder). Heads are NOT split
(PP is orthogonal to TP). Adds embed/head and forward_layers_prefill/
forward_layers_decode that take and return the [tokens, hidden] hidden
state; per-stage PagedKVCache is indexed by local layer id.
sampling: derive Clone on SamplingParams (carried in the PP command enum).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add ncclSend/ncclRecv FFI and a PpContext that initializes a NCCL
communicator across P pipeline stages and hands the hidden state to
neighbour stages on the null stream. Mirrors TpContext; the collective
differs (point-to-point hand-off vs in-layer AllReduce).
tests/sendrecv.rs: 2-GPU stage0->stage1 send/recv smoke test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cooked-mode read_line() left line editing to the terminal, so Backspace on a
multi-byte 汉字/かな/한글 deleted a byte (or behaved inconsistently across TTYs).
Replace with a raw-mode reader (libc termios): Backspace pops a whole char,
multi-byte input is reassembled from its continuation bytes, and a full-line
redraw renders double-width glyphs correctly. Non-TTY input falls back to a
plain read; raw mode is restored after each line. libc is already a locked
transitive dep, so this builds offline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token
commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled
token always comes from rank 0, so it's correct for greedy and stochastic
sampling. Serial single-request path (sufficient for the quality benchmark).
--tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
from_weights_tp shards each rank's weights (column-split q/k/v/gate/up,
row-split o/down; replicate norms/embed/lm_head) and the paged forward uses
local head counts + AllReduces after o_proj and down_proj. PagedKVCache::new_tp
sizes the pool for the rank's local KV heads (KV is sharded too). TP=1 is the
identity path. New bench-tp binary runs E2E multi-GPU generation per TP degree.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per
thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so
it orders naturally with the model's kernels. 2-GPU AllReduce test included.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory:
- Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the
worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192.
- Scheduler tracks waiting/running/swapped sets: block-aware admission,
swap-in of resumable sequences when blocks free, and preemption of the
newest running sequence to host when the pool can't cover a decode step.
- --swap-space-gb (default 8) sizes the pinned host swap pool;
XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping.
- api: poison-tolerant lock + clean 503 when the engine thread is gone,
instead of cascading mutex-poison panics.
Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a
forced 40-block pool drives 48 lossless swap-out/swap-in cycles under
concurrency with coherent output.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- paged_kv_cache: new block-paged KV cache; adds a pinned-host swap pool with
a second BlockAllocator, per-sequence Location {Gpu,Cpu}, and lossless
swap_out/swap_in (block-granular D2H/H2D) for vLLM-style preemption.
bytes_per_block helper exposes per-block cost for VRAM-based sizing.
- decode_graph: CUDA-graph decode path.
- qwen3/gpt2/kv_cache: paged prefill/decode forward + related updates.
- tokenizer/bins: BPE updates, new xserv-chat CLI, bench-qwen3 tweaks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
pool backing) and offset-aware D2H/H2D copies used to move KV blocks
between the GPU pool and pinned host memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three performance optimizations targeting decode throughput:
1. Decode Attention Kernel (csrc/attention/flash_attention.cu):
- Specialized kernel for Q_len=1 (decode step)
- 256 threads parallelize across KV sequence dimension
- Online softmax with block-level warp-shuffle reduction
- Replaces FA2 kernel which wasted 63/64 threads for decode
- flash_attention() auto-dispatches when q_len==1
2. Fused SiLU×Mul (csrc/activation/activations.cu):
- Single kernel: out = silu(gate) * up
- Saves 1 HBM read + 1 HBM write per FFN layer (N elements)
- Eliminates intermediate tensor allocation
3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu):
- Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual)
- Saves 1 full HBM round-trip per attention block
- Eliminates separate add + rmsnorm kernel pair
Performance analysis:
- At current short sequences (max 79 tokens), these optimizations provide
marginal benefit because the bottleneck is cuBLAS GEMV overhead:
252 weight matrix reads × ~32MB each = 15.5 GB per decode step.
Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap).
- The fused kernels and decode attention will show larger gains at
longer sequences where attention and element-wise ops dominate.
- Next optimization target: CUDA Graphs to eliminate kernel launch
overhead, or custom GEMV kernels to replace cuBLAS for M=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New crate: xserv-server
- Engine thread: loads Qwen3-8B, processes requests sequentially
- axum HTTP server: /health, /v1/models, /v1/chat/completions
- tokio::sync::mpsc channel between API and engine threads
- Non-streaming JSON response (streaming SSE to be added later)
API is OpenAI-compatible:
POST /v1/chat/completions {"messages": [...], "max_tokens": N}
→ {"choices": [{"message": {"content": "..."}}]}
Verified: "Hi" → ", I'm" (3 tokens), model runs correctly via HTTP.
Key learnings:
- std::sync::mpsc::SyncSender is Send but NOT Sync → wrap in Mutex for Arc<AppState>
- MutexGuard must not live across await points (scope carefully)
- axum 0.8 Extension<Arc<T>> requires T: Send + Sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset
- Per-head strided layout [num_kv_heads, max_seq_len, head_dim]
- Fixed critical bug: seq_len must advance AFTER all layers write
(not inside the loop per-layer)
- GpuBuffer::copy_from_device_at for offset-based D2D copy
- Tensor::from_storage constructor for wrapping raw GPU buffers
- Exported Storage and Dims from xserv-tensor
Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical
Performance: ~neutral (KV cache was never the main bottleneck —
reshape/merge/transpose CPU round-trips dominate for Qwen3-8B)
TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kernel additions:
- add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU)
- Refactored activation.rs with dispatch_unary/dispatch_binary helpers
- Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips
GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add)
BF16 precision analysis (docs/benchmarks/phase10-qwen3.md):
- Root cause: separate attention kernels materialize BF16 intermediates
(QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path)
- HF itself SDPA vs Eager also differs by ~0.125 logit
- xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases
- Industry standard for BF16: top-5 overlap (we achieve 100%)
- Fix path: Flash Attention (Phase 14) to fuse attention in FP32
Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3 model (qwen3.rs):
- RMSNorm + QK normalization (per-head q_norm/k_norm)
- GQA: 32 Q heads, 8 KV heads, repeat_kv for attention
- SwiGLU FFN: gate_proj → SiLU → * up_proj → down_proj
- RoPE with transpose for [1,H,S,D] ↔ [S,H,D] layout
- BF16 forward pass, [out,in] weight layout via linear_t
- No attention bias (attention_bias=false)
Tokenizer fixes:
- Fixed unicode_to_byte: shifted bytes now use correct inverse lookup table
- MergeEntry supports both string and array formats
- Both GPT-2 and Qwen3 tokenizers work correctly (English + Chinese)
KVCache refactored:
- Dtype-agnostic: stores raw bytes per-head, works for F32 and BF16
- append_kv_tensor/get_kv_tensors use Tensor directly
CLI updated:
- Auto-detects model type from config.json (gpt2 vs qwen3)
- Supports both GPT-2 (F32) and Qwen3 (BF16)
Verified: Qwen3-8B generates coherent English and Chinese on single RTX 5090.
61/61 tests pass, GPT-2 performance no regression.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU
Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure
Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]
Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>