The dense MoE kernels (moe_replicate, moe_bias_add_3d, moe_weighted_sum)
computed total / expert_stride / element indices in int32. gpt-oss prefill
runs the whole prompt through the dense path unchunked (SPARSE_MAX_TOKENS=8),
so local_experts*num_tokens*hidden (and batch*num_tokens*dim, and
local_id*expert_stride) overflow int32 at ~3.6k-23k prefill tokens
(TP-dependent) — well inside the supported context window. The launch then
fails silently because CUDA_CHECK_LAST_ERROR was ((void)0) under NDEBUG, so
the bias / weighted-sum simply never runs and the forward pass is corrupted
with no error reported.
Fix: switch the three kernels and their launchers to long long, mirroring the
(long long) indexing already used in moe_sparse.cu. Also make
CUDA_CHECK_LAST_ERROR always-on — cudaGetLastError does not sync, so the
per-launch host cost is negligible, and a silent launch failure is exactly
the class of bug this one was.
Verified on dash5 (RTX 5090): a direct kernel test at 2.21B elements (>2^31)
for both moe_replicate and moe_bias_add_3d produces correct results with no
launch error; bench-gpt-oss TP=2 holds at 5.9ms TPOT, output unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps):
TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x
ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x,
TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%.
Lesson recorded: graphs bought ~0.6 ms (launches were already hidden
by async execution), the GPU argmax ~1 ms — measure, don't guess.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
sample() at temperature 0 copied the full [seq, 201088] BF16 logits
to the host and scanned them every token (~1 ms/token). Use the
Phase 15 argmax kernel (block reduction + 4-byte D2H) when logits are
BF16 on GPU; bench-gpt-oss's greedy sampler likewise. Exact-tie
logits may break differently than the host scan — greedy trajectories
can legitimately diverge at a tie token (GSM8K unchanged).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Split forward_decode_paged into host bookkeeping (decode_prepare +
ids/pos upload + advance_seq_len) and a pure-GPU decode_core. The
paged-KV and sparse-MoE designs already read every per-step variable
(block table, context lens, expert ids) from stable-address device
buffers, so decode_core captures as-is.
GptOssDecodeGraph captures lazily on the second decode step (the
first eager step warms cuBLAS) after a "retained warmup": the step
runs once with the allocator quarantine on, stocking the pool with a
dedicated block for every allocation so the capture itself never
pool-misses (a cudaMalloc while capturing is illegal — and the
capture's own quarantine is what would otherwise starve the pool).
NCCL all-reduces capture cleanly; TP=2 replays in lockstep.
Wired into tp_engine, bench-gpt-oss, and xserv-chat via
GraphedGptOssDecoder (batch>1 falls back to eager;
XSERV_DECODE_GRAPH=0 disables). Greedy tokens identical to eager.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Thread-local launch stream (xserv_cuda::stream): every kernel
wrapper, cublasSetStream, and NCCL collective now launches on
current_stream_raw() — the legacy null stream by default (behavior
unchanged), or the capture stream installed via push_stream during
graph capture. Capture is impossible on the legacy stream.
- Allocator retain mode: blocks freed inside a retain window are
quarantined (RetainedBlocks) instead of pooled, so an instantiated
graph keeps exclusive ownership of every intermediate buffer it
references across replays.
- Capture mode GLOBAL -> THREAD_LOCAL: concurrent TP rank threads
must not poison each other's captures with their own cudaMallocs.
- embedding_device_ids / rope_inplace_device_pos: variants reading
token ids / positions from persistent device buffers, replacing the
per-call host upload that a captured region cannot contain.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- docs/19-gpt-oss-moe.md: the numbered series jumped 18->20; write up
gpt-oss arch deltas, harmony pitfalls, and the two CUDA debugging
postmortems (fully-masked-tile NaN in flash-attention sinks;
pre-__syncthreads early return reading uninitialized smem in the
decode GEMV) — the highest-value learning content of that phase.
- README: models/perf/capabilities were frozen at the Qwen3-only era;
now lists gpt-oss MoE, TP/PP, FP8/MXFP4, sparse MoE, and the
llama.cpp standing.
- Roadmap: record where reality diverged from the plan at Phase 18+,
add milestone entries and the ranked next-phase candidates
(21 CUDA-graph MoE decode, 22 non-expert quant, 23 sparse prefill).
- sparse-moe benchmark doc: post-review-fix numbers.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- --pp with gpt-oss now fails with a clear message instead of a
cryptic missing-weight panic inside the Qwen3-only PP engine.
- Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the
uint4-vectorized kernels would silently drop a tail otherwise.
- Document the topk_ids buffer holding i32 under an F32 dtype label
(DType has no I32).
- Drop unused imports/locals and the cuBLASLt scale-mode constants
orphaned by the strided-batched FP8 rework (e631a71).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Decode carried three leftover cudaDeviceSynchronize (prefill one) from
NaN debugging — the Qwen3 path has none and the logits D2H in sample()
already orders against the null stream.
add_bias for rows>1 round-tripped the bias through the CPU (D2H + host
tile loop + H2D) on every call — 96 times per prefill across q/k/v/o.
Replace with a bias_add_2d broadcast kernel.
dash5, FP8 TP=2, warm-server: TTFT 35/49/94 -> 29/42/79 ms
(short/medium/long), TPOT 7.19-7.32 -> 6.99-7.21 ms. Greedy tokens
unchanged; GSM8K-50 94%.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
gpt-oss has no single-GPU engine path, so --tp 1 fell through to the
Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp
even at tp=1: NCCL world-1 init works and all_reduce already no-ops
(bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB
FP8 / 13 GB MXFP4) now serves on one 32 GB 5090.
Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"):
it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one
GPU and rank 1's set_device panicked while rank 0 spun in NCCL init.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Dense MoE replicated x across all 16 local experts and ran the full
batched GEMM, reading every expert's weights per token; the weighted
sum then discarded 12 of 16 results. Decode is memory-bound, so this
was ~8x wasted expert bytes — the entire decode gap vs llama.cpp.
New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read
topk_ids on-device (no host sync) and early-return block-uniformly
for experts other ranks own. FP8 runs W8A16 (activations stay BF16 —
tensor cores are irrelevant at M=1, and activation quantization error
disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the
GEMV epilogue; slot-indexed weighted sum skips (never multiplies)
unwritten non-local slots. Dense path retained for num_tokens > 8
(prefill) and via XSERV_DENSE_MOE=1 for A/B.
dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms.
Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms —
first config where xserv wins decode outright. GSM8K-100: 96% (dense
FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode
CUDA graphs, non-expert quantization, sparse prefill (docs/20).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
xserv_vs_llama.py runs each server one at a time on the same GPUs (drains VRAM
between), streams identical prompts through /v1/chat/completions, and reports
median TTFT/TPOT/throughput. Counts llama's reasoning_content as real decode
tokens so the gpt-oss CoT is measured fairly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Weight-only 4-bit for the gpt-oss MoE experts: weights stored MXFP4 (E2M1 +
per-32-element UE8M0 block scale, tools/quantize_mxfp4.py), a fused kernel reads
the 4-bit weights and dequantizes on-chip to BF16. Decode (M=1) uses a fused
dequant-GEMV (batched_gemv_mxfp4) with shared-memory activation tiling; prefill
(M>1) dequantizes to BF16 then reuses the BF16 batched GEMM. MXFP4 is detected
by the scale tensor's rank (3-D [E,N,K/32]) vs FP8's 1-D [E].
Verified on dash5 (gpt-oss-20b, TP=2, 5090): byte-identical greedy tokens to
FP8/BF16, smallest footprint (13 GB vs 22 GB FP8, 39 GB BF16) — fits one 32 GB
5090 with room for KV cache.
NOT a decode speedup: the hand-written W4A16 GEMV (no tensor cores) is less
efficient than cuBLASLt's FP8 tensor-core GEMM, so even at half the weight bytes
decode is 17.0 ms vs FP8 13.5 ms (faster than BF16 18.8 ms); prefill regresses
(350 vs 134 ms, dequant fallback). Committed as a correct memory-optimization
foundation. Beating FP8 on speed needs FP4 tensor cores (W4A4, cuBLASLt
block-scaled MXFP4) or a Marlin-class kernel; see
docs/benchmarks/mxfp4-and-llama-decode.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The plan-cache fix removed the per-expert heuristic churn but still issued one
cublasLtMatmul per expert: ~768 tiny launches per decoded token (16 local
experts × 2 GEMMs × 24 layers), which capped the FP8 decode win at ~1.05× over
BF16. Collapse each MoE GEMM into ONE strided-batched cuBLASLt FP8 matmul
(BATCH_COUNT + strided-batch offsets on all four layouts) → ~48 launches/token.
A single strided call can't carry a per-batch scalar B-scale, so the per-expert
weight scale moves out of the GEMM epilogue into a fused post-scale kernel
(rowwise_scale_moe_bf16) that applies a_scale[token]·b_scale[expert] in one
pass. This is precision-equivalent: BF16's relative error is scale-invariant, so
scaling the unscaled GEMM output afterward loses nothing vs scaling in-epilogue.
Measured on dash5 (gpt-oss-20b, TP=2, 5090), warm-server GSM8K:
decode TPOT 17.45 → 13.08 ms (FP8 now 1.41× vs BF16 18.39 ms),
throughput 57.3 → 76.4 tok/s, accuracy unchanged (FP8 91.0% vs BF16 90.0%).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fp8_compare.py launches one xserv-server per model (same GPUs / TP for a
fair comparison), gates readiness on a real generation (not /health),
and streams GSM8K through /v1/chat/completions measuring per-request
TTFT (time to first token) and TPOT (mean inter-token latency) plus
exact-match accuracy. docs/benchmarks/fp8-quantization.md records the
quantization scheme, the perf-bug fix, and the dash5 results.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
batched_gemm_fp8 rebuilt the cuBLASLt matmul descriptor, four matrix
layouts, a preference, and a 4-byte scale alloc, AND ran the algo
heuristic search — once per expert, per GEMM, per layer, on every
forward (~1500 heuristic searches per decoded token). FP8 decode ran at
27.0 ms/tok vs BF16 18.8 ms, i.e. slower than the path it was meant to
accelerate.
Cache the full plan (descriptor + layouts + heuristically-chosen algo)
in a thread-local map keyed by (M, N, K) so the heuristic runs once per
shape and is reused across experts and forwards; allocate the 1.0 scale
buffer once; pass each expert's weight scale via the cuBLASLt B-scale
device pointer instead of folding it into alpha (identical FP32-epilogue
precision, and no host readback of b_scales). The per-expert loop now
issues only cublasLtMatmul.
Measured on dash5 (gpt-oss-20b, TP=2, 5090): FP8 decode TPOT 27.0 -> 17.9
ms, now faster than BF16 (18.8 ms); GSM8K-200 accuracy unchanged
(FP8 93.0% vs BF16 90.5%, within noise).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the W8A16 dequant→BF16-GEMM path with native FP8×FP8→BF16 GEMM
using cuBLASLt on Blackwell (RTX 5090). Both weights (static FP8 E4M3)
and activations (dynamically quantized per-row) are processed directly
on FP8 tensor cores.
Key implementation details:
- cuBLASLt on Blackwell requires transA=T for FP8, so expert weights
are transposed during model loading ([E,K,N] → [E,N,K])
- Per-row activation quantization kernel (absmax/448 → FP8 E4M3)
- Post-GEMM row-wise rescaling recovers per-token precision
- Per-expert loop (not batched) due to cuBLASLt FP8 scale constraints
The same FP8 quantized model files work — no re-quantization needed.
Activation quantization happens dynamically at inference time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Close the <think> block when EOS or max_tokens interrupts an analysis
channel, and flush stdout after each analysis token so --think streams
smoothly instead of dumping in buffer-sized chunks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gpt-oss harmony `analysis` channel is the model's reasoning, analogous
to Qwen3's <think>. With --think, wrap it in a `<think>\n…\n</think>\n\n`
block (gray when color is on, like Qwen3) and then print the final-channel
answer; without --think, suppress the analysis and show only the answer.
Replaces the previous color-gated behavior (analysis shown gray only on a
TTY, with no markers). Analysis is still excluded from the multi-turn
history (answer_ids), so re-prefill drops CoT as before.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gemv_bf16_fused_kernel returned early on out-of-range columns
(`if (col >= N) return;`) BEFORE the cooperative load of x into shared
memory and the `__syncthreads()`. When N is not a multiple of GEMV_TILE_N
(128), the last column-block's out-of-range threads exited without loading
their slice of x_shared, so the in-range threads then read uninitialized
shared memory in the dot product — and __syncthreads with exited threads is
itself UB. Result: intermittent huge/garbage outputs (~1e33) that, after
the next RMSNorm, collapsed the whole forward pass to a degenerate logit
distribution (argmax → vocab_size-1, or NaN), derailing generation.
This hit every M=1 BF16 GEMV (n>=256) with n % 128 != 0 — i.e. gpt-oss
decode o_proj and the MoE projections (n=2880). q/k/v (4096) and lm_head
(201088) are 128-aligned and were unaffected, as is Qwen3 (hidden 4096),
which is why this manifested as intermittent gpt-oss-only decode failures.
Fix: all threads participate in the shared-memory load and reach the
barrier; the col>=N check moves to AFTER __syncthreads.
Verified on dash5 (TP=2): a prompt that reliably produced garbage ~70% of
runs now yields clean logits 16/16; the multi-turn Chinese chat that
collapsed mid-conversation completes coherently with 0 NaN warnings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the
causal `continue`); an early tile entirely outside the sliding window was
still processed with every key masked to -inf, so row_max == -INFINITY.
Folding that into the online softmax computed expf(-inf - (-inf)) = NaN,
and the next valid tile's 0*NaN correction then poisoned the whole row.
Result: the gpt-oss prefill produced all-NaN logits for any query whose
sliding window (128) starts past the first KV tile — i.e. at longer
context — collapsing generation into a single repeated token (argmax of
all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was
the residual multi-turn/long-context collapse.
Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes
nothing to the softmax. The decode kernel already guards
local_max == -INFINITY, so it was unaffected.
Verified on dash5 (TP=2): the prefill that previously went all-NaN now
produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long
prior answer) completes correctly instead of emitting "!!!!".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-render the whole conversation each turn and re-prefill into a freshly
cleared slot, with past assistant messages rendered as completed `final`
channels (analysis dropped, terminated with <|end|> not the <|return|>
stop token) — matching the model's training format and the server's
builder. The previous incremental cache kept every turn's chain-of-thought
plus <|return|> in context, which is out of distribution for harmony
multi-turn. The generator now returns the final-channel text to feed back
as history. Qwen3 keeps the incremental cache (its ChatML format is
unaffected); reset_slot factors out the free+re-register.
NOTE: this corrects the multi-turn *format* but does NOT cure the
long-context collapse on some inputs. That is a forward-pass numerical bug
(NaN / degenerate logits) reproducible in clean bench-gpt-oss independent
of the chat layer — the collapse token is vocab_size-1 (201087), the
all-NaN argmax tie-break. Tracked separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_prompt_gpt_oss (the hardcoded builder used when a gpt-oss model
ships no Jinja chat template) emitted the same malformed "You are a
helpful assistant." system message that destabilized the CLI. Replace it
with the canonical harmony system message (identity / knowledge cutoff /
current date via strftime_now / Reasoning: low / channels), matching the
chat_template.jinja build_system_message macro and the xserv-chat fix.
Dormant for gpt-oss-20b (it ships a Jinja template, so the template path
runs), but correct now for any gpt-oss model without one.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hand-rolled gpt-oss system message dropped the canonical harmony
structure (identity / knowledge cutoff / current date / Reasoning level),
putting the model out of distribution — greedy decoding then flipped into
garbage or analysis loops on ~half of single-turn requests. Emit the
canonical system message (matching the model's chat_template.jinja
build_system_message macro) with Reasoning: low, plus a today_ymd() date
helper.
Also:
- Default the repetition penalty to off (1.0). Penalizing the harmony
control tokens (<|channel|>/<|message|>/<|start|>) that must repeat to
open the final channel made gpt-oss stop right after analysis, emitting
nothing.
- Suppress the literal "assistant" role header emitted between the
analysis and final channels (only print in the final channel, moe only;
non-moe Qwen3 stays in Normal and prints as before).
Verified on dash5 (TP=2): single-turn "capital of France" is now stable
across runs with a clean final answer; Qwen3 chat unaffected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Let the model generate its own <|channel|> routing instead of forcing
<|channel|>final<|message|> — matches the GGUF chat template behavior.
- State machine tracks harmony channels: analysis channel rendered gray,
final channel printed normally, <|end|> stops on final channel only.
- Add repetition penalty (default 1.3 for MoE, 1.0 for Qwen) with 512
token window to prevent greedy decode loops. Configurable via
XSERV_REP_PENALTY and XSERV_REP_WINDOW env vars.
- Fix Length path: use <|end|> instead of <|im_end|> for gpt-oss to
avoid poisoning the KV cache with garbage tokens on truncation.
- Server api.rs: append <|channel|>final<|message|> to the hardcoded
gpt-oss prompt (server expects to post-process the JSON output).
- Add startswith filter to minijinja for harmony template compatibility.
Known issue: gpt-oss multi-turn NaN when total context exceeds ~256
tokens — likely a flash_attention_sinks kernel bug with sliding window
layers at large kv_len + small q_len. Single-turn and short multi-turn
conversations work correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gpt-oss harmony format generates internal control tokens
(<|channel|>, <|start|>, <|end|>, <|message|>) that should not appear
in the user-facing output. Additionally, <|end|> marks the end of a
response segment but was not in the model's EOS list, causing the
model to self-prompt into analysis channels and loop.
Fix: treat <|end|> as a stop token, skip all harmony special tokens
from the output stream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ChatModel enum dispatching between Qwen3 and GptOss based on
config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
accumulator inside the kernel (block k=0) while other blocks
atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Load the model's chat_template.jinja (or tokenizer_config.json
chat_template field) at startup and render it with minijinja instead of
hardcoded per-model prompt builders.
Custom Jinja functions: strftime_now (date formatting), raise_exception
(template validation errors). Falls back to Qwen3 ChatML template if
no Jinja template is found.
Removes the hardcoded build_prompt_gpt_oss() — the model's own template
now drives prompt formatting, matching llama.cpp's behavior exactly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the per-token CPU-routed MoE forward with an all-GPU path:
1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
2. moe_replicate: broadcast input to all local experts
3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)
Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).
Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.
GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse the model's `pre_tokenizer` section to extract its Split regex
instead of hardcoding the GPT-2 pattern. The gpt-oss-20b model uses
a GPT-4-style regex that produces different word boundaries, causing a
1-token prompt mismatch vs HuggingFace (136 → 135 tokens, now aligned).
Unsupported lookahead `(?!\S)` is stripped — it only affects trailing
whitespace edge cases. Falls back to the old GPT-2/Qwen heuristic if
the model regex fails to compile or is absent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --prompt to override the fixed prompt, and two teacher-forced
diagnostics: --forced runs prefill over prompt+oracle ids and reports
per-position top-1 agreement; --forced-decode walks the oracle trajectory
through the decode path with per-position agreement bucketed by position,
to localize long-context decode divergence from the reference.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Route caller-supplied system messages into a harmony 'developer'
instructions block (<|start|>developer<|message|># Instructions...),
keeping the fixed system/meta block for the channel declaration. Harmony
puts user instructions on the developer role, not system.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Use tokenizer.is_eos() (multi-eos) for generation termination in both PP
and TP engines instead of a single eos id, so gpt-oss stops on <|return|>
/<|call|>/<|endoftext|>.
In the TP engine, optionally apply a repetition penalty on the greedy
decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off
by default) to break greedy repetition loops.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make argmax skip NaN logits (warn once) instead of panicking the engine
thread on a single NaN. Add sample_greedy_penalized() applying an
HF-style repetition penalty over recent ids on the greedy path, to break
greedy repetition loops on reasoning models without touching the forward
pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Track an ordered eos_token_ids list (not just one id) and add is_eos().
gpt-oss/harmony ends the assistant turn on <|return|> and also treats
<|call|> and <|endoftext|> as terminators (generation_config.json
eos_token_id = [200002, 199999, 200012]); single-eos families are
unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.
Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The gpt-oss model requires a specific prompt format with <|start|>,
<|message|>, <|end|>, <|channel|> tokens. Without this, the model
produces degenerate output. Auto-detected via config.model_type.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
vs llama.cpp (BF16 GGUF) on GSM8K quality + speed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The custom launch_gemv_bf16 kernel produces NaN when output dimension N
is small (e.g. N=32 for the MoE router). Fall back to cuBLAS GemmEx for
N < 256. Also removes the padding workaround in gpt_oss MoE forward.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).
When all active sequences use temperature=0, run argmax on the GPU and
only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size]
BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches
fall back to the existing CPU path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Weight fusion at load time:
- q/k/v_proj → single qkv_proj_wt, GEMV once then narrow() to split
- gate/up_proj → single gate_up_proj_wt, same pattern
- Reduces GEMV calls from 7 to 4 per layer (36 layers → 108 fewer launches)
Batched decode refactor (forward_decode_paged):
- Per-head RMSNorm: reshape to [B*H, D], one rmsnorm call
- Batched RoPE: one call for all sequences
- Batched KV scatter: one reshape_and_cache kernel per layer
- Eliminates the per-sequence loop entirely
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the Rust cudaMemcpy loop in append_tokens() with the new
reshape_and_cache kernel. Add append_tokens_batched() for the decode
path using the batched variant.
Fix: use data_ptr() instead of storage().gpu_buffer().as_ptr() so that
tensor offset is respected. The old code silently read from storage base
(element 0) instead of the tensor's logical start, which produced wrong
results when K/V tensors were narrow() views into a fused QKV buffer.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three new CUDA kernels and one rewrite:
- reshape_and_cache: scatter K/V into paged pool in a single kernel per
layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
Includes both single-sequence (prefill) and batched (decode) variants.
- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
decode now only D2H-transfers B×4 bytes (token ids) instead of the
full [B, vocab] logits tensor.
- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exposes the caching allocator's trim() through a public free function.
Called after weight fusion during model loading to free temporary buffers
that would otherwise sit in the pool and cause OOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narrow(dim, start, len) creates a zero-copy slice along any dimension.
is_contiguous() now ignores stride mismatches on dimensions of size 1,
since those dimensions are never stepped. This avoids unnecessary GPU
strided copies when slicing fused projection outputs at batch=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>