- --pp with gpt-oss now fails with a clear message instead of a
cryptic missing-weight panic inside the Qwen3-only PP engine.
- Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the
uint4-vectorized kernels would silently drop a tail otherwise.
- Document the topk_ids buffer holding i32 under an F32 dtype label
(DType has no I32).
- Drop unused imports/locals and the cuBLASLt scale-mode constants
orphaned by the strided-batched FP8 rework (e631a71).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Decode carried three leftover cudaDeviceSynchronize (prefill one) from
NaN debugging — the Qwen3 path has none and the logits D2H in sample()
already orders against the null stream.
add_bias for rows>1 round-tripped the bias through the CPU (D2H + host
tile loop + H2D) on every call — 96 times per prefill across q/k/v/o.
Replace with a bias_add_2d broadcast kernel.
dash5, FP8 TP=2, warm-server: TTFT 35/49/94 -> 29/42/79 ms
(short/medium/long), TPOT 7.19-7.32 -> 6.99-7.21 ms. Greedy tokens
unchanged; GSM8K-50 94%.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Dense MoE replicated x across all 16 local experts and ran the full
batched GEMM, reading every expert's weights per token; the weighted
sum then discarded 12 of 16 results. Decode is memory-bound, so this
was ~8x wasted expert bytes — the entire decode gap vs llama.cpp.
New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read
topk_ids on-device (no host sync) and early-return block-uniformly
for experts other ranks own. FP8 runs W8A16 (activations stay BF16 —
tensor cores are irrelevant at M=1, and activation quantization error
disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the
GEMV epilogue; slot-indexed weighted sum skips (never multiplies)
unwritten non-local slots. Dense path retained for num_tokens > 8
(prefill) and via XSERV_DENSE_MOE=1 for A/B.
dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms.
Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms —
first config where xserv wins decode outright. GSM8K-100: 96% (dense
FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode
CUDA graphs, non-expert quantization, sparse prefill (docs/20).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Weight-only 4-bit for the gpt-oss MoE experts: weights stored MXFP4 (E2M1 +
per-32-element UE8M0 block scale, tools/quantize_mxfp4.py), a fused kernel reads
the 4-bit weights and dequantizes on-chip to BF16. Decode (M=1) uses a fused
dequant-GEMV (batched_gemv_mxfp4) with shared-memory activation tiling; prefill
(M>1) dequantizes to BF16 then reuses the BF16 batched GEMM. MXFP4 is detected
by the scale tensor's rank (3-D [E,N,K/32]) vs FP8's 1-D [E].
Verified on dash5 (gpt-oss-20b, TP=2, 5090): byte-identical greedy tokens to
FP8/BF16, smallest footprint (13 GB vs 22 GB FP8, 39 GB BF16) — fits one 32 GB
5090 with room for KV cache.
NOT a decode speedup: the hand-written W4A16 GEMV (no tensor cores) is less
efficient than cuBLASLt's FP8 tensor-core GEMM, so even at half the weight bytes
decode is 17.0 ms vs FP8 13.5 ms (faster than BF16 18.8 ms); prefill regresses
(350 vs 134 ms, dequant fallback). Committed as a correct memory-optimization
foundation. Beating FP8 on speed needs FP4 tensor cores (W4A4, cuBLASLt
block-scaled MXFP4) or a Marlin-class kernel; see
docs/benchmarks/mxfp4-and-llama-decode.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The plan-cache fix removed the per-expert heuristic churn but still issued one
cublasLtMatmul per expert: ~768 tiny launches per decoded token (16 local
experts × 2 GEMMs × 24 layers), which capped the FP8 decode win at ~1.05× over
BF16. Collapse each MoE GEMM into ONE strided-batched cuBLASLt FP8 matmul
(BATCH_COUNT + strided-batch offsets on all four layouts) → ~48 launches/token.
A single strided call can't carry a per-batch scalar B-scale, so the per-expert
weight scale moves out of the GEMM epilogue into a fused post-scale kernel
(rowwise_scale_moe_bf16) that applies a_scale[token]·b_scale[expert] in one
pass. This is precision-equivalent: BF16's relative error is scale-invariant, so
scaling the unscaled GEMM output afterward loses nothing vs scaling in-epilogue.
Measured on dash5 (gpt-oss-20b, TP=2, 5090), warm-server GSM8K:
decode TPOT 17.45 → 13.08 ms (FP8 now 1.41× vs BF16 18.39 ms),
throughput 57.3 → 76.4 tok/s, accuracy unchanged (FP8 91.0% vs BF16 90.0%).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the W8A16 dequant→BF16-GEMM path with native FP8×FP8→BF16 GEMM
using cuBLASLt on Blackwell (RTX 5090). Both weights (static FP8 E4M3)
and activations (dynamically quantized per-row) are processed directly
on FP8 tensor cores.
Key implementation details:
- cuBLASLt on Blackwell requires transA=T for FP8, so expert weights
are transposed during model loading ([E,K,N] → [E,N,K])
- Per-row activation quantization kernel (absmax/448 → FP8 E4M3)
- Post-GEMM row-wise rescaling recovers per-token precision
- Per-expert loop (not batched) due to cuBLASLt FP8 scale constraints
The same FP8 quantized model files work — no re-quantization needed.
Activation quantization happens dynamically at inference time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gemv_bf16_fused_kernel returned early on out-of-range columns
(`if (col >= N) return;`) BEFORE the cooperative load of x into shared
memory and the `__syncthreads()`. When N is not a multiple of GEMV_TILE_N
(128), the last column-block's out-of-range threads exited without loading
their slice of x_shared, so the in-range threads then read uninitialized
shared memory in the dot product — and __syncthreads with exited threads is
itself UB. Result: intermittent huge/garbage outputs (~1e33) that, after
the next RMSNorm, collapsed the whole forward pass to a degenerate logit
distribution (argmax → vocab_size-1, or NaN), derailing generation.
This hit every M=1 BF16 GEMV (n>=256) with n % 128 != 0 — i.e. gpt-oss
decode o_proj and the MoE projections (n=2880). q/k/v (4096) and lm_head
(201088) are 128-aligned and were unaffected, as is Qwen3 (hidden 4096),
which is why this manifested as intermittent gpt-oss-only decode failures.
Fix: all threads participate in the shared-memory load and reach the
barrier; the col>=N check moves to AFTER __syncthreads.
Verified on dash5 (TP=2): a prompt that reliably produced garbage ~70% of
runs now yields clean logits 16/16; the multi-turn Chinese chat that
collapsed mid-conversation completes coherently with 0 NaN warnings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the
causal `continue`); an early tile entirely outside the sliding window was
still processed with every key masked to -inf, so row_max == -INFINITY.
Folding that into the online softmax computed expf(-inf - (-inf)) = NaN,
and the next valid tile's 0*NaN correction then poisoned the whole row.
Result: the gpt-oss prefill produced all-NaN logits for any query whose
sliding window (128) starts past the first KV tile — i.e. at longer
context — collapsing generation into a single repeated token (argmax of
all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was
the residual multi-turn/long-context collapse.
Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes
nothing to the softmax. The decode kernel already guards
local_max == -INFINITY, so it was unaffected.
Verified on dash5 (TP=2): the prefill that previously went all-NaN now
produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long
prior answer) completes correctly instead of emitting "!!!!".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add ChatModel enum dispatching between Qwen3 and GptOss based on
config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
accumulator inside the kernel (block k=0) while other blocks
atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the per-token CPU-routed MoE forward with an all-GPU path:
1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
2. moe_replicate: broadcast input to all local experts
3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)
Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).
Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.
GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.
Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three new CUDA kernels and one rewrite:
- reshape_and_cache: scatter K/V into paged pool in a single kernel per
layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
Includes both single-sequence (prefill) and batched (decode) variants.
- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
decode now only D2H-transfers B×4 bytes (token ids) instead of the
full [B, vocab] logits tensor.
- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
pool backing) and offset-aware D2H/H2D copies used to move KV blocks
between the GPU pool and pinned host memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three performance optimizations targeting decode throughput:
1. Decode Attention Kernel (csrc/attention/flash_attention.cu):
- Specialized kernel for Q_len=1 (decode step)
- 256 threads parallelize across KV sequence dimension
- Online softmax with block-level warp-shuffle reduction
- Replaces FA2 kernel which wasted 63/64 threads for decode
- flash_attention() auto-dispatches when q_len==1
2. Fused SiLU×Mul (csrc/activation/activations.cu):
- Single kernel: out = silu(gate) * up
- Saves 1 HBM read + 1 HBM write per FFN layer (N elements)
- Eliminates intermediate tensor allocation
3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu):
- Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual)
- Saves 1 full HBM round-trip per attention block
- Eliminates separate add + rmsnorm kernel pair
Performance analysis:
- At current short sequences (max 79 tokens), these optimizations provide
marginal benefit because the bottleneck is cuBLAS GEMV overhead:
252 weight matrix reads × ~32MB each = 15.5 GB per decode step.
Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap).
- The fused kernels and decode attention will show larger gains at
longer sequences where attention and element-wise ops dominate.
- Next optimization target: CUDA Graphs to eliminate kernel launch
overhead, or custom GEMV kernels to replace cuBLAS for M=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kernel additions:
- add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU)
- Refactored activation.rs with dispatch_unary/dispatch_binary helpers
- Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips
GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add)
BF16 precision analysis (docs/benchmarks/phase10-qwen3.md):
- Root cause: separate attention kernels materialize BF16 intermediates
(QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path)
- HF itself SDPA vs Eager also differs by ~0.125 logit
- xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases
- Industry standard for BF16: top-5 overlap (we achieve 100%)
- Fix path: Flash Attention (Phase 14) to fuse attention in FP32
Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>