Commit Graph

105 Commits

Author SHA1 Message Date
5f060902f6 cuda: fix remaining int32-address and nondeterministic-reduction bugs
Three CUDA bugs from the review after 5b350ee / cfbd64d that were missed
by those commits:

- flash_attention.cu decode_attention_bf16_kernel used atomicAdd to
  merge per-warp partials into smem_O — same nondeterminism pattern
  that 5b350ee already fixed in paged_attention.cu and gemv.cu. This
  kernel is on the legacy forward_gpu_cache path plus the speculative
  bench baseline, so verify/decode parity depended on it. Replace with
  smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order.
- causal_mask.cu computed the flat address as
  `batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28
  seq=32768 already overflows int32. Promote the index to long long.
- quantization/dequant_fp8.cu had `int total = num_experts * rows * cols`
  and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows.
  Same fix pattern as the MoE dense kernels in cfbd64d — 64-bit total /
  idx / expert_stride, and grid computed in long long.
2026-07-01 15:13:07 +08:00
a67753f516 softmax: cap block size at 512 threads
launch_softmax_{f32,bf16} clamped block to 1024 threads when cols was
larger. Halving the ceiling to 512 keeps two blocks per SM resident on
the large vocab kernels that dominate speculative verify workloads
without changing rows/block indexing, and never exceeds cols.
2026-07-01 14:16:32 +08:00
f5ec10c2c3 xserv-cli: expose sampling params and greedy repetition penalty
Interactive REPL used to always call sample_greedy_last on both the
paged and legacy KV paths, so temperature/top-k/top-p and the repetition
penalty added in the sampling module were unreachable from the CLI.

- flag() helper parses --max-tokens / --temperature / --top-k / --top-p
  / --rep-penalty / --rep-window (defaults preserve prior behavior:
  temperature 0, top-p 1, penalty 1, window 512).
- pick_next() dispatches to sample_greedy_penalized only when
  temperature==0 and rep_penalty>1, otherwise to sample().
- Both Qwen3/GPT-2 paths and the GptOss paged path share the same
  sampler and both feed the rolling history window used for the penalty.
- Prompt input now unescapes literal "\n" so multi-turn prompts can be
  typed on one line.
2026-07-01 14:16:31 +08:00
ce7229f4fe speculative: Qwen3 draft-model v0 with paged verify parity
Phase 22 lands a correctness-only speculative decoding loop for Qwen3
target + Qwen3 small draft (batch=1, greedy, gamma=4). Phase 23 turns
verify logits into the authoritative acceptance signal so mirror-decode
per accepted token is no longer needed.

- paged_kv_cache: truncate_sequence(slot, new_len) shrinks a registered
  sequence, freeing whole physical blocks no longer reachable and
  leaving the slot registered. Covered by a CUDA-gated unit test.
- qwen3: forward_verify_paged_decode_attention writes the draft window
  into the target cache, runs the same paged decode attention kernel per
  draft token, and uses matmul_rows_gemv so linear layers follow the
  single-token decode BF16 rounding path.
- bench-speculative: new bench binary drives the state machine with
  --gamma / --gen-tokens / --prompts / --use-verify-logits /
  --verify-path flash|paged-decode / --dump-verify-mismatches, and
  compares baseline vs spec token sequences plus TPOT / tok/s / speedup.
- docs/22 records the decode-authoritative v0 result and dash5 numbers
  (matched=true, speedup_e2e ~0.29x, verify_decode_mismatches>0 under
  --use-verify-logits).
- docs/23 records the paged-decode verify path (matched=true,
  verify_decode_mismatches=0, 50x64 speedup_e2e ~0.44x) and the
  next-step performance TODO.
2026-07-01 14:16:30 +08:00
5b350ee5f0 cuda: deterministic BF16 gemv + paged attention reductions
BF16 greedy decode was sensitive to inter-block scheduling when logits
were close, which broke speculative-decoding verify-vs-decode parity.

- gemv.cu: write per-K-block partials, then reduce in fixed block order
  in a second kernel instead of atomicAdd across K-blocks. Scratch
  buffer size is now n * ceil(k / GEMV_TILE_K); gemv_scratch_elems()
  exposes this to callers, and decode_graph.rs sizes fp32_hidden/q/kv/
  intermediate/vocab from it.
- paged_attention.cu: replace atomicAdd merge of warp outputs with
  per-warp shared partials reduced in warp-id order for both the base
  and sinks kernels.
2026-07-01 14:16:28 +08:00
0314b4f3ac server: non-blocking stream send — stop one slow client stalling the batch
All three engines emitted tokens with blocking_send on the single
decode/coordinator OS thread. A streaming client that drains slower than
generation fills its 64-slot channel, and blocking_send then blocks the whole
thread: under continuous batching one slow consumer stalls every other running
sequence (and in the serial TP/PP path it blocks admission of the next request
too). The whole point of continuous batching is defeated.

Fix: switch to try_send. engine.rs sets a client_stalled flag on Full/Closed,
reaped by is_finished() next iteration; tp_engine/pp_engine emit_text returns
bool and the decode loop breaks with finish_reason "error". When the
sequence/request is dropped its sender drops too, closing the channel so the
client receive loop ends rather than hanging. A slow client now only loses its
own sequence, never the batch.

Verified on dash5: gpt-oss FP8 TP=1 streaming via tp_engine still streams
correctly (SSE chunks, coherent content, no hang); bench-gpt-oss TP=2 5.9ms
TPOT unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 12:37:32 +08:00
cfbd64d206 cuda: fix int32 overflow in MoE dense kernels; surface launch errors in release
The dense MoE kernels (moe_replicate, moe_bias_add_3d, moe_weighted_sum)
computed total / expert_stride / element indices in int32. gpt-oss prefill
runs the whole prompt through the dense path unchunked (SPARSE_MAX_TOKENS=8),
so local_experts*num_tokens*hidden (and batch*num_tokens*dim, and
local_id*expert_stride) overflow int32 at ~3.6k-23k prefill tokens
(TP-dependent) — well inside the supported context window. The launch then
fails silently because CUDA_CHECK_LAST_ERROR was ((void)0) under NDEBUG, so
the bias / weighted-sum simply never runs and the forward pass is corrupted
with no error reported.

Fix: switch the three kernels and their launchers to long long, mirroring the
(long long) indexing already used in moe_sparse.cu. Also make
CUDA_CHECK_LAST_ERROR always-on — cudaGetLastError does not sync, so the
per-launch host cost is negligible, and a silent launch failure is exactly
the class of bug this one was.

Verified on dash5 (RTX 5090): a direct kernel test at 2.21B elements (>2^31)
for both moe_replicate and moe_bias_add_3d produces correct results with no
launch error; bench-gpt-oss TP=2 holds at 5.9ms TPOT, output unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 12:37:21 +08:00
531cd3fe08 style: format Rust workspace 2026-06-18 18:11:58 +08:00
013465fc06 docs: Phase 21 — decode CUDA graph + GPU argmax results
dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps):
TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x
ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x,
TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%.
Lesson recorded: graphs bought ~0.6 ms (launches were already hidden
by async execution), the GPU argmax ~1 ms — measure, don't guess.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
8414f8d1e6 sampling: GPU argmax fast path for greedy decode
sample() at temperature 0 copied the full [seq, 201088] BF16 logits
to the host and scanned them every token (~1 ms/token). Use the
Phase 15 argmax kernel (block reduction + 4-byte D2H) when logits are
BF16 on GPU; bench-gpt-oss's greedy sampler likewise. Exact-tie
logits may break differently than the host scan — greedy trajectories
can legitimately diverge at a tie token (GSM8K unchanged).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
34224c7c93 gpt-oss: replay the whole batch=1 decode step as one CUDA graph
Split forward_decode_paged into host bookkeeping (decode_prepare +
ids/pos upload + advance_seq_len) and a pure-GPU decode_core. The
paged-KV and sparse-MoE designs already read every per-step variable
(block table, context lens, expert ids) from stable-address device
buffers, so decode_core captures as-is.

GptOssDecodeGraph captures lazily on the second decode step (the
first eager step warms cuBLAS) after a "retained warmup": the step
runs once with the allocator quarantine on, stocking the pool with a
dedicated block for every allocation so the capture itself never
pool-misses (a cudaMalloc while capturing is illegal — and the
capture's own quarantine is what would otherwise starve the pool).
NCCL all-reduces capture cleanly; TP=2 replays in lockstep.

Wired into tp_engine, bench-gpt-oss, and xserv-chat via
GraphedGptOssDecoder (batch>1 falls back to eager;
XSERV_DECODE_GRAPH=0 disables). Greedy tokens identical to eager.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
4088f49b7d cuda: infrastructure for whole-step CUDA graph capture
- Thread-local launch stream (xserv_cuda::stream): every kernel
  wrapper, cublasSetStream, and NCCL collective now launches on
  current_stream_raw() — the legacy null stream by default (behavior
  unchanged), or the capture stream installed via push_stream during
  graph capture. Capture is impossible on the legacy stream.
- Allocator retain mode: blocks freed inside a retain window are
  quarantined (RetainedBlocks) instead of pooled, so an instantiated
  graph keeps exclusive ownership of every intermediate buffer it
  references across replays.
- Capture mode GLOBAL -> THREAD_LOCAL: concurrent TP rank threads
  must not poison each other's captures with their own cudaMallocs.
- embedding_device_ids / rope_inplace_device_pos: variants reading
  token ids / positions from persistent device buffers, replacing the
  per-call host upload that a captured region cannot contain.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
2a92f268a9 docs: fill the Phase 19 gap, refresh README/roadmap to actual state
- docs/19-gpt-oss-moe.md: the numbered series jumped 18->20; write up
  gpt-oss arch deltas, harmony pitfalls, and the two CUDA debugging
  postmortems (fully-masked-tile NaN in flash-attention sinks;
  pre-__syncthreads early return reading uninitialized smem in the
  decode GEMV) — the highest-value learning content of that phase.
- README: models/perf/capabilities were frozen at the Qwen3-only era;
  now lists gpt-oss MoE, TP/PP, FP8/MXFP4, sparse MoE, and the
  llama.cpp standing.
- Roadmap: record where reality diverged from the plan at Phase 18+,
  add milestone entries and the ranked next-phase candidates
  (21 CUDA-graph MoE decode, 22 non-expert quant, 23 sparse prefill).
- sparse-moe benchmark doc: post-review-fix numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:02:59 +08:00
5343391dbd review cleanups: pp+gpt-oss guard, sparse GEMV asserts, warnings
- --pp with gpt-oss now fails with a clear message instead of a
  cryptic missing-weight panic inside the Qwen3-only PP engine.
- Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the
  uint4-vectorized kernels would silently drop a tail otherwise.
- Document the topk_ids buffer holding i32 under an F32 dtype label
  (DType has no I32).
- Drop unused imports/locals and the cuBLASLt scale-mode constants
  orphaned by the strided-batched FP8 rework (e631a71).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:02:59 +08:00
1897b2e17a gpt-oss: drop debug syncs from forward; GPU broadcast bias-add
Decode carried three leftover cudaDeviceSynchronize (prefill one) from
NaN debugging — the Qwen3 path has none and the logits D2H in sample()
already orders against the null stream.

add_bias for rows>1 round-tripped the bias through the CPU (D2H + host
tile loop + H2D) on every call — 96 times per prefill across q/k/v/o.
Replace with a bias_add_2d broadcast kernel.

dash5, FP8 TP=2, warm-server: TTFT 35/49/94 -> 29/42/79 ms
(short/medium/long), TPOT 7.19-7.32 -> 6.99-7.21 ms. Greedy tokens
unchanged; GSM8K-50 94%.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:02:59 +08:00
63f5599717 server: serve gpt-oss on a single GPU via the TP engine (world=1)
gpt-oss has no single-GPU engine path, so --tp 1 fell through to the
Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp
even at tp=1: NCCL world-1 init works and all_reduce already no-ops
(bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB
FP8 / 13 GB MXFP4) now serves on one 32 GB 5090.

Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"):
it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one
GPU and rank 1's set_device panicked while rank 0 spun in NCCL init.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 16:29:10 +08:00
fb20178992 moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2)
Dense MoE replicated x across all 16 local experts and ran the full
batched GEMM, reading every expert's weights per token; the weighted
sum then discarded 12 of 16 results. Decode is memory-bound, so this
was ~8x wasted expert bytes — the entire decode gap vs llama.cpp.

New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read
topk_ids on-device (no host sync) and early-return block-uniformly
for experts other ranks own. FP8 runs W8A16 (activations stay BF16 —
tensor cores are irrelevant at M=1, and activation quantization error
disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the
GEMV epilogue; slot-indexed weighted sum skips (never multiplies)
unwritten non-local slots. Dense path retained for num_tokens > 8
(prefill) and via XSERV_DENSE_MOE=1 for A/B.

dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms.
Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms —
first config where xserv wins decode outright. GSM8K-100: 96% (dense
FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode
CUDA graphs, non-expert quantization, sparse prefill (docs/20).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 16:29:10 +08:00
cf1e9e41db tools: single-stream decode benchmark vs llama.cpp
xserv_vs_llama.py runs each server one at a time on the same GPUs (drains VRAM
between), streams identical prompts through /v1/chat/completions, and reports
median TTFT/TPOT/throughput. Counts llama's reasoning_content as real decode
tokens so the gpt-oss CoT is measured fairly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 15:01:42 +08:00
d33220498a quantization: MXFP4 W4A16 expert weights (memory-optimization foundation)
Weight-only 4-bit for the gpt-oss MoE experts: weights stored MXFP4 (E2M1 +
per-32-element UE8M0 block scale, tools/quantize_mxfp4.py), a fused kernel reads
the 4-bit weights and dequantizes on-chip to BF16. Decode (M=1) uses a fused
dequant-GEMV (batched_gemv_mxfp4) with shared-memory activation tiling; prefill
(M>1) dequantizes to BF16 then reuses the BF16 batched GEMM. MXFP4 is detected
by the scale tensor's rank (3-D [E,N,K/32]) vs FP8's 1-D [E].

Verified on dash5 (gpt-oss-20b, TP=2, 5090): byte-identical greedy tokens to
FP8/BF16, smallest footprint (13 GB vs 22 GB FP8, 39 GB BF16) — fits one 32 GB
5090 with room for KV cache.

NOT a decode speedup: the hand-written W4A16 GEMV (no tensor cores) is less
efficient than cuBLASLt's FP8 tensor-core GEMM, so even at half the weight bytes
decode is 17.0 ms vs FP8 13.5 ms (faster than BF16 18.8 ms); prefill regresses
(350 vs 134 ms, dequant fallback). Committed as a correct memory-optimization
foundation. Beating FP8 on speed needs FP4 tensor cores (W4A4, cuBLASLt
block-scaled MXFP4) or a Marlin-class kernel; see
docs/benchmarks/mxfp4-and-llama-decode.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 15:01:42 +08:00
e631a71b68 quantization: single strided-batched FP8 MoE GEMM — cut per-token launches ~768→48
The plan-cache fix removed the per-expert heuristic churn but still issued one
cublasLtMatmul per expert: ~768 tiny launches per decoded token (16 local
experts × 2 GEMMs × 24 layers), which capped the FP8 decode win at ~1.05× over
BF16. Collapse each MoE GEMM into ONE strided-batched cuBLASLt FP8 matmul
(BATCH_COUNT + strided-batch offsets on all four layouts) → ~48 launches/token.

A single strided call can't carry a per-batch scalar B-scale, so the per-expert
weight scale moves out of the GEMM epilogue into a fused post-scale kernel
(rowwise_scale_moe_bf16) that applies a_scale[token]·b_scale[expert] in one
pass. This is precision-equivalent: BF16's relative error is scale-invariant, so
scaling the unscaled GEMM output afterward loses nothing vs scaling in-epilogue.

Measured on dash5 (gpt-oss-20b, TP=2, 5090), warm-server GSM8K:
  decode TPOT 17.45 → 13.08 ms (FP8 now 1.41× vs BF16 18.39 ms),
  throughput 57.3 → 76.4 tok/s, accuracy unchanged (FP8 91.0% vs BF16 90.0%).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 01:23:29 +08:00
24c49c31c2 tools: warm-server FP8 vs BF16 benchmark + results doc
fp8_compare.py launches one xserv-server per model (same GPUs / TP for a
fair comparison), gates readiness on a real generation (not /health),
and streams GSM8K through /v1/chat/completions measuring per-request
TTFT (time to first token) and TPOT (mean inter-token latency) plus
exact-match accuracy. docs/benchmarks/fp8-quantization.md records the
quantization scheme, the perf-bug fix, and the dash5 results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 00:58:46 +08:00
5a16225c1f quantization: cache cuBLASLt FP8 plan per shape — fix per-expert heuristic churn
batched_gemm_fp8 rebuilt the cuBLASLt matmul descriptor, four matrix
layouts, a preference, and a 4-byte scale alloc, AND ran the algo
heuristic search — once per expert, per GEMM, per layer, on every
forward (~1500 heuristic searches per decoded token). FP8 decode ran at
27.0 ms/tok vs BF16 18.8 ms, i.e. slower than the path it was meant to
accelerate.

Cache the full plan (descriptor + layouts + heuristically-chosen algo)
in a thread-local map keyed by (M, N, K) so the heuristic runs once per
shape and is reused across experts and forwards; allocate the 1.0 scale
buffer once; pass each expert's weight scale via the cuBLASLt B-scale
device pointer instead of folding it into alpha (identical FP32-epilogue
precision, and no host readback of b_scales). The per-expert loop now
issues only cublasLtMatmul.

Measured on dash5 (gpt-oss-20b, TP=2, 5090): FP8 decode TPOT 27.0 -> 17.9
ms, now faster than BF16 (18.8 ms); GSM8K-200 accuracy unchanged
(FP8 93.0% vs BF16 90.5%, within noise).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 00:58:46 +08:00
3a530956af tools: add FP8 vs BF16 benchmark and GSM8K eval harness
bench_fp8.py — head-to-head comparison of FP8 and BF16 models on
  GSM8K / AIME2025 accuracy plus TTFT/TPOT performance measurement.

eval_gsm8k_batch.sh — lightweight GSM8K accuracy evaluator that
  pipes one problem per xserv-chat invocation and scores with
  \boxed{} / last-number extraction.

Benchmark results (gpt-oss-20b, 50-problem GSM8K):
  FP8 W8A8 TP1 : 94.0%  (single RTX 5090, 25 GB)
  FP8 W8A16 TP1: 94.0%
  BF16 TP2     : 94.0%  (requires 2× RTX 5090)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-08 15:43:04 +08:00
76487b7963 quantization: W8A8 FP8 compute via cuBLASLt tensor cores
Replace the W8A16 dequant→BF16-GEMM path with native FP8×FP8→BF16 GEMM
using cuBLASLt on Blackwell (RTX 5090). Both weights (static FP8 E4M3)
and activations (dynamically quantized per-row) are processed directly
on FP8 tensor cores.

Key implementation details:
- cuBLASLt on Blackwell requires transA=T for FP8, so expert weights
  are transposed during model loading ([E,K,N] → [E,N,K])
- Per-row activation quantization kernel (absmax/448 → FP8 E4M3)
- Post-GEMM row-wise rescaling recovers per-token precision
- Per-expert loop (not batched) due to cuBLASLt FP8 scale constraints

The same FP8 quantized model files work — no re-quantization needed.
Activation quantization happens dynamically at inference time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-07 20:38:26 +08:00
9f1fbbb98b quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights
Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem)
with per-expert FP32 scale factors. At inference, a fused CUDA kernel
dequantizes to BF16 before the existing cuBLAS batched GEMM.

Results on gpt-oss-20b (50-problem GSM8K subset):
  - FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM)
  - BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total)

No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%).

New files:
  - tools/quantize_fp8.py: offline BF16→FP8 conversion script
  - csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel
  - crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper
  - tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-07 19:33:07 +08:00
e1eb77baa4 xserv-chat: fix unclosed <think> on early termination and flush analysis tokens
Close the <think> block when EOS or max_tokens interrupts an analysis
channel, and flush stdout after each analysis token so --think streams
smoothly instead of dumping in buffer-sized chunks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-03 01:01:41 +08:00
34e9bee375 xserv-chat: render gpt-oss analysis as a Qwen3-style <think> block
The gpt-oss harmony `analysis` channel is the model's reasoning, analogous
to Qwen3's <think>. With --think, wrap it in a `<think>\n…\n</think>\n\n`
block (gray when color is on, like Qwen3) and then print the final-channel
answer; without --think, suppress the analysis and show only the answer.
Replaces the previous color-gated behavior (analysis shown gray only on a
TTY, with no markers). Analysis is still excluded from the multi-turn
history (answer_ids), so re-prefill drops CoT as before.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 21:37:28 +08:00
3b9e32e6cd kernels: fix uninitialized shared-memory read in M=1 decode GEMV
gemv_bf16_fused_kernel returned early on out-of-range columns
(`if (col >= N) return;`) BEFORE the cooperative load of x into shared
memory and the `__syncthreads()`. When N is not a multiple of GEMV_TILE_N
(128), the last column-block's out-of-range threads exited without loading
their slice of x_shared, so the in-range threads then read uninitialized
shared memory in the dot product — and __syncthreads with exited threads is
itself UB. Result: intermittent huge/garbage outputs (~1e33) that, after
the next RMSNorm, collapsed the whole forward pass to a degenerate logit
distribution (argmax → vocab_size-1, or NaN), derailing generation.

This hit every M=1 BF16 GEMV (n>=256) with n % 128 != 0 — i.e. gpt-oss
decode o_proj and the MoE projections (n=2880). q/k/v (4096) and lm_head
(201088) are 128-aligned and were unaffected, as is Qwen3 (hidden 4096),
which is why this manifested as intermittent gpt-oss-only decode failures.

Fix: all threads participate in the shared-memory load and reach the
barrier; the col>=N check moves to AFTER __syncthreads.

Verified on dash5 (TP=2): a prompt that reliably produced garbage ~70% of
runs now yields clean logits 16/16; the multi-turn Chinese chat that
collapsed mid-conversation completes coherently with 0 NaN warnings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 17:18:37 +08:00
5157b2cd30 kernels: fix NaN in flash-attention sinks on fully-masked window tiles
flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the
causal `continue`); an early tile entirely outside the sliding window was
still processed with every key masked to -inf, so row_max == -INFINITY.
Folding that into the online softmax computed expf(-inf - (-inf)) = NaN,
and the next valid tile's 0*NaN correction then poisoned the whole row.

Result: the gpt-oss prefill produced all-NaN logits for any query whose
sliding window (128) starts past the first KV tile — i.e. at longer
context — collapsing generation into a single repeated token (argmax of
all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was
the residual multi-turn/long-context collapse.

Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes
nothing to the softmax. The decode kernel already guards
local_max == -INFINITY, so it was unaffected.

Verified on dash5 (TP=2): the prefill that previously went all-NaN now
produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long
prior answer) completes correctly instead of emitting "!!!!".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 16:09:43 +08:00
ea5d8ba7ea xserv-chat: render gpt-oss multi-turn as canonical harmony (drop CoT)
Re-render the whole conversation each turn and re-prefill into a freshly
cleared slot, with past assistant messages rendered as completed `final`
channels (analysis dropped, terminated with <|end|> not the <|return|>
stop token) — matching the model's training format and the server's
builder. The previous incremental cache kept every turn's chain-of-thought
plus <|return|> in context, which is out of distribution for harmony
multi-turn. The generator now returns the final-channel text to feed back
as history. Qwen3 keeps the incremental cache (its ChatML format is
unaffected); reset_slot factors out the free+re-register.

NOTE: this corrects the multi-turn *format* but does NOT cure the
long-context collapse on some inputs. That is a forward-pass numerical bug
(NaN / degenerate logits) reproducible in clean bench-gpt-oss independent
of the chat layer — the collapse token is vocab_size-1 (201087), the
all-NaN argmax tie-break. Tracked separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 15:39:24 +08:00
c0a81c84e7 server: canonical harmony system message in gpt-oss fallback
build_prompt_gpt_oss (the hardcoded builder used when a gpt-oss model
ships no Jinja chat template) emitted the same malformed "You are a
helpful assistant." system message that destabilized the CLI. Replace it
with the canonical harmony system message (identity / knowledge cutoff /
current date via strftime_now / Reasoning: low / channels), matching the
chat_template.jinja build_system_message macro and the xserv-chat fix.

Dormant for gpt-oss-20b (it ships a Jinja template, so the template path
runs), but correct now for any gpt-oss model without one.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 15:19:50 +08:00
3d6bb1918e xserv-chat: fix gpt-oss harmony chat (canonical system prompt + routing)
The hand-rolled gpt-oss system message dropped the canonical harmony
structure (identity / knowledge cutoff / current date / Reasoning level),
putting the model out of distribution — greedy decoding then flipped into
garbage or analysis loops on ~half of single-turn requests. Emit the
canonical system message (matching the model's chat_template.jinja
build_system_message macro) with Reasoning: low, plus a today_ymd() date
helper.

Also:
- Default the repetition penalty to off (1.0). Penalizing the harmony
  control tokens (<|channel|>/<|message|>/<|start|>) that must repeat to
  open the final channel made gpt-oss stop right after analysis, emitting
  nothing.
- Suppress the literal "assistant" role header emitted between the
  analysis and final channels (only print in the final channel, moe only;
  non-moe Qwen3 stays in Normal and prints as before).

Verified on dash5 (TP=2): single-turn "capital of France" is now stable
across runs with a clean final answer; Qwen3 chat unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 15:19:07 +08:00
f2e60218b4 xserv-chat: harmony channel routing + repetition penalty for gpt-oss
- Let the model generate its own <|channel|> routing instead of forcing
  <|channel|>final<|message|> — matches the GGUF chat template behavior.
- State machine tracks harmony channels: analysis channel rendered gray,
  final channel printed normally, <|end|> stops on final channel only.
- Add repetition penalty (default 1.3 for MoE, 1.0 for Qwen) with 512
  token window to prevent greedy decode loops. Configurable via
  XSERV_REP_PENALTY and XSERV_REP_WINDOW env vars.
- Fix Length path: use <|end|> instead of <|im_end|> for gpt-oss to
  avoid poisoning the KV cache with garbage tokens on truncation.
- Server api.rs: append <|channel|>final<|message|> to the hardcoded
  gpt-oss prompt (server expects to post-process the JSON output).
- Add startswith filter to minijinja for harmony template compatibility.

Known issue: gpt-oss multi-turn NaN when total context exceeds ~256
tokens — likely a flash_attention_sinks kernel bug with sliding window
layers at large kv_len + small q_len. Single-turn and short multi-turn
conversations work correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-02 12:40:17 +08:00
3ee8df2c0f xserv-chat: filter harmony control tokens + stop at <|end|> for gpt-oss
The gpt-oss harmony format generates internal control tokens
(<|channel|>, <|start|>, <|end|>, <|message|>) that should not appear
in the user-facing output. Additionally, <|end|> marks the end of a
response segment but was not in the model's EOS list, causing the
model to self-prompt into analysis channels and loop.

Fix: treat <|end|> as a stop token, skip all harmony special tokens
from the output stream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-02 12:05:07 +08:00
ae08896f46 xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug
- Add ChatModel enum dispatching between Qwen3 and GptOss based on
  config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
  gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
  multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
  during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
  accumulator inside the kernel (block k=0) while other blocks
  atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-02 00:58:10 +08:00
Gahow Wang
1d0ec32e8d server: Jinja chat template rendering via minijinja
Load the model's chat_template.jinja (or tokenizer_config.json
chat_template field) at startup and render it with minijinja instead of
hardcoded per-model prompt builders.

Custom Jinja functions: strftime_now (date formatting), raise_exception
(template validation errors).  Falls back to Qwen3 ChatML template if
no Jinja template is found.

Removes the hardcoded build_prompt_gpt_oss() — the model's own template
now drives prompt formatting, matching llama.cpp's behavior exactly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 13:23:18 +08:00
Gahow Wang
4368e79695 model: fused GPU MoE kernel — eliminate CPU roundtrip
Replace the per-token CPU-routed MoE forward with an all-GPU path:

  1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
  2. moe_replicate: broadcast input to all local experts
  3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
  4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)

Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).

Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.

GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 13:22:59 +08:00
Gahow Wang
377a04b81f tokenizer: read pre-tokenizer regex from tokenizer.json
Parse the model's `pre_tokenizer` section to extract its Split regex
instead of hardcoding the GPT-2 pattern.  The gpt-oss-20b model uses
a GPT-4-style regex that produces different word boundaries, causing a
1-token prompt mismatch vs HuggingFace (136 → 135 tokens, now aligned).

Unsupported lookahead `(?!\S)` is stripped — it only affects trailing
whitespace edge cases.  Falls back to the old GPT-2/Qwen heuristic if
the model regex fails to compile or is absent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 13:22:35 +08:00
Gahow Wang
241009a96c docs: remove TO-BE-FIXED.md — all listed issues have been resolved
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 01:06:26 +08:00
0c6135aea3 bench-gpt-oss: teacher-forced diagnostics + --prompt flag
Add --prompt to override the fixed prompt, and two teacher-forced
diagnostics: --forced runs prefill over prompt+oracle ids and reports
per-position top-1 agreement; --forced-decode walks the oracle trajectory
through the decode path with per-position agreement bucketed by position,
to localize long-context decode divergence from the reference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:46 +08:00
ffd90ce7fb server: emit harmony developer instructions block for gpt-oss
Route caller-supplied system messages into a harmony 'developer'
instructions block (<|start|>developer<|message|># Instructions...),
keeping the fixed system/meta block for the channel declaration. Harmony
puts user instructions on the developer role, not system.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:39 +08:00
3c9d5e260e server: harmony termination via is_eos + TP repetition penalty
Use tokenizer.is_eos() (multi-eos) for generation termination in both PP
and TP engines instead of a single eos id, so gpt-oss stops on <|return|>
/<|call|>/<|endoftext|>.

In the TP engine, optionally apply a repetition penalty on the greedy
decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off
by default) to break greedy repetition loops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:33 +08:00
99b212e6c1 model/sampling: NaN-safe argmax + optional repetition penalty
Make argmax skip NaN logits (warn once) instead of panicking the engine
thread on a single NaN. Add sample_greedy_penalized() applying an
HF-style repetition penalty over recent ids on the greedy path, to break
greedy repetition loops on reasoning models without touching the forward
pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:27 +08:00
e11f15e009 tokenizer: support multiple end-of-generation tokens
Track an ordered eos_token_ids list (not just one id) and add is_eos().
gpt-oss/harmony ends the assistant turn on <|return|> and also treats
<|call|> and <|endoftext|> as terminators (generation_config.json
eos_token_id = [200002, 199999, 200012]); single-eos families are
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:21 +08:00
9c98c169ff kernels: flash attention with gpt-oss sinks + sliding window
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.

Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:10 +08:00
Gahow Wang
5cb3cf28f9 server: add gpt-oss chat template for proper prompt formatting
The gpt-oss model requires a specific prompt format with <|start|>,
<|message|>, <|end|>, <|channel|> tokens. Without this, the model
produces degenerate output. Auto-detected via config.model_type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:43:29 +08:00
Gahow Wang
15c51f143e server: support GptOss in TP engine + benchmark script
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
  config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
  vs llama.cpp (BF16 GGUF) on GSM8K quality + speed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:39:44 +08:00
Gahow Wang
d29c39d74e fix: GEMV NaN bug — skip custom kernel for small N (<256)
The custom launch_gemv_bf16 kernel produces NaN when output dimension N
is small (e.g. N=32 for the MoE router). Fall back to cuBLAS GemmEx for
N < 256. Also removes the padding workaround in gpt_oss MoE forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:20:04 +08:00
Gahow Wang
9ad91a4a92 phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2
Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params,
32 experts × top-4 routing). Key additions:

- ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window,
  attention_bias, explicit head_dim, rope_scaling, swiglu_limit)
- YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation
  and attention_scaling = 0.1*ln(factor)+1
- Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation)
- Paged attention with sinks + sliding window kernel variant
- GptOss model struct with expert-parallel TP (split 32 experts across ranks)
- bench-gpt-oss binary for TP inference benchmarking

Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT.
Model generates topically-coherent output (needs chat template for quality).

Known issues:
- Custom GEMV kernel produces NaN with small N (workaround: pad to M=2)
- Prefill doesn't use attention sinks (uses standard flash attention)
- Output quality requires chat template formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:18:01 +08:00
Gahow Wang
46bfb59f30 Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference
Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).
2026-05-30 13:13:05 +08:00