xserv

Author	SHA1	Message	Date
Gahow Wang	24c49c31c2	tools: warm-server FP8 vs BF16 benchmark + results doc fp8_compare.py launches one xserv-server per model (same GPUs / TP for a fair comparison), gates readiness on a real generation (not /health), and streams GSM8K through /v1/chat/completions measuring per-request TTFT (time to first token) and TPOT (mean inter-token latency) plus exact-match accuracy. docs/benchmarks/fp8-quantization.md records the quantization scheme, the perf-bug fix, and the dash5 results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 00:58:46 +08:00
Gahow Wang	5a16225c1f	quantization: cache cuBLASLt FP8 plan per shape — fix per-expert heuristic churn batched_gemm_fp8 rebuilt the cuBLASLt matmul descriptor, four matrix layouts, a preference, and a 4-byte scale alloc, AND ran the algo heuristic search — once per expert, per GEMM, per layer, on every forward (~1500 heuristic searches per decoded token). FP8 decode ran at 27.0 ms/tok vs BF16 18.8 ms, i.e. slower than the path it was meant to accelerate. Cache the full plan (descriptor + layouts + heuristically-chosen algo) in a thread-local map keyed by (M, N, K) so the heuristic runs once per shape and is reused across experts and forwards; allocate the 1.0 scale buffer once; pass each expert's weight scale via the cuBLASLt B-scale device pointer instead of folding it into alpha (identical FP32-epilogue precision, and no host readback of b_scales). The per-expert loop now issues only cublasLtMatmul. Measured on dash5 (gpt-oss-20b, TP=2, 5090): FP8 decode TPOT 27.0 -> 17.9 ms, now faster than BF16 (18.8 ms); GSM8K-200 accuracy unchanged (FP8 93.0% vs BF16 90.5%, within noise). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 00:58:46 +08:00
Gahow Wang	3a530956af	tools: add FP8 vs BF16 benchmark and GSM8K eval harness bench_fp8.py — head-to-head comparison of FP8 and BF16 models on GSM8K / AIME2025 accuracy plus TTFT/TPOT performance measurement. eval_gsm8k_batch.sh — lightweight GSM8K accuracy evaluator that pipes one problem per xserv-chat invocation and scores with \boxed{} / last-number extraction. Benchmark results (gpt-oss-20b, 50-problem GSM8K): FP8 W8A8 TP1 : 94.0% (single RTX 5090, 25 GB) FP8 W8A16 TP1: 94.0% BF16 TP2 : 94.0% (requires 2× RTX 5090) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-08 15:43:04 +08:00
Gahow Wang	76487b7963	quantization: W8A8 FP8 compute via cuBLASLt tensor cores Replace the W8A16 dequant→BF16-GEMM path with native FP8×FP8→BF16 GEMM using cuBLASLt on Blackwell (RTX 5090). Both weights (static FP8 E4M3) and activations (dynamically quantized per-row) are processed directly on FP8 tensor cores. Key implementation details: - cuBLASLt on Blackwell requires transA=T for FP8, so expert weights are transposed during model loading ([E,K,N] → [E,N,K]) - Per-row activation quantization kernel (absmax/448 → FP8 E4M3) - Post-GEMM row-wise rescaling recovers per-token precision - Per-expert loop (not batched) due to cuBLASLt FP8 scale constraints The same FP8 quantized model files work — no re-quantization needed. Activation quantization happens dynamically at inference time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-07 20:38:26 +08:00
Gahow Wang	9f1fbbb98b	quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem) with per-expert FP32 scale factors. At inference, a fused CUDA kernel dequantizes to BF16 before the existing cuBLAS batched GEMM. Results on gpt-oss-20b (50-problem GSM8K subset): - FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM) - BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total) No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%). New files: - tools/quantize_fp8.py: offline BF16→FP8 conversion script - csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel - crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper - tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-07 19:33:07 +08:00
Gahow Wang	e1eb77baa4	xserv-chat: fix unclosed <think> on early termination and flush analysis tokens Close the <think> block when EOS or max_tokens interrupts an analysis channel, and flush stdout after each analysis token so --think streams smoothly instead of dumping in buffer-sized chunks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-03 01:01:41 +08:00
Gahow Wang	34e9bee375	xserv-chat: render gpt-oss analysis as a Qwen3-style <think> block The gpt-oss harmony `analysis` channel is the model's reasoning, analogous to Qwen3's <think>. With --think, wrap it in a `<think>\n…\n</think>\n\n` block (gray when color is on, like Qwen3) and then print the final-channel answer; without --think, suppress the analysis and show only the answer. Replaces the previous color-gated behavior (analysis shown gray only on a TTY, with no markers). Analysis is still excluded from the multi-turn history (answer_ids), so re-prefill drops CoT as before. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 21:37:28 +08:00
Gahow Wang	3b9e32e6cd	kernels: fix uninitialized shared-memory read in M=1 decode GEMV gemv_bf16_fused_kernel returned early on out-of-range columns (`if (col >= N) return;`) BEFORE the cooperative load of x into shared memory and the `__syncthreads()`. When N is not a multiple of GEMV_TILE_N (128), the last column-block's out-of-range threads exited without loading their slice of x_shared, so the in-range threads then read uninitialized shared memory in the dot product — and __syncthreads with exited threads is itself UB. Result: intermittent huge/garbage outputs (~1e33) that, after the next RMSNorm, collapsed the whole forward pass to a degenerate logit distribution (argmax → vocab_size-1, or NaN), derailing generation. This hit every M=1 BF16 GEMV (n>=256) with n % 128 != 0 — i.e. gpt-oss decode o_proj and the MoE projections (n=2880). q/k/v (4096) and lm_head (201088) are 128-aligned and were unaffected, as is Qwen3 (hidden 4096), which is why this manifested as intermittent gpt-oss-only decode failures. Fix: all threads participate in the shared-memory load and reach the barrier; the col>=N check moves to AFTER __syncthreads. Verified on dash5 (TP=2): a prompt that reliably produced garbage ~70% of runs now yields clean logits 16/16; the multi-turn Chinese chat that collapsed mid-conversation completes coherently with 0 NaN warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 17:18:37 +08:00
Gahow Wang	5157b2cd30	kernels: fix NaN in flash-attention sinks on fully-masked window tiles flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the causal `continue`); an early tile entirely outside the sliding window was still processed with every key masked to -inf, so row_max == -INFINITY. Folding that into the online softmax computed expf(-inf - (-inf)) = NaN, and the next valid tile's 0*NaN correction then poisoned the whole row. Result: the gpt-oss prefill produced all-NaN logits for any query whose sliding window (128) starts past the first KV tile — i.e. at longer context — collapsing generation into a single repeated token (argmax of all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was the residual multi-turn/long-context collapse. Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes nothing to the softmax. The decode kernel already guards local_max == -INFINITY, so it was unaffected. Verified on dash5 (TP=2): the prefill that previously went all-NaN now produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long prior answer) completes correctly instead of emitting "!!!!". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 16:09:43 +08:00
Gahow Wang	ea5d8ba7ea	xserv-chat: render gpt-oss multi-turn as canonical harmony (drop CoT) Re-render the whole conversation each turn and re-prefill into a freshly cleared slot, with past assistant messages rendered as completed `final` channels (analysis dropped, terminated with <\|end\|> not the <\|return\|> stop token) — matching the model's training format and the server's builder. The previous incremental cache kept every turn's chain-of-thought plus <\|return\|> in context, which is out of distribution for harmony multi-turn. The generator now returns the final-channel text to feed back as history. Qwen3 keeps the incremental cache (its ChatML format is unaffected); reset_slot factors out the free+re-register. NOTE: this corrects the multi-turn format but does NOT cure the long-context collapse on some inputs. That is a forward-pass numerical bug (NaN / degenerate logits) reproducible in clean bench-gpt-oss independent of the chat layer — the collapse token is vocab_size-1 (201087), the all-NaN argmax tie-break. Tracked separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 15:39:24 +08:00
Gahow Wang	c0a81c84e7	server: canonical harmony system message in gpt-oss fallback build_prompt_gpt_oss (the hardcoded builder used when a gpt-oss model ships no Jinja chat template) emitted the same malformed "You are a helpful assistant." system message that destabilized the CLI. Replace it with the canonical harmony system message (identity / knowledge cutoff / current date via strftime_now / Reasoning: low / channels), matching the chat_template.jinja build_system_message macro and the xserv-chat fix. Dormant for gpt-oss-20b (it ships a Jinja template, so the template path runs), but correct now for any gpt-oss model without one. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 15:19:50 +08:00
Gahow Wang	3d6bb1918e	xserv-chat: fix gpt-oss harmony chat (canonical system prompt + routing) The hand-rolled gpt-oss system message dropped the canonical harmony structure (identity / knowledge cutoff / current date / Reasoning level), putting the model out of distribution — greedy decoding then flipped into garbage or analysis loops on ~half of single-turn requests. Emit the canonical system message (matching the model's chat_template.jinja build_system_message macro) with Reasoning: low, plus a today_ymd() date helper. Also: - Default the repetition penalty to off (1.0). Penalizing the harmony control tokens (<\|channel\|>/<\|message\|>/<\|start\|>) that must repeat to open the final channel made gpt-oss stop right after analysis, emitting nothing. - Suppress the literal "assistant" role header emitted between the analysis and final channels (only print in the final channel, moe only; non-moe Qwen3 stays in Normal and prints as before). Verified on dash5 (TP=2): single-turn "capital of France" is now stable across runs with a clean final answer; Qwen3 chat unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 15:19:07 +08:00
Gahow Wang	f2e60218b4	xserv-chat: harmony channel routing + repetition penalty for gpt-oss - Let the model generate its own <\|channel\|> routing instead of forcing <\|channel\|>final<\|message\|> — matches the GGUF chat template behavior. - State machine tracks harmony channels: analysis channel rendered gray, final channel printed normally, <\|end\|> stops on final channel only. - Add repetition penalty (default 1.3 for MoE, 1.0 for Qwen) with 512 token window to prevent greedy decode loops. Configurable via XSERV_REP_PENALTY and XSERV_REP_WINDOW env vars. - Fix Length path: use <\|end\|> instead of <\|im_end\|> for gpt-oss to avoid poisoning the KV cache with garbage tokens on truncation. - Server api.rs: append <\|channel\|>final<\|message\|> to the hardcoded gpt-oss prompt (server expects to post-process the JSON output). - Add startswith filter to minijinja for harmony template compatibility. Known issue: gpt-oss multi-turn NaN when total context exceeds ~256 tokens — likely a flash_attention_sinks kernel bug with sliding window layers at large kv_len + small q_len. Single-turn and short multi-turn conversations work correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 12:40:17 +08:00
Gahow Wang	3ee8df2c0f	xserv-chat: filter harmony control tokens + stop at <\|end\|> for gpt-oss The gpt-oss harmony format generates internal control tokens (<\|channel\|>, <\|start\|>, <\|end\|>, <\|message\|>) that should not appear in the user-facing output. Additionally, <\|end\|> marks the end of a response segment but was not in the model's EOS list, causing the model to self-prompt into analysis channels and loop. Fix: treat <\|end\|> as a stop token, skip all harmony special tokens from the output stream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 12:05:07 +08:00
Gahow Wang	ae08896f46	xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug - Add ChatModel enum dispatching between Qwen3 and GptOss based on config.is_moe(), following the TP engine pattern. - Add --tp N flag for tensor-parallel inference (required for 39GB gpt-oss-20b which doesn't fit on a single 32GB GPU). - Add gpt-oss harmony chat template with channel/message format. - Replace hardcoded is_stop_token() with tokenizer.is_eos() for multi-model EOS support. - Restore gpt-oss hardcoded prompt template in server api.rs, lost during the Jinja template refactor. - Fix GEMV race condition: the K-split kernel zeroed the FP32 accumulator inside the kernel (block k=0) while other blocks atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead. - Update benchmark docs with post-fix results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 00:58:10 +08:00
Gahow Wang	1d0ec32e8d	server: Jinja chat template rendering via minijinja Load the model's chat_template.jinja (or tokenizer_config.json chat_template field) at startup and render it with minijinja instead of hardcoded per-model prompt builders. Custom Jinja functions: strftime_now (date formatting), raise_exception (template validation errors). Falls back to Qwen3 ChatML template if no Jinja template is found. Removes the hardcoded build_prompt_gpt_oss() — the model's own template now drives prompt formatting, matching llama.cpp's behavior exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-31 13:23:18 +08:00
Gahow Wang	4368e79695	model: fused GPU MoE kernel — eliminate CPU roundtrip Replace the per-token CPU-routed MoE forward with an all-GPU path: 1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax) 2. moe_replicate: broadcast input to all local experts 3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS) 4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU) Expert weights stored as contiguous 3D tensors for strided batched GEMM. Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer). Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight diagnostic at load time. GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-31 13:22:59 +08:00
Gahow Wang	377a04b81f	tokenizer: read pre-tokenizer regex from tokenizer.json Parse the model's `pre_tokenizer` section to extract its Split regex instead of hardcoding the GPT-2 pattern. The gpt-oss-20b model uses a GPT-4-style regex that produces different word boundaries, causing a 1-token prompt mismatch vs HuggingFace (136 → 135 tokens, now aligned). Unsupported lookahead `(?!\S)` is stripped — it only affects trailing whitespace edge cases. Falls back to the old GPT-2/Qwen heuristic if the model regex fails to compile or is absent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-31 13:22:35 +08:00
Gahow Wang	241009a96c	docs: remove TO-BE-FIXED.md — all listed issues have been resolved Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-31 01:06:26 +08:00
Gahow Wang	0c6135aea3	bench-gpt-oss: teacher-forced diagnostics + --prompt flag Add --prompt to override the fixed prompt, and two teacher-forced diagnostics: --forced runs prefill over prompt+oracle ids and reports per-position top-1 agreement; --forced-decode walks the oracle trajectory through the decode path with per-position agreement bucketed by position, to localize long-context decode divergence from the reference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:46 +08:00
Gahow Wang	ffd90ce7fb	server: emit harmony developer instructions block for gpt-oss Route caller-supplied system messages into a harmony 'developer' instructions block (<\|start\|>developer<\|message\|># Instructions...), keeping the fixed system/meta block for the channel declaration. Harmony puts user instructions on the developer role, not system. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:39 +08:00
Gahow Wang	3c9d5e260e	server: harmony termination via is_eos + TP repetition penalty Use tokenizer.is_eos() (multi-eos) for generation termination in both PP and TP engines instead of a single eos id, so gpt-oss stops on <\|return\|> /<\|call\|>/<\|endoftext\|>. In the TP engine, optionally apply a repetition penalty on the greedy decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off by default) to break greedy repetition loops. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:33 +08:00
Gahow Wang	99b212e6c1	model/sampling: NaN-safe argmax + optional repetition penalty Make argmax skip NaN logits (warn once) instead of panicking the engine thread on a single NaN. Add sample_greedy_penalized() applying an HF-style repetition penalty over recent ids on the greedy path, to break greedy repetition loops on reasoning models without touching the forward pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:27 +08:00
Gahow Wang	e11f15e009	tokenizer: support multiple end-of-generation tokens Track an ordered eos_token_ids list (not just one id) and add is_eos(). gpt-oss/harmony ends the assistant turn on <\|return\|> and also treats <\|call\|> and <\|endoftext\|> as terminators (generation_config.json eos_token_id = [200002, 199999, 200012]); single-eos families are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:21 +08:00
Gahow Wang	9c98c169ff	kernels: flash attention with gpt-oss sinks + sliding window Add flash_attention_sinks_bf16 prefill kernel that folds the per-head attention sink into the softmax denominator (exactly as the decode sink kernel) and supports an optional sliding-window mask matching HF gpt-oss. Wire it through xserv-kernels (flash_attention_sinks) and use it in GptOss prefill, replacing the post-hoc sink approximation for an exact match against the reference math. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:10 +08:00
Gahow Wang	5cb3cf28f9	server: add gpt-oss chat template for proper prompt formatting The gpt-oss model requires a specific prompt format with <\|start\|>, <\|message\|>, <\|end\|>, <\|channel\|> tokens. Without this, the model produces degenerate output. Auto-detected via config.model_type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:43:29 +08:00
Gahow Wang	15c51f143e	server: support GptOss in TP engine + benchmark script - tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on config.is_moe(). Server auto-detects model type on startup. - tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2) vs llama.cpp (BF16 GGUF) on GSM8K quality + speed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:39:44 +08:00
Gahow Wang	d29c39d74e	fix: GEMV NaN bug — skip custom kernel for small N (<256) The custom launch_gemv_bf16 kernel produces NaN when output dimension N is small (e.g. N=32 for the MoE router). Fall back to cuBLAS GemmEx for N < 256. Also removes the padding workaround in gpt_oss MoE forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:20:04 +08:00
Gahow Wang	9ad91a4a92	phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2 Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params, 32 experts × top-4 routing). Key additions: - ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window, attention_bias, explicit head_dim, rope_scaling, swiglu_limit) - YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation and attention_scaling = 0.1*ln(factor)+1 - Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation) - Paged attention with sinks + sliding window kernel variant - GptOss model struct with expert-parallel TP (split 32 experts across ranks) - bench-gpt-oss binary for TP inference benchmarking Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT. Model generates topically-coherent output (needs chat template for quality). Known issues: - Custom GEMV kernel produces NaN with small N (workaround: pad to M=2) - Prefill doesn't use attention sinks (uses standard flash attention) - Output quality requires chat template formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:18:01 +08:00
Gahow Wang	46bfb59f30	Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv. Each stage holds layers [sL, (s+1)L), stage 0 owns embedding, last stage owns norm/lm_head. v1 serial (one request at a time) — correctness + per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up projections and removes unused kernels (argmax, reshape_and_cache).	2026-05-30 13:13:05 +08:00
Gahow Wang	9a01c60100	server: GPU argmax fast path for greedy decode When all active sequences use temperature=0, run argmax on the GPU and only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size] BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches fall back to the existing CPU path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:47 +08:00
Gahow Wang	c679f618fd	model: fuse QKV/gate_up projections, batched decode ops Weight fusion at load time: - q/k/v_proj → single qkv_proj_wt, GEMV once then narrow() to split - gate/up_proj → single gate_up_proj_wt, same pattern - Reduces GEMV calls from 7 to 4 per layer (36 layers → 108 fewer launches) Batched decode refactor (forward_decode_paged): - Per-head RMSNorm: reshape to [B*H, D], one rmsnorm call - Batched RoPE: one call for all sequences - Batched KV scatter: one reshape_and_cache kernel per layer - Eliminates the per-sequence loop entirely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:39 +08:00
Gahow Wang	cc4bd4cfe5	paged-kv: kernel-based scatter + fix data_ptr offset bug Replace the Rust cudaMemcpy loop in append_tokens() with the new reshape_and_cache kernel. Add append_tokens_batched() for the decode path using the batched variant. Fix: use data_ptr() instead of storage().gpu_buffer().as_ptr() so that tensor offset is respected. The old code silently read from storage base (element 0) instead of the tensor's logical start, which produced wrong results when K/V tensors were narrow() views into a fused QKV buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:28 +08:00
Gahow Wang	13ae3de69e	kernels: reshape_and_cache, GPU argmax, single-launch GEMV Three new CUDA kernels and one rewrite: - reshape_and_cache: scatter K/V into paged pool in a single kernel per layer, replacing the Rust-side per-token per-head cudaMemcpy loop. Includes both single-sequence (prefill) and batched (decode) variants. - argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy decode now only D2H-transfers B×4 bytes (token ids) instead of the full [B, vocab] logits tensor. - GEMV rewrite: fused zero-init inside the K-split kernel eliminates the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:17 +08:00
Gahow Wang	6ce21345be	cuda: add cached_trim() to release pooled GPU buffers Exposes the caching allocator's trim() through a public free function. Called after weight fusion during model loading to free temporary buffers that would otherwise sit in the pool and cause OOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:04 +08:00
Gahow Wang	1ab6ca9c09	tensor: add narrow() view and relax is_contiguous for size-1 dims narrow(dim, start, len) creates a zero-copy slice along any dimension. is_contiguous() now ignores stride mismatches on dimensions of size 1, since those dimensions are never stepped. This avoids unnecessary GPU strided copies when slicing fused projection outputs at batch=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:49:57 +08:00
Gahow Wang	11e0154e4d	docs: Phase 18 pipeline parallelism — design + benchmark results docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P, per-stage KV, engine/threading model). docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact correctness (single x2 vs pp4 x2 control), and the full AIME-30 + GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in every cell, TPOT flat across PP. README: multi-card (TP/PP) section + roadmap to Phase 18. gitignore: /.claude/ runtime state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:57:09 +08:00
Gahow Wang	d5dcf1a5ab	bench: PP harness (xserv --pp vs llama.cpp -sm layer) runner/servers: add --pp for both engines (xserv --pp N; llama.cpp -sm layer over N GPUs). New drivers: pp_final.sh (sequential latency + per-GPU VRAM + byte-exact correctness), pp_diag.sh (single x2 vs pp4 x2 determinism control), pp_quality_full.sh / pp_llama_47.sh (AIME+GSM8K matrix, xserv on 0-3 \|\| llama on 4-7), summarize_pp/summarize_fullq, pp_time.py latency probe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:59 +08:00
Gahow Wang	824cc58daa	server: pipeline-parallel HTTP engine (--pp N) pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling + stop logic) on the calling thread, worker stage threads for 1..P. Each step the coordinator embeds + runs its layers, then the hidden state is handed stage->stage over NCCL P2P; the last stage samples and returns the token to stage 0 over an in-process channel. v1 is serial (one request, one token/step) — correctness first; throughput via microbatch overlap is future work. main: wire --pp N (mutually exclusive with --tp). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:52 +08:00
Gahow Wang	da3aaa134a	model: pipeline-parallel Qwen3 (from_weights_pp + stage forward) Layer-wise split: each stage loads only its contiguous layer range [sL, (s+1)L); stage 0 keeps embed_tokens, the last stage keeps norm/lm_head (others get a 1x1 placeholder). Heads are NOT split (PP is orthogonal to TP). Adds embed/head and forward_layers_prefill/ forward_layers_decode that take and return the [tokens, hidden] hidden state; per-stage PagedKVCache is indexed by local layer id. sampling: derive Clone on SamplingParams (carried in the PP command enum). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:47 +08:00
Gahow Wang	859c0cc0b6	distributed: NCCL P2P primitives (PpContext + send/recv) Add ncclSend/ncclRecv FFI and a PpContext that initializes a NCCL communicator across P pipeline stages and hands the hidden state to neighbour stages on the null stream. Mirrors TpContext; the collective differs (point-to-point hand-off vs in-layer AllReduce). tests/sendrecv.rs: 2-GPU stage0->stage1 send/recv smoke test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:42 +08:00
Gahow Wang	c2362df1f1	fix(xserv-chat): UTF-8/CJK-aware line input Cooked-mode read_line() left line editing to the terminal, so Backspace on a multi-byte 汉字/かな/한글 deleted a byte (or behaved inconsistently across TTYs). Replace with a raw-mode reader (libc termios): Backspace pops a whole char, multi-byte input is reassembled from its continuation bytes, and a full-line redraw renders double-width glyphs correctly. Non-TTY input falls back to a plain read; raw mode is restored after each line. libc is already a locked transitive dep, so this builds offline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:36:54 +08:00
Gahow Wang	7b8b520cda	docs: TP=1/2/4 xserv vs llama.cpp benchmark results AIME 2025 + GSM8K at TP=1/2/4. Quality on par across engines/TP. Opposite perf scaling: xserv TPOT improves with TP (21->17->15ms) while llama.cpp row-split regresses over PCIe (10->19->20ms), crossing over so xserv is faster at TP=4. Includes the clean same-path bench-tp scaling (58/76/86 tok/s). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:52 +08:00
Gahow Wang	a4a171d425	bench: TP sweep harness (xserv --tp, llama row-split, concurrent groups) runner/servers gain --tp (xserv --tp N; llama.cpp --split-mode row) and --llama-devices so llama can run on a disjoint GPU group. run_tp_parallel.sh runs xserv (GPU 0..N-1) and llama.cpp (GPU 4..4+N-1) concurrently per TP, matching the box's 0-3 / 4-7 PHB groups. summarize_tp.py tabulates the sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:43 +08:00
Gahow Wang	95eb61d639	server: tensor-parallel HTTP engine (--tp N) tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled token always comes from rank 0, so it's correct for greedy and stochastic sampling. Serial single-request path (sufficient for the quality benchmark). --tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:33 +08:00
Gahow Wang	f17011129e	model: tensor-parallel Qwen3 (sharded weights + AllReduce) from_weights_tp shards each rank's weights (column-split q/k/v/gate/up, row-split o/down; replicate norms/embed/lm_head) and the paged forward uses local head counts + AllReduces after o_proj and down_proj. PagedKVCache::new_tp sizes the pool for the rank's local KV heads (KV is sharded too). TP=1 is the identity path. New bench-tp binary runs E2E multi-GPU generation per TP degree. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:24 +08:00
Gahow Wang	453520d622	distributed: NCCL tensor-parallel primitives (TpContext + AllReduce) New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so it orders naturally with the model's kernels. 2-GPU AllReduce test included. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:14 +08:00
Gahow Wang	76fffb3b68	docs: Phase 17 tensor parallelism design Megatron-style TP for Qwen3 on the 8x5090 (no-NVLink, PCIe) box: column/row split per layer, 2 AllReduces/layer, multi-thread one-rank-per-GPU model, NCCL, sharded weights, and the incremental implementation + verification plan. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:03 +08:00
Gahow Wang	14a44b503e	docs: add Chinese README (overview + usage) Project intro, architecture, build, basic usage (HTTP server / CLI / bench), and the llama.cpp comparison workflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 21:38:20 +08:00
Gahow Wang	80157e614a	docs: update llama.cpp comparison with 8192 results (OOM fixed) Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 21:32:14 +08:00

1 2

85 Commits