xserv

Author	SHA1	Message	Date
Gahow Wang	013465fc06	docs: Phase 21 — decode CUDA graph + GPU argmax results dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps): TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x, TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%. Lesson recorded: graphs bought ~0.6 ms (launches were already hidden by async execution), the GPU argmax ~1 ms — measure, don't guess. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:12:37 +08:00
Gahow Wang	2a92f268a9	docs: fill the Phase 19 gap, refresh README/roadmap to actual state - docs/19-gpt-oss-moe.md: the numbered series jumped 18->20; write up gpt-oss arch deltas, harmony pitfalls, and the two CUDA debugging postmortems (fully-masked-tile NaN in flash-attention sinks; pre-__syncthreads early return reading uninitialized smem in the decode GEMV) — the highest-value learning content of that phase. - README: models/perf/capabilities were frozen at the Qwen3-only era; now lists gpt-oss MoE, TP/PP, FP8/MXFP4, sparse MoE, and the llama.cpp standing. - Roadmap: record where reality diverged from the plan at Phase 18+, add milestone entries and the ranked next-phase candidates (21 CUDA-graph MoE decode, 22 non-expert quant, 23 sparse prefill). - sparse-moe benchmark doc: post-review-fix numbers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 17:02:59 +08:00
Gahow Wang	fb20178992	moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2) Dense MoE replicated x across all 16 local experts and ran the full batched GEMM, reading every expert's weights per token; the weighted sum then discarded 12 of 16 results. Decode is memory-bound, so this was ~8x wasted expert bytes — the entire decode gap vs llama.cpp. New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read topk_ids on-device (no host sync) and early-return block-uniformly for experts other ranks own. FP8 runs W8A16 (activations stay BF16 — tensor cores are irrelevant at M=1, and activation quantization error disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the GEMV epilogue; slot-indexed weighted sum skips (never multiplies) unwritten non-local slots. Dense path retained for num_tokens > 8 (prefill) and via XSERV_DENSE_MOE=1 for A/B. dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms. Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms — first config where xserv wins decode outright. GSM8K-100: 96% (dense FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode CUDA graphs, non-expert quantization, sparse prefill (docs/20). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 16:29:10 +08:00
Gahow Wang	d33220498a	quantization: MXFP4 W4A16 expert weights (memory-optimization foundation) Weight-only 4-bit for the gpt-oss MoE experts: weights stored MXFP4 (E2M1 + per-32-element UE8M0 block scale, tools/quantize_mxfp4.py), a fused kernel reads the 4-bit weights and dequantizes on-chip to BF16. Decode (M=1) uses a fused dequant-GEMV (batched_gemv_mxfp4) with shared-memory activation tiling; prefill (M>1) dequantizes to BF16 then reuses the BF16 batched GEMM. MXFP4 is detected by the scale tensor's rank (3-D [E,N,K/32]) vs FP8's 1-D [E]. Verified on dash5 (gpt-oss-20b, TP=2, 5090): byte-identical greedy tokens to FP8/BF16, smallest footprint (13 GB vs 22 GB FP8, 39 GB BF16) — fits one 32 GB 5090 with room for KV cache. NOT a decode speedup: the hand-written W4A16 GEMV (no tensor cores) is less efficient than cuBLASLt's FP8 tensor-core GEMM, so even at half the weight bytes decode is 17.0 ms vs FP8 13.5 ms (faster than BF16 18.8 ms); prefill regresses (350 vs 134 ms, dequant fallback). Committed as a correct memory-optimization foundation. Beating FP8 on speed needs FP4 tensor cores (W4A4, cuBLASLt block-scaled MXFP4) or a Marlin-class kernel; see docs/benchmarks/mxfp4-and-llama-decode.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 15:01:42 +08:00
Gahow Wang	e631a71b68	quantization: single strided-batched FP8 MoE GEMM — cut per-token launches ~768→48 The plan-cache fix removed the per-expert heuristic churn but still issued one cublasLtMatmul per expert: ~768 tiny launches per decoded token (16 local experts × 2 GEMMs × 24 layers), which capped the FP8 decode win at ~1.05× over BF16. Collapse each MoE GEMM into ONE strided-batched cuBLASLt FP8 matmul (BATCH_COUNT + strided-batch offsets on all four layouts) → ~48 launches/token. A single strided call can't carry a per-batch scalar B-scale, so the per-expert weight scale moves out of the GEMM epilogue into a fused post-scale kernel (rowwise_scale_moe_bf16) that applies a_scale[token]·b_scale[expert] in one pass. This is precision-equivalent: BF16's relative error is scale-invariant, so scaling the unscaled GEMM output afterward loses nothing vs scaling in-epilogue. Measured on dash5 (gpt-oss-20b, TP=2, 5090), warm-server GSM8K: decode TPOT 17.45 → 13.08 ms (FP8 now 1.41× vs BF16 18.39 ms), throughput 57.3 → 76.4 tok/s, accuracy unchanged (FP8 91.0% vs BF16 90.0%). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 01:23:29 +08:00
Gahow Wang	24c49c31c2	tools: warm-server FP8 vs BF16 benchmark + results doc fp8_compare.py launches one xserv-server per model (same GPUs / TP for a fair comparison), gates readiness on a real generation (not /health), and streams GSM8K through /v1/chat/completions measuring per-request TTFT (time to first token) and TPOT (mean inter-token latency) plus exact-match accuracy. docs/benchmarks/fp8-quantization.md records the quantization scheme, the perf-bug fix, and the dash5 results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 00:58:46 +08:00
Gahow Wang	ae08896f46	xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug - Add ChatModel enum dispatching between Qwen3 and GptOss based on config.is_moe(), following the TP engine pattern. - Add --tp N flag for tensor-parallel inference (required for 39GB gpt-oss-20b which doesn't fit on a single 32GB GPU). - Add gpt-oss harmony chat template with channel/message format. - Replace hardcoded is_stop_token() with tokenizer.is_eos() for multi-model EOS support. - Restore gpt-oss hardcoded prompt template in server api.rs, lost during the Jinja template refactor. - Fix GEMV race condition: the K-split kernel zeroed the FP32 accumulator inside the kernel (block k=0) while other blocks atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead. - Update benchmark docs with post-fix results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 00:58:10 +08:00
Gahow Wang	11e0154e4d	docs: Phase 18 pipeline parallelism — design + benchmark results docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P, per-stage KV, engine/threading model). docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact correctness (single x2 vs pp4 x2 control), and the full AIME-30 + GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in every cell, TPOT flat across PP. README: multi-card (TP/PP) section + roadmap to Phase 18. gitignore: /.claude/ runtime state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:57:09 +08:00
Gahow Wang	7b8b520cda	docs: TP=1/2/4 xserv vs llama.cpp benchmark results AIME 2025 + GSM8K at TP=1/2/4. Quality on par across engines/TP. Opposite perf scaling: xserv TPOT improves with TP (21->17->15ms) while llama.cpp row-split regresses over PCIe (10->19->20ms), crossing over so xserv is faster at TP=4. Includes the clean same-path bench-tp scaling (58/76/86 tok/s). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:52 +08:00
Gahow Wang	80157e614a	docs: update llama.cpp comparison with 8192 results (OOM fixed) Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 21:32:14 +08:00
Gahow Wang	3f1c3d429a	docs: llama.cpp vs xserv benchmark results + summary Record what the new baseline adds (llama.cpp pinned b9371, same BF16 weights, AIME 2025 + GSM8K) and the measured results: performance (xserv ~0.45-0.61x llama.cpp throughput) and quality parity (GSM8K 94% vs 96%, AIME 23.3% vs 20% after the context fix), plus the findings the bench surfaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 15:06:21 +08:00
Gahow Wang	a67e724119	docs: Phase 15 design doc + benchmark report Design document (docs/15-performance.md): - Roofline analysis: 112 tok/s theoretical at 1.79 TB/s - Bottleneck quantification: cuBLAS M=1 GEMV at 8% bandwidth → 77% of step time - Six optimizations with rationale, implementation details, and expected impact - Ablation table with per-optimization delta measurements - Remaining 55% roofline gap breakdown with next-step priorities Benchmark report (docs/benchmarks/phase15-performance.md): - Full ablation: 12.9 → 50.3 tok/s across 6 optimizations - Per-prompt detail (8 prompts, 46-51 tok/s range) - Concurrent throughput analysis (batch=4 vs serial) - Phase-over-phase tracking from Phase 8 to Phase 15 (2.5 → 50.3 tok/s) - Correctness verification (9/10 top-1 match, 52/52 API pass) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 00:39:27 +08:00
Gahow Wang	6cc1c9332d	docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty Phase 14 (Flash Attention): - Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible), kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip, known limitations and optimization roadmap - Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline), performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings analysis, remaining bottleneck breakdown Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache" with explicit note that paged allocation was not implemented. Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state — iteration-level scheduling implemented + verified (6.0x concurrent speedup), batched GPU forward explicitly marked as not yet implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 18:51:29 +08:00
Gahow Wang	268e40d764	phase 10: add Qwen3-8B benchmark + performance fix Benchmark infrastructure: - bench-qwen3 binary: 50 prompts × 20 tokens with KV cache - bench_compare_qwen3.py: comparison against HF transformers (BF16) Performance fix: - Precompute transposed weights at model load time (eliminated per-token weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token) - Result: from "infinite" (>10 min/token) to 144ms/token Results (50 prompts): - Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers - Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers) - Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s) - All outputs are coherent English/Chinese Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:25:33 +08:00
Gahow Wang	64084d3489	phase 9: KV cache + autoregressive generation - KVCache: per-layer, per-head storage with append + reconstruct - forward_with_cache: prefill (full prompt) + decode (single token) modes - Fixed data layout bug: per-head vectors avoid cross-head interleaving - CLI updated to use KV cache by default - bench-gpt2 supports --no-cache flag for comparison Benchmark results (50 prompts × 20 tokens): - KV cache vs no-cache: 50/50 bit-identical (cache is correct) - 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s - vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:39:41 +08:00
Gahow Wang	cb12250ef0	phase 8: add benchmark framework + baseline results - bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON - bench_compare.py: compares xserv vs transformers token-by-token + timing - Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch) - Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:29:41 +08:00

16 Commits