xserv

Author	SHA1	Message	Date
Gahow Wang	63f5599717	server: serve gpt-oss on a single GPU via the TP engine (world=1) gpt-oss has no single-GPU engine path, so --tp 1 fell through to the Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp even at tp=1: NCCL world-1 init works and all_reduce already no-ops (bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB FP8 / 13 GB MXFP4) now serves on one 32 GB 5090. Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"): it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one GPU and rank 1's set_device panicked while rank 0 spun in NCCL init. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 16:29:10 +08:00
Gahow Wang	cf1e9e41db	tools: single-stream decode benchmark vs llama.cpp xserv_vs_llama.py runs each server one at a time on the same GPUs (drains VRAM between), streams identical prompts through /v1/chat/completions, and reports median TTFT/TPOT/throughput. Counts llama's reasoning_content as real decode tokens so the gpt-oss CoT is measured fairly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 15:01:42 +08:00
Gahow Wang	d33220498a	quantization: MXFP4 W4A16 expert weights (memory-optimization foundation) Weight-only 4-bit for the gpt-oss MoE experts: weights stored MXFP4 (E2M1 + per-32-element UE8M0 block scale, tools/quantize_mxfp4.py), a fused kernel reads the 4-bit weights and dequantizes on-chip to BF16. Decode (M=1) uses a fused dequant-GEMV (batched_gemv_mxfp4) with shared-memory activation tiling; prefill (M>1) dequantizes to BF16 then reuses the BF16 batched GEMM. MXFP4 is detected by the scale tensor's rank (3-D [E,N,K/32]) vs FP8's 1-D [E]. Verified on dash5 (gpt-oss-20b, TP=2, 5090): byte-identical greedy tokens to FP8/BF16, smallest footprint (13 GB vs 22 GB FP8, 39 GB BF16) — fits one 32 GB 5090 with room for KV cache. NOT a decode speedup: the hand-written W4A16 GEMV (no tensor cores) is less efficient than cuBLASLt's FP8 tensor-core GEMM, so even at half the weight bytes decode is 17.0 ms vs FP8 13.5 ms (faster than BF16 18.8 ms); prefill regresses (350 vs 134 ms, dequant fallback). Committed as a correct memory-optimization foundation. Beating FP8 on speed needs FP4 tensor cores (W4A4, cuBLASLt block-scaled MXFP4) or a Marlin-class kernel; see docs/benchmarks/mxfp4-and-llama-decode.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 15:01:42 +08:00
Gahow Wang	24c49c31c2	tools: warm-server FP8 vs BF16 benchmark + results doc fp8_compare.py launches one xserv-server per model (same GPUs / TP for a fair comparison), gates readiness on a real generation (not /health), and streams GSM8K through /v1/chat/completions measuring per-request TTFT (time to first token) and TPOT (mean inter-token latency) plus exact-match accuracy. docs/benchmarks/fp8-quantization.md records the quantization scheme, the perf-bug fix, and the dash5 results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 00:58:46 +08:00
Gahow Wang	3a530956af	tools: add FP8 vs BF16 benchmark and GSM8K eval harness bench_fp8.py — head-to-head comparison of FP8 and BF16 models on GSM8K / AIME2025 accuracy plus TTFT/TPOT performance measurement. eval_gsm8k_batch.sh — lightweight GSM8K accuracy evaluator that pipes one problem per xserv-chat invocation and scores with \boxed{} / last-number extraction. Benchmark results (gpt-oss-20b, 50-problem GSM8K): FP8 W8A8 TP1 : 94.0% (single RTX 5090, 25 GB) FP8 W8A16 TP1: 94.0% BF16 TP2 : 94.0% (requires 2× RTX 5090) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-08 15:43:04 +08:00
Gahow Wang	9f1fbbb98b	quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem) with per-expert FP32 scale factors. At inference, a fused CUDA kernel dequantizes to BF16 before the existing cuBLAS batched GEMM. Results on gpt-oss-20b (50-problem GSM8K subset): - FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM) - BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total) No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%). New files: - tools/quantize_fp8.py: offline BF16→FP8 conversion script - csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel - crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper - tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-07 19:33:07 +08:00
Gahow Wang	15c51f143e	server: support GptOss in TP engine + benchmark script - tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on config.is_moe(). Server auto-detects model type on startup. - tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2) vs llama.cpp (BF16 GGUF) on GSM8K quality + speed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:39:44 +08:00
Gahow Wang	d5dcf1a5ab	bench: PP harness (xserv --pp vs llama.cpp -sm layer) runner/servers: add --pp for both engines (xserv --pp N; llama.cpp -sm layer over N GPUs). New drivers: pp_final.sh (sequential latency + per-GPU VRAM + byte-exact correctness), pp_diag.sh (single x2 vs pp4 x2 determinism control), pp_quality_full.sh / pp_llama_47.sh (AIME+GSM8K matrix, xserv on 0-3 \|\| llama on 4-7), summarize_pp/summarize_fullq, pp_time.py latency probe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:59 +08:00
Gahow Wang	a4a171d425	bench: TP sweep harness (xserv --tp, llama row-split, concurrent groups) runner/servers gain --tp (xserv --tp N; llama.cpp --split-mode row) and --llama-devices so llama can run on a disjoint GPU group. run_tp_parallel.sh runs xserv (GPU 0..N-1) and llama.cpp (GPU 4..4+N-1) concurrently per TP, matching the box's 0-3 / 4-7 PHB groups. summarize_tp.py tabulates the sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:43 +08:00
Gahow Wang	950ccf3822	bench: fix llama.cpp per-slot context (was 1/parallel of intended) llama.cpp divides total -c across --parallel slots, so -c 4096 --parallel 4 gave each request only 1024 tokens — truncating long AIME generations before the boxed answer and making xserv look artificially better (20% vs 3.3%). Set total -c = max_seq_len * n_parallel so per-slot context equals xserv's per-sequence max_seq_len. Also drop --log-disable; its startup log reports the per-slot n_ctx that catches exactly this misconfiguration. After the fix, AIME is at parity (xserv 23.3% vs llama.cpp 20.0%), matching the GSM8K parity and confirming the gap was a config artifact, not engine quality. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 15:06:12 +08:00
Gahow Wang	7cb9ee3870	bench: run one server at a time, match thinking mode, fix tools package Refinements from end-to-end bring-up on the GPU host: - Run each system start→suites→stop in sequence. Two BF16 8B models don't co-reside on one 32GB GPU, and a resident idle engine would distort the other's latency/throughput. - Match generation mode: xserv hardcodes Qwen3 thinking off, so send chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint extra_body. --enable-thinking opts back into thinking mode. - Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package instead of a site-packages `tools` (nvfuser ships one that shadowed it). - Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM finding that the bench surfaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 11:40:07 +08:00
Gahow Wang	49c7653222	tools: add llama.cpp comparison baseline + standard benchmark suite Vendor llama.cpp as a submodule pinned to b9371 and add a one-click benchmark driver that compares xserv against it on identical workloads: - setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh converts the same safetensors to BF16 GGUF for an apples-to-apples baseline. - tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput (single-stream + concurrent) and response quality on AIME 2025 + GSM8K. - fetch_datasets.py pulls datasets to local JSON (GPU host has no network); task loaders prefer the local JSON. - sync-and-build.sh: `bench` subcommand transfers source + datasets to the GPU host via tar-over-ssh (no rsync there), builds, and runs the suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 11:18:52 +08:00
Gahow Wang	9bb5c5c328	tools: add correctness + performance test scripts for Qwen3-8B - test_correctness.py: compare prefill logits top-20 vs HF transformers - bench_server.py: HTTP API benchmark (throughput, streaming, concurrent, EOS leak check) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 14:13:49 +08:00
Gahow Wang	ee68d3565d	fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:53:28 +08:00
Gahow Wang	d8493bd70f	phase 12: implement real continuous batching scheduler Rewrote engine.rs from scratch: - Scheduler loop: admit → prefill → decode → finish → check new requests - Multiple sequences run concurrently (max_batch_size configurable) - Each sequence has independent GpuKVCache - Non-blocking try_recv() for new requests during decode iterations - Dynamic join: new requests enter batch immediately, don't wait for others Verified with concurrent test (tools/test_concurrent.py): - 3 concurrent requests: wall_time=3.8s, concurrency_ratio=2.82x ✓ - 5 concurrent requests: wall_time=6.1s, concurrency_ratio=4.04x ✓ - All outputs are coherent and correct Design doc (docs/12-continuous-batching.md) fully rewritten with: - Detailed scheduler loop pseudocode - Data structures (Sequence, Scheduler) - Acceptance criteria with specific test cases - Clear separation from Phase 13 (HTTP layer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 13:44:26 +08:00
Gahow Wang	268e40d764	phase 10: add Qwen3-8B benchmark + performance fix Benchmark infrastructure: - bench-qwen3 binary: 50 prompts × 20 tokens with KV cache - bench_compare_qwen3.py: comparison against HF transformers (BF16) Performance fix: - Precompute transposed weights at model load time (eliminated per-token weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token) - Result: from "infinite" (>10 min/token) to 144ms/token Results (50 prompts): - Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers - Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers) - Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s) - All outputs are coherent English/Chinese Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:25:33 +08:00
Gahow Wang	64084d3489	phase 9: KV cache + autoregressive generation - KVCache: per-layer, per-head storage with append + reconstruct - forward_with_cache: prefill (full prompt) + decode (single token) modes - Fixed data layout bug: per-head vectors avoid cross-head interleaving - CLI updated to use KV cache by default - bench-gpt2 supports --no-cache flag for comparison Benchmark results (50 prompts × 20 tokens): - KV cache vs no-cache: 50/50 bit-identical (cache is correct) - 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s - vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:39:41 +08:00
Gahow Wang	cb12250ef0	phase 8: add benchmark framework + baseline results - bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON - bench_compare.py: compares xserv vs transformers token-by-token + timing - Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch) - Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:29:41 +08:00
Gahow Wang	9806b4db35	phase 0+1: project scaffold + xserv-cuda crate - Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 18:40:22 +08:00

19 Commits