xserv

Author	SHA1	Message	Date
Gahow Wang	fcf531a9b2	style: rustfmt server engine files	2026-07-01 15:13:35 +08:00
Gahow Wang	d96ee0766c	server: sampling-param validation, finish_reason normalization, backpressure Three related hardening changes for the API surface: - validate_request rejects NaN/negative temperature, out-of-range top_p, and absurd top_k before those values reach the CUDA sampling paths. Prevents NaN logits from downstream sampling and matches typical OpenAI-compatible server behavior (400 instead of 500). - normalize_finish_reason maps engine strings to the OpenAI-standard subset. Currently only "error" (from tp/pp engine client-stall) needs normalization — it collapses to null so SDK clients see a clean stream close instead of an unknown finish_reason value. Applied to both streaming (SSE) and non-streaming JSON responses. - Replace the unbounded std::sync::mpsc engine channel with a bounded sync_channel(256) and switch submit_to_engine to try_send. A saturated engine now returns 503 "engine is busy" instead of letting requests pile up in RAM. Also add axum DefaultBodyLimit(4 MiB) so a malicious or misbehaving client cannot exhaust memory with an arbitrary JSON POST.	2026-07-01 15:13:24 +08:00
Gahow Wang	0314b4f3ac	server: non-blocking stream send — stop one slow client stalling the batch All three engines emitted tokens with blocking_send on the single decode/coordinator OS thread. A streaming client that drains slower than generation fills its 64-slot channel, and blocking_send then blocks the whole thread: under continuous batching one slow consumer stalls every other running sequence (and in the serial TP/PP path it blocks admission of the next request too). The whole point of continuous batching is defeated. Fix: switch to try_send. engine.rs sets a client_stalled flag on Full/Closed, reaped by is_finished() next iteration; tp_engine/pp_engine emit_text returns bool and the decode loop breaks with finish_reason "error". When the sequence/request is dropped its sender drops too, closing the channel so the client receive loop ends rather than hanging. A slow client now only loses its own sequence, never the batch. Verified on dash5: gpt-oss FP8 TP=1 streaming via tp_engine still streams correctly (SSE chunks, coherent content, no hang); bench-gpt-oss TP=2 5.9ms TPOT unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-01 12:37:32 +08:00
Gahow Wang	531cd3fe08	style: format Rust workspace	2026-06-18 18:11:58 +08:00
Gahow Wang	34224c7c93	gpt-oss: replay the whole batch=1 decode step as one CUDA graph Split forward_decode_paged into host bookkeeping (decode_prepare + ids/pos upload + advance_seq_len) and a pure-GPU decode_core. The paged-KV and sparse-MoE designs already read every per-step variable (block table, context lens, expert ids) from stable-address device buffers, so decode_core captures as-is. GptOssDecodeGraph captures lazily on the second decode step (the first eager step warms cuBLAS) after a "retained warmup": the step runs once with the allocator quarantine on, stocking the pool with a dedicated block for every allocation so the capture itself never pool-misses (a cudaMalloc while capturing is illegal — and the capture's own quarantine is what would otherwise starve the pool). NCCL all-reduces capture cleanly; TP=2 replays in lockstep. Wired into tp_engine, bench-gpt-oss, and xserv-chat via GraphedGptOssDecoder (batch>1 falls back to eager; XSERV_DECODE_GRAPH=0 disables). Greedy tokens identical to eager. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:12:37 +08:00
Gahow Wang	5343391dbd	review cleanups: pp+gpt-oss guard, sparse GEMV asserts, warnings - --pp with gpt-oss now fails with a clear message instead of a cryptic missing-weight panic inside the Qwen3-only PP engine. - Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the uint4-vectorized kernels would silently drop a tail otherwise. - Document the topk_ids buffer holding i32 under an F32 dtype label (DType has no I32). - Drop unused imports/locals and the cuBLASLt scale-mode constants orphaned by the strided-batched FP8 rework (`e631a71`). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 17:02:59 +08:00
Gahow Wang	63f5599717	server: serve gpt-oss on a single GPU via the TP engine (world=1) gpt-oss has no single-GPU engine path, so --tp 1 fell through to the Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp even at tp=1: NCCL world-1 init works and all_reduce already no-ops (bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB FP8 / 13 GB MXFP4) now serves on one 32 GB 5090. Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"): it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one GPU and rank 1's set_device panicked while rank 0 spun in NCCL init. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 16:29:10 +08:00
Gahow Wang	c0a81c84e7	server: canonical harmony system message in gpt-oss fallback build_prompt_gpt_oss (the hardcoded builder used when a gpt-oss model ships no Jinja chat template) emitted the same malformed "You are a helpful assistant." system message that destabilized the CLI. Replace it with the canonical harmony system message (identity / knowledge cutoff / current date via strftime_now / Reasoning: low / channels), matching the chat_template.jinja build_system_message macro and the xserv-chat fix. Dormant for gpt-oss-20b (it ships a Jinja template, so the template path runs), but correct now for any gpt-oss model without one. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 15:19:50 +08:00
Gahow Wang	f2e60218b4	xserv-chat: harmony channel routing + repetition penalty for gpt-oss - Let the model generate its own <\|channel\|> routing instead of forcing <\|channel\|>final<\|message\|> — matches the GGUF chat template behavior. - State machine tracks harmony channels: analysis channel rendered gray, final channel printed normally, <\|end\|> stops on final channel only. - Add repetition penalty (default 1.3 for MoE, 1.0 for Qwen) with 512 token window to prevent greedy decode loops. Configurable via XSERV_REP_PENALTY and XSERV_REP_WINDOW env vars. - Fix Length path: use <\|end\|> instead of <\|im_end\|> for gpt-oss to avoid poisoning the KV cache with garbage tokens on truncation. - Server api.rs: append <\|channel\|>final<\|message\|> to the hardcoded gpt-oss prompt (server expects to post-process the JSON output). - Add startswith filter to minijinja for harmony template compatibility. Known issue: gpt-oss multi-turn NaN when total context exceeds ~256 tokens — likely a flash_attention_sinks kernel bug with sliding window layers at large kv_len + small q_len. Single-turn and short multi-turn conversations work correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 12:40:17 +08:00
Gahow Wang	ae08896f46	xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug - Add ChatModel enum dispatching between Qwen3 and GptOss based on config.is_moe(), following the TP engine pattern. - Add --tp N flag for tensor-parallel inference (required for 39GB gpt-oss-20b which doesn't fit on a single 32GB GPU). - Add gpt-oss harmony chat template with channel/message format. - Replace hardcoded is_stop_token() with tokenizer.is_eos() for multi-model EOS support. - Restore gpt-oss hardcoded prompt template in server api.rs, lost during the Jinja template refactor. - Fix GEMV race condition: the K-split kernel zeroed the FP32 accumulator inside the kernel (block k=0) while other blocks atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead. - Update benchmark docs with post-fix results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 00:58:10 +08:00
Gahow Wang	1d0ec32e8d	server: Jinja chat template rendering via minijinja Load the model's chat_template.jinja (or tokenizer_config.json chat_template field) at startup and render it with minijinja instead of hardcoded per-model prompt builders. Custom Jinja functions: strftime_now (date formatting), raise_exception (template validation errors). Falls back to Qwen3 ChatML template if no Jinja template is found. Removes the hardcoded build_prompt_gpt_oss() — the model's own template now drives prompt formatting, matching llama.cpp's behavior exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-31 13:23:18 +08:00
Gahow Wang	ffd90ce7fb	server: emit harmony developer instructions block for gpt-oss Route caller-supplied system messages into a harmony 'developer' instructions block (<\|start\|>developer<\|message\|># Instructions...), keeping the fixed system/meta block for the channel declaration. Harmony puts user instructions on the developer role, not system. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:39 +08:00
Gahow Wang	3c9d5e260e	server: harmony termination via is_eos + TP repetition penalty Use tokenizer.is_eos() (multi-eos) for generation termination in both PP and TP engines instead of a single eos id, so gpt-oss stops on <\|return\|> /<\|call\|>/<\|endoftext\|>. In the TP engine, optionally apply a repetition penalty on the greedy decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off by default) to break greedy repetition loops. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:33 +08:00
Gahow Wang	5cb3cf28f9	server: add gpt-oss chat template for proper prompt formatting The gpt-oss model requires a specific prompt format with <\|start\|>, <\|message\|>, <\|end\|>, <\|channel\|> tokens. Without this, the model produces degenerate output. Auto-detected via config.model_type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:43:29 +08:00
Gahow Wang	15c51f143e	server: support GptOss in TP engine + benchmark script - tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on config.is_moe(). Server auto-detects model type on startup. - tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2) vs llama.cpp (BF16 GGUF) on GSM8K quality + speed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:39:44 +08:00
Gahow Wang	46bfb59f30	Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv. Each stage holds layers [sL, (s+1)L), stage 0 owns embedding, last stage owns norm/lm_head. v1 serial (one request at a time) — correctness + per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up projections and removes unused kernels (argmax, reshape_and_cache).	2026-05-30 13:13:05 +08:00
Gahow Wang	9a01c60100	server: GPU argmax fast path for greedy decode When all active sequences use temperature=0, run argmax on the GPU and only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size] BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches fall back to the existing CPU path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:47 +08:00
Gahow Wang	824cc58daa	server: pipeline-parallel HTTP engine (--pp N) pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling + stop logic) on the calling thread, worker stage threads for 1..P. Each step the coordinator embeds + runs its layers, then the hidden state is handed stage->stage over NCCL P2P; the last stage samples and returns the token to stage 0 over an in-process channel. v1 is serial (one request, one token/step) — correctness first; throughput via microbatch overlap is future work. main: wire --pp N (mutually exclusive with --tp). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:52 +08:00
Gahow Wang	95eb61d639	server: tensor-parallel HTTP engine (--tp N) tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled token always comes from rank 0, so it's correct for greedy and stochastic sampling. Serial single-request path (sufficient for the quality benchmark). --tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:33 +08:00
Gahow Wang	fc1900a745	server: VRAM-sized KV pool + vLLM-style swap scheduler Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory: - Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192. - Scheduler tracks waiting/running/swapped sets: block-aware admission, swap-in of resumable sequences when blocks free, and preemption of the newest running sequence to host when the pool can't cover a decode step. - --swap-space-gb (default 8) sizes the pinned host swap pool; XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping. - api: poison-tolerant lock + clean 503 when the engine thread is gone, instead of cascading mutex-poison panics. Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a forced 40-block pool drives 48 lossless swap-out/swap-in cycles under concurrency with coherent output. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:59:06 +08:00
Gahow Wang	986a289616	fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090 P0 fixes (blocking usability): - FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul) - FIX-16: EOS token no longer leaks into API responses - FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256) - FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400 P1 fixes (bugs & performance): - FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat) - FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety - FIX-09: tokenizer byte_fallback graceful degradation (was panic) - FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf) - FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm - FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches P2 fixes (improvements): - FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations - FIX-23: RoPE cache no longer artificially capped at 8192 positions Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 14:13:43 +08:00
Gahow Wang	876d3f5d6a	phase 15: batched decode forward — 35 tok/s (97% of HF transformers) Implement batched decode that processes multiple sequences' tokens in one forward pass. The key insight: cuBLAS M=4 GEMM is dramatically faster than 4× M=1 GEMV due to better TensorCore utilization and amortized kernel launch overhead. New method Qwen3::forward_decode_batch(&tokens, &positions, &mut caches): - Batched embedding, norm, projections, FFN: [B, hidden] × [hidden, X] → one cuBLAS call per weight matrix instead of B calls - Per-sequence attention: RoPE, KV cache, decode_attention remain per-seq (each has different position and KV length) - Row extraction (row_view) and concatenation (concat_rows) for batched↔per-seq transitions Engine Step 4b: - batch_size >= 2: extracts caches via std::mem::replace, calls forward_decode_batch, restores caches, samples per-sequence - batch_size == 1: falls back to per-seq forward_gpu_cache (no overhead) Ablation results (dash5, RTX 5090, Qwen3-8B BF16): \| Scenario \| Throughput \| vs HF \| \|----------\|-----------\|-------\| \| Serial (batch=1) \| 13.2 tok/s \| 37% \| \| Concurrent (batch=4) \| 35.1 tok/s \| 97% \| \| HF transformers \| 36.0 tok/s \| 100% \| The 2.66x throughput improvement (13.2 → 35.1) for concurrent requests comes from cuBLAS going from 1008 M=1 GEMVs to 252 M=4 GEMMs per step, which cuBLAS handles ~4x more efficiently on TensorCores. Milestone ④ target (50% of vLLM/HF throughput) achieved with 97%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 20:07:43 +08:00
Gahow Wang	ee68d3565d	fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:53:28 +08:00
Gahow Wang	d8493bd70f	phase 12: implement real continuous batching scheduler Rewrote engine.rs from scratch: - Scheduler loop: admit → prefill → decode → finish → check new requests - Multiple sequences run concurrently (max_batch_size configurable) - Each sequence has independent GpuKVCache - Non-blocking try_recv() for new requests during decode iterations - Dynamic join: new requests enter batch immediately, don't wait for others Verified with concurrent test (tools/test_concurrent.py): - 3 concurrent requests: wall_time=3.8s, concurrency_ratio=2.82x ✓ - 5 concurrent requests: wall_time=6.1s, concurrency_ratio=4.04x ✓ - All outputs are coherent and correct Design doc (docs/12-continuous-batching.md) fully rewritten with: - Detailed scheduler loop pseudocode - Data structures (Sequence, Scheduler) - Acceptance criteria with specific test cases - Clear separation from Phase 13 (HTTP layer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 13:44:26 +08:00
Gahow Wang	da043554ba	phase 12+13: HTTP API server with OpenAI-compatible endpoint (Milestone ③) New crate: xserv-server - Engine thread: loads Qwen3-8B, processes requests sequentially - axum HTTP server: /health, /v1/models, /v1/chat/completions - tokio::sync::mpsc channel between API and engine threads - Non-streaming JSON response (streaming SSE to be added later) API is OpenAI-compatible: POST /v1/chat/completions {"messages": [...], "max_tokens": N} → {"choices": [{"message": {"content": "..."}}]} Verified: "Hi" → ", I'm" (3 tokens), model runs correctly via HTTP. Key learnings: - std::sync::mpsc::SyncSender is Send but NOT Sync → wrap in Mutex for Arc<AppState> - MutexGuard must not live across await points (scope carefully) - axum 0.8 Extension<Arc<T>> requires T: Send + Sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 12:55:19 +08:00

25 Commits