xserv

Author	SHA1	Message	Date
Gahow Wang	e11f15e009	tokenizer: support multiple end-of-generation tokens Track an ordered eos_token_ids list (not just one id) and add is_eos(). gpt-oss/harmony ends the assistant turn on <\|return\|> and also treats <\|call\|> and <\|endoftext\|> as terminators (generation_config.json eos_token_id = [200002, 199999, 200012]); single-eos families are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:21 +08:00
Gahow Wang	9c98c169ff	kernels: flash attention with gpt-oss sinks + sliding window Add flash_attention_sinks_bf16 prefill kernel that folds the per-head attention sink into the softmax denominator (exactly as the decode sink kernel) and supports an optional sliding-window mask matching HF gpt-oss. Wire it through xserv-kernels (flash_attention_sinks) and use it in GptOss prefill, replacing the post-hoc sink approximation for an exact match against the reference math. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:10 +08:00
Gahow Wang	5cb3cf28f9	server: add gpt-oss chat template for proper prompt formatting The gpt-oss model requires a specific prompt format with <\|start\|>, <\|message\|>, <\|end\|>, <\|channel\|> tokens. Without this, the model produces degenerate output. Auto-detected via config.model_type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:43:29 +08:00
Gahow Wang	15c51f143e	server: support GptOss in TP engine + benchmark script - tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on config.is_moe(). Server auto-detects model type on startup. - tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2) vs llama.cpp (BF16 GGUF) on GSM8K quality + speed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:39:44 +08:00
Gahow Wang	d29c39d74e	fix: GEMV NaN bug — skip custom kernel for small N (<256) The custom launch_gemv_bf16 kernel produces NaN when output dimension N is small (e.g. N=32 for the MoE router). Fall back to cuBLAS GemmEx for N < 256. Also removes the padding workaround in gpt_oss MoE forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:20:04 +08:00
Gahow Wang	9ad91a4a92	phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2 Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params, 32 experts × top-4 routing). Key additions: - ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window, attention_bias, explicit head_dim, rope_scaling, swiglu_limit) - YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation and attention_scaling = 0.1*ln(factor)+1 - Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation) - Paged attention with sinks + sliding window kernel variant - GptOss model struct with expert-parallel TP (split 32 experts across ranks) - bench-gpt-oss binary for TP inference benchmarking Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT. Model generates topically-coherent output (needs chat template for quality). Known issues: - Custom GEMV kernel produces NaN with small N (workaround: pad to M=2) - Prefill doesn't use attention sinks (uses standard flash attention) - Output quality requires chat template formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:18:01 +08:00
Gahow Wang	46bfb59f30	Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv. Each stage holds layers [sL, (s+1)L), stage 0 owns embedding, last stage owns norm/lm_head. v1 serial (one request at a time) — correctness + per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up projections and removes unused kernels (argmax, reshape_and_cache).	2026-05-30 13:13:05 +08:00
Gahow Wang	9a01c60100	server: GPU argmax fast path for greedy decode When all active sequences use temperature=0, run argmax on the GPU and only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size] BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches fall back to the existing CPU path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:47 +08:00
Gahow Wang	c679f618fd	model: fuse QKV/gate_up projections, batched decode ops Weight fusion at load time: - q/k/v_proj → single qkv_proj_wt, GEMV once then narrow() to split - gate/up_proj → single gate_up_proj_wt, same pattern - Reduces GEMV calls from 7 to 4 per layer (36 layers → 108 fewer launches) Batched decode refactor (forward_decode_paged): - Per-head RMSNorm: reshape to [B*H, D], one rmsnorm call - Batched RoPE: one call for all sequences - Batched KV scatter: one reshape_and_cache kernel per layer - Eliminates the per-sequence loop entirely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:39 +08:00
Gahow Wang	cc4bd4cfe5	paged-kv: kernel-based scatter + fix data_ptr offset bug Replace the Rust cudaMemcpy loop in append_tokens() with the new reshape_and_cache kernel. Add append_tokens_batched() for the decode path using the batched variant. Fix: use data_ptr() instead of storage().gpu_buffer().as_ptr() so that tensor offset is respected. The old code silently read from storage base (element 0) instead of the tensor's logical start, which produced wrong results when K/V tensors were narrow() views into a fused QKV buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:28 +08:00
Gahow Wang	13ae3de69e	kernels: reshape_and_cache, GPU argmax, single-launch GEMV Three new CUDA kernels and one rewrite: - reshape_and_cache: scatter K/V into paged pool in a single kernel per layer, replacing the Rust-side per-token per-head cudaMemcpy loop. Includes both single-sequence (prefill) and batched (decode) variants. - argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy decode now only D2H-transfers B×4 bytes (token ids) instead of the full [B, vocab] logits tensor. - GEMV rewrite: fused zero-init inside the K-split kernel eliminates the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:17 +08:00
Gahow Wang	6ce21345be	cuda: add cached_trim() to release pooled GPU buffers Exposes the caching allocator's trim() through a public free function. Called after weight fusion during model loading to free temporary buffers that would otherwise sit in the pool and cause OOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:04 +08:00
Gahow Wang	1ab6ca9c09	tensor: add narrow() view and relax is_contiguous for size-1 dims narrow(dim, start, len) creates a zero-copy slice along any dimension. is_contiguous() now ignores stride mismatches on dimensions of size 1, since those dimensions are never stepped. This avoids unnecessary GPU strided copies when slicing fused projection outputs at batch=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:49:57 +08:00
Gahow Wang	824cc58daa	server: pipeline-parallel HTTP engine (--pp N) pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling + stop logic) on the calling thread, worker stage threads for 1..P. Each step the coordinator embeds + runs its layers, then the hidden state is handed stage->stage over NCCL P2P; the last stage samples and returns the token to stage 0 over an in-process channel. v1 is serial (one request, one token/step) — correctness first; throughput via microbatch overlap is future work. main: wire --pp N (mutually exclusive with --tp). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:52 +08:00
Gahow Wang	da3aaa134a	model: pipeline-parallel Qwen3 (from_weights_pp + stage forward) Layer-wise split: each stage loads only its contiguous layer range [sL, (s+1)L); stage 0 keeps embed_tokens, the last stage keeps norm/lm_head (others get a 1x1 placeholder). Heads are NOT split (PP is orthogonal to TP). Adds embed/head and forward_layers_prefill/ forward_layers_decode that take and return the [tokens, hidden] hidden state; per-stage PagedKVCache is indexed by local layer id. sampling: derive Clone on SamplingParams (carried in the PP command enum). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:47 +08:00
Gahow Wang	859c0cc0b6	distributed: NCCL P2P primitives (PpContext + send/recv) Add ncclSend/ncclRecv FFI and a PpContext that initializes a NCCL communicator across P pipeline stages and hands the hidden state to neighbour stages on the null stream. Mirrors TpContext; the collective differs (point-to-point hand-off vs in-layer AllReduce). tests/sendrecv.rs: 2-GPU stage0->stage1 send/recv smoke test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:45:42 +08:00
Gahow Wang	c2362df1f1	fix(xserv-chat): UTF-8/CJK-aware line input Cooked-mode read_line() left line editing to the terminal, so Backspace on a multi-byte 汉字/かな/한글 deleted a byte (or behaved inconsistently across TTYs). Replace with a raw-mode reader (libc termios): Backspace pops a whole char, multi-byte input is reassembled from its continuation bytes, and a full-line redraw renders double-width glyphs correctly. Non-TTY input falls back to a plain read; raw mode is restored after each line. libc is already a locked transitive dep, so this builds offline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:36:54 +08:00
Gahow Wang	95eb61d639	server: tensor-parallel HTTP engine (--tp N) tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled token always comes from rank 0, so it's correct for greedy and stochastic sampling. Serial single-request path (sufficient for the quality benchmark). --tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:33 +08:00
Gahow Wang	f17011129e	model: tensor-parallel Qwen3 (sharded weights + AllReduce) from_weights_tp shards each rank's weights (column-split q/k/v/gate/up, row-split o/down; replicate norms/embed/lm_head) and the paged forward uses local head counts + AllReduces after o_proj and down_proj. PagedKVCache::new_tp sizes the pool for the rank's local KV heads (KV is sharded too). TP=1 is the identity path. New bench-tp binary runs E2E multi-GPU generation per TP degree. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:24 +08:00
Gahow Wang	453520d622	distributed: NCCL tensor-parallel primitives (TpContext + AllReduce) New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so it orders naturally with the model's kernels. 2-GPU AllReduce test included. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 11:10:14 +08:00
Gahow Wang	fc1900a745	server: VRAM-sized KV pool + vLLM-style swap scheduler Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory: - Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192. - Scheduler tracks waiting/running/swapped sets: block-aware admission, swap-in of resumable sequences when blocks free, and preemption of the newest running sequence to host when the pool can't cover a decode step. - --swap-space-gb (default 8) sizes the pinned host swap pool; XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping. - api: poison-tolerant lock + clean 503 when the engine thread is gone, instead of cascading mutex-poison panics. Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a forced 40-block pool drives 48 lossless swap-out/swap-in cycles under concurrency with coherent output. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:59:06 +08:00
Gahow Wang	d52baa0006	model: paged KV cache with CPU swap pool, decode graph, qwen3 updates - paged_kv_cache: new block-paged KV cache; adds a pinned-host swap pool with a second BlockAllocator, per-sequence Location {Gpu,Cpu}, and lossless swap_out/swap_in (block-granular D2H/H2D) for vLLM-style preemption. bytes_per_block helper exposes per-block cost for VRAM-based sizing. - decode_graph: CUDA-graph decode path. - qwen3/gpt2/kv_cache: paged prefill/decode forward + related updates. - tokenizer/bins: BPE updates, new xserv-chat CLI, bench-qwen3 tweaks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:58:54 +08:00
Gahow Wang	4c3f914459	kernels/cuda: paged-attention kernel, dispatch, pinned host memory CUDA layer for the paged-KV + swap work: - csrc: new paged_attention.cu plus updates across attention/gemm/norm/ activation/embedding/reduce kernels and common.cuh. - xserv-kernels: new dispatch module and kernel-binding updates. - xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap pool backing) and offset-aware D2H/H2D copies used to move KV blocks between the GPU pool and pinned host memory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:58:36 +08:00
Gahow Wang	986a289616	fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090 P0 fixes (blocking usability): - FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul) - FIX-16: EOS token no longer leaks into API responses - FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256) - FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400 P1 fixes (bugs & performance): - FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat) - FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety - FIX-09: tokenizer byte_fallback graceful degradation (was panic) - FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf) - FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm - FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches P2 fixes (improvements): - FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations - FIX-23: RoPE cache no longer artificially capped at 8192 positions Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 14:13:43 +08:00
Gahow Wang	d5532ef209	phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline) Two optimizations: 1. Tensor::empty() — skip cudaMemset for output tensors All kernel wrappers that fully overwrite their output now use Tensor::empty() instead of Tensor::zeros(). Eliminates ~756 cudaMemset calls per decode step (21 per layer × 36 layers). Improvement: 46.6 → 50.3 tok/s (+8%). 2. CUDA Graph infrastructure (for future use) Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate, cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the forward pass due to variable kv_len, but provides foundation for future graph-based decode optimization. Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode): \| Optimization \| tok/s \| vs HF \| Roofline \| \|-------------\|-------\|-------\|----------\| \| Phase 14 baseline \| 12.9 \| 36% \| 12% \| \| + Fused kernels \| 13.2 \| 37% \| 12% \| \| + Batched decode \| 13.2 (serial) \| 37% \| 12% \| \| + Custom GEMV \| 46.6 \| 130% \| 42% \| \| + Tensor::empty \| 50.3 \| 140% \| 45% \| Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 23:57:34 +08:00
Gahow Wang	e207523e21	phase 15: custom GEMV kernel — 46.6 tok/s serial (3.5x improvement, 130% of HF) Custom bandwidth-optimized GEMV kernel for M=1 BF16 decode, replacing cuBLAS which achieves only ~8% bandwidth utilization for tiny M=1 GEMMs. Kernel design (csrc/gemm/gemv.cu): - K-split tiled: TILE_N=128, TILE_K=256, Grid=(N/128, K/256)=512 blocks - High occupancy: 512 blocks / 170 SMs = ~3 blocks/SM - Coalesced memory access: adjacent threads read adjacent columns of W - Shared memory for x vector (avoids redundant global reads) - FP32 accumulation via atomicAdd (K-split partial sums) - Separate fp32→bf16 conversion kernel Integration: - matmul() auto-dispatches to custom GEMV when M==1 && dtype==BF16 - Batched decode (M>1) continues to use cuBLAS - Caching allocator provides FP32 temp buffer (pooled, no per-call malloc) Ablation results (dash5, RTX 5090, Qwen3-8B BF16): \| Config \| tok/s \| vs HF (36) \| vs roofline (112) \| \|--------\|-------\|-----------\|-------------------\| \| Phase 14 (cuBLAS M=1) \| 13.2 \| 37% \| 12% \| \| + Custom GEMV (M=1) \| 46.6 \| 130% \| 42% \| \| Concurrent batch=4 \| 28.2 \| 78% \| — \| Single-request throughput now EXCEEDS HuggingFace transformers by 30%. The custom GEMV achieves ~42% of the theoretical roofline (vs 12% before). Note: concurrent batch=4 (28.2 tok/s) is slower than serial (46.6 tok/s) because the per-seq attention/reshape overhead in batched decode outweighs the cuBLAS M=4 benefit when the custom GEMV already handles M=1 efficiently. Engine should prefer serial decode when custom GEMV is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 22:22:31 +08:00
Gahow Wang	876d3f5d6a	phase 15: batched decode forward — 35 tok/s (97% of HF transformers) Implement batched decode that processes multiple sequences' tokens in one forward pass. The key insight: cuBLAS M=4 GEMM is dramatically faster than 4× M=1 GEMV due to better TensorCore utilization and amortized kernel launch overhead. New method Qwen3::forward_decode_batch(&tokens, &positions, &mut caches): - Batched embedding, norm, projections, FFN: [B, hidden] × [hidden, X] → one cuBLAS call per weight matrix instead of B calls - Per-sequence attention: RoPE, KV cache, decode_attention remain per-seq (each has different position and KV length) - Row extraction (row_view) and concatenation (concat_rows) for batched↔per-seq transitions Engine Step 4b: - batch_size >= 2: extracts caches via std::mem::replace, calls forward_decode_batch, restores caches, samples per-sequence - batch_size == 1: falls back to per-seq forward_gpu_cache (no overhead) Ablation results (dash5, RTX 5090, Qwen3-8B BF16): \| Scenario \| Throughput \| vs HF \| \|----------\|-----------\|-------\| \| Serial (batch=1) \| 13.2 tok/s \| 37% \| \| Concurrent (batch=4) \| 35.1 tok/s \| 97% \| \| HF transformers \| 36.0 tok/s \| 100% \| The 2.66x throughput improvement (13.2 → 35.1) for concurrent requests comes from cuBLAS going from 1008 M=1 GEMVs to 252 M=4 GEMMs per step, which cuBLAS handles ~4x more efficiently on TensorCores. Milestone ④ target (50% of vLLM/HF throughput) achieved with 97%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 20:07:43 +08:00
Gahow Wang	9783fcf410	phase 15: decode attention kernel + fused silu_mul + fused add_rmsnorm Three performance optimizations targeting decode throughput: 1. Decode Attention Kernel (csrc/attention/flash_attention.cu): - Specialized kernel for Q_len=1 (decode step) - 256 threads parallelize across KV sequence dimension - Online softmax with block-level warp-shuffle reduction - Replaces FA2 kernel which wasted 63/64 threads for decode - flash_attention() auto-dispatches when q_len==1 2. Fused SiLU×Mul (csrc/activation/activations.cu): - Single kernel: out = silu(gate) * up - Saves 1 HBM read + 1 HBM write per FFN layer (N elements) - Eliminates intermediate tensor allocation 3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu): - Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual) - Saves 1 full HBM round-trip per attention block - Eliminates separate add + rmsnorm kernel pair Performance analysis: - At current short sequences (max 79 tokens), these optimizations provide marginal benefit because the bottleneck is cuBLAS GEMV overhead: 252 weight matrix reads × ~32MB each = 15.5 GB per decode step. Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap). - The fused kernels and decode attention will show larger gains at longer sequences where attention and element-wise ops dominate. - Next optimization target: CUDA Graphs to eliminate kernel launch overhead, or custom GEMV kernels to replace cuBLAS for M=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 19:40:56 +08:00
Gahow Wang	d67dda404e	phase 14: Flash Attention 2 for SM120 (RTX 5090) Implement Flash Attention 2 forward kernel targeting SM120 (CC 12.0). FA4 requires TMEM (only on data-center Blackwell SM100), so FA2 is the correct target for consumer Blackwell GPUs like the RTX 5090. CUDA kernel (csrc/attention/flash_attention.cu): - Online softmax with tiled Q/K/V — O(1) extra memory, no S×S matrix - Tile sizes: BR=BC=64, head_dim up to 128 (runtime parameter) - BF16 input, FP32 accumulation, BF16 output - Native GQA: kv_head = q_head / (num_q_heads / num_kv_heads) - Causal mask with tile-level skip optimization - Shared memory: 32 KB (Q_tile 16KB + KV_tile 16KB, fits in 48KB default) - Grid: (q_tiles, batch × num_q_heads), Block: 128 threads Integration: - flash_attention() Rust wrapper in xserv-kernels with shape/dtype validation - Qwen3 forward_gpu_cache uses flash_attention directly (no repeat_kv_gpu) - Eliminates repeat_kv memory allocation + copy per layer per step - Naive attention() preserved for testing/comparison Validated on dash5 (RTX 5090, CUDA 12.9): - Correctness: 9/10 top-1 match vs HF (identical to pre-FA baseline) - Throughput: 12.9 tok/s (up from 10.3, +25% improvement) - Now at 35% of HF transformers baseline (up from 30%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 18:27:39 +08:00
Gahow Wang	ee68d3565d	fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:53:28 +08:00
Gahow Wang	d8493bd70f	phase 12: implement real continuous batching scheduler Rewrote engine.rs from scratch: - Scheduler loop: admit → prefill → decode → finish → check new requests - Multiple sequences run concurrently (max_batch_size configurable) - Each sequence has independent GpuKVCache - Non-blocking try_recv() for new requests during decode iterations - Dynamic join: new requests enter batch immediately, don't wait for others Verified with concurrent test (tools/test_concurrent.py): - 3 concurrent requests: wall_time=3.8s, concurrency_ratio=2.82x ✓ - 5 concurrent requests: wall_time=6.1s, concurrency_ratio=4.04x ✓ - All outputs are coherent and correct Design doc (docs/12-continuous-batching.md) fully rewritten with: - Detailed scheduler loop pseudocode - Data structures (Sequence, Scheduler) - Acceptance criteria with specific test cases - Clear separation from Phase 13 (HTTP layer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 13:44:26 +08:00
Gahow Wang	da043554ba	phase 12+13: HTTP API server with OpenAI-compatible endpoint (Milestone ③) New crate: xserv-server - Engine thread: loads Qwen3-8B, processes requests sequentially - axum HTTP server: /health, /v1/models, /v1/chat/completions - tokio::sync::mpsc channel between API and engine threads - Non-streaming JSON response (streaming SSE to be added later) API is OpenAI-compatible: POST /v1/chat/completions {"messages": [...], "max_tokens": N} → {"choices": [{"message": {"content": "..."}}]} Verified: "Hi" → ", I'm" (3 tokens), model runs correctly via HTTP. Key learnings: - std::sync::mpsc::SyncSender is Send but NOT Sync → wrap in Mutex for Arc<AppState> - MutexGuard must not live across await points (scope carefully) - axum 0.8 Extension<Arc<T>> requires T: Send + Sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 12:55:19 +08:00
Gahow Wang	2be27d6d94	perf: GPU transpose/reshape/repeat_kv kernels (eliminate CPU round-trips) New CUDA kernels (csrc/embedding/transpose.cu): - reshape_heads_bf16: [S, HD] → [1, H, S, D] - merge_heads_bf16: [1, H, S, D] → [S, HD] - transpose_hsd_to_shd_bf16: [1, H, S, D] → [S, H, D] (for RoPE) - transpose_shd_to_hsd_bf16: [S, H, D] → [1, H, S, D] (from RoPE) - repeat_kv_bf16: [1, KV_H, S, D] → [1, KV_H*n_rep, S, D] Rust wrappers (xserv-kernels/src/transpose.rs): - reshape_heads_gpu, merge_heads_gpu, transpose_for/from_rope_gpu, repeat_kv_gpu Qwen3 forward_gpu_cache now uses all GPU kernels — zero CPU data round-trips. Result: 50/50 self-consistent, 3-5% faster (TBT 142→137ms) Remaining bottleneck: ~900 device::synchronize() calls + 252 cuBLAS handle creations per token (Phase 15 targets) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 12:01:07 +08:00
Gahow Wang	2d48f25e66	phase 11: GPU-resident KV cache - GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset - Per-head strided layout [num_kv_heads, max_seq_len, head_dim] - Fixed critical bug: seq_len must advance AFTER all layers write (not inside the loop per-layer) - GpuBuffer::copy_from_device_at for offset-based D2D copy - Tensor::from_storage constructor for wrapping raw GPU buffers - Exported Storage and Dims from xserv-tensor Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical Performance: ~neutral (KV cache was never the main bottleneck — reshape/merge/transpose CPU round-trips dominate for Qwen3-8B) TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 11:50:12 +08:00
Gahow Wang	be5c64ea8a	phase 10: GPU add/mul kernels + BF16 precision analysis Kernel additions: - add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU) - Refactored activation.rs with dispatch_unary/dispatch_binary helpers - Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add) BF16 precision analysis (docs/benchmarks/phase10-qwen3.md): - Root cause: separate attention kernels materialize BF16 intermediates (QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path) - HF itself SDPA vs Eager also differs by ~0.125 logit - xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases - Industry standard for BF16: top-5 overlap (we achieve 100%) - Fix path: Flash Attention (Phase 14) to fuse attention in FP32 Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 11:35:26 +08:00
Gahow Wang	268e40d764	phase 10: add Qwen3-8B benchmark + performance fix Benchmark infrastructure: - bench-qwen3 binary: 50 prompts × 20 tokens with KV cache - bench_compare_qwen3.py: comparison against HF transformers (BF16) Performance fix: - Precompute transposed weights at model load time (eliminated per-token weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token) - Result: from "infinite" (>10 min/token) to 144ms/token Results (50 prompts): - Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers - Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers) - Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s) - All outputs are coherent English/Chinese Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:25:33 +08:00
Gahow Wang	246ae1c590	phase 10: Qwen3-8B support (Milestone ②) Qwen3 model (qwen3.rs): - RMSNorm + QK normalization (per-head q_norm/k_norm) - GQA: 32 Q heads, 8 KV heads, repeat_kv for attention - SwiGLU FFN: gate_proj → SiLU → * up_proj → down_proj - RoPE with transpose for [1,H,S,D] ↔ [S,H,D] layout - BF16 forward pass, [out,in] weight layout via linear_t - No attention bias (attention_bias=false) Tokenizer fixes: - Fixed unicode_to_byte: shifted bytes now use correct inverse lookup table - MergeEntry supports both string and array formats - Both GPT-2 and Qwen3 tokenizers work correctly (English + Chinese) KVCache refactored: - Dtype-agnostic: stores raw bytes per-head, works for F32 and BF16 - append_kv_tensor/get_kv_tensors use Tensor directly CLI updated: - Auto-detects model type from config.json (gpt2 vs qwen3) - Supports both GPT-2 (F32) and Qwen3 (BF16) Verified: Qwen3-8B generates coherent English and Chinese on single RTX 5090. 61/61 tests pass, GPT-2 performance no regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:46:37 +08:00
Gahow Wang	64084d3489	phase 9: KV cache + autoregressive generation - KVCache: per-layer, per-head storage with append + reconstruct - forward_with_cache: prefill (full prompt) + decode (single token) modes - Fixed data layout bug: per-head vectors avoid cross-head interleaving - CLI updated to use KV cache by default - bench-gpt2 supports --no-cache flag for comparison Benchmark results (50 prompts × 20 tokens): - KV cache vs no-cache: 50/50 bit-identical (cache is correct) - 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s - vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:39:41 +08:00
Gahow Wang	cb12250ef0	phase 8: add benchmark framework + baseline results - bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON - bench_compare.py: compares xserv vs transformers token-by-token + timing - Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch) - Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:29:41 +08:00
Gahow Wang	e1e75fc7f6	phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①) Phase 6 — Model Loading (xserv-model): - safetensors parser with single/sharded file support - ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming) - Weight loading flow: safetensors → mmap → CPU Tensor → GPU Phase 7 — BPE Tokenizer (xserv-tokenizer): - Full BPE encode/decode from tokenizer.json - GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes) - Pre-tokenization regex, special token handling - Chat template support structure Phase 8 — GPT-2 Complete Inference: - GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f - Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits - QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug) - Greedy sampling from last-position logits - Interactive CLI: xserv-cli <model-dir> [--max-tokens N] Verified: GPT-2 124M generates coherent English text on RTX 5090. "The future of AI is uncertain. The future of AI is uncertain..." "Once upon a time, the world was a place of great beauty..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 22:04:00 +08:00
Gahow Wang	6035ffdc0b	phase 5: naive multi-head attention - Batched GEMM via cublasGemmStridedBatchedEx - Causal mask CUDA kernel (F32 + BF16) - Element-wise scale CUDA kernel (F32 + BF16) - attention() composing: batched_matmul + scale + causal_mask + softmax - Fixed to_device/contiguous infinite recursion (GPU contiguous via CPU round-trip) - 5 attention tests passing (max_err < 3e-7 F32) - Total: 61 tests passing across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:17:23 +08:00
Gahow Wang	c8e8153702	phase 4: transformer core kernels CUDA kernels (csrc/): - common.cuh: shared warp_reduce_sum/max, block_reduce_sum/max - normalization/rmsnorm.cu: RMSNorm (F32 + BF16) - normalization/layernorm.cu: LayerNorm with Welford (F32 + BF16) - activation/activations.cu: GELU tanh-approx + SiLU (F32 + BF16) - reduce/softmax.cu: safe softmax, 3-pass (F32 + BF16) - embedding/embedding.cu: gather lookup (F32 + BF16) - embedding/rope.cu: RoPE in-place + precomputed cos/sin cache (F32 + BF16) Rust wrappers (xserv-kernels/src/): - rmsnorm.rs, layernorm.rs, activation.rs, softmax.rs, embedding.rs, rope.rs - RopeCache struct with GPU-side precomputation Tests: 12 new tests (ops_test.rs), all passing with good precision: - F32: max_err 1e-6 ~ 1e-9 - BF16: max_err 2e-3 ~ 7e-3 Total: 29 kernel tests + 27 prior = 56 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:07:24 +08:00
Gahow Wang	d77f921a12	phase 3: GEMM kernels (naive, tiled, cuBLAS) - Naive GEMM kernel: one thread per output element (F32 + BF16) - Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16) - cuBLAS wrapper: cublasGemmEx with row-major trick - GemmBackend enum for runtime backend selection - CublasContext RAII handle - Made error::check public for cross-crate use - 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16 - Cross-backend consistency verified (naive vs tiled vs cuBLAS) - All 44 tests pass across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 19:48:05 +08:00
Gahow Wang	a83971fa25	phase 2: tensor abstraction layer - DType enum (F32, F16, BF16) with TensorDType trait - Shape utilities: contiguous_strides, broadcast_shape, broadcast_strides - Storage with Arc reference counting (CPU Vec<u8> or GPU GpuBuffer) - Device enum (Cpu, Cuda(id)) with to_device transfer - Tensor type with strided layout: reshape, transpose, squeeze, unsqueeze - contiguous() copies non-contiguous views to contiguous layout - from_slice, zeros, ones constructors - as_slice<T> for typed CPU read access, data_ptr for GPU kernel launch - CPU↔GPU roundtrip verified - All 27 tests pass (12 cuda + 4 shape + 11 tensor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 19:45:22 +08:00
Gahow Wang	c8f7bc0c3c	phase 0+1: fix Rust 2024 edition compat + memory query - unsafe extern "C" blocks (Rust 2024 requirement) - unsafe blocks inside unsafe fn bodies - Use cudaMemGetInfo for accurate GPU memory reporting - Remove cc "cuda" feature (doesn't exist, built-in) - All 12 tests pass on RTX 5090 (CC 12.0, 170 SMs, 32GB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 19:40:49 +08:00
Gahow Wang	9806b4db35	phase 0+1: project scaffold + xserv-cuda crate - Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 18:40:22 +08:00

1 2

96 Commits