xserv

Author	SHA1	Message	Date
Gahow Wang	be5c64ea8a	phase 10: GPU add/mul kernels + BF16 precision analysis Kernel additions: - add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU) - Refactored activation.rs with dispatch_unary/dispatch_binary helpers - Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add) BF16 precision analysis (docs/benchmarks/phase10-qwen3.md): - Root cause: separate attention kernels materialize BF16 intermediates (QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path) - HF itself SDPA vs Eager also differs by ~0.125 logit - xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases - Industry standard for BF16: top-5 overlap (we achieve 100%) - Fix path: Flash Attention (Phase 14) to fuse attention in FP32 Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase10	2026-05-22 11:35:26 +08:00
Gahow Wang	268e40d764	phase 10: add Qwen3-8B benchmark + performance fix Benchmark infrastructure: - bench-qwen3 binary: 50 prompts × 20 tokens with KV cache - bench_compare_qwen3.py: comparison against HF transformers (BF16) Performance fix: - Precompute transposed weights at model load time (eliminated per-token weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token) - Result: from "infinite" (>10 min/token) to 144ms/token Results (50 prompts): - Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers - Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers) - Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s) - All outputs are coherent English/Chinese Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:25:33 +08:00
Gahow Wang	246ae1c590	phase 10: Qwen3-8B support (Milestone ②) Qwen3 model (qwen3.rs): - RMSNorm + QK normalization (per-head q_norm/k_norm) - GQA: 32 Q heads, 8 KV heads, repeat_kv for attention - SwiGLU FFN: gate_proj → SiLU → * up_proj → down_proj - RoPE with transpose for [1,H,S,D] ↔ [S,H,D] layout - BF16 forward pass, [out,in] weight layout via linear_t - No attention bias (attention_bias=false) Tokenizer fixes: - Fixed unicode_to_byte: shifted bytes now use correct inverse lookup table - MergeEntry supports both string and array formats - Both GPT-2 and Qwen3 tokenizers work correctly (English + Chinese) KVCache refactored: - Dtype-agnostic: stores raw bytes per-head, works for F32 and BF16 - append_kv_tensor/get_kv_tensors use Tensor directly CLI updated: - Auto-detects model type from config.json (gpt2 vs qwen3) - Supports both GPT-2 (F32) and Qwen3 (BF16) Verified: Qwen3-8B generates coherent English and Chinese on single RTX 5090. 61/61 tests pass, GPT-2 performance no regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:46:37 +08:00
Gahow Wang	64084d3489	phase 9: KV cache + autoregressive generation - KVCache: per-layer, per-head storage with append + reconstruct - forward_with_cache: prefill (full prompt) + decode (single token) modes - Fixed data layout bug: per-head vectors avoid cross-head interleaving - CLI updated to use KV cache by default - bench-gpt2 supports --no-cache flag for comparison Benchmark results (50 prompts × 20 tokens): - KV cache vs no-cache: 50/50 bit-identical (cache is correct) - 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s - vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase9	2026-05-21 23:39:41 +08:00
Gahow Wang	cb12250ef0	phase 8: add benchmark framework + baseline results - bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON - bench_compare.py: compares xserv vs transformers token-by-token + timing - Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch) - Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:29:41 +08:00
Gahow Wang	e1e75fc7f6	phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①) Phase 6 — Model Loading (xserv-model): - safetensors parser with single/sharded file support - ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming) - Weight loading flow: safetensors → mmap → CPU Tensor → GPU Phase 7 — BPE Tokenizer (xserv-tokenizer): - Full BPE encode/decode from tokenizer.json - GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes) - Pre-tokenization regex, special token handling - Chat template support structure Phase 8 — GPT-2 Complete Inference: - GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f - Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits - QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug) - Greedy sampling from last-position logits - Interactive CLI: xserv-cli <model-dir> [--max-tokens N] Verified: GPT-2 124M generates coherent English text on RTX 5090. "The future of AI is uncertain. The future of AI is uncertain..." "Once upon a time, the world was a place of great beauty..." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase8	2026-05-21 22:04:00 +08:00
Gahow Wang	6035ffdc0b	phase 5: naive multi-head attention - Batched GEMM via cublasGemmStridedBatchedEx - Causal mask CUDA kernel (F32 + BF16) - Element-wise scale CUDA kernel (F32 + BF16) - attention() composing: batched_matmul + scale + causal_mask + softmax - Fixed to_device/contiguous infinite recursion (GPU contiguous via CPU round-trip) - 5 attention tests passing (max_err < 3e-7 F32) - Total: 61 tests passing across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase5	2026-05-21 21:17:23 +08:00
Gahow Wang	c8e8153702	phase 4: transformer core kernels CUDA kernels (csrc/): - common.cuh: shared warp_reduce_sum/max, block_reduce_sum/max - normalization/rmsnorm.cu: RMSNorm (F32 + BF16) - normalization/layernorm.cu: LayerNorm with Welford (F32 + BF16) - activation/activations.cu: GELU tanh-approx + SiLU (F32 + BF16) - reduce/softmax.cu: safe softmax, 3-pass (F32 + BF16) - embedding/embedding.cu: gather lookup (F32 + BF16) - embedding/rope.cu: RoPE in-place + precomputed cos/sin cache (F32 + BF16) Rust wrappers (xserv-kernels/src/): - rmsnorm.rs, layernorm.rs, activation.rs, softmax.rs, embedding.rs, rope.rs - RopeCache struct with GPU-side precomputation Tests: 12 new tests (ops_test.rs), all passing with good precision: - F32: max_err 1e-6 ~ 1e-9 - BF16: max_err 2e-3 ~ 7e-3 Total: 29 kernel tests + 27 prior = 56 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase4	2026-05-21 21:07:24 +08:00
Gahow Wang	51a0f2eb14	docs: add design docs + takeaways for Phase 2 and Phase 3 - docs/01-cuda-ffi.md: added takeaways (struct layout pitfall, Rust 2024 unsafe changes, caching allocator strategy, etc.) - docs/02-tensor.md: design doc + takeaways for tensor abstraction - docs/03-gemm.md: design doc + takeaways for GEMM kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 20:59:45 +08:00
Gahow Wang	d77f921a12	phase 3: GEMM kernels (naive, tiled, cuBLAS) - Naive GEMM kernel: one thread per output element (F32 + BF16) - Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16) - cuBLAS wrapper: cublasGemmEx with row-major trick - GemmBackend enum for runtime backend selection - CublasContext RAII handle - Made error::check public for cross-crate use - 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16 - Cross-backend consistency verified (naive vs tiled vs cuBLAS) - All 44 tests pass across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase3	2026-05-21 19:48:05 +08:00
Gahow Wang	a83971fa25	phase 2: tensor abstraction layer - DType enum (F32, F16, BF16) with TensorDType trait - Shape utilities: contiguous_strides, broadcast_shape, broadcast_strides - Storage with Arc reference counting (CPU Vec<u8> or GPU GpuBuffer) - Device enum (Cpu, Cuda(id)) with to_device transfer - Tensor type with strided layout: reshape, transpose, squeeze, unsqueeze - contiguous() copies non-contiguous views to contiguous layout - from_slice, zeros, ones constructors - as_slice<T> for typed CPU read access, data_ptr for GPU kernel launch - CPU↔GPU roundtrip verified - All 27 tests pass (12 cuda + 4 shape + 11 tensor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase2	2026-05-21 19:45:22 +08:00
Gahow Wang	c8f7bc0c3c	phase 0+1: fix Rust 2024 edition compat + memory query - unsafe extern "C" blocks (Rust 2024 requirement) - unsafe blocks inside unsafe fn bodies - Use cudaMemGetInfo for accurate GPU memory reporting - Remove cc "cuda" feature (doesn't exist, built-in) - All 12 tests pass on RTX 5090 (CC 12.0, 170 SMs, 32GB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> phase0-1	2026-05-21 19:40:49 +08:00
Gahow Wang	9806b4db35	phase 0+1: project scaffold + xserv-cuda crate - Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 18:40:22 +08:00

13 Commits