13 Commits

Author SHA1 Message Date
be5c64ea8a phase 10: GPU add/mul kernels + BF16 precision analysis
Kernel additions:
- add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU)
- Refactored activation.rs with dispatch_unary/dispatch_binary helpers
- Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips

GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add)

BF16 precision analysis (docs/benchmarks/phase10-qwen3.md):
- Root cause: separate attention kernels materialize BF16 intermediates
  (QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path)
- HF itself SDPA vs Eager also differs by ~0.125 logit
- xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases
- Industry standard for BF16: top-5 overlap (we achieve 100%)
- Fix path: Flash Attention (Phase 14) to fuse attention in FP32

Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase10
2026-05-22 11:35:26 +08:00
268e40d764 phase 10: add Qwen3-8B benchmark + performance fix
Benchmark infrastructure:
- bench-qwen3 binary: 50 prompts × 20 tokens with KV cache
- bench_compare_qwen3.py: comparison against HF transformers (BF16)

Performance fix:
- Precompute transposed weights at model load time (eliminated per-token
  weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token)
- Result: from "infinite" (>10 min/token) to 144ms/token

Results (50 prompts):
- Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers
- Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers)
- Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s)
- All outputs are coherent English/Chinese

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:33 +08:00
246ae1c590 phase 10: Qwen3-8B support (Milestone ②)
Qwen3 model (qwen3.rs):
- RMSNorm + QK normalization (per-head q_norm/k_norm)
- GQA: 32 Q heads, 8 KV heads, repeat_kv for attention
- SwiGLU FFN: gate_proj → SiLU → * up_proj → down_proj
- RoPE with transpose for [1,H,S,D] ↔ [S,H,D] layout
- BF16 forward pass, [out,in] weight layout via linear_t
- No attention bias (attention_bias=false)

Tokenizer fixes:
- Fixed unicode_to_byte: shifted bytes now use correct inverse lookup table
- MergeEntry supports both string and array formats
- Both GPT-2 and Qwen3 tokenizers work correctly (English + Chinese)

KVCache refactored:
- Dtype-agnostic: stores raw bytes per-head, works for F32 and BF16
- append_kv_tensor/get_kv_tensors use Tensor directly

CLI updated:
- Auto-detects model type from config.json (gpt2 vs qwen3)
- Supports both GPT-2 (F32) and Qwen3 (BF16)

Verified: Qwen3-8B generates coherent English and Chinese on single RTX 5090.
61/61 tests pass, GPT-2 performance no regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:46:37 +08:00
64084d3489 phase 9: KV cache + autoregressive generation
- KVCache: per-layer, per-head storage with append + reconstruct
- forward_with_cache: prefill (full prompt) + decode (single token) modes
- Fixed data layout bug: per-head vectors avoid cross-head interleaving
- CLI updated to use KV cache by default
- bench-gpt2 supports --no-cache flag for comparison

Benchmark results (50 prompts × 20 tokens):
- KV cache vs no-cache: 50/50 bit-identical (cache is correct)
- 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s
- vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase9
2026-05-21 23:39:41 +08:00
cb12250ef0 phase 8: add benchmark framework + baseline results
- bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON
- bench_compare.py: compares xserv vs transformers token-by-token + timing
- Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch)
- Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 23:29:41 +08:00
e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase8
2026-05-21 22:04:00 +08:00
6035ffdc0b phase 5: naive multi-head attention
- Batched GEMM via cublasGemmStridedBatchedEx
- Causal mask CUDA kernel (F32 + BF16)
- Element-wise scale CUDA kernel (F32 + BF16)
- attention() composing: batched_matmul + scale + causal_mask + softmax
- Fixed to_device/contiguous infinite recursion (GPU contiguous via CPU round-trip)
- 5 attention tests passing (max_err < 3e-7 F32)
- Total: 61 tests passing across all crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase5
2026-05-21 21:17:23 +08:00
c8e8153702 phase 4: transformer core kernels
CUDA kernels (csrc/):
- common.cuh: shared warp_reduce_sum/max, block_reduce_sum/max
- normalization/rmsnorm.cu: RMSNorm (F32 + BF16)
- normalization/layernorm.cu: LayerNorm with Welford (F32 + BF16)
- activation/activations.cu: GELU tanh-approx + SiLU (F32 + BF16)
- reduce/softmax.cu: safe softmax, 3-pass (F32 + BF16)
- embedding/embedding.cu: gather lookup (F32 + BF16)
- embedding/rope.cu: RoPE in-place + precomputed cos/sin cache (F32 + BF16)

Rust wrappers (xserv-kernels/src/):
- rmsnorm.rs, layernorm.rs, activation.rs, softmax.rs, embedding.rs, rope.rs
- RopeCache struct with GPU-side precomputation

Tests: 12 new tests (ops_test.rs), all passing with good precision:
- F32: max_err 1e-6 ~ 1e-9
- BF16: max_err 2e-3 ~ 7e-3
Total: 29 kernel tests + 27 prior = 56 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase4
2026-05-21 21:07:24 +08:00
51a0f2eb14 docs: add design docs + takeaways for Phase 2 and Phase 3
- docs/01-cuda-ffi.md: added takeaways (struct layout pitfall,
  Rust 2024 unsafe changes, caching allocator strategy, etc.)
- docs/02-tensor.md: design doc + takeaways for tensor abstraction
- docs/03-gemm.md: design doc + takeaways for GEMM kernels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 20:59:45 +08:00
d77f921a12 phase 3: GEMM kernels (naive, tiled, cuBLAS)
- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase3
2026-05-21 19:48:05 +08:00
a83971fa25 phase 2: tensor abstraction layer
- DType enum (F32, F16, BF16) with TensorDType trait
- Shape utilities: contiguous_strides, broadcast_shape, broadcast_strides
- Storage with Arc reference counting (CPU Vec<u8> or GPU GpuBuffer)
- Device enum (Cpu, Cuda(id)) with to_device transfer
- Tensor type with strided layout: reshape, transpose, squeeze, unsqueeze
- contiguous() copies non-contiguous views to contiguous layout
- from_slice, zeros, ones constructors
- as_slice<T> for typed CPU read access, data_ptr for GPU kernel launch
- CPU↔GPU roundtrip verified
- All 27 tests pass (12 cuda + 4 shape + 11 tensor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase2
2026-05-21 19:45:22 +08:00
c8f7bc0c3c phase 0+1: fix Rust 2024 edition compat + memory query
- unsafe extern "C" blocks (Rust 2024 requirement)
- unsafe blocks inside unsafe fn bodies
- Use cudaMemGetInfo for accurate GPU memory reporting
- Remove cc "cuda" feature (doesn't exist, built-in)
- All 12 tests pass on RTX 5090 (CC 12.0, 170 SMs, 32GB)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
phase0-1
2026-05-21 19:40:49 +08:00
9806b4db35 phase 0+1: project scaffold + xserv-cuda crate
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 18:40:22 +08:00