xserv

Author	SHA1	Message	Date
Gahow Wang	268e40d764	phase 10: add Qwen3-8B benchmark + performance fix Benchmark infrastructure: - bench-qwen3 binary: 50 prompts × 20 tokens with KV cache - bench_compare_qwen3.py: comparison against HF transformers (BF16) Performance fix: - Precompute transposed weights at model load time (eliminated per-token weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token) - Result: from "infinite" (>10 min/token) to 144ms/token Results (50 prompts): - Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers - Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers) - Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s) - All outputs are coherent English/Chinese Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:25:33 +08:00
Gahow Wang	64084d3489	phase 9: KV cache + autoregressive generation - KVCache: per-layer, per-head storage with append + reconstruct - forward_with_cache: prefill (full prompt) + decode (single token) modes - Fixed data layout bug: per-head vectors avoid cross-head interleaving - CLI updated to use KV cache by default - bench-gpt2 supports --no-cache flag for comparison Benchmark results (50 prompts × 20 tokens): - KV cache vs no-cache: 50/50 bit-identical (cache is correct) - 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s - vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:39:41 +08:00
Gahow Wang	cb12250ef0	phase 8: add benchmark framework + baseline results - bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON - bench_compare.py: compares xserv vs transformers token-by-token + timing - Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch) - Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:29:41 +08:00
Gahow Wang	9806b4db35	phase 0+1: project scaffold + xserv-cuda crate - Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 18:40:22 +08:00

4 Commits