xserv

Files

Gahow Wang 268e40d764 phase 10: add Qwen3-8B benchmark + performance fix

Benchmark infrastructure:
- bench-qwen3 binary: 50 prompts × 20 tokens with KV cache
- bench_compare_qwen3.py: comparison against HF transformers (BF16)

Performance fix:
- Precompute transposed weights at model load time (eliminated per-token
  weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token)
- Result: from "infinite" (>10 min/token) to 144ms/token

Results (50 prompts):
- Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers
- Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers)
- Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s)
- All outputs are coherent English/Chinese

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 10:25:33 +08:00

benchmarks

phase 10: add Qwen3-8B benchmark + performance fix

2026-05-22 10:25:33 +08:00

00-roadmap.md

phase 0+1: project scaffold + xserv-cuda crate

2026-05-21 18:40:22 +08:00

01-cuda-ffi.md

docs: add design docs + takeaways for Phase 2 and Phase 3

2026-05-21 20:59:45 +08:00

02-tensor.md

docs: add design docs + takeaways for Phase 2 and Phase 3