xserv

Go to file

Gahow Wang 268e40d764 phase 10: add Qwen3-8B benchmark + performance fix

Benchmark infrastructure:
- bench-qwen3 binary: 50 prompts × 20 tokens with KV cache
- bench_compare_qwen3.py: comparison against HF transformers (BF16)

Performance fix:
- Precompute transposed weights at model load time (eliminated per-token
  weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token)
- Result: from "infinite" (>10 min/token) to 144ms/token

Results (50 prompts):
- Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers
- Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers)
- Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s)
- All outputs are coherent English/Chinese

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 10:25:33 +08:00

crates

phase 10: add Qwen3-8B benchmark + performance fix

2026-05-22 10:25:33 +08:00

csrc

phase 5: naive multi-head attention

2026-05-21 21:17:23 +08:00

docs

phase 10: add Qwen3-8B benchmark + performance fix

2026-05-22 10:25:33 +08:00

tools

phase 10: add Qwen3-8B benchmark + performance fix