Gahow Wang
268e40d764
phase 10: add Qwen3-8B benchmark + performance fix
Benchmark infrastructure:
- bench-qwen3 binary: 50 prompts × 20 tokens with KV cache
- bench_compare_qwen3.py: comparison against HF transformers (BF16)
Performance fix:
- Precompute transposed weights at model load time (eliminated per-token
weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token)
- Result: from "infinite" (>10 min/token) to 144ms/token
Results (50 prompts):
- Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers
- Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers)
- Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s)
- All outputs are coherent English/Chinese
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:33 +08:00
..
2026-05-22 10:25:33 +08:00
2026-05-21 18:40:22 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 21:07:24 +08:00
2026-05-21 21:17:23 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 23:39:41 +08:00
2026-05-22 00:46:37 +08:00