docs: llama.cpp vs xserv benchmark results + summary

Record what the new baseline adds (llama.cpp pinned b9371, same BF16 weights, AIME 2025 + GSM8K) and the measured results: performance (xserv ~0.45-0.61x llama.cpp throughput) and quality parity (GSM8K 94% vs 96%, AIME 23.3% vs 20% after the context fix), plus the findings the bench surfaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 15:06:21 +08:00
parent 950ccf3822
commit 3f1c3d429a
2 changed files with 80 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -16,6 +16,7 @@
 # Benchmark output + fetched datasets (transferred to GPU host, not committed)
 /bench-out/
 /tools/bench/data/
+/tools/__pycache__/
 /tools/bench/__pycache__/
 /tools/bench/**/__pycache__/