tools: add llama.cpp comparison baseline + standard benchmark suite

Vendor llama.cpp as a submodule pinned to b9371 and add a one-click benchmark driver that compares xserv against it on identical workloads: - setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh converts the same safetensors to BF16 GGUF for an apples-to-apples baseline. - tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput (single-stream + concurrent) and response quality on AIME 2025 + GSM8K. - fetch_datasets.py pulls datasets to local JSON (GPU host has no network); task loaders prefer the local JSON. - sync-and-build.sh: `bench` subcommand transfers source + datasets to the GPU host via tar-over-ssh (no rsync there), builds, and runs the suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:18:52 +08:00
parent 9bb5c5c328
commit 49c7653222
20 changed files with 1690 additions and 14 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -7,3 +7,15 @@
 **/*.rs.bk
 .env
 *.npy
+
+# llama.cpp baseline (cloned/submoduled by tools/setup-llama-cpp.sh)
+/third_party/llama.cpp/build/
+/third_party/llama.cpp/models/
+*.gguf
+
+# Benchmark output + fetched datasets (transferred to GPU host, not committed)
+/bench-out/
+/tools/bench/data/
+/tools/bench/__pycache__/
+/tools/bench/**/__pycache__/
+