Vendor llama.cpp as a submodule pinned to b9371 and add a one-click
benchmark driver that compares xserv against it on identical workloads:
- setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh
converts the same safetensors to BF16 GGUF for an apples-to-apples baseline.
- tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput
(single-stream + concurrent) and response quality on AIME 2025 + GSM8K.
- fetch_datasets.py pulls datasets to local JSON (GPU host has no network);
task loaders prefer the local JSON.
- sync-and-build.sh: `bench` subcommand transfers source + datasets to the
GPU host via tar-over-ssh (no rsync there), builds, and runs the suite.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>