Vendor llama.cpp as a submodule pinned to b9371 and add a one-click benchmark driver that compares xserv against it on identical workloads: - setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh converts the same safetensors to BF16 GGUF for an apples-to-apples baseline. - tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput (single-stream + concurrent) and response quality on AIME 2025 + GSM8K. - fetch_datasets.py pulls datasets to local JSON (GPU host has no network); task loaders prefer the local JSON. - sync-and-build.sh: `bench` subcommand transfers source + datasets to the GPU host via tar-over-ssh (no rsync there), builds, and runs the suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
22 lines
362 B
Plaintext
22 lines
362 B
Plaintext
/target
|
|
*.o
|
|
*.so
|
|
*.a
|
|
*.ptx
|
|
*.cubin
|
|
**/*.rs.bk
|
|
.env
|
|
*.npy
|
|
|
|
# llama.cpp baseline (cloned/submoduled by tools/setup-llama-cpp.sh)
|
|
/third_party/llama.cpp/build/
|
|
/third_party/llama.cpp/models/
|
|
*.gguf
|
|
|
|
# Benchmark output + fetched datasets (transferred to GPU host, not committed)
|
|
/bench-out/
|
|
/tools/bench/data/
|
|
/tools/bench/__pycache__/
|
|
/tools/bench/**/__pycache__/
|
|
|