tools: add llama.cpp comparison baseline + standard benchmark suite

Vendor llama.cpp as a submodule pinned to b9371 and add a one-click benchmark driver that compares xserv against it on identical workloads: - setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh converts the same safetensors to BF16 GGUF for an apples-to-apples baseline. - tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput (single-stream + concurrent) and response quality on AIME 2025 + GSM8K. - fetch_datasets.py pulls datasets to local JSON (GPU host has no network); task loaders prefer the local JSON. - sync-and-build.sh: `bench` subcommand transfers source + datasets to the GPU host via tar-over-ssh (no rsync there), builds, and runs the suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:18:52 +08:00
parent 9bb5c5c328
commit 49c7653222
20 changed files with 1690 additions and 14 deletions
--- a/docs/16-llama-cpp-comparison.md
+++ b/docs/16-llama-cpp-comparison.md
@@ -0,0 +1,153 @@
+# Phase 16: llama.cpp Comparison Baseline
+
+> **Goal.** Replace HF transformers with **llama.cpp** as the standing
+> performance baseline, and add a standard quality (response correctness)
+> benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs
+> both systems under identical workloads and emits a side-by-side report.
+
+## Motivation
+
+xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15).
+HF is no longer a useful performance bar — it's a *correctness* baseline.
+
+**llama.cpp** is the right next bar because:
+- It's a serious C++/CUDA inference engine with active optimization
+- Same OpenAI-compatible API → black-box, fair comparison
+- Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
+- Used widely as a reference point in the community
+
+We also need **quality benchmarks** so that performance improvements don't
+silently regress model quality (numerical precision, sampling, prompt
+formatting). AIME and GSM8K are the cheapest credible signals.
+
+## Architecture
+
+```
+xserv/
+├── third_party/llama.cpp/         # cloned by setup-llama-cpp.sh
+│   └── build/bin/llama-server     # CUDA build (SM120)
+├── tools/
+│   ├── setup-llama-cpp.sh         # clone + cmake build (idempotent)
+│   ├── convert-to-gguf.sh         # safetensors → BF16 GGUF (same weights)
+│   ├── sync-and-build.sh          # extended with `bench` subcommand
+│   └── bench/                     # Python benchmark driver
+│       ├── runner.py              # entrypoint
+│       ├── servers.py             # subprocess lifecycle (start/stop both)
+│       ├── client.py              # OpenAI streaming client + TTFT/TPOT
+│       ├── speed.py               # speed suite
+│       ├── quality.py             # quality suite
+│       ├── tasks/{aime,gsm8k}.py  # dataset loaders + scorers
+│       ├── report.py              # markdown + json output
+│       └── requirements.txt       # httpx, datasets
+└── bench-out/                     # report artifacts (gitignored)
+    ├── comparison-<stamp>.md
+    ├── comparison-<stamp>.json
+    └── logs/{xserv,llama_cpp}.log
+```
+
+Both systems are treated as **black-box HTTP servers** speaking the OpenAI
+streaming chat API. No in-process integration, no shared Python bindings. This
+keeps the comparison fair (same protocol, same prompt-template path) and
+isolates the test harness from internal API churn on either side.
+
+## Workflow
+
+```
+local repo                            dash5 (GPU host)
+──────────                            ────────────────
+tools/sync-and-build.sh bench   →  rsync project (excl. target, third_party, bench-out)
+                                   →  setup-llama-cpp.sh    (no-op if built)
+                                   →  convert-to-gguf.sh    (no-op if .gguf exists)
+                                   →  cargo build --release
+                                   →  python3 -m tools.bench.runner ...
+                                   →  bench-out/comparison-<stamp>.md
+tools/sync-and-build.sh fetch-bench-out  ←  rsync bench-out back
+```
+
+## What gets measured
+
+### Speed (TTFT / TPOT / throughput)
+
+- **Single-stream**, three prompt lengths (short / medium / long), `cfg.speed_prompts` repeats each
+  - `TTFT p50/p95`, `TPOT p50/p95`, per-request throughput
+- **Concurrent**, fixed medium prompt, sweep `concurrency ∈ {1, 2, 4, 8}`
+  - Aggregate `tok/s`, `TTFT p95`, error count
+- Both at `temperature=0`, `max_tokens=128` by default.
+
+### Quality (response correctness)
+
+| Task | N | Source | Scoring | Why |
+|---|---|---|---|---|
+| AIME 2025 | 30 | `MathArena/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
+| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
+
+Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
+(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
+
+### Report
+
+`bench-out/comparison-<stamp>.md` contains:
+- Environment (GPU, driver, xserv commit, python)
+- Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
+- Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)
+
+A sibling `.json` holds all per-request raw rows and per-problem case detail
+(prediction, gold, response preview) so we can diff regressions in CI later.
+
+## Running it
+
+**Full sweep on dash5 (recommended):**
+```bash
+./tools/sync-and-build.sh bench
+./tools/sync-and-build.sh fetch-bench-out
+open bench-out/comparison-*.md
+```
+
+**Speed-only smoke (fast):**
+```bash
+./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2
+```
+
+**Quality smoke with 5 problems each:**
+```bash
+./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5
+```
+
+**On a host that already has both servers running** (e.g. local dev with two
+shells open):
+```bash
+python3 -m tools.bench.runner \
+    --xserv-base-url http://127.0.0.1:8080 \
+    --llama-base-url http://127.0.0.1:8081 \
+    --suite all
+```
+
+## Design choices
+
+1. **Black-box HTTP, not FFI.** Both engines bind the same OpenAI surface and
+   real serving traffic uses HTTP. Anything that doesn't show up over the wire
+   doesn't matter for serving.
+2. **Same BF16 weights.** We convert the same safetensors with llama.cpp's
+   `convert_hf_to_gguf.py --outtype bf16`. No quantization at this stage; if we
+   want a quant comparison later we'll add a separate column, not replace this
+   one.
+3. **Streaming everywhere.** TTFT and TPOT only make sense with streaming. We
+   ask both servers for `stream=true` with `include_usage` so we can read
+   server-reported token counts when available.
+4. **Idempotent setup.** `setup-llama-cpp.sh` and `convert-to-gguf.sh` are
+   safe to re-run — they no-op when the build / file already exists. The
+   `bench` subcommand wires them so the first run does a full setup and
+   subsequent runs are fast.
+5. **Subprocess lifecycle owned by the driver.** We spawn each server in its
+   own process group and SIGTERM the group on exit so half-dead llama-server
+   children don't survive. If the user is already running a server somewhere,
+   pass `--xserv-base-url` / `--llama-base-url` to skip launch.
+
+## Future extensions
+
+- Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
+- Wire to GitHub Actions for nightly regression
+- Track results across commits to flag regressions (per-commit JSON in
+  `docs/benchmarks/history/`)
+- Add MMLU-Pro / HumanEval when budget allows
+- Long-context benchmark (8K, 32K prompts) to compare prefill scaling