tools: add llama.cpp comparison baseline + standard benchmark suite

Vendor llama.cpp as a submodule pinned to b9371 and add a one-click
benchmark driver that compares xserv against it on identical workloads:

- setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh
  converts the same safetensors to BF16 GGUF for an apples-to-apples baseline.
- tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput
  (single-stream + concurrent) and response quality on AIME 2025 + GSM8K.
- fetch_datasets.py pulls datasets to local JSON (GPU host has no network);
  task loaders prefer the local JSON.
- sync-and-build.sh: `bench` subcommand transfers source + datasets to the
  GPU host via tar-over-ssh (no rsync there), builds, and runs the suite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 11:18:52 +08:00
parent 9bb5c5c328
commit 49c7653222
20 changed files with 1690 additions and 14 deletions

View File

@@ -0,0 +1,153 @@
# Phase 16: llama.cpp Comparison Baseline
> **Goal.** Replace HF transformers with **llama.cpp** as the standing
> performance baseline, and add a standard quality (response correctness)
> benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs
> both systems under identical workloads and emits a side-by-side report.
## Motivation
xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15).
HF is no longer a useful performance bar — it's a *correctness* baseline.
**llama.cpp** is the right next bar because:
- It's a serious C++/CUDA inference engine with active optimization
- Same OpenAI-compatible API → black-box, fair comparison
- Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
- Used widely as a reference point in the community
We also need **quality benchmarks** so that performance improvements don't
silently regress model quality (numerical precision, sampling, prompt
formatting). AIME and GSM8K are the cheapest credible signals.
## Architecture
```
xserv/
├── third_party/llama.cpp/ # cloned by setup-llama-cpp.sh
│ └── build/bin/llama-server # CUDA build (SM120)
├── tools/
│ ├── setup-llama-cpp.sh # clone + cmake build (idempotent)
│ ├── convert-to-gguf.sh # safetensors → BF16 GGUF (same weights)
│ ├── sync-and-build.sh # extended with `bench` subcommand
│ └── bench/ # Python benchmark driver
│ ├── runner.py # entrypoint
│ ├── servers.py # subprocess lifecycle (start/stop both)
│ ├── client.py # OpenAI streaming client + TTFT/TPOT
│ ├── speed.py # speed suite
│ ├── quality.py # quality suite
│ ├── tasks/{aime,gsm8k}.py # dataset loaders + scorers
│ ├── report.py # markdown + json output
│ └── requirements.txt # httpx, datasets
└── bench-out/ # report artifacts (gitignored)
├── comparison-<stamp>.md
├── comparison-<stamp>.json
└── logs/{xserv,llama_cpp}.log
```
Both systems are treated as **black-box HTTP servers** speaking the OpenAI
streaming chat API. No in-process integration, no shared Python bindings. This
keeps the comparison fair (same protocol, same prompt-template path) and
isolates the test harness from internal API churn on either side.
## Workflow
```
local repo dash5 (GPU host)
────────── ────────────────
tools/sync-and-build.sh bench → rsync project (excl. target, third_party, bench-out)
→ setup-llama-cpp.sh (no-op if built)
→ convert-to-gguf.sh (no-op if .gguf exists)
→ cargo build --release
→ python3 -m tools.bench.runner ...
→ bench-out/comparison-<stamp>.md
tools/sync-and-build.sh fetch-bench-out ← rsync bench-out back
```
## What gets measured
### Speed (TTFT / TPOT / throughput)
- **Single-stream**, three prompt lengths (short / medium / long), `cfg.speed_prompts` repeats each
- `TTFT p50/p95`, `TPOT p50/p95`, per-request throughput
- **Concurrent**, fixed medium prompt, sweep `concurrency ∈ {1, 2, 4, 8}`
- Aggregate `tok/s`, `TTFT p95`, error count
- Both at `temperature=0`, `max_tokens=128` by default.
### Quality (response correctness)
| Task | N | Source | Scoring | Why |
|---|---|---|---|---|
| AIME 2025 | 30 | `MathArena/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
### Report
`bench-out/comparison-<stamp>.md` contains:
- Environment (GPU, driver, xserv commit, python)
- Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
- Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)
A sibling `.json` holds all per-request raw rows and per-problem case detail
(prediction, gold, response preview) so we can diff regressions in CI later.
## Running it
**Full sweep on dash5 (recommended):**
```bash
./tools/sync-and-build.sh bench
./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md
```
**Speed-only smoke (fast):**
```bash
./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2
```
**Quality smoke with 5 problems each:**
```bash
./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5
```
**On a host that already has both servers running** (e.g. local dev with two
shells open):
```bash
python3 -m tools.bench.runner \
--xserv-base-url http://127.0.0.1:8080 \
--llama-base-url http://127.0.0.1:8081 \
--suite all
```
## Design choices
1. **Black-box HTTP, not FFI.** Both engines bind the same OpenAI surface and
real serving traffic uses HTTP. Anything that doesn't show up over the wire
doesn't matter for serving.
2. **Same BF16 weights.** We convert the same safetensors with llama.cpp's
`convert_hf_to_gguf.py --outtype bf16`. No quantization at this stage; if we
want a quant comparison later we'll add a separate column, not replace this
one.
3. **Streaming everywhere.** TTFT and TPOT only make sense with streaming. We
ask both servers for `stream=true` with `include_usage` so we can read
server-reported token counts when available.
4. **Idempotent setup.** `setup-llama-cpp.sh` and `convert-to-gguf.sh` are
safe to re-run — they no-op when the build / file already exists. The
`bench` subcommand wires them so the first run does a full setup and
subsequent runs are fast.
5. **Subprocess lifecycle owned by the driver.** We spawn each server in its
own process group and SIGTERM the group on exit so half-dead llama-server
children don't survive. If the user is already running a server somewhere,
pass `--xserv-base-url` / `--llama-base-url` to skip launch.
## Future extensions
- Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
- Wire to GitHub Actions for nightly regression
- Track results across commits to flag regressions (per-commit JSON in
`docs/benchmarks/history/`)
- Add MMLU-Pro / HumanEval when budget allows
- Long-context benchmark (8K, 32K prompts) to compare prefill scaling