tools: add llama.cpp comparison baseline + standard benchmark suite
Vendor llama.cpp as a submodule pinned to b9371 and add a one-click benchmark driver that compares xserv against it on identical workloads: - setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh converts the same safetensors to BF16 GGUF for an apples-to-apples baseline. - tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput (single-stream + concurrent) and response quality on AIME 2025 + GSM8K. - fetch_datasets.py pulls datasets to local JSON (GPU host has no network); task loaders prefer the local JSON. - sync-and-build.sh: `bench` subcommand transfers source + datasets to the GPU host via tar-over-ssh (no rsync there), builds, and runs the suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
153
docs/16-llama-cpp-comparison.md
Normal file
153
docs/16-llama-cpp-comparison.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# Phase 16: llama.cpp Comparison Baseline
|
||||
|
||||
> **Goal.** Replace HF transformers with **llama.cpp** as the standing
|
||||
> performance baseline, and add a standard quality (response correctness)
|
||||
> benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs
|
||||
> both systems under identical workloads and emits a side-by-side report.
|
||||
|
||||
## Motivation
|
||||
|
||||
xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15).
|
||||
HF is no longer a useful performance bar — it's a *correctness* baseline.
|
||||
|
||||
**llama.cpp** is the right next bar because:
|
||||
- It's a serious C++/CUDA inference engine with active optimization
|
||||
- Same OpenAI-compatible API → black-box, fair comparison
|
||||
- Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
|
||||
- Used widely as a reference point in the community
|
||||
|
||||
We also need **quality benchmarks** so that performance improvements don't
|
||||
silently regress model quality (numerical precision, sampling, prompt
|
||||
formatting). AIME and GSM8K are the cheapest credible signals.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
xserv/
|
||||
├── third_party/llama.cpp/ # cloned by setup-llama-cpp.sh
|
||||
│ └── build/bin/llama-server # CUDA build (SM120)
|
||||
├── tools/
|
||||
│ ├── setup-llama-cpp.sh # clone + cmake build (idempotent)
|
||||
│ ├── convert-to-gguf.sh # safetensors → BF16 GGUF (same weights)
|
||||
│ ├── sync-and-build.sh # extended with `bench` subcommand
|
||||
│ └── bench/ # Python benchmark driver
|
||||
│ ├── runner.py # entrypoint
|
||||
│ ├── servers.py # subprocess lifecycle (start/stop both)
|
||||
│ ├── client.py # OpenAI streaming client + TTFT/TPOT
|
||||
│ ├── speed.py # speed suite
|
||||
│ ├── quality.py # quality suite
|
||||
│ ├── tasks/{aime,gsm8k}.py # dataset loaders + scorers
|
||||
│ ├── report.py # markdown + json output
|
||||
│ └── requirements.txt # httpx, datasets
|
||||
└── bench-out/ # report artifacts (gitignored)
|
||||
├── comparison-<stamp>.md
|
||||
├── comparison-<stamp>.json
|
||||
└── logs/{xserv,llama_cpp}.log
|
||||
```
|
||||
|
||||
Both systems are treated as **black-box HTTP servers** speaking the OpenAI
|
||||
streaming chat API. No in-process integration, no shared Python bindings. This
|
||||
keeps the comparison fair (same protocol, same prompt-template path) and
|
||||
isolates the test harness from internal API churn on either side.
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
local repo dash5 (GPU host)
|
||||
────────── ────────────────
|
||||
tools/sync-and-build.sh bench → rsync project (excl. target, third_party, bench-out)
|
||||
→ setup-llama-cpp.sh (no-op if built)
|
||||
→ convert-to-gguf.sh (no-op if .gguf exists)
|
||||
→ cargo build --release
|
||||
→ python3 -m tools.bench.runner ...
|
||||
→ bench-out/comparison-<stamp>.md
|
||||
tools/sync-and-build.sh fetch-bench-out ← rsync bench-out back
|
||||
```
|
||||
|
||||
## What gets measured
|
||||
|
||||
### Speed (TTFT / TPOT / throughput)
|
||||
|
||||
- **Single-stream**, three prompt lengths (short / medium / long), `cfg.speed_prompts` repeats each
|
||||
- `TTFT p50/p95`, `TPOT p50/p95`, per-request throughput
|
||||
- **Concurrent**, fixed medium prompt, sweep `concurrency ∈ {1, 2, 4, 8}`
|
||||
- Aggregate `tok/s`, `TTFT p95`, error count
|
||||
- Both at `temperature=0`, `max_tokens=128` by default.
|
||||
|
||||
### Quality (response correctness)
|
||||
|
||||
| Task | N | Source | Scoring | Why |
|
||||
|---|---|---|---|---|
|
||||
| AIME 2025 | 30 | `MathArena/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
|
||||
| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
|
||||
|
||||
Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
|
||||
(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
|
||||
|
||||
### Report
|
||||
|
||||
`bench-out/comparison-<stamp>.md` contains:
|
||||
- Environment (GPU, driver, xserv commit, python)
|
||||
- Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
|
||||
- Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)
|
||||
|
||||
A sibling `.json` holds all per-request raw rows and per-problem case detail
|
||||
(prediction, gold, response preview) so we can diff regressions in CI later.
|
||||
|
||||
## Running it
|
||||
|
||||
**Full sweep on dash5 (recommended):**
|
||||
```bash
|
||||
./tools/sync-and-build.sh bench
|
||||
./tools/sync-and-build.sh fetch-bench-out
|
||||
open bench-out/comparison-*.md
|
||||
```
|
||||
|
||||
**Speed-only smoke (fast):**
|
||||
```bash
|
||||
./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2
|
||||
```
|
||||
|
||||
**Quality smoke with 5 problems each:**
|
||||
```bash
|
||||
./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5
|
||||
```
|
||||
|
||||
**On a host that already has both servers running** (e.g. local dev with two
|
||||
shells open):
|
||||
```bash
|
||||
python3 -m tools.bench.runner \
|
||||
--xserv-base-url http://127.0.0.1:8080 \
|
||||
--llama-base-url http://127.0.0.1:8081 \
|
||||
--suite all
|
||||
```
|
||||
|
||||
## Design choices
|
||||
|
||||
1. **Black-box HTTP, not FFI.** Both engines bind the same OpenAI surface and
|
||||
real serving traffic uses HTTP. Anything that doesn't show up over the wire
|
||||
doesn't matter for serving.
|
||||
2. **Same BF16 weights.** We convert the same safetensors with llama.cpp's
|
||||
`convert_hf_to_gguf.py --outtype bf16`. No quantization at this stage; if we
|
||||
want a quant comparison later we'll add a separate column, not replace this
|
||||
one.
|
||||
3. **Streaming everywhere.** TTFT and TPOT only make sense with streaming. We
|
||||
ask both servers for `stream=true` with `include_usage` so we can read
|
||||
server-reported token counts when available.
|
||||
4. **Idempotent setup.** `setup-llama-cpp.sh` and `convert-to-gguf.sh` are
|
||||
safe to re-run — they no-op when the build / file already exists. The
|
||||
`bench` subcommand wires them so the first run does a full setup and
|
||||
subsequent runs are fast.
|
||||
5. **Subprocess lifecycle owned by the driver.** We spawn each server in its
|
||||
own process group and SIGTERM the group on exit so half-dead llama-server
|
||||
children don't survive. If the user is already running a server somewhere,
|
||||
pass `--xserv-base-url` / `--llama-base-url` to skip launch.
|
||||
|
||||
## Future extensions
|
||||
|
||||
- Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
|
||||
- Wire to GitHub Actions for nightly regression
|
||||
- Track results across commits to flag regressions (per-commit JSON in
|
||||
`docs/benchmarks/history/`)
|
||||
- Add MMLU-Pro / HumanEval when budget allows
|
||||
- Long-context benchmark (8K, 32K prompts) to compare prefill scaling
|
||||
Reference in New Issue
Block a user