Files
xserv/docs/16-llama-cpp-comparison.md
Gahow Wang 80157e614a docs: update llama.cpp comparison with 8192 results (OOM fixed)
Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it:
- OOM finding resolved — pool sized to available VRAM + vLLM-style host swap;
  8192 runs with 0 swap events (swap is the overload safety net).
- Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%.
- Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 21:32:14 +08:00

202 lines
9.5 KiB
Markdown

# Phase 16: llama.cpp Comparison Baseline
> **Goal.** Replace HF transformers with **llama.cpp** as the standing
> performance baseline, and add a standard quality (response correctness)
> benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs
> both systems under identical workloads and emits a side-by-side report.
## Motivation
xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15).
HF is no longer a useful performance bar — it's a *correctness* baseline.
**llama.cpp** is the right next bar because:
- It's a serious C++/CUDA inference engine with active optimization
- Same OpenAI-compatible API → black-box, fair comparison
- Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
- Used widely as a reference point in the community
We also need **quality benchmarks** so that performance improvements don't
silently regress model quality (numerical precision, sampling, prompt
formatting). AIME and GSM8K are the cheapest credible signals.
## Architecture
```
xserv/
├── third_party/llama.cpp/ # cloned by setup-llama-cpp.sh
│ └── build/bin/llama-server # CUDA build (SM120)
├── tools/
│ ├── setup-llama-cpp.sh # clone + cmake build (idempotent)
│ ├── convert-to-gguf.sh # safetensors → BF16 GGUF (same weights)
│ ├── sync-and-build.sh # extended with `bench` subcommand
│ └── bench/ # Python benchmark driver
│ ├── runner.py # entrypoint
│ ├── servers.py # subprocess lifecycle (start/stop both)
│ ├── client.py # OpenAI streaming client + TTFT/TPOT
│ ├── speed.py # speed suite
│ ├── quality.py # quality suite
│ ├── tasks/{aime,gsm8k}.py # dataset loaders + scorers
│ ├── report.py # markdown + json output
│ └── requirements.txt # httpx, datasets
└── bench-out/ # report artifacts (gitignored)
├── comparison-<stamp>.md
├── comparison-<stamp>.json
└── logs/{xserv,llama_cpp}.log
```
Both systems are treated as **black-box HTTP servers** speaking the OpenAI
streaming chat API. No in-process integration, no shared Python bindings. This
keeps the comparison fair (same protocol, same prompt-template path) and
isolates the test harness from internal API churn on either side.
## Workflow
The GPU host (dash5) has **no outbound network and no rsync**, so anything from
the internet is fetched locally and shipped over via tar-over-ssh.
```
local repo (has network) dash5 (GPU host, no network)
──────────────────────── ────────────────────────────
# one-time, on a networked machine:
python3 -m tools.bench.fetch_datasets → tools/bench/data/{aime2025,gsm8k}.json
git submodule update --init … → third_party/llama.cpp source
tools/sync-and-build.sh bench → tar project (excl. target, third_party, bench-out)
→ tar llama.cpp source (excl. build, .git)
→ setup-llama-cpp.sh (build-only; no-op if built)
→ convert-to-gguf.sh (no-op if .gguf exists)
→ cargo build --release
→ python3 -m tools.bench.runner ...
→ bench-out/comparison-<stamp>.md
tools/sync-and-build.sh fetch-bench-out ← tar bench-out back
```
Behind a flaky proxy, fetch datasets through the HF mirror:
`HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets`.
`tools/__init__.py` exists so `python3 -m tools.bench.runner` resolves our
package: some site-packages (e.g. nvfuser) ship a regular top-level `tools`
package that would otherwise shadow a namespace `tools`.
## What gets measured
### Speed (TTFT / TPOT / throughput)
- **Single-stream**, three prompt lengths (short / medium / long), `cfg.speed_prompts` repeats each
- `TTFT p50/p95`, `TPOT p50/p95`, per-request throughput
- **Concurrent**, fixed medium prompt, sweep `concurrency ∈ {1, 2, 4, 8}`
- Aggregate `tok/s`, `TTFT p95`, error count
- Both at `temperature=0`, `max_tokens=128` by default.
### Quality (response correctness)
| Task | N | Source | Scoring | Why |
|---|---|---|---|---|
| AIME 2025 | 30 | `MathArena/aime_2025`, fallback `yentinglin/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
**Generation mode must match.** xserv's prompt builder hardcodes Qwen3 thinking
OFF (it appends an empty `<think></think>` block). llama-server applies the
GGUF's Qwen3 jinja template, which has thinking ON by default. The driver
therefore sends `chat_template_kwargs={"enable_thinking": false}` to llama.cpp
so both engines run the model in the same mode. Pass `--enable-thinking` to
compare in thinking mode instead (xserv would need a matching change first).
### Report
`bench-out/comparison-<stamp>.md` contains:
- Environment (GPU, driver, xserv commit, python)
- Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
- Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)
A sibling `.json` holds all per-request raw rows and per-problem case detail
(prediction, gold, response preview) so we can diff regressions in CI later.
## Running it
**One-time prerequisites (on a networked machine):**
```bash
git submodule update --init third_party/llama.cpp # pinned to b9371
HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
```
**Full sweep on dash5 (recommended):**
```bash
./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md
```
**Speed-only smoke (fast):**
```bash
./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2
```
**Quality smoke with 5 problems each:**
```bash
./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5
```
**On a host that already has both servers running** (e.g. local dev with two
shells open):
```bash
python3 -m tools.bench.runner \
--xserv-base-url http://127.0.0.1:8080 \
--llama-base-url http://127.0.0.1:8081 \
--suite all
```
## Design choices
1. **Black-box HTTP, not FFI.** Both engines bind the same OpenAI surface and
real serving traffic uses HTTP. Anything that doesn't show up over the wire
doesn't matter for serving.
2. **Same BF16 weights.** We convert the same safetensors with llama.cpp's
`convert_hf_to_gguf.py --outtype bf16`. No quantization at this stage; if we
want a quant comparison later we'll add a separate column, not replace this
one.
3. **Streaming everywhere.** TTFT and TPOT only make sense with streaming. We
ask both servers for `stream=true` with `include_usage` so we can read
server-reported token counts when available.
4. **Idempotent setup.** `setup-llama-cpp.sh` and `convert-to-gguf.sh` are
safe to re-run — they no-op when the build / file already exists. The
`bench` subcommand wires them so the first run does a full setup and
subsequent runs are fast.
5. **Subprocess lifecycle owned by the driver.** We spawn each server in its
own process group and SIGTERM the group on exit so half-dead llama-server
children don't survive. If the user is already running a server somewhere,
pass `--xserv-base-url` / `--llama-base-url` to skip launch.
6. **One server at a time.** The driver starts a system, runs every suite
against it, stops it, then moves to the next. Two BF16 8B models (~16GB each)
do not co-reside on a single 32GB GPU, and a resident idle engine would
distort the other's latency/throughput. This serialization is why the report
is assembled from per-system passes rather than a single interleaved run.
## Known constraints / findings
- **xserv OOM at `--max-seq-len 8192` — fixed.** xserv used to pre-allocate its
paged-KV pool (`total_blocks = blocks_per_seq · max_batch · 2`, ≈9GB at 8192)
on top of the 16GB weights, exceeding 32GB at startup (`paged_kv_cache.rs`
`alloc paged K pool: OutOfMemory`). Now the pool is sized to *available VRAM*
(`cudaMemGetInfo`) and overflow is swapped to pinned host memory (vLLM-style
preemption, `--swap-space-gb`). The 8192 comparison runs cleanly with 0 swap
events; swap is verified separately under a forced-small pool. The benchmark
surfaced the OOM — a good example of the baseline doing its job.
- When the xserv engine thread dies, the API now returns a clean 503 (the
request handler uses a poison-tolerant lock instead of cascading
mutex-poison panics). The driver records any failure as a per-request error,
so a broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
## Future extensions
- Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
- Wire to GitHub Actions for nightly regression
- Track results across commits to flag regressions (per-commit JSON in
`docs/benchmarks/history/`)
- Add MMLU-Pro / HumanEval when budget allows
- Long-context benchmark (8K, 32K prompts) to compare prefill scaling