Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
202 lines
9.5 KiB
Markdown
202 lines
9.5 KiB
Markdown
# Phase 16: llama.cpp Comparison Baseline
|
|
|
|
> **Goal.** Replace HF transformers with **llama.cpp** as the standing
|
|
> performance baseline, and add a standard quality (response correctness)
|
|
> benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs
|
|
> both systems under identical workloads and emits a side-by-side report.
|
|
|
|
## Motivation
|
|
|
|
xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15).
|
|
HF is no longer a useful performance bar — it's a *correctness* baseline.
|
|
|
|
**llama.cpp** is the right next bar because:
|
|
- It's a serious C++/CUDA inference engine with active optimization
|
|
- Same OpenAI-compatible API → black-box, fair comparison
|
|
- Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
|
|
- Used widely as a reference point in the community
|
|
|
|
We also need **quality benchmarks** so that performance improvements don't
|
|
silently regress model quality (numerical precision, sampling, prompt
|
|
formatting). AIME and GSM8K are the cheapest credible signals.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
xserv/
|
|
├── third_party/llama.cpp/ # cloned by setup-llama-cpp.sh
|
|
│ └── build/bin/llama-server # CUDA build (SM120)
|
|
├── tools/
|
|
│ ├── setup-llama-cpp.sh # clone + cmake build (idempotent)
|
|
│ ├── convert-to-gguf.sh # safetensors → BF16 GGUF (same weights)
|
|
│ ├── sync-and-build.sh # extended with `bench` subcommand
|
|
│ └── bench/ # Python benchmark driver
|
|
│ ├── runner.py # entrypoint
|
|
│ ├── servers.py # subprocess lifecycle (start/stop both)
|
|
│ ├── client.py # OpenAI streaming client + TTFT/TPOT
|
|
│ ├── speed.py # speed suite
|
|
│ ├── quality.py # quality suite
|
|
│ ├── tasks/{aime,gsm8k}.py # dataset loaders + scorers
|
|
│ ├── report.py # markdown + json output
|
|
│ └── requirements.txt # httpx, datasets
|
|
└── bench-out/ # report artifacts (gitignored)
|
|
├── comparison-<stamp>.md
|
|
├── comparison-<stamp>.json
|
|
└── logs/{xserv,llama_cpp}.log
|
|
```
|
|
|
|
Both systems are treated as **black-box HTTP servers** speaking the OpenAI
|
|
streaming chat API. No in-process integration, no shared Python bindings. This
|
|
keeps the comparison fair (same protocol, same prompt-template path) and
|
|
isolates the test harness from internal API churn on either side.
|
|
|
|
## Workflow
|
|
|
|
The GPU host (dash5) has **no outbound network and no rsync**, so anything from
|
|
the internet is fetched locally and shipped over via tar-over-ssh.
|
|
|
|
```
|
|
local repo (has network) dash5 (GPU host, no network)
|
|
──────────────────────── ────────────────────────────
|
|
# one-time, on a networked machine:
|
|
python3 -m tools.bench.fetch_datasets → tools/bench/data/{aime2025,gsm8k}.json
|
|
git submodule update --init … → third_party/llama.cpp source
|
|
|
|
tools/sync-and-build.sh bench → tar project (excl. target, third_party, bench-out)
|
|
→ tar llama.cpp source (excl. build, .git)
|
|
→ setup-llama-cpp.sh (build-only; no-op if built)
|
|
→ convert-to-gguf.sh (no-op if .gguf exists)
|
|
→ cargo build --release
|
|
→ python3 -m tools.bench.runner ...
|
|
→ bench-out/comparison-<stamp>.md
|
|
tools/sync-and-build.sh fetch-bench-out ← tar bench-out back
|
|
```
|
|
|
|
Behind a flaky proxy, fetch datasets through the HF mirror:
|
|
`HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets`.
|
|
|
|
`tools/__init__.py` exists so `python3 -m tools.bench.runner` resolves our
|
|
package: some site-packages (e.g. nvfuser) ship a regular top-level `tools`
|
|
package that would otherwise shadow a namespace `tools`.
|
|
|
|
## What gets measured
|
|
|
|
### Speed (TTFT / TPOT / throughput)
|
|
|
|
- **Single-stream**, three prompt lengths (short / medium / long), `cfg.speed_prompts` repeats each
|
|
- `TTFT p50/p95`, `TPOT p50/p95`, per-request throughput
|
|
- **Concurrent**, fixed medium prompt, sweep `concurrency ∈ {1, 2, 4, 8}`
|
|
- Aggregate `tok/s`, `TTFT p95`, error count
|
|
- Both at `temperature=0`, `max_tokens=128` by default.
|
|
|
|
### Quality (response correctness)
|
|
|
|
| Task | N | Source | Scoring | Why |
|
|
|---|---|---|---|---|
|
|
| AIME 2025 | 30 | `MathArena/aime_2025`, fallback `yentinglin/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
|
|
| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
|
|
|
|
Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
|
|
(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
|
|
|
|
**Generation mode must match.** xserv's prompt builder hardcodes Qwen3 thinking
|
|
OFF (it appends an empty `<think></think>` block). llama-server applies the
|
|
GGUF's Qwen3 jinja template, which has thinking ON by default. The driver
|
|
therefore sends `chat_template_kwargs={"enable_thinking": false}` to llama.cpp
|
|
so both engines run the model in the same mode. Pass `--enable-thinking` to
|
|
compare in thinking mode instead (xserv would need a matching change first).
|
|
|
|
### Report
|
|
|
|
`bench-out/comparison-<stamp>.md` contains:
|
|
- Environment (GPU, driver, xserv commit, python)
|
|
- Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
|
|
- Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)
|
|
|
|
A sibling `.json` holds all per-request raw rows and per-problem case detail
|
|
(prediction, gold, response preview) so we can diff regressions in CI later.
|
|
|
|
## Running it
|
|
|
|
**One-time prerequisites (on a networked machine):**
|
|
```bash
|
|
git submodule update --init third_party/llama.cpp # pinned to b9371
|
|
HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
|
|
```
|
|
|
|
**Full sweep on dash5 (recommended):**
|
|
```bash
|
|
./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
|
|
./tools/sync-and-build.sh fetch-bench-out
|
|
open bench-out/comparison-*.md
|
|
```
|
|
|
|
**Speed-only smoke (fast):**
|
|
```bash
|
|
./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2
|
|
```
|
|
|
|
**Quality smoke with 5 problems each:**
|
|
```bash
|
|
./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5
|
|
```
|
|
|
|
**On a host that already has both servers running** (e.g. local dev with two
|
|
shells open):
|
|
```bash
|
|
python3 -m tools.bench.runner \
|
|
--xserv-base-url http://127.0.0.1:8080 \
|
|
--llama-base-url http://127.0.0.1:8081 \
|
|
--suite all
|
|
```
|
|
|
|
## Design choices
|
|
|
|
1. **Black-box HTTP, not FFI.** Both engines bind the same OpenAI surface and
|
|
real serving traffic uses HTTP. Anything that doesn't show up over the wire
|
|
doesn't matter for serving.
|
|
2. **Same BF16 weights.** We convert the same safetensors with llama.cpp's
|
|
`convert_hf_to_gguf.py --outtype bf16`. No quantization at this stage; if we
|
|
want a quant comparison later we'll add a separate column, not replace this
|
|
one.
|
|
3. **Streaming everywhere.** TTFT and TPOT only make sense with streaming. We
|
|
ask both servers for `stream=true` with `include_usage` so we can read
|
|
server-reported token counts when available.
|
|
4. **Idempotent setup.** `setup-llama-cpp.sh` and `convert-to-gguf.sh` are
|
|
safe to re-run — they no-op when the build / file already exists. The
|
|
`bench` subcommand wires them so the first run does a full setup and
|
|
subsequent runs are fast.
|
|
5. **Subprocess lifecycle owned by the driver.** We spawn each server in its
|
|
own process group and SIGTERM the group on exit so half-dead llama-server
|
|
children don't survive. If the user is already running a server somewhere,
|
|
pass `--xserv-base-url` / `--llama-base-url` to skip launch.
|
|
6. **One server at a time.** The driver starts a system, runs every suite
|
|
against it, stops it, then moves to the next. Two BF16 8B models (~16GB each)
|
|
do not co-reside on a single 32GB GPU, and a resident idle engine would
|
|
distort the other's latency/throughput. This serialization is why the report
|
|
is assembled from per-system passes rather than a single interleaved run.
|
|
|
|
## Known constraints / findings
|
|
|
|
- **xserv OOM at `--max-seq-len 8192` — fixed.** xserv used to pre-allocate its
|
|
paged-KV pool (`total_blocks = blocks_per_seq · max_batch · 2`, ≈9GB at 8192)
|
|
on top of the 16GB weights, exceeding 32GB at startup (`paged_kv_cache.rs`
|
|
`alloc paged K pool: OutOfMemory`). Now the pool is sized to *available VRAM*
|
|
(`cudaMemGetInfo`) and overflow is swapped to pinned host memory (vLLM-style
|
|
preemption, `--swap-space-gb`). The 8192 comparison runs cleanly with 0 swap
|
|
events; swap is verified separately under a forced-small pool. The benchmark
|
|
surfaced the OOM — a good example of the baseline doing its job.
|
|
- When the xserv engine thread dies, the API now returns a clean 503 (the
|
|
request handler uses a poison-tolerant lock instead of cascading
|
|
mutex-poison panics). The driver records any failure as a per-request error,
|
|
so a broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
|
|
|
|
## Future extensions
|
|
|
|
- Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
|
|
- Wire to GitHub Actions for nightly regression
|
|
- Track results across commits to flag regressions (per-commit JSON in
|
|
`docs/benchmarks/history/`)
|
|
- Add MMLU-Pro / HumanEval when budget allows
|
|
- Long-context benchmark (8K, 32K prompts) to compare prefill scaling
|