xserv/docs/16-llama-cpp-comparison.md

# Phase 16: llama.cpp Comparison Baseline

> **Goal.** Replace HF transformers with **llama.cpp** as the standing
> performance baseline, and add a standard quality (response correctness)
> benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs
> both systems under identical workloads and emits a side-by-side report.

## Motivation

xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15).
HF is no longer a useful performance bar — it's a *correctness* baseline.

**llama.cpp** is the right next bar because:
- It's a serious C++/CUDA inference engine with active optimization
- Same OpenAI-compatible API → black-box, fair comparison
- Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
- Used widely as a reference point in the community

We also need **quality benchmarks** so that performance improvements don't
silently regress model quality (numerical precision, sampling, prompt
formatting). AIME and GSM8K are the cheapest credible signals.

## Architecture

```
xserv/
├── third_party/llama.cpp/         # cloned by setup-llama-cpp.sh
│   └── build/bin/llama-server     # CUDA build (SM120)
├── tools/
│   ├── setup-llama-cpp.sh         # clone + cmake build (idempotent)
│   ├── convert-to-gguf.sh         # safetensors → BF16 GGUF (same weights)
│   ├── sync-and-build.sh          # extended with `bench` subcommand
│   └── bench/                     # Python benchmark driver
│       ├── runner.py              # entrypoint
│       ├── servers.py             # subprocess lifecycle (start/stop both)
│       ├── client.py              # OpenAI streaming client + TTFT/TPOT
│       ├── speed.py               # speed suite
│       ├── quality.py             # quality suite
│       ├── tasks/{aime,gsm8k}.py  # dataset loaders + scorers
│       ├── report.py              # markdown + json output
│       └── requirements.txt       # httpx, datasets
└── bench-out/                     # report artifacts (gitignored)
    ├── comparison-<stamp>.md
    ├── comparison-<stamp>.json
    └── logs/{xserv,llama_cpp}.log
```

Both systems are treated as **black-box HTTP servers** speaking the OpenAI
streaming chat API. No in-process integration, no shared Python bindings. This
keeps the comparison fair (same protocol, same prompt-template path) and
isolates the test harness from internal API churn on either side.

## Workflow

The GPU host (dash5) has **no outbound network and no rsync**, so anything from
the internet is fetched locally and shipped over via tar-over-ssh.

```
local repo (has network)              dash5 (GPU host, no network)
────────────────────────              ────────────────────────────
# one-time, on a networked machine:
python3 -m tools.bench.fetch_datasets  →  tools/bench/data/{aime2025,gsm8k}.json
git submodule update --init …          →  third_party/llama.cpp source

tools/sync-and-build.sh bench   →  tar project   (excl. target, third_party, bench-out)
                                →  tar llama.cpp source (excl. build, .git)
                                →  setup-llama-cpp.sh   (build-only; no-op if built)
                                →  convert-to-gguf.sh   (no-op if .gguf exists)
                                →  cargo build --release
                                →  python3 -m tools.bench.runner ...
                                →  bench-out/comparison-<stamp>.md
tools/sync-and-build.sh fetch-bench-out  ←  tar bench-out back
```

Behind a flaky proxy, fetch datasets through the HF mirror:
`HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets`.

`tools/__init__.py` exists so `python3 -m tools.bench.runner` resolves our
package: some site-packages (e.g. nvfuser) ship a regular top-level `tools`
package that would otherwise shadow a namespace `tools`.

## What gets measured

### Speed (TTFT / TPOT / throughput)

- **Single-stream**, three prompt lengths (short / medium / long), `cfg.speed_prompts` repeats each
  - `TTFT p50/p95`, `TPOT p50/p95`, per-request throughput
- **Concurrent**, fixed medium prompt, sweep `concurrency ∈ {1, 2, 4, 8}`
  - Aggregate `tok/s`, `TTFT p95`, error count
- Both at `temperature=0`, `max_tokens=128` by default.

### Quality (response correctness)

| Task | N | Source | Scoring | Why |
|---|---|---|---|---|
| AIME 2025 | 30 | `MathArena/aime_2025`, fallback `yentinglin/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |

Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.

**Generation mode must match.** xserv's prompt builder hardcodes Qwen3 thinking
OFF (it appends an empty `<think></think>` block). llama-server applies the
GGUF's Qwen3 jinja template, which has thinking ON by default. The driver
therefore sends `chat_template_kwargs={"enable_thinking": false}` to llama.cpp
so both engines run the model in the same mode. Pass `--enable-thinking` to
compare in thinking mode instead (xserv would need a matching change first).

### Report

`bench-out/comparison-<stamp>.md` contains:
- Environment (GPU, driver, xserv commit, python)
- Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
- Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)

A sibling `.json` holds all per-request raw rows and per-problem case detail
(prediction, gold, response preview) so we can diff regressions in CI later.

## Running it

**One-time prerequisites (on a networked machine):**
```bash
git submodule update --init third_party/llama.cpp     # pinned to b9371
HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
```

**Full sweep on dash5 (recommended):**
```bash
./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md
```

**Speed-only smoke (fast):**
```bash
./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2
```

**Quality smoke with 5 problems each:**
```bash
./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5
```

**On a host that already has both servers running** (e.g. local dev with two
shells open):
```bash
python3 -m tools.bench.runner \
    --xserv-base-url http://127.0.0.1:8080 \
    --llama-base-url http://127.0.0.1:8081 \
    --suite all
```

## Design choices

1. **Black-box HTTP, not FFI.** Both engines bind the same OpenAI surface and
   real serving traffic uses HTTP. Anything that doesn't show up over the wire
   doesn't matter for serving.
2. **Same BF16 weights.** We convert the same safetensors with llama.cpp's
   `convert_hf_to_gguf.py --outtype bf16`. No quantization at this stage; if we
   want a quant comparison later we'll add a separate column, not replace this
   one.
3. **Streaming everywhere.** TTFT and TPOT only make sense with streaming. We
   ask both servers for `stream=true` with `include_usage` so we can read
   server-reported token counts when available.
4. **Idempotent setup.** `setup-llama-cpp.sh` and `convert-to-gguf.sh` are
   safe to re-run — they no-op when the build / file already exists. The
   `bench` subcommand wires them so the first run does a full setup and
   subsequent runs are fast.
5. **Subprocess lifecycle owned by the driver.** We spawn each server in its
   own process group and SIGTERM the group on exit so half-dead llama-server
   children don't survive. If the user is already running a server somewhere,
   pass `--xserv-base-url` / `--llama-base-url` to skip launch.
6. **One server at a time.** The driver starts a system, runs every suite
   against it, stops it, then moves to the next. Two BF16 8B models (~16GB each)
   do not co-reside on a single 32GB GPU, and a resident idle engine would
   distort the other's latency/throughput. This serialization is why the report
   is assembled from per-system passes rather than a single interleaved run.

## Known constraints / findings

- **xserv OOM at `--max-seq-len 8192` — fixed.** xserv used to pre-allocate its
  paged-KV pool (`total_blocks = blocks_per_seq · max_batch · 2`, ≈9GB at 8192)
  on top of the 16GB weights, exceeding 32GB at startup (`paged_kv_cache.rs`
  `alloc paged K pool: OutOfMemory`). Now the pool is sized to *available VRAM*
  (`cudaMemGetInfo`) and overflow is swapped to pinned host memory (vLLM-style
  preemption, `--swap-space-gb`). The 8192 comparison runs cleanly with 0 swap
  events; swap is verified separately under a forced-small pool. The benchmark
  surfaced the OOM — a good example of the baseline doing its job.
- When the xserv engine thread dies, the API now returns a clean 503 (the
  request handler uses a poison-tolerant lock instead of cascading
  mutex-poison panics). The driver records any failure as a per-request error,
  so a broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.

## Future extensions

- Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
- Wire to GitHub Actions for nightly regression
- Track results across commits to flag regressions (per-commit JSON in
  `docs/benchmarks/history/`)
- Add MMLU-Pro / HumanEval when budget allows
- Long-context benchmark (8K, 32K prompts) to compare prefill scaling