Files
xserv/docs/16-llama-cpp-comparison.md
Gahow Wang 49c7653222 tools: add llama.cpp comparison baseline + standard benchmark suite
Vendor llama.cpp as a submodule pinned to b9371 and add a one-click
benchmark driver that compares xserv against it on identical workloads:

- setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh
  converts the same safetensors to BF16 GGUF for an apples-to-apples baseline.
- tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput
  (single-stream + concurrent) and response quality on AIME 2025 + GSM8K.
- fetch_datasets.py pulls datasets to local JSON (GPU host has no network);
  task loaders prefer the local JSON.
- sync-and-build.sh: `bench` subcommand transfers source + datasets to the
  GPU host via tar-over-ssh (no rsync there), builds, and runs the suite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:18:52 +08:00

6.6 KiB

Phase 16: llama.cpp Comparison Baseline

Goal. Replace HF transformers with llama.cpp as the standing performance baseline, and add a standard quality (response correctness) benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs both systems under identical workloads and emits a side-by-side report.

Motivation

xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15). HF is no longer a useful performance bar — it's a correctness baseline.

llama.cpp is the right next bar because:

  • It's a serious C++/CUDA inference engine with active optimization
  • Same OpenAI-compatible API → black-box, fair comparison
  • Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
  • Used widely as a reference point in the community

We also need quality benchmarks so that performance improvements don't silently regress model quality (numerical precision, sampling, prompt formatting). AIME and GSM8K are the cheapest credible signals.

Architecture

xserv/
├── third_party/llama.cpp/         # cloned by setup-llama-cpp.sh
│   └── build/bin/llama-server     # CUDA build (SM120)
├── tools/
│   ├── setup-llama-cpp.sh         # clone + cmake build (idempotent)
│   ├── convert-to-gguf.sh         # safetensors → BF16 GGUF (same weights)
│   ├── sync-and-build.sh          # extended with `bench` subcommand
│   └── bench/                     # Python benchmark driver
│       ├── runner.py              # entrypoint
│       ├── servers.py             # subprocess lifecycle (start/stop both)
│       ├── client.py              # OpenAI streaming client + TTFT/TPOT
│       ├── speed.py               # speed suite
│       ├── quality.py             # quality suite
│       ├── tasks/{aime,gsm8k}.py  # dataset loaders + scorers
│       ├── report.py              # markdown + json output
│       └── requirements.txt       # httpx, datasets
└── bench-out/                     # report artifacts (gitignored)
    ├── comparison-<stamp>.md
    ├── comparison-<stamp>.json
    └── logs/{xserv,llama_cpp}.log

Both systems are treated as black-box HTTP servers speaking the OpenAI streaming chat API. No in-process integration, no shared Python bindings. This keeps the comparison fair (same protocol, same prompt-template path) and isolates the test harness from internal API churn on either side.

Workflow

local repo                            dash5 (GPU host)
──────────                            ────────────────
tools/sync-and-build.sh bench   →  rsync project (excl. target, third_party, bench-out)
                                   →  setup-llama-cpp.sh    (no-op if built)
                                   →  convert-to-gguf.sh    (no-op if .gguf exists)
                                   →  cargo build --release
                                   →  python3 -m tools.bench.runner ...
                                   →  bench-out/comparison-<stamp>.md
tools/sync-and-build.sh fetch-bench-out  ←  rsync bench-out back

What gets measured

Speed (TTFT / TPOT / throughput)

  • Single-stream, three prompt lengths (short / medium / long), cfg.speed_prompts repeats each
    • TTFT p50/p95, TPOT p50/p95, per-request throughput
  • Concurrent, fixed medium prompt, sweep concurrency ∈ {1, 2, 4, 8}
    • Aggregate tok/s, TTFT p95, error count
  • Both at temperature=0, max_tokens=128 by default.

Quality (response correctness)

Task N Source Scoring Why
AIME 2025 30 MathArena/aime_2025 (HF) exact-match boxed integer (0..999) reasoning + math, hard signal
GSM8K 1319 openai/gsm8k (HF), test split exact-match \boxed{n} or last number broad sanity, decimals allowed

Same temperature=0 sampling across both systems. Max tokens: 16384 for AIME (reasoning long), 2048 for GSM8K. Subsample with --quality-limit N for smoke.

Report

bench-out/comparison-<stamp>.md contains:

  • Environment (GPU, driver, xserv commit, python)
  • Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
  • Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)

A sibling .json holds all per-request raw rows and per-problem case detail (prediction, gold, response preview) so we can diff regressions in CI later.

Running it

Full sweep on dash5 (recommended):

./tools/sync-and-build.sh bench
./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md

Speed-only smoke (fast):

./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2

Quality smoke with 5 problems each:

./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5

On a host that already has both servers running (e.g. local dev with two shells open):

python3 -m tools.bench.runner \
    --xserv-base-url http://127.0.0.1:8080 \
    --llama-base-url http://127.0.0.1:8081 \
    --suite all

Design choices

  1. Black-box HTTP, not FFI. Both engines bind the same OpenAI surface and real serving traffic uses HTTP. Anything that doesn't show up over the wire doesn't matter for serving.
  2. Same BF16 weights. We convert the same safetensors with llama.cpp's convert_hf_to_gguf.py --outtype bf16. No quantization at this stage; if we want a quant comparison later we'll add a separate column, not replace this one.
  3. Streaming everywhere. TTFT and TPOT only make sense with streaming. We ask both servers for stream=true with include_usage so we can read server-reported token counts when available.
  4. Idempotent setup. setup-llama-cpp.sh and convert-to-gguf.sh are safe to re-run — they no-op when the build / file already exists. The bench subcommand wires them so the first run does a full setup and subsequent runs are fast.
  5. Subprocess lifecycle owned by the driver. We spawn each server in its own process group and SIGTERM the group on exit so half-dead llama-server children don't survive. If the user is already running a server somewhere, pass --xserv-base-url / --llama-base-url to skip launch.

Future extensions

  • Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
  • Wire to GitHub Actions for nightly regression
  • Track results across commits to flag regressions (per-commit JSON in docs/benchmarks/history/)
  • Add MMLU-Pro / HumanEval when budget allows
  • Long-context benchmark (8K, 32K prompts) to compare prefill scaling