Files

Gahow Wang 7cb9ee3870 bench: run one server at a time, match thinking mode, fix tools package

Refinements from end-to-end bring-up on the GPU host:

- Run each system start→suites→stop in sequence. Two BF16 8B models don't
  co-reside on one 32GB GPU, and a resident idle engine would distort the
  other's latency/throughput.
- Match generation mode: xserv hardcodes Qwen3 thinking off, so send
  chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint
  extra_body. --enable-thinking opts back into thinking mode.
- Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package
  instead of a site-packages `tools` (nvfuser ships one that shadowed it).
- Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM
  finding that the bench surfaced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 11:40:07 +08:00

9.4 KiB

Raw Blame History

Phase 16: llama.cpp Comparison Baseline

Goal. Replace HF transformers with llama.cpp as the standing performance baseline, and add a standard quality (response correctness) benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs both systems under identical workloads and emits a side-by-side report.

Motivation

xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15). HF is no longer a useful performance bar — it's a correctness baseline.

llama.cpp is the right next bar because:

It's a serious C++/CUDA inference engine with active optimization
Same OpenAI-compatible API → black-box, fair comparison
Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
Used widely as a reference point in the community

We also need quality benchmarks so that performance improvements don't silently regress model quality (numerical precision, sampling, prompt formatting). AIME and GSM8K are the cheapest credible signals.

Architecture

xserv/
├── third_party/llama.cpp/         # cloned by setup-llama-cpp.sh
│   └── build/bin/llama-server     # CUDA build (SM120)
├── tools/
│   ├── setup-llama-cpp.sh         # clone + cmake build (idempotent)
│   ├── convert-to-gguf.sh         # safetensors → BF16 GGUF (same weights)
│   ├── sync-and-build.sh          # extended with `bench` subcommand
│   └── bench/                     # Python benchmark driver
│       ├── runner.py              # entrypoint
│       ├── servers.py             # subprocess lifecycle (start/stop both)
│       ├── client.py              # OpenAI streaming client + TTFT/TPOT
│       ├── speed.py               # speed suite
│       ├── quality.py             # quality suite
│       ├── tasks/{aime,gsm8k}.py  # dataset loaders + scorers
│       ├── report.py              # markdown + json output
│       └── requirements.txt       # httpx, datasets
└── bench-out/                     # report artifacts (gitignored)
    ├── comparison-<stamp>.md
    ├── comparison-<stamp>.json
    └── logs/{xserv,llama_cpp}.log

Both systems are treated as black-box HTTP servers speaking the OpenAI streaming chat API. No in-process integration, no shared Python bindings. This keeps the comparison fair (same protocol, same prompt-template path) and isolates the test harness from internal API churn on either side.

Workflow

The GPU host (dash5) has no outbound network and no rsync, so anything from the internet is fetched locally and shipped over via tar-over-ssh.

local repo (has network)              dash5 (GPU host, no network)
────────────────────────              ────────────────────────────
# one-time, on a networked machine:
python3 -m tools.bench.fetch_datasets  →  tools/bench/data/{aime2025,gsm8k}.json
git submodule update --init …          →  third_party/llama.cpp source

tools/sync-and-build.sh bench   →  tar project   (excl. target, third_party, bench-out)
                                →  tar llama.cpp source (excl. build, .git)
                                →  setup-llama-cpp.sh   (build-only; no-op if built)
                                →  convert-to-gguf.sh   (no-op if .gguf exists)
                                →  cargo build --release
                                →  python3 -m tools.bench.runner ...
                                →  bench-out/comparison-<stamp>.md
tools/sync-and-build.sh fetch-bench-out  ←  tar bench-out back

Behind a flaky proxy, fetch datasets through the HF mirror: HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets.

tools/__init__.py exists so python3 -m tools.bench.runner resolves our package: some site-packages (e.g. nvfuser) ship a regular top-level tools package that would otherwise shadow a namespace tools.

What gets measured

Speed (TTFT / TPOT / throughput)

Single-stream, three prompt lengths (short / medium / long), cfg.speed_prompts repeats each
- TTFT p50/p95, TPOT p50/p95, per-request throughput
Concurrent, fixed medium prompt, sweep concurrency ∈ {1, 2, 4, 8}
- Aggregate tok/s, TTFT p95, error count
Both at temperature=0, max_tokens=128 by default.

Quality (response correctness)

Task	N	Source	Scoring	Why
AIME 2025	30	`MathArena/aime_2025`, fallback `yentinglin/aime_2025` (HF)	exact-match boxed integer (0..999)	reasoning + math, hard signal
GSM8K	1319	`openai/gsm8k` (HF), `test` split	exact-match `\boxed{n}` or last number	broad sanity, decimals allowed

Same temperature=0 sampling across both systems. Max tokens: 16384 for AIME (reasoning long), 2048 for GSM8K. Subsample with --quality-limit N for smoke.

Generation mode must match. xserv's prompt builder hardcodes Qwen3 thinking OFF (it appends an empty <think></think> block). llama-server applies the GGUF's Qwen3 jinja template, which has thinking ON by default. The driver therefore sends chat_template_kwargs={"enable_thinking": false} to llama.cpp so both engines run the model in the same mode. Pass --enable-thinking to compare in thinking mode instead (xserv would need a matching change first).

Report

bench-out/comparison-<stamp>.md contains:

Environment (GPU, driver, xserv commit, python)
Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)

A sibling .json holds all per-request raw rows and per-problem case detail (prediction, gold, response preview) so we can diff regressions in CI later.

Running it

One-time prerequisites (on a networked machine):

git submodule update --init third_party/llama.cpp     # pinned to b9371
HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets

Full sweep on dash5 (recommended):

# 4096 ctx because xserv OOMs at 8192 (see Known constraints)
./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md

Speed-only smoke (fast):

./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2

Quality smoke with 5 problems each:

./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5

On a host that already has both servers running (e.g. local dev with two shells open):

python3 -m tools.bench.runner \
    --xserv-base-url http://127.0.0.1:8080 \
    --llama-base-url http://127.0.0.1:8081 \
    --suite all

Design choices

Black-box HTTP, not FFI. Both engines bind the same OpenAI surface and real serving traffic uses HTTP. Anything that doesn't show up over the wire doesn't matter for serving.
Same BF16 weights. We convert the same safetensors with llama.cpp's convert_hf_to_gguf.py --outtype bf16. No quantization at this stage; if we want a quant comparison later we'll add a separate column, not replace this one.
Streaming everywhere. TTFT and TPOT only make sense with streaming. We ask both servers for stream=true with include_usage so we can read server-reported token counts when available.
Idempotent setup. setup-llama-cpp.sh and convert-to-gguf.sh are safe to re-run — they no-op when the build / file already exists. The bench subcommand wires them so the first run does a full setup and subsequent runs are fast.
Subprocess lifecycle owned by the driver. We spawn each server in its own process group and SIGTERM the group on exit so half-dead llama-server children don't survive. If the user is already running a server somewhere, pass --xserv-base-url / --llama-base-url to skip launch.
One server at a time. The driver starts a system, runs every suite against it, stops it, then moves to the next. Two BF16 8B models (~16GB each) do not co-reside on a single 32GB GPU, and a resident idle engine would distort the other's latency/throughput. This serialization is why the report is assembled from per-system passes rather than a single interleaved run.

Known constraints / findings

xserv OOMs at --max-seq-len 8192 + --max-batch 4. xserv eagerly pre-allocates its paged-KV pool (total_blocks = blocks_per_seq · max_batch · 2, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup (paged_kv_cache.rs alloc paged K pool: OutOfMemory). llama.cpp allocates KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the comparison at --max-seq-len 4096 (xserv peaks ~28GB there). The benchmark surfaced this — it's tracked as a follow-up fix.
When the xserv engine thread dies, the request handler panics on the poisoned engine_sender mutex and every subsequent request fails with "server disconnected". The driver records these as per-request errors (no crash), so a broken engine shows up as errs=N / accuracy 0% rather than a hung run.

Future extensions

Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
Wire to GitHub Actions for nightly regression
Track results across commits to flag regressions (per-commit JSON in docs/benchmarks/history/)
Add MMLU-Pro / HumanEval when budget allows
Long-context benchmark (8K, 32K prompts) to compare prefill scaling

9.4 KiB Raw Blame History