bench: run one server at a time, match thinking mode, fix tools package

Refinements from end-to-end bring-up on the GPU host:

- Run each system start→suites→stop in sequence. Two BF16 8B models don't
  co-reside on one 32GB GPU, and a resident idle engine would distort the
  other's latency/throughput.
- Match generation mode: xserv hardcodes Qwen3 thinking off, so send
  chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint
  extra_body. --enable-thinking opts back into thinking mode.
- Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package
  instead of a site-packages `tools` (nvfuser ships one that shadowed it).
- Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM
  finding that the bench surfaced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 11:40:07 +08:00
parent 49c7653222
commit 7cb9ee3870
7 changed files with 102 additions and 28 deletions

View File

@@ -52,18 +52,33 @@ isolates the test harness from internal API churn on either side.
## Workflow
The GPU host (dash5) has **no outbound network and no rsync**, so anything from
the internet is fetched locally and shipped over via tar-over-ssh.
```
local repo dash5 (GPU host)
────────── ────────────────
tools/sync-and-build.sh bench → rsync project (excl. target, third_party, bench-out)
→ setup-llama-cpp.sh (no-op if built)
→ convert-to-gguf.sh (no-op if .gguf exists)
→ cargo build --release
→ python3 -m tools.bench.runner ...
→ bench-out/comparison-<stamp>.md
tools/sync-and-build.sh fetch-bench-out ← rsync bench-out back
local repo (has network) dash5 (GPU host, no network)
──────────────────────── ────────────────────────────
# one-time, on a networked machine:
python3 -m tools.bench.fetch_datasets tools/bench/data/{aime2025,gsm8k}.json
git submodule update --init … → third_party/llama.cpp source
tools/sync-and-build.sh bench tar project (excl. target, third_party, bench-out)
→ tar llama.cpp source (excl. build, .git)
→ setup-llama-cpp.sh (build-only; no-op if built)
→ convert-to-gguf.sh (no-op if .gguf exists)
→ cargo build --release
→ python3 -m tools.bench.runner ...
→ bench-out/comparison-<stamp>.md
tools/sync-and-build.sh fetch-bench-out ← tar bench-out back
```
Behind a flaky proxy, fetch datasets through the HF mirror:
`HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets`.
`tools/__init__.py` exists so `python3 -m tools.bench.runner` resolves our
package: some site-packages (e.g. nvfuser) ship a regular top-level `tools`
package that would otherwise shadow a namespace `tools`.
## What gets measured
### Speed (TTFT / TPOT / throughput)
@@ -78,12 +93,19 @@ tools/sync-and-build.sh fetch-bench-out ← rsync bench-out back
| Task | N | Source | Scoring | Why |
|---|---|---|---|---|
| AIME 2025 | 30 | `MathArena/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
| AIME 2025 | 30 | `MathArena/aime_2025`, fallback `yentinglin/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
**Generation mode must match.** xserv's prompt builder hardcodes Qwen3 thinking
OFF (it appends an empty `<think></think>` block). llama-server applies the
GGUF's Qwen3 jinja template, which has thinking ON by default. The driver
therefore sends `chat_template_kwargs={"enable_thinking": false}` to llama.cpp
so both engines run the model in the same mode. Pass `--enable-thinking` to
compare in thinking mode instead (xserv would need a matching change first).
### Report
`bench-out/comparison-<stamp>.md` contains:
@@ -96,9 +118,16 @@ A sibling `.json` holds all per-request raw rows and per-problem case detail
## Running it
**One-time prerequisites (on a networked machine):**
```bash
git submodule update --init third_party/llama.cpp # pinned to b9371
HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
```
**Full sweep on dash5 (recommended):**
```bash
./tools/sync-and-build.sh bench
# 4096 ctx because xserv OOMs at 8192 (see Known constraints)
./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
./tools/sync-and-build.sh fetch-bench-out
open bench-out/comparison-*.md
```
@@ -142,6 +171,25 @@ python3 -m tools.bench.runner \
own process group and SIGTERM the group on exit so half-dead llama-server
children don't survive. If the user is already running a server somewhere,
pass `--xserv-base-url` / `--llama-base-url` to skip launch.
6. **One server at a time.** The driver starts a system, runs every suite
against it, stops it, then moves to the next. Two BF16 8B models (~16GB each)
do not co-reside on a single 32GB GPU, and a resident idle engine would
distort the other's latency/throughput. This serialization is why the report
is assembled from per-system passes rather than a single interleaved run.
## Known constraints / findings
- **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
pre-allocates its paged-KV pool (`total_blocks = blocks_per_seq · max_batch ·
2`, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup
(`paged_kv_cache.rs` `alloc paged K pool: OutOfMemory`). llama.cpp allocates
KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the
comparison at `--max-seq-len 4096` (xserv peaks ~28GB there). The benchmark
surfaced this — it's tracked as a follow-up fix.
- When the xserv engine thread dies, the request handler panics on the poisoned
`engine_sender` mutex and every subsequent request fails with "server
disconnected". The driver records these as per-request errors (no crash), so a
broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
## Future extensions