bench: run one server at a time, match thinking mode, fix tools package
Refinements from end-to-end bring-up on the GPU host:
- Run each system start→suites→stop in sequence. Two BF16 8B models don't
co-reside on one 32GB GPU, and a resident idle engine would distort the
other's latency/throughput.
- Match generation mode: xserv hardcodes Qwen3 thinking off, so send
chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint
extra_body. --enable-thinking opts back into thinking mode.
- Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package
instead of a site-packages `tools` (nvfuser ships one that shadowed it).
- Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM
finding that the bench surfaced.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -52,18 +52,33 @@ isolates the test harness from internal API churn on either side.
|
||||
|
||||
## Workflow
|
||||
|
||||
The GPU host (dash5) has **no outbound network and no rsync**, so anything from
|
||||
the internet is fetched locally and shipped over via tar-over-ssh.
|
||||
|
||||
```
|
||||
local repo dash5 (GPU host)
|
||||
────────── ────────────────
|
||||
tools/sync-and-build.sh bench → rsync project (excl. target, third_party, bench-out)
|
||||
→ setup-llama-cpp.sh (no-op if built)
|
||||
→ convert-to-gguf.sh (no-op if .gguf exists)
|
||||
→ cargo build --release
|
||||
→ python3 -m tools.bench.runner ...
|
||||
→ bench-out/comparison-<stamp>.md
|
||||
tools/sync-and-build.sh fetch-bench-out ← rsync bench-out back
|
||||
local repo (has network) dash5 (GPU host, no network)
|
||||
──────────────────────── ────────────────────────────
|
||||
# one-time, on a networked machine:
|
||||
python3 -m tools.bench.fetch_datasets → tools/bench/data/{aime2025,gsm8k}.json
|
||||
git submodule update --init … → third_party/llama.cpp source
|
||||
|
||||
tools/sync-and-build.sh bench → tar project (excl. target, third_party, bench-out)
|
||||
→ tar llama.cpp source (excl. build, .git)
|
||||
→ setup-llama-cpp.sh (build-only; no-op if built)
|
||||
→ convert-to-gguf.sh (no-op if .gguf exists)
|
||||
→ cargo build --release
|
||||
→ python3 -m tools.bench.runner ...
|
||||
→ bench-out/comparison-<stamp>.md
|
||||
tools/sync-and-build.sh fetch-bench-out ← tar bench-out back
|
||||
```
|
||||
|
||||
Behind a flaky proxy, fetch datasets through the HF mirror:
|
||||
`HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets`.
|
||||
|
||||
`tools/__init__.py` exists so `python3 -m tools.bench.runner` resolves our
|
||||
package: some site-packages (e.g. nvfuser) ship a regular top-level `tools`
|
||||
package that would otherwise shadow a namespace `tools`.
|
||||
|
||||
## What gets measured
|
||||
|
||||
### Speed (TTFT / TPOT / throughput)
|
||||
@@ -78,12 +93,19 @@ tools/sync-and-build.sh fetch-bench-out ← rsync bench-out back
|
||||
|
||||
| Task | N | Source | Scoring | Why |
|
||||
|---|---|---|---|---|
|
||||
| AIME 2025 | 30 | `MathArena/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
|
||||
| AIME 2025 | 30 | `MathArena/aime_2025`, fallback `yentinglin/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
|
||||
| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
|
||||
|
||||
Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
|
||||
(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
|
||||
|
||||
**Generation mode must match.** xserv's prompt builder hardcodes Qwen3 thinking
|
||||
OFF (it appends an empty `<think></think>` block). llama-server applies the
|
||||
GGUF's Qwen3 jinja template, which has thinking ON by default. The driver
|
||||
therefore sends `chat_template_kwargs={"enable_thinking": false}` to llama.cpp
|
||||
so both engines run the model in the same mode. Pass `--enable-thinking` to
|
||||
compare in thinking mode instead (xserv would need a matching change first).
|
||||
|
||||
### Report
|
||||
|
||||
`bench-out/comparison-<stamp>.md` contains:
|
||||
@@ -96,9 +118,16 @@ A sibling `.json` holds all per-request raw rows and per-problem case detail
|
||||
|
||||
## Running it
|
||||
|
||||
**One-time prerequisites (on a networked machine):**
|
||||
```bash
|
||||
git submodule update --init third_party/llama.cpp # pinned to b9371
|
||||
HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
|
||||
```
|
||||
|
||||
**Full sweep on dash5 (recommended):**
|
||||
```bash
|
||||
./tools/sync-and-build.sh bench
|
||||
# 4096 ctx because xserv OOMs at 8192 (see Known constraints)
|
||||
./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
|
||||
./tools/sync-and-build.sh fetch-bench-out
|
||||
open bench-out/comparison-*.md
|
||||
```
|
||||
@@ -142,6 +171,25 @@ python3 -m tools.bench.runner \
|
||||
own process group and SIGTERM the group on exit so half-dead llama-server
|
||||
children don't survive. If the user is already running a server somewhere,
|
||||
pass `--xserv-base-url` / `--llama-base-url` to skip launch.
|
||||
6. **One server at a time.** The driver starts a system, runs every suite
|
||||
against it, stops it, then moves to the next. Two BF16 8B models (~16GB each)
|
||||
do not co-reside on a single 32GB GPU, and a resident idle engine would
|
||||
distort the other's latency/throughput. This serialization is why the report
|
||||
is assembled from per-system passes rather than a single interleaved run.
|
||||
|
||||
## Known constraints / findings
|
||||
|
||||
- **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
|
||||
pre-allocates its paged-KV pool (`total_blocks = blocks_per_seq · max_batch ·
|
||||
2`, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup
|
||||
(`paged_kv_cache.rs` `alloc paged K pool: OutOfMemory`). llama.cpp allocates
|
||||
KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the
|
||||
comparison at `--max-seq-len 4096` (xserv peaks ~28GB there). The benchmark
|
||||
surfaced this — it's tracked as a follow-up fix.
|
||||
- When the xserv engine thread dies, the request handler panics on the poisoned
|
||||
`engine_sender` mutex and every subsequent request fails with "server
|
||||
disconnected". The driver records these as per-request errors (no crash), so a
|
||||
broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.
|
||||
|
||||
## Future extensions
|
||||
|
||||
|
||||
Reference in New Issue
Block a user