bench: run one server at a time, match thinking mode, fix tools package

Refinements from end-to-end bring-up on the GPU host: - Run each system start→suites→stop in sequence. Two BF16 8B models don't co-reside on one 32GB GPU, and a resident idle engine would distort the other's latency/throughput. - Match generation mode: xserv hardcodes Qwen3 thinking off, so send chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint extra_body. --enable-thinking opts back into thinking mode. - Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package instead of a site-packages `tools` (nvfuser ships one that shadowed it). - Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM finding that the bench surfaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:40:07 +08:00
parent 49c7653222
commit 7cb9ee3870
7 changed files with 102 additions and 28 deletions
--- a/docs/16-llama-cpp-comparison.md
+++ b/docs/16-llama-cpp-comparison.md
@@ -52,18 +52,33 @@ isolates the test harness from internal API churn on either side.

 ## Workflow

+The GPU host (dash5) has **no outbound network and no rsync**, so anything from
+the internet is fetched locally and shipped over via tar-over-ssh.
+
 ```
-local repo                            dash5 (GPU host)
-──────────                            ────────────────
-tools/sync-and-build.sh bench   →  rsync project (excl. target, third_party, bench-out)
-                                   →  setup-llama-cpp.sh    (no-op if built)
-                                   →  convert-to-gguf.sh    (no-op if .gguf exists)
-                                   →  cargo build --release
-                                   →  python3 -m tools.bench.runner ...
-                                   →  bench-out/comparison-<stamp>.md
-tools/sync-and-build.sh fetch-bench-out  ←  rsync bench-out back
+local repo (has network)              dash5 (GPU host, no network)
+────────────────────────              ────────────────────────────
+# one-time, on a networked machine:
+python3 -m tools.bench.fetch_datasets  →  tools/bench/data/{aime2025,gsm8k}.json
+git submodule update --init …          →  third_party/llama.cpp source
+
+tools/sync-and-build.sh bench   →  tar project   (excl. target, third_party, bench-out)
+                                →  tar llama.cpp source (excl. build, .git)
+                                →  setup-llama-cpp.sh   (build-only; no-op if built)
+                                →  convert-to-gguf.sh   (no-op if .gguf exists)
+                                →  cargo build --release
+                                →  python3 -m tools.bench.runner ...
+                                →  bench-out/comparison-<stamp>.md
+tools/sync-and-build.sh fetch-bench-out  ←  tar bench-out back
 ```

+Behind a flaky proxy, fetch datasets through the HF mirror:
+`HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets`.
+
+`tools/__init__.py` exists so `python3 -m tools.bench.runner` resolves our
+package: some site-packages (e.g. nvfuser) ship a regular top-level `tools`
+package that would otherwise shadow a namespace `tools`.
+
 ## What gets measured

 ### Speed (TTFT / TPOT / throughput)
@@ -78,12 +93,19 @@ tools/sync-and-build.sh fetch-bench-out  ←  rsync bench-out back

 | Task | N | Source | Scoring | Why |
 |---|---|---|---|---|
-| AIME 2025 | 30 | `MathArena/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
+| AIME 2025 | 30 | `MathArena/aime_2025`, fallback `yentinglin/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
 | GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |

 Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
 (reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.

+**Generation mode must match.** xserv's prompt builder hardcodes Qwen3 thinking
+OFF (it appends an empty `<think></think>` block). llama-server applies the
+GGUF's Qwen3 jinja template, which has thinking ON by default. The driver
+therefore sends `chat_template_kwargs={"enable_thinking": false}` to llama.cpp
+so both engines run the model in the same mode. Pass `--enable-thinking` to
+compare in thinking mode instead (xserv would need a matching change first).
+
 ### Report

 `bench-out/comparison-<stamp>.md` contains:
@@ -96,9 +118,16 @@ A sibling `.json` holds all per-request raw rows and per-problem case detail

 ## Running it

+**One-time prerequisites (on a networked machine):**
+```bash
+git submodule update --init third_party/llama.cpp     # pinned to b9371
+HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets
+```
+
 **Full sweep on dash5 (recommended):**
 ```bash
-./tools/sync-and-build.sh bench
+# 4096 ctx because xserv OOMs at 8192 (see Known constraints)
+./tools/sync-and-build.sh bench -- --max-seq-len 4096 --quality-limit 50
 ./tools/sync-and-build.sh fetch-bench-out
 open bench-out/comparison-*.md
 ```
@@ -142,6 +171,25 @@ python3 -m tools.bench.runner \
   own process group and SIGTERM the group on exit so half-dead llama-server
   children don't survive. If the user is already running a server somewhere,
   pass `--xserv-base-url` / `--llama-base-url` to skip launch.
+6. **One server at a time.** The driver starts a system, runs every suite
+   against it, stops it, then moves to the next. Two BF16 8B models (~16GB each)
+   do not co-reside on a single 32GB GPU, and a resident idle engine would
+   distort the other's latency/throughput. This serialization is why the report
+   is assembled from per-system passes rather than a single interleaved run.
+
+## Known constraints / findings
+
+- **xserv OOMs at `--max-seq-len 8192` + `--max-batch 4`.** xserv eagerly
+  pre-allocates its paged-KV pool (`total_blocks = blocks_per_seq · max_batch ·
+  2`, ≈9GB at 8192) on top of the 16GB weights, exceeding 32GB at startup
+  (`paged_kv_cache.rs` `alloc paged K pool: OutOfMemory`). llama.cpp allocates
+  KV lazily and fits 8192 easily. Until xserv's sizing is fixed, run the
+  comparison at `--max-seq-len 4096` (xserv peaks ~28GB there). The benchmark
+  surfaced this — it's tracked as a follow-up fix.
+- When the xserv engine thread dies, the request handler panics on the poisoned
+  `engine_sender` mutex and every subsequent request fails with "server
+  disconnected". The driver records these as per-request errors (no crash), so a
+  broken engine shows up as `errs=N` / `accuracy 0%` rather than a hung run.

 ## Future extensions