tools: add llama.cpp comparison baseline + standard benchmark suite

Vendor llama.cpp as a submodule pinned to b9371 and add a one-click benchmark driver that compares xserv against it on identical workloads: - setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh converts the same safetensors to BF16 GGUF for an apples-to-apples baseline. - tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput (single-stream + concurrent) and response quality on AIME 2025 + GSM8K. - fetch_datasets.py pulls datasets to local JSON (GPU host has no network); task loaders prefer the local JSON. - sync-and-build.sh: `bench` subcommand transfers source + datasets to the GPU host via tar-over-ssh (no rsync there), builds, and runs the suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:18:52 +08:00
parent 9bb5c5c328
commit 49c7653222
20 changed files with 1690 additions and 14 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -7,3 +7,15 @@
 **/*.rs.bk
 .env
 *.npy
+
+# llama.cpp baseline (cloned/submoduled by tools/setup-llama-cpp.sh)
+/third_party/llama.cpp/build/
+/third_party/llama.cpp/models/
+*.gguf
+
+# Benchmark output + fetched datasets (transferred to GPU host, not committed)
+/bench-out/
+/tools/bench/data/
+/tools/bench/__pycache__/
+/tools/bench/**/__pycache__/
+
--- a/.gitmodules
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "third_party/llama.cpp"]
+	path = third_party/llama.cpp
+	url = https://github.com/ggerganov/llama.cpp
--- a/docs/16-llama-cpp-comparison.md
+++ b/docs/16-llama-cpp-comparison.md
@@ -0,0 +1,153 @@
+# Phase 16: llama.cpp Comparison Baseline
+
+> **Goal.** Replace HF transformers with **llama.cpp** as the standing
+> performance baseline, and add a standard quality (response correctness)
+> benchmark suite (AIME 2025, GSM8K). Provide a one-click entrypoint that runs
+> both systems under identical workloads and emits a side-by-side report.
+
+## Motivation
+
+xserv has cleared 140% of HF transformers throughput on Qwen3-8B (Phase 15).
+HF is no longer a useful performance bar — it's a *correctness* baseline.
+
+**llama.cpp** is the right next bar because:
+- It's a serious C++/CUDA inference engine with active optimization
+- Same OpenAI-compatible API → black-box, fair comparison
+- Same GGUF↔safetensors weight source (we convert BF16, no quantization shortcuts)
+- Used widely as a reference point in the community
+
+We also need **quality benchmarks** so that performance improvements don't
+silently regress model quality (numerical precision, sampling, prompt
+formatting). AIME and GSM8K are the cheapest credible signals.
+
+## Architecture
+
+```
+xserv/
+├── third_party/llama.cpp/         # cloned by setup-llama-cpp.sh
+│   └── build/bin/llama-server     # CUDA build (SM120)
+├── tools/
+│   ├── setup-llama-cpp.sh         # clone + cmake build (idempotent)
+│   ├── convert-to-gguf.sh         # safetensors → BF16 GGUF (same weights)
+│   ├── sync-and-build.sh          # extended with `bench` subcommand
+│   └── bench/                     # Python benchmark driver
+│       ├── runner.py              # entrypoint
+│       ├── servers.py             # subprocess lifecycle (start/stop both)
+│       ├── client.py              # OpenAI streaming client + TTFT/TPOT
+│       ├── speed.py               # speed suite
+│       ├── quality.py             # quality suite
+│       ├── tasks/{aime,gsm8k}.py  # dataset loaders + scorers
+│       ├── report.py              # markdown + json output
+│       └── requirements.txt       # httpx, datasets
+└── bench-out/                     # report artifacts (gitignored)
+    ├── comparison-<stamp>.md
+    ├── comparison-<stamp>.json
+    └── logs/{xserv,llama_cpp}.log
+```
+
+Both systems are treated as **black-box HTTP servers** speaking the OpenAI
+streaming chat API. No in-process integration, no shared Python bindings. This
+keeps the comparison fair (same protocol, same prompt-template path) and
+isolates the test harness from internal API churn on either side.
+
+## Workflow
+
+```
+local repo                            dash5 (GPU host)
+──────────                            ────────────────
+tools/sync-and-build.sh bench   →  rsync project (excl. target, third_party, bench-out)
+                                   →  setup-llama-cpp.sh    (no-op if built)
+                                   →  convert-to-gguf.sh    (no-op if .gguf exists)
+                                   →  cargo build --release
+                                   →  python3 -m tools.bench.runner ...
+                                   →  bench-out/comparison-<stamp>.md
+tools/sync-and-build.sh fetch-bench-out  ←  rsync bench-out back
+```
+
+## What gets measured
+
+### Speed (TTFT / TPOT / throughput)
+
+- **Single-stream**, three prompt lengths (short / medium / long), `cfg.speed_prompts` repeats each
+  - `TTFT p50/p95`, `TPOT p50/p95`, per-request throughput
+- **Concurrent**, fixed medium prompt, sweep `concurrency ∈ {1, 2, 4, 8}`
+  - Aggregate `tok/s`, `TTFT p95`, error count
+- Both at `temperature=0`, `max_tokens=128` by default.
+
+### Quality (response correctness)
+
+| Task | N | Source | Scoring | Why |
+|---|---|---|---|---|
+| AIME 2025 | 30 | `MathArena/aime_2025` (HF) | exact-match boxed integer (0..999) | reasoning + math, hard signal |
+| GSM8K | 1319 | `openai/gsm8k` (HF), `test` split | exact-match `\boxed{n}` or last number | broad sanity, decimals allowed |
+
+Same `temperature=0` sampling across both systems. Max tokens: 16384 for AIME
+(reasoning long), 2048 for GSM8K. Subsample with `--quality-limit N` for smoke.
+
+### Report
+
+`bench-out/comparison-<stamp>.md` contains:
+- Environment (GPU, driver, xserv commit, python)
+- Speed table per scenario (xserv | llama.cpp | xserv÷llama.cpp speedup)
+- Quality table per task (n, correct, accuracy, mean tokens, TTFT, TPOT, wall)
+
+A sibling `.json` holds all per-request raw rows and per-problem case detail
+(prediction, gold, response preview) so we can diff regressions in CI later.
+
+## Running it
+
+**Full sweep on dash5 (recommended):**
+```bash
+./tools/sync-and-build.sh bench
+./tools/sync-and-build.sh fetch-bench-out
+open bench-out/comparison-*.md
+```
+
+**Speed-only smoke (fast):**
+```bash
+./tools/sync-and-build.sh bench -- --suite speed --speed-prompts 2
+```
+
+**Quality smoke with 5 problems each:**
+```bash
+./tools/sync-and-build.sh bench -- --suite quality --quality-limit 5
+```
+
+**On a host that already has both servers running** (e.g. local dev with two
+shells open):
+```bash
+python3 -m tools.bench.runner \
+    --xserv-base-url http://127.0.0.1:8080 \
+    --llama-base-url http://127.0.0.1:8081 \
+    --suite all
+```
+
+## Design choices
+
+1. **Black-box HTTP, not FFI.** Both engines bind the same OpenAI surface and
+   real serving traffic uses HTTP. Anything that doesn't show up over the wire
+   doesn't matter for serving.
+2. **Same BF16 weights.** We convert the same safetensors with llama.cpp's
+   `convert_hf_to_gguf.py --outtype bf16`. No quantization at this stage; if we
+   want a quant comparison later we'll add a separate column, not replace this
+   one.
+3. **Streaming everywhere.** TTFT and TPOT only make sense with streaming. We
+   ask both servers for `stream=true` with `include_usage` so we can read
+   server-reported token counts when available.
+4. **Idempotent setup.** `setup-llama-cpp.sh` and `convert-to-gguf.sh` are
+   safe to re-run — they no-op when the build / file already exists. The
+   `bench` subcommand wires them so the first run does a full setup and
+   subsequent runs are fast.
+5. **Subprocess lifecycle owned by the driver.** We spawn each server in its
+   own process group and SIGTERM the group on exit so half-dead llama-server
+   children don't survive. If the user is already running a server somewhere,
+   pass `--xserv-base-url` / `--llama-base-url` to skip launch.
+
+## Future extensions
+
+- Add quant runs (Q8_0, Q4_K_M) as separate "system" columns
+- Wire to GitHub Actions for nightly regression
+- Track results across commits to flag regressions (per-commit JSON in
+  `docs/benchmarks/history/`)
+- Add MMLU-Pro / HumanEval when budget allows
+- Long-context benchmark (8K, 32K prompts) to compare prefill scaling
--- a/third_party/llama.cpp
+++ b/third_party/llama.cpp
--- a/tools/bench/init.py
+++ b/tools/bench/init.py
--- a/tools/bench/client.py
+++ b/tools/bench/client.py
@@ -0,0 +1,154 @@
+"""HTTP client for OpenAI-compatible /v1/chat/completions.
+
+Records per-request: TTFT (time to first content token), TPOT (mean
+inter-token latency over the decode phase), and end-to-end throughput.
+
+We don't care about parsing exact OpenAI envelope semantics, just enough to
+get the deltas + finish_reason + token counts.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import time
+from dataclasses import dataclass, field
+from typing import Any
+
+import httpx
+
+
+@dataclass
+class StreamResult:
+    text: str = ""
+    completion_tokens: int = 0
+    prompt_tokens: int = 0
+    finish_reason: str | None = None
+    # Timings (seconds; -1 means not measured)
+    ttft_s: float = -1.0
+    e2e_s: float = -1.0
+    chunk_times: list[float] = field(default_factory=list)   # absolute monotonic times of content chunks
+    error: str | None = None
+
+    @property
+    def tpot_s(self) -> float:
+        """Mean inter-content-chunk latency after the first chunk (seconds/token)."""
+        if len(self.chunk_times) < 2:
+            return -1.0
+        deltas = [self.chunk_times[i] - self.chunk_times[i - 1] for i in range(1, len(self.chunk_times))]
+        return sum(deltas) / len(deltas)
+
+    @property
+    def throughput_tok_s(self) -> float:
+        if self.e2e_s <= 0 or self.completion_tokens <= 0:
+            return -1.0
+        return self.completion_tokens / self.e2e_s
+
+
+async def chat_stream(
+    client: httpx.AsyncClient,
+    base_url: str,
+    model: str,
+    messages: list[dict[str, str]],
+    *,
+    max_tokens: int,
+    temperature: float = 0.0,
+    api_key: str | None = None,
+    timeout: float = 1800.0,
+) -> StreamResult:
+    payload: dict[str, Any] = {
+        "model": model,
+        "messages": messages,
+        "max_tokens": max_tokens,
+        "temperature": temperature,
+        "stream": True,
+    }
+    # llama-server returns usage in the final stream chunk when this is set;
+    # xserv ignores unknown fields, so this is harmless there.
+    payload["stream_options"] = {"include_usage": True}
+
+    headers = {"Content-Type": "application/json"}
+    if api_key:
+        headers["Authorization"] = f"Bearer {api_key}"
+
+    url = base_url.rstrip("/") + "/v1/chat/completions"
+    res = StreamResult()
+    t_start = time.perf_counter()
+
+    try:
+        async with client.stream(
+            "POST", url, json=payload, headers=headers, timeout=timeout,
+        ) as resp:
+            if resp.status_code != 200:
+                body = await resp.aread()
+                res.error = f"HTTP {resp.status_code}: {body.decode(errors='replace')[:400]}"
+                res.e2e_s = time.perf_counter() - t_start
+                return res
+
+            async for line in resp.aiter_lines():
+                if not line or not line.startswith("data:"):
+                    continue
+                data = line[len("data:"):].strip()
+                if data == "[DONE]":
+                    break
+                try:
+                    chunk = json.loads(data)
+                except json.JSONDecodeError:
+                    continue
+
+                if "usage" in chunk and chunk["usage"]:
+                    usage = chunk["usage"]
+                    res.prompt_tokens = usage.get("prompt_tokens", res.prompt_tokens)
+                    res.completion_tokens = usage.get("completion_tokens", res.completion_tokens)
+
+                choices = chunk.get("choices") or []
+                if not choices:
+                    continue
+                choice = choices[0]
+                delta = choice.get("delta") or {}
+                content = delta.get("content")
+                if content:
+                    now = time.perf_counter()
+                    if res.ttft_s < 0:
+                        res.ttft_s = now - t_start
+                    res.chunk_times.append(now)
+                    res.text += content
+                if choice.get("finish_reason"):
+                    res.finish_reason = choice["finish_reason"]
+    except Exception as e:  # noqa: BLE001 — surface any failure to the report
+        res.error = f"{type(e).__name__}: {e}"
+
+    res.e2e_s = time.perf_counter() - t_start
+    # Fall back to chunk count when server doesn't report usage (xserv stream path).
+    if res.completion_tokens == 0:
+        res.completion_tokens = len(res.chunk_times)
+    return res
+
+
+async def chat_concurrent(
+    base_url: str,
+    model: str,
+    prompts: list[list[dict[str, str]]],
+    *,
+    max_tokens: int,
+    temperature: float = 0.0,
+    api_key: str | None = None,
+    timeout: float = 1800.0,
+    concurrency: int,
+) -> tuple[list[StreamResult], float]:
+    """Fire `concurrency` requests in parallel waves. Returns per-request results
+    plus wall-clock elapsed time of the entire batch."""
+    sem = asyncio.Semaphore(concurrency)
+    limits = httpx.Limits(max_connections=concurrency * 2, max_keepalive_connections=concurrency)
+    async with httpx.AsyncClient(timeout=timeout, limits=limits) as client:
+        async def one(messages: list[dict[str, str]]) -> StreamResult:
+            async with sem:
+                return await chat_stream(
+                    client, base_url, model, messages,
+                    max_tokens=max_tokens, temperature=temperature,
+                    api_key=api_key, timeout=timeout,
+                )
+        t0 = time.perf_counter()
+        results = await asyncio.gather(*(one(p) for p in prompts))
+        wall = time.perf_counter() - t0
+    return results, wall
--- a/tools/bench/config.py
+++ b/tools/bench/config.py
@@ -0,0 +1,51 @@
+"""Defaults + CLI argument shapes for the benchmark driver.
+
+All paths default to the dash5 layout (/opt/wjh/...) because that's where the
+GPU lives — see docs/16-llama-cpp-comparison.md.
+"""
+
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass, field
+
+
+# Names used in reports and as logical keys throughout the driver.
+SYSTEM_XSERV = "xserv"
+SYSTEM_LLAMA_CPP = "llama.cpp"
+DEFAULT_SYSTEMS = (SYSTEM_XSERV, SYSTEM_LLAMA_CPP)
+
+
+@dataclass
+class SystemEndpoint:
+    """How to reach (or how to start) one of the systems under test."""
+
+    name: str
+    base_url: str                  # http://host:port  (OpenAI-compatible root, no /v1)
+    model_id: str                  # what to put in the request body's "model" field
+    api_key: str | None = None     # llama-server doesn't need one; xserv ignores it
+    # Process supervision is optional — if base_url is already serving, we skip launch.
+    launch_cmd: list[str] | None = None
+    launch_env: dict[str, str] = field(default_factory=dict)
+    launch_cwd: str | None = None
+    health_path: str = "/health"
+    ready_timeout_s: float = 600.0   # cold loads of 8B BF16 take a while
+
+
+@dataclass
+class BenchConfig:
+    out_dir: str = "bench-out"
+    # Speed suite
+    speed_prompts: int = 8           # synthetic prompts per length bucket
+    speed_max_tokens: int = 128
+    speed_concurrency: tuple[int, ...] = (1, 2, 4, 8)
+    # Quality suite
+    quality_max_tokens_aime: int = 16384
+    quality_max_tokens_gsm8k: int = 2048
+    quality_limit: int | None = None   # subsample for smoke tests; None = all
+    quality_temperature: float = 0.0
+    request_timeout_s: float = 1800.0
+
+
+def env_default(key: str, fallback: str) -> str:
+    return os.environ.get(key, fallback)
--- a/tools/bench/fetch_datasets.py
+++ b/tools/bench/fetch_datasets.py
@@ -0,0 +1,40 @@
+"""Pre-fetch quality-benchmark datasets into local JSON.
+
+Run this on a machine WITH network (e.g. your laptop). The resulting
+tools/bench/data/*.json files are then shipped to the GPU host (which has no
+network) by the bench sync step.
+
+Usage:
+  python3 -m tools.bench.fetch_datasets               # all tasks
+  python3 -m tools.bench.fetch_datasets aime2025      # one task
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+
+if __package__ in (None, ""):
+    sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+
+from tools.bench.tasks import aime, gsm8k, save_local
+
+FETCHERS = {
+    "aime2025": aime.load_remote,
+    "gsm8k": gsm8k.load_remote,
+}
+
+
+def main() -> None:
+    wanted = sys.argv[1:] or list(FETCHERS)
+    for name in wanted:
+        if name not in FETCHERS:
+            raise SystemExit(f"unknown task: {name} (have: {', '.join(FETCHERS)})")
+        print(f"[fetch] {name} ...")
+        records = FETCHERS[name]()
+        path = save_local(name, records)
+        print(f"[fetch] {name}: {len(records)} records -> {path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/bench/quality.py
+++ b/tools/bench/quality.py
@@ -0,0 +1,146 @@
+"""Quality suite — run dataset tasks against each system, score, report.
+
+Each task module exposes the same surface:
+    load() -> list[{id, problem, answer, source}]
+    make_messages(problem) -> list[dict]
+    extract_answer(text) -> str | None
+    score(pred, gold) -> bool
+
+Concurrency is fixed at 1 per system for quality runs. Mixing concurrent
+requests with quality scoring is fine (deterministic temperature=0) but the
+extra moving parts aren't worth it for the first iteration.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import statistics
+import time
+from dataclasses import asdict, dataclass
+from typing import Any
+
+import httpx
+
+from .client import chat_stream
+from .config import BenchConfig, SystemEndpoint
+from .tasks import aime, gsm8k
+
+TASKS = {
+    "aime2025": (aime, "quality_max_tokens_aime"),
+    "gsm8k":    (gsm8k, "quality_max_tokens_gsm8k"),
+}
+
+
+@dataclass
+class QualityRow:
+    system: str
+    task: str
+    n_total: int
+    n_correct: int
+    n_errors: int
+    accuracy: float
+    mean_completion_tokens: float
+    mean_ttft_ms: float
+    mean_tpot_ms: float
+    wall_s: float
+
+
+@dataclass
+class QualityCase:
+    system: str
+    task: str
+    problem_id: str
+    gold: str
+    pred: str | None
+    correct: bool
+    completion_tokens: int
+    ttft_ms: float
+    tpot_ms: float
+    e2e_s: float
+    error: str | None
+    response_preview: str
+
+
+async def _run_one_task(
+    ep: SystemEndpoint, task_name: str, task_mod, max_tokens: int, cfg: BenchConfig,
+) -> tuple[QualityRow, list[QualityCase]]:
+    problems = task_mod.load()
+    if cfg.quality_limit is not None:
+        problems = problems[: cfg.quality_limit]
+    print(f"[quality] {ep.name} / {task_name}: {len(problems)} problems "
+          f"(max_tokens={max_tokens})")
+
+    cases: list[QualityCase] = []
+    t_wall = time.perf_counter()
+    async with httpx.AsyncClient(timeout=cfg.request_timeout_s) as client:
+        for prob in problems:
+            messages = task_mod.make_messages(prob["problem"])
+            r = await chat_stream(
+                client, ep.base_url, ep.model_id, messages,
+                max_tokens=max_tokens,
+                temperature=cfg.quality_temperature,
+                api_key=ep.api_key,
+                timeout=cfg.request_timeout_s,
+            )
+            pred = task_mod.extract_answer(r.text) if r.error is None else None
+            correct = task_mod.score(pred, prob["answer"]) if r.error is None else False
+            cases.append(QualityCase(
+                system=ep.name, task=task_name,
+                problem_id=prob["id"], gold=prob["answer"], pred=pred,
+                correct=correct, completion_tokens=r.completion_tokens,
+                ttft_ms=r.ttft_s * 1000 if r.ttft_s > 0 else -1.0,
+                tpot_ms=r.tpot_s * 1000 if r.tpot_s > 0 else -1.0,
+                e2e_s=r.e2e_s, error=r.error,
+                response_preview=(r.text or "")[:240].replace("\n", " "),
+            ))
+            mark = "✓" if correct else ("E" if r.error else "✗")
+            print(f"  [{mark}] {prob['id']:>4s}  gold={prob['answer']:>6s}  "
+                  f"pred={str(pred):>6s}  tok={r.completion_tokens:5d}  "
+                  f"{r.e2e_s:6.1f}s")
+    wall = time.perf_counter() - t_wall
+
+    ok = [c for c in cases if c.error is None]
+    correct = sum(1 for c in cases if c.correct)
+    errors = sum(1 for c in cases if c.error)
+    row = QualityRow(
+        system=ep.name,
+        task=task_name,
+        n_total=len(cases),
+        n_correct=correct,
+        n_errors=errors,
+        accuracy=correct / max(len(cases) - errors, 1),
+        mean_completion_tokens=statistics.mean(c.completion_tokens for c in ok) if ok else 0.0,
+        mean_ttft_ms=statistics.mean(c.ttft_ms for c in ok if c.ttft_ms > 0) if ok else -1.0,
+        mean_tpot_ms=statistics.mean(c.tpot_ms for c in ok if c.tpot_ms > 0) if ok else -1.0,
+        wall_s=wall,
+    )
+    return row, cases
+
+
+def run_quality(
+    endpoints: list[SystemEndpoint], cfg: BenchConfig, tasks: list[str],
+) -> tuple[list[QualityRow], list[QualityCase]]:
+    all_rows: list[QualityRow] = []
+    all_cases: list[QualityCase] = []
+    for ep in endpoints:
+        print(f"[quality] === {ep.name} ===")
+        for task_name in tasks:
+            if task_name not in TASKS:
+                raise ValueError(f"unknown task: {task_name}")
+            task_mod, max_tok_attr = TASKS[task_name]
+            row, cases = asyncio.run(_run_one_task(
+                ep, task_name, task_mod, getattr(cfg, max_tok_attr), cfg,
+            ))
+            all_rows.append(row)
+            all_cases.extend(cases)
+            print(f"  -> {row.task}: {row.n_correct}/{row.n_total} = "
+                  f"{row.accuracy * 100:.1f}%  ({row.wall_s:.1f}s wall)")
+    return all_rows, all_cases
+
+
+def rows_to_dicts(rows: list[QualityRow]) -> list[dict[str, Any]]:
+    return [asdict(r) for r in rows]
+
+
+def cases_to_dicts(cases: list[QualityCase]) -> list[dict[str, Any]]:
+    return [asdict(c) for c in cases]
--- a/tools/bench/report.py
+++ b/tools/bench/report.py
@@ -0,0 +1,122 @@
+"""Combined speed + quality report (markdown + json side-cars)."""
+
+from __future__ import annotations
+
+import datetime as dt
+import json
+import os
+from typing import Any
+
+from .config import DEFAULT_SYSTEMS
+
+
+def _fmt(x: float, nd: int = 1) -> str:
+    if x is None or x < 0:
+        return "—"
+    return f"{x:.{nd}f}"
+
+
+def _speed_table(rows: list[dict[str, Any]]) -> str:
+    if not rows:
+        return "_(no speed results)_\n"
+
+    # scenarios in stable order
+    scenarios: list[str] = []
+    for r in rows:
+        if r["scenario"] not in scenarios:
+            scenarios.append(r["scenario"])
+    systems: list[str] = []
+    for r in rows:
+        if r["system"] not in systems:
+            systems.append(r["system"])
+
+    by = {(r["system"], r["scenario"]): r for r in rows}
+    out = []
+    out.append("| scenario | metric | " + " | ".join(systems) + " | speedup (xserv ÷ llama.cpp) |")
+    out.append("|---|---|" + "|".join(["---"] * (len(systems) + 1)) + "|")
+
+    metrics = [
+        ("ttft_ms_p50", "TTFT p50 (ms)", "lower"),
+        ("ttft_ms_p95", "TTFT p95 (ms)", "lower"),
+        ("tpot_ms_p50", "TPOT p50 (ms/tok)", "lower"),
+        ("throughput_tok_s", "Throughput (tok/s)", "higher"),
+    ]
+    for sc in scenarios:
+        for key, label, direction in metrics:
+            cells = []
+            vals = {}
+            for s in systems:
+                row = by.get((s, sc))
+                v = row[key] if row else -1.0
+                vals[s] = v
+                cells.append(_fmt(v, 2 if "tpot" in key else 1))
+            x = vals.get("xserv", -1.0)
+            l = vals.get("llama.cpp", -1.0)
+            if x > 0 and l > 0:
+                ratio = (x / l) if direction == "higher" else (l / x)
+                cells.append(f"{ratio:.2f}×")
+            else:
+                cells.append("—")
+            out.append(f"| {sc} | {label} | " + " | ".join(cells) + " |")
+    return "\n".join(out) + "\n"
+
+
+def _quality_table(rows: list[dict[str, Any]]) -> str:
+    if not rows:
+        return "_(no quality results)_\n"
+    by_task: dict[str, list[dict[str, Any]]] = {}
+    for r in rows:
+        by_task.setdefault(r["task"], []).append(r)
+    out: list[str] = []
+    out.append("| task | system | n | correct | accuracy | mean tokens | TTFT (ms) | TPOT (ms/tok) | wall (s) |")
+    out.append("|---|---|---|---|---|---|---|---|---|")
+    for task, task_rows in by_task.items():
+        for r in task_rows:
+            out.append(
+                f"| {task} | {r['system']} | {r['n_total']} | {r['n_correct']} | "
+                f"{r['accuracy'] * 100:.1f}% | {r['mean_completion_tokens']:.0f} | "
+                f"{_fmt(r['mean_ttft_ms'])} | {_fmt(r['mean_tpot_ms'], 2)} | {r['wall_s']:.1f} |"
+            )
+    return "\n".join(out) + "\n"
+
+
+def write_report(
+    out_dir: str,
+    speed_rows: list[dict[str, Any]],
+    speed_raw: list[dict[str, Any]],
+    quality_rows: list[dict[str, Any]],
+    quality_cases: list[dict[str, Any]],
+    env: dict[str, Any],
+) -> str:
+    os.makedirs(out_dir, exist_ok=True)
+    stamp = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
+    md_path = os.path.join(out_dir, f"comparison-{stamp}.md")
+    json_path = os.path.join(out_dir, f"comparison-{stamp}.json")
+
+    with open(json_path, "w") as f:
+        json.dump({
+            "stamp": stamp,
+            "env": env,
+            "speed": {"summary": speed_rows, "raw": speed_raw},
+            "quality": {"summary": quality_rows, "cases": quality_cases},
+        }, f, indent=2)
+
+    lines: list[str] = []
+    lines.append(f"# xserv vs llama.cpp — comparison\n")
+    lines.append(f"_Generated: {stamp}_\n")
+    lines.append("## Environment\n")
+    for k, v in env.items():
+        lines.append(f"- **{k}**: {v}")
+    lines.append("")
+    lines.append("## Speed\n")
+    lines.append(_speed_table(speed_rows))
+    lines.append("\n## Quality\n")
+    lines.append(_quality_table(quality_rows))
+    lines.append(f"\n_Raw results: `{os.path.basename(json_path)}`_\n")
+
+    with open(md_path, "w") as f:
+        f.write("\n".join(lines))
+
+    print(f"\n[report] wrote {md_path}")
+    print(f"[report] wrote {json_path}")
+    return md_path
--- a/tools/bench/requirements.txt
+++ b/tools/bench/requirements.txt
@@ -0,0 +1,2 @@
+httpx>=0.27
+datasets>=2.20
--- a/tools/bench/runner.py
+++ b/tools/bench/runner.py
@@ -0,0 +1,202 @@
+"""One-click entrypoint: spin up both servers, run suites, write report.
+
+Usage examples:
+
+  # Full sweep against both systems
+  python3 -m tools.bench.runner \
+      --xserv-bin ./target/release/xserv-server \
+      --xserv-model /opt/wjh/models/qwen3-8b \
+      --llama-bin third_party/llama.cpp/build/bin/llama-server \
+      --llama-gguf /opt/wjh/models/qwen3-8b/qwen3-8b-bf16.gguf \
+      --suite all
+
+  # Speed-only smoke test
+  python3 -m tools.bench.runner ... --suite speed
+
+  # Quality with 5-problem subsample
+  python3 -m tools.bench.runner ... --suite quality --quality-limit 5
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import platform
+import subprocess
+import sys
+from contextlib import ExitStack
+from typing import Any
+
+# Allow running as `python3 tools/bench/runner.py` from repo root.
+if __package__ in (None, ""):
+    sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+
+from tools.bench.config import (
+    BenchConfig, SystemEndpoint, SYSTEM_XSERV, SYSTEM_LLAMA_CPP,
+)
+from tools.bench.servers import (
+    ServerHandle, start_server, stop_server,
+    xserv_launch_cmd, llama_cpp_launch_cmd,
+)
+from tools.bench.speed import run_speed, rows_to_dicts as speed_rows_to_dicts
+from tools.bench.quality import (
+    run_quality, rows_to_dicts as q_rows_to_dicts, cases_to_dicts,
+)
+from tools.bench.report import write_report
+
+
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="xserv vs llama.cpp benchmark suite")
+    # Targets
+    p.add_argument("--xserv-bin", default="./target/release/xserv-server")
+    p.add_argument("--xserv-model", required=False,
+                   help="HF model directory for xserv-server (defaults to $XSERV_MODEL_DIR)")
+    p.add_argument("--xserv-port", type=int, default=18080)
+    p.add_argument("--xserv-base-url", default=None,
+                   help="If set, skip launching xserv and target this URL.")
+    p.add_argument("--xserv-model-id", default="qwen3-8b")
+
+    p.add_argument("--llama-bin", default="third_party/llama.cpp/build/bin/llama-server")
+    p.add_argument("--llama-gguf", required=False,
+                   help="GGUF model for llama-server (defaults to $LLAMA_GGUF)")
+    p.add_argument("--llama-port", type=int, default=18081)
+    p.add_argument("--llama-base-url", default=None,
+                   help="If set, skip launching llama-server and target this URL.")
+    p.add_argument("--llama-model-id", default="qwen3-8b",
+                   help="String to send in OpenAI 'model' field; llama-server is permissive.")
+
+    # Shared
+    p.add_argument("--max-batch", type=int, default=4)
+    p.add_argument("--max-seq-len", type=int, default=8192)
+    p.add_argument("--systems", default="xserv,llama.cpp",
+                   help="Comma-separated subset to run, e.g. 'xserv' to skip llama.cpp")
+
+    # Suites
+    p.add_argument("--suite", choices=["speed", "quality", "all"], default="all")
+    p.add_argument("--quality-tasks", default="aime2025,gsm8k")
+    p.add_argument("--quality-limit", type=int, default=None,
+                   help="Cap problems per task (smoke test). None = all problems.")
+    p.add_argument("--speed-prompts", type=int, default=8)
+    p.add_argument("--speed-max-tokens", type=int, default=128)
+    p.add_argument("--speed-concurrency", default="1,2,4,8")
+
+    p.add_argument("--out-dir", default="bench-out")
+    return p.parse_args()
+
+
+def build_endpoints(args) -> list[SystemEndpoint]:
+    wanted = set(s.strip() for s in args.systems.split(",") if s.strip())
+    eps: list[SystemEndpoint] = []
+
+    if SYSTEM_XSERV in wanted:
+        if args.xserv_base_url:
+            eps.append(SystemEndpoint(
+                name=SYSTEM_XSERV, base_url=args.xserv_base_url,
+                model_id=args.xserv_model_id, launch_cmd=None,
+            ))
+        else:
+            model_dir = args.xserv_model or os.environ.get("XSERV_MODEL_DIR")
+            if not model_dir:
+                raise SystemExit("--xserv-model or XSERV_MODEL_DIR required (or pass --xserv-base-url)")
+            eps.append(SystemEndpoint(
+                name=SYSTEM_XSERV,
+                base_url=f"http://127.0.0.1:{args.xserv_port}",
+                model_id=args.xserv_model_id,
+                launch_cmd=xserv_launch_cmd(
+                    args.xserv_bin, model_dir, args.xserv_port,
+                    max_batch=args.max_batch, max_seq_len=args.max_seq_len,
+                ),
+                health_path="/health",
+                ready_timeout_s=900.0,
+            ))
+
+    if SYSTEM_LLAMA_CPP in wanted:
+        if args.llama_base_url:
+            eps.append(SystemEndpoint(
+                name=SYSTEM_LLAMA_CPP, base_url=args.llama_base_url,
+                model_id=args.llama_model_id, launch_cmd=None,
+            ))
+        else:
+            gguf = args.llama_gguf or os.environ.get("LLAMA_GGUF")
+            if not gguf:
+                raise SystemExit("--llama-gguf or LLAMA_GGUF required (or pass --llama-base-url)")
+            eps.append(SystemEndpoint(
+                name=SYSTEM_LLAMA_CPP,
+                base_url=f"http://127.0.0.1:{args.llama_port}",
+                model_id=args.llama_model_id,
+                launch_cmd=llama_cpp_launch_cmd(
+                    args.llama_bin, gguf, args.llama_port,
+                    n_parallel=args.max_batch, ctx_size=args.max_seq_len,
+                ),
+                # llama-server's health endpoint also returns 200 only when model is loaded.
+                health_path="/health",
+                ready_timeout_s=900.0,
+            ))
+    return eps
+
+
+def collect_env() -> dict[str, Any]:
+    env: dict[str, Any] = {
+        "platform": platform.platform(),
+        "python": sys.version.split()[0],
+    }
+    for cmd, key in [
+        (["nvidia-smi", "--query-gpu=name,driver_version,memory.total", "--format=csv,noheader"], "gpu"),
+        (["git", "rev-parse", "HEAD"], "xserv_commit"),
+    ]:
+        try:
+            out = subprocess.check_output(cmd, text=True, stderr=subprocess.DEVNULL, timeout=5).strip()
+            env[key] = out.splitlines()[0] if out else "?"
+        except (subprocess.CalledProcessError, FileNotFoundError, subprocess.TimeoutExpired):
+            env[key] = "?"
+    return env
+
+
+def main() -> None:
+    args = parse_args()
+    endpoints = build_endpoints(args)
+    if not endpoints:
+        raise SystemExit("no systems selected (check --systems)")
+
+    cfg = BenchConfig(
+        out_dir=args.out_dir,
+        speed_prompts=args.speed_prompts,
+        speed_max_tokens=args.speed_max_tokens,
+        speed_concurrency=tuple(int(c) for c in args.speed_concurrency.split(",") if c.strip()),
+        quality_limit=args.quality_limit,
+    )
+
+    os.makedirs(args.out_dir, exist_ok=True)
+    log_dir = os.path.join(args.out_dir, "logs")
+
+    handles: list[ServerHandle] = []
+    speed_rows: list[Any] = []
+    speed_raw: list[dict[str, Any]] = []
+    quality_rows: list[Any] = []
+    quality_cases: list[Any] = []
+
+    with ExitStack() as stack:
+        for ep in endpoints:
+            h = start_server(ep, log_dir)
+            handles.append(h)
+            stack.callback(stop_server, h)
+
+        if args.suite in ("speed", "all"):
+            speed_rows, speed_raw = run_speed(endpoints, cfg)
+
+        if args.suite in ("quality", "all"):
+            tasks = [t.strip() for t in args.quality_tasks.split(",") if t.strip()]
+            quality_rows, quality_cases = run_quality(endpoints, cfg, tasks)
+
+    write_report(
+        out_dir=args.out_dir,
+        speed_rows=speed_rows_to_dicts(speed_rows) if speed_rows else [],
+        speed_raw=speed_raw,
+        quality_rows=q_rows_to_dicts(quality_rows) if quality_rows else [],
+        quality_cases=cases_to_dicts(quality_cases) if quality_cases else [],
+        env=collect_env(),
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/bench/servers.py
+++ b/tools/bench/servers.py
@@ -0,0 +1,145 @@
+"""Start/stop xserv-server and llama-server as subprocesses.
+
+The benchmark driver treats both systems as black-box HTTP servers — it does
+not import their Rust/C++ code. This keeps the comparison fair (same wire
+protocol, no in-process shortcut) and avoids coupling the bench harness to
+internal APIs.
+"""
+
+from __future__ import annotations
+
+import contextlib
+import os
+import signal
+import subprocess
+import sys
+import time
+import urllib.error
+import urllib.request
+from dataclasses import dataclass
+
+from .config import SystemEndpoint
+
+
+@dataclass
+class ServerHandle:
+    endpoint: SystemEndpoint
+    proc: subprocess.Popen[bytes] | None
+    log_path: str | None
+
+
+def _wait_ready(base_url: str, health_path: str, timeout_s: float) -> bool:
+    url = base_url.rstrip("/") + health_path
+    deadline = time.monotonic() + timeout_s
+    last_err = ""
+    while time.monotonic() < deadline:
+        try:
+            with urllib.request.urlopen(url, timeout=5) as r:
+                if r.status == 200:
+                    return True
+        except (urllib.error.URLError, ConnectionError, TimeoutError) as e:
+            last_err = repr(e)
+            time.sleep(1.0)
+    print(f"[servers] not ready after {timeout_s}s ({url}): {last_err}", file=sys.stderr)
+    return False
+
+
+def start_server(ep: SystemEndpoint, log_dir: str) -> ServerHandle:
+    """Launch `ep.launch_cmd` if set; otherwise assume it's already running."""
+    if ep.launch_cmd is None:
+        if _wait_ready(ep.base_url, ep.health_path, timeout_s=10.0):
+            print(f"[servers] reusing already-running {ep.name} at {ep.base_url}")
+            return ServerHandle(endpoint=ep, proc=None, log_path=None)
+        raise RuntimeError(f"{ep.name}: no launch_cmd and not reachable at {ep.base_url}")
+
+    os.makedirs(log_dir, exist_ok=True)
+    log_path = os.path.join(log_dir, f"{ep.name.replace('.', '_')}.log")
+    log_f = open(log_path, "wb")
+    env = os.environ.copy()
+    env.update(ep.launch_env)
+
+    print(f"[servers] launching {ep.name}: {' '.join(ep.launch_cmd)}")
+    print(f"[servers]   log: {log_path}")
+    proc = subprocess.Popen(
+        ep.launch_cmd,
+        cwd=ep.launch_cwd,
+        env=env,
+        stdout=log_f,
+        stderr=subprocess.STDOUT,
+        # Own process group so SIGTERM kills children (llama-server in particular).
+        preexec_fn=os.setsid,
+    )
+
+    ok = _wait_ready(ep.base_url, ep.health_path, timeout_s=ep.ready_timeout_s)
+    if not ok:
+        # Hand back enough info so caller can drain logs before dying.
+        log_f.flush()
+        try:
+            os.killpg(proc.pid, signal.SIGTERM)
+        except ProcessLookupError:
+            pass
+        raise RuntimeError(
+            f"{ep.name} failed to become ready (see {log_path}). "
+            "Common causes: model path wrong, port already in use, OOM."
+        )
+
+    return ServerHandle(endpoint=ep, proc=proc, log_path=log_path)
+
+
+def stop_server(h: ServerHandle, *, grace_s: float = 10.0) -> None:
+    if h.proc is None:
+        return
+    print(f"[servers] stopping {h.endpoint.name} (pid {h.proc.pid})")
+    try:
+        os.killpg(h.proc.pid, signal.SIGTERM)
+    except ProcessLookupError:
+        return
+    try:
+        h.proc.wait(timeout=grace_s)
+    except subprocess.TimeoutExpired:
+        print(f"[servers] {h.endpoint.name} did not exit, sending SIGKILL")
+        with contextlib.suppress(ProcessLookupError):
+            os.killpg(h.proc.pid, signal.SIGKILL)
+        h.proc.wait(timeout=5)
+
+
+# ---------- launch-command builders ----------
+
+
+def xserv_launch_cmd(
+    bin_path: str,
+    model_dir: str,
+    port: int,
+    *,
+    max_batch: int,
+    max_seq_len: int,
+) -> list[str]:
+    return [
+        bin_path,
+        model_dir,
+        "--port", str(port),
+        "--max-batch", str(max_batch),
+        "--max-seq-len", str(max_seq_len),
+    ]
+
+
+def llama_cpp_launch_cmd(
+    bin_path: str,
+    gguf_path: str,
+    port: int,
+    *,
+    n_parallel: int,
+    ctx_size: int,
+    n_gpu_layers: int = 99,
+) -> list[str]:
+    return [
+        bin_path,
+        "-m", gguf_path,
+        "--port", str(port),
+        "--host", "0.0.0.0",
+        "-c", str(ctx_size),
+        "-ngl", str(n_gpu_layers),
+        "--parallel", str(n_parallel),
+        # Be quiet by default; the log file already captures stderr.
+        "--log-disable",
+    ]
--- a/tools/bench/speed.py
+++ b/tools/bench/speed.py
@@ -0,0 +1,169 @@
+"""Speed suite: TTFT, TPOT, throughput; serial and concurrent.
+
+Single-stream and concurrent throughput are reported separately because they
+stress different things — TTFT/TPOT are kernel/latency bound (single stream),
+throughput at high concurrency is scheduler/batching bound.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import statistics
+from dataclasses import asdict, dataclass
+from typing import Any
+
+from .client import StreamResult, chat_concurrent
+from .config import BenchConfig, SystemEndpoint
+
+
+# Three prompt-length buckets cover the common interesting points:
+# short = greeting-style; medium = QA; long = summarize-ish (prefill-heavy).
+SPEED_PROMPTS = {
+    "short":  "What is 2 + 2?",
+    "medium": "Explain the difference between TCP and UDP, briefly.",
+    "long": (
+        "Write a detailed comparison of Python and Rust for systems programming. "
+        "Cover memory management, performance, ergonomics, ecosystem, and typical "
+        "use cases. Be specific."
+    ),
+}
+
+
+@dataclass
+class SpeedRow:
+    system: str
+    scenario: str          # e.g. "single/short", "concurrent-4"
+    requests: int
+    completion_tokens_total: int
+    wall_s: float
+    ttft_ms_p50: float
+    ttft_ms_p95: float
+    tpot_ms_p50: float
+    tpot_ms_p95: float
+    throughput_tok_s: float    # aggregate completion_tokens / wall
+    per_req_throughput_tok_s_mean: float
+    errors: int
+
+
+def _percentile(values: list[float], p: float) -> float:
+    if not values:
+        return -1.0
+    s = sorted(values)
+    idx = max(0, min(len(s) - 1, int(round((p / 100.0) * (len(s) - 1)))))
+    return s[idx]
+
+
+def _summarize(system: str, scenario: str, results: list[StreamResult], wall_s: float) -> SpeedRow:
+    ok = [r for r in results if r.error is None]
+    ttft_ms = [r.ttft_s * 1000 for r in ok if r.ttft_s >= 0]
+    tpot_ms = [r.tpot_s * 1000 for r in ok if r.tpot_s >= 0]
+    per_req_tps = [r.throughput_tok_s for r in ok if r.throughput_tok_s > 0]
+    total_tokens = sum(r.completion_tokens for r in ok)
+    return SpeedRow(
+        system=system,
+        scenario=scenario,
+        requests=len(results),
+        completion_tokens_total=total_tokens,
+        wall_s=wall_s,
+        ttft_ms_p50=_percentile(ttft_ms, 50),
+        ttft_ms_p95=_percentile(ttft_ms, 95),
+        tpot_ms_p50=_percentile(tpot_ms, 50),
+        tpot_ms_p95=_percentile(tpot_ms, 95),
+        throughput_tok_s=total_tokens / wall_s if wall_s > 0 else -1.0,
+        per_req_throughput_tok_s_mean=statistics.mean(per_req_tps) if per_req_tps else -1.0,
+        errors=len(results) - len(ok),
+    )
+
+
+async def run_single_stream(
+    ep: SystemEndpoint, cfg: BenchConfig,
+) -> tuple[list[SpeedRow], list[dict[str, Any]]]:
+    """One request at a time, three prompt lengths. Repeat each `cfg.speed_prompts` times."""
+    rows: list[SpeedRow] = []
+    raw: list[dict[str, Any]] = []
+    for bucket, prompt in SPEED_PROMPTS.items():
+        messages = [[{"role": "user", "content": prompt}]] * cfg.speed_prompts
+        results, wall = await chat_concurrent(
+            ep.base_url, ep.model_id, messages,
+            max_tokens=cfg.speed_max_tokens,
+            temperature=0.0,
+            api_key=ep.api_key,
+            timeout=cfg.request_timeout_s,
+            concurrency=1,
+        )
+        rows.append(_summarize(ep.name, f"single/{bucket}", results, wall))
+        for i, r in enumerate(results):
+            raw.append({
+                "system": ep.name, "scenario": f"single/{bucket}", "i": i,
+                "ttft_s": r.ttft_s, "tpot_s": r.tpot_s,
+                "completion_tokens": r.completion_tokens,
+                "e2e_s": r.e2e_s, "error": r.error,
+                "finish_reason": r.finish_reason,
+            })
+    return rows, raw
+
+
+async def run_concurrent(
+    ep: SystemEndpoint, cfg: BenchConfig,
+) -> tuple[list[SpeedRow], list[dict[str, Any]]]:
+    """Fixed medium-length prompt, sweep concurrency."""
+    rows: list[SpeedRow] = []
+    raw: list[dict[str, Any]] = []
+    prompt = SPEED_PROMPTS["medium"]
+    for c in cfg.speed_concurrency:
+        # Send 4x concurrency requests so the scheduler sees sustained load,
+        # not just one wave.
+        n = max(c * 4, 8)
+        messages = [[{"role": "user", "content": prompt}]] * n
+        results, wall = await chat_concurrent(
+            ep.base_url, ep.model_id, messages,
+            max_tokens=cfg.speed_max_tokens,
+            temperature=0.0,
+            api_key=ep.api_key,
+            timeout=cfg.request_timeout_s,
+            concurrency=c,
+        )
+        rows.append(_summarize(ep.name, f"concurrent-{c}", results, wall))
+        for i, r in enumerate(results):
+            raw.append({
+                "system": ep.name, "scenario": f"concurrent-{c}", "i": i,
+                "ttft_s": r.ttft_s, "tpot_s": r.tpot_s,
+                "completion_tokens": r.completion_tokens,
+                "e2e_s": r.e2e_s, "error": r.error,
+                "finish_reason": r.finish_reason,
+            })
+    return rows, raw
+
+
+def run_speed(
+    endpoints: list[SystemEndpoint], cfg: BenchConfig,
+) -> tuple[list[SpeedRow], list[dict[str, Any]]]:
+    all_rows: list[SpeedRow] = []
+    all_raw: list[dict[str, Any]] = []
+    for ep in endpoints:
+        print(f"[speed] === {ep.name} ===")
+        # Tiny warmup so the first row isn't penalized by lazy cache allocation.
+        warm_messages = [[{"role": "user", "content": "Hello"}]]
+        asyncio.run(chat_concurrent(
+            ep.base_url, ep.model_id, warm_messages,
+            max_tokens=8, temperature=0.0, api_key=ep.api_key,
+            timeout=120, concurrency=1,
+        ))
+
+        rows1, raw1 = asyncio.run(run_single_stream(ep, cfg))
+        all_rows.extend(rows1); all_raw.extend(raw1)
+        for r in rows1:
+            print(f"  {r.scenario:18s} ttft p50={r.ttft_ms_p50:7.1f}ms  "
+                  f"tpot p50={r.tpot_ms_p50:6.2f}ms  thpt={r.throughput_tok_s:6.1f} tok/s")
+
+        rows2, raw2 = asyncio.run(run_concurrent(ep, cfg))
+        all_rows.extend(rows2); all_raw.extend(raw2)
+        for r in rows2:
+            print(f"  {r.scenario:18s} reqs={r.requests:3d}  thpt={r.throughput_tok_s:6.1f} tok/s  "
+                  f"ttft p95={r.ttft_ms_p95:7.1f}ms  errs={r.errors}")
+
+    return all_rows, all_raw
+
+
+def rows_to_dicts(rows: list[SpeedRow]) -> list[dict[str, Any]]:
+    return [asdict(r) for r in rows]
--- a/tools/bench/tasks/init.py
+++ b/tools/bench/tasks/init.py
@@ -0,0 +1,46 @@
+"""Shared helpers for quality tasks.
+
+Each task can be backed by a pre-fetched local JSON file (so the GPU host
+doesn't need network). The JSON is a list of records:
+    [{"id": str, "problem": str, "answer": str, "source": str}, ...]
+
+Use tools/bench/fetch_datasets.py on a networked machine to produce these
+files, then ship them to the GPU host (the bench sync does this automatically).
+"""
+
+from __future__ import annotations
+
+import json
+import os
+from typing import Any
+
+
+def data_dir() -> str:
+    """Directory holding pre-fetched dataset JSON. Override via BENCH_DATA_DIR."""
+    return os.environ.get(
+        "BENCH_DATA_DIR",
+        os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "data"),
+    )
+
+
+def local_json_path(task_name: str) -> str:
+    return os.path.normpath(os.path.join(data_dir(), f"{task_name}.json"))
+
+
+def load_local(task_name: str) -> list[dict[str, Any]] | None:
+    """Return records from the local JSON file if present, else None."""
+    path = local_json_path(task_name)
+    if not os.path.isfile(path):
+        return None
+    with open(path) as f:
+        records = json.load(f)
+    print(f"[tasks] loaded {len(records)} records from {path}")
+    return records
+
+
+def save_local(task_name: str, records: list[dict[str, Any]]) -> str:
+    path = local_json_path(task_name)
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    with open(path, "w") as f:
+        json.dump(records, f, ensure_ascii=False, indent=1)
+    return path
--- a/tools/bench/tasks/aime.py
+++ b/tools/bench/tasks/aime.py
@@ -0,0 +1,114 @@
+"""AIME 2025 — 30 problems, integer answers 0..999.
+
+Scoring: exact-match of the integer in the last `\\boxed{...}` in the response,
+falling back to the last standalone integer in the response. Matches the
+convention used by most reasoning-model leaderboards.
+"""
+
+from __future__ import annotations
+
+import re
+from typing import Any
+
+from . import load_local
+
+TASK_NAME = "aime2025"
+
+# Tried in order; first one to load wins. These are the most-cited HF datasets
+# for AIME 2025 at time of writing; we don't depend on any one being present.
+DATASET_CANDIDATES = [
+    ("MathArena/aime_2025", None, "test"),
+    ("yentinglin/aime_2025", None, "train"),
+    ("opencompass/AIME2025", "AIME2025-I", "test"),
+]
+
+
+def load() -> list[dict[str, Any]]:
+    # Prefer the pre-fetched local JSON (GPU host has no network).
+    local = load_local(TASK_NAME)
+    if local is not None:
+        return local
+    return load_remote()
+
+
+def load_remote() -> list[dict[str, Any]]:
+    """Fetch from HuggingFace. Requires network — used by fetch_datasets.py."""
+    from datasets import load_dataset  # noqa: PLC0415 — optional dep, see requirements.txt
+
+    last_err: Exception | None = None
+    for repo, config, split in DATASET_CANDIDATES:
+        try:
+            ds = load_dataset(repo, config, split=split) if config else load_dataset(repo, split=split)
+        except Exception as e:  # noqa: BLE001 — try the next candidate
+            last_err = e
+            continue
+
+        problems: list[dict[str, Any]] = []
+        for i, row in enumerate(ds):
+            problem = row.get("problem") or row.get("question") or row.get("Problem")
+            answer = row.get("answer") or row.get("Answer") or row.get("solution_int")
+            if problem is None or answer is None:
+                continue
+            problems.append({
+                "id": str(row.get("id") or row.get("ID") or i),
+                "problem": problem,
+                "answer": str(answer).strip(),
+                "source": repo,
+            })
+        if problems:
+            return problems
+
+    raise RuntimeError(
+        f"Could not load AIME 2025 from any of {[c[0] for c in DATASET_CANDIDATES]} "
+        f"(last error: {last_err!r}). Set HF_HOME / HF_TOKEN if needed."
+    )
+
+
+SYSTEM_PROMPT = (
+    "You are a careful math problem solver. Solve the problem step by step. "
+    "Put your final integer answer (an integer from 0 to 999) inside \\boxed{}."
+)
+
+
+def make_messages(problem: str) -> list[dict[str, str]]:
+    return [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": problem},
+    ]
+
+
+_BOXED_RE = re.compile(r"\\boxed\s*\{([^{}]*)\}")
+_INT_RE = re.compile(r"-?\d+")
+
+
+def extract_answer(text: str) -> str | None:
+    """Return canonical integer string, or None if nothing parseable."""
+    if not text:
+        return None
+    boxed = _BOXED_RE.findall(text)
+    candidates: list[str] = []
+    if boxed:
+        # Inside the \boxed{} there may be extra latex; grab the last integer.
+        ints = _INT_RE.findall(boxed[-1])
+        if ints:
+            candidates.append(ints[-1])
+    # Fallback: the last integer anywhere in the response.
+    if not candidates:
+        ints = _INT_RE.findall(text)
+        if ints:
+            candidates.append(ints[-1])
+    if not candidates:
+        return None
+    try:
+        return str(int(candidates[-1]))
+    except ValueError:
+        return None
+
+
+def score(pred: str | None, gold: str) -> bool:
+    if pred is None:
+        return False
+    try:
+        return int(pred) == int(gold)
+    except ValueError:
+        return False
--- a/tools/bench/tasks/gsm8k.py
+++ b/tools/bench/tasks/gsm8k.py
@@ -0,0 +1,90 @@
+"""GSM8K — 1319 grade-school math problems with integer/decimal answers.
+
+Gold answers in the dataset are in the form `... #### 42`. We score by
+exact-match of the final number, with the same `\\boxed{}` / last-number
+extraction used for AIME, since for instruction-tuned models the response
+follows the prompt instructions, not the dataset's `####` convention.
+"""
+
+from __future__ import annotations
+
+import re
+from typing import Any
+
+from . import load_local
+
+TASK_NAME = "gsm8k"
+
+
+def load() -> list[dict[str, Any]]:
+    local = load_local(TASK_NAME)
+    if local is not None:
+        return local
+    return load_remote()
+
+
+def load_remote() -> list[dict[str, Any]]:
+    """Fetch from HuggingFace. Requires network — used by fetch_datasets.py."""
+    from datasets import load_dataset  # noqa: PLC0415
+
+    ds = load_dataset("openai/gsm8k", "main", split="test")
+    out: list[dict[str, Any]] = []
+    for i, row in enumerate(ds):
+        ans_full: str = row["answer"]
+        # gold format: "<chain of thought>\n#### 42"
+        gold = ans_full.split("####")[-1].strip().replace(",", "")
+        out.append({
+            "id": str(i),
+            "problem": row["question"],
+            "answer": gold,
+            "source": "openai/gsm8k",
+        })
+    return out
+
+
+SYSTEM_PROMPT = (
+    "You are a careful math problem solver. Solve the problem step by step. "
+    "Put your final numeric answer inside \\boxed{}."
+)
+
+
+def make_messages(problem: str) -> list[dict[str, str]]:
+    return [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": problem},
+    ]
+
+
+_BOXED_RE = re.compile(r"\\boxed\s*\{([^{}]*)\}")
+# Allow comma-grouped thousands (e.g. "3,500"); _normalize_num strips them.
+_NUM_RE = re.compile(r"-?\d+(?:,\d{3})*(?:\.\d+)?")
+
+
+def _normalize_num(s: str) -> str | None:
+    s = s.replace(",", "").strip()
+    try:
+        f = float(s)
+    except ValueError:
+        return None
+    return str(int(f)) if f.is_integer() else f"{f:g}"
+
+
+def extract_answer(text: str) -> str | None:
+    if not text:
+        return None
+    boxed = _BOXED_RE.findall(text)
+    if boxed:
+        nums = _NUM_RE.findall(boxed[-1])
+        if nums:
+            return _normalize_num(nums[-1])
+    nums = _NUM_RE.findall(text)
+    if nums:
+        return _normalize_num(nums[-1])
+    return None
+
+
+def score(pred: str | None, gold: str) -> bool:
+    if pred is None:
+        return False
+    gold_norm = _normalize_num(gold)
+    return gold_norm is not None and pred == gold_norm
--- a/tools/convert-to-gguf.sh
+++ b/tools/convert-to-gguf.sh
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+# Convert a HuggingFace safetensors model dir into a BF16 GGUF for llama.cpp.
+#
+# Why BF16: we run xserv in BF16, so the baseline must run BF16 too. If we
+# compared xserv-BF16 against llama.cpp-Q4_K_M the speed delta would be
+# dominated by quantization, not by our kernels — that's not an apples-to-
+# apples comparison.
+#
+# Usage:
+#   tools/convert-to-gguf.sh <hf-model-dir> [out.gguf]
+#
+# Example:
+#   tools/convert-to-gguf.sh /opt/wjh/models/qwen3-8b
+#   # → /opt/wjh/models/qwen3-8b/qwen3-8b-bf16.gguf
+
+set -euo pipefail
+
+if [ "$#" -lt 1 ]; then
+    echo "Usage: $0 <hf-model-dir> [out.gguf]" >&2
+    exit 1
+fi
+
+SRC="$(realpath "$1")"
+ROOT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
+CONVERT_PY="$ROOT_DIR/third_party/llama.cpp/convert_hf_to_gguf.py"
+
+if [ ! -f "$CONVERT_PY" ]; then
+    echo "convert script not found: $CONVERT_PY" >&2
+    echo "Run tools/setup-llama-cpp.sh first." >&2
+    exit 1
+fi
+
+if [ ! -d "$SRC" ]; then
+    echo "source model dir not found: $SRC" >&2
+    exit 1
+fi
+
+if [ "$#" -ge 2 ]; then
+    OUT="$2"
+else
+    BASENAME="$(basename "$SRC")"
+    OUT="$SRC/${BASENAME}-bf16.gguf"
+fi
+
+if [ -f "$OUT" ]; then
+    echo "==> already exists: $OUT (skipping; remove to force re-convert)"
+    echo "$OUT"
+    exit 0
+fi
+
+echo "==> converting $SRC -> $OUT (BF16)"
+python3 "$CONVERT_PY" "$SRC" --outfile "$OUT" --outtype bf16
+
+echo "=== done ==="
+echo "$OUT"
--- a/tools/setup-llama-cpp.sh
+++ b/tools/setup-llama-cpp.sh
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+# Build the llama.cpp baseline (third_party/llama.cpp) with CUDA.
+#
+# Source is vendored as a git submodule pinned to a fixed tag (see .gitmodules
+# and the recorded gitlink commit). This script does NOT fetch from the network
+# by default — it expects the source to already be present, either via:
+#   - `git submodule update --init` (on a host with network), or
+#   - rsync/tar transfer (how it reaches dash5, which has no network).
+#
+# It only fetches as a convenience fallback when the source is missing AND
+# network is reachable.
+#
+# Idempotent. Safe to re-run.
+#
+# Usage:
+#   tools/setup-llama-cpp.sh            # build (configure if needed)
+#   tools/setup-llama-cpp.sh --rebuild  # wipe build dir, reconfigure, rebuild
+#
+# Env:
+#   CUDA_ARCH   CUDA architectures for cmake (default 120-real = RTX 5090 SM120)
+#   CUDA_HOME   CUDA toolkit root (auto-detected: /usr/local/cuda-12.9 then cuda)
+
+set -euo pipefail
+
+ROOT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
+VENDOR_DIR="$ROOT_DIR/third_party/llama.cpp"
+CUDA_ARCH="${CUDA_ARCH:-120-real}"
+REBUILD=0
+for arg in "$@"; do
+    case "$arg" in
+        --rebuild) REBUILD=1 ;;
+        --help|-h) grep -E '^#' "$0" | sed 's/^# \{0,1\}//'; exit 0 ;;
+    esac
+done
+
+if [ -d /usr/local/cuda-12.9 ]; then
+    export CUDA_HOME="${CUDA_HOME:-/usr/local/cuda-12.9}"
+elif [ -d /usr/local/cuda ]; then
+    export CUDA_HOME="${CUDA_HOME:-/usr/local/cuda}"
+fi
+[ -n "${CUDA_HOME:-}" ] && export PATH="$CUDA_HOME/bin:$PATH"
+
+echo "=== llama.cpp build ==="
+echo "  vendor dir : $VENDOR_DIR"
+echo "  CUDA arch  : $CUDA_ARCH"
+echo "  CUDA_HOME  : ${CUDA_HOME:-<not set>}"
+
+# --- Ensure source is present ---
+if [ ! -f "$VENDOR_DIR/CMakeLists.txt" ]; then
+    echo "==> source missing at $VENDOR_DIR"
+    if git -C "$ROOT_DIR" rev-parse --git-dir >/dev/null 2>&1 \
+        && timeout 8 git ls-remote https://github.com/ggerganov/llama.cpp HEAD >/dev/null 2>&1; then
+        echo "==> network OK, initializing submodule"
+        git -C "$ROOT_DIR" submodule update --init --recursive third_party/llama.cpp
+    else
+        echo "ERROR: llama.cpp source not present and network unavailable." >&2
+        echo "  On a networked host run: git submodule update --init third_party/llama.cpp" >&2
+        echo "  Then transfer the source here (the bench tooling does this via rsync)." >&2
+        exit 1
+    fi
+fi
+
+BUILD_DIR="$VENDOR_DIR/build"
+if [ "$REBUILD" -eq 1 ] && [ -d "$BUILD_DIR" ]; then
+    echo "==> --rebuild: removing $BUILD_DIR"
+    rm -rf "$BUILD_DIR"
+fi
+
+SERVER_BIN="$BUILD_DIR/bin/llama-server"
+if [ -x "$SERVER_BIN" ] && [ "$REBUILD" -eq 0 ]; then
+    echo "==> already built: $SERVER_BIN (use --rebuild to force)"
+    echo "$SERVER_BIN"
+    exit 0
+fi
+
+echo "==> cmake configure"
+cmake -S "$VENDOR_DIR" -B "$BUILD_DIR" \
+    -DGGML_CUDA=ON \
+    -DLLAMA_CURL=OFF \
+    -DLLAMA_BUILD_TESTS=OFF \
+    -DLLAMA_BUILD_EXAMPLES=OFF \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DCMAKE_CUDA_ARCHITECTURES="$CUDA_ARCH"
+
+echo "==> build llama-server llama-cli (jobs: $(nproc))"
+cmake --build "$BUILD_DIR" --target llama-server llama-cli -j "$(nproc)"
+
+if [ ! -x "$SERVER_BIN" ]; then
+    echo "ERROR: llama-server did not build at $SERVER_BIN" >&2
+    exit 1
+fi
+
+echo "=== done ==="
+echo "$SERVER_BIN"
--- a/tools/sync-and-build.sh
+++ b/tools/sync-and-build.sh
@@ -1,25 +1,102 @@
 #!/bin/bash
-# Sync local project to dash5 and build/test there.
-# Usage: ./tools/sync-and-build.sh [test|build|run]
+# Sync local project to dash5 and build/test/bench there.
+#
+# Usage:
+#   ./tools/sync-and-build.sh [test|build|run]
+#       Runs `cargo <action> --release` on dash5.
+#
+#   ./tools/sync-and-build.sh bench [-- <extra runner args>]
+#       Ensures llama.cpp is built (tools/setup-llama-cpp.sh) and a BF16 GGUF
+#       exists, then runs tools/bench/runner.py against both xserv-server and
+#       llama-server. Result lands in $REMOTE_DIR/bench-out/.
+#
+#   ./tools/sync-and-build.sh fetch-bench-out
+#       Copies dash5:$REMOTE_DIR/bench-out/ back to ./bench-out/.

 set -e

 REMOTE="dash5"
 REMOTE_DIR="/opt/wjh/projects/xserv"
+REMOTE_MODEL_DIR="${REMOTE_MODEL_DIR:-/opt/wjh/models/qwen3-8b}"
 LOCAL_DIR="$(cd "$(dirname "$0")/.." && pwd)"

 ACTION="${1:-build}"
+shift || true

-echo "=== Syncing to $REMOTE:$REMOTE_DIR ==="
-ssh "$REMOTE" "mkdir -p $REMOTE_DIR"
-rsync -az --delete \
-    --exclude target \
-    --exclude .git \
-    "$LOCAL_DIR/" "$REMOTE:$REMOTE_DIR/"
+cuda_env='if [ -d /usr/local/cuda-12.9 ]; then export CUDA_HOME=/usr/local/cuda-12.9; else export CUDA_HOME=/usr/local/cuda; fi && export PATH=$CUDA_HOME/bin:/usr/local/cuda/bin:$PATH'

-echo "=== Running: cargo $ACTION ==="
-ssh "$REMOTE" "source \$HOME/.cargo/env && \
-    export PATH=/usr/local/cuda/bin:\$PATH && \
-    export CUDA_HOME=/usr/local/cuda && \
-    cd $REMOTE_DIR && \
-    cargo $ACTION --release 2>&1"
+sync_project() {
+    echo "=== Syncing to $REMOTE:$REMOTE_DIR ==="
+    # Preserve `target/`, `third_party/` (large + arch-specific) and `bench-out/`
+    # on the remote side. Everything else is wiped + replaced.
+    ssh "$REMOTE" "mkdir -p $REMOTE_DIR && find $REMOTE_DIR -mindepth 1 -maxdepth 1 ! -name target ! -name third_party ! -name bench-out -exec rm -rf {} +"
+    tar --exclude target --exclude third_party --exclude bench-out --exclude .git \
+        -C "$LOCAL_DIR" -czf - . \
+        | ssh "$REMOTE" "tar -xzf - -C $REMOTE_DIR"
+}
+
+sync_llama_src() {
+    # dash5 has no network (and no rsync), so we transfer the llama.cpp submodule
+    # working tree (source only — never the build dir or .git) via tar-over-ssh.
+    local src="$LOCAL_DIR/third_party/llama.cpp"
+    if [ ! -f "$src/CMakeLists.txt" ]; then
+        echo "ERROR: llama.cpp source not found at $src" >&2
+        echo "  Run: git submodule update --init third_party/llama.cpp" >&2
+        exit 1
+    fi
+    echo "=== Syncing llama.cpp source to $REMOTE (tar) ==="
+    # Preserve the remote build/ dir; only refresh source files.
+    ssh "$REMOTE" "mkdir -p $REMOTE_DIR/third_party/llama.cpp"
+    tar --exclude build --exclude .git --exclude '*.gguf' \
+        -C "$src" -czf - . \
+        | ssh "$REMOTE" "tar -xzf - -C $REMOTE_DIR/third_party/llama.cpp"
+}
+
+case "$ACTION" in
+    test|build|run|check|clippy)
+        sync_project
+        echo "=== Running: cargo $ACTION ==="
+        ssh "$REMOTE" "source \$HOME/.cargo/env && $cuda_env && cd $REMOTE_DIR && cargo $ACTION --release 2>&1"
+        ;;
+
+    bench)
+        sync_project
+        sync_llama_src
+        echo "=== Ensuring llama.cpp baseline is built ==="
+        ssh "$REMOTE" "source \$HOME/.cargo/env && $cuda_env && cd $REMOTE_DIR && \
+            ./tools/setup-llama-cpp.sh 2>&1"
+
+        echo "=== Ensuring BF16 GGUF exists for $REMOTE_MODEL_DIR ==="
+        # Returned path on stdout's last line is what we feed --llama-gguf.
+        GGUF_PATH=$(ssh "$REMOTE" "$cuda_env && cd $REMOTE_DIR && \
+            ./tools/convert-to-gguf.sh $REMOTE_MODEL_DIR 2>&1 | tail -1")
+        echo "    gguf: $GGUF_PATH"
+
+        echo "=== Building xserv (release) ==="
+        ssh "$REMOTE" "source \$HOME/.cargo/env && $cuda_env && cd $REMOTE_DIR && \
+            cargo build --release 2>&1"
+
+        echo "=== Running benchmark suite ==="
+        ssh "$REMOTE" "$cuda_env && cd $REMOTE_DIR && \
+            python3 -m tools.bench.runner \
+                --xserv-bin ./target/release/xserv-server \
+                --xserv-model $REMOTE_MODEL_DIR \
+                --llama-bin third_party/llama.cpp/build/bin/llama-server \
+                --llama-gguf $GGUF_PATH \
+                $* 2>&1"
+        ;;
+
+    fetch-bench-out)
+        mkdir -p "$LOCAL_DIR/bench-out"
+        echo "=== Fetching bench-out from $REMOTE:$REMOTE_DIR/bench-out (tar) ==="
+        ssh "$REMOTE" "tar -C $REMOTE_DIR/bench-out -czf - ." \
+            | tar -xzf - -C "$LOCAL_DIR/bench-out"
+        echo "    -> $LOCAL_DIR/bench-out/"
+        ;;
+
+    *)
+        echo "Unknown action: $ACTION" >&2
+        echo "Usage: $0 {build|test|run|check|clippy|bench|fetch-bench-out} [-- extra args]" >&2
+        exit 2
+        ;;
+esac