xserv

Files

Gahow Wang 7cb9ee3870 bench: run one server at a time, match thinking mode, fix tools package

Refinements from end-to-end bring-up on the GPU host:

- Run each system start→suites→stop in sequence. Two BF16 8B models don't
  co-reside on one 32GB GPU, and a resident idle engine would distort the
  other's latency/throughput.
- Match generation mode: xserv hardcodes Qwen3 thinking off, so send
  chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint
  extra_body. --enable-thinking opts back into thinking mode.
- Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package
  instead of a site-packages `tools` (nvfuser ships one that shadowed it).
- Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM
  finding that the bench surfaced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 11:40:07 +08:00

bench

bench: run one server at a time, match thinking mode, fix tools package

2026-05-28 11:40:07 +08:00

__init__.py

bench: run one server at a time, match thinking mode, fix tools package

2026-05-28 11:40:07 +08:00

analyze_divergence.py

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

bench_compare_qwen3.py

phase 10: add Qwen3-8B benchmark + performance fix