xserv

gahow/xserv

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	80157e614a	docs: update llama.cpp comparison with 8192 results (OOM fixed) Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it: - OOM finding resolved — pool sized to available VRAM + vLLM-style host swap; 8192 runs with 0 swap events (swap is the overload safety net). - Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%. - Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 21:32:14 +08:00
Gahow Wang	7cb9ee3870	bench: run one server at a time, match thinking mode, fix tools package Refinements from end-to-end bring-up on the GPU host: - Run each system start→suites→stop in sequence. Two BF16 8B models don't co-reside on one 32GB GPU, and a resident idle engine would distort the other's latency/throughput. - Match generation mode: xserv hardcodes Qwen3 thinking off, so send chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint extra_body. --enable-thinking opts back into thinking mode. - Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package instead of a site-packages `tools` (nvfuser ships one that shadowed it). - Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM finding that the bench surfaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 11:40:07 +08:00
Gahow Wang	49c7653222	tools: add llama.cpp comparison baseline + standard benchmark suite Vendor llama.cpp as a submodule pinned to b9371 and add a one-click benchmark driver that compares xserv against it on identical workloads: - setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh converts the same safetensors to BF16 GGUF for an apples-to-apples baseline. - tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput (single-stream + concurrent) and response quality on AIME 2025 + GSM8K. - fetch_datasets.py pulls datasets to local JSON (GPU host has no network); task loaders prefer the local JSON. - sync-and-build.sh: `bench` subcommand transfers source + datasets to the GPU host via tar-over-ssh (no rsync there), builds, and runs the suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 11:18:52 +08:00

Author

SHA1

Message

Date

Gahow Wang

80157e614a

docs: update llama.cpp comparison with 8192 results (OOM fixed)

Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it:
- OOM finding resolved — pool sized to available VRAM + vLLM-style host swap;
  8192 runs with 0 swap events (swap is the overload safety net).
- Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%.
- Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 21:32:14 +08:00

Gahow Wang

7cb9ee3870

bench: run one server at a time, match thinking mode, fix tools package

Refinements from end-to-end bring-up on the GPU host:

- Run each system start→suites→stop in sequence. Two BF16 8B models don't
  co-reside on one 32GB GPU, and a resident idle engine would distort the
  other's latency/throughput.
- Match generation mode: xserv hardcodes Qwen3 thinking off, so send
  chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint
  extra_body. --enable-thinking opts back into thinking mode.
- Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package
  instead of a site-packages `tools` (nvfuser ships one that shadowed it).
- Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM
  finding that the bench surfaced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 11:40:07 +08:00

Gahow Wang

49c7653222

tools: add llama.cpp comparison baseline + standard benchmark suite

Vendor llama.cpp as a submodule pinned to b9371 and add a one-click
benchmark driver that compares xserv against it on identical workloads:

- setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh
  converts the same safetensors to BF16 GGUF for an apples-to-apples baseline.
- tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput
  (single-stream + concurrent) and response quality on AIME 2025 + GSM8K.
- fetch_datasets.py pulls datasets to local JSON (GPU host has no network);
  task loaders prefer the local JSON.
- sync-and-build.sh: `bench` subcommand transfers source + datasets to the
  GPU host via tar-over-ssh (no rsync there), builds, and runs the suite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 11:18:52 +08:00

3 Commits