xserv

Files

Gahow Wang 950ccf3822 bench: fix llama.cpp per-slot context (was 1/parallel of intended)

llama.cpp divides total -c across --parallel slots, so -c 4096 --parallel 4
gave each request only 1024 tokens — truncating long AIME generations before
the boxed answer and making xserv look artificially better (20% vs 3.3%).
Set total -c = max_seq_len * n_parallel so per-slot context equals xserv's
per-sequence max_seq_len. Also drop --log-disable; its startup log reports the
per-slot n_ctx that catches exactly this misconfiguration.

After the fix, AIME is at parity (xserv 23.3% vs llama.cpp 20.0%), matching the
GSM8K parity and confirming the gap was a config artifact, not engine quality.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 15:06:12 +08:00

bench

bench: fix llama.cpp per-slot context (was 1/parallel of intended)

2026-05-28 15:06:12 +08:00

__init__.py

bench: run one server at a time, match thinking mode, fix tools package

2026-05-28 11:40:07 +08:00

analyze_divergence.py

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

bench_compare_qwen3.py

phase 10: add Qwen3-8B benchmark + performance fix