gpt-oss has no single-GPU engine path, so --tp 1 fell through to the
Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp
even at tp=1: NCCL world-1 init works and all_reduce already no-ops
(bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB
FP8 / 13 GB MXFP4) now serves on one 32 GB 5090.
Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"):
it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one
GPU and rank 1's set_device panicked while rank 0 spun in NCCL init.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
xserv_vs_llama.py runs each server one at a time on the same GPUs (drains VRAM
between), streams identical prompts through /v1/chat/completions, and reports
median TTFT/TPOT/throughput. Counts llama's reasoning_content as real decode
tokens so the gpt-oss CoT is measured fairly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Weight-only 4-bit for the gpt-oss MoE experts: weights stored MXFP4 (E2M1 +
per-32-element UE8M0 block scale, tools/quantize_mxfp4.py), a fused kernel reads
the 4-bit weights and dequantizes on-chip to BF16. Decode (M=1) uses a fused
dequant-GEMV (batched_gemv_mxfp4) with shared-memory activation tiling; prefill
(M>1) dequantizes to BF16 then reuses the BF16 batched GEMM. MXFP4 is detected
by the scale tensor's rank (3-D [E,N,K/32]) vs FP8's 1-D [E].
Verified on dash5 (gpt-oss-20b, TP=2, 5090): byte-identical greedy tokens to
FP8/BF16, smallest footprint (13 GB vs 22 GB FP8, 39 GB BF16) — fits one 32 GB
5090 with room for KV cache.
NOT a decode speedup: the hand-written W4A16 GEMV (no tensor cores) is less
efficient than cuBLASLt's FP8 tensor-core GEMM, so even at half the weight bytes
decode is 17.0 ms vs FP8 13.5 ms (faster than BF16 18.8 ms); prefill regresses
(350 vs 134 ms, dequant fallback). Committed as a correct memory-optimization
foundation. Beating FP8 on speed needs FP4 tensor cores (W4A4, cuBLASLt
block-scaled MXFP4) or a Marlin-class kernel; see
docs/benchmarks/mxfp4-and-llama-decode.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fp8_compare.py launches one xserv-server per model (same GPUs / TP for a
fair comparison), gates readiness on a real generation (not /health),
and streams GSM8K through /v1/chat/completions measuring per-request
TTFT (time to first token) and TPOT (mean inter-token latency) plus
exact-match accuracy. docs/benchmarks/fp8-quantization.md records the
quantization scheme, the perf-bug fix, and the dash5 results.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
vs llama.cpp (BF16 GGUF) on GSM8K quality + speed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
runner/servers gain --tp (xserv --tp N; llama.cpp --split-mode row) and
--llama-devices so llama can run on a disjoint GPU group. run_tp_parallel.sh
runs xserv (GPU 0..N-1) and llama.cpp (GPU 4..4+N-1) concurrently per TP,
matching the box's 0-3 / 4-7 PHB groups. summarize_tp.py tabulates the sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
llama.cpp divides total -c across --parallel slots, so -c 4096 --parallel 4
gave each request only 1024 tokens — truncating long AIME generations before
the boxed answer and making xserv look artificially better (20% vs 3.3%).
Set total -c = max_seq_len * n_parallel so per-slot context equals xserv's
per-sequence max_seq_len. Also drop --log-disable; its startup log reports the
per-slot n_ctx that catches exactly this misconfiguration.
After the fix, AIME is at parity (xserv 23.3% vs llama.cpp 20.0%), matching the
GSM8K parity and confirming the gap was a config artifact, not engine quality.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refinements from end-to-end bring-up on the GPU host:
- Run each system start→suites→stop in sequence. Two BF16 8B models don't
co-reside on one 32GB GPU, and a resident idle engine would distort the
other's latency/throughput.
- Match generation mode: xserv hardcodes Qwen3 thinking off, so send
chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint
extra_body. --enable-thinking opts back into thinking mode.
- Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package
instead of a site-packages `tools` (nvfuser ships one that shadowed it).
- Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM
finding that the bench surfaced.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Vendor llama.cpp as a submodule pinned to b9371 and add a one-click
benchmark driver that compares xserv against it on identical workloads:
- setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh
converts the same safetensors to BF16 GGUF for an apples-to-apples baseline.
- tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput
(single-stream + concurrent) and response quality on AIME 2025 + GSM8K.
- fetch_datasets.py pulls datasets to local JSON (GPU host has no network);
task loaders prefer the local JSON.
- sync-and-build.sh: `bench` subcommand transfers source + datasets to the
GPU host via tar-over-ssh (no rsync there), builds, and runs the suite.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>