xserv/tools at 42e13f33dd06c6b2a3003fc4eb91151f9cff5bd5 - xserv - Local Gitea

gahow/xserv

Files

History

Gahow Wang 63f5599717 server: serve gpt-oss on a single GPU via the TP engine (world=1)

gpt-oss has no single-GPU engine path, so --tp 1 fell through to the
Qwen3-only engine and every request 503'd. Route gpt_oss to run_tp
even at tp=1: NCCL world-1 init works and all_reduce already no-ops
(bench-gpt-oss --tp 1 exercised this path). Quantized gpt-oss (22 GB
FP8 / 13 GB MXFP4) now serves on one 32 GB 5090.

Also fix eval_gsm8k_fast.py --gpu to accept a device list ("2,3"):
it was type=int, so any --tp 2 run pinned CUDA_VISIBLE_DEVICES to one
GPU and rank 1's set_device panicked while rank 0 spun in NCCL init.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 16:29:10 +08:00

..

bench: PP harness (xserv --pp vs llama.cpp -sm layer)

2026-05-29 18:45:59 +08:00

__init__.py

bench: run one server at a time, match thinking mode, fix tools package

2026-05-28 11:40:07 +08:00

analyze_divergence.py

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

bench_compare_qwen3.py

phase 10: add Qwen3-8B benchmark + performance fix

2026-05-22 10:25:33 +08:00

bench_compare.py

phase 8: add benchmark framework + baseline results

2026-05-21 23:29:41 +08:00

bench_fp8.py

tools: add FP8 vs BF16 benchmark and GSM8K eval harness

2026-06-08 15:43:04 +08:00

bench_gpt_oss.sh

server: support GptOss in TP engine + benchmark script

2026-05-30 15:39:44 +08:00

bench_server.py

tools: add correctness + performance test scripts for Qwen3-8B

2026-05-23 14:13:49 +08:00

bench_vs_hf.py

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

2026-05-22 17:53:28 +08:00

compare_logits.py

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

2026-05-22 17:53:28 +08:00

convert-to-gguf.sh

tools: add llama.cpp comparison baseline + standard benchmark suite

2026-05-28 11:18:52 +08:00

e2e_validate.py

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

2026-05-22 17:53:28 +08:00

eval_gsm8k_batch.sh

quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights

2026-06-07 19:33:07 +08:00

eval_gsm8k_fast.py

server: serve gpt-oss on a single GPU via the TP engine (world=1)

2026-06-12 16:29:10 +08:00

eval_gsm8k.py

tools: add FP8 vs BF16 benchmark and GSM8K eval harness

2026-06-08 15:43:04 +08:00

fp8_compare.py

tools: warm-server FP8 vs BF16 benchmark + results doc

2026-06-12 00:58:46 +08:00

pp_diag.sh

bench: PP harness (xserv --pp vs llama.cpp -sm layer)

2026-05-29 18:45:59 +08:00

pp_final.sh

bench: PP harness (xserv --pp vs llama.cpp -sm layer)

2026-05-29 18:45:59 +08:00

pp_llama_47.sh

bench: PP harness (xserv --pp vs llama.cpp -sm layer)

2026-05-29 18:45:59 +08:00

pp_quality_full.sh

bench: PP harness (xserv --pp vs llama.cpp -sm layer)

2026-05-29 18:45:59 +08:00

pp_verify.sh

bench: PP harness (xserv --pp vs llama.cpp -sm layer)

2026-05-29 18:45:59 +08:00

quantize_fp8.py

quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights

2026-06-07 19:33:07 +08:00

quantize_mxfp4.py

quantization: MXFP4 W4A16 expert weights (memory-optimization foundation)

2026-06-12 15:01:42 +08:00

run_gpt_oss_bench.sh

server: support GptOss in TP engine + benchmark script

2026-05-30 15:39:44 +08:00

setup-llama-cpp.sh

tools: add llama.cpp comparison baseline + standard benchmark suite

2026-05-28 11:18:52 +08:00

sync-and-build.sh

tools: add llama.cpp comparison baseline + standard benchmark suite

2026-05-28 11:18:52 +08:00

test_concurrent.py

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

2026-05-22 17:53:28 +08:00

test_correctness.py

tools: add correctness + performance test scripts for Qwen3-8B

2026-05-23 14:13:49 +08:00

test_fp8_gemm.cu

tools: add FP8 vs BF16 benchmark and GSM8K eval harness

2026-06-08 15:43:04 +08:00

xserv_vs_llama.py

tools: single-stream decode benchmark vs llama.cpp

2026-06-12 15:01:42 +08:00