xserv

Files

Gahow Wang ae08896f46 xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug

- Add ChatModel enum dispatching between Qwen3 and GptOss based on
  config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
  gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
  multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
  during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
  accumulator inside the kernel (block k=0) while other blocks
  atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-06-02 00:58:10 +08:00

llama-cpp-comparison.md

xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug

2026-06-02 00:58:10 +08:00

phase8-gpt2-baseline.md

phase 8: add benchmark framework + baseline results

2026-05-21 23:29:41 +08:00

phase9-kv-cache.md

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

phase10-qwen3.md

phase 10: add Qwen3-8B benchmark + performance fix