Files
xserv/docs/benchmarks
Gahow Wang ae08896f46 xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug
- Add ChatModel enum dispatching between Qwen3 and GptOss based on
  config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
  gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
  multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
  during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
  accumulator inside the kernel (block k=0) while other blocks
  atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-02 00:58:10 +08:00
..