- Add ChatModel enum dispatching between Qwen3 and GptOss based on
config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
accumulator inside the kernel (block k=0) while other blocks
atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>