- Let the model generate its own <|channel|> routing instead of forcing
<|channel|>final<|message|> — matches the GGUF chat template behavior.
- State machine tracks harmony channels: analysis channel rendered gray,
final channel printed normally, <|end|> stops on final channel only.
- Add repetition penalty (default 1.3 for MoE, 1.0 for Qwen) with 512
token window to prevent greedy decode loops. Configurable via
XSERV_REP_PENALTY and XSERV_REP_WINDOW env vars.
- Fix Length path: use <|end|> instead of <|im_end|> for gpt-oss to
avoid poisoning the KV cache with garbage tokens on truncation.
- Server api.rs: append <|channel|>final<|message|> to the hardcoded
gpt-oss prompt (server expects to post-process the JSON output).
- Add startswith filter to minijinja for harmony template compatibility.
Known issue: gpt-oss multi-turn NaN when total context exceeds ~256
tokens — likely a flash_attention_sinks kernel bug with sliding window
layers at large kv_len + small q_len. Single-turn and short multi-turn
conversations work correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>