Files
xserv/crates
Gahow Wang ea5d8ba7ea xserv-chat: render gpt-oss multi-turn as canonical harmony (drop CoT)
Re-render the whole conversation each turn and re-prefill into a freshly
cleared slot, with past assistant messages rendered as completed `final`
channels (analysis dropped, terminated with <|end|> not the <|return|>
stop token) — matching the model's training format and the server's
builder. The previous incremental cache kept every turn's chain-of-thought
plus <|return|> in context, which is out of distribution for harmony
multi-turn. The generator now returns the final-channel text to feed back
as history. Qwen3 keeps the incremental cache (its ChatML format is
unaffected); reset_slot factors out the free+re-register.

NOTE: this corrects the multi-turn *format* but does NOT cure the
long-context collapse on some inputs. That is a forward-pass numerical bug
(NaN / degenerate logits) reproducible in clean bench-gpt-oss independent
of the chat layer — the collapse token is vocab_size-1 (201087), the
all-NaN argmax tie-break. Tracked separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 15:39:24 +08:00
..