xserv

Files

Gahow Wang ea5d8ba7ea xserv-chat: render gpt-oss multi-turn as canonical harmony (drop CoT)

Re-render the whole conversation each turn and re-prefill into a freshly
cleared slot, with past assistant messages rendered as completed `final`
channels (analysis dropped, terminated with <|end|> not the <|return|>
stop token) — matching the model's training format and the server's
builder. The previous incremental cache kept every turn's chain-of-thought
plus <|return|> in context, which is out of distribution for harmony
multi-turn. The generator now returns the final-channel text to feed back
as history. Qwen3 keeps the incremental cache (its ChatML format is
unaffected); reset_slot factors out the free+re-register.

NOTE: this corrects the multi-turn *format* but does NOT cure the
long-context collapse on some inputs. That is a forward-pass numerical bug
(NaN / degenerate logits) reproducible in clean bench-gpt-oss independent
of the chat layer — the collapse token is vocab_size-1 (201087), the
all-NaN argmax tie-break. Tracked separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-02 15:39:24 +08:00

xserv-cuda

cuda: add cached_trim() to release pooled GPU buffers

2026-05-30 12:50:04 +08:00

xserv-distributed

distributed: NCCL P2P primitives (PpContext + send/recv)

2026-05-29 18:45:42 +08:00

xserv-kernels

model: fused GPU MoE kernel — eliminate CPU roundtrip

2026-05-31 13:22:59 +08:00

xserv-model

xserv-chat: render gpt-oss multi-turn as canonical harmony (drop CoT)