xserv

Files

Gahow Wang 3d6bb1918e xserv-chat: fix gpt-oss harmony chat (canonical system prompt + routing)

The hand-rolled gpt-oss system message dropped the canonical harmony
structure (identity / knowledge cutoff / current date / Reasoning level),
putting the model out of distribution — greedy decoding then flipped into
garbage or analysis loops on ~half of single-turn requests. Emit the
canonical system message (matching the model's chat_template.jinja
build_system_message macro) with Reasoning: low, plus a today_ymd() date
helper.

Also:
- Default the repetition penalty to off (1.0). Penalizing the harmony
  control tokens (<|channel|>/<|message|>/<|start|>) that must repeat to
  open the final channel made gpt-oss stop right after analysis, emitting
  nothing.
- Suppress the literal "assistant" role header emitted between the
  analysis and final channels (only print in the final channel, moe only;
  non-moe Qwen3 stays in Normal and prints as before).

Verified on dash5 (TP=2): single-turn "capital of France" is now stable
across runs with a clean final answer; Qwen3 chat unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-02 15:19:07 +08:00

xserv-cuda

cuda: add cached_trim() to release pooled GPU buffers

2026-05-30 12:50:04 +08:00

xserv-distributed

distributed: NCCL P2P primitives (PpContext + send/recv)

2026-05-29 18:45:42 +08:00

xserv-kernels

model: fused GPU MoE kernel — eliminate CPU roundtrip

2026-05-31 13:22:59 +08:00

xserv-model

xserv-chat: fix gpt-oss harmony chat (canonical system prompt + routing)