xserv

Files

Gahow Wang 3c9d5e260e server: harmony termination via is_eos + TP repetition penalty

Use tokenizer.is_eos() (multi-eos) for generation termination in both PP
and TP engines instead of a single eos id, so gpt-oss stops on <|return|>
/<|call|>/<|endoftext|>.

In the TP engine, optionally apply a repetition penalty on the greedy
decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off
by default) to break greedy repetition loops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-31 00:56:33 +08:00

xserv-cuda

cuda: add cached_trim() to release pooled GPU buffers

2026-05-30 12:50:04 +08:00

xserv-distributed

distributed: NCCL P2P primitives (PpContext + send/recv)

2026-05-29 18:45:42 +08:00

xserv-kernels

kernels: flash attention with gpt-oss sinks + sliding window

2026-05-31 00:56:10 +08:00

xserv-model

model/sampling: NaN-safe argmax + optional repetition penalty