xserv

Files

Gahow Wang 9ad91a4a92 phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2

Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params,
32 experts × top-4 routing). Key additions:

- ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window,
  attention_bias, explicit head_dim, rope_scaling, swiglu_limit)
- YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation
  and attention_scaling = 0.1*ln(factor)+1
- Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation)
- Paged attention with sinks + sliding window kernel variant
- GptOss model struct with expert-parallel TP (split 32 experts across ranks)
- bench-gpt-oss binary for TP inference benchmarking

Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT.
Model generates topically-coherent output (needs chat template for quality).

Known issues:
- Custom GEMV kernel produces NaN with small N (workaround: pad to M=2)
- Prefill doesn't use attention sinks (uses standard flash attention)
- Output quality requires chat template formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-30 15:18:01 +08:00

activations.cu

phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2

2026-05-30 15:18:01 +08:00