xserv

Files

Gahow Wang da3aaa134a model: pipeline-parallel Qwen3 (from_weights_pp + stage forward)

Layer-wise split: each stage loads only its contiguous layer range
[s*L, (s+1)*L); stage 0 keeps embed_tokens, the last stage keeps
norm/lm_head (others get a 1x1 placeholder). Heads are NOT split
(PP is orthogonal to TP). Adds embed/head and forward_layers_prefill/
forward_layers_decode that take and return the [tokens, hidden] hidden
state; per-stage PagedKVCache is indexed by local layer id.

sampling: derive Clone on SamplingParams (carried in the PP command enum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 18:45:47 +08:00

xserv-cuda

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

xserv-distributed

distributed: NCCL P2P primitives (PpContext + send/recv)

2026-05-29 18:45:42 +08:00

xserv-kernels

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

xserv-model

model: pipeline-parallel Qwen3 (from_weights_pp + stage forward)