xserv

Files

Gahow Wang 46bfb59f30 Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference

Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).

2026-05-30 13:13:05 +08:00

xserv-cuda

cuda: add cached_trim() to release pooled GPU buffers

2026-05-30 12:50:04 +08:00

xserv-distributed

distributed: NCCL P2P primitives (PpContext + send/recv)

2026-05-29 18:45:42 +08:00

xserv-kernels

kernels: reshape_and_cache, GPU argmax, single-launch GEMV

2026-05-30 12:50:17 +08:00

xserv-model

Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference