Files
xserv/crates
Gahow Wang 46bfb59f30 Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference
Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).
2026-05-30 13:13:05 +08:00
..