Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv. Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last stage owns norm/lm_head. v1 serial (one request at a time) — correctness + per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up projections and removes unused kernels (argmax, reshape_and_cache).