pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling +
stop logic) on the calling thread, worker stage threads for 1..P. Each
step the coordinator embeds + runs its layers, then the hidden state is
handed stage->stage over NCCL P2P; the last stage samples and returns
the token to stage 0 over an in-process channel. v1 is serial (one
request, one token/step) — correctness first; throughput via microbatch
overlap is future work.
main: wire --pp N (mutually exclusive with --tp).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>