Files

Gahow Wang 11e0154e4d docs: Phase 18 pipeline parallelism — design + benchmark results

docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P,
per-stage KV, engine/threading model).
docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B
BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact
correctness (single x2 vs pp4 x2 control), and the full AIME-30 +
GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in
every cell, TPOT flat across PP.
README: multi-card (TP/PP) section + roadmap to Phase 18.
gitignore: /.claude/ runtime state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 18:57:09 +08:00

6.2 KiB

Raw Permalink Blame History

PP sweep — xserv vs llama.cpp (Qwen3-8B BF16, 8×RTX 5090)

Pipeline parallelism (layer split), verified end-to-end on dash5. Qwen3-8B BF16, greedy, single stream, no NVLink (hand-off / split traffic over PCIe Gen5). xserv --pp N puts stage s on GPU s and hands the hidden state stage→stage over NCCL P2P; llama.cpp uses -sm layer (its default pipeline split) over N GPUs.

Single-stream latency + per-GPU VRAM (measured, `--max-seq-len 2048`)

Measured strictly sequentially, one server at a time, each config gated on a real successful generation (so VRAM snapshots are post-load). Driver: tools/pp_final.sh.

engine	PP	TTFT_ms	TPOT_ms	tok/s	per-GPU VRAM (MiB)
xserv	1	33.2	17.39	57.5	24010
xserv	2	35.9	18.07	55.3	11580, 13632
xserv	4	36.1	17.91	55.8	7298, 5250, 5250, 9350
llama	1	133.3	9.38	106.7	15604
llama	2	131.4	9.10	109.9	7862, 8494
llama	4	161.2	8.88	112.6	4476, 4090, 4090, 5108

(xserv VRAM with XSERV_MAX_KV_BLOCKS=160 so the number is weights + a minimal KV pool. tok/s = 1000 / TPOT. This latency probe's TTFT differs from the quality-suite TTFT below because the suite includes scheduler/HTTP overhead.)

Correctness — PP is numerically exact

The hidden-state hand-off between stages is a bit-exact BF16 P2P copy and each stage runs the same kernels over its layers, so PP must reproduce the single-GPU result. Verified by byte-comparing generated text (greedy, temp 0), running each config twice to separate PP effects from run-to-run GEMM noise:

comparison	result
single run A == single run B	DIFFER (cuBLAS GEMM is not bit-reproducible run-to-run)
pp4 run A == pp4 run B	IDENTICAL
single run A == pp4 run A	IDENTICAL
single == pp2 (single run each)	IDENTICAL

Takeaway: single-GPU itself is non-deterministic under greedy (a 1-ULP logit difference flips a late argmax and the suffix changes), so a one-shot single-vs-PP byte compare can spuriously "DIFFER". The 2×2 control shows PP=4 is more reproducible than re-running single-GPU, and it lands exactly on a single-GPU trajectory. NCCL P2P (tests/sendrecv.rs) and AllReduce (tests/allreduce.rs) unit tests pass.

Quality matrix — AIME 2025 (30) + GSM8K (30), greedy, both engines × PP=1/2/4

Full measured matrix (tools/bench/summarize_fullq.py; raw in bench-out/FULLQ_SUMMARY.txt). Qwen3-8B BF16, thinking OFF, max_seq_len 4096. xserv on GPUs 0-3, llama.cpp on GPUs 4-7 (disjoint groups, run in parallel).

engine	PP	AIME 2025	GSM8K	AIME mean_tok	TTFT_ms	TPOT_ms
xserv	1	8/30 (26.7%)	29/30 (96.7%)	2383	485	22.42
xserv	2	7/30 (23.3%)	29/30 (96.7%)	2367	457	22.55
xserv	4	7/30 (23.3%)	29/30 (96.7%)	2652	494	23.31
llama	1	7/30 (23.3%)	29/30 (96.7%)	2651	119	10.37
llama	2	7/30 (23.3%)	29/30 (96.7%)	2651	118	10.41
llama	4	7/30 (23.3%)	29/30 (96.7%)	2651	119	10.39

Reading the matrix:

GSM8K = 29/30 (96.7%) in every cell — identical across both engines and all PP levels. xserv's accuracy matches llama.cpp exactly on the same weights.
AIME = 7/30 (23.3%) everywhere except xserv PP=1 (8/30). That single +1 is the run-to-run greedy nondeterminism documented above (an AIME solution is ~2400 tokens; one late argmax flip changes one problem's outcome) — not a PP or engine effect. AIME accuracy is low because this is an 8B model with thinking disabled; the point here is the cross-engine / cross-PP agreement, which holds.
TPOT is flat across PP for both engines (xserv 22.4→23.3 ms, llama 10.3→10.4 ms), reconfirming PP doesn't slow single-stream decode. The ~2.2× TPOT gap to llama.cpp is the single-GPU gap (llama-cpp-comparison.md), orthogonal to PP.

Takeaways

Memory is the win. Per-GPU weights+KV scale ~1/P: xserv 24.0 GB (1 GPU) → ~11–14 GB (PP=2) → ~5–9 GB (PP=4); llama 15.6 → ~8 → ~4–5 GB. The two end stages sit higher (stage 0 holds embed_tokens, the last stage norm+lm_head, ~1.1 GB each). This is what PP buys: a model / context that does not fit on one card fits across P.
Single-stream latency is flat, not faster. v1 PP is serial across stages (no microbatch overlap): per-token latency = sum of all stages' compute + (P-1) P2P hops + a blocking sync per stage. The [1, hidden] BF16 hop (8 KB) over PCIe is cheap relative to per-token compute, so TPOT is ~constant across P. PP does not speed up single-stream decode; it trades (almost no) latency for large memory headroom.
Quality is preserved and matches llama.cpp. GSM8K 96.7% in all 12 cells; AIME within the greedy noise band. PP=1/2/4 agree, and xserv tracks llama.cpp.

Reproduce

./tools/sync-and-build.sh build
# latency + VRAM + byte-exact correctness (writes bench-out/PP_FINAL.md):
ssh <host> 'cd <repo> && bash tools/pp_final.sh'
# determinism control (single×2 vs pp4×2):
ssh <host> 'cd <repo> && bash tools/pp_diag.sh'
# NCCL P2P + AllReduce unit tests:
ssh <host> 'cd <repo> && cargo test -p xserv-distributed --release'
# full quality matrix AIME-30 + GSM8K-30 (xserv 0-3 serial; or parallel w/ llama 4-7):
ssh <host> 'cd <repo> && bash tools/pp_quality_full.sh'   # xserv+llama serial, GPU 0-3
ssh <host> 'cd <repo> && bash tools/pp_llama_47.sh'        # llama on GPU 4-7 (parallel)
python3 tools/bench/summarize_fullq.py bench-out

Next (where PP actually raises throughput)

Microbatch / 1F1B overlap: while stage 1 runs microbatch A, stage 0 runs B. This is the only thing that turns PP into a throughput win; v1 is serial, so P GPUs give 1 GPU's single-stream rate (but P× the memory headroom / batch room).
Persistent per-stage recv buffers (drop the per-token CPU alloc + H2D) and event-based ordering instead of a full device sync per hop.
2D TP×PP, and layers % P != 0 non-uniform splits.

🤖 Generated with Claude Code

6.2 KiB Raw Permalink Blame History Unescape Escape