docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P, per-stage KV, engine/threading model). docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact correctness (single x2 vs pp4 x2 control), and the full AIME-30 + GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in every cell, TPOT flat across PP. README: multi-card (TP/PP) section + roadmap to Phase 18. gitignore: /.claude/ runtime state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
119 lines
6.2 KiB
Markdown
119 lines
6.2 KiB
Markdown
# PP sweep — xserv vs llama.cpp (Qwen3-8B BF16, 8×RTX 5090)
|
||
|
||
Pipeline parallelism (layer split), verified end-to-end on dash5. Qwen3-8B BF16,
|
||
greedy, single stream, no NVLink (hand-off / split traffic over PCIe Gen5).
|
||
xserv `--pp N` puts stage `s` on GPU `s` and hands the hidden state stage→stage
|
||
over NCCL P2P; llama.cpp uses `-sm layer` (its default pipeline split) over N GPUs.
|
||
|
||
## Single-stream latency + per-GPU VRAM (measured, `--max-seq-len 2048`)
|
||
|
||
Measured strictly sequentially, one server at a time, each config gated on a real
|
||
successful generation (so VRAM snapshots are post-load). Driver:
|
||
`tools/pp_final.sh`.
|
||
|
||
| engine | PP | TTFT_ms | TPOT_ms | tok/s | per-GPU VRAM (MiB) |
|
||
|--------|----|---------|---------|-------|--------------------|
|
||
| xserv | 1 | 33.2 | 17.39 | 57.5 | 24010 |
|
||
| xserv | 2 | 35.9 | 18.07 | 55.3 | 11580, 13632 |
|
||
| xserv | 4 | 36.1 | 17.91 | 55.8 | 7298, 5250, 5250, 9350 |
|
||
| llama | 1 | 133.3 | 9.38 | 106.7 | 15604 |
|
||
| llama | 2 | 131.4 | 9.10 | 109.9 | 7862, 8494 |
|
||
| llama | 4 | 161.2 | 8.88 | 112.6 | 4476, 4090, 4090, 5108 |
|
||
|
||
(xserv VRAM with `XSERV_MAX_KV_BLOCKS=160` so the number is weights + a minimal
|
||
KV pool. `tok/s = 1000 / TPOT`. This latency probe's TTFT differs from the
|
||
quality-suite TTFT below because the suite includes scheduler/HTTP overhead.)
|
||
|
||
## Correctness — PP is numerically exact
|
||
|
||
The hidden-state hand-off between stages is a bit-exact BF16 P2P copy and each
|
||
stage runs the same kernels over its layers, so PP must reproduce the single-GPU
|
||
result. Verified by byte-comparing generated text (greedy, temp 0), running each
|
||
config **twice** to separate PP effects from run-to-run GEMM noise:
|
||
|
||
| comparison | result |
|
||
|------------|--------|
|
||
| single run A == single run B | **DIFFER** (cuBLAS GEMM is not bit-reproducible run-to-run) |
|
||
| pp4 run A == pp4 run B | **IDENTICAL** |
|
||
| single run A == pp4 run A | **IDENTICAL** |
|
||
| single == pp2 (single run each) | **IDENTICAL** |
|
||
|
||
Takeaway: **single-GPU itself is non-deterministic** under greedy (a 1-ULP logit
|
||
difference flips a late argmax and the suffix changes), so a one-shot single-vs-PP
|
||
byte compare can spuriously "DIFFER". The 2×2 control shows PP=4 is *more*
|
||
reproducible than re-running single-GPU, and it lands exactly on a single-GPU
|
||
trajectory. NCCL P2P (`tests/sendrecv.rs`) and AllReduce (`tests/allreduce.rs`)
|
||
unit tests pass.
|
||
|
||
## Quality matrix — AIME 2025 (30) + GSM8K (30), greedy, both engines × PP=1/2/4
|
||
|
||
Full measured matrix (`tools/bench/summarize_fullq.py`; raw in
|
||
`bench-out/FULLQ_SUMMARY.txt`). Qwen3-8B BF16, thinking OFF, `max_seq_len 4096`.
|
||
xserv on GPUs 0-3, llama.cpp on GPUs 4-7 (disjoint groups, run in parallel).
|
||
|
||
| engine | PP | AIME 2025 | GSM8K | AIME mean_tok | TTFT_ms | TPOT_ms |
|
||
|--------|----|-----------|-------|---------------|---------|---------|
|
||
| xserv | 1 | 8/30 (26.7%) | 29/30 (96.7%) | 2383 | 485 | 22.42 |
|
||
| xserv | 2 | 7/30 (23.3%) | 29/30 (96.7%) | 2367 | 457 | 22.55 |
|
||
| xserv | 4 | 7/30 (23.3%) | 29/30 (96.7%) | 2652 | 494 | 23.31 |
|
||
| llama | 1 | 7/30 (23.3%) | 29/30 (96.7%) | 2651 | 119 | 10.37 |
|
||
| llama | 2 | 7/30 (23.3%) | 29/30 (96.7%) | 2651 | 118 | 10.41 |
|
||
| llama | 4 | 7/30 (23.3%) | 29/30 (96.7%) | 2651 | 119 | 10.39 |
|
||
|
||
Reading the matrix:
|
||
|
||
- **GSM8K = 29/30 (96.7%) in every cell** — identical across both engines and all
|
||
PP levels. xserv's accuracy matches llama.cpp exactly on the same weights.
|
||
- **AIME = 7/30 (23.3%) everywhere except xserv PP=1 (8/30)**. That single +1 is
|
||
the run-to-run greedy nondeterminism documented above (an AIME solution is
|
||
~2400 tokens; one late argmax flip changes one problem's outcome) — not a PP or
|
||
engine effect. AIME accuracy is low because this is an 8B model with thinking
|
||
disabled; the point here is the *cross-engine / cross-PP agreement*, which holds.
|
||
- **TPOT is flat across PP** for both engines (xserv 22.4→23.3 ms, llama
|
||
10.3→10.4 ms), reconfirming PP doesn't slow single-stream decode. The ~2.2×
|
||
TPOT gap to llama.cpp is the single-GPU gap (`llama-cpp-comparison.md`),
|
||
orthogonal to PP.
|
||
|
||
## Takeaways
|
||
|
||
- **Memory is the win.** Per-GPU weights+KV scale ~1/P: xserv 24.0 GB (1 GPU) →
|
||
~11–14 GB (PP=2) → ~5–9 GB (PP=4); llama 15.6 → ~8 → ~4–5 GB. The two end
|
||
stages sit higher (stage 0 holds `embed_tokens`, the last stage `norm`+`lm_head`,
|
||
~1.1 GB each). This is what PP buys: a model / context that does not fit on one
|
||
card fits across P.
|
||
- **Single-stream latency is flat, not faster.** v1 PP is serial across stages
|
||
(no microbatch overlap): per-token latency = sum of all stages' compute +
|
||
(P-1) P2P hops + a blocking sync per stage. The `[1, hidden]` BF16 hop (8 KB)
|
||
over PCIe is cheap relative to per-token compute, so TPOT is ~constant across P.
|
||
PP does **not** speed up single-stream decode; it trades (almost no) latency for
|
||
large memory headroom.
|
||
- **Quality is preserved and matches llama.cpp.** GSM8K 96.7% in all 12 cells;
|
||
AIME within the greedy noise band. PP=1/2/4 agree, and xserv tracks llama.cpp.
|
||
|
||
## Reproduce
|
||
|
||
```bash
|
||
./tools/sync-and-build.sh build
|
||
# latency + VRAM + byte-exact correctness (writes bench-out/PP_FINAL.md):
|
||
ssh <host> 'cd <repo> && bash tools/pp_final.sh'
|
||
# determinism control (single×2 vs pp4×2):
|
||
ssh <host> 'cd <repo> && bash tools/pp_diag.sh'
|
||
# NCCL P2P + AllReduce unit tests:
|
||
ssh <host> 'cd <repo> && cargo test -p xserv-distributed --release'
|
||
# full quality matrix AIME-30 + GSM8K-30 (xserv 0-3 serial; or parallel w/ llama 4-7):
|
||
ssh <host> 'cd <repo> && bash tools/pp_quality_full.sh' # xserv+llama serial, GPU 0-3
|
||
ssh <host> 'cd <repo> && bash tools/pp_llama_47.sh' # llama on GPU 4-7 (parallel)
|
||
python3 tools/bench/summarize_fullq.py bench-out
|
||
```
|
||
|
||
## Next (where PP actually raises throughput)
|
||
|
||
- **Microbatch / 1F1B overlap**: while stage 1 runs microbatch A, stage 0 runs B.
|
||
This is the only thing that turns PP into a *throughput* win; v1 is serial, so
|
||
P GPUs give 1 GPU's single-stream rate (but P× the memory headroom / batch room).
|
||
- Persistent per-stage recv buffers (drop the per-token CPU alloc + H2D) and
|
||
event-based ordering instead of a full device sync per hop.
|
||
- 2D TP×PP, and `layers % P != 0` non-uniform splits.
|
||
|
||
🤖 Generated with [Claude Code](https://claude.com/claude-code)
|