Files
xserv/docs/benchmarks/pp-sweep.md
Gahow Wang 11e0154e4d docs: Phase 18 pipeline parallelism — design + benchmark results
docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P,
per-stage KV, engine/threading model).
docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B
BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact
correctness (single x2 vs pp4 x2 control), and the full AIME-30 +
GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in
every cell, TPOT flat across PP.
README: multi-card (TP/PP) section + roadmap to Phase 18.
gitignore: /.claude/ runtime state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:57:09 +08:00

119 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PP sweep — xserv vs llama.cpp (Qwen3-8B BF16, 8×RTX 5090)
Pipeline parallelism (layer split), verified end-to-end on dash5. Qwen3-8B BF16,
greedy, single stream, no NVLink (hand-off / split traffic over PCIe Gen5).
xserv `--pp N` puts stage `s` on GPU `s` and hands the hidden state stage→stage
over NCCL P2P; llama.cpp uses `-sm layer` (its default pipeline split) over N GPUs.
## Single-stream latency + per-GPU VRAM (measured, `--max-seq-len 2048`)
Measured strictly sequentially, one server at a time, each config gated on a real
successful generation (so VRAM snapshots are post-load). Driver:
`tools/pp_final.sh`.
| engine | PP | TTFT_ms | TPOT_ms | tok/s | per-GPU VRAM (MiB) |
|--------|----|---------|---------|-------|--------------------|
| xserv | 1 | 33.2 | 17.39 | 57.5 | 24010 |
| xserv | 2 | 35.9 | 18.07 | 55.3 | 11580, 13632 |
| xserv | 4 | 36.1 | 17.91 | 55.8 | 7298, 5250, 5250, 9350 |
| llama | 1 | 133.3 | 9.38 | 106.7 | 15604 |
| llama | 2 | 131.4 | 9.10 | 109.9 | 7862, 8494 |
| llama | 4 | 161.2 | 8.88 | 112.6 | 4476, 4090, 4090, 5108 |
(xserv VRAM with `XSERV_MAX_KV_BLOCKS=160` so the number is weights + a minimal
KV pool. `tok/s = 1000 / TPOT`. This latency probe's TTFT differs from the
quality-suite TTFT below because the suite includes scheduler/HTTP overhead.)
## Correctness — PP is numerically exact
The hidden-state hand-off between stages is a bit-exact BF16 P2P copy and each
stage runs the same kernels over its layers, so PP must reproduce the single-GPU
result. Verified by byte-comparing generated text (greedy, temp 0), running each
config **twice** to separate PP effects from run-to-run GEMM noise:
| comparison | result |
|------------|--------|
| single run A == single run B | **DIFFER** (cuBLAS GEMM is not bit-reproducible run-to-run) |
| pp4 run A == pp4 run B | **IDENTICAL** |
| single run A == pp4 run A | **IDENTICAL** |
| single == pp2 (single run each) | **IDENTICAL** |
Takeaway: **single-GPU itself is non-deterministic** under greedy (a 1-ULP logit
difference flips a late argmax and the suffix changes), so a one-shot single-vs-PP
byte compare can spuriously "DIFFER". The 2×2 control shows PP=4 is *more*
reproducible than re-running single-GPU, and it lands exactly on a single-GPU
trajectory. NCCL P2P (`tests/sendrecv.rs`) and AllReduce (`tests/allreduce.rs`)
unit tests pass.
## Quality matrix — AIME 2025 (30) + GSM8K (30), greedy, both engines × PP=1/2/4
Full measured matrix (`tools/bench/summarize_fullq.py`; raw in
`bench-out/FULLQ_SUMMARY.txt`). Qwen3-8B BF16, thinking OFF, `max_seq_len 4096`.
xserv on GPUs 0-3, llama.cpp on GPUs 4-7 (disjoint groups, run in parallel).
| engine | PP | AIME 2025 | GSM8K | AIME mean_tok | TTFT_ms | TPOT_ms |
|--------|----|-----------|-------|---------------|---------|---------|
| xserv | 1 | 8/30 (26.7%) | 29/30 (96.7%) | 2383 | 485 | 22.42 |
| xserv | 2 | 7/30 (23.3%) | 29/30 (96.7%) | 2367 | 457 | 22.55 |
| xserv | 4 | 7/30 (23.3%) | 29/30 (96.7%) | 2652 | 494 | 23.31 |
| llama | 1 | 7/30 (23.3%) | 29/30 (96.7%) | 2651 | 119 | 10.37 |
| llama | 2 | 7/30 (23.3%) | 29/30 (96.7%) | 2651 | 118 | 10.41 |
| llama | 4 | 7/30 (23.3%) | 29/30 (96.7%) | 2651 | 119 | 10.39 |
Reading the matrix:
- **GSM8K = 29/30 (96.7%) in every cell** — identical across both engines and all
PP levels. xserv's accuracy matches llama.cpp exactly on the same weights.
- **AIME = 7/30 (23.3%) everywhere except xserv PP=1 (8/30)**. That single +1 is
the run-to-run greedy nondeterminism documented above (an AIME solution is
~2400 tokens; one late argmax flip changes one problem's outcome) — not a PP or
engine effect. AIME accuracy is low because this is an 8B model with thinking
disabled; the point here is the *cross-engine / cross-PP agreement*, which holds.
- **TPOT is flat across PP** for both engines (xserv 22.4→23.3 ms, llama
10.3→10.4 ms), reconfirming PP doesn't slow single-stream decode. The ~2.2×
TPOT gap to llama.cpp is the single-GPU gap (`llama-cpp-comparison.md`),
orthogonal to PP.
## Takeaways
- **Memory is the win.** Per-GPU weights+KV scale ~1/P: xserv 24.0 GB (1 GPU) →
~1114 GB (PP=2) → ~59 GB (PP=4); llama 15.6 → ~8 → ~45 GB. The two end
stages sit higher (stage 0 holds `embed_tokens`, the last stage `norm`+`lm_head`,
~1.1 GB each). This is what PP buys: a model / context that does not fit on one
card fits across P.
- **Single-stream latency is flat, not faster.** v1 PP is serial across stages
(no microbatch overlap): per-token latency = sum of all stages' compute +
(P-1) P2P hops + a blocking sync per stage. The `[1, hidden]` BF16 hop (8 KB)
over PCIe is cheap relative to per-token compute, so TPOT is ~constant across P.
PP does **not** speed up single-stream decode; it trades (almost no) latency for
large memory headroom.
- **Quality is preserved and matches llama.cpp.** GSM8K 96.7% in all 12 cells;
AIME within the greedy noise band. PP=1/2/4 agree, and xserv tracks llama.cpp.
## Reproduce
```bash
./tools/sync-and-build.sh build
# latency + VRAM + byte-exact correctness (writes bench-out/PP_FINAL.md):
ssh <host> 'cd <repo> && bash tools/pp_final.sh'
# determinism control (single×2 vs pp4×2):
ssh <host> 'cd <repo> && bash tools/pp_diag.sh'
# NCCL P2P + AllReduce unit tests:
ssh <host> 'cd <repo> && cargo test -p xserv-distributed --release'
# full quality matrix AIME-30 + GSM8K-30 (xserv 0-3 serial; or parallel w/ llama 4-7):
ssh <host> 'cd <repo> && bash tools/pp_quality_full.sh' # xserv+llama serial, GPU 0-3
ssh <host> 'cd <repo> && bash tools/pp_llama_47.sh' # llama on GPU 4-7 (parallel)
python3 tools/bench/summarize_fullq.py bench-out
```
## Next (where PP actually raises throughput)
- **Microbatch / 1F1B overlap**: while stage 1 runs microbatch A, stage 0 runs B.
This is the only thing that turns PP into a *throughput* win; v1 is serial, so
P GPUs give 1 GPU's single-stream rate (but P× the memory headroom / batch room).
- Persistent per-stage recv buffers (drop the per-token CPU alloc + H2D) and
event-based ordering instead of a full device sync per hop.
- 2D TP×PP, and `layers % P != 0` non-uniform splits.
🤖 Generated with [Claude Code](https://claude.com/claude-code)