xserv

Files

Gahow Wang 11e0154e4d docs: Phase 18 pipeline parallelism — design + benchmark results

docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P,
per-stage KV, engine/threading model).
docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B
BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact
correctness (single x2 vs pp4 x2 control), and the full AIME-30 +
GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in
every cell, TPOT flat across PP.
README: multi-card (TP/PP) section + roadmap to Phase 18.
gitignore: /.claude/ runtime state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 18:57:09 +08:00

llama-cpp-comparison.md

docs: update llama.cpp comparison with 8192 results (OOM fixed)

2026-05-28 21:32:14 +08:00

phase8-gpt2-baseline.md

phase 8: add benchmark framework + baseline results

2026-05-21 23:29:41 +08:00

phase9-kv-cache.md

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

phase10-qwen3.md

phase 10: add Qwen3-8B benchmark + performance fix