Files
xserv/.gitignore
Gahow Wang 11e0154e4d docs: Phase 18 pipeline parallelism — design + benchmark results
docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P,
per-stage KV, engine/threading model).
docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B
BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact
correctness (single x2 vs pp4 x2 control), and the full AIME-30 +
GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in
every cell, TPOT flat across PP.
README: multi-card (TP/PP) section + roadmap to Phase 18.
gitignore: /.claude/ runtime state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:57:09 +08:00

26 lines
421 B
Plaintext

/target
*.o
*.so
*.a
*.ptx
*.cubin
**/*.rs.bk
.env
*.npy
# llama.cpp baseline (cloned/submoduled by tools/setup-llama-cpp.sh)
/third_party/llama.cpp/build/
/third_party/llama.cpp/models/
*.gguf
# Claude Code runtime state
/.claude/
# Benchmark output + fetched datasets (transferred to GPU host, not committed)
/bench-out/
/tools/bench/data/
/tools/__pycache__/
/tools/bench/__pycache__/
/tools/bench/**/__pycache__/