docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P, per-stage KV, engine/threading model). docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact correctness (single x2 vs pp4 x2 control), and the full AIME-30 + GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in every cell, TPOT flat across PP. README: multi-card (TP/PP) section + roadmap to Phase 18. gitignore: /.claude/ runtime state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
26 lines
421 B
Plaintext
26 lines
421 B
Plaintext
/target
|
|
*.o
|
|
*.so
|
|
*.a
|
|
*.ptx
|
|
*.cubin
|
|
**/*.rs.bk
|
|
.env
|
|
*.npy
|
|
|
|
# llama.cpp baseline (cloned/submoduled by tools/setup-llama-cpp.sh)
|
|
/third_party/llama.cpp/build/
|
|
/third_party/llama.cpp/models/
|
|
*.gguf
|
|
|
|
# Claude Code runtime state
|
|
/.claude/
|
|
|
|
# Benchmark output + fetched datasets (transferred to GPU host, not committed)
|
|
/bench-out/
|
|
/tools/bench/data/
|
|
/tools/__pycache__/
|
|
/tools/bench/__pycache__/
|
|
/tools/bench/**/__pycache__/
|
|
|