xserv

gahow/xserv

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	2a92f268a9	docs: fill the Phase 19 gap, refresh README/roadmap to actual state - docs/19-gpt-oss-moe.md: the numbered series jumped 18->20; write up gpt-oss arch deltas, harmony pitfalls, and the two CUDA debugging postmortems (fully-masked-tile NaN in flash-attention sinks; pre-__syncthreads early return reading uninitialized smem in the decode GEMV) — the highest-value learning content of that phase. - README: models/perf/capabilities were frozen at the Qwen3-only era; now lists gpt-oss MoE, TP/PP, FP8/MXFP4, sparse MoE, and the llama.cpp standing. - Roadmap: record where reality diverged from the plan at Phase 18+, add milestone entries and the ranked next-phase candidates (21 CUDA-graph MoE decode, 22 non-expert quant, 23 sparse prefill). - sparse-moe benchmark doc: post-review-fix numbers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 17:02:59 +08:00
Gahow Wang	11e0154e4d	docs: Phase 18 pipeline parallelism — design + benchmark results docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P, per-stage KV, engine/threading model). docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact correctness (single x2 vs pp4 x2 control), and the full AIME-30 + GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in every cell, TPOT flat across PP. README: multi-card (TP/PP) section + roadmap to Phase 18. gitignore: /.claude/ runtime state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:57:09 +08:00
Gahow Wang	14a44b503e	docs: add Chinese README (overview + usage) Project intro, architecture, build, basic usage (HTTP server / CLI / bench), and the llama.cpp comparison workflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 21:38:20 +08:00

Author

SHA1

Message

Date

Gahow Wang

2a92f268a9

docs: fill the Phase 19 gap, refresh README/roadmap to actual state

- docs/19-gpt-oss-moe.md: the numbered series jumped 18->20; write up
  gpt-oss arch deltas, harmony pitfalls, and the two CUDA debugging
  postmortems (fully-masked-tile NaN in flash-attention sinks;
  pre-__syncthreads early return reading uninitialized smem in the
  decode GEMV) — the highest-value learning content of that phase.
- README: models/perf/capabilities were frozen at the Qwen3-only era;
  now lists gpt-oss MoE, TP/PP, FP8/MXFP4, sparse MoE, and the
  llama.cpp standing.
- Roadmap: record where reality diverged from the plan at Phase 18+,
  add milestone entries and the ranked next-phase candidates
  (21 CUDA-graph MoE decode, 22 non-expert quant, 23 sparse prefill).
- sparse-moe benchmark doc: post-review-fix numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 17:02:59 +08:00

Gahow Wang

11e0154e4d

docs: Phase 18 pipeline parallelism — design + benchmark results

docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P,
per-stage KV, engine/threading model).
docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B
BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact
correctness (single x2 vs pp4 x2 control), and the full AIME-30 +
GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in
every cell, TPOT flat across PP.
README: multi-card (TP/PP) section + roadmap to Phase 18.
gitignore: /.claude/ runtime state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 18:57:09 +08:00

Gahow Wang

14a44b503e

docs: add Chinese README (overview + usage)

Project intro, architecture, build, basic usage (HTTP server / CLI / bench),
and the llama.cpp comparison workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 21:38:20 +08:00

3 Commits