Go to file

Gahow Wang a67753f516 softmax: cap block size at 512 threads

launch_softmax_{f32,bf16} clamped block to 1024 threads when cols was
larger. Halving the ceiling to 512 keeps two blocks per SM resident on
the large vocab kernels that dominate speculative verify workloads
without changing rows/block indexing, and never exceeds cols.

2026-07-01 14:16:32 +08:00

crates

xserv-cli: expose sampling params and greedy repetition penalty

2026-07-01 14:16:31 +08:00

csrc

softmax: cap block size at 512 threads

2026-07-01 14:16:32 +08:00

docs

speculative: Qwen3 draft-model v0 with paged verify parity

2026-07-01 14:16:30 +08:00

third_party

tools: add llama.cpp comparison baseline + standard benchmark suite

2026-05-28 11:18:52 +08:00

tools

server: serve gpt-oss on a single GPU via the TP engine (world=1)

2026-06-12 16:29:10 +08:00

.gitignore

docs: Phase 18 pipeline parallelism — design + benchmark results

2026-05-29 18:57:09 +08:00

.gitmodules

tools: add llama.cpp comparison baseline + standard benchmark suite

2026-05-28 11:18:52 +08:00

Cargo.lock

server: Jinja chat template rendering via minijinja

2026-05-31 13:23:18 +08:00

Cargo.toml

server: Jinja chat template rendering via minijinja

2026-05-31 13:23:18 +08:00

README.md

docs: Phase 21 — decode CUDA graph + GPU argmax results

2026-06-12 20:12:37 +08:00

README.md

xserv

从零用 Rust + CUDA 构建的 LLM 推理引擎，目标是吃透 LLM Serving 全栈技术。

xserv 不依赖 PyTorch / vLLM / TensorRT 等现成框架，自己实现了张量抽象、CUDA kernel、分词器、模型前向、KV cache、调度器和 OpenAI 兼容的 HTTP 服务。支持 Qwen3-8B（BF16）和 gpt-oss-20b（MoE，BF16/FP8/MXFP4 量化），多卡 TP/PP，并提供一套与 llama.cpp 对比正确性和性能的标准 benchmark。

现状一览

模型：GPT-2（124M）、Qwen3-8B（BF16）、gpt-oss-20b（32 专家 top-4 MoE，harmony 格式）
性能（RTX 5090，贪心，单流）：
- Qwen3-8B BF16 单卡：约 56 tok/s（HF transformers 的 1.4×）
- gpt-oss-20b FP8 稀疏 MoE + CUDA Graph decode：TPOT 5.8ms（~172 tok/s， TP=1/2 同速）；同配置 TP=2 全面快于 llama.cpp（1.26-1.47×），llama 单卡模式（2.8ms）仍领先，差距 2.0×
精度：GSM8K 全量与 llama.cpp 同权重持平（94.5% vs 94.4%）；FP8/MXFP4 量化无回归
服务：OpenAI 兼容 /v1/chat/completions，SSE 流式；gpt-oss 量化后可单卡 32GB 服务
关键能力：自写 GEMM / Flash-Attention 2(SM120，含 attention sinks + sliding window) / Paged-Attention kernel、分页 KV cache（含 CPU 换出/换入）、连续批处理、 CUDA Graph 解码（Qwen3 单卡 + gpt-oss 全路径整图回放）、Tensor/Pipeline 并行（NCCL，TP=1/2/4、PP=2/4）、 FP8 W8A8 / MXFP4 W4A16 量化、稀疏 top-k MoE decode（只算被路由的专家）

这是一个以学习为主的项目，逐 Phase 推进，每步都做数值/端到端验证。

架构

xserv/
├── csrc/                  # CUDA 源码 (.cu/.cuh)
│   ├── gemm/              #   GEMM (naive / tiled / gemv)
│   ├── attention/         #   Flash-Attention 2 (SM120)、Paged-Attention、causal mask
│   ├── normalization/     #   LayerNorm / RMSNorm
│   ├── activation/        #   GELU / SiLU / gpt-oss GLU
│   ├── embedding/         #   embedding lookup / RoPE / transpose
│   ├── moe/               #   MoE top-k 路由、稀疏专家 GEMV、加权求和
│   ├── quantization/      #   FP8 量化/反量化、cuBLASLt FP8 GEMM、MXFP4 GEMV
│   └── reduce/            #   softmax
├── crates/
│   ├── xserv-cuda/        # CUDA FFI、Stream、显存分配器、Pinned 内存、CUDA Graph
│   ├── xserv-tensor/      # Tensor 类型（strided 布局、BF16/F16/F32、CPU↔GPU）
│   ├── xserv-kernels/     # kernel registry（自写 kernel + cuBLAS 可切换）
│   ├── xserv-tokenizer/   # BPE 分词器
│   ├── xserv-distributed/ # NCCL FFI、TP 上下文（AllReduce）
│   ├── xserv-model/       # 模型定义（GPT-2 / Qwen3 / gpt-oss MoE）、权重加载、KV cache、采样
│   └── xserv-server/      # tokio + axum HTTP 服务、调度器、TP/PP 引擎
├── tools/                 # 辅助脚本 + benchmark 套件（见下）
└── docs/                  # 每个 Phase 的设计文档 + benchmark 报告

环境要求

GPU：NVIDIA，计算能力 SM120（RTX 5090 / Blackwell）。其它架构需调整 CUDA_ARCH。
CUDA Toolkit：12.9（nvcc 需在 PATH，构建 .cu 依赖它）
Rust：edition 2024（建议较新的 stable 工具链）
模型：HuggingFace 目录格式（含 config.json、tokenizer.json、*.safetensors）

构建

export CUDA_HOME=/usr/local/cuda-12.9
export PATH=$CUDA_HOME/bin:$PATH
cargo build --release

如果本地没有 GPU/CUDA，可用远端构建脚本把代码同步到带卡的机器上构建/运行/测试：

./tools/sync-and-build.sh build      # 远端 cargo build --release
./tools/sync-and-build.sh test       # 远端 cargo test

（远端主机、目录、模型路径在 tools/sync-and-build.sh 顶部配置。）

基本用法

1. 启动 HTTP 服务（OpenAI 兼容）

./target/release/xserv-server /path/to/qwen3-8b \
    --port 8080 \
    --max-batch 4 \
    --max-seq-len 8192 \
    --swap-space-gb 8

参数说明：

参数	含义	默认
`--port`	监听端口	8080
`--max-batch`	解码批大小（并发上限）	4
`--max-seq-len`	单序列最大长度	2048
`--swap-space-gb`	KV 换出到 CPU 的 pinned 内存大小（0 关闭）	8

请求示例（流式）：

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b",
    "messages": [{"role": "user", "content": "用一句话解释什么是注意力机制"}],
    "max_tokens": 256,
    "temperature": 0,
    "stream": true
  }'

其它端点：GET /health、GET /v1/models。

2. 命令行推理

# 单轮生成
cargo run --release --bin xserv-cli -- /path/to/qwen3-8b --max-tokens 256

# 交互式多轮对话
cargo run --release --bin xserv-chat -- /path/to/qwen3-8b

3. 单机性能基准

# 输出每个 prompt 的 TTFT / TBT / TPOT（JSON）
cargo run --release --bin bench-qwen3 -- /path/to/qwen3-8b --gen-tokens 64 [--cuda-graph]

与 llama.cpp 对比 benchmark

tools/bench/ 提供一套一键对比套件，把 xserv 和 llama.cpp（同一份 BF16 权重）放在相同负载下，黑盒通过 OpenAI API 对比：

性能：TTFT、TPOT、吞吐（单流 + 不同并发）
精度：AIME 2025、GSM8K（标准数据集，exact-match 评分）

# 一次性准备（需联网的机器）：拉取 llama.cpp 子模块 + 下载数据集
git submodule update --init third_party/llama.cpp        # 固定在 tag b9371
HF_ENDPOINT=https://hf-mirror.com python3 -m tools.bench.fetch_datasets

# 一键对比（构建 llama.cpp + 转 GGUF + 构建 xserv + 跑两套 + 出报告）
./tools/sync-and-build.sh bench -- --max-seq-len 8192 --quality-limit 50
./tools/sync-and-build.sh fetch-bench-out
# 报告产物：bench-out/comparison-<时间戳>.{md,json}

设计细节见 docs/16-llama-cpp-comparison.md，结果报告见 docs/benchmarks/llama-cpp-comparison.md。

文档

docs/00-roadmap.md：总体路线图与各 Phase 设计
docs/01..15-*.md：CUDA FFI / Tensor / GEMM / Attention / KV cache / 性能优化等每个 Phase 的设计文档
docs/16-llama-cpp-comparison.md：llama.cpp 对比基准的设计
docs/17-tensor-parallelism.md：张量并行（TP）设计
docs/18-pipeline-parallelism.md：流水线并行（PP）设计
docs/benchmarks/：各阶段的 benchmark 报告（含 pp-sweep.md）

多卡并行（TP / PP）

单机多卡，复用 NCCL（crate xserv-distributed）。两种切法正交、二选一：

张量并行 --tp N：按 head / 中间维切每一层，层内用 AllReduce 聚合（每 token 2·层数 次）。
流水线并行 --pp N：按层切成 N 段，相邻段间用 NCCL P2P 传 hidden state（每 token 仅 N-1 次），通信量远小于 AllReduce，对无 NVLink 的 PCIe 更友好。

# 组内 GPU 0-3：4 卡张量并行 / 4 卡流水线并行
CUDA_VISIBLE_DEVICES=0,1,2,3 ./target/release/xserv-server /path/to/qwen3-8b --tp 4
CUDA_VISIBLE_DEVICES=0,1,2,3 ./target/release/xserv-server /path/to/qwen3-8b --pp 4

PP 实测（dash5，Qwen3-8B BF16，单流贪心；每卡显存为权重+最小 KV 池）：

配置	TTFT	TPOT	tok/s	每卡显存
单卡	33ms	17.4ms	57.5	24.0 GB
PP=2	36ms	18.1ms	55.3	11.6 / 13.6 GB
PP=4	36ms	17.9ms	55.8	7.3 / 5.3 / 5.3 / 9.4 GB

质量对比（AIME 2025 30 题 + GSM8K 30 题，贪心，xserv 在 GPU 0-3、llama.cpp 在 GPU 4-7 并行）：

引擎	PP	AIME	GSM8K
xserv	1/2/4	8 / 7 / 7 (/30)	29/30 (96.7%) 全部一致
llama	1/2/4	7 / 7 / 7 (/30)	29/30 (96.7%) 全部一致

正确性：hidden state 跨段是 bit-exact BF16 P2P 拷贝，PP=4 输出与单卡逐字节一致（用「单卡×2 vs PP=4×2」对照确认——单卡自身因 cuBLAS 非确定性 run-to-run 会变，而 PP=4 可复现且落在某次单卡轨迹上）。 GSM8K 12 个格子全是 29/30，xserv 与 llama.cpp 完全一致；AIME 的 ±1 是长生成下贪心对 GEMM 抖动的敏感，非 PP 或引擎效应。收益在显存（每卡权重+KV ≈ 1/N）；v1 为串行流水线，单流 TPOT 基本持平、不优于单卡，真正的吞吐提升需后续做 microbatch / 1F1B 重叠。完整数据见 docs/benchmarks/pp-sweep.md。

路线图（节选）

已完成 Phase 0–21：CUDA 基础设施 → Tensor → GEMM → Transformer kernels → Attention → 模型加载 → 分词器 → GPT-2 → KV cache → Qwen3-8B → Paged Attention → 连续批处理 → HTTP API → Flash Attention 2 → 性能优化 → 张量并行（TP） → 流水线并行（PP） → gpt-oss MoE + FP8/MXFP4 量化 → 稀疏 top-k MoE decode → decode CUDA Graph 整图回放；并加入了 llama.cpp 对比基准 与 KV CPU 换出 等基础设施。

后续方向：非专家权重量化（lm_head/qkv/o）、稀疏 prefill（grouped GEMM）、server 侧 harmony channel 分离、PP microbatch/1F1B、投机解码、多模态。详见 docs/00-roadmap.md 的实际进展记录。

许可

MIT

README.md Unescape Escape

xserv

现状一览

架构

环境要求

构建

基本用法

1. 启动 HTTP 服务（OpenAI 兼容）

2. 命令行推理

3. 单机性能基准

与 llama.cpp 对比 benchmark

文档

多卡并行（TP / PP）

路线图（节选）

许可

README.md