dash5, gpt-oss-20b FP8, warm-server vs llama.cpp MXFP4 (6 reps): TP=2 TPOT 5.76-5.89 vs 7.42-8.45 ms (xserv 1.26-1.47x), TTFT 2.4x ahead short/medium; TP=1 5.78-5.95 vs 2.80-3.22 ms (gap 2.5x -> 2.0x, TTFT now ahead short/medium). GSM8K-50 through the graph path: 94%. Lesson recorded: graphs bought ~0.6 ms (launches were already hidden by async execution), the GPU argmax ~1 ms — measure, don't guess. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1804 lines
62 KiB
Markdown
1804 lines
62 KiB
Markdown
# xserv — LLM Inference Engine Roadmap
|
||
|
||
> 从零用 Rust + CUDA 构建一个完整的 LLM 推理引擎,目标是深入理解 LLM Serving 全栈技术。
|
||
|
||
## 设计决策
|
||
|
||
| 决策项 | 选择 | 备注 |
|
||
|--------|------|------|
|
||
| 抽象层级 | Level 0.5 | 自写 CUDA kernel + cuBLAS 可切换,便于 benchmark 对比 |
|
||
| 硬件 | 8×RTX 5090 (Blackwell, CC 12.0, 32GB GDDR7) | 纯 PCIe Gen5 x16 互联,无 NVLink (详见下方硬件拓扑) |
|
||
| 语言 | Rust + CUDA (C/C++) | Rust FFI 调用 CUDA |
|
||
| 起步模型 | GPT-2 124M → Qwen3-8B | 从简单到实用 |
|
||
| 精度 | BF16/FP16 | 后期扩展 FP8 |
|
||
| Tensor | 自己实现 | 完整学习 tensor 抽象设计 |
|
||
| Tokenizer | 自己实现 BPE | 学习分词机制 |
|
||
| 权重格式 | safetensors | Rust 友好,零拷贝 mmap |
|
||
| Async Runtime | tokio | 成熟稳定,不引入性能问题 |
|
||
| API | OpenAI 兼容 | `/v1/chat/completions`,SSE streaming |
|
||
| 时间线 | 不限 | 学习为主,每步验证 |
|
||
|
||
## 硬件拓扑 (dash5, 已确认 2026-05-21)
|
||
|
||
**GPU**: 8× NVIDIA GeForce RTX 5090, 32607 MiB, Compute Capability 12.0
|
||
**CUDA Toolkit**: 12.9 (安装于 `/usr/local/cuda-12.9`,需将 `bin/` 加入 PATH)
|
||
**PCIe**: Gen 5 x16 (理论单向 ~64 GB/s,空闲时降频至 Gen 1)
|
||
|
||
**互联拓扑** (`nvidia-smi topo -m`):
|
||
```
|
||
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
|
||
GPU0 X PHB PHB PHB NODE NODE NODE NODE
|
||
GPU1 PHB X PHB PHB NODE NODE NODE NODE
|
||
GPU2 PHB PHB X PHB NODE NODE NODE NODE
|
||
GPU3 PHB PHB PHB X NODE NODE NODE NODE
|
||
GPU4 NODE NODE NODE NODE X PHB PHB PHB
|
||
GPU5 NODE NODE NODE NODE PHB X PHB PHB
|
||
GPU6 NODE NODE NODE NODE PHB PHB X PHB
|
||
GPU7 NODE NODE NODE NODE PHB PHB PHB X
|
||
|
||
PHB = 同一 PCIe Host Bridge(同组,延迟低)
|
||
NODE = 跨 PCIe Host Bridge(跨组,延迟较高)
|
||
```
|
||
|
||
**分组**: GPU 0-3 为一组, GPU 4-7 为一组。组内 PHB 互联,跨组 NODE 互联。
|
||
|
||
**对设计的影响**:
|
||
- **无 NVLink**: AllReduce 带宽受限于 PCIe (~64 GB/s vs NVLink ~450 GB/s)
|
||
- **TP 策略**: 当前阶段目标 TP=1/2/4,在同组内 (0-3 或 4-7) 执行,全 PHB 互联
|
||
- **跨组并行 (TP=8, PP 等)**: 留待后续扩展
|
||
- **CPU Affinity**: GPU 0-3 亲和 CPU 0-127, GPU 4-7 亲和 CPU 0-186(NUMA 0-1)
|
||
|
||
## 项目结构
|
||
|
||
```
|
||
xserv/
|
||
├── Cargo.toml # workspace root
|
||
├── csrc/ # CUDA 源文件 (.cu / .cuh)
|
||
│ ├── gemm/ # GEMM kernels (naive, tiled, tensor core)
|
||
│ ├── attention/ # Attention kernels (naive, flash, paged)
|
||
│ ├── normalization/ # LayerNorm, RMSNorm
|
||
│ ├── activation/ # GELU, SiLU
|
||
│ ├── embedding/ # Embedding lookup, RoPE
|
||
│ ├── reduce/ # Softmax, argmax, sampling
|
||
│ └── quantize/ # FP8/INT8 kernels
|
||
├── crates/
|
||
│ ├── xserv-cuda/ # Phase 1: CUDA FFI, context, stream, allocator
|
||
│ ├── xserv-tensor/ # Phase 2: Tensor type, ops dispatch, DType
|
||
│ ├── xserv-kernels/ # Phase 3-5: kernel registry (custom + cuBLAS)
|
||
│ ├── xserv-tokenizer/ # Phase 7: BPE tokenizer
|
||
│ ├── xserv-model/ # Phase 6,8,10: model def + weight loading
|
||
│ ├── xserv-runtime/ # Phase 9,11,12: KV cache, paging, scheduler
|
||
│ ├── xserv-engine/ # Phase 13: inference engine orchestration
|
||
│ ├── xserv-api/ # Phase 13: HTTP server + OpenAI compat
|
||
│ ├── xserv-speculative/ # Phase 16: speculative decoding
|
||
│ └── xserv-distributed/ # Phase 17: tensor parallelism, NCCL
|
||
├── tests/ # integration tests
|
||
├── benches/ # criterion benchmarks
|
||
├── tools/ # 辅助脚本 (PyTorch reference output 生成等)
|
||
└── docs/ # 每个 phase 的设计文档
|
||
```
|
||
|
||
## Phase 依赖图
|
||
|
||
```
|
||
Phase 0: 项目脚手架 + 环境验证
|
||
│
|
||
Phase 1: CUDA FFI 基础设施
|
||
│
|
||
Phase 2: Tensor 抽象层
|
||
│
|
||
Phase 3: GEMM (naive → tiled → tensor core → cuBLAS)
|
||
│
|
||
Phase 4: Transformer Kernels (Norm, Activation, Embedding, RoPE, Softmax)
|
||
│
|
||
Phase 5: Attention Kernel (naive MHA)
|
||
│
|
||
Phase 6: 模型加载 (safetensors + HF config)
|
||
│ │
|
||
│ Phase 7: BPE Tokenizer (可与 Phase 6 并行)
|
||
│ │
|
||
Phase 8: GPT-2 完整推理 ◄──────────── 里程碑 ① CLI 文本生成
|
||
│
|
||
Phase 9: KV Cache + Autoregressive Generation
|
||
│
|
||
Phase 10: Qwen3-8B 支持 ◄─────────── 里程碑 ② 8B 模型推理
|
||
│
|
||
Phase 11: Paged Attention + KV Cache Manager
|
||
│
|
||
Phase 12: Continuous Batching + Request Scheduler
|
||
│
|
||
Phase 13: HTTP API + SSE Streaming ◄── 里程碑 ③ 端到端 API 可用
|
||
│
|
||
Phase 14: Flash Attention (FA2 for SM120)
|
||
│
|
||
Phase 15: 性能优化 ◄──────────────── 里程碑 ④ 50% vLLM throughput
|
||
│
|
||
Phase 16: Speculative Decoding
|
||
│
|
||
Phase 17: Tensor Parallelism (TP=1/2/4) ◄── 里程碑 ⑤ 多卡推理
|
||
│
|
||
Phase 18: 量化 (FP8 / INT8)
|
||
│
|
||
Phase 19: Multimodal ◄────────────── 里程碑 ⑥ 视觉问答
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 0: 项目脚手架 + 环境验证
|
||
|
||
**目标**: 搭建 Cargo workspace,验证 CUDA 工具链,确保开发环境就绪。
|
||
|
||
**技术要点**:
|
||
- Cargo workspace 配置,所有 crate 共享依赖版本
|
||
- `build.rs` 中用 `cc` crate 编译 `.cu` 文件的 pipeline
|
||
- 验证 CUDA toolkit 版本: **已确认 CUDA 12.9** (`/usr/local/cuda-12.9`)
|
||
- 验证 GPU compute capability: **已确认 CC 12.0** (Blackwell)
|
||
- 确认 `nvcc` 在 PATH 中 (需要 `export PATH=/usr/local/cuda-12.9/bin:$PATH`)
|
||
- ~~运行 `nvidia-smi topo -m` 确认互联拓扑~~ **已确认: 纯 PCIe Gen5, 无 NVLink**
|
||
|
||
**外部依赖**: `cc` crate(编译 CUDA)
|
||
|
||
**测试验收**:
|
||
- `cargo build` 通过
|
||
- 一个最小的 `.cu` kernel(向量加法)能从 Rust 调用并返回正确结果
|
||
- 输出 GPU 信息(名称、显存、compute capability)
|
||
|
||
**设计文档**: `docs/01-cuda-ffi.md`(与 Phase 1 合并)
|
||
|
||
---
|
||
|
||
## Phase 1: CUDA FFI 基础设施
|
||
|
||
**Crate**: `xserv-cuda`
|
||
|
||
**目标**: 封装 CUDA Runtime API,提供安全的 Rust 抽象层。
|
||
|
||
### 模块划分
|
||
|
||
```
|
||
xserv-cuda/src/
|
||
├── lib.rs
|
||
├── error.rs # CudaError 类型, cudaGetLastError 封装
|
||
├── device.rs # Device 枚举, 设备查询 (属性/数量/当前设备)
|
||
├── context.rs # CUDA context 管理
|
||
├── stream.rs # CudaStream (异步操作流)
|
||
├── memory.rs # GPU 内存分配/释放/拷贝 (H2D, D2H, D2D)
|
||
├── allocator.rs # Caching Allocator (显存池)
|
||
└── module.rs # cuModuleLoad (加载 PTX/cubin, 可选)
|
||
```
|
||
|
||
### 关键技术点
|
||
|
||
1. **FFI 绑定策略**:
|
||
- 手写 `extern "C"` 绑定核心 CUDA Runtime API(~30 个函数)
|
||
- 不用 bindgen,保持可控和可读
|
||
- 需要绑定的 API: `cudaMalloc`, `cudaFree`, `cudaMemcpy`, `cudaMemcpyAsync`,
|
||
`cudaStreamCreate`, `cudaStreamSynchronize`, `cudaGetDeviceProperties`,
|
||
`cudaSetDevice`, `cudaDeviceSynchronize`, `cudaGetLastError` 等
|
||
|
||
2. **GpuBuffer 抽象**:
|
||
```rust
|
||
pub struct GpuBuffer {
|
||
ptr: *mut c_void,
|
||
size_bytes: usize,
|
||
device: usize,
|
||
}
|
||
|
||
impl Drop for GpuBuffer {
|
||
fn drop(&mut self) { /* cudaFree or return to allocator */ }
|
||
}
|
||
```
|
||
- `Drop` trait 自动释放,防止 GPU 内存泄漏
|
||
- 不实现 `Clone`(显式 `copy_from` 代替)
|
||
|
||
3. **Caching Allocator**:
|
||
- 维护 free list(按大小分桶,桶边界: 512B, 1KB, 2KB, ..., 1GB)
|
||
- `alloc(size)`: 在对应桶中找 >= size 的 free block,miss 时 `cudaMalloc`
|
||
- `free(ptr, size)`: 不调 `cudaFree`,放回 free list
|
||
- `trim()`: 真正释放所有 free blocks(OOM 恢复时用)
|
||
- 这是性能关键组件——频繁 `cudaMalloc/cudaFree` 会严重影响 throughput
|
||
- 参考: PyTorch 的 `CUDACachingAllocator` 设计
|
||
|
||
4. **Stream 管理**:
|
||
- 每个 stream 是独立的 GPU 执行队列
|
||
- Kernel launch 和 memcpy 是异步的(提交到 stream 后立即返回)
|
||
- `stream.synchronize()` 等待该 stream 上所有操作完成
|
||
- 后续用于 overlap compute 和 memory transfer
|
||
|
||
5. **Error Handling**:
|
||
```rust
|
||
#[derive(Debug)]
|
||
pub enum CudaError {
|
||
OutOfMemory,
|
||
InvalidDevice,
|
||
LaunchFailure,
|
||
Raw { code: i32, message: String },
|
||
}
|
||
|
||
// 所有 CUDA 调用包装为 Result
|
||
pub(crate) fn check(code: cudaError_t) -> Result<(), CudaError>;
|
||
```
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 分配 1GB GPU 内存,H2D 拷贝一个大数组,D2H 拷回,验证数据一致
|
||
- [ ] Caching allocator: alloc → free → re-alloc same size,第二次不触发 `cudaMalloc`(通过内部计数验证)
|
||
- [ ] 多 stream 并发拷贝两个数组,验证结果正确
|
||
- [ ] 设备查询: 打印 GPU name, total memory, compute capability, SM count
|
||
- [ ] Benchmark: caching allocator vs 裸 `cudaMalloc` 的分配延迟对比(100 次 alloc/free 循环)
|
||
|
||
---
|
||
|
||
## Phase 2: Tensor 抽象层
|
||
|
||
**Crate**: `xserv-tensor`
|
||
|
||
**目标**: 实现核心 Tensor 类型,支持 CPU/GPU 存储、多种数据类型、视图操作。
|
||
|
||
### 核心数据结构
|
||
|
||
```rust
|
||
// --- 数据类型 ---
|
||
#[derive(Clone, Copy, PartialEq)]
|
||
pub enum DType {
|
||
F32,
|
||
F16,
|
||
BF16,
|
||
// 后期: U8, I8, F8E4M3, F8E5M2
|
||
}
|
||
|
||
impl DType {
|
||
pub fn size_bytes(&self) -> usize; // F32=4, F16=2, BF16=2
|
||
}
|
||
|
||
// --- 设备 ---
|
||
#[derive(Clone, Copy, PartialEq)]
|
||
pub enum Device {
|
||
Cpu,
|
||
Cuda(usize), // device ordinal
|
||
}
|
||
|
||
// --- 存储 ---
|
||
// 引用计数,支持 view(多个 Tensor 共享同一 Storage)
|
||
pub struct Storage(Arc<StorageInner>);
|
||
|
||
enum StorageInner {
|
||
Cpu { data: Vec<u8> },
|
||
Cuda { buffer: GpuBuffer, device: usize },
|
||
}
|
||
|
||
// --- Tensor ---
|
||
pub struct Tensor {
|
||
storage: Storage,
|
||
shape: SmallVec<[usize; 4]>, // 维度(大多数 tensor <= 4D)
|
||
strides: SmallVec<[usize; 4]>, // 步长(以元素为单位)
|
||
offset: usize, // storage 中的起始偏移(元素数)
|
||
dtype: DType,
|
||
device: Device,
|
||
}
|
||
```
|
||
|
||
### 关键技术点
|
||
|
||
1. **Strided Layout**:
|
||
- 支持 `transpose`, `slice`, `permute` 等操作不拷贝数据,只改 strides/offset
|
||
- `is_contiguous()`: strides 从右到左依次为 1, shape[-1], shape[-1]*shape[-2], ...
|
||
- 非 contiguous tensor 在送入 CUDA kernel 前需要 `contiguous()` 拷贝为连续布局
|
||
- 例: `[3,4]` tensor 的 strides = `[4, 1]`;transpose 后 shape=`[4,3]`, strides=`[1, 4]`
|
||
|
||
2. **BF16/F16 在 Rust 中的表示**:
|
||
- 使用 `half` crate 的 `bf16` 和 `f16` 类型
|
||
- GPU kernel 中使用 `__nv_bfloat16` / `__half`
|
||
- Tensor 内部存储为 raw bytes,通过 DType dispatch 解释
|
||
|
||
3. **设备间拷贝**:
|
||
```rust
|
||
impl Tensor {
|
||
pub fn to(&self, device: Device) -> Tensor; // CPU↔GPU 拷贝
|
||
pub fn to_dtype(&self, dtype: DType) -> Tensor; // 类型转换
|
||
}
|
||
```
|
||
|
||
4. **基础操作**(此阶段实现):
|
||
- **创建**: `zeros`, `ones`, `from_slice`, `rand`, `full`, `arange`
|
||
- **形状**: `reshape`, `view`, `transpose`, `squeeze`, `unsqueeze`, `contiguous`
|
||
- **逐元素** (CPU + GPU kernel): `add`, `mul`, `sub`, `div`
|
||
- **广播 (Broadcasting)**: NumPy 语义,维度从尾部对齐
|
||
- **归约**: `sum`, `max`, `mean`(沿指定轴)
|
||
|
||
5. **Op Dispatch 机制**:
|
||
```rust
|
||
// 根据 device 和 dtype dispatch 到不同实现
|
||
pub fn add(a: &Tensor, b: &Tensor) -> Tensor {
|
||
match (a.device(), b.device()) {
|
||
(Device::Cpu, Device::Cpu) => cpu_ops::add(a, b),
|
||
(Device::Cuda(_), Device::Cuda(_)) => cuda_ops::add(a, b),
|
||
_ => panic!("device mismatch"),
|
||
}
|
||
}
|
||
```
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 创建 tensor, reshape, transpose, slice,验证 shape/strides 计算正确
|
||
- [ ] 广播加法: `[3,1] + [1,4]` → `[3,4]`,与 numpy 结果对比
|
||
- [ ] CPU ↔ GPU 拷贝往返,数据一致
|
||
- [ ] BF16 tensor 的基础运算精度验证(与 FP32 结果对比 relative error)
|
||
- [ ] View 共享存储: 修改 view 的数据,原 tensor 也应变化
|
||
- [ ] Benchmark: GPU 逐元素 kernel vs CPU 的加速比(大数组)
|
||
|
||
---
|
||
|
||
## Phase 3: GEMM — 矩阵乘法
|
||
|
||
**Crate**: `xserv-kernels`
|
||
**CUDA 源码**: `csrc/gemm/`
|
||
|
||
**目标**: 实现 GEMM 的多个版本,从 naive 到 tensor core,同时封装 cuBLAS,建立 benchmark 对比框架。
|
||
|
||
这是 CUDA kernel 编程的第一个"修罗场",会深刻理解 GPU 编程的核心概念。
|
||
|
||
### 实现路线(4 个递进版本)
|
||
|
||
#### Version 1: Naive GEMM
|
||
- 每个 thread 计算输出矩阵 C 的一个元素: `C[i][j] = sum(A[i][k] * B[k][j])`
|
||
- grid 维度: `(M/BLOCK, N/BLOCK)`, block 维度: `(BLOCK, BLOCK)`
|
||
- **学到**: grid/block 维度规划, global memory access pattern
|
||
- **问题**: global memory 访问完全没有局部性,bandwidth 利用率极低
|
||
- **预期性能**: ~1-2% cuBLAS
|
||
|
||
#### Version 2: Tiled GEMM (shared memory)
|
||
- 将 A, B 分成 TILE×TILE 的小块,加载到 shared memory
|
||
- 每个 thread block 计算 C 的一个 TILE×TILE 输出块
|
||
- 内层循环沿 K 维度滑动 tile
|
||
- **学到**: shared memory 使用, `__syncthreads()`, bank conflict, memory coalescing
|
||
- **关键**: A 的 tile 要按行加载(coalesced),B 的 tile 按列访问需要注意 bank conflict
|
||
- **预期性能**: ~10-20% cuBLAS
|
||
|
||
#### Version 3: Register Tiling + 向量化
|
||
- 每个 thread 计算多个输出元素(如 4×4 或 8×8)
|
||
- 使用寄存器存储中间结果,减少 shared memory 访问
|
||
- 向量化加载: `float4` 一次读 128 bit
|
||
- **学到**: register pressure, ILP (Instruction-Level Parallelism), occupancy vs. ILP tradeoff
|
||
- **预期性能**: ~30-50% cuBLAS
|
||
|
||
#### Version 4: Tensor Core GEMM (WMMA)
|
||
- 使用 CUDA WMMA API 调用 Tensor Core
|
||
- BF16 输入, FP32 累加
|
||
- 每次 wmma::mma_sync 计算 16×16×16 矩阵乘
|
||
- **学到**: WMMA fragment layout, Tensor Core 编程模型, warp-level 协作
|
||
- **关键**: 5090 Blackwell (CC 12.0) 的 Tensor Core 支持 BF16 和 FP8
|
||
- **预期性能**: ~60-80% cuBLAS
|
||
|
||
### cuBLAS 封装
|
||
|
||
```rust
|
||
// 需要封装的 cuBLAS API
|
||
extern "C" {
|
||
fn cublasCreate_v2(handle: *mut cublasHandle_t) -> cublasStatus_t;
|
||
fn cublasSetStream_v2(handle: cublasHandle_t, stream: cudaStream_t) -> cublasStatus_t;
|
||
fn cublasGemmEx(
|
||
handle: cublasHandle_t,
|
||
transa: cublasOperation_t, transb: cublasOperation_t,
|
||
m: i32, n: i32, k: i32,
|
||
alpha: *const c_void,
|
||
A: *const c_void, Atype: cudaDataType, lda: i32,
|
||
B: *const c_void, Btype: cudaDataType, ldb: i32,
|
||
beta: *const c_void,
|
||
C: *mut c_void, Ctype: cudaDataType, ldc: i32,
|
||
computeType: cublasComputeType_t,
|
||
algo: cublasGemmAlgo_t,
|
||
) -> cublasStatus_t;
|
||
}
|
||
```
|
||
|
||
支持: BF16×BF16→BF16 (compute=FP32), FP16×FP16→FP16, FP32×FP32→FP32
|
||
|
||
### Kernel Registry(运行时可切换 backend)
|
||
|
||
```rust
|
||
#[derive(Clone, Copy)]
|
||
pub enum GemmBackend {
|
||
Naive,
|
||
Tiled,
|
||
RegisterTiled,
|
||
TensorCore,
|
||
CuBlas,
|
||
}
|
||
|
||
pub fn matmul(a: &Tensor, b: &Tensor, backend: GemmBackend) -> Tensor;
|
||
|
||
// 全局默认 backend(可配置)
|
||
pub fn set_default_gemm_backend(backend: GemmBackend);
|
||
```
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 正确性: 所有 5 个 backend 的输出与 cuBLAS 对比,max absolute error < 1e-3 (BF16)
|
||
- [ ] Benchmark 表格(用 `criterion` crate):
|
||
|
||
| Backend | M=N=K=1024 | M=N=K=4096 | % of cuBLAS |
|
||
|---------|------------|------------|-------------|
|
||
| Naive | ms | ms | % |
|
||
| Tiled | ms | ms | % |
|
||
| RegisterTiled | ms | ms | % |
|
||
| TensorCore | ms | ms | % |
|
||
| cuBLAS | ms | ms | 100% |
|
||
|
||
- [ ] Profile: 用 `nsys`/`ncu` 分析 naive vs tiled 的 memory throughput 差异
|
||
- [ ] 非方阵测试: M=1, N=4096, K=4096 (decode 阶段的典型 shape)
|
||
|
||
---
|
||
|
||
## Phase 4: Transformer 核心 Kernels
|
||
|
||
**Crate**: `xserv-kernels`
|
||
**CUDA 源码**: `csrc/normalization/`, `csrc/activation/`, `csrc/embedding/`, `csrc/reduce/`
|
||
|
||
**目标**: 实现 Transformer 所需的所有非 Attention 算子,每个都有自定义 CUDA kernel。
|
||
|
||
### Kernel 清单
|
||
|
||
| Kernel | 用途 | CUDA 文件 | 核心优化点 |
|
||
|--------|------|-----------|-----------|
|
||
| LayerNorm | GPT-2 | `normalization/layernorm.cu` | Online Welford 算法, warp reduce, 向量化加载 |
|
||
| RMSNorm | Qwen3/LLaMA | `normalization/rmsnorm.cu` | 比 LayerNorm 简单(无 mean), rsqrt |
|
||
| GELU | GPT-2 激活 | `activation/gelu.cu` | tanh 近似 vs 精确, 向量化 |
|
||
| SiLU (Swish) | Qwen3 激活 | `activation/silu.cu` | `x * sigmoid(x)`, 逐元素 |
|
||
| SwiGLU | Qwen3 FFN | `activation/swiglu.cu` | `SiLU(gate) * up`, fused 逐元素 |
|
||
| Embedding | token→vector | `embedding/embedding.cu` | Gather 操作, coalesced access |
|
||
| RoPE | Qwen3 位置编码 | `embedding/rope.cu` | 复数旋转, precompute freq |
|
||
| Softmax | Attention 内 | `reduce/softmax.cu` | Online safe softmax, 数值稳定 |
|
||
| Argmax | Greedy sampling | `reduce/argmax.cu` | Parallel reduction |
|
||
| TopK | TopK sampling | `reduce/topk.cu` | Bitonic sort 或 radix select |
|
||
|
||
### 关键学习主题
|
||
|
||
**Reduction Pattern(核心中的核心)**:
|
||
|
||
LayerNorm, RMSNorm, Softmax 都涉及对某个维度求和/求最大值。GPU reduction 是分层的:
|
||
|
||
```
|
||
Thread-level: 每个 thread 处理多个元素,本地累加
|
||
↓
|
||
Warp-level: __shfl_down_sync() 在 warp (32 threads) 内规约
|
||
↓
|
||
Block-level: shared memory 存各 warp 的结果,再规约
|
||
↓
|
||
Grid-level: (如果需要) atomic 或两遍 kernel
|
||
```
|
||
|
||
对于 Norm/Softmax,通常 hidden_dim <= 8192,一个 block 就够,不需要 grid-level reduction。
|
||
|
||
**向量化内存访问**:
|
||
- `float4` (128-bit) 一次加载 4 个 float 或 8 个 bf16
|
||
- `__nv_bfloat162` 一次处理 2 个 bf16
|
||
- 提升 memory throughput,减少 load/store 指令数
|
||
|
||
**每个 kernel 都实现两个版本**:
|
||
1. Custom CUDA kernel(自己写,深入理解)
|
||
2. Reference 实现(简单的 Python/numpy,生成 reference output 用于验证)
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 每个 kernel 的输出与 PyTorch 参考实现对比
|
||
- 写一个 `tools/generate_reference.py` 脚本,为每个 op 生成 reference input/output,保存为 `.npy`
|
||
- Rust 测试中加载 `.npy` 对比
|
||
- [ ] 数值精度: BF16 下 max relative error < 1e-2, FP32 下 < 1e-5
|
||
- [ ] RoPE: 验证旋转后的向量与 HF transformers 的 `apply_rotary_pos_emb` 结果一致
|
||
- [ ] Softmax: 验证 `sum(output, dim=-1) == 1.0`,验证 numerical stability(大值输入不 overflow)
|
||
- [ ] Benchmark: 每个 kernel 与 PyTorch 对应操作的延迟对比
|
||
|
||
---
|
||
|
||
## Phase 5: Attention Kernel (Naive 版)
|
||
|
||
**Crate**: `xserv-kernels`
|
||
**CUDA 源码**: `csrc/attention/naive_attention.cu`
|
||
|
||
**目标**: 实现标准 Multi-Head Attention,不做 Flash/Paged 优化。理解 attention 机制的计算基础。
|
||
|
||
### 计算流程
|
||
|
||
```
|
||
Input: Q [B, H, S, D], K [B, H, S, D], V [B, H, S, D]
|
||
其中 B=batch, H=num_heads, S=seq_len, D=head_dim
|
||
|
||
1. scores = Q @ K^T / sqrt(D) → [B, H, S, S]
|
||
2. scores = scores + causal_mask → 上三角置为 -inf
|
||
3. weights = softmax(scores, dim=-1) → [B, H, S, S]
|
||
4. output = weights @ V → [B, H, S, D]
|
||
```
|
||
|
||
### 实现方式
|
||
|
||
**方式一: 组合式(先跑通)**
|
||
- 用 Phase 3 的 GEMM (Q@K^T) + Phase 4 的 Softmax + GEMM (weights@V)
|
||
- 简单但 materialize 了 S×S 矩阵,内存 O(S²)
|
||
|
||
**方式二: Fused kernel(理解 Flash Attention 的前置)**
|
||
- 一个 kernel 完成整个 attention
|
||
- 仍然 materialize S×S(不做 tiling),但减少 kernel launch 和 global memory 读写次数
|
||
|
||
### Causal Mask
|
||
|
||
- 不显式构造 mask 矩阵(浪费内存)
|
||
- 在 softmax 前对 `scores[i][j]` where `j > i` 写 `-inf`(`-1e9` for BF16)
|
||
- 编译期条件判断: `if (col > row) score = -inf;`
|
||
|
||
### GQA 预备
|
||
|
||
- 本阶段实现标准 MHA: `num_kv_heads == num_heads`
|
||
- Phase 10 扩展为 GQA: `num_kv_heads < num_heads`
|
||
- GQA 时 K/V 需要 repeat: 每个 KV head 服务 `num_heads / num_kv_heads` 个 Q head
|
||
- 实际实现: 不真正 repeat 数据,在 kernel 中用 `kv_head_idx = q_head_idx / num_groups` 索引
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 随机 Q, K, V,输出与 PyTorch `F.scaled_dot_product_attention(is_causal=True)` 对比
|
||
- [ ] 验证 causal mask: attention weight 矩阵的上三角全为 0
|
||
- [ ] Benchmark 表(记录为 Flash Attention 的 baseline):
|
||
|
||
| Seq Length | Latency (ms) | GPU Memory (MB) |
|
||
|------------|-------------|-----------------|
|
||
| 128 | | |
|
||
| 512 | | |
|
||
| 2048 | | |
|
||
| 4096 | | |
|
||
| 8192 | | OOM? |
|
||
|
||
- [ ] 输出 attention weights 可视化(小规模,验证 causal pattern)
|
||
|
||
---
|
||
|
||
## Phase 6: 模型加载
|
||
|
||
**Crate**: `xserv-model`
|
||
|
||
**目标**: 从 HuggingFace safetensors 文件加载模型权重到 GPU Tensor。
|
||
|
||
### 核心组件
|
||
|
||
#### 1. safetensors 解析
|
||
|
||
- 使用 `safetensors` crate 读取文件
|
||
- 文件结构: 8 bytes header_size + JSON header + raw tensor data
|
||
- 支持 mmap 零拷贝读取
|
||
- 支持 sharded 文件: `model-00001-of-00003.safetensors`
|
||
- 通过 `model.safetensors.index.json` 查找 tensor → file 的映射
|
||
|
||
#### 2. HF Config 解析
|
||
|
||
```rust
|
||
#[derive(Deserialize)]
|
||
pub struct ModelConfig {
|
||
pub architectures: Vec<String>,
|
||
pub hidden_size: usize,
|
||
pub intermediate_size: usize,
|
||
pub num_attention_heads: usize,
|
||
pub num_key_value_heads: usize, // GQA: 可能 < num_attention_heads
|
||
pub num_hidden_layers: usize,
|
||
pub vocab_size: usize,
|
||
pub max_position_embeddings: usize,
|
||
pub rms_norm_eps: f64, // Qwen3 用
|
||
pub rope_theta: f64, // RoPE base frequency
|
||
pub tie_word_embeddings: bool,
|
||
// ... 其他字段按模型按需添加
|
||
}
|
||
```
|
||
|
||
#### 3. 权重映射
|
||
|
||
HuggingFace 命名规范 (以 Qwen3 为例):
|
||
```
|
||
model.embed_tokens.weight → embedding
|
||
model.layers.{i}.self_attn.q_proj.weight → layer[i].attn.q_proj
|
||
model.layers.{i}.self_attn.k_proj.weight → layer[i].attn.k_proj
|
||
model.layers.{i}.self_attn.v_proj.weight → layer[i].attn.v_proj
|
||
model.layers.{i}.self_attn.o_proj.weight → layer[i].attn.o_proj
|
||
model.layers.{i}.mlp.gate_proj.weight → layer[i].mlp.gate
|
||
model.layers.{i}.mlp.up_proj.weight → layer[i].mlp.up
|
||
model.layers.{i}.mlp.down_proj.weight → layer[i].mlp.down
|
||
model.layers.{i}.input_layernorm.weight → layer[i].attn_norm
|
||
model.layers.{i}.post_attention_layernorm.weight → layer[i].ffn_norm
|
||
model.norm.weight → final_norm
|
||
lm_head.weight → lm_head
|
||
```
|
||
|
||
#### 4. 加载流程
|
||
|
||
```
|
||
safetensors file (disk)
|
||
→ mmap (host memory, 零拷贝)
|
||
→ dtype check/cast (如 FP32 → BF16)
|
||
→ H2D copy → GPU Tensor
|
||
→ 按 layer 组织成模型结构
|
||
```
|
||
|
||
### 外部依赖
|
||
|
||
- `safetensors` crate
|
||
- `serde` + `serde_json` (解析 config.json)
|
||
- `memmap2` (mmap 支持,safetensors crate 可能内置)
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 加载 GPT-2 124M (`openai-community/gpt2`),打印所有 tensor name, shape, dtype
|
||
- [ ] 抽查几个 tensor 的前 10 个值,与 PyTorch `from_pretrained` 对比
|
||
- [ ] 加载 Qwen3-8B sharded 权重,验证所有 tensor 都成功加载
|
||
- [ ] 性能: 测量 8B 模型权重加载时间 (mmap → GPU 全流程)
|
||
- [ ] 错误处理: 缺少 tensor、dtype 不匹配、文件不存在等情况
|
||
|
||
---
|
||
|
||
## Phase 7: BPE Tokenizer
|
||
|
||
**Crate**: `xserv-tokenizer`
|
||
|
||
**目标**: 从零实现 Byte-Pair Encoding tokenizer,兼容 HuggingFace tokenizer.json 格式。
|
||
|
||
### BPE 算法核心
|
||
|
||
#### 编码 (encode)
|
||
|
||
```
|
||
输入: "Hello world"
|
||
→ pre-tokenize (regex split): ["Hello", " world"]
|
||
→ 每个词转为 byte 序列: [72, 101, 108, 108, 111], [32, 119, 111, ...]
|
||
→ 初始 token 序列: 每个 byte 是一个 token
|
||
→ 反复合并:
|
||
1. 找当前序列中优先级最高的 byte-pair (从 merges 表查)
|
||
2. 合并该 pair
|
||
3. 重复直到无可合并
|
||
→ 输出: token IDs
|
||
```
|
||
|
||
#### 解码 (decode)
|
||
|
||
```
|
||
token IDs → 查 vocab 得到 byte 序列 → 拼接 → UTF-8 decode → 文本
|
||
```
|
||
|
||
### 需要处理的细节
|
||
|
||
1. **Pre-tokenization**:
|
||
- GPT-2 regex: `'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+`
|
||
- Qwen3 可能使用不同的 regex pattern(从 `tokenizer.json` 的 `pre_tokenizer` 字段读取)
|
||
- 用 `regex` crate 实现
|
||
|
||
2. **tokenizer.json 解析**:
|
||
```json
|
||
{
|
||
"model": {
|
||
"type": "BPE",
|
||
"vocab": {"Hello": 0, "world": 1, ...},
|
||
"merges": ["H e", "He l", "Hel lo", ...]
|
||
},
|
||
"added_tokens": [...],
|
||
"pre_tokenizer": {...},
|
||
"post_processor": {...}
|
||
}
|
||
```
|
||
|
||
3. **Special Tokens**:
|
||
- `<|endoftext|>` (GPT-2 EOS)
|
||
- `<|im_start|>`, `<|im_end|>` (Qwen3 ChatML)
|
||
- `<|endoftext|>` (Qwen3 EOS)
|
||
- Special tokens 不参与 BPE merge,直接映射到 ID
|
||
|
||
4. **Chat Template** (Qwen3 格式):
|
||
```
|
||
<|im_start|>system
|
||
You are a helpful assistant.<|im_end|>
|
||
<|im_start|>user
|
||
Hello<|im_end|>
|
||
<|im_start|>assistant
|
||
```
|
||
|
||
5. **性能优化**:
|
||
- Merge rules 用 `HashMap<(TokenId, TokenId), MergePriority>` 预索引
|
||
- 对于长文本,考虑 priority queue 加速 pair 查找
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 加载 GPT-2 tokenizer,encode + decode 一批测试文本,与 Python `AutoTokenizer` 逐 token 对比
|
||
- [ ] 加载 Qwen3 tokenizer,同样逐 token 对比
|
||
- [ ] 边界情况: 空字符串、纯 emoji (🎉🔥)、中英混合、超长文本 (1MB)
|
||
- [ ] Chat template: 给定 messages 列表,生成与 HF `apply_chat_template` 一致的 token 序列
|
||
- [ ] Benchmark: encode 1MB 文本的延迟
|
||
|
||
---
|
||
|
||
## Phase 8: GPT-2 完整推理 — 里程碑 ①
|
||
|
||
**Crate**: `xserv-model`
|
||
|
||
**目标**: 将所有组件串联,实现 GPT-2 的完整推理 pipeline。这是第一次看到模型"说话"。
|
||
|
||
### 模型结构
|
||
|
||
```rust
|
||
pub struct GPT2 {
|
||
config: GPT2Config,
|
||
wte: Tensor, // token embedding [vocab_size, hidden_size]
|
||
wpe: Tensor, // position embedding [max_seq_len, hidden_size]
|
||
layers: Vec<GPT2Block>,
|
||
ln_f: LayerNorm, // final layer norm
|
||
// lm_head 与 wte 共享权重 (tied embeddings)
|
||
}
|
||
|
||
pub struct GPT2Block {
|
||
ln_1: LayerNorm,
|
||
attn: GPT2Attention, // MHA: q_proj, k_proj, v_proj, o_proj
|
||
ln_2: LayerNorm,
|
||
mlp: GPT2MLP, // fc1 (4H) → GELU → fc2 (H)
|
||
}
|
||
```
|
||
|
||
### Forward Pass 流程
|
||
|
||
```
|
||
tokens [B, S]
|
||
→ wte[tokens] + wpe[0..S] → hidden [B, S, 768]
|
||
→ for each layer:
|
||
→ residual = hidden
|
||
→ hidden = ln_1(hidden)
|
||
→ hidden = attention(hidden) # Q, K, V 从 hidden 线性变换
|
||
→ hidden = hidden + residual # residual connection
|
||
→ residual = hidden
|
||
→ hidden = ln_2(hidden)
|
||
→ hidden = mlp(hidden) # Linear→GELU→Linear
|
||
→ hidden = hidden + residual
|
||
→ hidden = ln_f(hidden)
|
||
→ logits = hidden @ wte.T → [B, S, vocab_size]
|
||
→ next_token = sample(logits[:, -1, :]) # 只取最后一个 position
|
||
```
|
||
|
||
### Sampling 策略
|
||
|
||
```rust
|
||
pub struct SamplingParams {
|
||
pub temperature: f32, // default 1.0
|
||
pub top_k: usize, // default 50
|
||
pub top_p: f32, // default 1.0 (disabled)
|
||
pub max_tokens: usize, // default 256
|
||
pub repetition_penalty: f32, // default 1.0 (disabled)
|
||
}
|
||
```
|
||
|
||
实现:
|
||
1. **Greedy**: `argmax(logits)`
|
||
2. **Temperature**: `logits = logits / temperature` → softmax → sample
|
||
3. **Top-K**: 保留 top-k logits,其余置为 -inf → softmax → sample
|
||
4. **Top-P (Nucleus)**: 按概率降序排列,累加到概率 >= p → 截断 → 重新 normalize → sample
|
||
5. 以上可以组合: temperature → top-k → top-p → sample
|
||
|
||
### CLI 交互
|
||
|
||
```
|
||
$ cargo run --release --bin xserv-cli -- --model openai-community/gpt2
|
||
|
||
xserv> The future of AI is
|
||
GPT-2> The future of AI is not just about the technology, but about the people
|
||
who are building it. The question is whether we can...
|
||
|
||
xserv> Once upon a time
|
||
GPT-2> Once upon a time, there was a young man who lived in a small village...
|
||
```
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 加载 `openai-community/gpt2`,prefill "The future of AI is"
|
||
- [ ] Prefill logits 与 PyTorch 对比: top-5 token IDs 和对应 logit 值一致
|
||
- [ ] Greedy decode 50 tokens,结果应该是连贯英文
|
||
- [ ] Temperature/TopK/TopP sampling: 生成多次结果应有变化
|
||
- [ ] CLI 交互模式可用
|
||
|
||
---
|
||
|
||
## Phase 9: KV Cache + Autoregressive 优化
|
||
|
||
**Crate**: `xserv-runtime`
|
||
|
||
**目标**: 实现 KV Cache,将 decode 从 O(S²) 降到 O(S) per step。
|
||
|
||
### 核心概念: Prefill vs Decode
|
||
|
||
**Prefill(首次处理 prompt)**:
|
||
- 输入: 完整 prompt `[B, S, D]`
|
||
- 计算: 所有 token 的 Q, K, V
|
||
- Attention: `Q[B,H,S,D] @ K[B,H,S,D]^T` → 完整 S×S 矩阵
|
||
- 输出: 缓存 K, V 到 KV cache
|
||
- 特点: **Compute-bound**(大矩阵乘法)
|
||
|
||
**Decode(逐 token 生成)**:
|
||
- 输入: 上一步生成的 1 个 token `[B, 1, D]`
|
||
- 计算: 只计算新 token 的 Q, K, V
|
||
- K_new, V_new append 到 cache
|
||
- Attention: `Q[B,H,1,D] @ K_cache[B,H,S+1,D]^T` → `[B,H,1,S+1]`
|
||
- 特点: **Memory-bound**(Q 只有 1 行,瓶颈在读 K/V cache)
|
||
|
||
### KV Cache 设计(简单版,非 paged)
|
||
|
||
```rust
|
||
pub struct KVCache {
|
||
// 每层一对 K/V tensor
|
||
// shape: [batch_size, num_kv_heads, max_seq_len, head_dim]
|
||
k_caches: Vec<Tensor>, // 索引 = layer_idx
|
||
v_caches: Vec<Tensor>,
|
||
seq_len: usize, // 当前已填充的长度
|
||
max_seq_len: usize, // 预分配的最大长度
|
||
}
|
||
|
||
impl KVCache {
|
||
// prefill 时: 写入 [0..prompt_len] 的 K, V
|
||
pub fn fill(&mut self, layer: usize, k: &Tensor, v: &Tensor);
|
||
|
||
// decode 时: 在 seq_len 位置写入新的 K, V,返回完整 cache
|
||
pub fn append(&mut self, layer: usize, k: &Tensor, v: &Tensor) -> (&Tensor, &Tensor);
|
||
}
|
||
```
|
||
|
||
### Decode Attention Kernel
|
||
|
||
与 prefill attention 不同,decode 时 Q 只有 1 行:
|
||
```
|
||
Q [B, H, 1, D] × K_cache [B, H, S, D]^T → scores [B, H, 1, S]
|
||
→ softmax → weights [B, H, 1, S]
|
||
weights × V_cache [B, H, S, D] → output [B, H, 1, D]
|
||
```
|
||
|
||
优化方向:
|
||
- 每个 warp 处理一个 head
|
||
- 沿 S 维度做 parallel reduction (dot product + online softmax)
|
||
- 重点优化 memory bandwidth(K/V cache 的读取是瓶颈)
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 对比有/无 KV cache 的生成结果 → **必须完全一致**(bit-exact for greedy)
|
||
- [ ] Benchmark decode 延迟:
|
||
|
||
| Seq Length | Without Cache (ms/token) | With Cache (ms/token) | Speedup |
|
||
|------------|--------------------------|----------------------|---------|
|
||
| 128 | | | |
|
||
| 512 | | | |
|
||
| 2048 | | | |
|
||
|
||
- [ ] 显存占用: KV cache 的实际显存与理论值 (`2 * num_layers * num_kv_heads * seq_len * head_dim * sizeof(bf16)`) 对比
|
||
- [ ] GPT-2 decode throughput (tokens/s) 记录为 baseline
|
||
|
||
---
|
||
|
||
## Phase 10: Qwen3-8B 支持 — 里程碑 ②
|
||
|
||
**Crate**: `xserv-model`
|
||
|
||
**目标**: 扩展模型定义以支持 Qwen3-8B,验证输出正确性。
|
||
|
||
### 架构对比
|
||
|
||
| 特性 | GPT-2 (124M) | Qwen3-8B |
|
||
|------|-------------|----------|
|
||
| Normalization | LayerNorm (pre-LN) | RMSNorm (pre-LN) |
|
||
| Position Encoding | Learned absolute (wpe) | RoPE (无单独参数) |
|
||
| Attention | MHA (12 heads, 12 KV heads) | GQA (如 32 Q heads, 8 KV heads) |
|
||
| Activation | GELU | SwiGLU (SiLU gate) |
|
||
| FFN | Linear(H→4H) → GELU → Linear(4H→H) | gate_proj + up_proj → SiLU gate → down_proj |
|
||
| Vocab Size | 50,257 | ~152,000 |
|
||
| Hidden Size | 768 | 4,096 (8B) |
|
||
| Layers | 12 | 36 |
|
||
| Tied Embeddings | Yes | No |
|
||
|
||
### 需要新增/修改的组件
|
||
|
||
#### 1. GQA (Grouped Query Attention)
|
||
```
|
||
num_heads = 32, num_kv_heads = 8
|
||
每个 KV head 服务 32/8 = 4 个 Q head
|
||
|
||
Q: [B, 32, S, 128]
|
||
K: [B, 8, S, 128] ← 只有 8 个 KV heads
|
||
V: [B, 8, S, 128]
|
||
|
||
Attention 时:
|
||
kv_head_idx = q_head_idx / (num_heads / num_kv_heads)
|
||
不需要真正 repeat K/V 数据,kernel 中做索引映射
|
||
```
|
||
|
||
#### 2. RoPE (Rotary Position Embedding)
|
||
```
|
||
对 Q, K 的每对相邻元素做旋转:
|
||
Q' = Q * cos(θ) + rotate_half(Q) * sin(θ)
|
||
K' = K * cos(θ) + rotate_half(K) * sin(θ)
|
||
其中 θ_i = pos / (rope_theta^(2i/d))
|
||
|
||
预计算: freqs = 1.0 / (rope_theta^(2i/d)) for i in 0..d/2
|
||
运行时: cos_cache[pos][i] = cos(pos * freqs[i])
|
||
```
|
||
|
||
#### 3. SwiGLU FFN
|
||
```
|
||
x → gate_proj(x) → SiLU → ⊙ up_proj(x) → down_proj → output
|
||
|
||
三个 Linear:
|
||
gate_proj: [hidden_size, intermediate_size]
|
||
up_proj: [hidden_size, intermediate_size]
|
||
down_proj: [intermediate_size, hidden_size]
|
||
```
|
||
|
||
### 模型结构
|
||
|
||
```rust
|
||
pub struct Qwen3Model {
|
||
config: Qwen3Config,
|
||
embed_tokens: Tensor, // [vocab_size, hidden_size]
|
||
layers: Vec<Qwen3DecoderLayer>,
|
||
norm: RMSNorm, // final RMSNorm
|
||
lm_head: Tensor, // [vocab_size, hidden_size] (not tied)
|
||
}
|
||
|
||
pub struct Qwen3DecoderLayer {
|
||
input_layernorm: RMSNorm,
|
||
self_attn: Qwen3Attention, // GQA with RoPE
|
||
post_attention_layernorm: RMSNorm,
|
||
mlp: Qwen3MLP, // SwiGLU
|
||
}
|
||
```
|
||
|
||
### 显存预算 (BF16, 单卡 5090 32GB)
|
||
|
||
```
|
||
模型权重: 8B × 2B = ~16 GB
|
||
KV cache: 36 layers × 2(KV) × 8 heads × 4096 tokens × 128 dim × 2B ≈ 5.6 GB
|
||
Activation (单请求): ~1 GB
|
||
────────────────────────
|
||
总计: ~22.6 GB (单请求),剩余 ~10 GB 可用于更多并发
|
||
```
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 加载 Qwen3-8B 权重到单张 5090,打印模型结构和参数量
|
||
- [ ] Prefill logits 与 HF transformers 对比: 输入 "你好" → top-5 logits 一致
|
||
- [ ] 英文生成: "What is the capital of France?" → 生成合理回答
|
||
- [ ] 中文生成: "请介绍一下量子计算" → 生成通顺中文
|
||
- [ ] 多轮对话:
|
||
```
|
||
<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n
|
||
```
|
||
验证 chat template 格式正确
|
||
- [ ] 单请求性能 baseline: prefill latency (ms), decode throughput (tokens/s)
|
||
|
||
---
|
||
|
||
## Phase 11: Paged Attention + KV Cache Manager
|
||
|
||
**Crate**: `xserv-runtime`
|
||
**CUDA 源码**: `csrc/attention/paged_attention.cu`
|
||
|
||
**目标**: 实现 vLLM 的核心创新 — PagedAttention,解决 KV cache 内存碎片化问题。
|
||
|
||
### 问题
|
||
|
||
Phase 9 的简单 KV cache 为每个请求预分配 `max_seq_len` 的连续内存:
|
||
- 请求只用了 100 tokens 但占了 4096 tokens 的空间 → 内存利用率低
|
||
- 连续分配导致碎片化 → 并发请求数受限
|
||
|
||
### PagedAttention 设计
|
||
|
||
**核心思想**: 像操作系统的虚拟内存一样,将 KV cache 分成固定大小的 page (block)。
|
||
|
||
```rust
|
||
pub const BLOCK_SIZE: usize = 16; // 每个 block 存 16 个 token 的 KV
|
||
|
||
// 物理 KV cache: 预分配的大块 GPU 内存
|
||
// k_cache shape: [num_physical_blocks, num_kv_heads, block_size, head_dim]
|
||
// v_cache shape: [num_physical_blocks, num_kv_heads, block_size, head_dim]
|
||
|
||
pub struct BlockAllocator {
|
||
free_blocks: Vec<usize>, // 空闲物理 block ID 列表
|
||
num_total_blocks: usize,
|
||
ref_counts: Vec<usize>, // 每个 block 的引用计数 (CoW 用)
|
||
}
|
||
|
||
pub struct BlockTable {
|
||
// logical_block_idx → physical_block_idx
|
||
// 例: 一个 seq_len=50 的请求有 4 个 block (50/16=3.125, 向上取整)
|
||
blocks: Vec<usize>,
|
||
}
|
||
|
||
pub struct PagedKVCacheManager {
|
||
k_cache: Tensor, // 所有物理 blocks
|
||
v_cache: Tensor,
|
||
allocator: BlockAllocator,
|
||
block_tables: HashMap<SeqId, BlockTable>,
|
||
}
|
||
```
|
||
|
||
### Paged Attention Kernel
|
||
|
||
与普通 attention 的区别: K/V 不是连续存储,需要通过 block table 间接寻址。
|
||
|
||
```
|
||
输入:
|
||
Q: [num_seqs, num_heads, head_dim] (decode 时每个 seq 只有 1 个 query)
|
||
k_cache: [num_blocks, num_kv_heads, block_size, head_dim] (物理存储)
|
||
v_cache: [num_blocks, num_kv_heads, block_size, head_dim]
|
||
block_tables: [num_seqs, max_num_blocks] (间接寻址表)
|
||
seq_lens: [num_seqs] (每个 seq 的实际长度)
|
||
|
||
每个 thread block 处理:
|
||
1 个 seq 的 1 个 attention head
|
||
遍历该 seq 的所有 logical blocks
|
||
对每个 block: 查 block_table 得到 physical_block_id → 读取 K/V
|
||
online softmax 累加
|
||
输出: [num_seqs, num_heads, head_dim]
|
||
```
|
||
|
||
### Copy-on-Write (高级,可选)
|
||
|
||
- 多个 sequence 共享相同 prefix 的 KV blocks(beam search, prompt caching)
|
||
- 写入时: 如果 ref_count > 1,先复制该 block 再修改
|
||
- 这阶段先不实现,标记为后续优化
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 正确性: paged attention 输出与 Phase 9 简单 KV cache 完全一致
|
||
- [ ] 内存效率对比:
|
||
|
||
| 场景 | Naive KV Cache | Paged KV Cache |
|
||
|------|---------------|----------------|
|
||
| 1 req, seq=100 | 分配 4096 tokens | 分配 7 blocks (112 tokens) |
|
||
| 10 req, seq=100-500 | 10×4096 | 按需分配 |
|
||
| 最大并发数 (32GB) | N 个 | M 个 (M >> N) |
|
||
|
||
- [ ] Block allocator: alloc/free 循环,无内存泄漏
|
||
- [ ] Benchmark: paged attention kernel vs naive decode attention 延迟对比
|
||
|
||
---
|
||
|
||
## Phase 12: Continuous Batching + Request Scheduler
|
||
|
||
**Crate**: `xserv-runtime`
|
||
|
||
**目标**: 实现 iteration-level 调度,支持请求的动态加入和退出。
|
||
|
||
### Static Batching vs Continuous Batching
|
||
|
||
**Static (朴素)**:
|
||
```
|
||
Batch 1: [req1, req2, req3] → 等 req1, req2, req3 全部完成
|
||
Batch 2: [req4, req5, req6] → ...
|
||
问题: req1 完成了但 req3 还在生成 → GPU 空转
|
||
```
|
||
|
||
**Continuous (Orca 论文)**:
|
||
```
|
||
Iteration 1: [req1, req2, req3] → req1 完成!
|
||
Iteration 2: [req2, req3, req4] → req4 动态加入
|
||
Iteration 3: [req2, req3, req4] → req3 完成!
|
||
Iteration 4: [req2, req4, req5] → req5 动态加入
|
||
```
|
||
|
||
### 核心数据结构
|
||
|
||
```rust
|
||
#[derive(Clone, Copy, PartialEq)]
|
||
pub enum SequenceStatus {
|
||
Waiting, // 在等待队列中
|
||
Prefilling, // 正在做 prefill
|
||
Decoding, // 正在做 decode
|
||
Finished, // 已完成 (EOS / max_len / stop string)
|
||
Preempted, // 被抢占(显存不够,KV cache 被换出)
|
||
}
|
||
|
||
pub struct Sequence {
|
||
pub id: SeqId,
|
||
pub prompt_tokens: Vec<u32>,
|
||
pub generated_tokens: Vec<u32>,
|
||
pub status: SequenceStatus,
|
||
pub sampling_params: SamplingParams,
|
||
pub block_table: BlockTable,
|
||
pub arrival_time: Instant,
|
||
// 用于 streaming 输出
|
||
pub output_sender: tokio::sync::mpsc::Sender<GeneratedToken>,
|
||
}
|
||
|
||
pub struct Scheduler {
|
||
waiting: VecDeque<Sequence>,
|
||
running: Vec<Sequence>,
|
||
max_num_seqs: usize, // 最大并发 batch size
|
||
max_num_tokens: usize, // 单次 iteration 最大 token 数
|
||
block_manager: PagedKVCacheManager,
|
||
}
|
||
```
|
||
|
||
### 调度循环 (Engine 主循环)
|
||
|
||
```rust
|
||
loop {
|
||
// Step 1: 回收已完成的 sequence
|
||
// - 释放其 KV cache blocks
|
||
// - 从 running 移除
|
||
|
||
// Step 2: 检查能否 admit 新请求
|
||
// - 条件: running.len() < max_num_seqs
|
||
// && 有足够的 free blocks 给新请求的 prompt
|
||
// - FCFS 从 waiting 取
|
||
|
||
// Step 3: 划分 prefill / decode
|
||
// - 新加入的 sequence: prefill (处理完整 prompt)
|
||
// - 已在 running 的: decode (生成 1 个 token)
|
||
|
||
// Step 4: 组装 batch input
|
||
// - Prefill: 各 seq 的 prompt tokens, 需要 padding 或 ragged batch
|
||
// - Decode: 各 seq 的最后一个 token
|
||
|
||
// Step 5: Forward pass
|
||
// - prefill 和 decode 可以混合在一个 forward 中
|
||
// - 或者分开处理 (先 prefill, 再 decode)
|
||
|
||
// Step 6: Sampling
|
||
// - 对每个 seq 的 logits 进行采样
|
||
|
||
// Step 7: 更新状态
|
||
// - 将新 token append 到 sequence
|
||
// - 检查是否完成 (EOS / max_len)
|
||
// - 通过 channel 发送新 token 给 API 层
|
||
}
|
||
```
|
||
|
||
### Preemption(显存不足时的抢占)
|
||
|
||
当显存不足以 admit 新请求时:
|
||
1. **Swap**: 将低优先级 seq 的 KV cache 从 GPU 换到 CPU(复杂,后续再做)
|
||
2. **Recompute**: 丢弃低优先级 seq 的 KV cache,后续重新 prefill(简单,先实现这个)
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 模拟 10 个请求在不同时间到达,所有请求都得到正确的生成结果
|
||
- [ ] 短请求完成后,新请求立即加入 batch(观察 log)
|
||
- [ ] Throughput 对比:
|
||
|
||
| 方式 | 20 请求总耗时 | Token/s |
|
||
|------|-------------|---------|
|
||
| 串行 (batch=1) | | |
|
||
| Static batch=4 | | |
|
||
| Continuous batch | | |
|
||
|
||
- [ ] 压力测试: 100 个并发请求,全部正确完成,无 hang/crash
|
||
|
||
---
|
||
|
||
## Phase 13: HTTP API + SSE Streaming — 里程碑 ③
|
||
|
||
**Crate**: `xserv-engine`, `xserv-api`
|
||
|
||
**目标**: 提供 OpenAI 兼容的 HTTP API,支持 SSE streaming。第一个端到端可用的里程碑。
|
||
|
||
### 技术栈
|
||
|
||
- **HTTP**: `axum` (Rust async web framework)
|
||
- **Async**: `tokio`
|
||
- **JSON**: `serde_json`
|
||
- **SSE**: `axum` 内置 SSE 支持 (`axum::response::sse`)
|
||
|
||
### API 端点
|
||
|
||
```
|
||
POST /v1/chat/completions # 主要端点 (ChatML 格式)
|
||
POST /v1/completions # 纯文本补全
|
||
GET /v1/models # 列出可用模型
|
||
GET /health # 健康检查
|
||
```
|
||
|
||
### 请求/响应格式 (OpenAI 兼容)
|
||
|
||
**Chat Completion Request**:
|
||
```json
|
||
{
|
||
"model": "qwen3-8b",
|
||
"messages": [
|
||
{"role": "system", "content": "You are a helpful assistant."},
|
||
{"role": "user", "content": "What is 1+1?"}
|
||
],
|
||
"stream": true,
|
||
"temperature": 0.7,
|
||
"top_p": 0.9,
|
||
"max_tokens": 256,
|
||
"stop": ["\n\n"]
|
||
}
|
||
```
|
||
|
||
**SSE Streaming Response**:
|
||
```
|
||
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
|
||
|
||
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}
|
||
|
||
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":" answer"},"finish_reason":null}]}
|
||
|
||
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
|
||
|
||
data: [DONE]
|
||
```
|
||
|
||
**Non-streaming Response**:
|
||
```json
|
||
{
|
||
"id": "chatcmpl-xxx",
|
||
"object": "chat.completion",
|
||
"created": 1234567890,
|
||
"model": "qwen3-8b",
|
||
"choices": [{
|
||
"index": 0,
|
||
"message": {"role": "assistant", "content": "The answer is 2."},
|
||
"finish_reason": "stop"
|
||
}],
|
||
"usage": {
|
||
"prompt_tokens": 25,
|
||
"completion_tokens": 8,
|
||
"total_tokens": 33
|
||
}
|
||
}
|
||
```
|
||
|
||
### 架构分层
|
||
|
||
```
|
||
Client (curl / Python OpenAI SDK)
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────┐
|
||
│ xserv-api (axum HTTP server) │
|
||
│ - 解析请求, 验证参数 │
|
||
│ - apply chat template │
|
||
│ - 将请求提交给 engine │
|
||
│ - 从 channel 接收 token, 编码为 SSE│
|
||
└────────────┬────────────────────────┘
|
||
│ InferenceRequest (通过 channel)
|
||
▼
|
||
┌─────────────────────────────────────┐
|
||
│ xserv-engine (推理引擎) │
|
||
│ - 独立的 OS thread (非 async) │
|
||
│ - 运行 scheduler 调度循环 │
|
||
│ - 管理 model + KV cache │
|
||
│ - 每生成一个 token, 通过 channel │
|
||
│ 发送给 API 层 │
|
||
└─────────────────────────────────────┘
|
||
```
|
||
|
||
**关键设计决策**:
|
||
- Engine 跑在独立 OS thread(避免 GPU 同步操作 block tokio runtime)
|
||
- API ↔ Engine 通过 `tokio::sync::mpsc` channel 通信
|
||
- 每个请求有独立的 `mpsc::Sender/Receiver` 用于 token streaming
|
||
|
||
### 测试验收
|
||
|
||
- [ ] `curl` 测试:
|
||
```bash
|
||
curl http://localhost:8080/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"model":"qwen3-8b","messages":[{"role":"user","content":"Hello"}],"stream":true}'
|
||
```
|
||
看到 SSE 逐 token 输出
|
||
|
||
- [ ] Python OpenAI SDK 测试:
|
||
```python
|
||
from openai import OpenAI
|
||
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
|
||
for chunk in client.chat.completions.create(
|
||
model="qwen3-8b",
|
||
messages=[{"role": "user", "content": "What is 1+1?"}],
|
||
stream=True
|
||
):
|
||
print(chunk.choices[0].delta.content or "", end="", flush=True)
|
||
```
|
||
|
||
- [ ] 非 streaming 模式也能正常工作
|
||
- [ ] 并发 10 个请求,全部正确完成
|
||
- [ ] 多轮对话: 连续发两轮消息(第二轮包含 history),验证上下文连贯
|
||
- [ ] `/v1/models` 返回已加载的模型列表
|
||
- [ ] 错误处理: 无效参数返回 400, 模型不存在返回 404
|
||
|
||
---
|
||
|
||
## Phase 14: Flash Attention (FA2 for SM120)
|
||
|
||
**Crate**: `xserv-kernels`
|
||
**CUDA 源码**: `csrc/attention/flash_attention.cu`
|
||
|
||
**目标**: 实现 Flash Attention 的 CUDA kernel,大幅降低 attention 的显存占用并提升速度。
|
||
|
||
### 硬件适配说明
|
||
|
||
Flash Attention 已发展到第 4 代 (FA4, arxiv 2603.05451),但各版本有明确的硬件依赖:
|
||
|
||
| 版本 | 目标架构 | 关键硬件特性 | RTX 5090 兼容 |
|
||
|------|---------|------------|--------------|
|
||
| FA2 | 通用 CUDA (SM75+) | 标准 shared memory + HMMA | **是** ✅ |
|
||
| FA3 | Hopper SM90 (H100) | TMA + WGMMA + warp specialization | 否 |
|
||
| FA4 | Blackwell SM100 (B200/B300) | TMEM + async MMA + 2-CTA mode | 否 |
|
||
|
||
**RTX 5090 (SM120, CC 12.0) 使用的是消费级 Blackwell 架构 (GB202),与数据中心 Blackwell (B200, SM100) 是不同的硅片设计。SM120 物理上没有 TMEM (Tensor Memory) 子系统,因此 FA4 的 kernel 无法在 5090 上运行。这不是软件限制,是硬件级差异。**
|
||
|
||
因此本项目实现 **FA2 算法**,使用标准 CUDA (shared memory + HMMA)。FA2 的核心优化——online softmax tiling、O(1) 显存占用——在任何架构上都有效。
|
||
|
||
### 核心思想
|
||
|
||
标准 attention 的问题:
|
||
```
|
||
S = Q @ K^T ← 需要 O(S²) 显存存储 S×S 矩阵
|
||
P = softmax(S) ← 需要完整的 S×S 才能做 softmax
|
||
O = P @ V
|
||
```
|
||
|
||
Flash Attention 的解法:
|
||
- **不 materialize S×S 矩阵**
|
||
- 将 Q, K, V 分成 tiles,在 SRAM (shared memory) 中计算
|
||
- 使用 **online softmax trick**: 边算边更新 running max 和 running sum
|
||
|
||
### 算法 (Forward Pass, FA2)
|
||
|
||
FA2 相比 FA1 的改进: 外层循环遍历 Q tiles (而非 K/V),减少 HBM 读写次数。
|
||
|
||
```
|
||
Br, Bc = tile sizes for Q and K/V respectively
|
||
|
||
for each Q tile (q_start..q_start+Br): ← 外层: Q tiles
|
||
load Q_tile [Br, D] to shared memory
|
||
initialize: O_tile = 0, l = 0, m = -inf // running sum and max
|
||
|
||
for each K,V tile (kv_start..kv_start+Bc): ← 内层: K/V tiles
|
||
load K_tile [Bc, D], V_tile [Bc, D] to shared memory
|
||
|
||
// Compute attention scores for this tile pair
|
||
S_tile = Q_tile @ K_tile^T // [Br, Bc], in registers/SRAM
|
||
|
||
// Apply causal mask (skip if kv_start > q_start + Br)
|
||
if causal: mask upper triangle of S_tile
|
||
|
||
// Online softmax update
|
||
m_new = max(m, rowmax(S_tile)) // new running max
|
||
P_tile = exp(S_tile - m_new) // safe exp
|
||
l_new = exp(m - m_new) * l + rowsum(P_tile) // update running sum
|
||
|
||
// Rescale and accumulate output
|
||
O_tile = diag(exp(m - m_new)) * O_tile + P_tile @ V_tile
|
||
m = m_new
|
||
l = l_new
|
||
|
||
O_tile = O_tile / l // final normalization
|
||
write O_tile [Br, D] to global memory (HBM)
|
||
```
|
||
|
||
### 实现要点
|
||
|
||
1. **Tile 大小选择**:
|
||
- 5090 SM120: shared memory per SM = 100 KB (需实测确认)
|
||
- 需同时存 Q_tile, K_tile, V_tile, S_tile
|
||
- BF16: Q_tile [Br, D] = Br × 128 × 2B; K_tile [Bc, D] = Bc × 128 × 2B
|
||
- S_tile [Br, Bc] 保持 FP32 = Br × Bc × 4B
|
||
- 推荐起步: Br=Bc=64, head_dim=128 → 共需 ~100KB shared memory
|
||
- 优化版: Br=Bc=128 需要更多 shared memory, 可能需要拆分
|
||
|
||
2. **Causal mask 优化**:
|
||
- 如果 K/V tile 完全在 Q tile 的"未来"(kv_start > q_end)→ 跳过整个 tile
|
||
- 减少约 50% 的计算量
|
||
|
||
3. **BF16 精度**:
|
||
- S_tile, P_tile 的计算在 FP32 中进行(累加精度)
|
||
- Q, K, V 的加载用 BF16(节省 bandwidth)
|
||
- 最终 O 转回 BF16 写出
|
||
|
||
4. **GQA 支持**:
|
||
- K/V heads 数量 < Q heads 时,kernel 中做 `kv_head = q_head / num_groups` 索引
|
||
- 不需要 repeat_kv 操作,直接在 kernel 内部解决
|
||
|
||
5. **Decode attention 特化**:
|
||
- Decode 时 Q 只有 1 行 (Br=1),退化为 vector-matrix attention
|
||
- 可以写一个专门的 decode attention kernel (类似 FlashDecoding)
|
||
- 沿 KV sequence 维度做 parallel reduction
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 正确性: 与 Phase 5 naive attention 对比, max error < 1e-2 (BF16)
|
||
- [ ] 显存: Flash Attention 不随 S 平方增长
|
||
|
||
| Seq Length | Naive VRAM | Flash VRAM | Naive Time | Flash Time |
|
||
|------------|-----------|------------|------------|------------|
|
||
| 512 | MB | MB | ms | ms |
|
||
| 2048 | MB | MB | ms | ms |
|
||
| 8192 | OOM? | MB | OOM? | ms |
|
||
| 32768 | OOM | MB | OOM | ms |
|
||
|
||
- [ ] 集成到 Qwen3-8B,端到端 decode latency 对比
|
||
- [ ] Profile: `ncu` 分析 compute utilization, memory throughput
|
||
- [ ] GQA 支持: 无 repeat_kv 开销
|
||
|
||
---
|
||
|
||
## Phase 15: 性能优化 — 里程碑 ④
|
||
|
||
**目标**: 系统性 profiling + 优化,向 50% vLLM throughput 目标冲刺。
|
||
|
||
### 优化方向
|
||
|
||
#### 1. Kernel Fusion
|
||
减少 memory-bound kernel 之间的 global memory 读写:
|
||
- **Residual Add + RMSNorm**: 一次读 hidden + residual,写出 normed output
|
||
- **SiLU + Elementwise Mul** (SwiGLU 内部): 一次读 gate + up, 写出 SiLU(gate)*up
|
||
- **Bias Add + Activation**: Linear 的 bias + 激活函数合并
|
||
|
||
原则: 两个逐元素 kernel 之间如果有 global memory 读写,就值得融合。
|
||
|
||
#### 2. CUDA Graphs
|
||
- Decode 阶段每步的 kernel 序列是固定的(shape 不变)
|
||
- 用 `cudaStreamBeginCapture` / `cudaStreamEndCapture` 捕获一次
|
||
- 后续用 `cudaGraphLaunch` 重放(消除 kernel launch overhead)
|
||
- **注意**: batch size 变化时需要重新 capture
|
||
|
||
#### 3. Memory 优化
|
||
- **权重预加载**: 启动时加载到 GPU,推理路径上零分配
|
||
- **Activation reuse**: 同一层的中间结果用完立即释放/复用
|
||
- **Pinned memory**: H2D/D2H 用 `cudaMallocHost`(pinned)提升拷贝带宽
|
||
|
||
#### 4. Compute 优化
|
||
- 确保所有 GEMM 走 Tensor Core (BF16)
|
||
- Decode attention: 优化 memory bandwidth 利用率
|
||
- Prefill: chunked processing(控制峰值显存,允许更大 batch)
|
||
|
||
#### 5. Scheduling 优化
|
||
- Prefill-Decode disaggregation: prefill 和 decode 分开 batch
|
||
- 原因: prefill 是 compute-bound, decode 是 memory-bound, 混合导致两边都不优
|
||
- Dynamic batch size: 根据当前 running seqs 的 seq_len 动态调整
|
||
|
||
### Profiling 工具使用
|
||
|
||
```bash
|
||
# 整体 timeline (哪个 kernel 最耗时)
|
||
nsys profile --stats=true ./target/release/xserv-server
|
||
|
||
# 单个 kernel 分析 (occupancy, memory throughput)
|
||
ncu --target-processes all --set full ./target/release/xserv-server
|
||
|
||
# 自定义 Rust 计时
|
||
# 在 engine 循环中记录每个 phase 的耗时
|
||
```
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 安装 vLLM,同一台机器跑 Qwen3-8B
|
||
- [ ] Benchmark 对比:
|
||
|
||
| Metric | vLLM | xserv | Ratio |
|
||
|--------|------|-------|-------|
|
||
| Prefill latency (ms, 128 tokens) | | | |
|
||
| Decode throughput (tokens/s, batch=1) | | | |
|
||
| Decode throughput (tokens/s, batch=16) | | | |
|
||
| Max concurrent requests (32GB) | | | |
|
||
|
||
- [ ] 目标: xserv throughput >= 50% vLLM
|
||
- [ ] Profiling 报告: 每个组件的耗时占比 pie chart
|
||
- [ ] 无功能回归: 所有之前的集成测试通过
|
||
|
||
---
|
||
|
||
## Phase 16: Speculative Decoding
|
||
|
||
**Crate**: `xserv-speculative`
|
||
|
||
**目标**: 用小模型(draft model)加速大模型(target model)的 decode 阶段。
|
||
|
||
### 算法
|
||
|
||
```
|
||
γ = 4 (speculative tokens 数量)
|
||
|
||
1. Draft model 自回归生成 γ 个 token: t1, t2, t3, t4
|
||
2. Target model 一次 forward 处理这 γ+1 个 position
|
||
(等价于一次 prefill: [last_accepted_token, t1, t2, t3, t4])
|
||
得到 γ+1 个 logits
|
||
3. Rejection sampling:
|
||
for i in 0..γ:
|
||
p = target_prob[i][t_{i+1}] // target model 给 draft token 的概率
|
||
q = draft_prob[i][t_{i+1}] // draft model 给该 token 的概率
|
||
if random() < min(1, p/q):
|
||
accept t_{i+1}
|
||
else:
|
||
reject, resample from adjusted distribution
|
||
break
|
||
4. 至少接受 1 个 token, 期望接受 ~γ×acceptance_rate 个
|
||
```
|
||
|
||
### 关键点
|
||
|
||
- **无损**: rejection sampling 保证输出分布与纯 target model 一致
|
||
- **加速条件**: draft model 足够快且与 target 分布接近
|
||
- **Draft model 选择**: Qwen3-0.5B / Qwen3-1.5B 作为 Qwen3-8B 的 draft
|
||
|
||
### KV Cache 处理
|
||
|
||
- Draft model 有自己的 KV cache
|
||
- Target model 验证时,accepted tokens 的 KV 可以复用(不用重算)
|
||
- Rejected 位置之后的 KV 需要丢弃
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 验证: speculative decode 100 条不同 prompt,输出分布与标准 decode 无统计差异
|
||
- [ ] Acceptance rate 统计 (期望 60-80% per token)
|
||
- [ ] 端到端加速:
|
||
|
||
| Method | Tokens/s (batch=1) | Speedup |
|
||
|--------|--------------------|---------|
|
||
| Standard decode | | 1.0x |
|
||
| Speculative (γ=4) | | ~2-3x |
|
||
|
||
---
|
||
|
||
## Phase 17: Tensor Parallelism (TP=1/2/4) — 里程碑 ⑤
|
||
|
||
**Crate**: `xserv-distributed`
|
||
|
||
**目标**: 实现 Tensor Parallelism,支持 TP=2 和 TP=4(同组 GPU 内),跑通多卡推理。PP 及更大规模并行留待后续扩展。
|
||
|
||
### 通信后端: NCCL
|
||
|
||
NVIDIA Collective Communication Library,提供高效的 multi-GPU 通信原语。
|
||
|
||
需要封装的操作:
|
||
```rust
|
||
// NCCL FFI
|
||
ncclAllReduce(sendbuff, recvbuff, count, datatype, op, comm, stream)
|
||
ncclAllGather(sendbuff, recvbuff, count, datatype, comm, stream)
|
||
ncclBroadcast(sendbuff, recvbuff, count, datatype, root, comm, stream)
|
||
```
|
||
|
||
### Tensor Parallelism 策略 (Megatron-LM 风格)
|
||
|
||
每层只需要 **2 次 AllReduce**:
|
||
|
||
#### Attention 部分
|
||
```
|
||
Column Parallel: Q/K/V proj 按 head 维度切分
|
||
GPU 0: q_proj[:, :hidden/TP], k_proj[:, :hidden/TP], v_proj[:, :hidden/TP]
|
||
GPU 1: q_proj[:, hidden/TP:], k_proj[:, hidden/TP:], v_proj[:, hidden/TP:]
|
||
→ 每卡计算自己那部分 heads 的 attention → 无需通信
|
||
|
||
Row Parallel: o_proj 按行切分
|
||
每卡计算部分输出 → AllReduce 求和 → 得到完整 output
|
||
```
|
||
|
||
#### FFN (SwiGLU) 部分
|
||
```
|
||
Column Parallel: gate_proj, up_proj 按列切分
|
||
每卡计算部分 intermediate features → 无需通信
|
||
|
||
Row Parallel: down_proj 按行切分
|
||
每卡计算部分输出 → AllReduce 求和 → 得到完整 output
|
||
```
|
||
|
||
#### 其他
|
||
- **Embedding**: vocab 按行切分,AllGather 拼接
|
||
- **RMSNorm**: 每卡独立计算(输入是 AllReduce 后的完整 tensor)
|
||
- **lm_head**: 按列切分,AllGather 拼接 logits
|
||
|
||
### 权重分片
|
||
|
||
启动时:
|
||
1. Rank 0 加载完整权重
|
||
2. 按 TP 策略切分
|
||
3. AllGather 或 Scatter 分发到各卡
|
||
|
||
或者:
|
||
- 每个 rank 独立加载,只读取属于自己的那部分(更高效)
|
||
|
||
### 互联拓扑 (已确认)
|
||
|
||
**纯 PCIe Gen5 x16, 无 NVLink**。GPU 分两组: 0-3 (PHB) 和 4-7 (PHB),跨组走 NODE。
|
||
|
||
**TP 部署策略** (当前阶段目标: TP=1/2/4):
|
||
- **TP=2**: 同组内任意两卡 (如 GPU0+GPU1),PCIe PHB 延迟最低
|
||
- **TP=4**: 同组 4 卡 (GPU 0-3 或 GPU 4-7),全 PHB 互联
|
||
- **PCIe Gen5 x16 带宽**: 理论 ~64 GB/s 单向,实测 AllReduce 有效带宽约 40-50 GB/s
|
||
- **后续扩展**: TP=8 (跨组) 和 Pipeline Parallelism 留待后续阶段
|
||
|
||
### 测试验收
|
||
|
||
- [ ] TP=2: Qwen3-8B 输出与单卡 (TP=1) 完全一致
|
||
- [ ] TP=4: 每卡权重显存占用约 1/4
|
||
- [ ] Scaling benchmark (同组 GPU 0-3):
|
||
|
||
| TP Size | Prefill (tokens/s) | Decode (tokens/s) | Scaling Efficiency |
|
||
|---------|--------------------|--------------------|-------------------|
|
||
| 1 | | | 100% |
|
||
| 2 | | | % |
|
||
| 4 | | | % |
|
||
|
||
- [ ] AllReduce latency 测量 (不同消息大小,同组 PHB 互联)
|
||
|
||
---
|
||
|
||
## Phase 18: 量化 (FP8 / INT8)
|
||
|
||
**Crate**: `xserv-kernels` (新增量化 kernel), `xserv-model` (量化加载)
|
||
**CUDA 源码**: `csrc/quantize/`
|
||
|
||
**目标**: 降低模型显存占用和提升计算吞吐。
|
||
|
||
### 量化方式
|
||
|
||
#### 1. Weight-Only INT8
|
||
- 只量化权重,activation 保持 BF16
|
||
- Per-channel scale: 每个输出 channel 一个 `scale` 和 `zero_point`
|
||
- GEMM: INT8 × BF16 (dequantize on-the-fly)
|
||
- 适用于 memory-bound 场景(decode)
|
||
|
||
#### 2. FP8 (E4M3 / E5M2)
|
||
- 5090 Blackwell (CC 12.0) 原生支持 FP8 Tensor Core
|
||
- 权重和 activation 都量化为 FP8
|
||
- Dynamic scaling: 每个 tensor 运行时计算 `amax`,确定 scale factor
|
||
- GEMM: FP8 × FP8 → BF16/FP32 accumulate
|
||
- 适用于 compute-bound 场景(prefill)
|
||
|
||
#### 3. GPTQ / AWQ (高级, 可选)
|
||
- INT4 weight quantization
|
||
- 需要 calibration data
|
||
- 更复杂但压缩率更高
|
||
|
||
### FP8 实现要点
|
||
|
||
```rust
|
||
pub enum DType {
|
||
// ... existing
|
||
F8E4M3, // 1 sign + 4 exponent + 3 mantissa (范围小,精度高)
|
||
F8E5M2, // 1 sign + 5 exponent + 2 mantissa (范围大,精度低)
|
||
}
|
||
```
|
||
|
||
Dynamic scaling:
|
||
```
|
||
scale = amax(tensor) / fp8_max // fp8_max = 448.0 for E4M3
|
||
tensor_fp8 = cast_to_fp8(tensor / scale)
|
||
// GEMM 后: output = output * scale_A * scale_B
|
||
```
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 精度:
|
||
|
||
| Quantization | Perplexity (WikiText-2) | vs BF16 |
|
||
|-------------|------------------------|---------|
|
||
| BF16 (baseline) | X.XX | — |
|
||
| FP8 E4M3 | X.XX | +0.XX |
|
||
| INT8 weight-only | X.XX | +0.XX |
|
||
|
||
- [ ] 显存: FP8 权重占用约 BF16 的一半 (~8 GB for 8B model)
|
||
- [ ] 性能: FP8 GEMM throughput vs BF16 GEMM
|
||
|
||
---
|
||
|
||
## Phase 19: Multimodal — 里程碑 ⑥
|
||
|
||
**Crate**: `xserv-model` (新增 vision encoder)
|
||
|
||
**目标**: 支持 vision-language 模型,接受图片+文本输入。
|
||
|
||
### 目标模型
|
||
|
||
Qwen-VL 系列(或类似 architecture)。典型结构:
|
||
```
|
||
Image → ViT Encoder → Visual Tokens → Projector → LLM Input
|
||
Text → Tokenizer → Text Tokens ────────────→ LLM Input
|
||
```
|
||
|
||
### 需要新增的组件
|
||
|
||
#### 1. ViT (Vision Transformer) Encoder
|
||
- **Patch Embedding**: 将图片切成 14×14 patches, 每个 patch 线性投影
|
||
- 输入: [B, 3, H, W] → 输出: [B, num_patches, hidden_dim]
|
||
- **ViT Blocks**: 标准 Transformer Encoder blocks
|
||
- Multi-Head Self-Attention (无 causal mask)
|
||
- FFN
|
||
- LayerNorm
|
||
- 参数量: 通常 300M-600M
|
||
|
||
#### 2. Visual-Language Projector
|
||
- MLP: 将 ViT 输出维度映射到 LLM embedding 维度
|
||
- 可能包含 pooling / resampling (减少 visual token 数量)
|
||
|
||
#### 3. 图片预处理
|
||
- Resize to target resolution (e.g., 448×448)
|
||
- Normalize: ImageNet mean/std
|
||
- 使用 `image` crate 在 CPU 上完成
|
||
|
||
#### 4. 输入拼接
|
||
```
|
||
[<image_token>] × num_visual_tokens + [text tokens]
|
||
```
|
||
- 在 embedding 层面拼接
|
||
- LLM 处理混合 sequence
|
||
|
||
### API 扩展
|
||
|
||
```json
|
||
{
|
||
"model": "qwen-vl",
|
||
"messages": [{
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
|
||
{"type": "text", "text": "What is in this image?"}
|
||
]
|
||
}]
|
||
}
|
||
```
|
||
|
||
### 外部依赖
|
||
|
||
- `image` crate (图片加载和预处理)
|
||
- `base64` crate (解码 base64 图片)
|
||
|
||
### 测试验收
|
||
|
||
- [ ] 加载 Qwen-VL 模型,输入一张猫的图片 + "What is in this image?"
|
||
- [ ] 生成合理的图片描述
|
||
- [ ] 与 HF transformers 输出对比
|
||
- [ ] API 端到端: HTTP POST 含 base64 图片 → streaming 文字回答
|
||
- [ ] 纯文本请求不受影响(向后兼容)
|
||
|
||
---
|
||
|
||
## 实际进展记录(与原计划的分叉,2026-06 更新)
|
||
|
||
Phase 0–17 按计划完成。Phase 18 起实际路线偏离了上面的原计划
|
||
(speculative decoding 与多模态推迟),实际走向是 MoE + 量化 + 稀疏化:
|
||
|
||
| 实际 Phase | 内容 | 文档 |
|
||
|---|---|---|
|
||
| 18 | Pipeline Parallelism(PP=2/4) | `18-pipeline-parallelism.md`、`benchmarks/pp-sweep.md` |
|
||
| 19 | **gpt-oss-20b MoE**:harmony 格式、attention sinks + sliding window、YaRN;两个 CUDA bug 实战(prefill sinks NaN、GEMV 未初始化 smem);GSM8K 94.5% 对齐 llama.cpp;FP8 W8A8 / MXFP4 W4A16 量化 | `19-gpt-oss-moe.md`、`benchmarks/{fp8-quantization,mxfp4-and-llama-decode}.md` |
|
||
| 20 | **稀疏 top-k MoE decode**:只算被路由的专家,decode 13.9→7.0ms,TP=2 下 decode/TTFT 全面快于 llama.cpp 同配置;gpt-oss 单卡 serving | `20-sparse-moe.md`、`benchmarks/sparse-moe.md` |
|
||
| 21 | **decode CUDA Graph + GPU argmax**:整个 decode step 录成一个图回放(thread-local launch stream、retained-warmup 分配策略、NCCL capture);greedy 采样换 GPU argmax。TPOT 7.5→5.9ms(TP=1)/ 5.8ms(TP=2);TP=2 全面领先 llama(1.26-1.47×),TP=1 差距 2.5×→2.0× | `21-cuda-graph-decode.md` |
|
||
|
||
**下一步候选(按预期收益排序):**
|
||
|
||
| 候选 Phase | 内容 | 预期 |
|
||
|---|---|---|
|
||
| 22 | **非专家权重量化**:qkv/o + lm_head(1.16GB/token)仍是 BF16 | TPOT 再省 ~1.5ms |
|
||
| 23 | **稀疏 prefill**(按专家 permute + grouped GEMM) | 长 prompt TTFT 51-75 → ~30ms |
|
||
| 24 | server 侧 harmony channel 分离(`reasoning_content` 流式输出,对齐 llama-server 行为) | API 易用性 |
|
||
| — | Speculative Decoding、多模态(原 16/19) | 推迟 |
|
||
|
||
## 里程碑总结
|
||
|
||
| 里程碑 | Phase | 验收标准 |
|
||
|--------|-------|---------|
|
||
| ① GPT-2 推理 | 8 | CLI 输入 prompt, GPT-2 生成连贯文本, logits 与 PyTorch 一致 |
|
||
| ② Qwen3-8B 推理 | 10 | 8B 模型中英文对话, 多轮 chat template 正确 |
|
||
| ③ E2E API | 13 | HTTP streaming API, Python OpenAI SDK 可调用, 10 并发正确 |
|
||
| ④ 性能达标 | 15 | throughput >= 50% vLLM, profiling 报告完成 |
|
||
| ⑤ 多卡推理 | 17 | TP=2/4 同组 GPU 推理正确, scaling benchmark 完成 |
|
||
| ⑥ MoE 模型(实际) | 19 | gpt-oss-20b 端到端正确, GSM8K 与 llama.cpp 持平 ✅ |
|
||
| ⑦ 性能反超(实际) | 20 | 同配置 decode 快于 llama.cpp(TP=2 达成;单卡是 Phase 21+ 目标) ✅ |
|
||
| ⑧ 多模态 | 推迟 | 图片输入 → 文字回答, API 端到端 |
|
||
|
||
## 外部依赖清单
|
||
|
||
| Crate | 用途 | 引入 Phase |
|
||
|-------|------|-----------|
|
||
| `cc` | build.rs 编译 .cu 文件 | 0 |
|
||
| `half` | f16 / bf16 Rust 类型 | 2 |
|
||
| `smallvec` | Tensor shape / strides (栈分配) | 2 |
|
||
| `safetensors` | 权重文件解析 | 6 |
|
||
| `serde` + `serde_json` | JSON 序列化 | 6 |
|
||
| `memmap2` | 文件 mmap (safetensors 可能内置) | 6 |
|
||
| `regex` | BPE pre-tokenization | 7 |
|
||
| `rand` | Sampling (随机数) | 8 |
|
||
| `tokio` | Async runtime | 13 |
|
||
| `axum` | HTTP server | 13 |
|
||
| `criterion` | Benchmark framework | 3+ |
|
||
| `image` | 图片加载 (multimodal) | 19 |
|
||
| `base64` | Base64 decode (multimodal API) | 19 |
|
||
|
||
**不使用**: `candle`, `burn`, `tch`, `tokenizers`, `cudarc` — 核心组件全部自己实现。
|