Files

Gahow Wang ee68d3565d fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

Strict code review identified 30+ issues across correctness, performance,
and architecture. This commit addresses 14 of them with verified fixes,
restructures Phase 12 for honest continuous batching, and updates Phase 14
to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4).

Bug fixes:
- FIX-01: Global cuBLAS handle (thread-local singleton, was per-call)
- FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels
- FIX-03: Qwen3 ChatML template (was plain text concatenation)
- FIX-04: EOS token from tokenizer (was hardcoded 151645)
- FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0))
- FIX-06: unsqueeze stride preserves contiguous layout
- FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding)
- FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic)

Feature additions:
- FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible)
- FIX-11: Correct usage statistics (prompt/completion/total tokens)
- FIX-13: Temperature / top-k / top-p sampling with SamplingParams

Performance improvements:
- FIX-07: Caching allocator wired up (thread-local pool, pooled flag)
- FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw)
- FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip)

Architecture:
- Phase 12 engine restructured: prefill/decode separation, honest TODO
  for batched GPU forward (requires Flash Attention)
- Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090)
- Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096)

Validated on dash5 (8x RTX 5090):
- 52/52 API prompts pass (EN/CN/code), SSE streaming verified
- Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap
- 8 concurrent requests: 5.99x scheduling speedup (batch_size=4)
- Throughput: 10.3 tok/s (serial), 30% of HF baseline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 17:53:28 +08:00

60 KiB

Raw Blame History

xserv — LLM Inference Engine Roadmap

从零用 Rust + CUDA 构建一个完整的 LLM 推理引擎，目标是深入理解 LLM Serving 全栈技术。

设计决策

决策项	选择	备注
抽象层级	Level 0.5	自写 CUDA kernel + cuBLAS 可切换，便于 benchmark 对比
硬件	8×RTX 5090 (Blackwell, CC 12.0, 32GB GDDR7)	纯 PCIe Gen5 x16 互联，无 NVLink (详见下方硬件拓扑)
语言	Rust + CUDA (C/C++)	Rust FFI 调用 CUDA
起步模型	GPT-2 124M → Qwen3-8B	从简单到实用
精度	BF16/FP16	后期扩展 FP8
Tensor	自己实现	完整学习 tensor 抽象设计
Tokenizer	自己实现 BPE	学习分词机制
权重格式	safetensors	Rust 友好，零拷贝 mmap
Async Runtime	tokio	成熟稳定，不引入性能问题
API	OpenAI 兼容	`/v1/chat/completions`，SSE streaming
时间线	不限	学习为主，每步验证

硬件拓扑 (dash5, 已确认 2026-05-21)

GPU: 8× NVIDIA GeForce RTX 5090, 32607 MiB, Compute Capability 12.0 CUDA Toolkit: 12.9 (安装于 /usr/local/cuda-12.9，需将 bin/ 加入 PATH) PCIe: Gen 5 x16 (理论单向 ~64 GB/s，空闲时降频至 Gen 1)

互联拓扑 (nvidia-smi topo -m):

        GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
GPU0     X    PHB   PHB   PHB   NODE  NODE  NODE  NODE
GPU1    PHB    X    PHB   PHB   NODE  NODE  NODE  NODE
GPU2    PHB   PHB    X    PHB   NODE  NODE  NODE  NODE
GPU3    PHB   PHB   PHB    X    NODE  NODE  NODE  NODE
GPU4    NODE  NODE  NODE  NODE   X    PHB   PHB   PHB
GPU5    NODE  NODE  NODE  NODE  PHB    X    PHB   PHB
GPU6    NODE  NODE  NODE  NODE  PHB   PHB    X    PHB
GPU7    NODE  NODE  NODE  NODE  PHB   PHB   PHB    X

PHB  = 同一 PCIe Host Bridge（同组，延迟低）
NODE = 跨 PCIe Host Bridge（跨组，延迟较高）

分组: GPU 0-3 为一组, GPU 4-7 为一组。组内 PHB 互联，跨组 NODE 互联。

对设计的影响:

无 NVLink: AllReduce 带宽受限于 PCIe (~64 GB/s vs NVLink ~450 GB/s)
TP 策略: 当前阶段目标 TP=1/2/4，在同组内 (0-3 或 4-7) 执行，全 PHB 互联
跨组并行 (TP=8, PP 等): 留待后续扩展
CPU Affinity: GPU 0-3 亲和 CPU 0-127, GPU 4-7 亲和 CPU 0-186（NUMA 0-1）

项目结构

xserv/
├── Cargo.toml                    # workspace root
├── csrc/                         # CUDA 源文件 (.cu / .cuh)
│   ├── gemm/                     #   GEMM kernels (naive, tiled, tensor core)
│   ├── attention/                #   Attention kernels (naive, flash, paged)
│   ├── normalization/            #   LayerNorm, RMSNorm
│   ├── activation/               #   GELU, SiLU
│   ├── embedding/                #   Embedding lookup, RoPE
│   ├── reduce/                   #   Softmax, argmax, sampling
│   └── quantize/                 #   FP8/INT8 kernels
├── crates/
│   ├── xserv-cuda/               # Phase 1: CUDA FFI, context, stream, allocator
│   ├── xserv-tensor/             # Phase 2: Tensor type, ops dispatch, DType
│   ├── xserv-kernels/            # Phase 3-5: kernel registry (custom + cuBLAS)
│   ├── xserv-tokenizer/          # Phase 7: BPE tokenizer
│   ├── xserv-model/              # Phase 6,8,10: model def + weight loading
│   ├── xserv-runtime/            # Phase 9,11,12: KV cache, paging, scheduler
│   ├── xserv-engine/             # Phase 13: inference engine orchestration
│   ├── xserv-api/                # Phase 13: HTTP server + OpenAI compat
│   ├── xserv-speculative/        # Phase 16: speculative decoding
│   └── xserv-distributed/        # Phase 17: tensor parallelism, NCCL
├── tests/                        # integration tests
├── benches/                      # criterion benchmarks
├── tools/                        # 辅助脚本 (PyTorch reference output 生成等)
└── docs/                         # 每个 phase 的设计文档

Phase 依赖图

Phase 0: 项目脚手架 + 环境验证
    │
Phase 1: CUDA FFI 基础设施
    │
Phase 2: Tensor 抽象层
    │
Phase 3: GEMM (naive → tiled → tensor core → cuBLAS)
    │
Phase 4: Transformer Kernels (Norm, Activation, Embedding, RoPE, Softmax)
    │
Phase 5: Attention Kernel (naive MHA)
    │
Phase 6: 模型加载 (safetensors + HF config)
    │              │
    │         Phase 7: BPE Tokenizer (可与 Phase 6 并行)
    │              │
Phase 8: GPT-2 完整推理 ◄──────────── 里程碑 ① CLI 文本生成
    │
Phase 9: KV Cache + Autoregressive Generation
    │
Phase 10: Qwen3-8B 支持 ◄─────────── 里程碑 ② 8B 模型推理
    │
Phase 11: Paged Attention + KV Cache Manager
    │
Phase 12: Continuous Batching + Request Scheduler
    │
Phase 13: HTTP API + SSE Streaming ◄── 里程碑 ③ 端到端 API 可用
    │
Phase 14: Flash Attention (FA2 for SM120)
    │
Phase 15: 性能优化 ◄──────────────── 里程碑 ④ 50% vLLM throughput
    │
Phase 16: Speculative Decoding
    │
Phase 17: Tensor Parallelism (TP=1/2/4) ◄── 里程碑 ⑤ 多卡推理
    │
Phase 18: 量化 (FP8 / INT8)
    │
Phase 19: Multimodal ◄────────────── 里程碑 ⑥ 视觉问答

Phase 0: 项目脚手架 + 环境验证

目标: 搭建 Cargo workspace，验证 CUDA 工具链，确保开发环境就绪。

技术要点:

Cargo workspace 配置，所有 crate 共享依赖版本
build.rs 中用 cc crate 编译 .cu 文件的 pipeline
验证 CUDA toolkit 版本: 已确认 CUDA 12.9 (/usr/local/cuda-12.9)
验证 GPU compute capability: 已确认 CC 12.0 (Blackwell)
确认 nvcc 在 PATH 中 (需要 export PATH=/usr/local/cuda-12.9/bin:$PATH)
~~运行 nvidia-smi topo -m 确认互联拓扑~~ 已确认: 纯 PCIe Gen5, 无 NVLink

外部依赖: cc crate（编译 CUDA）

测试验收:

cargo build 通过
一个最小的 .cu kernel（向量加法）能从 Rust 调用并返回正确结果
输出 GPU 信息（名称、显存、compute capability）

设计文档: docs/01-cuda-ffi.md（与 Phase 1 合并）

Phase 1: CUDA FFI 基础设施

Crate: xserv-cuda

目标: 封装 CUDA Runtime API，提供安全的 Rust 抽象层。

模块划分

xserv-cuda/src/
├── lib.rs
├── error.rs        # CudaError 类型, cudaGetLastError 封装
├── device.rs       # Device 枚举, 设备查询 (属性/数量/当前设备)
├── context.rs      # CUDA context 管理
├── stream.rs       # CudaStream (异步操作流)
├── memory.rs       # GPU 内存分配/释放/拷贝 (H2D, D2H, D2D)
├── allocator.rs    # Caching Allocator (显存池)
└── module.rs       # cuModuleLoad (加载 PTX/cubin, 可选)

关键技术点

FFI 绑定策略:
- 手写 extern "C" 绑定核心 CUDA Runtime API（~30 个函数）
- 不用 bindgen，保持可控和可读
- 需要绑定的 API: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaStreamCreate, cudaStreamSynchronize, cudaGetDeviceProperties, cudaSetDevice, cudaDeviceSynchronize, cudaGetLastError 等

GpuBuffer 抽象:

pub struct GpuBuffer {
    ptr: *mut c_void,
    size_bytes: usize,
    device: usize,
}

impl Drop for GpuBuffer {
    fn drop(&mut self) { /* cudaFree or return to allocator */ }
}

Drop trait 自动释放，防止 GPU 内存泄漏
不实现 Clone（显式 copy_from 代替）

Caching Allocator:
- 维护 free list（按大小分桶，桶边界: 512B, 1KB, 2KB, ..., 1GB）
- alloc(size): 在对应桶中找 >= size 的 free block，miss 时 cudaMalloc
- free(ptr, size): 不调 cudaFree，放回 free list
- trim(): 真正释放所有 free blocks（OOM 恢复时用）
- 这是性能关键组件——频繁 cudaMalloc/cudaFree 会严重影响 throughput
- 参考: PyTorch 的 CUDACachingAllocator 设计
Stream 管理:
- 每个 stream 是独立的 GPU 执行队列
- Kernel launch 和 memcpy 是异步的（提交到 stream 后立即返回）
- stream.synchronize() 等待该 stream 上所有操作完成
- 后续用于 overlap compute 和 memory transfer

Error Handling:

#[derive(Debug)]
pub enum CudaError {
    OutOfMemory,
    InvalidDevice,
    LaunchFailure,
    Raw { code: i32, message: String },
}

// 所有 CUDA 调用包装为 Result
pub(crate) fn check(code: cudaError_t) -> Result<(), CudaError>;

测试验收

分配 1GB GPU 内存，H2D 拷贝一个大数组，D2H 拷回，验证数据一致
Caching allocator: alloc → free → re-alloc same size，第二次不触发 cudaMalloc（通过内部计数验证）
多 stream 并发拷贝两个数组，验证结果正确
设备查询: 打印 GPU name, total memory, compute capability, SM count
Benchmark: caching allocator vs 裸 cudaMalloc 的分配延迟对比（100 次 alloc/free 循环）

Phase 2: Tensor 抽象层

Crate: xserv-tensor

目标: 实现核心 Tensor 类型，支持 CPU/GPU 存储、多种数据类型、视图操作。

核心数据结构

// --- 数据类型 ---
#[derive(Clone, Copy, PartialEq)]
pub enum DType {
    F32,
    F16,
    BF16,
    // 后期: U8, I8, F8E4M3, F8E5M2
}

impl DType {
    pub fn size_bytes(&self) -> usize; // F32=4, F16=2, BF16=2
}

// --- 设备 ---
#[derive(Clone, Copy, PartialEq)]
pub enum Device {
    Cpu,
    Cuda(usize),  // device ordinal
}

// --- 存储 ---
// 引用计数，支持 view（多个 Tensor 共享同一 Storage）
pub struct Storage(Arc<StorageInner>);

enum StorageInner {
    Cpu { data: Vec<u8> },
    Cuda { buffer: GpuBuffer, device: usize },
}

// --- Tensor ---
pub struct Tensor {
    storage: Storage,
    shape: SmallVec<[usize; 4]>,    // 维度（大多数 tensor <= 4D）
    strides: SmallVec<[usize; 4]>,  // 步长（以元素为单位）
    offset: usize,                   // storage 中的起始偏移（元素数）
    dtype: DType,
    device: Device,
}

关键技术点

Strided Layout:
- 支持 transpose, slice, permute 等操作不拷贝数据，只改 strides/offset
- is_contiguous(): strides 从右到左依次为 1, shape[-1], shape[-1]*shape[-2], ...
- 非 contiguous tensor 在送入 CUDA kernel 前需要 contiguous() 拷贝为连续布局
- 例: [3,4] tensor 的 strides = [4, 1]；transpose 后 shape=[4,3], strides=[1, 4]
BF16/F16 在 Rust 中的表示:
- 使用 half crate 的 bf16 和 f16 类型
- GPU kernel 中使用 __nv_bfloat16 / __half
- Tensor 内部存储为 raw bytes，通过 DType dispatch 解释

设备间拷贝:

impl Tensor {
    pub fn to(&self, device: Device) -> Tensor; // CPU↔GPU 拷贝
    pub fn to_dtype(&self, dtype: DType) -> Tensor; // 类型转换
}

基础操作（此阶段实现）:
- 创建: zeros, ones, from_slice, rand, full, arange
- 形状: reshape, view, transpose, squeeze, unsqueeze, contiguous
- 逐元素 (CPU + GPU kernel): add, mul, sub, div
- 广播 (Broadcasting): NumPy 语义，维度从尾部对齐
- 归约: sum, max, mean（沿指定轴）

Op Dispatch 机制:

// 根据 device 和 dtype dispatch 到不同实现
pub fn add(a: &Tensor, b: &Tensor) -> Tensor {
    match (a.device(), b.device()) {
        (Device::Cpu, Device::Cpu) => cpu_ops::add(a, b),
        (Device::Cuda(_), Device::Cuda(_)) => cuda_ops::add(a, b),
        _ => panic!("device mismatch"),
    }
}

测试验收

创建 tensor, reshape, transpose, slice，验证 shape/strides 计算正确
广播加法: [3,1] + [1,4] → [3,4]，与 numpy 结果对比
CPU ↔ GPU 拷贝往返，数据一致
BF16 tensor 的基础运算精度验证（与 FP32 结果对比 relative error）
View 共享存储: 修改 view 的数据，原 tensor 也应变化
Benchmark: GPU 逐元素 kernel vs CPU 的加速比（大数组）

Phase 3: GEMM — 矩阵乘法

Crate: xserv-kernels CUDA 源码: csrc/gemm/

目标: 实现 GEMM 的多个版本，从 naive 到 tensor core，同时封装 cuBLAS，建立 benchmark 对比框架。

这是 CUDA kernel 编程的第一个"修罗场"，会深刻理解 GPU 编程的核心概念。

实现路线（4 个递进版本）

Version 1: Naive GEMM

每个 thread 计算输出矩阵 C 的一个元素: C[i][j] = sum(A[i][k] * B[k][j])
grid 维度: (M/BLOCK, N/BLOCK), block 维度: (BLOCK, BLOCK)
学到: grid/block 维度规划, global memory access pattern
问题: global memory 访问完全没有局部性，bandwidth 利用率极低
预期性能: ~1-2% cuBLAS

Version 2: Tiled GEMM (shared memory)

将 A, B 分成 TILE×TILE 的小块，加载到 shared memory
每个 thread block 计算 C 的一个 TILE×TILE 输出块
内层循环沿 K 维度滑动 tile
学到: shared memory 使用, __syncthreads(), bank conflict, memory coalescing
关键: A 的 tile 要按行加载（coalesced），B 的 tile 按列访问需要注意 bank conflict
预期性能: ~10-20% cuBLAS

Version 3: Register Tiling + 向量化

每个 thread 计算多个输出元素（如 4×4 或 8×8）
使用寄存器存储中间结果，减少 shared memory 访问
向量化加载: float4 一次读 128 bit
学到: register pressure, ILP (Instruction-Level Parallelism), occupancy vs. ILP tradeoff
预期性能: ~30-50% cuBLAS

Version 4: Tensor Core GEMM (WMMA)

使用 CUDA WMMA API 调用 Tensor Core
BF16 输入, FP32 累加
每次 wmma::mma_sync 计算 16×16×16 矩阵乘
学到: WMMA fragment layout, Tensor Core 编程模型, warp-level 协作
关键: 5090 Blackwell (CC 12.0) 的 Tensor Core 支持 BF16 和 FP8
预期性能: ~60-80% cuBLAS

cuBLAS 封装

// 需要封装的 cuBLAS API
extern "C" {
    fn cublasCreate_v2(handle: *mut cublasHandle_t) -> cublasStatus_t;
    fn cublasSetStream_v2(handle: cublasHandle_t, stream: cudaStream_t) -> cublasStatus_t;
    fn cublasGemmEx(
        handle: cublasHandle_t,
        transa: cublasOperation_t, transb: cublasOperation_t,
        m: i32, n: i32, k: i32,
        alpha: *const c_void,
        A: *const c_void, Atype: cudaDataType, lda: i32,
        B: *const c_void, Btype: cudaDataType, ldb: i32,
        beta: *const c_void,
        C: *mut c_void, Ctype: cudaDataType, ldc: i32,
        computeType: cublasComputeType_t,
        algo: cublasGemmAlgo_t,
    ) -> cublasStatus_t;
}

支持: BF16×BF16→BF16 (compute=FP32), FP16×FP16→FP16, FP32×FP32→FP32

Kernel Registry（运行时可切换 backend）

#[derive(Clone, Copy)]
pub enum GemmBackend {
    Naive,
    Tiled,
    RegisterTiled,
    TensorCore,
    CuBlas,
}

pub fn matmul(a: &Tensor, b: &Tensor, backend: GemmBackend) -> Tensor;

// 全局默认 backend（可配置）
pub fn set_default_gemm_backend(backend: GemmBackend);

测试验收

正确性: 所有 5 个 backend 的输出与 cuBLAS 对比，max absolute error < 1e-3 (BF16)
Benchmark 表格（用 criterion crate）:

Backend	M=N=K=1024	M=N=K=4096	% of cuBLAS
Naive	ms	ms	%
Tiled	ms	ms	%
RegisterTiled	ms	ms	%
TensorCore	ms	ms	%
cuBLAS	ms	ms	100%

Profile: 用 nsys/ncu 分析 naive vs tiled 的 memory throughput 差异
非方阵测试: M=1, N=4096, K=4096 (decode 阶段的典型 shape)

Phase 4: Transformer 核心 Kernels

Crate: xserv-kernels CUDA 源码: csrc/normalization/, csrc/activation/, csrc/embedding/, csrc/reduce/

目标: 实现 Transformer 所需的所有非 Attention 算子，每个都有自定义 CUDA kernel。

Kernel 清单

Kernel	用途	CUDA 文件	核心优化点
LayerNorm	GPT-2	`normalization/layernorm.cu`	Online Welford 算法, warp reduce, 向量化加载
RMSNorm	Qwen3/LLaMA	`normalization/rmsnorm.cu`	比 LayerNorm 简单（无 mean）, rsqrt
GELU	GPT-2 激活	`activation/gelu.cu`	tanh 近似 vs 精确, 向量化
SiLU (Swish)	Qwen3 激活	`activation/silu.cu`	`x * sigmoid(x)`, 逐元素
SwiGLU	Qwen3 FFN	`activation/swiglu.cu`	`SiLU(gate) * up`, fused 逐元素
Embedding	token→vector	`embedding/embedding.cu`	Gather 操作, coalesced access
RoPE	Qwen3 位置编码	`embedding/rope.cu`	复数旋转, precompute freq
Softmax	Attention 内	`reduce/softmax.cu`	Online safe softmax, 数值稳定
Argmax	Greedy sampling	`reduce/argmax.cu`	Parallel reduction
TopK	TopK sampling	`reduce/topk.cu`	Bitonic sort 或 radix select

关键学习主题

Reduction Pattern（核心中的核心）:

LayerNorm, RMSNorm, Softmax 都涉及对某个维度求和/求最大值。GPU reduction 是分层的:

Thread-level:   每个 thread 处理多个元素，本地累加
    ↓
Warp-level:     __shfl_down_sync() 在 warp (32 threads) 内规约
    ↓
Block-level:    shared memory 存各 warp 的结果，再规约
    ↓
Grid-level:     (如果需要) atomic 或两遍 kernel

对于 Norm/Softmax，通常 hidden_dim <= 8192，一个 block 就够，不需要 grid-level reduction。

向量化内存访问:

float4 (128-bit) 一次加载 4 个 float 或 8 个 bf16
__nv_bfloat162 一次处理 2 个 bf16
提升 memory throughput，减少 load/store 指令数

每个 kernel 都实现两个版本:

Custom CUDA kernel（自己写，深入理解）
Reference 实现（简单的 Python/numpy，生成 reference output 用于验证）

测试验收

每个 kernel 的输出与 PyTorch 参考实现对比
- 写一个 tools/generate_reference.py 脚本，为每个 op 生成 reference input/output，保存为 .npy
- Rust 测试中加载 .npy 对比
数值精度: BF16 下 max relative error < 1e-2, FP32 下 < 1e-5
RoPE: 验证旋转后的向量与 HF transformers 的 apply_rotary_pos_emb 结果一致
Softmax: 验证 sum(output, dim=-1) == 1.0，验证 numerical stability（大值输入不 overflow）
Benchmark: 每个 kernel 与 PyTorch 对应操作的延迟对比

Phase 5: Attention Kernel (Naive 版)

Crate: xserv-kernels CUDA 源码: csrc/attention/naive_attention.cu

目标: 实现标准 Multi-Head Attention，不做 Flash/Paged 优化。理解 attention 机制的计算基础。

计算流程

Input: Q [B, H, S, D], K [B, H, S, D], V [B, H, S, D]
       其中 B=batch, H=num_heads, S=seq_len, D=head_dim

1. scores = Q @ K^T / sqrt(D)           → [B, H, S, S]
2. scores = scores + causal_mask         → 上三角置为 -inf
3. weights = softmax(scores, dim=-1)     → [B, H, S, S]
4. output = weights @ V                  → [B, H, S, D]

实现方式

方式一: 组合式（先跑通）

用 Phase 3 的 GEMM (Q@K^T) + Phase 4 的 Softmax + GEMM (weights@V)
简单但 materialize 了 S×S 矩阵，内存 O(S²)

方式二: Fused kernel（理解 Flash Attention 的前置）

一个 kernel 完成整个 attention
仍然 materialize S×S（不做 tiling），但减少 kernel launch 和 global memory 读写次数

Causal Mask

不显式构造 mask 矩阵（浪费内存）
在 softmax 前对 scores[i][j] where j > i 写 -inf（-1e9 for BF16）
编译期条件判断: if (col > row) score = -inf;

GQA 预备

本阶段实现标准 MHA: num_kv_heads == num_heads
Phase 10 扩展为 GQA: num_kv_heads < num_heads
GQA 时 K/V 需要 repeat: 每个 KV head 服务 num_heads / num_kv_heads 个 Q head
实际实现: 不真正 repeat 数据，在 kernel 中用 kv_head_idx = q_head_idx / num_groups 索引

测试验收

随机 Q, K, V，输出与 PyTorch F.scaled_dot_product_attention(is_causal=True) 对比
验证 causal mask: attention weight 矩阵的上三角全为 0
Benchmark 表（记录为 Flash Attention 的 baseline）:

Seq Length	Latency (ms)	GPU Memory (MB)
128
512
2048
4096
8192		OOM?

输出 attention weights 可视化（小规模，验证 causal pattern）

Phase 6: 模型加载

Crate: xserv-model

目标: 从 HuggingFace safetensors 文件加载模型权重到 GPU Tensor。

核心组件

1. safetensors 解析

使用 safetensors crate 读取文件
文件结构: 8 bytes header_size + JSON header + raw tensor data
支持 mmap 零拷贝读取
支持 sharded 文件: model-00001-of-00003.safetensors
- 通过 model.safetensors.index.json 查找 tensor → file 的映射

2. HF Config 解析

#[derive(Deserialize)]
pub struct ModelConfig {
    pub architectures: Vec<String>,
    pub hidden_size: usize,
    pub intermediate_size: usize,
    pub num_attention_heads: usize,
    pub num_key_value_heads: usize,    // GQA: 可能 < num_attention_heads
    pub num_hidden_layers: usize,
    pub vocab_size: usize,
    pub max_position_embeddings: usize,
    pub rms_norm_eps: f64,             // Qwen3 用
    pub rope_theta: f64,               // RoPE base frequency
    pub tie_word_embeddings: bool,
    // ... 其他字段按模型按需添加
}

3. 权重映射

HuggingFace 命名规范 (以 Qwen3 为例):

model.embed_tokens.weight                          → embedding
model.layers.{i}.self_attn.q_proj.weight           → layer[i].attn.q_proj
model.layers.{i}.self_attn.k_proj.weight           → layer[i].attn.k_proj
model.layers.{i}.self_attn.v_proj.weight           → layer[i].attn.v_proj
model.layers.{i}.self_attn.o_proj.weight           → layer[i].attn.o_proj
model.layers.{i}.mlp.gate_proj.weight              → layer[i].mlp.gate
model.layers.{i}.mlp.up_proj.weight                → layer[i].mlp.up
model.layers.{i}.mlp.down_proj.weight              → layer[i].mlp.down
model.layers.{i}.input_layernorm.weight             → layer[i].attn_norm
model.layers.{i}.post_attention_layernorm.weight    → layer[i].ffn_norm
model.norm.weight                                   → final_norm
lm_head.weight                                      → lm_head

4. 加载流程

safetensors file (disk)
  → mmap (host memory, 零拷贝)
  → dtype check/cast (如 FP32 → BF16)
  → H2D copy → GPU Tensor
  → 按 layer 组织成模型结构

外部依赖

safetensors crate
serde + serde_json (解析 config.json)
memmap2 (mmap 支持，safetensors crate 可能内置)

测试验收

加载 GPT-2 124M (openai-community/gpt2)，打印所有 tensor name, shape, dtype
抽查几个 tensor 的前 10 个值，与 PyTorch from_pretrained 对比
加载 Qwen3-8B sharded 权重，验证所有 tensor 都成功加载
性能: 测量 8B 模型权重加载时间 (mmap → GPU 全流程)
错误处理: 缺少 tensor、dtype 不匹配、文件不存在等情况

Phase 7: BPE Tokenizer

Crate: xserv-tokenizer

目标: 从零实现 Byte-Pair Encoding tokenizer，兼容 HuggingFace tokenizer.json 格式。

BPE 算法核心

编码 (encode)

输入: "Hello world"
  → pre-tokenize (regex split): ["Hello", " world"]
  → 每个词转为 byte 序列: [72, 101, 108, 108, 111], [32, 119, 111, ...]
  → 初始 token 序列: 每个 byte 是一个 token
  → 反复合并:
      1. 找当前序列中优先级最高的 byte-pair (从 merges 表查)
      2. 合并该 pair
      3. 重复直到无可合并
  → 输出: token IDs

解码 (decode)

token IDs → 查 vocab 得到 byte 序列 → 拼接 → UTF-8 decode → 文本

需要处理的细节

Pre-tokenization:
- GPT-2 regex: 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+
- Qwen3 可能使用不同的 regex pattern（从 tokenizer.json 的 pre_tokenizer 字段读取）
- 用 regex crate 实现

tokenizer.json 解析:

{
  "model": {
    "type": "BPE",
    "vocab": {"Hello": 0, "world": 1, ...},
    "merges": ["H e", "He l", "Hel lo", ...]
  },
  "added_tokens": [...],
  "pre_tokenizer": {...},
  "post_processor": {...}
}

Special Tokens:
- <|endoftext|> (GPT-2 EOS)
- <|im_start|>, <|im_end|> (Qwen3 ChatML)
- <|endoftext|> (Qwen3 EOS)
- Special tokens 不参与 BPE merge，直接映射到 ID

Chat Template (Qwen3 格式):

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant

性能优化:
- Merge rules 用 HashMap<(TokenId, TokenId), MergePriority> 预索引
- 对于长文本，考虑 priority queue 加速 pair 查找

测试验收

加载 GPT-2 tokenizer，encode + decode 一批测试文本，与 Python AutoTokenizer 逐 token 对比
加载 Qwen3 tokenizer，同样逐 token 对比
边界情况: 空字符串、纯 emoji (🎉🔥)、中英混合、超长文本 (1MB)
Chat template: 给定 messages 列表，生成与 HF apply_chat_template 一致的 token 序列
Benchmark: encode 1MB 文本的延迟

Phase 8: GPT-2 完整推理 — 里程碑 ①

Crate: xserv-model

目标: 将所有组件串联，实现 GPT-2 的完整推理 pipeline。这是第一次看到模型"说话"。

模型结构

pub struct GPT2 {
    config: GPT2Config,
    wte: Tensor,           // token embedding [vocab_size, hidden_size]
    wpe: Tensor,           // position embedding [max_seq_len, hidden_size]
    layers: Vec<GPT2Block>,
    ln_f: LayerNorm,       // final layer norm
    // lm_head 与 wte 共享权重 (tied embeddings)
}

pub struct GPT2Block {
    ln_1: LayerNorm,
    attn: GPT2Attention,   // MHA: q_proj, k_proj, v_proj, o_proj
    ln_2: LayerNorm,
    mlp: GPT2MLP,          // fc1 (4H) → GELU → fc2 (H)
}

Forward Pass 流程

tokens [B, S]
  → wte[tokens] + wpe[0..S]              → hidden [B, S, 768]
  → for each layer:
      → residual = hidden
      → hidden = ln_1(hidden)
      → hidden = attention(hidden)         # Q, K, V 从 hidden 线性变换
      → hidden = hidden + residual         # residual connection
      → residual = hidden
      → hidden = ln_2(hidden)
      → hidden = mlp(hidden)               # Linear→GELU→Linear
      → hidden = hidden + residual
  → hidden = ln_f(hidden)
  → logits = hidden @ wte.T               → [B, S, vocab_size]
  → next_token = sample(logits[:, -1, :]) # 只取最后一个 position

Sampling 策略

pub struct SamplingParams {
    pub temperature: f32,      // default 1.0
    pub top_k: usize,         // default 50
    pub top_p: f32,            // default 1.0 (disabled)
    pub max_tokens: usize,     // default 256
    pub repetition_penalty: f32, // default 1.0 (disabled)
}

实现:

Greedy: argmax(logits)
Temperature: logits = logits / temperature → softmax → sample
Top-K: 保留 top-k logits，其余置为 -inf → softmax → sample
Top-P (Nucleus): 按概率降序排列，累加到概率 >= p → 截断 → 重新 normalize → sample
以上可以组合: temperature → top-k → top-p → sample

CLI 交互

$ cargo run --release --bin xserv-cli -- --model openai-community/gpt2

xserv> The future of AI is
GPT-2> The future of AI is not just about the technology, but about the people
       who are building it. The question is whether we can...

xserv> Once upon a time
GPT-2> Once upon a time, there was a young man who lived in a small village...

测试验收

加载 openai-community/gpt2，prefill "The future of AI is"
Prefill logits 与 PyTorch 对比: top-5 token IDs 和对应 logit 值一致
Greedy decode 50 tokens，结果应该是连贯英文
Temperature/TopK/TopP sampling: 生成多次结果应有变化
CLI 交互模式可用

Phase 9: KV Cache + Autoregressive 优化

Crate: xserv-runtime

目标: 实现 KV Cache，将 decode 从 O(S²) 降到 O(S) per step。

核心概念: Prefill vs Decode

Prefill（首次处理 prompt）:

输入: 完整 prompt [B, S, D]
计算: 所有 token 的 Q, K, V
Attention: Q[B,H,S,D] @ K[B,H,S,D]^T → 完整 S×S 矩阵
输出: 缓存 K, V 到 KV cache
特点: Compute-bound（大矩阵乘法）

Decode（逐 token 生成）:

输入: 上一步生成的 1 个 token [B, 1, D]
计算: 只计算新 token 的 Q, K, V
K_new, V_new append 到 cache
Attention: Q[B,H,1,D] @ K_cache[B,H,S+1,D]^T → [B,H,1,S+1]
特点: Memory-bound（Q 只有 1 行，瓶颈在读 K/V cache）

KV Cache 设计（简单版，非 paged）

pub struct KVCache {
    // 每层一对 K/V tensor
    // shape: [batch_size, num_kv_heads, max_seq_len, head_dim]
    k_caches: Vec<Tensor>,   // 索引 = layer_idx
    v_caches: Vec<Tensor>,
    seq_len: usize,          // 当前已填充的长度
    max_seq_len: usize,      // 预分配的最大长度
}

impl KVCache {
    // prefill 时: 写入 [0..prompt_len] 的 K, V
    pub fn fill(&mut self, layer: usize, k: &Tensor, v: &Tensor);

    // decode 时: 在 seq_len 位置写入新的 K, V，返回完整 cache
    pub fn append(&mut self, layer: usize, k: &Tensor, v: &Tensor) -> (&Tensor, &Tensor);
}

Decode Attention Kernel

与 prefill attention 不同，decode 时 Q 只有 1 行:

Q [B, H, 1, D] × K_cache [B, H, S, D]^T → scores [B, H, 1, S]
→ softmax → weights [B, H, 1, S]
weights × V_cache [B, H, S, D] → output [B, H, 1, D]

优化方向:

每个 warp 处理一个 head
沿 S 维度做 parallel reduction (dot product + online softmax)
重点优化 memory bandwidth（K/V cache 的读取是瓶颈）

测试验收

对比有/无 KV cache 的生成结果 → 必须完全一致（bit-exact for greedy）
Benchmark decode 延迟:

Seq Length	Without Cache (ms/token)	With Cache (ms/token)	Speedup
128
512
2048

显存占用: KV cache 的实际显存与理论值 (2 * num_layers * num_kv_heads * seq_len * head_dim * sizeof(bf16)) 对比
GPT-2 decode throughput (tokens/s) 记录为 baseline

Phase 10: Qwen3-8B 支持 — 里程碑 ②

Crate: xserv-model

目标: 扩展模型定义以支持 Qwen3-8B，验证输出正确性。

架构对比

特性	GPT-2 (124M)	Qwen3-8B
Normalization	LayerNorm (pre-LN)	RMSNorm (pre-LN)
Position Encoding	Learned absolute (wpe)	RoPE (无单独参数)
Attention	MHA (12 heads, 12 KV heads)	GQA (如 32 Q heads, 8 KV heads)
Activation	GELU	SwiGLU (SiLU gate)
FFN	Linear(H→4H) → GELU → Linear(4H→H)	gate_proj + up_proj → SiLU gate → down_proj
Vocab Size	50,257	~152,000
Hidden Size	768	4,096 (8B)
Layers	12	36
Tied Embeddings	Yes	No

需要新增/修改的组件

1. GQA (Grouped Query Attention)

num_heads = 32, num_kv_heads = 8
每个 KV head 服务 32/8 = 4 个 Q head

Q: [B, 32, S, 128]
K: [B, 8, S, 128]   ← 只有 8 个 KV heads
V: [B, 8, S, 128]

Attention 时:
  kv_head_idx = q_head_idx / (num_heads / num_kv_heads)
  不需要真正 repeat K/V 数据，kernel 中做索引映射

2. RoPE (Rotary Position Embedding)

对 Q, K 的每对相邻元素做旋转:
  Q' = Q * cos(θ) + rotate_half(Q) * sin(θ)
  K' = K * cos(θ) + rotate_half(K) * sin(θ)
其中 θ_i = pos / (rope_theta^(2i/d))

预计算: freqs = 1.0 / (rope_theta^(2i/d)) for i in 0..d/2
运行时: cos_cache[pos][i] = cos(pos * freqs[i])

3. SwiGLU FFN

x → gate_proj(x) → SiLU → ⊙ up_proj(x) → down_proj → output

三个 Linear:
  gate_proj: [hidden_size, intermediate_size]
  up_proj:   [hidden_size, intermediate_size]
  down_proj: [intermediate_size, hidden_size]

模型结构

pub struct Qwen3Model {
    config: Qwen3Config,
    embed_tokens: Tensor,           // [vocab_size, hidden_size]
    layers: Vec<Qwen3DecoderLayer>,
    norm: RMSNorm,                  // final RMSNorm
    lm_head: Tensor,               // [vocab_size, hidden_size] (not tied)
}

pub struct Qwen3DecoderLayer {
    input_layernorm: RMSNorm,
    self_attn: Qwen3Attention,     // GQA with RoPE
    post_attention_layernorm: RMSNorm,
    mlp: Qwen3MLP,                 // SwiGLU
}

显存预算 (BF16, 单卡 5090 32GB)

模型权重:  8B × 2B = ~16 GB
KV cache:  36 layers × 2(KV) × 8 heads × 4096 tokens × 128 dim × 2B ≈ 5.6 GB
Activation (单请求): ~1 GB
────────────────────────
总计: ~22.6 GB (单请求)，剩余 ~10 GB 可用于更多并发

测试验收

加载 Qwen3-8B 权重到单张 5090，打印模型结构和参数量
Prefill logits 与 HF transformers 对比: 输入 "你好" → top-5 logits 一致
英文生成: "What is the capital of France?" → 生成合理回答
中文生成: "请介绍一下量子计算" → 生成通顺中文

多轮对话:

<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n

验证 chat template 格式正确

单请求性能 baseline: prefill latency (ms), decode throughput (tokens/s)

Phase 11: Paged Attention + KV Cache Manager

Crate: xserv-runtime CUDA 源码: csrc/attention/paged_attention.cu

目标: 实现 vLLM 的核心创新 — PagedAttention，解决 KV cache 内存碎片化问题。

问题

Phase 9 的简单 KV cache 为每个请求预分配 max_seq_len 的连续内存:

请求只用了 100 tokens 但占了 4096 tokens 的空间 → 内存利用率低
连续分配导致碎片化 → 并发请求数受限

PagedAttention 设计

核心思想: 像操作系统的虚拟内存一样，将 KV cache 分成固定大小的 page (block)。

pub const BLOCK_SIZE: usize = 16;  // 每个 block 存 16 个 token 的 KV

// 物理 KV cache: 预分配的大块 GPU 内存
// k_cache shape: [num_physical_blocks, num_kv_heads, block_size, head_dim]
// v_cache shape: [num_physical_blocks, num_kv_heads, block_size, head_dim]

pub struct BlockAllocator {
    free_blocks: Vec<usize>,          // 空闲物理 block ID 列表
    num_total_blocks: usize,
    ref_counts: Vec<usize>,           // 每个 block 的引用计数 (CoW 用)
}

pub struct BlockTable {
    // logical_block_idx → physical_block_idx
    // 例: 一个 seq_len=50 的请求有 4 个 block (50/16=3.125, 向上取整)
    blocks: Vec<usize>,
}

pub struct PagedKVCacheManager {
    k_cache: Tensor,                  // 所有物理 blocks
    v_cache: Tensor,
    allocator: BlockAllocator,
    block_tables: HashMap<SeqId, BlockTable>,
}

Paged Attention Kernel

与普通 attention 的区别: K/V 不是连续存储，需要通过 block table 间接寻址。

输入:
  Q:           [num_seqs, num_heads, head_dim]        (decode 时每个 seq 只有 1 个 query)
  k_cache:     [num_blocks, num_kv_heads, block_size, head_dim]  (物理存储)
  v_cache:     [num_blocks, num_kv_heads, block_size, head_dim]
  block_tables: [num_seqs, max_num_blocks]             (间接寻址表)
  seq_lens:    [num_seqs]                              (每个 seq 的实际长度)

每个 thread block 处理:
  1 个 seq 的 1 个 attention head
  遍历该 seq 的所有 logical blocks
  对每个 block: 查 block_table 得到 physical_block_id → 读取 K/V
  online softmax 累加
  输出: [num_seqs, num_heads, head_dim]

Copy-on-Write (高级，可选)

多个 sequence 共享相同 prefix 的 KV blocks（beam search, prompt caching）
写入时: 如果 ref_count > 1，先复制该 block 再修改
这阶段先不实现，标记为后续优化

测试验收

正确性: paged attention 输出与 Phase 9 简单 KV cache 完全一致
内存效率对比:

场景	Naive KV Cache	Paged KV Cache
1 req, seq=100	分配 4096 tokens	分配 7 blocks (112 tokens)
10 req, seq=100-500	10×4096	按需分配
最大并发数 (32GB)	N 个	M 个 (M >> N)

Block allocator: alloc/free 循环，无内存泄漏
Benchmark: paged attention kernel vs naive decode attention 延迟对比

Phase 12: Continuous Batching + Request Scheduler

Crate: xserv-runtime

目标: 实现 iteration-level 调度，支持请求的动态加入和退出。

Static Batching vs Continuous Batching

Static (朴素):

Batch 1: [req1, req2, req3] → 等 req1, req2, req3 全部完成
Batch 2: [req4, req5, req6] → ...
问题: req1 完成了但 req3 还在生成 → GPU 空转

Continuous (Orca 论文):

Iteration 1: [req1, req2, req3] → req1 完成!
Iteration 2: [req2, req3, req4] → req4 动态加入
Iteration 3: [req2, req3, req4] → req3 完成!
Iteration 4: [req2, req4, req5] → req5 动态加入

核心数据结构

#[derive(Clone, Copy, PartialEq)]
pub enum SequenceStatus {
    Waiting,      // 在等待队列中
    Prefilling,   // 正在做 prefill
    Decoding,     // 正在做 decode
    Finished,     // 已完成 (EOS / max_len / stop string)
    Preempted,    // 被抢占（显存不够，KV cache 被换出）
}

pub struct Sequence {
    pub id: SeqId,
    pub prompt_tokens: Vec<u32>,
    pub generated_tokens: Vec<u32>,
    pub status: SequenceStatus,
    pub sampling_params: SamplingParams,
    pub block_table: BlockTable,
    pub arrival_time: Instant,
    // 用于 streaming 输出
    pub output_sender: tokio::sync::mpsc::Sender<GeneratedToken>,
}

pub struct Scheduler {
    waiting: VecDeque<Sequence>,
    running: Vec<Sequence>,
    max_num_seqs: usize,           // 最大并发 batch size
    max_num_tokens: usize,         // 单次 iteration 最大 token 数
    block_manager: PagedKVCacheManager,
}

调度循环 (Engine 主循环)

loop {
    // Step 1: 回收已完成的 sequence
    //   - 释放其 KV cache blocks
    //   - 从 running 移除

    // Step 2: 检查能否 admit 新请求
    //   - 条件: running.len() < max_num_seqs
    //           && 有足够的 free blocks 给新请求的 prompt
    //   - FCFS 从 waiting 取

    // Step 3: 划分 prefill / decode
    //   - 新加入的 sequence: prefill (处理完整 prompt)
    //   - 已在 running 的: decode (生成 1 个 token)

    // Step 4: 组装 batch input
    //   - Prefill: 各 seq 的 prompt tokens, 需要 padding 或 ragged batch
    //   - Decode: 各 seq 的最后一个 token

    // Step 5: Forward pass
    //   - prefill 和 decode 可以混合在一个 forward 中
    //   - 或者分开处理 (先 prefill, 再 decode)

    // Step 6: Sampling
    //   - 对每个 seq 的 logits 进行采样

    // Step 7: 更新状态
    //   - 将新 token append 到 sequence
    //   - 检查是否完成 (EOS / max_len)
    //   - 通过 channel 发送新 token 给 API 层
}

Preemption（显存不足时的抢占）

当显存不足以 admit 新请求时:

Swap: 将低优先级 seq 的 KV cache 从 GPU 换到 CPU（复杂，后续再做）
Recompute: 丢弃低优先级 seq 的 KV cache，后续重新 prefill（简单，先实现这个）

测试验收

模拟 10 个请求在不同时间到达，所有请求都得到正确的生成结果
短请求完成后，新请求立即加入 batch（观察 log）
Throughput 对比:

方式	20 请求总耗时	Token/s
串行 (batch=1)
Static batch=4
Continuous batch

压力测试: 100 个并发请求，全部正确完成，无 hang/crash

Phase 13: HTTP API + SSE Streaming — 里程碑 ③

Crate: xserv-engine, xserv-api

目标: 提供 OpenAI 兼容的 HTTP API，支持 SSE streaming。第一个端到端可用的里程碑。

技术栈

HTTP: axum (Rust async web framework)
Async: tokio
JSON: serde_json
SSE: axum 内置 SSE 支持 (axum::response::sse)

API 端点

POST /v1/chat/completions    # 主要端点 (ChatML 格式)
POST /v1/completions         # 纯文本补全
GET  /v1/models              # 列出可用模型
GET  /health                 # 健康检查

请求/响应格式 (OpenAI 兼容)

Chat Completion Request:

{
  "model": "qwen3-8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 1+1?"}
  ],
  "stream": true,
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 256,
  "stop": ["\n\n"]
}

SSE Streaming Response:

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":" answer"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Non-streaming Response:

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "qwen3-8b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "The answer is 2."},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

架构分层

Client (curl / Python OpenAI SDK)
    │
    ▼
┌─────────────────────────────────────┐
│  xserv-api (axum HTTP server)       │
│  - 解析请求, 验证参数               │
│  - apply chat template              │
│  - 将请求提交给 engine              │
│  - 从 channel 接收 token, 编码为 SSE│
└────────────┬────────────────────────┘
             │ InferenceRequest (通过 channel)
             ▼
┌─────────────────────────────────────┐
│  xserv-engine (推理引擎)            │
│  - 独立的 OS thread (非 async)      │
│  - 运行 scheduler 调度循环          │
│  - 管理 model + KV cache            │
│  - 每生成一个 token, 通过 channel   │
│    发送给 API 层                     │
└─────────────────────────────────────┘

关键设计决策:

Engine 跑在独立 OS thread（避免 GPU 同步操作 block tokio runtime）
API ↔ Engine 通过 tokio::sync::mpsc channel 通信
每个请求有独立的 mpsc::Sender/Receiver 用于 token streaming

测试验收

curl 测试:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-8b","messages":[{"role":"user","content":"Hello"}],"stream":true}'

看到 SSE 逐 token 输出

Python OpenAI SDK 测试:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
for chunk in client.chat.completions.create(
    model="qwen3-8b",
    messages=[{"role": "user", "content": "What is 1+1?"}],
    stream=True
):
    print(chunk.choices[0].delta.content or "", end="", flush=True)

非 streaming 模式也能正常工作
并发 10 个请求，全部正确完成
多轮对话: 连续发两轮消息（第二轮包含 history），验证上下文连贯
/v1/models 返回已加载的模型列表
错误处理: 无效参数返回 400, 模型不存在返回 404

Phase 14: Flash Attention (FA2 for SM120)

Crate: xserv-kernels CUDA 源码: csrc/attention/flash_attention.cu

目标: 实现 Flash Attention 的 CUDA kernel，大幅降低 attention 的显存占用并提升速度。

硬件适配说明

Flash Attention 已发展到第 4 代 (FA4, arxiv 2603.05451)，但各版本有明确的硬件依赖:

版本	目标架构	关键硬件特性	RTX 5090 兼容
FA2	通用 CUDA (SM75+)	标准 shared memory + HMMA	是 ✅
FA3	Hopper SM90 (H100)	TMA + WGMMA + warp specialization	否
FA4	Blackwell SM100 (B200/B300)	TMEM + async MMA + 2-CTA mode	否

RTX 5090 (SM120, CC 12.0) 使用的是消费级 Blackwell 架构 (GB202)，与数据中心 Blackwell (B200, SM100) 是不同的硅片设计。SM120 物理上没有 TMEM (Tensor Memory) 子系统，因此 FA4 的 kernel 无法在 5090 上运行。这不是软件限制，是硬件级差异。

因此本项目实现 FA2 算法，使用标准 CUDA (shared memory + HMMA)。FA2 的核心优化——online softmax tiling、O(1) 显存占用——在任何架构上都有效。

核心思想

标准 attention 的问题:

S = Q @ K^T         ← 需要 O(S²) 显存存储 S×S 矩阵
P = softmax(S)      ← 需要完整的 S×S 才能做 softmax
O = P @ V

Flash Attention 的解法:

不 materialize S×S 矩阵
将 Q, K, V 分成 tiles，在 SRAM (shared memory) 中计算
使用 online softmax trick: 边算边更新 running max 和 running sum

算法 (Forward Pass, FA2)

FA2 相比 FA1 的改进: 外层循环遍历 Q tiles (而非 K/V)，减少 HBM 读写次数。

Br, Bc = tile sizes for Q and K/V respectively

for each Q tile (q_start..q_start+Br):                    ← 外层: Q tiles
    load Q_tile [Br, D] to shared memory
    initialize: O_tile = 0, l = 0, m = -inf               // running sum and max

    for each K,V tile (kv_start..kv_start+Bc):             ← 内层: K/V tiles
        load K_tile [Bc, D], V_tile [Bc, D] to shared memory

        // Compute attention scores for this tile pair
        S_tile = Q_tile @ K_tile^T              // [Br, Bc], in registers/SRAM

        // Apply causal mask (skip if kv_start > q_start + Br)
        if causal: mask upper triangle of S_tile

        // Online softmax update
        m_new = max(m, rowmax(S_tile))          // new running max
        P_tile = exp(S_tile - m_new)            // safe exp
        l_new = exp(m - m_new) * l + rowsum(P_tile)  // update running sum

        // Rescale and accumulate output
        O_tile = diag(exp(m - m_new)) * O_tile + P_tile @ V_tile
        m = m_new
        l = l_new

    O_tile = O_tile / l                         // final normalization
    write O_tile [Br, D] to global memory (HBM)

实现要点

Tile 大小选择:
- 5090 SM120: shared memory per SM = 100 KB (需实测确认)
- 需同时存 Q_tile, K_tile, V_tile, S_tile
- BF16: Q_tile [Br, D] = Br × 128 × 2B; K_tile [Bc, D] = Bc × 128 × 2B
- S_tile [Br, Bc] 保持 FP32 = Br × Bc × 4B
- 推荐起步: Br=Bc=64, head_dim=128 → 共需 ~100KB shared memory
- 优化版: Br=Bc=128 需要更多 shared memory, 可能需要拆分
Causal mask 优化:
- 如果 K/V tile 完全在 Q tile 的"未来"（kv_start > q_end）→ 跳过整个 tile
- 减少约 50% 的计算量
BF16 精度:
- S_tile, P_tile 的计算在 FP32 中进行（累加精度）
- Q, K, V 的加载用 BF16（节省 bandwidth）
- 最终 O 转回 BF16 写出
GQA 支持:
- K/V heads 数量 < Q heads 时，kernel 中做 kv_head = q_head / num_groups 索引
- 不需要 repeat_kv 操作，直接在 kernel 内部解决
Decode attention 特化:
- Decode 时 Q 只有 1 行 (Br=1)，退化为 vector-matrix attention
- 可以写一个专门的 decode attention kernel (类似 FlashDecoding)
- 沿 KV sequence 维度做 parallel reduction

测试验收

正确性: 与 Phase 5 naive attention 对比, max error < 1e-2 (BF16)
显存: Flash Attention 不随 S 平方增长

Seq Length	Naive VRAM	Flash VRAM	Naive Time	Flash Time
512	MB	MB	ms	ms
2048	MB	MB	ms	ms
8192	OOM?	MB	OOM?	ms
32768	OOM	MB	OOM	ms

集成到 Qwen3-8B，端到端 decode latency 对比
Profile: ncu 分析 compute utilization, memory throughput
GQA 支持: 无 repeat_kv 开销

Phase 15: 性能优化 — 里程碑 ④

目标: 系统性 profiling + 优化，向 50% vLLM throughput 目标冲刺。

优化方向

1. Kernel Fusion

减少 memory-bound kernel 之间的 global memory 读写:

Residual Add + RMSNorm: 一次读 hidden + residual，写出 normed output
SiLU + Elementwise Mul (SwiGLU 内部): 一次读 gate + up, 写出 SiLU(gate)*up
Bias Add + Activation: Linear 的 bias + 激活函数合并

原则: 两个逐元素 kernel 之间如果有 global memory 读写，就值得融合。

2. CUDA Graphs

Decode 阶段每步的 kernel 序列是固定的（shape 不变）
用 cudaStreamBeginCapture / cudaStreamEndCapture 捕获一次
后续用 cudaGraphLaunch 重放（消除 kernel launch overhead）
注意: batch size 变化时需要重新 capture

3. Memory 优化

权重预加载: 启动时加载到 GPU，推理路径上零分配
Activation reuse: 同一层的中间结果用完立即释放/复用
Pinned memory: H2D/D2H 用 cudaMallocHost（pinned）提升拷贝带宽

4. Compute 优化

确保所有 GEMM 走 Tensor Core (BF16)
Decode attention: 优化 memory bandwidth 利用率
Prefill: chunked processing（控制峰值显存，允许更大 batch）

5. Scheduling 优化

Prefill-Decode disaggregation: prefill 和 decode 分开 batch
- 原因: prefill 是 compute-bound, decode 是 memory-bound, 混合导致两边都不优
Dynamic batch size: 根据当前 running seqs 的 seq_len 动态调整

Profiling 工具使用

# 整体 timeline (哪个 kernel 最耗时)
nsys profile --stats=true ./target/release/xserv-server

# 单个 kernel 分析 (occupancy, memory throughput)
ncu --target-processes all --set full ./target/release/xserv-server

# 自定义 Rust 计时
# 在 engine 循环中记录每个 phase 的耗时

测试验收

安装 vLLM，同一台机器跑 Qwen3-8B
Benchmark 对比:

Metric	vLLM	xserv	Ratio
Prefill latency (ms, 128 tokens)
Decode throughput (tokens/s, batch=1)
Decode throughput (tokens/s, batch=16)
Max concurrent requests (32GB)

目标: xserv throughput >= 50% vLLM
Profiling 报告: 每个组件的耗时占比 pie chart
无功能回归: 所有之前的集成测试通过

Phase 16: Speculative Decoding

Crate: xserv-speculative

目标: 用小模型（draft model）加速大模型（target model）的 decode 阶段。

算法

γ = 4 (speculative tokens 数量)

1. Draft model 自回归生成 γ 个 token: t1, t2, t3, t4
2. Target model 一次 forward 处理这 γ+1 个 position
   (等价于一次 prefill: [last_accepted_token, t1, t2, t3, t4])
   得到 γ+1 个 logits
3. Rejection sampling:
   for i in 0..γ:
     p = target_prob[i][t_{i+1}]   // target model 给 draft token 的概率
     q = draft_prob[i][t_{i+1}]    // draft model 给该 token 的概率
     if random() < min(1, p/q):
       accept t_{i+1}
     else:
       reject, resample from adjusted distribution
       break
4. 至少接受 1 个 token, 期望接受 ~γ×acceptance_rate 个

关键点

无损: rejection sampling 保证输出分布与纯 target model 一致
加速条件: draft model 足够快且与 target 分布接近
Draft model 选择: Qwen3-0.5B / Qwen3-1.5B 作为 Qwen3-8B 的 draft

KV Cache 处理

Draft model 有自己的 KV cache
Target model 验证时，accepted tokens 的 KV 可以复用（不用重算）
Rejected 位置之后的 KV 需要丢弃

测试验收

验证: speculative decode 100 条不同 prompt，输出分布与标准 decode 无统计差异
Acceptance rate 统计 (期望 60-80% per token)
端到端加速:

Method	Tokens/s (batch=1)	Speedup
Standard decode		1.0x
Speculative (γ=4)		~2-3x

Phase 17: Tensor Parallelism (TP=1/2/4) — 里程碑 ⑤

Crate: xserv-distributed

目标: 实现 Tensor Parallelism，支持 TP=2 和 TP=4（同组 GPU 内），跑通多卡推理。PP 及更大规模并行留待后续扩展。

通信后端: NCCL

NVIDIA Collective Communication Library，提供高效的 multi-GPU 通信原语。

需要封装的操作:

// NCCL FFI
ncclAllReduce(sendbuff, recvbuff, count, datatype, op, comm, stream)
ncclAllGather(sendbuff, recvbuff, count, datatype, comm, stream)
ncclBroadcast(sendbuff, recvbuff, count, datatype, root, comm, stream)

Tensor Parallelism 策略 (Megatron-LM 风格)

每层只需要 2 次 AllReduce:

Attention 部分

Column Parallel: Q/K/V proj 按 head 维度切分
  GPU 0: q_proj[:, :hidden/TP], k_proj[:, :hidden/TP], v_proj[:, :hidden/TP]
  GPU 1: q_proj[:, hidden/TP:], k_proj[:, hidden/TP:], v_proj[:, hidden/TP:]
  → 每卡计算自己那部分 heads 的 attention → 无需通信

Row Parallel: o_proj 按行切分
  每卡计算部分输出 → AllReduce 求和 → 得到完整 output

FFN (SwiGLU) 部分

Column Parallel: gate_proj, up_proj 按列切分
  每卡计算部分 intermediate features → 无需通信

Row Parallel: down_proj 按行切分
  每卡计算部分输出 → AllReduce 求和 → 得到完整 output

其他

Embedding: vocab 按行切分，AllGather 拼接
RMSNorm: 每卡独立计算（输入是 AllReduce 后的完整 tensor）
lm_head: 按列切分，AllGather 拼接 logits

权重分片

启动时:

Rank 0 加载完整权重
按 TP 策略切分
AllGather 或 Scatter 分发到各卡

或者:

每个 rank 独立加载，只读取属于自己的那部分（更高效）

互联拓扑 (已确认)

纯 PCIe Gen5 x16, 无 NVLink。GPU 分两组: 0-3 (PHB) 和 4-7 (PHB)，跨组走 NODE。

TP 部署策略 (当前阶段目标: TP=1/2/4):

TP=2: 同组内任意两卡 (如 GPU0+GPU1)，PCIe PHB 延迟最低
TP=4: 同组 4 卡 (GPU 0-3 或 GPU 4-7)，全 PHB 互联
PCIe Gen5 x16 带宽: 理论 ~64 GB/s 单向，实测 AllReduce 有效带宽约 40-50 GB/s
后续扩展: TP=8 (跨组) 和 Pipeline Parallelism 留待后续阶段

测试验收

TP=2: Qwen3-8B 输出与单卡 (TP=1) 完全一致
TP=4: 每卡权重显存占用约 1/4
Scaling benchmark (同组 GPU 0-3):

TP Size	Prefill (tokens/s)	Decode (tokens/s)	Scaling Efficiency
1			100%
2			%
4			%

AllReduce latency 测量 (不同消息大小，同组 PHB 互联)

Phase 18: 量化 (FP8 / INT8)

Crate: xserv-kernels (新增量化 kernel), xserv-model (量化加载) CUDA 源码: csrc/quantize/

目标: 降低模型显存占用和提升计算吞吐。

量化方式

1. Weight-Only INT8

只量化权重，activation 保持 BF16
Per-channel scale: 每个输出 channel 一个 scale 和 zero_point
GEMM: INT8 × BF16 (dequantize on-the-fly)
适用于 memory-bound 场景（decode）

2. FP8 (E4M3 / E5M2)

5090 Blackwell (CC 12.0) 原生支持 FP8 Tensor Core
权重和 activation 都量化为 FP8
Dynamic scaling: 每个 tensor 运行时计算 amax，确定 scale factor
GEMM: FP8 × FP8 → BF16/FP32 accumulate
适用于 compute-bound 场景（prefill）

3. GPTQ / AWQ (高级, 可选)

INT4 weight quantization
需要 calibration data
更复杂但压缩率更高

FP8 实现要点

pub enum DType {
    // ... existing
    F8E4M3,    // 1 sign + 4 exponent + 3 mantissa (范围小，精度高)
    F8E5M2,    // 1 sign + 5 exponent + 2 mantissa (范围大，精度低)
}

Dynamic scaling:

scale = amax(tensor) / fp8_max   // fp8_max = 448.0 for E4M3
tensor_fp8 = cast_to_fp8(tensor / scale)
// GEMM 后: output = output * scale_A * scale_B

测试验收

精度:

Quantization	Perplexity (WikiText-2)	vs BF16
BF16 (baseline)	X.XX	—
FP8 E4M3	X.XX	+0.XX
INT8 weight-only	X.XX	+0.XX

显存: FP8 权重占用约 BF16 的一半 (~8 GB for 8B model)
性能: FP8 GEMM throughput vs BF16 GEMM

Phase 19: Multimodal — 里程碑 ⑥

Crate: xserv-model (新增 vision encoder)

目标: 支持 vision-language 模型，接受图片+文本输入。

目标模型

Qwen-VL 系列（或类似 architecture）。典型结构:

Image → ViT Encoder → Visual Tokens → Projector → LLM Input
Text  → Tokenizer   → Text Tokens   ────────────→ LLM Input

需要新增的组件

1. ViT (Vision Transformer) Encoder

Patch Embedding: 将图片切成 14×14 patches, 每个 patch 线性投影
- 输入: [B, 3, H, W] → 输出: [B, num_patches, hidden_dim]
ViT Blocks: 标准 Transformer Encoder blocks
- Multi-Head Self-Attention (无 causal mask)
- FFN
- LayerNorm
参数量: 通常 300M-600M

2. Visual-Language Projector

MLP: 将 ViT 输出维度映射到 LLM embedding 维度
可能包含 pooling / resampling (减少 visual token 数量)

3. 图片预处理

Resize to target resolution (e.g., 448×448)
Normalize: ImageNet mean/std
使用 image crate 在 CPU 上完成

4. 输入拼接

[<image_token>] × num_visual_tokens + [text tokens]

在 embedding 层面拼接
LLM 处理混合 sequence

API 扩展

{
  "model": "qwen-vl",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
      {"type": "text", "text": "What is in this image?"}
    ]
  }]
}

外部依赖

image crate (图片加载和预处理)
base64 crate (解码 base64 图片)

测试验收

加载 Qwen-VL 模型，输入一张猫的图片 + "What is in this image?"
生成合理的图片描述
与 HF transformers 输出对比
API 端到端: HTTP POST 含 base64 图片 → streaming 文字回答
纯文本请求不受影响（向后兼容）

里程碑总结

里程碑	Phase	验收标准
① GPT-2 推理	8	CLI 输入 prompt, GPT-2 生成连贯文本, logits 与 PyTorch 一致
② Qwen3-8B 推理	10	8B 模型中英文对话, 多轮 chat template 正确
③ E2E API	13	HTTP streaming API, Python OpenAI SDK 可调用, 10 并发正确
④ 性能达标	15	throughput >= 50% vLLM, profiling 报告完成
⑤ 多卡推理	17	TP=2/4 同组 GPU 推理正确, scaling benchmark 完成
⑥ 多模态	19	图片输入 → 文字回答, API 端到端

外部依赖清单

Crate	用途	引入 Phase
`cc`	build.rs 编译 .cu 文件	0
`half`	f16 / bf16 Rust 类型	2
`smallvec`	Tensor shape / strides (栈分配)	2
`safetensors`	权重文件解析	6
`serde` + `serde_json`	JSON 序列化	6
`memmap2`	文件 mmap (safetensors 可能内置)	6
`regex`	BPE pre-tokenization	7
`rand`	Sampling (随机数)	8
`tokio`	Async runtime	13
`axum`	HTTP server	13
`criterion`	Benchmark framework	3+
`image`	图片加载 (multimodal)	19
`base64`	Base64 decode (multimodal API)	19

不使用: candle, burn, tch, tokenizers, cudarc — 核心组件全部自己实现。

60 KiB Raw Blame History Unescape Escape

xserv — LLM Inference Engine Roadmap

设计决策

硬件拓扑 (dash5, 已确认 2026-05-21)

项目结构

Phase 依赖图

Phase 0: 项目脚手架 + 环境验证

Phase 1: CUDA FFI 基础设施

模块划分

关键技术点

测试验收

Phase 2: Tensor 抽象层

核心数据结构

关键技术点

测试验收

Phase 3: GEMM — 矩阵乘法

实现路线（4 个递进版本）

Version 1: Naive GEMM

Version 2: Tiled GEMM (shared memory)

Version 3: Register Tiling + 向量化

Version 4: Tensor Core GEMM (WMMA)

cuBLAS 封装

Kernel Registry（运行时可切换 backend）

测试验收

Phase 4: Transformer 核心 Kernels

Kernel 清单

关键学习主题

测试验收

Phase 5: Attention Kernel (Naive 版)

计算流程

实现方式

Causal Mask

GQA 预备

测试验收

Phase 6: 模型加载

核心组件

1. safetensors 解析

2. HF Config 解析

3. 权重映射

4. 加载流程

外部依赖

测试验收

Phase 7: BPE Tokenizer

BPE 算法核心

编码 (encode)

解码 (decode)

需要处理的细节

测试验收

Phase 8: GPT-2 完整推理 — 里程碑 ①

模型结构

Forward Pass 流程

Sampling 策略

CLI 交互

测试验收

Phase 9: KV Cache + Autoregressive 优化

核心概念: Prefill vs Decode

KV Cache 设计（简单版，非 paged）

Decode Attention Kernel

测试验收

Phase 10: Qwen3-8B 支持 — 里程碑 ②

架构对比

需要新增/修改的组件

1. GQA (Grouped Query Attention)

2. RoPE (Rotary Position Embedding)

3. SwiGLU FFN

模型结构

显存预算 (BF16, 单卡 5090 32GB)

测试验收

Phase 11: Paged Attention + KV Cache Manager

问题

PagedAttention 设计

Paged Attention Kernel

Copy-on-Write (高级，可选)

测试验收

Phase 12: Continuous Batching + Request Scheduler

Static Batching vs Continuous Batching

核心数据结构

调度循环 (Engine 主循环)

Preemption（显存不足时的抢占）

测试验收

60 KiB

Raw Blame History