xserv/docs/00-roadmap.md

# xserv — LLM Inference Engine Roadmap

> 从零用 Rust + CUDA 构建一个完整的 LLM 推理引擎，目标是深入理解 LLM Serving 全栈技术。

## 设计决策

| 决策项 | 选择 | 备注 |
|--------|------|------|
| 抽象层级 | Level 0.5 | 自写 CUDA kernel + cuBLAS 可切换，便于 benchmark 对比 |
| 硬件 | 8×RTX 5090 (Blackwell, CC 12.0, 32GB GDDR7) | 纯 PCIe Gen5 x16 互联，无 NVLink (详见下方硬件拓扑) |
| 语言 | Rust + CUDA (C/C++) | Rust FFI 调用 CUDA |
| 起步模型 | GPT-2 124M → Qwen3-8B | 从简单到实用 |
| 精度 | BF16/FP16 | 后期扩展 FP8 |
| Tensor | 自己实现 | 完整学习 tensor 抽象设计 |
| Tokenizer | 自己实现 BPE | 学习分词机制 |
| 权重格式 | safetensors | Rust 友好，零拷贝 mmap |
| Async Runtime | tokio | 成熟稳定，不引入性能问题 |
| API | OpenAI 兼容 | `/v1/chat/completions`，SSE streaming |
| 时间线 | 不限 | 学习为主，每步验证 |

## 硬件拓扑 (dash5, 已确认 2026-05-21)

**GPU**: 8× NVIDIA GeForce RTX 5090, 32607 MiB, Compute Capability 12.0
**CUDA Toolkit**: 12.9 (安装于 `/usr/local/cuda-12.9`，需将 `bin/` 加入 PATH)
**PCIe**: Gen 5 x16 (理论单向 ~64 GB/s，空闲时降频至 Gen 1)

**互联拓扑** (`nvidia-smi topo -m`):
```
        GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
GPU0     X    PHB   PHB   PHB   NODE  NODE  NODE  NODE
GPU1    PHB    X    PHB   PHB   NODE  NODE  NODE  NODE
GPU2    PHB   PHB    X    PHB   NODE  NODE  NODE  NODE
GPU3    PHB   PHB   PHB    X    NODE  NODE  NODE  NODE
GPU4    NODE  NODE  NODE  NODE   X    PHB   PHB   PHB
GPU5    NODE  NODE  NODE  NODE  PHB    X    PHB   PHB
GPU6    NODE  NODE  NODE  NODE  PHB   PHB    X    PHB
GPU7    NODE  NODE  NODE  NODE  PHB   PHB   PHB    X

PHB  = 同一 PCIe Host Bridge（同组，延迟低）
NODE = 跨 PCIe Host Bridge（跨组，延迟较高）
```

**分组**: GPU 0-3 为一组, GPU 4-7 为一组。组内 PHB 互联，跨组 NODE 互联。

**对设计的影响**:
- **无 NVLink**: AllReduce 带宽受限于 PCIe (~64 GB/s vs NVLink ~450 GB/s)
- **TP 策略**: 当前阶段目标 TP=1/2/4，在同组内 (0-3 或 4-7) 执行，全 PHB 互联
- **跨组并行 (TP=8, PP 等)**: 留待后续扩展
- **CPU Affinity**: GPU 0-3 亲和 CPU 0-127, GPU 4-7 亲和 CPU 0-186（NUMA 0-1）

## 项目结构

```
xserv/
├── Cargo.toml                    # workspace root
├── csrc/                         # CUDA 源文件 (.cu / .cuh)
│   ├── gemm/                     #   GEMM kernels (naive, tiled, tensor core)
│   ├── attention/                #   Attention kernels (naive, flash, paged)
│   ├── normalization/            #   LayerNorm, RMSNorm
│   ├── activation/               #   GELU, SiLU
│   ├── embedding/                #   Embedding lookup, RoPE
│   ├── reduce/                   #   Softmax, argmax, sampling
│   └── quantize/                 #   FP8/INT8 kernels
├── crates/
│   ├── xserv-cuda/               # Phase 1: CUDA FFI, context, stream, allocator
│   ├── xserv-tensor/             # Phase 2: Tensor type, ops dispatch, DType
│   ├── xserv-kernels/            # Phase 3-5: kernel registry (custom + cuBLAS)
│   ├── xserv-tokenizer/          # Phase 7: BPE tokenizer
│   ├── xserv-model/              # Phase 6,8,10: model def + weight loading
│   ├── xserv-runtime/            # Phase 9,11,12: KV cache, paging, scheduler
│   ├── xserv-engine/             # Phase 13: inference engine orchestration
│   ├── xserv-api/                # Phase 13: HTTP server + OpenAI compat
│   ├── xserv-speculative/        # Phase 16: speculative decoding
│   └── xserv-distributed/        # Phase 17: tensor parallelism, NCCL
├── tests/                        # integration tests
├── benches/                      # criterion benchmarks
├── tools/                        # 辅助脚本 (PyTorch reference output 生成等)
└── docs/                         # 每个 phase 的设计文档
```

## Phase 依赖图

```
Phase 0: 项目脚手架 + 环境验证
    │
Phase 1: CUDA FFI 基础设施
    │
Phase 2: Tensor 抽象层
    │
Phase 3: GEMM (naive → tiled → tensor core → cuBLAS)
    │
Phase 4: Transformer Kernels (Norm, Activation, Embedding, RoPE, Softmax)
    │
Phase 5: Attention Kernel (naive MHA)
    │
Phase 6: 模型加载 (safetensors + HF config)
    │              │
    │         Phase 7: BPE Tokenizer (可与 Phase 6 并行)
    │              │
Phase 8: GPT-2 完整推理 ◄──────────── 里程碑 ① CLI 文本生成
    │
Phase 9: KV Cache + Autoregressive Generation
    │
Phase 10: Qwen3-8B 支持 ◄─────────── 里程碑 ② 8B 模型推理
    │
Phase 11: Paged Attention + KV Cache Manager
    │
Phase 12: Continuous Batching + Request Scheduler
    │
Phase 13: HTTP API + SSE Streaming ◄── 里程碑 ③ 端到端 API 可用
    │
Phase 14: Flash Attention (FA2 for SM120)
    │
Phase 15: 性能优化 ◄──────────────── 里程碑 ④ 50% vLLM throughput
    │
Phase 16: Speculative Decoding
    │
Phase 17: Tensor Parallelism (TP=1/2/4) ◄── 里程碑 ⑤ 多卡推理
    │
Phase 18: 量化 (FP8 / INT8)
    │
Phase 19: Multimodal ◄────────────── 里程碑 ⑥ 视觉问答
```

---

## Phase 0: 项目脚手架 + 环境验证

**目标**: 搭建 Cargo workspace，验证 CUDA 工具链，确保开发环境就绪。

**技术要点**:
- Cargo workspace 配置，所有 crate 共享依赖版本
- `build.rs` 中用 `cc` crate 编译 `.cu` 文件的 pipeline
- 验证 CUDA toolkit 版本: **已确认 CUDA 12.9** (`/usr/local/cuda-12.9`)
- 验证 GPU compute capability: **已确认 CC 12.0** (Blackwell)
- 确认 `nvcc` 在 PATH 中 (需要 `export PATH=/usr/local/cuda-12.9/bin:$PATH`)
- ~~运行 `nvidia-smi topo -m` 确认互联拓扑~~ **已确认: 纯 PCIe Gen5, 无 NVLink**

**外部依赖**: `cc` crate（编译 CUDA）

**测试验收**:
- `cargo build` 通过
- 一个最小的 `.cu` kernel（向量加法）能从 Rust 调用并返回正确结果
- 输出 GPU 信息（名称、显存、compute capability）

**设计文档**: `docs/01-cuda-ffi.md`（与 Phase 1 合并）

---

## Phase 1: CUDA FFI 基础设施

**Crate**: `xserv-cuda`

**目标**: 封装 CUDA Runtime API，提供安全的 Rust 抽象层。

### 模块划分

```
xserv-cuda/src/
├── lib.rs
├── error.rs        # CudaError 类型, cudaGetLastError 封装
├── device.rs       # Device 枚举, 设备查询 (属性/数量/当前设备)
├── context.rs      # CUDA context 管理
├── stream.rs       # CudaStream (异步操作流)
├── memory.rs       # GPU 内存分配/释放/拷贝 (H2D, D2H, D2D)
├── allocator.rs    # Caching Allocator (显存池)
└── module.rs       # cuModuleLoad (加载 PTX/cubin, 可选)
```

### 关键技术点

1. **FFI 绑定策略**:
   - 手写 `extern "C"` 绑定核心 CUDA Runtime API（~30 个函数）
   - 不用 bindgen，保持可控和可读
   - 需要绑定的 API: `cudaMalloc`, `cudaFree`, `cudaMemcpy`, `cudaMemcpyAsync`,
     `cudaStreamCreate`, `cudaStreamSynchronize`, `cudaGetDeviceProperties`,
     `cudaSetDevice`, `cudaDeviceSynchronize`, `cudaGetLastError` 等

2. **GpuBuffer 抽象**:
   ```rust
   pub struct GpuBuffer {
       ptr: *mut c_void,
       size_bytes: usize,
       device: usize,
   }

   impl Drop for GpuBuffer {
       fn drop(&mut self) { /* cudaFree or return to allocator */ }
   }
   ```
   - `Drop` trait 自动释放，防止 GPU 内存泄漏
   - 不实现 `Clone`（显式 `copy_from` 代替）

3. **Caching Allocator**:
   - 维护 free list（按大小分桶，桶边界: 512B, 1KB, 2KB, ..., 1GB）
   - `alloc(size)`: 在对应桶中找 >= size 的 free block，miss 时 `cudaMalloc`
   - `free(ptr, size)`: 不调 `cudaFree`，放回 free list
   - `trim()`: 真正释放所有 free blocks（OOM 恢复时用）
   - 这是性能关键组件——频繁 `cudaMalloc/cudaFree` 会严重影响 throughput
   - 参考: PyTorch 的 `CUDACachingAllocator` 设计

4. **Stream 管理**:
   - 每个 stream 是独立的 GPU 执行队列
   - Kernel launch 和 memcpy 是异步的（提交到 stream 后立即返回）
   - `stream.synchronize()` 等待该 stream 上所有操作完成
   - 后续用于 overlap compute 和 memory transfer

5. **Error Handling**:
   ```rust
   #[derive(Debug)]
   pub enum CudaError {
       OutOfMemory,
       InvalidDevice,
       LaunchFailure,
       Raw { code: i32, message: String },
   }

   // 所有 CUDA 调用包装为 Result
   pub(crate) fn check(code: cudaError_t) -> Result<(), CudaError>;
   ```

### 测试验收

- [ ] 分配 1GB GPU 内存，H2D 拷贝一个大数组，D2H 拷回，验证数据一致
- [ ] Caching allocator: alloc → free → re-alloc same size，第二次不触发 `cudaMalloc`（通过内部计数验证）
- [ ] 多 stream 并发拷贝两个数组，验证结果正确
- [ ] 设备查询: 打印 GPU name, total memory, compute capability, SM count
- [ ] Benchmark: caching allocator vs 裸 `cudaMalloc` 的分配延迟对比（100 次 alloc/free 循环）

---

## Phase 2: Tensor 抽象层

**Crate**: `xserv-tensor`

**目标**: 实现核心 Tensor 类型，支持 CPU/GPU 存储、多种数据类型、视图操作。

### 核心数据结构

```rust
// --- 数据类型 ---
#[derive(Clone, Copy, PartialEq)]
pub enum DType {
    F32,
    F16,
    BF16,
    // 后期: U8, I8, F8E4M3, F8E5M2
}

impl DType {
    pub fn size_bytes(&self) -> usize; // F32=4, F16=2, BF16=2
}

// --- 设备 ---
#[derive(Clone, Copy, PartialEq)]
pub enum Device {
    Cpu,
    Cuda(usize),  // device ordinal
}

// --- 存储 ---
// 引用计数，支持 view（多个 Tensor 共享同一 Storage）
pub struct Storage(Arc<StorageInner>);

enum StorageInner {
    Cpu { data: Vec<u8> },
    Cuda { buffer: GpuBuffer, device: usize },
}

// --- Tensor ---
pub struct Tensor {
    storage: Storage,
    shape: SmallVec<[usize; 4]>,    // 维度（大多数 tensor <= 4D）
    strides: SmallVec<[usize; 4]>,  // 步长（以元素为单位）
    offset: usize,                   // storage 中的起始偏移（元素数）
    dtype: DType,
    device: Device,
}
```

### 关键技术点

1. **Strided Layout**:
   - 支持 `transpose`, `slice`, `permute` 等操作不拷贝数据，只改 strides/offset
   - `is_contiguous()`: strides 从右到左依次为 1, shape[-1], shape[-1]*shape[-2], ...
   - 非 contiguous tensor 在送入 CUDA kernel 前需要 `contiguous()` 拷贝为连续布局
   - 例: `[3,4]` tensor 的 strides = `[4, 1]`；transpose 后 shape=`[4,3]`, strides=`[1, 4]`

2. **BF16/F16 在 Rust 中的表示**:
   - 使用 `half` crate 的 `bf16` 和 `f16` 类型
   - GPU kernel 中使用 `__nv_bfloat16` / `__half`
   - Tensor 内部存储为 raw bytes，通过 DType dispatch 解释

3. **设备间拷贝**:
   ```rust
   impl Tensor {
       pub fn to(&self, device: Device) -> Tensor; // CPU↔GPU 拷贝
       pub fn to_dtype(&self, dtype: DType) -> Tensor; // 类型转换
   }
   ```

4. **基础操作**（此阶段实现）:
   - **创建**: `zeros`, `ones`, `from_slice`, `rand`, `full`, `arange`
   - **形状**: `reshape`, `view`, `transpose`, `squeeze`, `unsqueeze`, `contiguous`
   - **逐元素** (CPU + GPU kernel): `add`, `mul`, `sub`, `div`
   - **广播 (Broadcasting)**: NumPy 语义，维度从尾部对齐
   - **归约**: `sum`, `max`, `mean`（沿指定轴）

5. **Op Dispatch 机制**:
   ```rust
   // 根据 device 和 dtype dispatch 到不同实现
   pub fn add(a: &Tensor, b: &Tensor) -> Tensor {
       match (a.device(), b.device()) {
           (Device::Cpu, Device::Cpu) => cpu_ops::add(a, b),
           (Device::Cuda(_), Device::Cuda(_)) => cuda_ops::add(a, b),
           _ => panic!("device mismatch"),
       }
   }
   ```

### 测试验收

- [ ] 创建 tensor, reshape, transpose, slice，验证 shape/strides 计算正确
- [ ] 广播加法: `[3,1] + [1,4]` → `[3,4]`，与 numpy 结果对比
- [ ] CPU ↔ GPU 拷贝往返，数据一致
- [ ] BF16 tensor 的基础运算精度验证（与 FP32 结果对比 relative error）
- [ ] View 共享存储: 修改 view 的数据，原 tensor 也应变化
- [ ] Benchmark: GPU 逐元素 kernel vs CPU 的加速比（大数组）

---

## Phase 3: GEMM — 矩阵乘法

**Crate**: `xserv-kernels`
**CUDA 源码**: `csrc/gemm/`

**目标**: 实现 GEMM 的多个版本，从 naive 到 tensor core，同时封装 cuBLAS，建立 benchmark 对比框架。

这是 CUDA kernel 编程的第一个"修罗场"，会深刻理解 GPU 编程的核心概念。

### 实现路线（4 个递进版本）

#### Version 1: Naive GEMM
- 每个 thread 计算输出矩阵 C 的一个元素: `C[i][j] = sum(A[i][k] * B[k][j])`
- grid 维度: `(M/BLOCK, N/BLOCK)`, block 维度: `(BLOCK, BLOCK)`
- **学到**: grid/block 维度规划, global memory access pattern
- **问题**: global memory 访问完全没有局部性，bandwidth 利用率极低
- **预期性能**: ~1-2% cuBLAS

#### Version 2: Tiled GEMM (shared memory)
- 将 A, B 分成 TILE×TILE 的小块，加载到 shared memory
- 每个 thread block 计算 C 的一个 TILE×TILE 输出块
- 内层循环沿 K 维度滑动 tile
- **学到**: shared memory 使用, `__syncthreads()`, bank conflict, memory coalescing
- **关键**: A 的 tile 要按行加载（coalesced），B 的 tile 按列访问需要注意 bank conflict
- **预期性能**: ~10-20% cuBLAS

#### Version 3: Register Tiling + 向量化
- 每个 thread 计算多个输出元素（如 4×4 或 8×8）
- 使用寄存器存储中间结果，减少 shared memory 访问
- 向量化加载: `float4` 一次读 128 bit
- **学到**: register pressure, ILP (Instruction-Level Parallelism), occupancy vs. ILP tradeoff
- **预期性能**: ~30-50% cuBLAS

#### Version 4: Tensor Core GEMM (WMMA)
- 使用 CUDA WMMA API 调用 Tensor Core
- BF16 输入, FP32 累加
- 每次 wmma::mma_sync 计算 16×16×16 矩阵乘
- **学到**: WMMA fragment layout, Tensor Core 编程模型, warp-level 协作
- **关键**: 5090 Blackwell (CC 12.0) 的 Tensor Core 支持 BF16 和 FP8
- **预期性能**: ~60-80% cuBLAS

### cuBLAS 封装

```rust
// 需要封装的 cuBLAS API
extern "C" {
    fn cublasCreate_v2(handle: *mut cublasHandle_t) -> cublasStatus_t;
    fn cublasSetStream_v2(handle: cublasHandle_t, stream: cudaStream_t) -> cublasStatus_t;
    fn cublasGemmEx(
        handle: cublasHandle_t,
        transa: cublasOperation_t, transb: cublasOperation_t,
        m: i32, n: i32, k: i32,
        alpha: *const c_void,
        A: *const c_void, Atype: cudaDataType, lda: i32,
        B: *const c_void, Btype: cudaDataType, ldb: i32,
        beta: *const c_void,
        C: *mut c_void, Ctype: cudaDataType, ldc: i32,
        computeType: cublasComputeType_t,
        algo: cublasGemmAlgo_t,
    ) -> cublasStatus_t;
}
```

支持: BF16×BF16→BF16 (compute=FP32), FP16×FP16→FP16, FP32×FP32→FP32

### Kernel Registry（运行时可切换 backend）

```rust
#[derive(Clone, Copy)]
pub enum GemmBackend {
    Naive,
    Tiled,
    RegisterTiled,
    TensorCore,
    CuBlas,
}

pub fn matmul(a: &Tensor, b: &Tensor, backend: GemmBackend) -> Tensor;

// 全局默认 backend（可配置）
pub fn set_default_gemm_backend(backend: GemmBackend);
```

### 测试验收

- [ ] 正确性: 所有 5 个 backend 的输出与 cuBLAS 对比，max absolute error < 1e-3 (BF16)
- [ ] Benchmark 表格（用 `criterion` crate）:

| Backend | M=N=K=1024 | M=N=K=4096 | % of cuBLAS |
|---------|------------|------------|-------------|
| Naive | ms | ms | % |
| Tiled | ms | ms | % |
| RegisterTiled | ms | ms | % |
| TensorCore | ms | ms | % |
| cuBLAS | ms | ms | 100% |

- [ ] Profile: 用 `nsys`/`ncu` 分析 naive vs tiled 的 memory throughput 差异
- [ ] 非方阵测试: M=1, N=4096, K=4096 (decode 阶段的典型 shape)

---

## Phase 4: Transformer 核心 Kernels

**Crate**: `xserv-kernels`
**CUDA 源码**: `csrc/normalization/`, `csrc/activation/`, `csrc/embedding/`, `csrc/reduce/`

**目标**: 实现 Transformer 所需的所有非 Attention 算子，每个都有自定义 CUDA kernel。

### Kernel 清单

| Kernel | 用途 | CUDA 文件 | 核心优化点 |
|--------|------|-----------|-----------|
| LayerNorm | GPT-2 | `normalization/layernorm.cu` | Online Welford 算法, warp reduce, 向量化加载 |
| RMSNorm | Qwen3/LLaMA | `normalization/rmsnorm.cu` | 比 LayerNorm 简单（无 mean）, rsqrt |
| GELU | GPT-2 激活 | `activation/gelu.cu` | tanh 近似 vs 精确, 向量化 |
| SiLU (Swish) | Qwen3 激活 | `activation/silu.cu` | `x * sigmoid(x)`, 逐元素 |
| SwiGLU | Qwen3 FFN | `activation/swiglu.cu` | `SiLU(gate) * up`, fused 逐元素 |
| Embedding | token→vector | `embedding/embedding.cu` | Gather 操作, coalesced access |
| RoPE | Qwen3 位置编码 | `embedding/rope.cu` | 复数旋转, precompute freq |
| Softmax | Attention 内 | `reduce/softmax.cu` | Online safe softmax, 数值稳定 |
| Argmax | Greedy sampling | `reduce/argmax.cu` | Parallel reduction |
| TopK | TopK sampling | `reduce/topk.cu` | Bitonic sort 或 radix select |

### 关键学习主题

**Reduction Pattern（核心中的核心）**:

LayerNorm, RMSNorm, Softmax 都涉及对某个维度求和/求最大值。GPU reduction 是分层的:

```
Thread-level:   每个 thread 处理多个元素，本地累加
    ↓
Warp-level:     __shfl_down_sync() 在 warp (32 threads) 内规约
    ↓
Block-level:    shared memory 存各 warp 的结果，再规约
    ↓
Grid-level:     (如果需要) atomic 或两遍 kernel
```

对于 Norm/Softmax，通常 hidden_dim <= 8192，一个 block 就够，不需要 grid-level reduction。

**向量化内存访问**:
- `float4` (128-bit) 一次加载 4 个 float 或 8 个 bf16
- `__nv_bfloat162` 一次处理 2 个 bf16
- 提升 memory throughput，减少 load/store 指令数

**每个 kernel 都实现两个版本**:
1. Custom CUDA kernel（自己写，深入理解）
2. Reference 实现（简单的 Python/numpy，生成 reference output 用于验证）

### 测试验收

- [ ] 每个 kernel 的输出与 PyTorch 参考实现对比
  - 写一个 `tools/generate_reference.py` 脚本，为每个 op 生成 reference input/output，保存为 `.npy`
  - Rust 测试中加载 `.npy` 对比
- [ ] 数值精度: BF16 下 max relative error < 1e-2, FP32 下 < 1e-5
- [ ] RoPE: 验证旋转后的向量与 HF transformers 的 `apply_rotary_pos_emb` 结果一致
- [ ] Softmax: 验证 `sum(output, dim=-1) == 1.0`，验证 numerical stability（大值输入不 overflow）
- [ ] Benchmark: 每个 kernel 与 PyTorch 对应操作的延迟对比

---

## Phase 5: Attention Kernel (Naive 版)

**Crate**: `xserv-kernels`
**CUDA 源码**: `csrc/attention/naive_attention.cu`

**目标**: 实现标准 Multi-Head Attention，不做 Flash/Paged 优化。理解 attention 机制的计算基础。

### 计算流程

```
Input: Q [B, H, S, D], K [B, H, S, D], V [B, H, S, D]
       其中 B=batch, H=num_heads, S=seq_len, D=head_dim

1. scores = Q @ K^T / sqrt(D)           → [B, H, S, S]
2. scores = scores + causal_mask         → 上三角置为 -inf
3. weights = softmax(scores, dim=-1)     → [B, H, S, S]
4. output = weights @ V                  → [B, H, S, D]
```

### 实现方式

**方式一: 组合式（先跑通）**
- 用 Phase 3 的 GEMM (Q@K^T) + Phase 4 的 Softmax + GEMM (weights@V)
- 简单但 materialize 了 S×S 矩阵，内存 O(S²)

**方式二: Fused kernel（理解 Flash Attention 的前置）**
- 一个 kernel 完成整个 attention
- 仍然 materialize S×S（不做 tiling），但减少 kernel launch 和 global memory 读写次数

### Causal Mask

- 不显式构造 mask 矩阵（浪费内存）
- 在 softmax 前对 `scores[i][j]` where `j > i` 写 `-inf`（`-1e9` for BF16）
- 编译期条件判断: `if (col > row) score = -inf;`

### GQA 预备

- 本阶段实现标准 MHA: `num_kv_heads == num_heads`
- Phase 10 扩展为 GQA: `num_kv_heads < num_heads`
- GQA 时 K/V 需要 repeat: 每个 KV head 服务 `num_heads / num_kv_heads` 个 Q head
- 实际实现: 不真正 repeat 数据，在 kernel 中用 `kv_head_idx = q_head_idx / num_groups` 索引

### 测试验收

- [ ] 随机 Q, K, V，输出与 PyTorch `F.scaled_dot_product_attention(is_causal=True)` 对比
- [ ] 验证 causal mask: attention weight 矩阵的上三角全为 0
- [ ] Benchmark 表（记录为 Flash Attention 的 baseline）:

| Seq Length | Latency (ms) | GPU Memory (MB) |
|------------|-------------|-----------------|
| 128 | | |
| 512 | | |
| 2048 | | |
| 4096 | | |
| 8192 | | OOM? |

- [ ] 输出 attention weights 可视化（小规模，验证 causal pattern）

---

## Phase 6: 模型加载

**Crate**: `xserv-model`

**目标**: 从 HuggingFace safetensors 文件加载模型权重到 GPU Tensor。

### 核心组件

#### 1. safetensors 解析

- 使用 `safetensors` crate 读取文件
- 文件结构: 8 bytes header_size + JSON header + raw tensor data
- 支持 mmap 零拷贝读取
- 支持 sharded 文件: `model-00001-of-00003.safetensors`
  - 通过 `model.safetensors.index.json` 查找 tensor → file 的映射

#### 2. HF Config 解析

```rust
#[derive(Deserialize)]
pub struct ModelConfig {
    pub architectures: Vec<String>,
    pub hidden_size: usize,
    pub intermediate_size: usize,
    pub num_attention_heads: usize,
    pub num_key_value_heads: usize,    // GQA: 可能 < num_attention_heads
    pub num_hidden_layers: usize,
    pub vocab_size: usize,
    pub max_position_embeddings: usize,
    pub rms_norm_eps: f64,             // Qwen3 用
    pub rope_theta: f64,               // RoPE base frequency
    pub tie_word_embeddings: bool,
    // ... 其他字段按模型按需添加
}
```

#### 3. 权重映射

HuggingFace 命名规范 (以 Qwen3 为例):
```
model.embed_tokens.weight                          → embedding
model.layers.{i}.self_attn.q_proj.weight           → layer[i].attn.q_proj
model.layers.{i}.self_attn.k_proj.weight           → layer[i].attn.k_proj
model.layers.{i}.self_attn.v_proj.weight           → layer[i].attn.v_proj
model.layers.{i}.self_attn.o_proj.weight           → layer[i].attn.o_proj
model.layers.{i}.mlp.gate_proj.weight              → layer[i].mlp.gate
model.layers.{i}.mlp.up_proj.weight                → layer[i].mlp.up
model.layers.{i}.mlp.down_proj.weight              → layer[i].mlp.down
model.layers.{i}.input_layernorm.weight             → layer[i].attn_norm
model.layers.{i}.post_attention_layernorm.weight    → layer[i].ffn_norm
model.norm.weight                                   → final_norm
lm_head.weight                                      → lm_head
```

#### 4. 加载流程

```
safetensors file (disk)
  → mmap (host memory, 零拷贝)
  → dtype check/cast (如 FP32 → BF16)
  → H2D copy → GPU Tensor
  → 按 layer 组织成模型结构
```

### 外部依赖

- `safetensors` crate
- `serde` + `serde_json` (解析 config.json)
- `memmap2` (mmap 支持，safetensors crate 可能内置)

### 测试验收

- [ ] 加载 GPT-2 124M (`openai-community/gpt2`)，打印所有 tensor name, shape, dtype
- [ ] 抽查几个 tensor 的前 10 个值，与 PyTorch `from_pretrained` 对比
- [ ] 加载 Qwen3-8B sharded 权重，验证所有 tensor 都成功加载
- [ ] 性能: 测量 8B 模型权重加载时间 (mmap → GPU 全流程)
- [ ] 错误处理: 缺少 tensor、dtype 不匹配、文件不存在等情况

---

## Phase 7: BPE Tokenizer

**Crate**: `xserv-tokenizer`

**目标**: 从零实现 Byte-Pair Encoding tokenizer，兼容 HuggingFace tokenizer.json 格式。

### BPE 算法核心

#### 编码 (encode)

```
输入: "Hello world"
  → pre-tokenize (regex split): ["Hello", " world"]
  → 每个词转为 byte 序列: [72, 101, 108, 108, 111], [32, 119, 111, ...]
  → 初始 token 序列: 每个 byte 是一个 token
  → 反复合并:
      1. 找当前序列中优先级最高的 byte-pair (从 merges 表查)
      2. 合并该 pair
      3. 重复直到无可合并
  → 输出: token IDs
```

#### 解码 (decode)

```
token IDs → 查 vocab 得到 byte 序列 → 拼接 → UTF-8 decode → 文本
```

### 需要处理的细节

1. **Pre-tokenization**:
   - GPT-2 regex: `'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+`
   - Qwen3 可能使用不同的 regex pattern（从 `tokenizer.json` 的 `pre_tokenizer` 字段读取）
   - 用 `regex` crate 实现

2. **tokenizer.json 解析**:
   ```json
   {
     "model": {
       "type": "BPE",
       "vocab": {"Hello": 0, "world": 1, ...},
       "merges": ["H e", "He l", "Hel lo", ...]
     },
     "added_tokens": [...],
     "pre_tokenizer": {...},
     "post_processor": {...}
   }
   ```

3. **Special Tokens**:
   - `<|endoftext|>` (GPT-2 EOS)
   - `<|im_start|>`, `<|im_end|>` (Qwen3 ChatML)
   - `<|endoftext|>` (Qwen3 EOS)
   - Special tokens 不参与 BPE merge，直接映射到 ID

4. **Chat Template** (Qwen3 格式):
   ```
   <|im_start|>system
   You are a helpful assistant.<|im_end|>
   <|im_start|>user
   Hello<|im_end|>
   <|im_start|>assistant
   ```

5. **性能优化**:
   - Merge rules 用 `HashMap<(TokenId, TokenId), MergePriority>` 预索引
   - 对于长文本，考虑 priority queue 加速 pair 查找

### 测试验收

- [ ] 加载 GPT-2 tokenizer，encode + decode 一批测试文本，与 Python `AutoTokenizer` 逐 token 对比
- [ ] 加载 Qwen3 tokenizer，同样逐 token 对比
- [ ] 边界情况: 空字符串、纯 emoji (🎉🔥)、中英混合、超长文本 (1MB)
- [ ] Chat template: 给定 messages 列表，生成与 HF `apply_chat_template` 一致的 token 序列
- [ ] Benchmark: encode 1MB 文本的延迟

---

## Phase 8: GPT-2 完整推理 — 里程碑 ①

**Crate**: `xserv-model`

**目标**: 将所有组件串联，实现 GPT-2 的完整推理 pipeline。这是第一次看到模型"说话"。

### 模型结构

```rust
pub struct GPT2 {
    config: GPT2Config,
    wte: Tensor,           // token embedding [vocab_size, hidden_size]
    wpe: Tensor,           // position embedding [max_seq_len, hidden_size]
    layers: Vec<GPT2Block>,
    ln_f: LayerNorm,       // final layer norm
    // lm_head 与 wte 共享权重 (tied embeddings)
}

pub struct GPT2Block {
    ln_1: LayerNorm,
    attn: GPT2Attention,   // MHA: q_proj, k_proj, v_proj, o_proj
    ln_2: LayerNorm,
    mlp: GPT2MLP,          // fc1 (4H) → GELU → fc2 (H)
}
```

### Forward Pass 流程

```
tokens [B, S]
  → wte[tokens] + wpe[0..S]              → hidden [B, S, 768]
  → for each layer:
      → residual = hidden
      → hidden = ln_1(hidden)
      → hidden = attention(hidden)         # Q, K, V 从 hidden 线性变换
      → hidden = hidden + residual         # residual connection
      → residual = hidden
      → hidden = ln_2(hidden)
      → hidden = mlp(hidden)               # Linear→GELU→Linear
      → hidden = hidden + residual
  → hidden = ln_f(hidden)
  → logits = hidden @ wte.T               → [B, S, vocab_size]
  → next_token = sample(logits[:, -1, :]) # 只取最后一个 position
```

### Sampling 策略

```rust
pub struct SamplingParams {
    pub temperature: f32,      // default 1.0
    pub top_k: usize,         // default 50
    pub top_p: f32,            // default 1.0 (disabled)
    pub max_tokens: usize,     // default 256
    pub repetition_penalty: f32, // default 1.0 (disabled)
}
```

实现:
1. **Greedy**: `argmax(logits)`
2. **Temperature**: `logits = logits / temperature` → softmax → sample
3. **Top-K**: 保留 top-k logits，其余置为 -inf → softmax → sample
4. **Top-P (Nucleus)**: 按概率降序排列，累加到概率 >= p → 截断 → 重新 normalize → sample
5. 以上可以组合: temperature → top-k → top-p → sample

### CLI 交互

```
$ cargo run --release --bin xserv-cli -- --model openai-community/gpt2

xserv> The future of AI is
GPT-2> The future of AI is not just about the technology, but about the people
       who are building it. The question is whether we can...

xserv> Once upon a time
GPT-2> Once upon a time, there was a young man who lived in a small village...
```

### 测试验收

- [ ] 加载 `openai-community/gpt2`，prefill "The future of AI is"
- [ ] Prefill logits 与 PyTorch 对比: top-5 token IDs 和对应 logit 值一致
- [ ] Greedy decode 50 tokens，结果应该是连贯英文
- [ ] Temperature/TopK/TopP sampling: 生成多次结果应有变化
- [ ] CLI 交互模式可用

---

## Phase 9: KV Cache + Autoregressive 优化

**Crate**: `xserv-runtime`

**目标**: 实现 KV Cache，将 decode 从 O(S²) 降到 O(S) per step。

### 核心概念: Prefill vs Decode

**Prefill（首次处理 prompt）**:
- 输入: 完整 prompt `[B, S, D]`
- 计算: 所有 token 的 Q, K, V
- Attention: `Q[B,H,S,D] @ K[B,H,S,D]^T` → 完整 S×S 矩阵
- 输出: 缓存 K, V 到 KV cache
- 特点: **Compute-bound**（大矩阵乘法）

**Decode（逐 token 生成）**:
- 输入: 上一步生成的 1 个 token `[B, 1, D]`
- 计算: 只计算新 token 的 Q, K, V
- K_new, V_new append 到 cache
- Attention: `Q[B,H,1,D] @ K_cache[B,H,S+1,D]^T` → `[B,H,1,S+1]`
- 特点: **Memory-bound**（Q 只有 1 行，瓶颈在读 K/V cache）

### KV Cache 设计（简单版，非 paged）

```rust
pub struct KVCache {
    // 每层一对 K/V tensor
    // shape: [batch_size, num_kv_heads, max_seq_len, head_dim]
    k_caches: Vec<Tensor>,   // 索引 = layer_idx
    v_caches: Vec<Tensor>,
    seq_len: usize,          // 当前已填充的长度
    max_seq_len: usize,      // 预分配的最大长度
}

impl KVCache {
    // prefill 时: 写入 [0..prompt_len] 的 K, V
    pub fn fill(&mut self, layer: usize, k: &Tensor, v: &Tensor);

    // decode 时: 在 seq_len 位置写入新的 K, V，返回完整 cache
    pub fn append(&mut self, layer: usize, k: &Tensor, v: &Tensor) -> (&Tensor, &Tensor);
}
```

### Decode Attention Kernel

与 prefill attention 不同，decode 时 Q 只有 1 行:
```
Q [B, H, 1, D] × K_cache [B, H, S, D]^T → scores [B, H, 1, S]
→ softmax → weights [B, H, 1, S]
weights × V_cache [B, H, S, D] → output [B, H, 1, D]
```

优化方向:
- 每个 warp 处理一个 head
- 沿 S 维度做 parallel reduction (dot product + online softmax)
- 重点优化 memory bandwidth（K/V cache 的读取是瓶颈）

### 测试验收

- [ ] 对比有/无 KV cache 的生成结果 → **必须完全一致**（bit-exact for greedy）
- [ ] Benchmark decode 延迟:

| Seq Length | Without Cache (ms/token) | With Cache (ms/token) | Speedup |
|------------|--------------------------|----------------------|---------|
| 128 | | | |
| 512 | | | |
| 2048 | | | |

- [ ] 显存占用: KV cache 的实际显存与理论值 (`2 * num_layers * num_kv_heads * seq_len * head_dim * sizeof(bf16)`) 对比
- [ ] GPT-2 decode throughput (tokens/s) 记录为 baseline

---

## Phase 10: Qwen3-8B 支持 — 里程碑 ②

**Crate**: `xserv-model`

**目标**: 扩展模型定义以支持 Qwen3-8B，验证输出正确性。

### 架构对比

| 特性 | GPT-2 (124M) | Qwen3-8B |
|------|-------------|----------|
| Normalization | LayerNorm (pre-LN) | RMSNorm (pre-LN) |
| Position Encoding | Learned absolute (wpe) | RoPE (无单独参数) |
| Attention | MHA (12 heads, 12 KV heads) | GQA (如 32 Q heads, 8 KV heads) |
| Activation | GELU | SwiGLU (SiLU gate) |
| FFN | Linear(H→4H) → GELU → Linear(4H→H) | gate_proj + up_proj → SiLU gate → down_proj |
| Vocab Size | 50,257 | ~152,000 |
| Hidden Size | 768 | 4,096 (8B) |
| Layers | 12 | 36 |
| Tied Embeddings | Yes | No |

### 需要新增/修改的组件

#### 1. GQA (Grouped Query Attention)
```
num_heads = 32, num_kv_heads = 8
每个 KV head 服务 32/8 = 4 个 Q head

Q: [B, 32, S, 128]
K: [B, 8, S, 128]   ← 只有 8 个 KV heads
V: [B, 8, S, 128]

Attention 时:
  kv_head_idx = q_head_idx / (num_heads / num_kv_heads)
  不需要真正 repeat K/V 数据，kernel 中做索引映射
```

#### 2. RoPE (Rotary Position Embedding)
```
对 Q, K 的每对相邻元素做旋转:
  Q' = Q * cos(θ) + rotate_half(Q) * sin(θ)
  K' = K * cos(θ) + rotate_half(K) * sin(θ)
其中 θ_i = pos / (rope_theta^(2i/d))

预计算: freqs = 1.0 / (rope_theta^(2i/d)) for i in 0..d/2
运行时: cos_cache[pos][i] = cos(pos * freqs[i])
```

#### 3. SwiGLU FFN
```
x → gate_proj(x) → SiLU → ⊙ up_proj(x) → down_proj → output

三个 Linear:
  gate_proj: [hidden_size, intermediate_size]
  up_proj:   [hidden_size, intermediate_size]
  down_proj: [intermediate_size, hidden_size]
```

### 模型结构

```rust
pub struct Qwen3Model {
    config: Qwen3Config,
    embed_tokens: Tensor,           // [vocab_size, hidden_size]
    layers: Vec<Qwen3DecoderLayer>,
    norm: RMSNorm,                  // final RMSNorm
    lm_head: Tensor,               // [vocab_size, hidden_size] (not tied)
}

pub struct Qwen3DecoderLayer {
    input_layernorm: RMSNorm,
    self_attn: Qwen3Attention,     // GQA with RoPE
    post_attention_layernorm: RMSNorm,
    mlp: Qwen3MLP,                 // SwiGLU
}
```

### 显存预算 (BF16, 单卡 5090 32GB)

```
模型权重:  8B × 2B = ~16 GB
KV cache:  36 layers × 2(KV) × 8 heads × 4096 tokens × 128 dim × 2B ≈ 5.6 GB
Activation (单请求): ~1 GB
────────────────────────
总计: ~22.6 GB (单请求)，剩余 ~10 GB 可用于更多并发
```

### 测试验收

- [ ] 加载 Qwen3-8B 权重到单张 5090，打印模型结构和参数量
- [ ] Prefill logits 与 HF transformers 对比: 输入 "你好" → top-5 logits 一致
- [ ] 英文生成: "What is the capital of France?" → 生成合理回答
- [ ] 中文生成: "请介绍一下量子计算" → 生成通顺中文
- [ ] 多轮对话:
  ```
  <|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n
  ```
  验证 chat template 格式正确
- [ ] 单请求性能 baseline: prefill latency (ms), decode throughput (tokens/s)

---

## Phase 11: Paged Attention + KV Cache Manager

**Crate**: `xserv-runtime`
**CUDA 源码**: `csrc/attention/paged_attention.cu`

**目标**: 实现 vLLM 的核心创新 — PagedAttention，解决 KV cache 内存碎片化问题。

### 问题

Phase 9 的简单 KV cache 为每个请求预分配 `max_seq_len` 的连续内存:
- 请求只用了 100 tokens 但占了 4096 tokens 的空间 → 内存利用率低
- 连续分配导致碎片化 → 并发请求数受限

### PagedAttention 设计

**核心思想**: 像操作系统的虚拟内存一样，将 KV cache 分成固定大小的 page (block)。

```rust
pub const BLOCK_SIZE: usize = 16;  // 每个 block 存 16 个 token 的 KV

// 物理 KV cache: 预分配的大块 GPU 内存
// k_cache shape: [num_physical_blocks, num_kv_heads, block_size, head_dim]
// v_cache shape: [num_physical_blocks, num_kv_heads, block_size, head_dim]

pub struct BlockAllocator {
    free_blocks: Vec<usize>,          // 空闲物理 block ID 列表
    num_total_blocks: usize,
    ref_counts: Vec<usize>,           // 每个 block 的引用计数 (CoW 用)
}

pub struct BlockTable {
    // logical_block_idx → physical_block_idx
    // 例: 一个 seq_len=50 的请求有 4 个 block (50/16=3.125, 向上取整)
    blocks: Vec<usize>,
}

pub struct PagedKVCacheManager {
    k_cache: Tensor,                  // 所有物理 blocks
    v_cache: Tensor,
    allocator: BlockAllocator,
    block_tables: HashMap<SeqId, BlockTable>,
}
```

### Paged Attention Kernel

与普通 attention 的区别: K/V 不是连续存储，需要通过 block table 间接寻址。

```
输入:
  Q:           [num_seqs, num_heads, head_dim]        (decode 时每个 seq 只有 1 个 query)
  k_cache:     [num_blocks, num_kv_heads, block_size, head_dim]  (物理存储)
  v_cache:     [num_blocks, num_kv_heads, block_size, head_dim]
  block_tables: [num_seqs, max_num_blocks]             (间接寻址表)
  seq_lens:    [num_seqs]                              (每个 seq 的实际长度)

每个 thread block 处理:
  1 个 seq 的 1 个 attention head
  遍历该 seq 的所有 logical blocks
  对每个 block: 查 block_table 得到 physical_block_id → 读取 K/V
  online softmax 累加
  输出: [num_seqs, num_heads, head_dim]
```

### Copy-on-Write (高级，可选)

- 多个 sequence 共享相同 prefix 的 KV blocks（beam search, prompt caching）
- 写入时: 如果 ref_count > 1，先复制该 block 再修改
- 这阶段先不实现，标记为后续优化

### 测试验收

- [ ] 正确性: paged attention 输出与 Phase 9 简单 KV cache 完全一致
- [ ] 内存效率对比:

| 场景 | Naive KV Cache | Paged KV Cache |
|------|---------------|----------------|
| 1 req, seq=100 | 分配 4096 tokens | 分配 7 blocks (112 tokens) |
| 10 req, seq=100-500 | 10×4096 | 按需分配 |
| 最大并发数 (32GB) | N 个 | M 个 (M >> N) |

- [ ] Block allocator: alloc/free 循环，无内存泄漏
- [ ] Benchmark: paged attention kernel vs naive decode attention 延迟对比

---

## Phase 12: Continuous Batching + Request Scheduler

**Crate**: `xserv-runtime`

**目标**: 实现 iteration-level 调度，支持请求的动态加入和退出。

### Static Batching vs Continuous Batching

**Static (朴素)**:
```
Batch 1: [req1, req2, req3] → 等 req1, req2, req3 全部完成
Batch 2: [req4, req5, req6] → ...
问题: req1 完成了但 req3 还在生成 → GPU 空转
```

**Continuous (Orca 论文)**:
```
Iteration 1: [req1, req2, req3] → req1 完成!
Iteration 2: [req2, req3, req4] → req4 动态加入
Iteration 3: [req2, req3, req4] → req3 完成!
Iteration 4: [req2, req4, req5] → req5 动态加入
```

### 核心数据结构

```rust
#[derive(Clone, Copy, PartialEq)]
pub enum SequenceStatus {
    Waiting,      // 在等待队列中
    Prefilling,   // 正在做 prefill
    Decoding,     // 正在做 decode
    Finished,     // 已完成 (EOS / max_len / stop string)
    Preempted,    // 被抢占（显存不够，KV cache 被换出）
}

pub struct Sequence {
    pub id: SeqId,
    pub prompt_tokens: Vec<u32>,
    pub generated_tokens: Vec<u32>,
    pub status: SequenceStatus,
    pub sampling_params: SamplingParams,
    pub block_table: BlockTable,
    pub arrival_time: Instant,
    // 用于 streaming 输出
    pub output_sender: tokio::sync::mpsc::Sender<GeneratedToken>,
}

pub struct Scheduler {
    waiting: VecDeque<Sequence>,
    running: Vec<Sequence>,
    max_num_seqs: usize,           // 最大并发 batch size
    max_num_tokens: usize,         // 单次 iteration 最大 token 数
    block_manager: PagedKVCacheManager,
}
```

### 调度循环 (Engine 主循环)

```rust
loop {
    // Step 1: 回收已完成的 sequence
    //   - 释放其 KV cache blocks
    //   - 从 running 移除

    // Step 2: 检查能否 admit 新请求
    //   - 条件: running.len() < max_num_seqs
    //           && 有足够的 free blocks 给新请求的 prompt
    //   - FCFS 从 waiting 取

    // Step 3: 划分 prefill / decode
    //   - 新加入的 sequence: prefill (处理完整 prompt)
    //   - 已在 running 的: decode (生成 1 个 token)

    // Step 4: 组装 batch input
    //   - Prefill: 各 seq 的 prompt tokens, 需要 padding 或 ragged batch
    //   - Decode: 各 seq 的最后一个 token

    // Step 5: Forward pass
    //   - prefill 和 decode 可以混合在一个 forward 中
    //   - 或者分开处理 (先 prefill, 再 decode)

    // Step 6: Sampling
    //   - 对每个 seq 的 logits 进行采样

    // Step 7: 更新状态
    //   - 将新 token append 到 sequence
    //   - 检查是否完成 (EOS / max_len)
    //   - 通过 channel 发送新 token 给 API 层
}
```

### Preemption（显存不足时的抢占）

当显存不足以 admit 新请求时:
1. **Swap**: 将低优先级 seq 的 KV cache 从 GPU 换到 CPU（复杂，后续再做）
2. **Recompute**: 丢弃低优先级 seq 的 KV cache，后续重新 prefill（简单，先实现这个）

### 测试验收

- [ ] 模拟 10 个请求在不同时间到达，所有请求都得到正确的生成结果
- [ ] 短请求完成后，新请求立即加入 batch（观察 log）
- [ ] Throughput 对比:

| 方式 | 20 请求总耗时 | Token/s |
|------|-------------|---------|
| 串行 (batch=1) | | |
| Static batch=4 | | |
| Continuous batch | | |

- [ ] 压力测试: 100 个并发请求，全部正确完成，无 hang/crash

---

## Phase 13: HTTP API + SSE Streaming — 里程碑 ③

**Crate**: `xserv-engine`, `xserv-api`

**目标**: 提供 OpenAI 兼容的 HTTP API，支持 SSE streaming。第一个端到端可用的里程碑。

### 技术栈

- **HTTP**: `axum` (Rust async web framework)
- **Async**: `tokio`
- **JSON**: `serde_json`
- **SSE**: `axum` 内置 SSE 支持 (`axum::response::sse`)

### API 端点

```
POST /v1/chat/completions    # 主要端点 (ChatML 格式)
POST /v1/completions         # 纯文本补全
GET  /v1/models              # 列出可用模型
GET  /health                 # 健康检查
```

### 请求/响应格式 (OpenAI 兼容)

**Chat Completion Request**:
```json
{
  "model": "qwen3-8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 1+1?"}
  ],
  "stream": true,
  "temperature": 0.7,
  "top_p": 0.9,
  "max_tokens": 256,
  "stop": ["\n\n"]
}
```

**SSE Streaming Response**:
```
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{"content":" answer"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"qwen3-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]
```

**Non-streaming Response**:
```json
{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "qwen3-8b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "The answer is 2."},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}
```

### 架构分层

```
Client (curl / Python OpenAI SDK)
    │
    ▼
┌─────────────────────────────────────┐
│  xserv-api (axum HTTP server)       │
│  - 解析请求, 验证参数               │
│  - apply chat template              │
│  - 将请求提交给 engine              │
│  - 从 channel 接收 token, 编码为 SSE│
└────────────┬────────────────────────┘
             │ InferenceRequest (通过 channel)
             ▼
┌─────────────────────────────────────┐
│  xserv-engine (推理引擎)            │
│  - 独立的 OS thread (非 async)      │
│  - 运行 scheduler 调度循环          │
│  - 管理 model + KV cache            │
│  - 每生成一个 token, 通过 channel   │
│    发送给 API 层                     │
└─────────────────────────────────────┘
```

**关键设计决策**:
- Engine 跑在独立 OS thread（避免 GPU 同步操作 block tokio runtime）
- API ↔ Engine 通过 `tokio::sync::mpsc` channel 通信
- 每个请求有独立的 `mpsc::Sender/Receiver` 用于 token streaming

### 测试验收

- [ ] `curl` 测试:
  ```bash
  curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"qwen3-8b","messages":[{"role":"user","content":"Hello"}],"stream":true}'
  ```
  看到 SSE 逐 token 输出

- [ ] Python OpenAI SDK 测试:
  ```python
  from openai import OpenAI
  client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
  for chunk in client.chat.completions.create(
      model="qwen3-8b",
      messages=[{"role": "user", "content": "What is 1+1?"}],
      stream=True
  ):
      print(chunk.choices[0].delta.content or "", end="", flush=True)
  ```

- [ ] 非 streaming 模式也能正常工作
- [ ] 并发 10 个请求，全部正确完成
- [ ] 多轮对话: 连续发两轮消息（第二轮包含 history），验证上下文连贯
- [ ] `/v1/models` 返回已加载的模型列表
- [ ] 错误处理: 无效参数返回 400, 模型不存在返回 404

---

## Phase 14: Flash Attention (FA2 for SM120)

**Crate**: `xserv-kernels`
**CUDA 源码**: `csrc/attention/flash_attention.cu`

**目标**: 实现 Flash Attention 的 CUDA kernel，大幅降低 attention 的显存占用并提升速度。

### 硬件适配说明

Flash Attention 已发展到第 4 代 (FA4, arxiv 2603.05451)，但各版本有明确的硬件依赖:

| 版本 | 目标架构 | 关键硬件特性 | RTX 5090 兼容 |
|------|---------|------------|--------------|
| FA2 | 通用 CUDA (SM75+) | 标准 shared memory + HMMA | **是** ✅ |
| FA3 | Hopper SM90 (H100) | TMA + WGMMA + warp specialization | 否 |
| FA4 | Blackwell SM100 (B200/B300) | TMEM + async MMA + 2-CTA mode | 否 |

**RTX 5090 (SM120, CC 12.0) 使用的是消费级 Blackwell 架构 (GB202)，与数据中心 Blackwell (B200, SM100) 是不同的硅片设计。SM120 物理上没有 TMEM (Tensor Memory) 子系统，因此 FA4 的 kernel 无法在 5090 上运行。这不是软件限制，是硬件级差异。**

因此本项目实现 **FA2 算法**，使用标准 CUDA (shared memory + HMMA)。FA2 的核心优化——online softmax tiling、O(1) 显存占用——在任何架构上都有效。

### 核心思想

标准 attention 的问题:
```
S = Q @ K^T         ← 需要 O(S²) 显存存储 S×S 矩阵
P = softmax(S)      ← 需要完整的 S×S 才能做 softmax
O = P @ V
```

Flash Attention 的解法:
- **不 materialize S×S 矩阵**
- 将 Q, K, V 分成 tiles，在 SRAM (shared memory) 中计算
- 使用 **online softmax trick**: 边算边更新 running max 和 running sum

### 算法 (Forward Pass, FA2)

FA2 相比 FA1 的改进: 外层循环遍历 Q tiles (而非 K/V)，减少 HBM 读写次数。

```
Br, Bc = tile sizes for Q and K/V respectively

for each Q tile (q_start..q_start+Br):                    ← 外层: Q tiles
    load Q_tile [Br, D] to shared memory
    initialize: O_tile = 0, l = 0, m = -inf               // running sum and max

    for each K,V tile (kv_start..kv_start+Bc):             ← 内层: K/V tiles
        load K_tile [Bc, D], V_tile [Bc, D] to shared memory

        // Compute attention scores for this tile pair
        S_tile = Q_tile @ K_tile^T              // [Br, Bc], in registers/SRAM

        // Apply causal mask (skip if kv_start > q_start + Br)
        if causal: mask upper triangle of S_tile

        // Online softmax update
        m_new = max(m, rowmax(S_tile))          // new running max
        P_tile = exp(S_tile - m_new)            // safe exp
        l_new = exp(m - m_new) * l + rowsum(P_tile)  // update running sum

        // Rescale and accumulate output
        O_tile = diag(exp(m - m_new)) * O_tile + P_tile @ V_tile
        m = m_new
        l = l_new

    O_tile = O_tile / l                         // final normalization
    write O_tile [Br, D] to global memory (HBM)
```

### 实现要点

1. **Tile 大小选择**:
   - 5090 SM120: shared memory per SM = 100 KB (需实测确认)
   - 需同时存 Q_tile, K_tile, V_tile, S_tile
   - BF16: Q_tile [Br, D] = Br × 128 × 2B; K_tile [Bc, D] = Bc × 128 × 2B
   - S_tile [Br, Bc] 保持 FP32 = Br × Bc × 4B
   - 推荐起步: Br=Bc=64, head_dim=128 → 共需 ~100KB shared memory
   - 优化版: Br=Bc=128 需要更多 shared memory, 可能需要拆分

2. **Causal mask 优化**:
   - 如果 K/V tile 完全在 Q tile 的"未来"（kv_start > q_end）→ 跳过整个 tile
   - 减少约 50% 的计算量

3. **BF16 精度**:
   - S_tile, P_tile 的计算在 FP32 中进行（累加精度）
   - Q, K, V 的加载用 BF16（节省 bandwidth）
   - 最终 O 转回 BF16 写出

4. **GQA 支持**:
   - K/V heads 数量 < Q heads 时，kernel 中做 `kv_head = q_head / num_groups` 索引
   - 不需要 repeat_kv 操作，直接在 kernel 内部解决

5. **Decode attention 特化**:
   - Decode 时 Q 只有 1 行 (Br=1)，退化为 vector-matrix attention
   - 可以写一个专门的 decode attention kernel (类似 FlashDecoding)
   - 沿 KV sequence 维度做 parallel reduction

### 测试验收

- [ ] 正确性: 与 Phase 5 naive attention 对比, max error < 1e-2 (BF16)
- [ ] 显存: Flash Attention 不随 S 平方增长

| Seq Length | Naive VRAM | Flash VRAM | Naive Time | Flash Time |
|------------|-----------|------------|------------|------------|
| 512 | MB | MB | ms | ms |
| 2048 | MB | MB | ms | ms |
| 8192 | OOM? | MB | OOM? | ms |
| 32768 | OOM | MB | OOM | ms |

- [ ] 集成到 Qwen3-8B，端到端 decode latency 对比
- [ ] Profile: `ncu` 分析 compute utilization, memory throughput
- [ ] GQA 支持: 无 repeat_kv 开销

---

## Phase 15: 性能优化 — 里程碑 ④

**目标**: 系统性 profiling + 优化，向 50% vLLM throughput 目标冲刺。

### 优化方向

#### 1. Kernel Fusion
减少 memory-bound kernel 之间的 global memory 读写:
- **Residual Add + RMSNorm**: 一次读 hidden + residual，写出 normed output
- **SiLU + Elementwise Mul** (SwiGLU 内部): 一次读 gate + up, 写出 SiLU(gate)*up
- **Bias Add + Activation**: Linear 的 bias + 激活函数合并

原则: 两个逐元素 kernel 之间如果有 global memory 读写，就值得融合。

#### 2. CUDA Graphs
- Decode 阶段每步的 kernel 序列是固定的（shape 不变）
- 用 `cudaStreamBeginCapture` / `cudaStreamEndCapture` 捕获一次
- 后续用 `cudaGraphLaunch` 重放（消除 kernel launch overhead）
- **注意**: batch size 变化时需要重新 capture

#### 3. Memory 优化
- **权重预加载**: 启动时加载到 GPU，推理路径上零分配
- **Activation reuse**: 同一层的中间结果用完立即释放/复用
- **Pinned memory**: H2D/D2H 用 `cudaMallocHost`（pinned）提升拷贝带宽

#### 4. Compute 优化
- 确保所有 GEMM 走 Tensor Core (BF16)
- Decode attention: 优化 memory bandwidth 利用率
- Prefill: chunked processing（控制峰值显存，允许更大 batch）

#### 5. Scheduling 优化
- Prefill-Decode disaggregation: prefill 和 decode 分开 batch
  - 原因: prefill 是 compute-bound, decode 是 memory-bound, 混合导致两边都不优
- Dynamic batch size: 根据当前 running seqs 的 seq_len 动态调整

### Profiling 工具使用

```bash
# 整体 timeline (哪个 kernel 最耗时)
nsys profile --stats=true ./target/release/xserv-server

# 单个 kernel 分析 (occupancy, memory throughput)
ncu --target-processes all --set full ./target/release/xserv-server

# 自定义 Rust 计时
# 在 engine 循环中记录每个 phase 的耗时
```

### 测试验收

- [ ] 安装 vLLM，同一台机器跑 Qwen3-8B
- [ ] Benchmark 对比:

| Metric | vLLM | xserv | Ratio |
|--------|------|-------|-------|
| Prefill latency (ms, 128 tokens) | | | |
| Decode throughput (tokens/s, batch=1) | | | |
| Decode throughput (tokens/s, batch=16) | | | |
| Max concurrent requests (32GB) | | | |

- [ ] 目标: xserv throughput >= 50% vLLM
- [ ] Profiling 报告: 每个组件的耗时占比 pie chart
- [ ] 无功能回归: 所有之前的集成测试通过

---

## Phase 16: Speculative Decoding

**Crate**: `xserv-speculative`

**目标**: 用小模型（draft model）加速大模型（target model）的 decode 阶段。

### 算法

```
γ = 4 (speculative tokens 数量)

1. Draft model 自回归生成 γ 个 token: t1, t2, t3, t4
2. Target model 一次 forward 处理这 γ+1 个 position
   (等价于一次 prefill: [last_accepted_token, t1, t2, t3, t4])
   得到 γ+1 个 logits
3. Rejection sampling:
   for i in 0..γ:
     p = target_prob[i][t_{i+1}]   // target model 给 draft token 的概率
     q = draft_prob[i][t_{i+1}]    // draft model 给该 token 的概率
     if random() < min(1, p/q):
       accept t_{i+1}
     else:
       reject, resample from adjusted distribution
       break
4. 至少接受 1 个 token, 期望接受 ~γ×acceptance_rate 个
```

### 关键点

- **无损**: rejection sampling 保证输出分布与纯 target model 一致
- **加速条件**: draft model 足够快且与 target 分布接近
- **Draft model 选择**: Qwen3-0.5B / Qwen3-1.5B 作为 Qwen3-8B 的 draft

### KV Cache 处理

- Draft model 有自己的 KV cache
- Target model 验证时，accepted tokens 的 KV 可以复用（不用重算）
- Rejected 位置之后的 KV 需要丢弃

### 测试验收

- [ ] 验证: speculative decode 100 条不同 prompt，输出分布与标准 decode 无统计差异
- [ ] Acceptance rate 统计 (期望 60-80% per token)
- [ ] 端到端加速:

| Method | Tokens/s (batch=1) | Speedup |
|--------|--------------------|---------|
| Standard decode | | 1.0x |
| Speculative (γ=4) | | ~2-3x |

---

## Phase 17: Tensor Parallelism (TP=1/2/4) — 里程碑 ⑤

**Crate**: `xserv-distributed`

**目标**: 实现 Tensor Parallelism，支持 TP=2 和 TP=4（同组 GPU 内），跑通多卡推理。PP 及更大规模并行留待后续扩展。

### 通信后端: NCCL

NVIDIA Collective Communication Library，提供高效的 multi-GPU 通信原语。

需要封装的操作:
```rust
// NCCL FFI
ncclAllReduce(sendbuff, recvbuff, count, datatype, op, comm, stream)
ncclAllGather(sendbuff, recvbuff, count, datatype, comm, stream)
ncclBroadcast(sendbuff, recvbuff, count, datatype, root, comm, stream)
```

### Tensor Parallelism 策略 (Megatron-LM 风格)

每层只需要 **2 次 AllReduce**:

#### Attention 部分
```
Column Parallel: Q/K/V proj 按 head 维度切分
  GPU 0: q_proj[:, :hidden/TP], k_proj[:, :hidden/TP], v_proj[:, :hidden/TP]
  GPU 1: q_proj[:, hidden/TP:], k_proj[:, hidden/TP:], v_proj[:, hidden/TP:]
  → 每卡计算自己那部分 heads 的 attention → 无需通信

Row Parallel: o_proj 按行切分
  每卡计算部分输出 → AllReduce 求和 → 得到完整 output
```

#### FFN (SwiGLU) 部分
```
Column Parallel: gate_proj, up_proj 按列切分
  每卡计算部分 intermediate features → 无需通信

Row Parallel: down_proj 按行切分
  每卡计算部分输出 → AllReduce 求和 → 得到完整 output
```

#### 其他
- **Embedding**: vocab 按行切分，AllGather 拼接
- **RMSNorm**: 每卡独立计算（输入是 AllReduce 后的完整 tensor）
- **lm_head**: 按列切分，AllGather 拼接 logits

### 权重分片

启动时:
1. Rank 0 加载完整权重
2. 按 TP 策略切分
3. AllGather 或 Scatter 分发到各卡

或者:
- 每个 rank 独立加载，只读取属于自己的那部分（更高效）

### 互联拓扑 (已确认)

**纯 PCIe Gen5 x16, 无 NVLink**。GPU 分两组: 0-3 (PHB) 和 4-7 (PHB)，跨组走 NODE。

**TP 部署策略** (当前阶段目标: TP=1/2/4):
- **TP=2**: 同组内任意两卡 (如 GPU0+GPU1)，PCIe PHB 延迟最低
- **TP=4**: 同组 4 卡 (GPU 0-3 或 GPU 4-7)，全 PHB 互联
- **PCIe Gen5 x16 带宽**: 理论 ~64 GB/s 单向，实测 AllReduce 有效带宽约 40-50 GB/s
- **后续扩展**: TP=8 (跨组) 和 Pipeline Parallelism 留待后续阶段

### 测试验收

- [ ] TP=2: Qwen3-8B 输出与单卡 (TP=1) 完全一致
- [ ] TP=4: 每卡权重显存占用约 1/4
- [ ] Scaling benchmark (同组 GPU 0-3):

| TP Size | Prefill (tokens/s) | Decode (tokens/s) | Scaling Efficiency |
|---------|--------------------|--------------------|-------------------|
| 1 | | | 100% |
| 2 | | | % |
| 4 | | | % |

- [ ] AllReduce latency 测量 (不同消息大小，同组 PHB 互联)

---

## Phase 18: 量化 (FP8 / INT8)

**Crate**: `xserv-kernels` (新增量化 kernel), `xserv-model` (量化加载)
**CUDA 源码**: `csrc/quantize/`

**目标**: 降低模型显存占用和提升计算吞吐。

### 量化方式

#### 1. Weight-Only INT8
- 只量化权重，activation 保持 BF16
- Per-channel scale: 每个输出 channel 一个 `scale` 和 `zero_point`
- GEMM: INT8 × BF16 (dequantize on-the-fly)
- 适用于 memory-bound 场景（decode）

#### 2. FP8 (E4M3 / E5M2)
- 5090 Blackwell (CC 12.0) 原生支持 FP8 Tensor Core
- 权重和 activation 都量化为 FP8
- Dynamic scaling: 每个 tensor 运行时计算 `amax`，确定 scale factor
- GEMM: FP8 × FP8 → BF16/FP32 accumulate
- 适用于 compute-bound 场景（prefill）

#### 3. GPTQ / AWQ (高级, 可选)
- INT4 weight quantization
- 需要 calibration data
- 更复杂但压缩率更高

### FP8 实现要点

```rust
pub enum DType {
    // ... existing
    F8E4M3,    // 1 sign + 4 exponent + 3 mantissa (范围小，精度高)
    F8E5M2,    // 1 sign + 5 exponent + 2 mantissa (范围大，精度低)
}
```

Dynamic scaling:
```
scale = amax(tensor) / fp8_max   // fp8_max = 448.0 for E4M3
tensor_fp8 = cast_to_fp8(tensor / scale)
// GEMM 后: output = output * scale_A * scale_B
```

### 测试验收

- [ ] 精度:

| Quantization | Perplexity (WikiText-2) | vs BF16 |
|-------------|------------------------|---------|
| BF16 (baseline) | X.XX | — |
| FP8 E4M3 | X.XX | +0.XX |
| INT8 weight-only | X.XX | +0.XX |

- [ ] 显存: FP8 权重占用约 BF16 的一半 (~8 GB for 8B model)
- [ ] 性能: FP8 GEMM throughput vs BF16 GEMM

---

## Phase 19: Multimodal — 里程碑 ⑥

**Crate**: `xserv-model` (新增 vision encoder)

**目标**: 支持 vision-language 模型，接受图片+文本输入。

### 目标模型

Qwen-VL 系列（或类似 architecture）。典型结构:
```
Image → ViT Encoder → Visual Tokens → Projector → LLM Input
Text  → Tokenizer   → Text Tokens   ────────────→ LLM Input
```

### 需要新增的组件

#### 1. ViT (Vision Transformer) Encoder
- **Patch Embedding**: 将图片切成 14×14 patches, 每个 patch 线性投影
  - 输入: [B, 3, H, W] → 输出: [B, num_patches, hidden_dim]
- **ViT Blocks**: 标准 Transformer Encoder blocks
  - Multi-Head Self-Attention (无 causal mask)
  - FFN
  - LayerNorm
- 参数量: 通常 300M-600M

#### 2. Visual-Language Projector
- MLP: 将 ViT 输出维度映射到 LLM embedding 维度
- 可能包含 pooling / resampling (减少 visual token 数量)

#### 3. 图片预处理
- Resize to target resolution (e.g., 448×448)
- Normalize: ImageNet mean/std
- 使用 `image` crate 在 CPU 上完成

#### 4. 输入拼接
```
[<image_token>] × num_visual_tokens + [text tokens]
```
- 在 embedding 层面拼接
- LLM 处理混合 sequence

### API 扩展

```json
{
  "model": "qwen-vl",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
      {"type": "text", "text": "What is in this image?"}
    ]
  }]
}
```

### 外部依赖

- `image` crate (图片加载和预处理)
- `base64` crate (解码 base64 图片)

### 测试验收

- [ ] 加载 Qwen-VL 模型，输入一张猫的图片 + "What is in this image?"
- [ ] 生成合理的图片描述
- [ ] 与 HF transformers 输出对比
- [ ] API 端到端: HTTP POST 含 base64 图片 → streaming 文字回答
- [ ] 纯文本请求不受影响（向后兼容）

---

## 实际进展记录(与原计划的分叉,2026-06 更新)

Phase 0–17 按计划完成。Phase 18 起实际路线偏离了上面的原计划
(speculative decoding 与多模态推迟),实际走向是 MoE + 量化 + 稀疏化:

| 实际 Phase | 内容 | 文档 |
|---|---|---|
| 18 | Pipeline Parallelism(PP=2/4) | `18-pipeline-parallelism.md`、`benchmarks/pp-sweep.md` |
| 19 | **gpt-oss-20b MoE**:harmony 格式、attention sinks + sliding window、YaRN;两个 CUDA bug 实战(prefill sinks NaN、GEMV 未初始化 smem);GSM8K 94.5% 对齐 llama.cpp;FP8 W8A8 / MXFP4 W4A16 量化 | `19-gpt-oss-moe.md`、`benchmarks/{fp8-quantization,mxfp4-and-llama-decode}.md` |
| 20 | **稀疏 top-k MoE decode**:只算被路由的专家,decode 13.9→7.0ms,TP=2 下 decode/TTFT 全面快于 llama.cpp 同配置;gpt-oss 单卡 serving | `20-sparse-moe.md`、`benchmarks/sparse-moe.md` |
| 21 | **decode CUDA Graph + GPU argmax**:整个 decode step 录成一个图回放(thread-local launch stream、retained-warmup 分配策略、NCCL capture);greedy 采样换 GPU argmax。TPOT 7.5→5.9ms(TP=1)/ 5.8ms(TP=2);TP=2 全面领先 llama(1.26-1.47×),TP=1 差距 2.5×→2.0× | `21-cuda-graph-decode.md` |

**下一步候选(按预期收益排序):**

| 候选 Phase | 内容 | 预期 |
|---|---|---|
| 22 | **非专家权重量化**:qkv/o + lm_head(1.16GB/token)仍是 BF16 | TPOT 再省 ~1.5ms |
| 23 | **稀疏 prefill**(按专家 permute + grouped GEMM) | 长 prompt TTFT 51-75 → ~30ms |
| 24 | server 侧 harmony channel 分离(`reasoning_content` 流式输出,对齐 llama-server 行为) | API 易用性 |
| — | Speculative Decoding、多模态(原 16/19) | 推迟 |

## 里程碑总结

| 里程碑 | Phase | 验收标准 |
|--------|-------|---------|
| ① GPT-2 推理 | 8 | CLI 输入 prompt, GPT-2 生成连贯文本, logits 与 PyTorch 一致 |
| ② Qwen3-8B 推理 | 10 | 8B 模型中英文对话, 多轮 chat template 正确 |
| ③ E2E API | 13 | HTTP streaming API, Python OpenAI SDK 可调用, 10 并发正确 |
| ④ 性能达标 | 15 | throughput >= 50% vLLM, profiling 报告完成 |
| ⑤ 多卡推理 | 17 | TP=2/4 同组 GPU 推理正确, scaling benchmark 完成 |
| ⑥ MoE 模型(实际) | 19 | gpt-oss-20b 端到端正确, GSM8K 与 llama.cpp 持平 ✅ |
| ⑦ 性能反超(实际) | 20 | 同配置 decode 快于 llama.cpp(TP=2 达成;单卡是 Phase 21+ 目标) ✅ |
| ⑧ 多模态 | 推迟 | 图片输入 → 文字回答, API 端到端 |

## 外部依赖清单

| Crate | 用途 | 引入 Phase |
|-------|------|-----------|
| `cc` | build.rs 编译 .cu 文件 | 0 |
| `half` | f16 / bf16 Rust 类型 | 2 |
| `smallvec` | Tensor shape / strides (栈分配) | 2 |
| `safetensors` | 权重文件解析 | 6 |
| `serde` + `serde_json` | JSON 序列化 | 6 |
| `memmap2` | 文件 mmap (safetensors 可能内置) | 6 |
| `regex` | BPE pre-tokenization | 7 |
| `rand` | Sampling (随机数) | 8 |
| `tokio` | Async runtime | 13 |
| `axum` | HTTP server | 13 |
| `criterion` | Benchmark framework | 3+ |
| `image` | 图片加载 (multimodal) | 19 |
| `base64` | Base64 decode (multimodal API) | 19 |

**不使用**: `candle`, `burn`, `tch`, `tokenizers`, `cudarc` — 核心组件全部自己实现。