docs: split Phase 12 and Phase 13 into separate design documents

- docs/12-continuous-batching.md: scheduler, sequence management, batching strategy (currently single-request, expandable) - docs/13-http-api.md: HTTP server, OpenAI-compatible API, axum architecture, SSE streaming (TODO) Phase 12 = WHAT to compute (scheduling decisions) Phase 13 = HOW to expose it (HTTP protocol layer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:15:27 +08:00
parent da043554ba
commit 7d05ececa0
3 changed files with 234 additions and 98 deletions
--- a/docs/12-13-serving.md
+++ b/docs/12-13-serving.md
@@ -1,98 +0,0 @@
-# Phase 12+13: Continuous Batching + HTTP API — Design Document (Milestone ③)
-
-## Goal
-
-实现 HTTP serving 层：接收请求、调度执行、streaming 返回结果。OpenAI 兼容 API。
-
-由于当前是单请求引擎（无 multi-GPU、无并发），Phase 12 (continuous batching) 和 Phase 13 (HTTP API) 合并实现：先实现单请求 serving，scheduler 作为 placeholder 留待后续扩展。
-
-## Architecture
-
-```
-Client (curl / OpenAI SDK)
-    │
-    ▼  HTTP POST /v1/chat/completions
-┌─────────────────────────────────────┐
-│  xserv-api (axum)                   │
-│  - Parse request                    │
-│  - Apply chat template              │
-│  - Submit to engine via channel     │
-│  - Stream SSE chunks from channel   │
-└────────────┬────────────────────────┘
-             │  InferenceRequest → mpsc channel
-             ▼
-┌─────────────────────────────────────┐
-│  xserv-engine (dedicated thread)    │
-│  - Receive requests                 │
-│  - Run model forward (prefill+decode)│
-│  - Send tokens back via channel     │
-└─────────────────────────────────────┘
-```
-
-## Crates
-
- `xserv-engine`: inference orchestration (model + cache + generate loop)
- `xserv-api`: HTTP server with axum
-
-Both in one binary: `xserv-server`
-
-## API Endpoints
-
-```
-POST /v1/chat/completions   # main endpoint
-GET  /v1/models             # list models
-GET  /health                # health check
-```
-
-## Request/Response (OpenAI compatible)
-
-Request:
-```json
-{
-  "model": "qwen3-8b",
-  "messages": [{"role": "user", "content": "Hello"}],
-  "stream": true,
-  "max_tokens": 256,
-  "temperature": 1.0
-}
-```
-
-SSE Response:
-```
-data: {"id":"...","choices":[{"delta":{"content":"Hi"},"index":0}]}
-
-data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}
-
-data: [DONE]
-```
-
-## Engine Design
-
-```rust
-pub struct Engine {
-    model: Qwen3,
-    config: ModelConfig,
-    tokenizer: Tokenizer,
-}
-
-impl Engine {
-    pub fn generate(&self, prompt_tokens: &[u32], params: &SamplingParams,
-                    sender: mpsc::Sender<Token>) { ... }
-}
-```
-
-Engine runs on a dedicated OS thread (avoids async/GPU conflicts).
-API handlers communicate via `tokio::sync::mpsc` channels.
-
-## Sampling
-
-For this phase: greedy only (temperature=0 or 1 with argmax).
-Top-k/top-p sampling added later.
-
-## Test Plan
-
- [ ] curl streaming request → get SSE chunks
- [ ] Python OpenAI SDK client works
- [ ] /v1/models returns model info
- [ ] /health returns 200
- [ ] Multiple sequential requests work
--- a/docs/12-continuous-batching.md
+++ b/docs/12-continuous-batching.md
@@ -0,0 +1,101 @@
+# Phase 12: Continuous Batching + Request Scheduler — Design Document
+
+## Goal
+
+实现 iteration-level 请求调度器，支持多请求并发执行和动态 batch 管理。这是 LLM serving 系统的核心调度逻辑。
+
+## 核心概念
+
+### Static Batching vs Continuous Batching
+
+**Static（朴素）**:
+```
+Batch 1: [req1, req2, req3] → 等所有完成才开始下一批
+问题: req1 10 token 就完了，req3 要 200 token → req1 的 slot 空转
+```
+
+**Continuous（本阶段目标）**:
+```
+Iteration 1: [req1, req2, req3]  → req1 完成! slot 释放
+Iteration 2: [req2, req3, req4]  → req4 立即填入
+每一个 iteration（一次 forward pass）重新决定哪些请求参与
+```
+
+## 核心组件
+
+### Sequence
+
+```rust
+pub struct Sequence {
+    pub id: SeqId,
+    pub prompt_tokens: Vec<u32>,
+    pub generated_tokens: Vec<u32>,
+    pub status: SequenceStatus,
+    pub sampling_params: SamplingParams,
+    pub kv_cache_handle: KVCacheHandle,  // 该 seq 的 KV cache 资源
+    pub arrival_time: Instant,
+    pub output_sender: tokio::sync::mpsc::Sender<GenerateEvent>,
+}
+
+pub enum SequenceStatus {
+    Waiting,       // 等待调度
+    Prefilling,    // 正在 prefill
+    Decoding,      // 正在逐 token decode
+    Finished,      // 完成 (EOS / max_len)
+}
+```
+
+### Scheduler
+
+```rust
+pub struct Scheduler {
+    waiting: VecDeque<Sequence>,      // 等待队列
+    running: Vec<Sequence>,           // 正在执行
+    max_batch_size: usize,            // 最大并发数
+    block_manager: BlockManager,      // KV cache 资源管理
+}
+```
+
+### 调度循环
+
+```rust
+loop {
+    // 1. 回收已完成的 sequence，释放 KV cache
+    // 2. 从 waiting 中 admit 新请求（如果有空位+显存）
+    // 3. 对 running 中的所有 seq 做一步 forward
+    //    - 新加入的做 prefill
+    //    - 已在运行的做 decode
+    // 4. 对每个 seq 的 logits 做 sampling
+    // 5. 发送新 token / 完成信号
+}
+```
+
+## 当前状态 (Phase 12 初版)
+
+当前实现是 **单请求顺序执行**（max_batch_size=1），是 continuous batching 的退化形式：
+- 一次只处理一个请求
+- 完成后才接受下一个
+- 无 preemption、无 batching
+
+这是合理的起步——先跑通单请求 E2E，后续扩展为真正的并发 batching。
+
+## 后续扩展 (Phase 15+)
+
+1. **多请求 batch forward**: 将多个 seq 的 token 拼接为一个 batch 输入
+2. **Prefill-Decode 分离**: prefill (compute-bound) 和 decode (memory-bound) 分开调度
+3. **Preemption**: 显存不足时暂停低优先级 seq
+4. **动态 batch size**: 根据 KV cache 使用量调整
+
+## Test Plan
+
+- [x] 单请求 E2E: 提交请求 → 收到 token 流 → 完成信号
+- [ ] (后续) 多请求并发: 提交多个请求，验证都能正确完成
+- [ ] (后续) 短请求完成后新请求立即加入
+
+## Takeaways
+
+1. **单请求是 continuous batching 的特殊情况 (batch_size=1)**：当前实现的 engine 循环已经是正确的调度结构——receive request → prefill → decode loop → done → next request。扩展为多请求只需在 decode loop 中处理多个 sequence。
+
+2. **Engine 在独立 OS thread 上跑是正确的设计**：GPU 操作是同步阻塞的（cudaDeviceSynchronize），如果放在 tokio runtime 中会 block 整个 async runtime。独立线程 + channel 通信是标准模式。
+
+3. **std::sync::mpsc::SyncSender(capacity=1) 实现了天然的背压**：当 engine 忙时，新请求会 block 在 channel send 上，不会积压。
--- a/docs/13-http-api.md
+++ b/docs/13-http-api.md
@@ -0,0 +1,133 @@
+# Phase 13: HTTP API + Streaming — Design Document (Milestone ③)
+
+## Goal
+
+提供 OpenAI 兼容的 HTTP API，让 xserv 可以作为一个 serving 后端被任何 OpenAI SDK 调用。
+
+## 职责划分
+
+| 组件 | 职责 |
+|------|------|
+| Phase 12 (Scheduler/Engine) | 模型推理 + 请求调度 + token 生成循环 |
+| **Phase 13 (HTTP API)** | HTTP 请求解析 → 内部格式 → 提交给 engine → 从 channel 接收 token → 编码为 HTTP 响应 |
+
+Phase 13 不关心模型如何推理，只负责 HTTP 协议层。
+
+## 技术栈
+
+- **HTTP framework**: axum 0.8
+- **Async runtime**: tokio
+- **Serialization**: serde_json
+- **Channel**: tokio::sync::mpsc (API ↔ Engine)
+
+## API 端点
+
+```
+GET  /health                    → "ok"
+GET  /v1/models                 → {"data": [{"id": "qwen3-8b", ...}]}
+POST /v1/chat/completions       → JSON response (non-streaming)
+POST /v1/chat/completions       → SSE stream (streaming, TODO)
+```
+
+## Architecture
+
+```
+Client
+  │ HTTP POST /v1/chat/completions
+  ▼
+┌──────────────────────────────┐
+│  axum handler                │
+│  1. Deserialize ChatRequest  │
+│  2. Build prompt text        │
+│  3. Tokenize (Mutex<Tokenizer>)│
+│  4. Create mpsc channel      │
+│  5. Submit GenerateRequest   │
+│  6. await tokens from rx     │
+│  7. Build JSON response      │
+└──────────────────────────────┘
+  │ GenerateRequest via SyncSender
+  ▼
+┌──────────────────────────────┐
+│  Engine thread (Phase 12)    │
+│  - recv() request            │
+│  - model.forward_gpu_cache() │
+│  - blocking_send() tokens    │
+└──────────────────────────────┘
+```
+
+## OpenAI 兼容格式
+
+### Request
+```json
+{
+  "model": "qwen3-8b",
+  "messages": [
+    {"role": "system", "content": "You are helpful."},
+    {"role": "user", "content": "Hello"}
+  ],
+  "max_tokens": 256,
+  "stream": false
+}
+```
+
+### Response (non-streaming)
+```json
+{
+  "id": "chatcmpl-xxx",
+  "object": "chat.completion",
+  "created": 1234567890,
+  "model": "qwen3-8b",
+  "choices": [{
+    "index": 0,
+    "message": {"role": "assistant", "content": "Hi there!"},
+    "finish_reason": "stop"
+  }],
+  "usage": {"prompt_tokens": 5, "completion_tokens": 3, "total_tokens": 8}
+}
+```
+
+### SSE Streaming (TODO)
+```
+data: {"choices":[{"delta":{"content":"Hi"}}]}
+
+data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
+
+data: [DONE]
+```
+
+## 当前实现状态
+
+- [x] `/health` — 健康检查
+- [x] `/v1/models` — 模型列表
+- [x] `/v1/chat/completions` (non-streaming) — JSON response
+- [ ] `/v1/chat/completions` (streaming) — SSE
+- [ ] 完整的 `usage` 统计 (token 计数)
+- [ ] 错误处理 (400 for bad request, etc.)
+- [ ] 多轮对话 chat template
+
+## Key Design Decisions
+
+1. **Extension vs State**: 用 `axum::Extension<Arc<AppState>>` 而不是 `Router::with_state`，因为 `SyncSender` 不是 `Sync`（需要 Mutex 包装）。
+
+2. **Engine 在独立 thread**: GPU 同步操作 block 线程，不能放在 tokio runtime 中。
+
+3. **tokio::sync::mpsc 做 token 传输**: Engine (std thread) 用 `blocking_send()`，API (async) 用 `.recv().await`。跨 async/sync 边界通信。
+
+## Test Plan
+
+- [x] curl /health → "ok"
+- [x] curl /v1/models → JSON model list
+- [x] curl /v1/chat/completions → JSON with generated text
+- [ ] Python OpenAI SDK 兼容性测试
+- [ ] SSE streaming 测试
+- [ ] 多轮对话测试
+
+## Takeaways
+
+1. **axum 0.8 的 Handler trait 对 Send 很严格**：async fn 返回的 Future 必须是 Send。`std::sync::MutexGuard` 不是 Send，必须确保它不活过 await point（用 scope 或显式 drop）。
+
+2. **std::sync::mpsc::SyncSender 不是 Sync**：不能直接放在 `Arc<T>` 中被多个 async task 共享。解决方案：`Mutex<SyncSender>` 或换用 `tokio::sync::mpsc::Sender`（是 Sync 的）。
+
+3. **非 streaming 更简单，先跑通再加 SSE**：SSE streaming 涉及 `Stream` trait、lifetime 问题和复杂的类型推导。先用 collect-all-then-respond 跑通 E2E，streaming 作为增量优化。
+
+4. **Engine 加载时间 ~20s（Qwen3-8B）**：需要在 server 启动后等 engine ready 才接受请求，否则请求会 hang 在 channel send 上。当前靠 sync_channel(1) 的背压天然处理。