docs: split Phase 12 and Phase 13 into separate design documents
- docs/12-continuous-batching.md: scheduler, sequence management, batching strategy (currently single-request, expandable) - docs/13-http-api.md: HTTP server, OpenAI-compatible API, axum architecture, SSE streaming (TODO) Phase 12 = WHAT to compute (scheduling decisions) Phase 13 = HOW to expose it (HTTP protocol layer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,98 +0,0 @@
|
||||
# Phase 12+13: Continuous Batching + HTTP API — Design Document (Milestone ③)
|
||||
|
||||
## Goal
|
||||
|
||||
实现 HTTP serving 层:接收请求、调度执行、streaming 返回结果。OpenAI 兼容 API。
|
||||
|
||||
由于当前是单请求引擎(无 multi-GPU、无并发),Phase 12 (continuous batching) 和 Phase 13 (HTTP API) 合并实现:先实现单请求 serving,scheduler 作为 placeholder 留待后续扩展。
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Client (curl / OpenAI SDK)
|
||||
│
|
||||
▼ HTTP POST /v1/chat/completions
|
||||
┌─────────────────────────────────────┐
|
||||
│ xserv-api (axum) │
|
||||
│ - Parse request │
|
||||
│ - Apply chat template │
|
||||
│ - Submit to engine via channel │
|
||||
│ - Stream SSE chunks from channel │
|
||||
└────────────┬────────────────────────┘
|
||||
│ InferenceRequest → mpsc channel
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ xserv-engine (dedicated thread) │
|
||||
│ - Receive requests │
|
||||
│ - Run model forward (prefill+decode)│
|
||||
│ - Send tokens back via channel │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Crates
|
||||
|
||||
- `xserv-engine`: inference orchestration (model + cache + generate loop)
|
||||
- `xserv-api`: HTTP server with axum
|
||||
|
||||
Both in one binary: `xserv-server`
|
||||
|
||||
## API Endpoints
|
||||
|
||||
```
|
||||
POST /v1/chat/completions # main endpoint
|
||||
GET /v1/models # list models
|
||||
GET /health # health check
|
||||
```
|
||||
|
||||
## Request/Response (OpenAI compatible)
|
||||
|
||||
Request:
|
||||
```json
|
||||
{
|
||||
"model": "qwen3-8b",
|
||||
"messages": [{"role": "user", "content": "Hello"}],
|
||||
"stream": true,
|
||||
"max_tokens": 256,
|
||||
"temperature": 1.0
|
||||
}
|
||||
```
|
||||
|
||||
SSE Response:
|
||||
```
|
||||
data: {"id":"...","choices":[{"delta":{"content":"Hi"},"index":0}]}
|
||||
|
||||
data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}
|
||||
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
## Engine Design
|
||||
|
||||
```rust
|
||||
pub struct Engine {
|
||||
model: Qwen3,
|
||||
config: ModelConfig,
|
||||
tokenizer: Tokenizer,
|
||||
}
|
||||
|
||||
impl Engine {
|
||||
pub fn generate(&self, prompt_tokens: &[u32], params: &SamplingParams,
|
||||
sender: mpsc::Sender<Token>) { ... }
|
||||
}
|
||||
```
|
||||
|
||||
Engine runs on a dedicated OS thread (avoids async/GPU conflicts).
|
||||
API handlers communicate via `tokio::sync::mpsc` channels.
|
||||
|
||||
## Sampling
|
||||
|
||||
For this phase: greedy only (temperature=0 or 1 with argmax).
|
||||
Top-k/top-p sampling added later.
|
||||
|
||||
## Test Plan
|
||||
|
||||
- [ ] curl streaming request → get SSE chunks
|
||||
- [ ] Python OpenAI SDK client works
|
||||
- [ ] /v1/models returns model info
|
||||
- [ ] /health returns 200
|
||||
- [ ] Multiple sequential requests work
|
||||
101
docs/12-continuous-batching.md
Normal file
101
docs/12-continuous-batching.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# Phase 12: Continuous Batching + Request Scheduler — Design Document
|
||||
|
||||
## Goal
|
||||
|
||||
实现 iteration-level 请求调度器,支持多请求并发执行和动态 batch 管理。这是 LLM serving 系统的核心调度逻辑。
|
||||
|
||||
## 核心概念
|
||||
|
||||
### Static Batching vs Continuous Batching
|
||||
|
||||
**Static(朴素)**:
|
||||
```
|
||||
Batch 1: [req1, req2, req3] → 等所有完成才开始下一批
|
||||
问题: req1 10 token 就完了,req3 要 200 token → req1 的 slot 空转
|
||||
```
|
||||
|
||||
**Continuous(本阶段目标)**:
|
||||
```
|
||||
Iteration 1: [req1, req2, req3] → req1 完成! slot 释放
|
||||
Iteration 2: [req2, req3, req4] → req4 立即填入
|
||||
每一个 iteration(一次 forward pass)重新决定哪些请求参与
|
||||
```
|
||||
|
||||
## 核心组件
|
||||
|
||||
### Sequence
|
||||
|
||||
```rust
|
||||
pub struct Sequence {
|
||||
pub id: SeqId,
|
||||
pub prompt_tokens: Vec<u32>,
|
||||
pub generated_tokens: Vec<u32>,
|
||||
pub status: SequenceStatus,
|
||||
pub sampling_params: SamplingParams,
|
||||
pub kv_cache_handle: KVCacheHandle, // 该 seq 的 KV cache 资源
|
||||
pub arrival_time: Instant,
|
||||
pub output_sender: tokio::sync::mpsc::Sender<GenerateEvent>,
|
||||
}
|
||||
|
||||
pub enum SequenceStatus {
|
||||
Waiting, // 等待调度
|
||||
Prefilling, // 正在 prefill
|
||||
Decoding, // 正在逐 token decode
|
||||
Finished, // 完成 (EOS / max_len)
|
||||
}
|
||||
```
|
||||
|
||||
### Scheduler
|
||||
|
||||
```rust
|
||||
pub struct Scheduler {
|
||||
waiting: VecDeque<Sequence>, // 等待队列
|
||||
running: Vec<Sequence>, // 正在执行
|
||||
max_batch_size: usize, // 最大并发数
|
||||
block_manager: BlockManager, // KV cache 资源管理
|
||||
}
|
||||
```
|
||||
|
||||
### 调度循环
|
||||
|
||||
```rust
|
||||
loop {
|
||||
// 1. 回收已完成的 sequence,释放 KV cache
|
||||
// 2. 从 waiting 中 admit 新请求(如果有空位+显存)
|
||||
// 3. 对 running 中的所有 seq 做一步 forward
|
||||
// - 新加入的做 prefill
|
||||
// - 已在运行的做 decode
|
||||
// 4. 对每个 seq 的 logits 做 sampling
|
||||
// 5. 发送新 token / 完成信号
|
||||
}
|
||||
```
|
||||
|
||||
## 当前状态 (Phase 12 初版)
|
||||
|
||||
当前实现是 **单请求顺序执行**(max_batch_size=1),是 continuous batching 的退化形式:
|
||||
- 一次只处理一个请求
|
||||
- 完成后才接受下一个
|
||||
- 无 preemption、无 batching
|
||||
|
||||
这是合理的起步——先跑通单请求 E2E,后续扩展为真正的并发 batching。
|
||||
|
||||
## 后续扩展 (Phase 15+)
|
||||
|
||||
1. **多请求 batch forward**: 将多个 seq 的 token 拼接为一个 batch 输入
|
||||
2. **Prefill-Decode 分离**: prefill (compute-bound) 和 decode (memory-bound) 分开调度
|
||||
3. **Preemption**: 显存不足时暂停低优先级 seq
|
||||
4. **动态 batch size**: 根据 KV cache 使用量调整
|
||||
|
||||
## Test Plan
|
||||
|
||||
- [x] 单请求 E2E: 提交请求 → 收到 token 流 → 完成信号
|
||||
- [ ] (后续) 多请求并发: 提交多个请求,验证都能正确完成
|
||||
- [ ] (后续) 短请求完成后新请求立即加入
|
||||
|
||||
## Takeaways
|
||||
|
||||
1. **单请求是 continuous batching 的特殊情况 (batch_size=1)**:当前实现的 engine 循环已经是正确的调度结构——receive request → prefill → decode loop → done → next request。扩展为多请求只需在 decode loop 中处理多个 sequence。
|
||||
|
||||
2. **Engine 在独立 OS thread 上跑是正确的设计**:GPU 操作是同步阻塞的(cudaDeviceSynchronize),如果放在 tokio runtime 中会 block 整个 async runtime。独立线程 + channel 通信是标准模式。
|
||||
|
||||
3. **std::sync::mpsc::SyncSender(capacity=1) 实现了天然的背压**:当 engine 忙时,新请求会 block 在 channel send 上,不会积压。
|
||||
133
docs/13-http-api.md
Normal file
133
docs/13-http-api.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# Phase 13: HTTP API + Streaming — Design Document (Milestone ③)
|
||||
|
||||
## Goal
|
||||
|
||||
提供 OpenAI 兼容的 HTTP API,让 xserv 可以作为一个 serving 后端被任何 OpenAI SDK 调用。
|
||||
|
||||
## 职责划分
|
||||
|
||||
| 组件 | 职责 |
|
||||
|------|------|
|
||||
| Phase 12 (Scheduler/Engine) | 模型推理 + 请求调度 + token 生成循环 |
|
||||
| **Phase 13 (HTTP API)** | HTTP 请求解析 → 内部格式 → 提交给 engine → 从 channel 接收 token → 编码为 HTTP 响应 |
|
||||
|
||||
Phase 13 不关心模型如何推理,只负责 HTTP 协议层。
|
||||
|
||||
## 技术栈
|
||||
|
||||
- **HTTP framework**: axum 0.8
|
||||
- **Async runtime**: tokio
|
||||
- **Serialization**: serde_json
|
||||
- **Channel**: tokio::sync::mpsc (API ↔ Engine)
|
||||
|
||||
## API 端点
|
||||
|
||||
```
|
||||
GET /health → "ok"
|
||||
GET /v1/models → {"data": [{"id": "qwen3-8b", ...}]}
|
||||
POST /v1/chat/completions → JSON response (non-streaming)
|
||||
POST /v1/chat/completions → SSE stream (streaming, TODO)
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Client
|
||||
│ HTTP POST /v1/chat/completions
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ axum handler │
|
||||
│ 1. Deserialize ChatRequest │
|
||||
│ 2. Build prompt text │
|
||||
│ 3. Tokenize (Mutex<Tokenizer>)│
|
||||
│ 4. Create mpsc channel │
|
||||
│ 5. Submit GenerateRequest │
|
||||
│ 6. await tokens from rx │
|
||||
│ 7. Build JSON response │
|
||||
└──────────────────────────────┘
|
||||
│ GenerateRequest via SyncSender
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ Engine thread (Phase 12) │
|
||||
│ - recv() request │
|
||||
│ - model.forward_gpu_cache() │
|
||||
│ - blocking_send() tokens │
|
||||
└──────────────────────────────┘
|
||||
```
|
||||
|
||||
## OpenAI 兼容格式
|
||||
|
||||
### Request
|
||||
```json
|
||||
{
|
||||
"model": "qwen3-8b",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are helpful."},
|
||||
{"role": "user", "content": "Hello"}
|
||||
],
|
||||
"max_tokens": 256,
|
||||
"stream": false
|
||||
}
|
||||
```
|
||||
|
||||
### Response (non-streaming)
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-xxx",
|
||||
"object": "chat.completion",
|
||||
"created": 1234567890,
|
||||
"model": "qwen3-8b",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {"role": "assistant", "content": "Hi there!"},
|
||||
"finish_reason": "stop"
|
||||
}],
|
||||
"usage": {"prompt_tokens": 5, "completion_tokens": 3, "total_tokens": 8}
|
||||
}
|
||||
```
|
||||
|
||||
### SSE Streaming (TODO)
|
||||
```
|
||||
data: {"choices":[{"delta":{"content":"Hi"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
|
||||
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
## 当前实现状态
|
||||
|
||||
- [x] `/health` — 健康检查
|
||||
- [x] `/v1/models` — 模型列表
|
||||
- [x] `/v1/chat/completions` (non-streaming) — JSON response
|
||||
- [ ] `/v1/chat/completions` (streaming) — SSE
|
||||
- [ ] 完整的 `usage` 统计 (token 计数)
|
||||
- [ ] 错误处理 (400 for bad request, etc.)
|
||||
- [ ] 多轮对话 chat template
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
1. **Extension vs State**: 用 `axum::Extension<Arc<AppState>>` 而不是 `Router::with_state`,因为 `SyncSender` 不是 `Sync`(需要 Mutex 包装)。
|
||||
|
||||
2. **Engine 在独立 thread**: GPU 同步操作 block 线程,不能放在 tokio runtime 中。
|
||||
|
||||
3. **tokio::sync::mpsc 做 token 传输**: Engine (std thread) 用 `blocking_send()`,API (async) 用 `.recv().await`。跨 async/sync 边界通信。
|
||||
|
||||
## Test Plan
|
||||
|
||||
- [x] curl /health → "ok"
|
||||
- [x] curl /v1/models → JSON model list
|
||||
- [x] curl /v1/chat/completions → JSON with generated text
|
||||
- [ ] Python OpenAI SDK 兼容性测试
|
||||
- [ ] SSE streaming 测试
|
||||
- [ ] 多轮对话测试
|
||||
|
||||
## Takeaways
|
||||
|
||||
1. **axum 0.8 的 Handler trait 对 Send 很严格**:async fn 返回的 Future 必须是 Send。`std::sync::MutexGuard` 不是 Send,必须确保它不活过 await point(用 scope 或显式 drop)。
|
||||
|
||||
2. **std::sync::mpsc::SyncSender 不是 Sync**:不能直接放在 `Arc<T>` 中被多个 async task 共享。解决方案:`Mutex<SyncSender>` 或换用 `tokio::sync::mpsc::Sender`(是 Sync 的)。
|
||||
|
||||
3. **非 streaming 更简单,先跑通再加 SSE**:SSE streaming 涉及 `Stream` trait、lifetime 问题和复杂的类型推导。先用 collect-all-then-respond 跑通 E2E,streaming 作为增量优化。
|
||||
|
||||
4. **Engine 加载时间 ~20s(Qwen3-8B)**:需要在 server 启动后等 engine ready 才接受请求,否则请求会 hang 在 channel send 上。当前靠 sync_channel(1) 的背压天然处理。
|
||||
Reference in New Issue
Block a user