New crate: xserv-server
- Engine thread: loads Qwen3-8B, processes requests sequentially
- axum HTTP server: /health, /v1/models, /v1/chat/completions
- tokio::sync::mpsc channel between API and engine threads
- Non-streaming JSON response (streaming SSE to be added later)
API is OpenAI-compatible:
POST /v1/chat/completions {"messages": [...], "max_tokens": N}
→ {"choices": [{"message": {"content": "..."}}]}
Verified: "Hi" → ", I'm" (3 tokens), model runs correctly via HTTP.
Key learnings:
- std::sync::mpsc::SyncSender is Send but NOT Sync → wrap in Mutex for Arc<AppState>
- MutexGuard must not live across await points (scope carefully)
- axum 0.8 Extension<Arc<T>> requires T: Send + Sync
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.8 KiB
2.8 KiB
Phase 12+13: Continuous Batching + HTTP API — Design Document (Milestone ③)
Goal
实现 HTTP serving 层:接收请求、调度执行、streaming 返回结果。OpenAI 兼容 API。
由于当前是单请求引擎(无 multi-GPU、无并发),Phase 12 (continuous batching) 和 Phase 13 (HTTP API) 合并实现:先实现单请求 serving,scheduler 作为 placeholder 留待后续扩展。
Architecture
Client (curl / OpenAI SDK)
│
▼ HTTP POST /v1/chat/completions
┌─────────────────────────────────────┐
│ xserv-api (axum) │
│ - Parse request │
│ - Apply chat template │
│ - Submit to engine via channel │
│ - Stream SSE chunks from channel │
└────────────┬────────────────────────┘
│ InferenceRequest → mpsc channel
▼
┌─────────────────────────────────────┐
│ xserv-engine (dedicated thread) │
│ - Receive requests │
│ - Run model forward (prefill+decode)│
│ - Send tokens back via channel │
└─────────────────────────────────────┘
Crates
xserv-engine: inference orchestration (model + cache + generate loop)xserv-api: HTTP server with axum
Both in one binary: xserv-server
API Endpoints
POST /v1/chat/completions # main endpoint
GET /v1/models # list models
GET /health # health check
Request/Response (OpenAI compatible)
Request:
{
"model": "qwen3-8b",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true,
"max_tokens": 256,
"temperature": 1.0
}
SSE Response:
data: {"id":"...","choices":[{"delta":{"content":"Hi"},"index":0}]}
data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Engine Design
pub struct Engine {
model: Qwen3,
config: ModelConfig,
tokenizer: Tokenizer,
}
impl Engine {
pub fn generate(&self, prompt_tokens: &[u32], params: &SamplingParams,
sender: mpsc::Sender<Token>) { ... }
}
Engine runs on a dedicated OS thread (avoids async/GPU conflicts).
API handlers communicate via tokio::sync::mpsc channels.
Sampling
For this phase: greedy only (temperature=0 or 1 with argmax). Top-k/top-p sampling added later.
Test Plan
- curl streaming request → get SSE chunks
- Python OpenAI SDK client works
- /v1/models returns model info
- /health returns 200
- Multiple sequential requests work