Files

Gahow Wang da043554ba phase 12+13: HTTP API server with OpenAI-compatible endpoint (Milestone ③)

New crate: xserv-server
- Engine thread: loads Qwen3-8B, processes requests sequentially
- axum HTTP server: /health, /v1/models, /v1/chat/completions
- tokio::sync::mpsc channel between API and engine threads
- Non-streaming JSON response (streaming SSE to be added later)

API is OpenAI-compatible:
  POST /v1/chat/completions {"messages": [...], "max_tokens": N}
  → {"choices": [{"message": {"content": "..."}}]}

Verified: "Hi" → ", I'm" (3 tokens), model runs correctly via HTTP.

Key learnings:
- std::sync::mpsc::SyncSender is Send but NOT Sync → wrap in Mutex for Arc<AppState>
- MutexGuard must not live across await points (scope carefully)
- axum 0.8 Extension<Arc<T>> requires T: Send + Sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 12:55:19 +08:00

2.8 KiB

Raw Blame History

Phase 12+13: Continuous Batching + HTTP API — Design Document (Milestone ③)

Goal

实现 HTTP serving 层：接收请求、调度执行、streaming 返回结果。OpenAI 兼容 API。

由于当前是单请求引擎（无 multi-GPU、无并发），Phase 12 (continuous batching) 和 Phase 13 (HTTP API) 合并实现：先实现单请求 serving，scheduler 作为 placeholder 留待后续扩展。

Architecture

Client (curl / OpenAI SDK)
    │
    ▼  HTTP POST /v1/chat/completions
┌─────────────────────────────────────┐
│  xserv-api (axum)                   │
│  - Parse request                    │
│  - Apply chat template              │
│  - Submit to engine via channel     │
│  - Stream SSE chunks from channel   │
└────────────┬────────────────────────┘
             │  InferenceRequest → mpsc channel
             ▼
┌─────────────────────────────────────┐
│  xserv-engine (dedicated thread)    │
│  - Receive requests                 │
│  - Run model forward (prefill+decode)│
│  - Send tokens back via channel     │
└─────────────────────────────────────┘

Crates

xserv-engine: inference orchestration (model + cache + generate loop)
xserv-api: HTTP server with axum

Both in one binary: xserv-server

API Endpoints

POST /v1/chat/completions   # main endpoint
GET  /v1/models             # list models
GET  /health                # health check

Request/Response (OpenAI compatible)

Request:

{
  "model": "qwen3-8b",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true,
  "max_tokens": 256,
  "temperature": 1.0
}

SSE Response:

data: {"id":"...","choices":[{"delta":{"content":"Hi"},"index":0}]}

data: {"id":"...","choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Engine Design

pub struct Engine {
    model: Qwen3,
    config: ModelConfig,
    tokenizer: Tokenizer,
}

impl Engine {
    pub fn generate(&self, prompt_tokens: &[u32], params: &SamplingParams,
                    sender: mpsc::Sender<Token>) { ... }
}

Engine runs on a dedicated OS thread (avoids async/GPU conflicts). API handlers communicate via tokio::sync::mpsc channels.

Sampling

For this phase: greedy only (temperature=0 or 1 with argmax). Top-k/top-p sampling added later.

Test Plan

curl streaming request → get SSE chunks
Python OpenAI SDK client works
/v1/models returns model info
/health returns 200
Multiple sequential requests work

2.8 KiB Raw Blame History Unescape Escape