Gahow Wang
d8493bd70f
phase 12: implement real continuous batching scheduler
Rewrote engine.rs from scratch:
- Scheduler loop: admit → prefill → decode → finish → check new requests
- Multiple sequences run concurrently (max_batch_size configurable)
- Each sequence has independent GpuKVCache
- Non-blocking try_recv() for new requests during decode iterations
- Dynamic join: new requests enter batch immediately, don't wait for others
Verified with concurrent test (tools/test_concurrent.py):
- 3 concurrent requests: wall_time=3.8s, concurrency_ratio=2.82x ✓
- 5 concurrent requests: wall_time=6.1s, concurrency_ratio=4.04x ✓
- All outputs are coherent and correct
Design doc (docs/12-continuous-batching.md) fully rewritten with:
- Detailed scheduler loop pseudocode
- Data structures (Sequence, Scheduler)
- Acceptance criteria with specific test cases
- Clear separation from Phase 13 (HTTP layer)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:44:26 +08:00
..
2026-05-22 10:25:33 +08:00
2026-05-21 18:40:22 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 21:07:24 +08:00
2026-05-21 21:17:23 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 23:39:41 +08:00
2026-05-22 00:46:37 +08:00
2026-05-22 11:50:12 +08:00
2026-05-22 13:44:26 +08:00
2026-05-22 13:15:27 +08:00