docs: document project design and status
This commit is contained in:
94
AGENTS.md
Normal file
94
AGENTS.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# AGENTS.md
|
||||
|
||||
## Environment
|
||||
|
||||
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
|
||||
|
||||
## Goal
|
||||
|
||||
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
|
||||
|
||||
Current setup:
|
||||
- SGLang: `v0.5.10`
|
||||
- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
|
||||
- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
|
||||
- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
|
||||
- Traces:
|
||||
- Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)
|
||||
|
||||
---
|
||||
|
||||
## MVP Scope
|
||||
|
||||
We only do the following:
|
||||
|
||||
1. Run **SGLang xPyD** correctly on one machine
|
||||
2. Add a **baseline router**
|
||||
- `turn1`: default routing
|
||||
- `turn2+`: prefer previous `D` node for the same session
|
||||
3. Add a **KV-cache-aware routing** policy
|
||||
4. Replay traces and compare policies with the same evaluation pipeline
|
||||
|
||||
Out of scope for now:
|
||||
- autoscaling
|
||||
- fault tolerance
|
||||
- large-scale cluster scheduler
|
||||
- production hardening
|
||||
- general multi-tenant serving
|
||||
|
||||
---
|
||||
|
||||
## What matters
|
||||
|
||||
Primary metric:
|
||||
- **E2E latency**
|
||||
|
||||
Secondary metrics:
|
||||
- TTFT
|
||||
- TPOT
|
||||
- KV transfer volume
|
||||
- cache hit / reuse
|
||||
- re-prefill count
|
||||
- per-node load
|
||||
|
||||
Do not optimize TTFT alone if E2E does not improve.
|
||||
|
||||
---
|
||||
|
||||
## Development Order
|
||||
|
||||
Implement in this order:
|
||||
|
||||
1. **Bring up xPyD**
|
||||
2. **Add trace replay + metrics logging**
|
||||
3. **Implement sticky-to-D baseline**
|
||||
4. **Implement KV-cache-aware routing**
|
||||
5. **Analyze gains and failure cases**
|
||||
|
||||
Do not skip step 2.
|
||||
|
||||
---
|
||||
|
||||
## Core Rules
|
||||
|
||||
### 1. Keep policy separate from mechanism
|
||||
- mechanism = how requests / KV / xPyD work
|
||||
- policy = how we choose `P` and `D`
|
||||
|
||||
Do not mix them unless necessary.
|
||||
|
||||
### 2. Prefer simple, debuggable logic
|
||||
Start with simple heuristics before complex scoring.
|
||||
|
||||
### 3. Log everything needed to explain results
|
||||
Each request should log:
|
||||
- request id
|
||||
- session id
|
||||
- turn id
|
||||
- assigned P node
|
||||
- assigned D node
|
||||
- latency
|
||||
- whether reuse was expected / observed
|
||||
|
||||
### 4. Small interfaces only
|
||||
Avoid over-abstraction.
|
||||
Reference in New Issue
Block a user