95 lines
2.3 KiB
Markdown
95 lines
2.3 KiB
Markdown
# AGENTS.md
|
|
|
|
## Environment
|
|
|
|
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
|
|
|
|
## Goal
|
|
|
|
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
|
|
|
|
Current setup:
|
|
- SGLang: `v0.5.10`
|
|
- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
|
|
- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
|
|
- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
|
|
- Traces:
|
|
- Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)
|
|
|
|
---
|
|
|
|
## MVP Scope
|
|
|
|
We only do the following:
|
|
|
|
1. Run **SGLang xPyD** correctly on one machine
|
|
2. Add a **baseline router**
|
|
- `turn1`: default routing
|
|
- `turn2+`: prefer previous `D` node for the same session
|
|
3. Add a **KV-cache-aware routing** policy
|
|
4. Replay traces and compare policies with the same evaluation pipeline
|
|
|
|
Out of scope for now:
|
|
- autoscaling
|
|
- fault tolerance
|
|
- large-scale cluster scheduler
|
|
- production hardening
|
|
- general multi-tenant serving
|
|
|
|
---
|
|
|
|
## What matters
|
|
|
|
Primary metric:
|
|
- **E2E latency**
|
|
|
|
Secondary metrics:
|
|
- TTFT
|
|
- TPOT
|
|
- KV transfer volume
|
|
- cache hit / reuse
|
|
- re-prefill count
|
|
- per-node load
|
|
|
|
Do not optimize TTFT alone if E2E does not improve.
|
|
|
|
---
|
|
|
|
## Development Order
|
|
|
|
Implement in this order:
|
|
|
|
1. **Bring up xPyD**
|
|
2. **Add trace replay + metrics logging**
|
|
3. **Implement sticky-to-D baseline**
|
|
4. **Implement KV-cache-aware routing**
|
|
5. **Analyze gains and failure cases**
|
|
|
|
Do not skip step 2.
|
|
|
|
---
|
|
|
|
## Core Rules
|
|
|
|
### 1. Keep policy separate from mechanism
|
|
- mechanism = how requests / KV / xPyD work
|
|
- policy = how we choose `P` and `D`
|
|
|
|
Do not mix them unless necessary.
|
|
|
|
### 2. Prefer simple, debuggable logic
|
|
Start with simple heuristics before complex scoring.
|
|
|
|
### 3. Log everything needed to explain results
|
|
Each request should log:
|
|
- request id
|
|
- session id
|
|
- turn id
|
|
- assigned P node
|
|
- assigned D node
|
|
- latency
|
|
- whether reuse was expected / observed
|
|
|
|
### 4. Small interfaces only
|
|
Avoid over-abstraction.
|