docs: document project design and status

This commit is contained in:
2026-04-24 12:17:55 +00:00
parent 4bca741f32
commit 78f0d15221
3 changed files with 375 additions and 0 deletions

94
AGENTS.md Normal file
View File

@@ -0,0 +1,94 @@
# AGENTS.md
## Environment
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
## Goal
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
Current setup:
- SGLang: `v0.5.10`
- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
- Traces:
- Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)
---
## MVP Scope
We only do the following:
1. Run **SGLang xPyD** correctly on one machine
2. Add a **baseline router**
- `turn1`: default routing
- `turn2+`: prefer previous `D` node for the same session
3. Add a **KV-cache-aware routing** policy
4. Replay traces and compare policies with the same evaluation pipeline
Out of scope for now:
- autoscaling
- fault tolerance
- large-scale cluster scheduler
- production hardening
- general multi-tenant serving
---
## What matters
Primary metric:
- **E2E latency**
Secondary metrics:
- TTFT
- TPOT
- KV transfer volume
- cache hit / reuse
- re-prefill count
- per-node load
Do not optimize TTFT alone if E2E does not improve.
---
## Development Order
Implement in this order:
1. **Bring up xPyD**
2. **Add trace replay + metrics logging**
3. **Implement sticky-to-D baseline**
4. **Implement KV-cache-aware routing**
5. **Analyze gains and failure cases**
Do not skip step 2.
---
## Core Rules
### 1. Keep policy separate from mechanism
- mechanism = how requests / KV / xPyD work
- policy = how we choose `P` and `D`
Do not mix them unless necessary.
### 2. Prefer simple, debuggable logic
Start with simple heuristics before complex scoring.
### 3. Log everything needed to explain results
Each request should log:
- request id
- session id
- turn id
- assigned P node
- assigned D node
- latency
- whether reuse was expected / observed
### 4. Small interfaces only
Avoid over-abstraction.