docs: document project design and status

2026-04-24 12:17:55 +00:00
parent 4bca741f32
commit 78f0d15221
3 changed files with 375 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,94 @@
+# AGENTS.md
+
+## Environment
+
+Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
+
+## Goal
+
+Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
+
+Current setup:
+- SGLang: `v0.5.10`
+- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
+- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
+- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
+- Traces:
+  - Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)
+
+---
+
+## MVP Scope
+
+We only do the following:
+
+1. Run **SGLang xPyD** correctly on one machine
+2. Add a **baseline router**
+   - `turn1`: default routing
+   - `turn2+`: prefer previous `D` node for the same session
+3. Add a **KV-cache-aware routing** policy
+4. Replay traces and compare policies with the same evaluation pipeline
+
+Out of scope for now:
+- autoscaling
+- fault tolerance
+- large-scale cluster scheduler
+- production hardening
+- general multi-tenant serving
+
+---
+
+## What matters
+
+Primary metric:
+- **E2E latency**
+
+Secondary metrics:
+- TTFT
+- TPOT
+- KV transfer volume
+- cache hit / reuse
+- re-prefill count
+- per-node load
+
+Do not optimize TTFT alone if E2E does not improve.
+
+---
+
+## Development Order
+
+Implement in this order:
+
+1. **Bring up xPyD**
+2. **Add trace replay + metrics logging**
+3. **Implement sticky-to-D baseline**
+4. **Implement KV-cache-aware routing**
+5. **Analyze gains and failure cases**
+
+Do not skip step 2.
+
+---
+
+## Core Rules
+
+### 1. Keep policy separate from mechanism
+- mechanism = how requests / KV / xPyD work
+- policy = how we choose `P` and `D`
+
+Do not mix them unless necessary.
+
+### 2. Prefer simple, debuggable logic
+Start with simple heuristics before complex scoring.
+
+### 3. Log everything needed to explain results
+Each request should log:
+- request id
+- session id
+- turn id
+- assigned P node
+- assigned D node
+- latency
+- whether reuse was expected / observed
+
+### 4. Small interfaces only
+Avoid over-abstraction.