# AGENTS.md ## Environment Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time. ## Goal Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads. Current setup: - SGLang: `v0.5.10` - Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`) - xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$** - Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut - Traces: - Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`) --- ## MVP Scope We only do the following: 1. Run **SGLang xPyD** correctly on one machine 2. Add a **baseline router** - `turn1`: default routing - `turn2+`: prefer previous `D` node for the same session 3. Add a **KV-cache-aware routing** policy 4. Replay traces and compare policies with the same evaluation pipeline Out of scope for now: - autoscaling - fault tolerance - large-scale cluster scheduler - production hardening - general multi-tenant serving --- ## What matters Primary metric: - **E2E latency** Secondary metrics: - TTFT - TPOT - KV transfer volume - cache hit / reuse - re-prefill count - per-node load Do not optimize TTFT alone if E2E does not improve. --- ## Development Order Implement in this order: 1. **Bring up xPyD** 2. **Add trace replay + metrics logging** 3. **Implement sticky-to-D baseline** 4. **Implement KV-cache-aware routing** 5. **Analyze gains and failure cases** Do not skip step 2. --- ## Core Rules ### 1. Keep policy separate from mechanism - mechanism = how requests / KV / xPyD work - policy = how we choose `P` and `D` Do not mix them unless necessary. ### 2. Prefer simple, debuggable logic Start with simple heuristics before complex scoring. ### 3. Log everything needed to explain results Each request should log: - request id - session id - turn id - assigned P node - assigned D node - latency - whether reuse was expected / observed ### 4. Small interfaces only Avoid over-abstraction.