# AGENTS.md

## Environment

Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.

## Goal

Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.

Current setup:
- SGLang: `v0.5.10`
- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
- Traces:
  - Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)

---

## MVP Scope

We only do the following:

1. Run **SGLang xPyD** correctly on one machine
2. Add a **baseline router**
   - `turn1`: default routing
   - `turn2+`: prefer previous `D` node for the same session
3. Add a **KV-cache-aware routing** policy
4. Replay traces and compare policies with the same evaluation pipeline

Out of scope for now:
- autoscaling
- fault tolerance
- large-scale cluster scheduler
- production hardening
- general multi-tenant serving

---

## What matters

Primary metric:
- **E2E latency**

Secondary metrics:
- TTFT
- TPOT
- KV transfer volume
- cache hit / reuse
- re-prefill count
- per-node load

Do not optimize TTFT alone if E2E does not improve.

---

## Development Order

Implement in this order:

1. **Bring up xPyD**
2. **Add trace replay + metrics logging**
3. **Implement sticky-to-D baseline**
4. **Implement KV-cache-aware routing**
5. **Analyze gains and failure cases**

Do not skip step 2.

---

## Core Rules

### 1. Keep policy separate from mechanism
- mechanism = how requests / KV / xPyD work
- policy = how we choose `P` and `D`

Do not mix them unless necessary.

### 2. Prefer simple, debuggable logic
Start with simple heuristics before complex scoring.

### 3. Log everything needed to explain results
Each request should log:
- request id
- session id
- turn id
- assigned P node
- assigned D node
- latency
- whether reuse was expected / observed

### 4. Small interfaces only
Avoid over-abstraction.