AGENTS.md

Environment

Use uv to manage all python environment. uv add to manage deps so that we can uv sync to get exactly same runnable environment each time.

Goal

Build a minimal prototype on top of SGLang xPyD to test whether session-aware / KV-cache-aware P/D routing can improve end-to-end latency for agentic coding workloads.

Current setup:

SGLang: v0.5.10
Model: Qwen3-Coder-30B-A3B-Instruct (~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct)
xPyD runs on this single 8-GPU node, so the current constraint is $x + y \le 8$
Even in local experiments, the implementation should preserve the P -> D RDMA-style data path semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
Traces:
- Ali coding agent (~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl)

MVP Scope

We only do the following:

Run SGLang xPyD correctly on one machine
Add a baseline router
- turn1: default routing
- turn2+: prefer previous D node for the same session
Add a KV-cache-aware routing policy
Replay traces and compare policies with the same evaluation pipeline

Out of scope for now:

autoscaling
fault tolerance
large-scale cluster scheduler
production hardening
general multi-tenant serving

What matters

Primary metric:

E2E latency

Secondary metrics:

TTFT
TPOT
KV transfer volume
cache hit / reuse
re-prefill count
per-node load

Do not optimize TTFT alone if E2E does not improve.

Development Order

Implement in this order:

Bring up xPyD
Add trace replay + metrics logging
Implement sticky-to-D baseline
Implement KV-cache-aware routing
Analyze gains and failure cases

Do not skip step 2.

Core Rules

1. Keep policy separate from mechanism

mechanism = how requests / KV / xPyD work
policy = how we choose P and D

Do not mix them unless necessary.

2. Prefer simple, debuggable logic

Start with simple heuristics before complex scoring.

3. Log everything needed to explain results

Each request should log:

request id
session id
turn id
assigned P node
assigned D node
latency
whether reuse was expected / observed

4. Small interfaces only

Avoid over-abstraction.

2.3 KiB Raw Blame History