2.3 KiB
2.3 KiB
AGENTS.md
Environment
Use uv to manage all python environment. uv add to manage deps so that we can uv sync to get exactly same runnable environment each time.
Goal
Build a minimal prototype on top of SGLang xPyD to test whether session-aware / KV-cache-aware P/D routing can improve end-to-end latency for agentic coding workloads.
Current setup:
- SGLang:
v0.5.10 - Model:
Qwen3-Coder-30B-A3B-Instruct(~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct) - xPyD runs on this single 8-GPU node, so the current constraint is $x + y \le 8$
- Even in local experiments, the implementation should preserve the P -> D RDMA-style data path semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
- Traces:
- Ali coding agent (
~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl)
- Ali coding agent (
MVP Scope
We only do the following:
- Run SGLang xPyD correctly on one machine
- Add a baseline router
turn1: default routingturn2+: prefer previousDnode for the same session
- Add a KV-cache-aware routing policy
- Replay traces and compare policies with the same evaluation pipeline
Out of scope for now:
- autoscaling
- fault tolerance
- large-scale cluster scheduler
- production hardening
- general multi-tenant serving
What matters
Primary metric:
- E2E latency
Secondary metrics:
- TTFT
- TPOT
- KV transfer volume
- cache hit / reuse
- re-prefill count
- per-node load
Do not optimize TTFT alone if E2E does not improve.
Development Order
Implement in this order:
- Bring up xPyD
- Add trace replay + metrics logging
- Implement sticky-to-D baseline
- Implement KV-cache-aware routing
- Analyze gains and failure cases
Do not skip step 2.
Core Rules
1. Keep policy separate from mechanism
- mechanism = how requests / KV / xPyD work
- policy = how we choose
PandD
Do not mix them unless necessary.
2. Prefer simple, debuggable logic
Start with simple heuristics before complex scoring.
3. Log everything needed to explain results
Each request should log:
- request id
- session id
- turn id
- assigned P node
- assigned D node
- latency
- whether reuse was expected / observed
4. Small interfaces only
Avoid over-abstraction.