Files
agentic-pd-hybrid/AGENTS.md

2.3 KiB

AGENTS.md

Environment

Use uv to manage all python environment. uv add to manage deps so that we can uv sync to get exactly same runnable environment each time.

Goal

Build a minimal prototype on top of SGLang xPyD to test whether session-aware / KV-cache-aware P/D routing can improve end-to-end latency for agentic coding workloads.

Current setup:

  • SGLang: v0.5.10
  • Model: Qwen3-Coder-30B-A3B-Instruct (~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct)
  • xPyD runs on this single 8-GPU node, so the current constraint is $x + y \le 8$
  • Even in local experiments, the implementation should preserve the P -> D RDMA-style data path semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
  • Traces:
    • Ali coding agent (~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl)

MVP Scope

We only do the following:

  1. Run SGLang xPyD correctly on one machine
  2. Add a baseline router
    • turn1: default routing
    • turn2+: prefer previous D node for the same session
  3. Add a KV-cache-aware routing policy
  4. Replay traces and compare policies with the same evaluation pipeline

Out of scope for now:

  • autoscaling
  • fault tolerance
  • large-scale cluster scheduler
  • production hardening
  • general multi-tenant serving

What matters

Primary metric:

  • E2E latency

Secondary metrics:

  • TTFT
  • TPOT
  • KV transfer volume
  • cache hit / reuse
  • re-prefill count
  • per-node load

Do not optimize TTFT alone if E2E does not improve.


Development Order

Implement in this order:

  1. Bring up xPyD
  2. Add trace replay + metrics logging
  3. Implement sticky-to-D baseline
  4. Implement KV-cache-aware routing
  5. Analyze gains and failure cases

Do not skip step 2.


Core Rules

1. Keep policy separate from mechanism

  • mechanism = how requests / KV / xPyD work
  • policy = how we choose P and D

Do not mix them unless necessary.

2. Prefer simple, debuggable logic

Start with simple heuristics before complex scoring.

3. Log everything needed to explain results

Each request should log:

  • request id
  • session id
  • turn id
  • assigned P node
  • assigned D node
  • latency
  • whether reuse was expected / observed

4. Small interfaces only

Avoid over-abstraction.