Agentic PD Hybrid
Minimal prototype scaffold for evaluating session-aware and KV-cache-aware prefill/decode routing on top of SGLang PD disaggregation.
For a concise description of the project design, implemented features, current findings, and known limits, see docs/PROJECT_OVERVIEW.md.
Current implementation covers the initial MVP path in AGENTS.md:
- One-node PD/xPyD launch planning
- Trace replay plus request-level metrics logging
- Real end-to-end benchmark orchestration
Routing policy is kept separate from mechanism:
agentic_pd_hybrid.topologyandagentic_pd_hybrid.launcherhandle cluster shape and SGLang command generation.agentic_pd_hybrid.policieshandles decode selection heuristics.agentic_pd_hybrid.replayhandles trace pacing, synthetic prompt generation, and metrics.agentic_pd_hybrid.samplinghandles session-granularity trace sampling for live tests.agentic_pd_hybrid.stack/agentic_pd_hybrid.benchmarkhandles launching and tearing down a real PD stack.
Environment
Use uv for all environment management.
Sync the environment:
uv sync
third_party/sglang vendors a clean SGLang v0.5.10 snapshot plus our local
PD/session-cache patches in later commits. Keep SGLang changes scoped under that
directory and commit them with feat(sglang): ... or fix(sglang): ... so they
stay easy to review against the vendor baseline.
CLI
Print one-node PD launch commands:
uv run agentic-pd-hybrid print-launch \
--model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--prefill-workers 2 \
--decode-workers 2 \
--transfer-backend mooncake
Replay the Ali trace in dry-run mode and emit request logs plus a summary:
uv run agentic-pd-hybrid replay \
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
--policy sticky \
--prefill-workers 2 \
--decode-workers 2 \
--output outputs/sticky.jsonl
Sample a 10-minute shard at session granularity:
uv run agentic-pd-hybrid sample-sessions \
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
--output outputs/sampled-10min.jsonl \
--target-duration-s 600 \
--session-sample-rate 0.01
Sample Ali sessions that keep the small-append KV reuse shape used by the micro-benchmark:
uv run agentic-pd-hybrid sample-sessions \
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
--output outputs/ali-small-append.jsonl \
--profile small-append \
--target-duration-s 600 \
--session-sample-rate 0.01 \
--min-turns 2
Replay against a live router:
uv run agentic-pd-hybrid replay \
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
--policy sticky \
--router-url http://127.0.0.1:8000 \
--model Qwen3-Coder-30B-A3B-Instruct \
--output outputs/sticky-live.jsonl
Launch a real PD stack and collect live performance numbers:
uv run agentic-pd-hybrid benchmark-live \
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
--policy sticky \
--mechanism kvcache-centric \
--kvcache-admission-mode router \
--sample-profile small-append \
--prefill-workers 1 \
--decode-workers 1 \
--transfer-backend mooncake \
--target-duration-s 600 \
--session-sample-rate 0.01 \
--output-root outputs/live
Notes:
- The provided Ali release trace contains lengths and
hash_ids, not raw prompts. Replay therefore synthesizes deterministic prompt text fromhash_idsso repeated blocks remain repeated across turns. stickymode emitsx-smg-routing-key=<session_id>, which matches the upstream gateway'smanualpolicy semantics for "turn1 default, turn2+ sticky".kv-awarecomputes decode placement from observedhash_idsoverlap and can emitx-smg-target-worker=<index>when--header-mode target-workeris used with a compatible router decode policy.- Live benchmarking uses the repo-local
agentic_pd_hybrid.pd_router, which preserves the real prefill/decode double-request path over loopback without depending on the upstream Rust router build. - Managed live benchmarking prefers the vendored
third_party/sglang/python/sglangsource tree, so local SGLang changes apply immediately without packaging a wheel. - Live benchmarking currently targets the
mooncaketransfer backend, becausemooncake-transfer-engineis installed and usable on this node. benchmark-liveandreplaysupport streaming by default for TTFT/TPOT measurement. Use--no-streamfor E2E-only runs.kvcache-centricdefaults to router-managed admission (--kvcache-admission-mode router). This keeps a router-side shadow of decode session residency and capacity, so the critical path does not issue per-request worker/server_infoand/v1/loadsprobes. Use--kvcache-admission-mode workeronly as an A/B baseline for the older worker-managed admission path.
Output
Each replay writes:
- request-level metrics JSONL at the requested output path
- summary JSON at
<output>.summary.json
Each request log contains:
- request id
- session id
- turn id
- assigned prefill node
- assigned decode node
- latency fields when a live router is used
- whether reuse was expected and whether block overlap was observed
- expected KV transfer blocks
- per-node load snapshot at assignment time