## Agentic PD Hybrid Minimal prototype scaffold for evaluating session-aware and KV-cache-aware prefill/decode routing on top of SGLang PD disaggregation. For a concise description of the project design, implemented features, current findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md). Current implementation covers the initial MVP path in `AGENTS.md`: 1. One-node PD/xPyD launch planning 2. Trace replay plus request-level metrics logging 3. Real end-to-end benchmark orchestration Routing policy is kept separate from mechanism: - `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher` handle cluster shape and SGLang command generation. - `agentic_pd_hybrid.policies` handles decode selection heuristics. - `agentic_pd_hybrid.replay` handles trace pacing, synthetic prompt generation, and metrics. - `agentic_pd_hybrid.sampling` handles session-granularity trace sampling for live tests. - `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark` handles launching and tearing down a real PD stack. ## Environment Use `uv` for all environment management. Sync the environment: ```bash uv sync ``` `third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local PD/session-cache patches in later commits. Keep SGLang changes scoped under that directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they stay easy to review against the vendor baseline. ## CLI Print one-node PD launch commands: ```bash uv run agentic-pd-hybrid print-launch \ --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \ --prefill-workers 2 \ --decode-workers 2 \ --transfer-backend mooncake ``` Replay the Ali trace in dry-run mode and emit request logs plus a summary: ```bash uv run agentic-pd-hybrid replay \ --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ --policy sticky \ --prefill-workers 2 \ --decode-workers 2 \ --output outputs/sticky.jsonl ``` Sample a 10-minute shard at session granularity: ```bash uv run agentic-pd-hybrid sample-sessions \ --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ --output outputs/sampled-10min.jsonl \ --target-duration-s 600 \ --session-sample-rate 0.01 ``` Sample Ali sessions that keep the small-append KV reuse shape used by the micro-benchmark: ```bash uv run agentic-pd-hybrid sample-sessions \ --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ --output outputs/ali-small-append.jsonl \ --profile small-append \ --target-duration-s 600 \ --session-sample-rate 0.01 \ --min-turns 2 ``` Replay against a live router: ```bash uv run agentic-pd-hybrid replay \ --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ --policy sticky \ --router-url http://127.0.0.1:8000 \ --model Qwen3-Coder-30B-A3B-Instruct \ --output outputs/sticky-live.jsonl ``` Launch a real PD stack and collect live performance numbers: ```bash uv run agentic-pd-hybrid benchmark-live \ --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ --policy sticky \ --mechanism kvcache-centric \ --kvcache-admission-mode router \ --sample-profile small-append \ --prefill-workers 1 \ --decode-workers 1 \ --transfer-backend mooncake \ --target-duration-s 600 \ --session-sample-rate 0.01 \ --output-root outputs/live ``` Notes: - The provided Ali release trace contains lengths and `hash_ids`, not raw prompts. Replay therefore synthesizes deterministic prompt text from `hash_ids` so repeated blocks remain repeated across turns. - `sticky` mode emits `x-smg-routing-key=`, which matches the upstream gateway's `manual` policy semantics for "turn1 default, turn2+ sticky". - `kv-aware` computes decode placement from observed `hash_ids` overlap and can emit `x-smg-target-worker=` when `--header-mode target-worker` is used with a compatible router decode policy. - Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which preserves the real prefill/decode double-request path over loopback without depending on the upstream Rust router build. - Managed live benchmarking prefers the vendored `third_party/sglang/python/sglang` source tree, so local SGLang changes apply immediately without packaging a wheel. - Live benchmarking currently targets the `mooncake` transfer backend, because `mooncake-transfer-engine` is installed and usable on this node. - `benchmark-live` and `replay` support streaming by default for TTFT/TPOT measurement. Use `--no-stream` for E2E-only runs. - `kvcache-centric` defaults to router-managed admission (`--kvcache-admission-mode router`). This keeps a router-side shadow of decode session residency and capacity, so the critical path does not issue per-request worker `/server_info` and `/v1/loads` probes. Use `--kvcache-admission-mode worker` only as an A/B baseline for the older worker-managed admission path. ## Output Each replay writes: - request-level metrics JSONL at the requested output path - summary JSON at `.summary.json` Each request log contains: - request id - session id - turn id - assigned prefill node - assigned decode node - latency fields when a live router is used - whether reuse was expected and whether block overlap was observed - expected KV transfer blocks - per-node load snapshot at assignment time