Files
agentic-pd-hybrid/README.md

5.2 KiB

Agentic PD Hybrid

Minimal prototype scaffold for evaluating session-aware and KV-cache-aware prefill/decode routing on top of SGLang PD disaggregation.

For a concise description of the project design, implemented features, current findings, and known limits, see docs/PROJECT_OVERVIEW.md.

Current implementation covers the initial MVP path in AGENTS.md:

  1. One-node PD/xPyD launch planning
  2. Trace replay plus request-level metrics logging
  3. Real end-to-end benchmark orchestration

Routing policy is kept separate from mechanism:

  • agentic_pd_hybrid.topology and agentic_pd_hybrid.launcher handle cluster shape and SGLang command generation.
  • agentic_pd_hybrid.policies handles decode selection heuristics.
  • agentic_pd_hybrid.replay handles trace pacing, synthetic prompt generation, and metrics.
  • agentic_pd_hybrid.sampling handles session-granularity trace sampling for live tests.
  • agentic_pd_hybrid.stack / agentic_pd_hybrid.benchmark handles launching and tearing down a real PD stack.

Environment

Use uv for all environment management.

Sync the environment:

uv sync

Local experiments can use a repo-local third_party/sglang checkout of SGLang v0.5.10, but that heavyweight checkout is intentionally not committed here.

CLI

Print one-node PD launch commands:

uv run agentic-pd-hybrid print-launch \
  --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --prefill-workers 2 \
  --decode-workers 2 \
  --transfer-backend mooncake

Replay the Ali trace in dry-run mode and emit request logs plus a summary:

uv run agentic-pd-hybrid replay \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --prefill-workers 2 \
  --decode-workers 2 \
  --output outputs/sticky.jsonl

Sample a 10-minute shard at session granularity:

uv run agentic-pd-hybrid sample-sessions \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --output outputs/sampled-10min.jsonl \
  --target-duration-s 600 \
  --session-sample-rate 0.01

Sample Ali sessions that keep the small-append KV reuse shape used by the micro-benchmark:

uv run agentic-pd-hybrid sample-sessions \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --output outputs/ali-small-append.jsonl \
  --profile small-append \
  --target-duration-s 600 \
  --session-sample-rate 0.01 \
  --min-turns 2

Replay against a live router:

uv run agentic-pd-hybrid replay \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --router-url http://127.0.0.1:8000 \
  --model Qwen3-Coder-30B-A3B-Instruct \
  --output outputs/sticky-live.jsonl

Launch a real PD stack and collect live performance numbers:

uv run agentic-pd-hybrid benchmark-live \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --mechanism kvcache-centric \
  --kvcache-admission-mode router \
  --sample-profile small-append \
  --prefill-workers 1 \
  --decode-workers 1 \
  --transfer-backend mooncake \
  --target-duration-s 600 \
  --session-sample-rate 0.01 \
  --output-root outputs/live

Notes:

  • The provided Ali release trace contains lengths and hash_ids, not raw prompts. Replay therefore synthesizes deterministic prompt text from hash_ids so repeated blocks remain repeated across turns.
  • sticky mode emits x-smg-routing-key=<session_id>, which matches the upstream gateway's manual policy semantics for "turn1 default, turn2+ sticky".
  • kv-aware computes decode placement from observed hash_ids overlap and can emit x-smg-target-worker=<index> when --header-mode target-worker is used with a compatible router decode policy.
  • Live benchmarking uses the repo-local agentic_pd_hybrid.pd_router, which preserves the real prefill/decode double-request path over loopback without depending on the upstream Rust router build.
  • Managed live benchmarking prefers a local third_party/sglang/python/sglang checkout when it exists, so local SGLang source changes can apply immediately without packaging a wheel.
  • Live benchmarking currently targets the mooncake transfer backend, because mooncake-transfer-engine is installed and usable on this node.
  • benchmark-live and replay support streaming by default for TTFT/TPOT measurement. Use --no-stream for E2E-only runs.
  • kvcache-centric defaults to router-managed admission (--kvcache-admission-mode router). This keeps a router-side shadow of decode session residency and capacity, so the critical path does not issue per-request worker /server_info and /v1/loads probes. Use --kvcache-admission-mode worker only as an A/B baseline for the older worker-managed admission path.

Output

Each replay writes:

  • request-level metrics JSONL at the requested output path
  • summary JSON at <output>.summary.json

Each request log contains:

  • request id
  • session id
  • turn id
  • assigned prefill node
  • assigned decode node
  • latency fields when a live router is used
  • whether reuse was expected and whether block overlap was observed
  • expected KV transfer blocks
  • per-node load snapshot at assignment time