161 lines
5.2 KiB
Markdown
161 lines
5.2 KiB
Markdown
## Agentic PD Hybrid
|
|
|
|
Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
|
|
prefill/decode routing on top of SGLang PD disaggregation.
|
|
|
|
For a concise description of the project design, implemented features, current
|
|
findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).
|
|
|
|
Current implementation covers the initial MVP path in `AGENTS.md`:
|
|
|
|
1. One-node PD/xPyD launch planning
|
|
2. Trace replay plus request-level metrics logging
|
|
3. Real end-to-end benchmark orchestration
|
|
|
|
Routing policy is kept separate from mechanism:
|
|
|
|
- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
|
|
handle cluster shape and SGLang command generation.
|
|
- `agentic_pd_hybrid.policies`
|
|
handles decode selection heuristics.
|
|
- `agentic_pd_hybrid.replay`
|
|
handles trace pacing, synthetic prompt generation, and metrics.
|
|
- `agentic_pd_hybrid.sampling`
|
|
handles session-granularity trace sampling for live tests.
|
|
- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
|
|
handles launching and tearing down a real PD stack.
|
|
|
|
## Environment
|
|
|
|
Use `uv` for all environment management.
|
|
|
|
Sync the environment:
|
|
|
|
```bash
|
|
uv sync
|
|
```
|
|
|
|
Local experiments can use a repo-local `third_party/sglang` checkout of SGLang
|
|
`v0.5.10`, but that heavyweight checkout is intentionally not committed here.
|
|
|
|
## CLI
|
|
|
|
Print one-node PD launch commands:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid print-launch \
|
|
--model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
|
|
--prefill-workers 2 \
|
|
--decode-workers 2 \
|
|
--transfer-backend mooncake
|
|
```
|
|
|
|
Replay the Ali trace in dry-run mode and emit request logs plus a summary:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid replay \
|
|
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
--policy sticky \
|
|
--prefill-workers 2 \
|
|
--decode-workers 2 \
|
|
--output outputs/sticky.jsonl
|
|
```
|
|
|
|
Sample a 10-minute shard at session granularity:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid sample-sessions \
|
|
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
--output outputs/sampled-10min.jsonl \
|
|
--target-duration-s 600 \
|
|
--session-sample-rate 0.01
|
|
```
|
|
|
|
Sample Ali sessions that keep the small-append KV reuse shape used by the
|
|
micro-benchmark:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid sample-sessions \
|
|
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
--output outputs/ali-small-append.jsonl \
|
|
--profile small-append \
|
|
--target-duration-s 600 \
|
|
--session-sample-rate 0.01 \
|
|
--min-turns 2
|
|
```
|
|
|
|
Replay against a live router:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid replay \
|
|
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
--policy sticky \
|
|
--router-url http://127.0.0.1:8000 \
|
|
--model Qwen3-Coder-30B-A3B-Instruct \
|
|
--output outputs/sticky-live.jsonl
|
|
```
|
|
|
|
Launch a real PD stack and collect live performance numbers:
|
|
|
|
```bash
|
|
uv run agentic-pd-hybrid benchmark-live \
|
|
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
--policy sticky \
|
|
--mechanism kvcache-centric \
|
|
--kvcache-admission-mode router \
|
|
--sample-profile small-append \
|
|
--prefill-workers 1 \
|
|
--decode-workers 1 \
|
|
--transfer-backend mooncake \
|
|
--target-duration-s 600 \
|
|
--session-sample-rate 0.01 \
|
|
--output-root outputs/live
|
|
```
|
|
|
|
Notes:
|
|
|
|
- The provided Ali release trace contains lengths and `hash_ids`, not raw
|
|
prompts. Replay therefore synthesizes deterministic prompt text from
|
|
`hash_ids` so repeated blocks remain repeated across turns.
|
|
- `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
|
|
upstream gateway's `manual` policy semantics for "turn1 default, turn2+
|
|
sticky".
|
|
- `kv-aware` computes decode placement from observed `hash_ids` overlap and
|
|
can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
|
|
used with a compatible router decode policy.
|
|
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
|
|
preserves the real prefill/decode double-request path over loopback without
|
|
depending on the upstream Rust router build.
|
|
- Managed live benchmarking prefers a local
|
|
`third_party/sglang/python/sglang` checkout when it exists, so local SGLang
|
|
source changes can apply immediately without packaging a wheel.
|
|
- Live benchmarking currently targets the `mooncake` transfer backend, because
|
|
`mooncake-transfer-engine` is installed and usable on this node.
|
|
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
|
|
measurement. Use `--no-stream` for E2E-only runs.
|
|
- `kvcache-centric` defaults to router-managed admission
|
|
(`--kvcache-admission-mode router`). This keeps a router-side shadow of
|
|
decode session residency and capacity, so the critical path does not issue
|
|
per-request worker `/server_info` and `/v1/loads` probes. Use
|
|
`--kvcache-admission-mode worker` only as an A/B baseline for the older
|
|
worker-managed admission path.
|
|
|
|
## Output
|
|
|
|
Each replay writes:
|
|
|
|
- request-level metrics JSONL at the requested output path
|
|
- summary JSON at `<output>.summary.json`
|
|
|
|
Each request log contains:
|
|
|
|
- request id
|
|
- session id
|
|
- turn id
|
|
- assigned prefill node
|
|
- assigned decode node
|
|
- latency fields when a live router is used
|
|
- whether reuse was expected and whether block overlap was observed
|
|
- expected KV transfer blocks
|
|
- per-node load snapshot at assignment time
|