## Agentic PD Hybrid

Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
prefill/decode routing on top of SGLang PD disaggregation.

For a concise description of the project design, implemented features, current
findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).

Current implementation covers the initial MVP path in `AGENTS.md`:

1. One-node PD/xPyD launch planning
2. Trace replay plus request-level metrics logging
3. Real end-to-end benchmark orchestration

Routing policy is kept separate from mechanism:

- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
  handle cluster shape and SGLang command generation.
- `agentic_pd_hybrid.policies`
  handles decode selection heuristics.
- `agentic_pd_hybrid.replay`
  handles trace pacing, synthetic prompt generation, and metrics.
- `agentic_pd_hybrid.sampling`
  handles session-granularity trace sampling for live tests.
- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
  handles launching and tearing down a real PD stack.

## Environment

Use `uv` for all environment management.

Sync the environment:

```bash
uv sync
```

`third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local
PD/session-cache patches in later commits. Keep SGLang changes scoped under that
directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they
stay easy to review against the vendor baseline.

## CLI

Print one-node PD launch commands:

```bash
uv run agentic-pd-hybrid print-launch \
  --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --prefill-workers 2 \
  --decode-workers 2 \
  --transfer-backend mooncake
```

Replay the Ali trace in dry-run mode and emit request logs plus a summary:

```bash
uv run agentic-pd-hybrid replay \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --prefill-workers 2 \
  --decode-workers 2 \
  --output outputs/sticky.jsonl
```

Sample a 10-minute shard at session granularity:

```bash
uv run agentic-pd-hybrid sample-sessions \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --output outputs/sampled-10min.jsonl \
  --target-duration-s 600 \
  --session-sample-rate 0.01
```

Sample Ali sessions that keep the small-append KV reuse shape used by the
micro-benchmark:

```bash
uv run agentic-pd-hybrid sample-sessions \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --output outputs/ali-small-append.jsonl \
  --profile small-append \
  --target-duration-s 600 \
  --session-sample-rate 0.01 \
  --min-turns 2
```

Replay against a live router:

```bash
uv run agentic-pd-hybrid replay \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --router-url http://127.0.0.1:8000 \
  --model Qwen3-Coder-30B-A3B-Instruct \
  --output outputs/sticky-live.jsonl
```

Launch a real PD stack and collect live performance numbers:

```bash
uv run agentic-pd-hybrid benchmark-live \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --mechanism kvcache-centric \
  --kvcache-admission-mode router \
  --sample-profile small-append \
  --prefill-workers 1 \
  --decode-workers 1 \
  --transfer-backend mooncake \
  --target-duration-s 600 \
  --session-sample-rate 0.01 \
  --output-root outputs/live
```

Notes:

- The provided Ali release trace contains lengths and `hash_ids`, not raw
  prompts. Replay therefore synthesizes deterministic prompt text from
  `hash_ids` so repeated blocks remain repeated across turns.
- `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
  upstream gateway's `manual` policy semantics for "turn1 default, turn2+
  sticky".
- `kv-aware` computes decode placement from observed `hash_ids` overlap and
  can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
  used with a compatible router decode policy.
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
  preserves the real prefill/decode double-request path over loopback without
  depending on the upstream Rust router build.
- Managed live benchmarking prefers the vendored
  `third_party/sglang/python/sglang` source tree, so local SGLang changes apply
  immediately without packaging a wheel.
- Live benchmarking currently targets the `mooncake` transfer backend, because
  `mooncake-transfer-engine` is installed and usable on this node.
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
  measurement. Use `--no-stream` for E2E-only runs.
- `kvcache-centric` defaults to router-managed admission
  (`--kvcache-admission-mode router`). This keeps a router-side shadow of
  decode session residency and capacity, so the critical path does not issue
  per-request worker `/server_info` and `/v1/loads` probes. Use
  `--kvcache-admission-mode worker` only as an A/B baseline for the older
  worker-managed admission path.

## Output

Each replay writes:

- request-level metrics JSONL at the requested output path
- summary JSON at `<output>.summary.json`

Each request log contains:

- request id
- session id
- turn id
- assigned prefill node
- assigned decode node
- latency fields when a live router is used
- whether reuse was expected and whether block overlap was observed
- expected KV transfer blocks
- per-node load snapshot at assignment time