diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..d6b696a --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,94 @@ +# AGENTS.md + +## Environment + +Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time. + +## Goal + +Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads. + +Current setup: +- SGLang: `v0.5.10` +- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`) +- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$** +- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut +- Traces: + - Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`) + +--- + +## MVP Scope + +We only do the following: + +1. Run **SGLang xPyD** correctly on one machine +2. Add a **baseline router** + - `turn1`: default routing + - `turn2+`: prefer previous `D` node for the same session +3. Add a **KV-cache-aware routing** policy +4. Replay traces and compare policies with the same evaluation pipeline + +Out of scope for now: +- autoscaling +- fault tolerance +- large-scale cluster scheduler +- production hardening +- general multi-tenant serving + +--- + +## What matters + +Primary metric: +- **E2E latency** + +Secondary metrics: +- TTFT +- TPOT +- KV transfer volume +- cache hit / reuse +- re-prefill count +- per-node load + +Do not optimize TTFT alone if E2E does not improve. + +--- + +## Development Order + +Implement in this order: + +1. **Bring up xPyD** +2. **Add trace replay + metrics logging** +3. **Implement sticky-to-D baseline** +4. **Implement KV-cache-aware routing** +5. **Analyze gains and failure cases** + +Do not skip step 2. + +--- + +## Core Rules + +### 1. Keep policy separate from mechanism +- mechanism = how requests / KV / xPyD work +- policy = how we choose `P` and `D` + +Do not mix them unless necessary. + +### 2. Prefer simple, debuggable logic +Start with simple heuristics before complex scoring. + +### 3. Log everything needed to explain results +Each request should log: +- request id +- session id +- turn id +- assigned P node +- assigned D node +- latency +- whether reuse was expected / observed + +### 4. Small interfaces only +Avoid over-abstraction. diff --git a/README.md b/README.md new file mode 100644 index 0000000..04be8f8 --- /dev/null +++ b/README.md @@ -0,0 +1,160 @@ +## Agentic PD Hybrid + +Minimal prototype scaffold for evaluating session-aware and KV-cache-aware +prefill/decode routing on top of SGLang PD disaggregation. + +For a concise description of the project design, implemented features, current +findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md). + +Current implementation covers the initial MVP path in `AGENTS.md`: + +1. One-node PD/xPyD launch planning +2. Trace replay plus request-level metrics logging +3. Real end-to-end benchmark orchestration + +Routing policy is kept separate from mechanism: + +- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher` + handle cluster shape and SGLang command generation. +- `agentic_pd_hybrid.policies` + handles decode selection heuristics. +- `agentic_pd_hybrid.replay` + handles trace pacing, synthetic prompt generation, and metrics. +- `agentic_pd_hybrid.sampling` + handles session-granularity trace sampling for live tests. +- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark` + handles launching and tearing down a real PD stack. + +## Environment + +Use `uv` for all environment management. + +Sync the environment: + +```bash +uv sync +``` + +Local experiments can use a repo-local `third_party/sglang` checkout of SGLang +`v0.5.10`, but that heavyweight checkout is intentionally not committed here. + +## CLI + +Print one-node PD launch commands: + +```bash +uv run agentic-pd-hybrid print-launch \ + --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \ + --prefill-workers 2 \ + --decode-workers 2 \ + --transfer-backend mooncake +``` + +Replay the Ali trace in dry-run mode and emit request logs plus a summary: + +```bash +uv run agentic-pd-hybrid replay \ + --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ + --policy sticky \ + --prefill-workers 2 \ + --decode-workers 2 \ + --output outputs/sticky.jsonl +``` + +Sample a 10-minute shard at session granularity: + +```bash +uv run agentic-pd-hybrid sample-sessions \ + --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ + --output outputs/sampled-10min.jsonl \ + --target-duration-s 600 \ + --session-sample-rate 0.01 +``` + +Sample Ali sessions that keep the small-append KV reuse shape used by the +micro-benchmark: + +```bash +uv run agentic-pd-hybrid sample-sessions \ + --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ + --output outputs/ali-small-append.jsonl \ + --profile small-append \ + --target-duration-s 600 \ + --session-sample-rate 0.01 \ + --min-turns 2 +``` + +Replay against a live router: + +```bash +uv run agentic-pd-hybrid replay \ + --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ + --policy sticky \ + --router-url http://127.0.0.1:8000 \ + --model Qwen3-Coder-30B-A3B-Instruct \ + --output outputs/sticky-live.jsonl +``` + +Launch a real PD stack and collect live performance numbers: + +```bash +uv run agentic-pd-hybrid benchmark-live \ + --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ + --policy sticky \ + --mechanism kvcache-centric \ + --kvcache-admission-mode router \ + --sample-profile small-append \ + --prefill-workers 1 \ + --decode-workers 1 \ + --transfer-backend mooncake \ + --target-duration-s 600 \ + --session-sample-rate 0.01 \ + --output-root outputs/live +``` + +Notes: + +- The provided Ali release trace contains lengths and `hash_ids`, not raw + prompts. Replay therefore synthesizes deterministic prompt text from + `hash_ids` so repeated blocks remain repeated across turns. +- `sticky` mode emits `x-smg-routing-key=`, which matches the + upstream gateway's `manual` policy semantics for "turn1 default, turn2+ + sticky". +- `kv-aware` computes decode placement from observed `hash_ids` overlap and + can emit `x-smg-target-worker=` when `--header-mode target-worker` is + used with a compatible router decode policy. +- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which + preserves the real prefill/decode double-request path over loopback without + depending on the upstream Rust router build. +- Managed live benchmarking prefers a local + `third_party/sglang/python/sglang` checkout when it exists, so local SGLang + source changes can apply immediately without packaging a wheel. +- Live benchmarking currently targets the `mooncake` transfer backend, because + `mooncake-transfer-engine` is installed and usable on this node. +- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT + measurement. Use `--no-stream` for E2E-only runs. +- `kvcache-centric` defaults to router-managed admission + (`--kvcache-admission-mode router`). This keeps a router-side shadow of + decode session residency and capacity, so the critical path does not issue + per-request worker `/server_info` and `/v1/loads` probes. Use + `--kvcache-admission-mode worker` only as an A/B baseline for the older + worker-managed admission path. + +## Output + +Each replay writes: + +- request-level metrics JSONL at the requested output path +- summary JSON at `.summary.json` + +Each request log contains: + +- request id +- session id +- turn id +- assigned prefill node +- assigned decode node +- latency fields when a live router is used +- whether reuse was expected and whether block overlap was observed +- expected KV transfer blocks +- per-node load snapshot at assignment time diff --git a/docs/PROJECT_OVERVIEW.md b/docs/PROJECT_OVERVIEW.md new file mode 100644 index 0000000..ef95eec --- /dev/null +++ b/docs/PROJECT_OVERVIEW.md @@ -0,0 +1,121 @@ +# Project Overview + +This repository is a minimal research prototype for evaluating whether +session-aware and KV-cache-aware prefill/decode routing can improve end-to-end +latency for agentic coding workloads on top of SGLang xPyD. + +The current target environment is a single 8-GPU node running SGLang `v0.5.10` +with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer +path through SGLang disaggregation and Mooncake loopback instead of replacing it +with an in-process shortcut. + +## Design + +The code keeps policy separate from mechanism. + +- Mechanism code launches SGLang workers, sends requests, manages streaming + sessions, and records request-level metrics. +- Policy code decides which prefill worker and decode worker should receive a + request. +- Replay and benchmark code preserve trace arrival times unless explicitly + configured otherwise, so concurrency comes from the workload shape rather than + from an artificial fixed-concurrency driver. + +The main comparison points are: + +- `pd-disaggregation`: normal router-managed P/D serving. +- `kvcache-centric`: worker/router assisted session-aware routing that can keep + a decode streaming session resident and send later small appends directly to D. +- `pd-colo`: direct colocated serving baseline for experiments that do not use + the P/D router path. + +## Implemented + +The prototype currently includes: + +- One-node P/D launch planning and managed stack lifecycle. +- A lightweight Python PD router used for live local experiments. +- Ali trace loading, session-granularity sampling, and synthetic prompt + generation from `hash_ids`. +- Trace replay with natural pacing, request dependencies inside a session, and + request-level metrics JSONL plus summary JSON. +- Routing policies: + - `default`: simple baseline placement. + - `sticky`: turn2+ prefers the previous D node for the same session. + - `kv-aware`: uses observed block overlap/session state to choose D placement. +- Live benchmark orchestration through `benchmark-live`. +- Small-append synthetic trace generation for micro-benchmarks. +- KV-cache-centric worker admission modes: + - router shadow-state admission. + - worker queried admission. + - session-level D residency soft cap for worker-managed admission, so only a + small hot set is kept as decode streaming sessions while the rest fall back + to normal PD routing. +- P-side prefill backup bookkeeping for experiments where D evictions can retain + a lower-priority copy on P. +- Fail-fast handling for empty streaming responses and a shorter SGLang + disaggregation wait timeout to avoid treating transfer hangs as successful + long-tail responses. + +## Current Findings + +The micro-benchmark can make KV-cache-centric routing look better than +`pd-disaggregation` because the active sessions fit in D KV cache. Later turns +can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT. + +On the larger 316-request, variable-turn workload, there are 58 sessions and the +working set is larger than the useful D residency budget. A naive worker-managed +KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding +TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it +keeps too much state around. + +The current soft-cap optimization improves worker-managed KV-cache-centric +relative to the older worker-managed path, but `pd-disaggregation` is still +slightly better on the sampled Ali workload because most requests fall back to +normal PD routing while a few retained D sessions still consume token budget. + +## Useful Commands + +Run a live benchmark with natural arrival timing: + +```bash +uv run agentic-pd-hybrid benchmark-live \ + --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \ + --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \ + --mechanism kvcache-centric \ + --policy kv-aware \ + --kvcache-admission-mode worker \ + --prefill-workers 1 \ + --decode-workers 1 \ + --prefill-gpu-ids 0 \ + --decode-gpu-ids 1 \ + --transfer-backend mooncake \ + --target-duration-s 2000 \ + --session-sample-rate 1.0 \ + --min-turns 2 \ + --time-scale 1 \ + --concurrency-limit 1000 +``` + +Generate a 30k input, 1k append, 256 output small-append trace: + +```bash +uv run agentic-pd-hybrid make-small-append-trace \ + --output outputs/smoke-hotcap-30k-1k-256.jsonl \ + --session-count 4 \ + --turns-per-session 3 \ + --initial-input-length 30000 \ + --append-input-length 1000 \ + --output-length 256 +``` + +## Known Limits + +- This is not production routing code. +- The current evaluation is single-node and constrained by `prefill + decode <= + 8` GPUs. +- Trace prompts are synthetic because the Ali trace used here contains lengths + and `hash_ids`, not raw prompts. +- KV-cache-centric admission still needs better hot-session prediction. The next + useful step is inter-turn-gap-aware admission and aging, so D cache is held + only for sessions likely to reuse it soon.