docs: document project design and status

2026-04-24 12:17:55 +00:00
parent 4bca741f32
commit 78f0d15221
3 changed files with 375 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,94 @@
+# AGENTS.md
+
+## Environment
+
+Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
+
+## Goal
+
+Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
+
+Current setup:
+- SGLang: `v0.5.10`
+- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
+- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
+- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
+- Traces:
+  - Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)
+
+---
+
+## MVP Scope
+
+We only do the following:
+
+1. Run **SGLang xPyD** correctly on one machine
+2. Add a **baseline router**
+   - `turn1`: default routing
+   - `turn2+`: prefer previous `D` node for the same session
+3. Add a **KV-cache-aware routing** policy
+4. Replay traces and compare policies with the same evaluation pipeline
+
+Out of scope for now:
+- autoscaling
+- fault tolerance
+- large-scale cluster scheduler
+- production hardening
+- general multi-tenant serving
+
+---
+
+## What matters
+
+Primary metric:
+- **E2E latency**
+
+Secondary metrics:
+- TTFT
+- TPOT
+- KV transfer volume
+- cache hit / reuse
+- re-prefill count
+- per-node load
+
+Do not optimize TTFT alone if E2E does not improve.
+
+---
+
+## Development Order
+
+Implement in this order:
+
+1. **Bring up xPyD**
+2. **Add trace replay + metrics logging**
+3. **Implement sticky-to-D baseline**
+4. **Implement KV-cache-aware routing**
+5. **Analyze gains and failure cases**
+
+Do not skip step 2.
+
+---
+
+## Core Rules
+
+### 1. Keep policy separate from mechanism
+- mechanism = how requests / KV / xPyD work
+- policy = how we choose `P` and `D`
+
+Do not mix them unless necessary.
+
+### 2. Prefer simple, debuggable logic
+Start with simple heuristics before complex scoring.
+
+### 3. Log everything needed to explain results
+Each request should log:
+- request id
+- session id
+- turn id
+- assigned P node
+- assigned D node
+- latency
+- whether reuse was expected / observed
+
+### 4. Small interfaces only
+Avoid over-abstraction.
--- a/README.md
+++ b/README.md
@@ -0,0 +1,160 @@
+## Agentic PD Hybrid
+
+Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
+prefill/decode routing on top of SGLang PD disaggregation.
+
+For a concise description of the project design, implemented features, current
+findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).
+
+Current implementation covers the initial MVP path in `AGENTS.md`:
+
+1. One-node PD/xPyD launch planning
+2. Trace replay plus request-level metrics logging
+3. Real end-to-end benchmark orchestration
+
+Routing policy is kept separate from mechanism:
+
+- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
+  handle cluster shape and SGLang command generation.
+- `agentic_pd_hybrid.policies`
+  handles decode selection heuristics.
+- `agentic_pd_hybrid.replay`
+  handles trace pacing, synthetic prompt generation, and metrics.
+- `agentic_pd_hybrid.sampling`
+  handles session-granularity trace sampling for live tests.
+- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
+  handles launching and tearing down a real PD stack.
+
+## Environment
+
+Use `uv` for all environment management.
+
+Sync the environment:
+
+```bash
+uv sync
+```
+
+Local experiments can use a repo-local `third_party/sglang` checkout of SGLang
+`v0.5.10`, but that heavyweight checkout is intentionally not committed here.
+
+## CLI
+
+Print one-node PD launch commands:
+
+```bash
+uv run agentic-pd-hybrid print-launch \
+  --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
+  --prefill-workers 2 \
+  --decode-workers 2 \
+  --transfer-backend mooncake
+```
+
+Replay the Ali trace in dry-run mode and emit request logs plus a summary:
+
+```bash
+uv run agentic-pd-hybrid replay \
+  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
+  --policy sticky \
+  --prefill-workers 2 \
+  --decode-workers 2 \
+  --output outputs/sticky.jsonl
+```
+
+Sample a 10-minute shard at session granularity:
+
+```bash
+uv run agentic-pd-hybrid sample-sessions \
+  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
+  --output outputs/sampled-10min.jsonl \
+  --target-duration-s 600 \
+  --session-sample-rate 0.01
+```
+
+Sample Ali sessions that keep the small-append KV reuse shape used by the
+micro-benchmark:
+
+```bash
+uv run agentic-pd-hybrid sample-sessions \
+  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
+  --output outputs/ali-small-append.jsonl \
+  --profile small-append \
+  --target-duration-s 600 \
+  --session-sample-rate 0.01 \
+  --min-turns 2
+```
+
+Replay against a live router:
+
+```bash
+uv run agentic-pd-hybrid replay \
+  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
+  --policy sticky \
+  --router-url http://127.0.0.1:8000 \
+  --model Qwen3-Coder-30B-A3B-Instruct \
+  --output outputs/sticky-live.jsonl
+```
+
+Launch a real PD stack and collect live performance numbers:
+
+```bash
+uv run agentic-pd-hybrid benchmark-live \
+  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
+  --policy sticky \
+  --mechanism kvcache-centric \
+  --kvcache-admission-mode router \
+  --sample-profile small-append \
+  --prefill-workers 1 \
+  --decode-workers 1 \
+  --transfer-backend mooncake \
+  --target-duration-s 600 \
+  --session-sample-rate 0.01 \
+  --output-root outputs/live
+```
+
+Notes:
+
+- The provided Ali release trace contains lengths and `hash_ids`, not raw
+  prompts. Replay therefore synthesizes deterministic prompt text from
+  `hash_ids` so repeated blocks remain repeated across turns.
+- `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
+  upstream gateway's `manual` policy semantics for "turn1 default, turn2+
+  sticky".
+- `kv-aware` computes decode placement from observed `hash_ids` overlap and
+  can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
+  used with a compatible router decode policy.
+- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
+  preserves the real prefill/decode double-request path over loopback without
+  depending on the upstream Rust router build.
+- Managed live benchmarking prefers a local
+  `third_party/sglang/python/sglang` checkout when it exists, so local SGLang
+  source changes can apply immediately without packaging a wheel.
+- Live benchmarking currently targets the `mooncake` transfer backend, because
+  `mooncake-transfer-engine` is installed and usable on this node.
+- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
+  measurement. Use `--no-stream` for E2E-only runs.
+- `kvcache-centric` defaults to router-managed admission
+  (`--kvcache-admission-mode router`). This keeps a router-side shadow of
+  decode session residency and capacity, so the critical path does not issue
+  per-request worker `/server_info` and `/v1/loads` probes. Use
+  `--kvcache-admission-mode worker` only as an A/B baseline for the older
+  worker-managed admission path.
+
+## Output
+
+Each replay writes:
+
+- request-level metrics JSONL at the requested output path
+- summary JSON at `<output>.summary.json`
+
+Each request log contains:
+
+- request id
+- session id
+- turn id
+- assigned prefill node
+- assigned decode node
+- latency fields when a live router is used
+- whether reuse was expected and whether block overlap was observed
+- expected KV transfer blocks
+- per-node load snapshot at assignment time
--- a/docs/PROJECT_OVERVIEW.md
+++ b/docs/PROJECT_OVERVIEW.md
@@ -0,0 +1,121 @@
+# Project Overview
+
+This repository is a minimal research prototype for evaluating whether
+session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
+latency for agentic coding workloads on top of SGLang xPyD.
+
+The current target environment is a single 8-GPU node running SGLang `v0.5.10`
+with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
+path through SGLang disaggregation and Mooncake loopback instead of replacing it
+with an in-process shortcut.
+
+## Design
+
+The code keeps policy separate from mechanism.
+
+- Mechanism code launches SGLang workers, sends requests, manages streaming
+  sessions, and records request-level metrics.
+- Policy code decides which prefill worker and decode worker should receive a
+  request.
+- Replay and benchmark code preserve trace arrival times unless explicitly
+  configured otherwise, so concurrency comes from the workload shape rather than
+  from an artificial fixed-concurrency driver.
+
+The main comparison points are:
+
+- `pd-disaggregation`: normal router-managed P/D serving.
+- `kvcache-centric`: worker/router assisted session-aware routing that can keep
+  a decode streaming session resident and send later small appends directly to D.
+- `pd-colo`: direct colocated serving baseline for experiments that do not use
+  the P/D router path.
+
+## Implemented
+
+The prototype currently includes:
+
+- One-node P/D launch planning and managed stack lifecycle.
+- A lightweight Python PD router used for live local experiments.
+- Ali trace loading, session-granularity sampling, and synthetic prompt
+  generation from `hash_ids`.
+- Trace replay with natural pacing, request dependencies inside a session, and
+  request-level metrics JSONL plus summary JSON.
+- Routing policies:
+  - `default`: simple baseline placement.
+  - `sticky`: turn2+ prefers the previous D node for the same session.
+  - `kv-aware`: uses observed block overlap/session state to choose D placement.
+- Live benchmark orchestration through `benchmark-live`.
+- Small-append synthetic trace generation for micro-benchmarks.
+- KV-cache-centric worker admission modes:
+  - router shadow-state admission.
+  - worker queried admission.
+  - session-level D residency soft cap for worker-managed admission, so only a
+    small hot set is kept as decode streaming sessions while the rest fall back
+    to normal PD routing.
+- P-side prefill backup bookkeeping for experiments where D evictions can retain
+  a lower-priority copy on P.
+- Fail-fast handling for empty streaming responses and a shorter SGLang
+  disaggregation wait timeout to avoid treating transfer hangs as successful
+  long-tail responses.
+
+## Current Findings
+
+The micro-benchmark can make KV-cache-centric routing look better than
+`pd-disaggregation` because the active sessions fit in D KV cache. Later turns
+can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT.
+
+On the larger 316-request, variable-turn workload, there are 58 sessions and the
+working set is larger than the useful D residency budget. A naive worker-managed
+KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding
+TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it
+keeps too much state around.
+
+The current soft-cap optimization improves worker-managed KV-cache-centric
+relative to the older worker-managed path, but `pd-disaggregation` is still
+slightly better on the sampled Ali workload because most requests fall back to
+normal PD routing while a few retained D sessions still consume token budget.
+
+## Useful Commands
+
+Run a live benchmark with natural arrival timing:
+
+```bash
+uv run agentic-pd-hybrid benchmark-live \
+  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
+  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
+  --mechanism kvcache-centric \
+  --policy kv-aware \
+  --kvcache-admission-mode worker \
+  --prefill-workers 1 \
+  --decode-workers 1 \
+  --prefill-gpu-ids 0 \
+  --decode-gpu-ids 1 \
+  --transfer-backend mooncake \
+  --target-duration-s 2000 \
+  --session-sample-rate 1.0 \
+  --min-turns 2 \
+  --time-scale 1 \
+  --concurrency-limit 1000
+```
+
+Generate a 30k input, 1k append, 256 output small-append trace:
+
+```bash
+uv run agentic-pd-hybrid make-small-append-trace \
+  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
+  --session-count 4 \
+  --turns-per-session 3 \
+  --initial-input-length 30000 \
+  --append-input-length 1000 \
+  --output-length 256
+```
+
+## Known Limits
+
+- This is not production routing code.
+- The current evaluation is single-node and constrained by `prefill + decode <=
+  8` GPUs.
+- Trace prompts are synthetic because the Ali trace used here contains lengths
+  and `hash_ids`, not raw prompts.
+- KV-cache-centric admission still needs better hot-session prediction. The next
+  useful step is inter-turn-gap-aware admission and aging, so D cache is held
+  only for sessions likely to reuse it soon.