docs: document project design and status

2026-04-24 12:17:55 +00:00
parent 4bca741f32
commit 78f0d15221
3 changed files with 375 additions and 0 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,94 @@
 # AGENTS.md
 ## Environment
 Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
 ## Goal
 Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
 Current setup:
 - SGLang: `v0.5.10`
 - Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
 - xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
 - Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
 - Traces:
  - Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)
 ---
 ## MVP Scope
 We only do the following:
 1. Run **SGLang xPyD** correctly on one machine
 2. Add a **baseline router**
   - `turn1`: default routing
   - `turn2+`: prefer previous `D` node for the same session
 3. Add a **KV-cache-aware routing** policy
 4. Replay traces and compare policies with the same evaluation pipeline
 Out of scope for now:
 - autoscaling
 - fault tolerance
 - large-scale cluster scheduler
 - production hardening
 - general multi-tenant serving
 ---
 ## What matters
 Primary metric:
 - **E2E latency**
 Secondary metrics:
 - TTFT
 - TPOT
 - KV transfer volume
 - cache hit / reuse
 - re-prefill count
 - per-node load
 Do not optimize TTFT alone if E2E does not improve.
 ---
 ## Development Order
 Implement in this order:
 1. **Bring up xPyD**
 2. **Add trace replay + metrics logging**
 3. **Implement sticky-to-D baseline**
 4. **Implement KV-cache-aware routing**
 5. **Analyze gains and failure cases**
 Do not skip step 2.
 ---
 ## Core Rules
 ### 1. Keep policy separate from mechanism
 - mechanism = how requests / KV / xPyD work
 - policy = how we choose `P` and `D`
 Do not mix them unless necessary.
 ### 2. Prefer simple, debuggable logic
 Start with simple heuristics before complex scoring.
 ### 3. Log everything needed to explain results
 Each request should log:
 - request id
 - session id
 - turn id
 - assigned P node
 - assigned D node
 - latency
 - whether reuse was expected / observed
 ### 4. Small interfaces only
 Avoid over-abstraction.
--- a/README.md
+++ b/README.md
@@ -0,0 +1,160 @@
 ## Agentic PD Hybrid
 Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
 prefill/decode routing on top of SGLang PD disaggregation.
 For a concise description of the project design, implemented features, current
 findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).
 Current implementation covers the initial MVP path in `AGENTS.md`:
 1. One-node PD/xPyD launch planning
 2. Trace replay plus request-level metrics logging
 3. Real end-to-end benchmark orchestration
 Routing policy is kept separate from mechanism:
 - `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
  handle cluster shape and SGLang command generation.
 - `agentic_pd_hybrid.policies`
  handles decode selection heuristics.
 - `agentic_pd_hybrid.replay`
  handles trace pacing, synthetic prompt generation, and metrics.
 - `agentic_pd_hybrid.sampling`
  handles session-granularity trace sampling for live tests.
 - `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
  handles launching and tearing down a real PD stack.
 ## Environment
 Use `uv` for all environment management.
 Sync the environment:
 ```bash
 uv sync
 ```
 Local experiments can use a repo-local `third_party/sglang` checkout of SGLang
 `v0.5.10`, but that heavyweight checkout is intentionally not committed here.
 ## CLI
 Print one-node PD launch commands:
 ```bash
 uv run agentic-pd-hybrid print-launch \
  --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --prefill-workers 2 \
  --decode-workers 2 \
  --transfer-backend mooncake
 ```
 Replay the Ali trace in dry-run mode and emit request logs plus a summary:
 ```bash
 uv run agentic-pd-hybrid replay \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --prefill-workers 2 \
  --decode-workers 2 \
  --output outputs/sticky.jsonl
 ```
 Sample a 10-minute shard at session granularity:
 ```bash
 uv run agentic-pd-hybrid sample-sessions \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --output outputs/sampled-10min.jsonl \
  --target-duration-s 600 \
  --session-sample-rate 0.01
 ```
 Sample Ali sessions that keep the small-append KV reuse shape used by the
 micro-benchmark:
 ```bash
 uv run agentic-pd-hybrid sample-sessions \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --output outputs/ali-small-append.jsonl \
  --profile small-append \
  --target-duration-s 600 \
  --session-sample-rate 0.01 \
  --min-turns 2
 ```
 Replay against a live router:
 ```bash
 uv run agentic-pd-hybrid replay \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --router-url http://127.0.0.1:8000 \
  --model Qwen3-Coder-30B-A3B-Instruct \
  --output outputs/sticky-live.jsonl
 ```
 Launch a real PD stack and collect live performance numbers:
 ```bash
 uv run agentic-pd-hybrid benchmark-live \
  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
  --policy sticky \
  --mechanism kvcache-centric \
  --kvcache-admission-mode router \
  --sample-profile small-append \
  --prefill-workers 1 \
  --decode-workers 1 \
  --transfer-backend mooncake \
  --target-duration-s 600 \
  --session-sample-rate 0.01 \
  --output-root outputs/live
 ```
 Notes:
 - The provided Ali release trace contains lengths and `hash_ids`, not raw
  prompts. Replay therefore synthesizes deterministic prompt text from
  `hash_ids` so repeated blocks remain repeated across turns.
 - `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
  upstream gateway's `manual` policy semantics for "turn1 default, turn2+
  sticky".
 - `kv-aware` computes decode placement from observed `hash_ids` overlap and
  can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
  used with a compatible router decode policy.
 - Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
  preserves the real prefill/decode double-request path over loopback without
  depending on the upstream Rust router build.
 - Managed live benchmarking prefers a local
  `third_party/sglang/python/sglang` checkout when it exists, so local SGLang
  source changes can apply immediately without packaging a wheel.
 - Live benchmarking currently targets the `mooncake` transfer backend, because
  `mooncake-transfer-engine` is installed and usable on this node.
 - `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
  measurement. Use `--no-stream` for E2E-only runs.
 - `kvcache-centric` defaults to router-managed admission
  (`--kvcache-admission-mode router`). This keeps a router-side shadow of
  decode session residency and capacity, so the critical path does not issue
  per-request worker `/server_info` and `/v1/loads` probes. Use
  `--kvcache-admission-mode worker` only as an A/B baseline for the older
  worker-managed admission path.
 ## Output
 Each replay writes:
 - request-level metrics JSONL at the requested output path
 - summary JSON at `<output>.summary.json`
 Each request log contains:
 - request id
 - session id
 - turn id
 - assigned prefill node
 - assigned decode node
 - latency fields when a live router is used
 - whether reuse was expected and whether block overlap was observed
 - expected KV transfer blocks
 - per-node load snapshot at assignment time
--- a/docs/PROJECT_OVERVIEW.md
+++ b/docs/PROJECT_OVERVIEW.md
@@ -0,0 +1,121 @@
 # Project Overview
 This repository is a minimal research prototype for evaluating whether
 session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
 latency for agentic coding workloads on top of SGLang xPyD.
 The current target environment is a single 8-GPU node running SGLang `v0.5.10`
 with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
 path through SGLang disaggregation and Mooncake loopback instead of replacing it
 with an in-process shortcut.
 ## Design
 The code keeps policy separate from mechanism.
 - Mechanism code launches SGLang workers, sends requests, manages streaming
  sessions, and records request-level metrics.
 - Policy code decides which prefill worker and decode worker should receive a
  request.
 - Replay and benchmark code preserve trace arrival times unless explicitly
  configured otherwise, so concurrency comes from the workload shape rather than
  from an artificial fixed-concurrency driver.
 The main comparison points are:
 - `pd-disaggregation`: normal router-managed P/D serving.
 - `kvcache-centric`: worker/router assisted session-aware routing that can keep
  a decode streaming session resident and send later small appends directly to D.
 - `pd-colo`: direct colocated serving baseline for experiments that do not use
  the P/D router path.
 ## Implemented
 The prototype currently includes:
 - One-node P/D launch planning and managed stack lifecycle.
 - A lightweight Python PD router used for live local experiments.
 - Ali trace loading, session-granularity sampling, and synthetic prompt
  generation from `hash_ids`.
 - Trace replay with natural pacing, request dependencies inside a session, and
  request-level metrics JSONL plus summary JSON.
 - Routing policies:
  - `default`: simple baseline placement.
  - `sticky`: turn2+ prefers the previous D node for the same session.
  - `kv-aware`: uses observed block overlap/session state to choose D placement.
 - Live benchmark orchestration through `benchmark-live`.
 - Small-append synthetic trace generation for micro-benchmarks.
 - KV-cache-centric worker admission modes:
  - router shadow-state admission.
  - worker queried admission.
  - session-level D residency soft cap for worker-managed admission, so only a
    small hot set is kept as decode streaming sessions while the rest fall back
    to normal PD routing.
 - P-side prefill backup bookkeeping for experiments where D evictions can retain
  a lower-priority copy on P.
 - Fail-fast handling for empty streaming responses and a shorter SGLang
  disaggregation wait timeout to avoid treating transfer hangs as successful
  long-tail responses.
 ## Current Findings
 The micro-benchmark can make KV-cache-centric routing look better than
 `pd-disaggregation` because the active sessions fit in D KV cache. Later turns
 can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT.
 On the larger 316-request, variable-turn workload, there are 58 sessions and the
 working set is larger than the useful D residency budget. A naive worker-managed
 KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding
 TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it
 keeps too much state around.
 The current soft-cap optimization improves worker-managed KV-cache-centric
 relative to the older worker-managed path, but `pd-disaggregation` is still
 slightly better on the sampled Ali workload because most requests fall back to
 normal PD routing while a few retained D sessions still consume token budget.
 ## Useful Commands
 Run a live benchmark with natural arrival timing:
 ```bash
 uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000
 ```
 Generate a 30k input, 1k append, 256 output small-append trace:
 ```bash
 uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256
 ```
 ## Known Limits
 - This is not production routing code.
 - The current evaluation is single-node and constrained by `prefill + decode <=
  8` GPUs.
 - Trace prompts are synthetic because the Ali trace used here contains lengths
  and `hash_ids`, not raw prompts.
 - KV-cache-centric admission still needs better hot-session prediction. The next
  useful step is inter-turn-gap-aware admission and aging, so D cache is held
  only for sessions likely to reuse it soon.