docs: document project design and status
This commit is contained in:
94
AGENTS.md
Normal file
94
AGENTS.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# AGENTS.md
|
||||
|
||||
## Environment
|
||||
|
||||
Use `uv` to manage all python environment. `uv add` to manage deps so that we can `uv sync` to get exactly same runnable environment each time.
|
||||
|
||||
## Goal
|
||||
|
||||
Build a minimal prototype on top of **SGLang xPyD** to test whether **session-aware / KV-cache-aware P/D routing** can improve **end-to-end latency** for agentic coding workloads.
|
||||
|
||||
Current setup:
|
||||
- SGLang: `v0.5.10`
|
||||
- Model: `Qwen3-Coder-30B-A3B-Instruct` (`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`)
|
||||
- xPyD runs on this single 8-GPU node, so the current constraint is **$x + y \le 8$**
|
||||
- Even in local experiments, the implementation should preserve the **P -> D RDMA-style data path** semantics as much as possible; local runs should treat this as a loopback-based stand-in rather than collapsing P/D into a special in-process shortcut
|
||||
- Traces:
|
||||
- Ali coding agent (`~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl`)
|
||||
|
||||
---
|
||||
|
||||
## MVP Scope
|
||||
|
||||
We only do the following:
|
||||
|
||||
1. Run **SGLang xPyD** correctly on one machine
|
||||
2. Add a **baseline router**
|
||||
- `turn1`: default routing
|
||||
- `turn2+`: prefer previous `D` node for the same session
|
||||
3. Add a **KV-cache-aware routing** policy
|
||||
4. Replay traces and compare policies with the same evaluation pipeline
|
||||
|
||||
Out of scope for now:
|
||||
- autoscaling
|
||||
- fault tolerance
|
||||
- large-scale cluster scheduler
|
||||
- production hardening
|
||||
- general multi-tenant serving
|
||||
|
||||
---
|
||||
|
||||
## What matters
|
||||
|
||||
Primary metric:
|
||||
- **E2E latency**
|
||||
|
||||
Secondary metrics:
|
||||
- TTFT
|
||||
- TPOT
|
||||
- KV transfer volume
|
||||
- cache hit / reuse
|
||||
- re-prefill count
|
||||
- per-node load
|
||||
|
||||
Do not optimize TTFT alone if E2E does not improve.
|
||||
|
||||
---
|
||||
|
||||
## Development Order
|
||||
|
||||
Implement in this order:
|
||||
|
||||
1. **Bring up xPyD**
|
||||
2. **Add trace replay + metrics logging**
|
||||
3. **Implement sticky-to-D baseline**
|
||||
4. **Implement KV-cache-aware routing**
|
||||
5. **Analyze gains and failure cases**
|
||||
|
||||
Do not skip step 2.
|
||||
|
||||
---
|
||||
|
||||
## Core Rules
|
||||
|
||||
### 1. Keep policy separate from mechanism
|
||||
- mechanism = how requests / KV / xPyD work
|
||||
- policy = how we choose `P` and `D`
|
||||
|
||||
Do not mix them unless necessary.
|
||||
|
||||
### 2. Prefer simple, debuggable logic
|
||||
Start with simple heuristics before complex scoring.
|
||||
|
||||
### 3. Log everything needed to explain results
|
||||
Each request should log:
|
||||
- request id
|
||||
- session id
|
||||
- turn id
|
||||
- assigned P node
|
||||
- assigned D node
|
||||
- latency
|
||||
- whether reuse was expected / observed
|
||||
|
||||
### 4. Small interfaces only
|
||||
Avoid over-abstraction.
|
||||
160
README.md
Normal file
160
README.md
Normal file
@@ -0,0 +1,160 @@
|
||||
## Agentic PD Hybrid
|
||||
|
||||
Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
|
||||
prefill/decode routing on top of SGLang PD disaggregation.
|
||||
|
||||
For a concise description of the project design, implemented features, current
|
||||
findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).
|
||||
|
||||
Current implementation covers the initial MVP path in `AGENTS.md`:
|
||||
|
||||
1. One-node PD/xPyD launch planning
|
||||
2. Trace replay plus request-level metrics logging
|
||||
3. Real end-to-end benchmark orchestration
|
||||
|
||||
Routing policy is kept separate from mechanism:
|
||||
|
||||
- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
|
||||
handle cluster shape and SGLang command generation.
|
||||
- `agentic_pd_hybrid.policies`
|
||||
handles decode selection heuristics.
|
||||
- `agentic_pd_hybrid.replay`
|
||||
handles trace pacing, synthetic prompt generation, and metrics.
|
||||
- `agentic_pd_hybrid.sampling`
|
||||
handles session-granularity trace sampling for live tests.
|
||||
- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
|
||||
handles launching and tearing down a real PD stack.
|
||||
|
||||
## Environment
|
||||
|
||||
Use `uv` for all environment management.
|
||||
|
||||
Sync the environment:
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
Local experiments can use a repo-local `third_party/sglang` checkout of SGLang
|
||||
`v0.5.10`, but that heavyweight checkout is intentionally not committed here.
|
||||
|
||||
## CLI
|
||||
|
||||
Print one-node PD launch commands:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid print-launch \
|
||||
--model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
|
||||
--prefill-workers 2 \
|
||||
--decode-workers 2 \
|
||||
--transfer-backend mooncake
|
||||
```
|
||||
|
||||
Replay the Ali trace in dry-run mode and emit request logs plus a summary:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid replay \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--policy sticky \
|
||||
--prefill-workers 2 \
|
||||
--decode-workers 2 \
|
||||
--output outputs/sticky.jsonl
|
||||
```
|
||||
|
||||
Sample a 10-minute shard at session granularity:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid sample-sessions \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--output outputs/sampled-10min.jsonl \
|
||||
--target-duration-s 600 \
|
||||
--session-sample-rate 0.01
|
||||
```
|
||||
|
||||
Sample Ali sessions that keep the small-append KV reuse shape used by the
|
||||
micro-benchmark:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid sample-sessions \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--output outputs/ali-small-append.jsonl \
|
||||
--profile small-append \
|
||||
--target-duration-s 600 \
|
||||
--session-sample-rate 0.01 \
|
||||
--min-turns 2
|
||||
```
|
||||
|
||||
Replay against a live router:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid replay \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--policy sticky \
|
||||
--router-url http://127.0.0.1:8000 \
|
||||
--model Qwen3-Coder-30B-A3B-Instruct \
|
||||
--output outputs/sticky-live.jsonl
|
||||
```
|
||||
|
||||
Launch a real PD stack and collect live performance numbers:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--policy sticky \
|
||||
--mechanism kvcache-centric \
|
||||
--kvcache-admission-mode router \
|
||||
--sample-profile small-append \
|
||||
--prefill-workers 1 \
|
||||
--decode-workers 1 \
|
||||
--transfer-backend mooncake \
|
||||
--target-duration-s 600 \
|
||||
--session-sample-rate 0.01 \
|
||||
--output-root outputs/live
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- The provided Ali release trace contains lengths and `hash_ids`, not raw
|
||||
prompts. Replay therefore synthesizes deterministic prompt text from
|
||||
`hash_ids` so repeated blocks remain repeated across turns.
|
||||
- `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
|
||||
upstream gateway's `manual` policy semantics for "turn1 default, turn2+
|
||||
sticky".
|
||||
- `kv-aware` computes decode placement from observed `hash_ids` overlap and
|
||||
can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
|
||||
used with a compatible router decode policy.
|
||||
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
|
||||
preserves the real prefill/decode double-request path over loopback without
|
||||
depending on the upstream Rust router build.
|
||||
- Managed live benchmarking prefers a local
|
||||
`third_party/sglang/python/sglang` checkout when it exists, so local SGLang
|
||||
source changes can apply immediately without packaging a wheel.
|
||||
- Live benchmarking currently targets the `mooncake` transfer backend, because
|
||||
`mooncake-transfer-engine` is installed and usable on this node.
|
||||
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
|
||||
measurement. Use `--no-stream` for E2E-only runs.
|
||||
- `kvcache-centric` defaults to router-managed admission
|
||||
(`--kvcache-admission-mode router`). This keeps a router-side shadow of
|
||||
decode session residency and capacity, so the critical path does not issue
|
||||
per-request worker `/server_info` and `/v1/loads` probes. Use
|
||||
`--kvcache-admission-mode worker` only as an A/B baseline for the older
|
||||
worker-managed admission path.
|
||||
|
||||
## Output
|
||||
|
||||
Each replay writes:
|
||||
|
||||
- request-level metrics JSONL at the requested output path
|
||||
- summary JSON at `<output>.summary.json`
|
||||
|
||||
Each request log contains:
|
||||
|
||||
- request id
|
||||
- session id
|
||||
- turn id
|
||||
- assigned prefill node
|
||||
- assigned decode node
|
||||
- latency fields when a live router is used
|
||||
- whether reuse was expected and whether block overlap was observed
|
||||
- expected KV transfer blocks
|
||||
- per-node load snapshot at assignment time
|
||||
121
docs/PROJECT_OVERVIEW.md
Normal file
121
docs/PROJECT_OVERVIEW.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# Project Overview
|
||||
|
||||
This repository is a minimal research prototype for evaluating whether
|
||||
session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
|
||||
latency for agentic coding workloads on top of SGLang xPyD.
|
||||
|
||||
The current target environment is a single 8-GPU node running SGLang `v0.5.10`
|
||||
with Qwen3-Coder-30B-A3B-Instruct. The local setup keeps the P -> D transfer
|
||||
path through SGLang disaggregation and Mooncake loopback instead of replacing it
|
||||
with an in-process shortcut.
|
||||
|
||||
## Design
|
||||
|
||||
The code keeps policy separate from mechanism.
|
||||
|
||||
- Mechanism code launches SGLang workers, sends requests, manages streaming
|
||||
sessions, and records request-level metrics.
|
||||
- Policy code decides which prefill worker and decode worker should receive a
|
||||
request.
|
||||
- Replay and benchmark code preserve trace arrival times unless explicitly
|
||||
configured otherwise, so concurrency comes from the workload shape rather than
|
||||
from an artificial fixed-concurrency driver.
|
||||
|
||||
The main comparison points are:
|
||||
|
||||
- `pd-disaggregation`: normal router-managed P/D serving.
|
||||
- `kvcache-centric`: worker/router assisted session-aware routing that can keep
|
||||
a decode streaming session resident and send later small appends directly to D.
|
||||
- `pd-colo`: direct colocated serving baseline for experiments that do not use
|
||||
the P/D router path.
|
||||
|
||||
## Implemented
|
||||
|
||||
The prototype currently includes:
|
||||
|
||||
- One-node P/D launch planning and managed stack lifecycle.
|
||||
- A lightweight Python PD router used for live local experiments.
|
||||
- Ali trace loading, session-granularity sampling, and synthetic prompt
|
||||
generation from `hash_ids`.
|
||||
- Trace replay with natural pacing, request dependencies inside a session, and
|
||||
request-level metrics JSONL plus summary JSON.
|
||||
- Routing policies:
|
||||
- `default`: simple baseline placement.
|
||||
- `sticky`: turn2+ prefers the previous D node for the same session.
|
||||
- `kv-aware`: uses observed block overlap/session state to choose D placement.
|
||||
- Live benchmark orchestration through `benchmark-live`.
|
||||
- Small-append synthetic trace generation for micro-benchmarks.
|
||||
- KV-cache-centric worker admission modes:
|
||||
- router shadow-state admission.
|
||||
- worker queried admission.
|
||||
- session-level D residency soft cap for worker-managed admission, so only a
|
||||
small hot set is kept as decode streaming sessions while the rest fall back
|
||||
to normal PD routing.
|
||||
- P-side prefill backup bookkeeping for experiments where D evictions can retain
|
||||
a lower-priority copy on P.
|
||||
- Fail-fast handling for empty streaming responses and a shorter SGLang
|
||||
disaggregation wait timeout to avoid treating transfer hangs as successful
|
||||
long-tail responses.
|
||||
|
||||
## Current Findings
|
||||
|
||||
The micro-benchmark can make KV-cache-centric routing look better than
|
||||
`pd-disaggregation` because the active sessions fit in D KV cache. Later turns
|
||||
can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT.
|
||||
|
||||
On the larger 316-request, variable-turn workload, there are 58 sessions and the
|
||||
working set is larger than the useful D residency budget. A naive worker-managed
|
||||
KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding
|
||||
TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it
|
||||
keeps too much state around.
|
||||
|
||||
The current soft-cap optimization improves worker-managed KV-cache-centric
|
||||
relative to the older worker-managed path, but `pd-disaggregation` is still
|
||||
slightly better on the sampled Ali workload because most requests fall back to
|
||||
normal PD routing while a few retained D sessions still consume token budget.
|
||||
|
||||
## Useful Commands
|
||||
|
||||
Run a live benchmark with natural arrival timing:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
|
||||
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
|
||||
--mechanism kvcache-centric \
|
||||
--policy kv-aware \
|
||||
--kvcache-admission-mode worker \
|
||||
--prefill-workers 1 \
|
||||
--decode-workers 1 \
|
||||
--prefill-gpu-ids 0 \
|
||||
--decode-gpu-ids 1 \
|
||||
--transfer-backend mooncake \
|
||||
--target-duration-s 2000 \
|
||||
--session-sample-rate 1.0 \
|
||||
--min-turns 2 \
|
||||
--time-scale 1 \
|
||||
--concurrency-limit 1000
|
||||
```
|
||||
|
||||
Generate a 30k input, 1k append, 256 output small-append trace:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid make-small-append-trace \
|
||||
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
|
||||
--session-count 4 \
|
||||
--turns-per-session 3 \
|
||||
--initial-input-length 30000 \
|
||||
--append-input-length 1000 \
|
||||
--output-length 256
|
||||
```
|
||||
|
||||
## Known Limits
|
||||
|
||||
- This is not production routing code.
|
||||
- The current evaluation is single-node and constrained by `prefill + decode <=
|
||||
8` GPUs.
|
||||
- Trace prompts are synthetic because the Ali trace used here contains lengths
|
||||
and `hash_ids`, not raw prompts.
|
||||
- KV-cache-centric admission still needs better hot-session prediction. The next
|
||||
useful step is inter-turn-gap-aware admission and aging, so D cache is held
|
||||
only for sessions likely to reuse it soon.
|
||||
Reference in New Issue
Block a user