docs: rewrite project docs in concise chinese
This commit is contained in:
195
README.md
195
README.md
@@ -1,162 +1,101 @@
|
|||||||
## Agentic PD Hybrid
|
# Agentic PD Hybrid
|
||||||
|
|
||||||
Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
|
这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:
|
||||||
prefill/decode routing on top of SGLang PD disaggregation.
|
|
||||||
|
|
||||||
For a concise description of the project design, implemented features, current
|
**面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing,能不能降低端到端延迟。**
|
||||||
findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).
|
|
||||||
|
|
||||||
Current implementation covers the initial MVP path in `AGENTS.md`:
|
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
|
||||||
|
|
||||||
1. One-node PD/xPyD launch planning
|
## 当前做了什么
|
||||||
2. Trace replay plus request-level metrics logging
|
|
||||||
3. Real end-to-end benchmark orchestration
|
|
||||||
|
|
||||||
Routing policy is kept separate from mechanism:
|
- 启动单机 SGLang P/D 栈。
|
||||||
|
- 回放 Ali coding agent trace,并记录 request-level metrics。
|
||||||
|
- 支持 `default`、`sticky`、`kv-aware` 路由策略。
|
||||||
|
- 支持 `pd-disaggregation`、`kvcache-centric`、`pd-colo` 对比。
|
||||||
|
- 支持小 append、多轮 session 的 micro-benchmark trace。
|
||||||
|
- 维护了基于 SGLang `v0.5.10` 的本地 patch,放在 `third_party/sglang`。
|
||||||
|
|
||||||
- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
|
## 环境
|
||||||
handle cluster shape and SGLang command generation.
|
|
||||||
- `agentic_pd_hybrid.policies`
|
|
||||||
handles decode selection heuristics.
|
|
||||||
- `agentic_pd_hybrid.replay`
|
|
||||||
handles trace pacing, synthetic prompt generation, and metrics.
|
|
||||||
- `agentic_pd_hybrid.sampling`
|
|
||||||
handles session-granularity trace sampling for live tests.
|
|
||||||
- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
|
|
||||||
handles launching and tearing down a real PD stack.
|
|
||||||
|
|
||||||
## Environment
|
统一使用 `uv`:
|
||||||
|
|
||||||
Use `uv` for all environment management.
|
|
||||||
|
|
||||||
Sync the environment:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv sync
|
uv sync
|
||||||
```
|
```
|
||||||
|
|
||||||
`third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local
|
默认模型路径:
|
||||||
PD/session-cache patches in later commits. Keep SGLang changes scoped under that
|
|
||||||
directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they
|
|
||||||
stay easy to review against the vendor baseline.
|
|
||||||
|
|
||||||
## CLI
|
```text
|
||||||
|
~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
|
||||||
Print one-node PD launch commands:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
uv run agentic-pd-hybrid print-launch \
|
|
||||||
--model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
|
|
||||||
--prefill-workers 2 \
|
|
||||||
--decode-workers 2 \
|
|
||||||
--transfer-backend mooncake
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Replay the Ali trace in dry-run mode and emit request logs plus a summary:
|
当前主要测试环境是单机 8 GPU,约束是 `prefill + decode <= 8`。
|
||||||
|
|
||||||
|
## 常用命令
|
||||||
|
|
||||||
|
生成小 append trace:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv run agentic-pd-hybrid replay \
|
uv run agentic-pd-hybrid make-small-append-trace \
|
||||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
|
||||||
--policy sticky \
|
--session-count 4 \
|
||||||
--prefill-workers 2 \
|
--turns-per-session 3 \
|
||||||
--decode-workers 2 \
|
--initial-input-length 30000 \
|
||||||
--output outputs/sticky.jsonl
|
--append-input-length 1000 \
|
||||||
|
--output-length 256
|
||||||
```
|
```
|
||||||
|
|
||||||
Sample a 10-minute shard at session granularity:
|
跑 live benchmark:
|
||||||
|
|
||||||
```bash
|
|
||||||
uv run agentic-pd-hybrid sample-sessions \
|
|
||||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
||||||
--output outputs/sampled-10min.jsonl \
|
|
||||||
--target-duration-s 600 \
|
|
||||||
--session-sample-rate 0.01
|
|
||||||
```
|
|
||||||
|
|
||||||
Sample Ali sessions that keep the small-append KV reuse shape used by the
|
|
||||||
micro-benchmark:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
uv run agentic-pd-hybrid sample-sessions \
|
|
||||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
||||||
--output outputs/ali-small-append.jsonl \
|
|
||||||
--profile small-append \
|
|
||||||
--target-duration-s 600 \
|
|
||||||
--session-sample-rate 0.01 \
|
|
||||||
--min-turns 2
|
|
||||||
```
|
|
||||||
|
|
||||||
Replay against a live router:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
uv run agentic-pd-hybrid replay \
|
|
||||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
|
||||||
--policy sticky \
|
|
||||||
--router-url http://127.0.0.1:8000 \
|
|
||||||
--model Qwen3-Coder-30B-A3B-Instruct \
|
|
||||||
--output outputs/sticky-live.jsonl
|
|
||||||
```
|
|
||||||
|
|
||||||
Launch a real PD stack and collect live performance numbers:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv run agentic-pd-hybrid benchmark-live \
|
uv run agentic-pd-hybrid benchmark-live \
|
||||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
|
||||||
--policy sticky \
|
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
|
||||||
--mechanism kvcache-centric \
|
--mechanism kvcache-centric \
|
||||||
--kvcache-admission-mode router \
|
--policy kv-aware \
|
||||||
--sample-profile small-append \
|
--kvcache-admission-mode worker \
|
||||||
--prefill-workers 1 \
|
--prefill-workers 1 \
|
||||||
--decode-workers 1 \
|
--decode-workers 1 \
|
||||||
|
--prefill-gpu-ids 0 \
|
||||||
|
--decode-gpu-ids 1 \
|
||||||
--transfer-backend mooncake \
|
--transfer-backend mooncake \
|
||||||
--target-duration-s 600 \
|
--target-duration-s 2000 \
|
||||||
--session-sample-rate 0.01 \
|
--session-sample-rate 1.0 \
|
||||||
--output-root outputs/live
|
--min-turns 2 \
|
||||||
|
--time-scale 1 \
|
||||||
|
--concurrency-limit 1000
|
||||||
```
|
```
|
||||||
|
|
||||||
Notes:
|
只回放并写 metrics:
|
||||||
|
|
||||||
- The provided Ali release trace contains lengths and `hash_ids`, not raw
|
```bash
|
||||||
prompts. Replay therefore synthesizes deterministic prompt text from
|
uv run agentic-pd-hybrid replay \
|
||||||
`hash_ids` so repeated blocks remain repeated across turns.
|
--trace path/to/trace.jsonl \
|
||||||
- `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
|
--policy kv-aware \
|
||||||
upstream gateway's `manual` policy semantics for "turn1 default, turn2+
|
--mechanism pd-disaggregation \
|
||||||
sticky".
|
--router-url http://127.0.0.1:8000 \
|
||||||
- `kv-aware` computes decode placement from observed `hash_ids` overlap and
|
--output outputs/replay.jsonl
|
||||||
can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
|
```
|
||||||
used with a compatible router decode policy.
|
|
||||||
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
|
|
||||||
preserves the real prefill/decode double-request path over loopback without
|
|
||||||
depending on the upstream Rust router build.
|
|
||||||
- Managed live benchmarking prefers the vendored
|
|
||||||
`third_party/sglang/python/sglang` source tree, so local SGLang changes apply
|
|
||||||
immediately without packaging a wheel.
|
|
||||||
- Live benchmarking currently targets the `mooncake` transfer backend, because
|
|
||||||
`mooncake-transfer-engine` is installed and usable on this node.
|
|
||||||
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
|
|
||||||
measurement. Use `--no-stream` for E2E-only runs.
|
|
||||||
- `kvcache-centric` defaults to router-managed admission
|
|
||||||
(`--kvcache-admission-mode router`). This keeps a router-side shadow of
|
|
||||||
decode session residency and capacity, so the critical path does not issue
|
|
||||||
per-request worker `/server_info` and `/v1/loads` probes. Use
|
|
||||||
`--kvcache-admission-mode worker` only as an A/B baseline for the older
|
|
||||||
worker-managed admission path.
|
|
||||||
|
|
||||||
## Output
|
## 输出
|
||||||
|
|
||||||
Each replay writes:
|
每次 replay/benchmark 会写:
|
||||||
|
|
||||||
- request-level metrics JSONL at the requested output path
|
- request metrics:`request-metrics.jsonl`
|
||||||
- summary JSON at `<output>.summary.json`
|
- 汇总结果:`request-metrics.jsonl.summary.json`
|
||||||
|
|
||||||
Each request log contains:
|
重点看:
|
||||||
|
|
||||||
- request id
|
- E2E latency
|
||||||
- session id
|
- TTFT / TPOT
|
||||||
- turn id
|
- execution mode
|
||||||
- assigned prefill node
|
- cached tokens
|
||||||
- assigned decode node
|
- KV transfer blocks
|
||||||
- latency fields when a live router is used
|
- error
|
||||||
- whether reuse was expected and whether block overlap was observed
|
|
||||||
- expected KV transfer blocks
|
## 维护约定
|
||||||
- per-node load snapshot at assignment time
|
|
||||||
|
- 项目代码改动:`feat:` / `fix:` / `docs:`。
|
||||||
|
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`。
|
||||||
|
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
|
||||||
|
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
|
||||||
|
|||||||
@@ -1,141 +1,98 @@
|
|||||||
# Project Overview
|
# 项目概览
|
||||||
|
|
||||||
This repository is a minimal research prototype for evaluating whether
|
这个项目验证一个问题:
|
||||||
session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
|
|
||||||
latency for agentic coding workloads on top of SGLang xPyD.
|
|
||||||
|
|
||||||
The current target environment is a single 8-GPU node running SGLang `v0.5.10`
|
**agentic coding workload 里,如果 router 更懂 session 和 KV cache,P/D serving 的端到端延迟能不能更低。**
|
||||||
with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under
|
|
||||||
`third_party/sglang` so our xPyD/session-cache changes are maintained together
|
|
||||||
with the benchmark harness. The local setup keeps the P -> D transfer path
|
|
||||||
through SGLang disaggregation and Mooncake loopback instead of replacing it with
|
|
||||||
an in-process shortcut.
|
|
||||||
|
|
||||||
## Design
|
当前基于:
|
||||||
|
|
||||||
The code keeps policy separate from mechanism.
|
- SGLang `v0.5.10`
|
||||||
|
- Qwen3-Coder-30B-A3B-Instruct
|
||||||
|
- 单机 8 GPU
|
||||||
|
- Mooncake loopback 模拟 P -> D 传输
|
||||||
|
|
||||||
- Mechanism code launches SGLang workers, sends requests, manages streaming
|
## 设计
|
||||||
sessions, and records request-level metrics.
|
|
||||||
- Policy code decides which prefill worker and decode worker should receive a
|
|
||||||
request.
|
|
||||||
- Replay and benchmark code preserve trace arrival times unless explicitly
|
|
||||||
configured otherwise, so concurrency comes from the workload shape rather than
|
|
||||||
from an artificial fixed-concurrency driver.
|
|
||||||
|
|
||||||
The main comparison points are:
|
代码按两层分开:
|
||||||
|
|
||||||
- `pd-disaggregation`: normal router-managed P/D serving.
|
- **机制**:启动 SGLang、发送请求、管理 session、收集 metrics。
|
||||||
- `kvcache-centric`: worker/router assisted session-aware routing that can keep
|
- **策略**:决定请求去哪个 P node、哪个 D node。
|
||||||
a decode streaming session resident and send later small appends directly to D.
|
|
||||||
- `pd-colo`: direct colocated serving baseline for experiments that do not use
|
|
||||||
the P/D router path.
|
|
||||||
|
|
||||||
## Implemented
|
这样后续可以单独改 routing policy,不把它和 SGLang/xPyD 机制混在一起。
|
||||||
|
|
||||||
The prototype currently includes:
|
## 已实现
|
||||||
|
|
||||||
- One-node P/D launch planning and managed stack lifecycle.
|
- 单机 P/D stack 启动和关闭。
|
||||||
- A lightweight Python PD router used for live local experiments.
|
- 本地 Python PD router。
|
||||||
- Ali trace loading, session-granularity sampling, and synthetic prompt
|
- Ali trace 加载、session 级采样、synthetic prompt 生成。
|
||||||
generation from `hash_ids`.
|
- 按 trace 原始到达时间 replay,不用固定 concurrency 强行压流量。
|
||||||
- Trace replay with natural pacing, request dependencies inside a session, and
|
- request-level metrics 和 summary。
|
||||||
request-level metrics JSONL plus summary JSON.
|
- 路由策略:
|
||||||
- Routing policies:
|
- `default`
|
||||||
- `default`: simple baseline placement.
|
- `sticky`
|
||||||
- `sticky`: turn2+ prefers the previous D node for the same session.
|
- `kv-aware`
|
||||||
- `kv-aware`: uses observed block overlap/session state to choose D placement.
|
- serving 机制:
|
||||||
- Live benchmark orchestration through `benchmark-live`.
|
- `pd-disaggregation`
|
||||||
- Small-append synthetic trace generation for micro-benchmarks.
|
- `kvcache-centric`
|
||||||
- KV-cache-centric worker admission modes:
|
- `pd-colo`
|
||||||
- router shadow-state admission.
|
- micro-benchmark trace 生成。
|
||||||
- worker queried admission.
|
- worker-managed / router-managed KV admission 对比。
|
||||||
- session-level D residency soft cap for worker-managed admission, so only a
|
- worker-managed 下的 D session soft-cap,避免所有 session 都挤进 D KV。
|
||||||
small hot set is kept as decode streaming sessions while the rest fall back
|
- SGLang patch:
|
||||||
to normal PD routing.
|
- decode worker 支持 PD mode 下 local append-prefill;
|
||||||
- P-side prefill backup bookkeeping for experiments where D evictions can retain
|
- 暴露 streaming session cache 状态;
|
||||||
a lower-priority copy on P.
|
- 支持按 session 粒度 evict idle streaming session;
|
||||||
- Fail-fast handling for empty streaming responses and a shorter SGLang
|
- 支持 direct append admission 查询。
|
||||||
disaggregation wait timeout to avoid treating transfer hangs as successful
|
|
||||||
long-tail responses.
|
|
||||||
|
|
||||||
## SGLang Maintenance
|
## 当前结论
|
||||||
|
|
||||||
SGLang is tracked directly in this repository:
|
micro-benchmark 上,`kvcache-centric` 可以比 `pd-disaggregation` 好。
|
||||||
|
|
||||||
- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline.
|
原因很简单:session 少,D KV 放得下,turn2+ 可以直接走 D session,省掉一部分 P/D 路径开销。
|
||||||
- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only
|
|
||||||
local SGLang changes.
|
|
||||||
- Generated files such as `__pycache__` and benchmark outputs stay ignored.
|
|
||||||
|
|
||||||
The current SGLang patch adds the worker-side mechanisms needed by
|
但在 300+ request、58 session 的测试上,情况不同:
|
||||||
KV-cache-centric experiments:
|
|
||||||
|
|
||||||
- decode workers can optionally accept local append-prefill requests in PD mode;
|
- D KV 放不下全部 session working set。
|
||||||
- streaming session cache status is exposed for router/admission decisions;
|
- naive worker-managed 会频繁 evict/reseed 整个 session。
|
||||||
- idle streaming sessions can be evicted at session granularity;
|
- reseed 和 transfer 压力会抵消 KV reuse 收益。
|
||||||
- direct append admission can check resident session state and D token pressure
|
- aggressive P-backup 会增加尾延迟风险。
|
||||||
before the replay path bypasses P.
|
|
||||||
|
|
||||||
## Current Findings
|
当前 soft-cap 优化后:
|
||||||
|
|
||||||
The micro-benchmark can make KV-cache-centric routing look better than
|
- worker-managed 比旧版本更稳;
|
||||||
`pd-disaggregation` because the active sessions fit in D KV cache. Later turns
|
- TTFT 明显下降;
|
||||||
can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT.
|
- 没有再出现 600s transfer hang 被当成成功响应的问题;
|
||||||
|
- 但 sampled Ali trace 上,`pd-disaggregation` 仍然略好。
|
||||||
|
|
||||||
On the larger 316-request, variable-turn workload, there are 58 sessions and the
|
当前判断:
|
||||||
working set is larger than the useful D residency budget. A naive worker-managed
|
|
||||||
KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding
|
|
||||||
TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it
|
|
||||||
keeps too much state around.
|
|
||||||
|
|
||||||
The current soft-cap optimization improves worker-managed KV-cache-centric
|
**KV-cache-centric 只应该保留真正 hot 的 session。不是所有 session 都值得占 D KV。**
|
||||||
relative to the older worker-managed path, but `pd-disaggregation` is still
|
|
||||||
slightly better on the sampled Ali workload because most requests fall back to
|
|
||||||
normal PD routing while a few retained D sessions still consume token budget.
|
|
||||||
|
|
||||||
## Useful Commands
|
下一步最有价值的是:
|
||||||
|
|
||||||
Run a live benchmark with natural arrival timing:
|
- inter-turn-gap-aware admission;
|
||||||
|
- session aging;
|
||||||
|
- 更精确地预测哪些 session 会很快复用 KV。
|
||||||
|
|
||||||
```bash
|
## SGLang 维护方式
|
||||||
uv run agentic-pd-hybrid benchmark-live \
|
|
||||||
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
|
|
||||||
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
|
|
||||||
--mechanism kvcache-centric \
|
|
||||||
--policy kv-aware \
|
|
||||||
--kvcache-admission-mode worker \
|
|
||||||
--prefill-workers 1 \
|
|
||||||
--decode-workers 1 \
|
|
||||||
--prefill-gpu-ids 0 \
|
|
||||||
--decode-gpu-ids 1 \
|
|
||||||
--transfer-backend mooncake \
|
|
||||||
--target-duration-s 2000 \
|
|
||||||
--session-sample-rate 1.0 \
|
|
||||||
--min-turns 2 \
|
|
||||||
--time-scale 1 \
|
|
||||||
--concurrency-limit 1000
|
|
||||||
```
|
|
||||||
|
|
||||||
Generate a 30k input, 1k append, 256 output small-append trace:
|
`third_party/sglang` 已纳入主仓库。
|
||||||
|
|
||||||
```bash
|
历史结构:
|
||||||
uv run agentic-pd-hybrid make-small-append-trace \
|
|
||||||
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
|
|
||||||
--session-count 4 \
|
|
||||||
--turns-per-session 3 \
|
|
||||||
--initial-input-length 30000 \
|
|
||||||
--append-input-length 1000 \
|
|
||||||
--output-length 256
|
|
||||||
```
|
|
||||||
|
|
||||||
## Known Limits
|
- `chore: vendor sglang v0.5.10 snapshot`:干净上游基线。
|
||||||
|
- `feat(sglang): ...` / `fix(sglang): ...`:我们的 SGLang patch。
|
||||||
|
|
||||||
- This is not production routing code.
|
后续改 SGLang 时:
|
||||||
- The current evaluation is single-node and constrained by `prefill + decode <=
|
|
||||||
8` GPUs.
|
- 只改 `third_party/sglang` 下相关文件;
|
||||||
- Trace prompts are synthetic because the Ali trace used here contains lengths
|
- 单独提交;
|
||||||
and `hash_ids`, not raw prompts.
|
- commit message 带 `(sglang)`;
|
||||||
- KV-cache-centric admission still needs better hot-session prediction. The next
|
- 不把 benchmark 输出、pyc、日志混进提交。
|
||||||
useful step is inter-turn-gap-aware admission and aging, so D cache is held
|
|
||||||
only for sessions likely to reuse it soon.
|
## 已知限制
|
||||||
|
|
||||||
|
- 这是实验原型,不是生产 router。
|
||||||
|
- 当前主要验证单机 8 GPU。
|
||||||
|
- Ali trace 没有原始 prompt,只能用 `hash_ids` 合成 prompt。
|
||||||
|
- 当前 routing 还缺少真正的 hot-session 预测。
|
||||||
|
|||||||
Reference in New Issue
Block a user