docs: rewrite project docs in concise chinese
This commit is contained in:
195
README.md
195
README.md
@@ -1,162 +1,101 @@
|
||||
## Agentic PD Hybrid
|
||||
# Agentic PD Hybrid
|
||||
|
||||
Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
|
||||
prefill/decode routing on top of SGLang PD disaggregation.
|
||||
这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:
|
||||
|
||||
For a concise description of the project design, implemented features, current
|
||||
findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).
|
||||
**面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing,能不能降低端到端延迟。**
|
||||
|
||||
Current implementation covers the initial MVP path in `AGENTS.md`:
|
||||
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
|
||||
|
||||
1. One-node PD/xPyD launch planning
|
||||
2. Trace replay plus request-level metrics logging
|
||||
3. Real end-to-end benchmark orchestration
|
||||
## 当前做了什么
|
||||
|
||||
Routing policy is kept separate from mechanism:
|
||||
- 启动单机 SGLang P/D 栈。
|
||||
- 回放 Ali coding agent trace,并记录 request-level metrics。
|
||||
- 支持 `default`、`sticky`、`kv-aware` 路由策略。
|
||||
- 支持 `pd-disaggregation`、`kvcache-centric`、`pd-colo` 对比。
|
||||
- 支持小 append、多轮 session 的 micro-benchmark trace。
|
||||
- 维护了基于 SGLang `v0.5.10` 的本地 patch,放在 `third_party/sglang`。
|
||||
|
||||
- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
|
||||
handle cluster shape and SGLang command generation.
|
||||
- `agentic_pd_hybrid.policies`
|
||||
handles decode selection heuristics.
|
||||
- `agentic_pd_hybrid.replay`
|
||||
handles trace pacing, synthetic prompt generation, and metrics.
|
||||
- `agentic_pd_hybrid.sampling`
|
||||
handles session-granularity trace sampling for live tests.
|
||||
- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
|
||||
handles launching and tearing down a real PD stack.
|
||||
## 环境
|
||||
|
||||
## Environment
|
||||
|
||||
Use `uv` for all environment management.
|
||||
|
||||
Sync the environment:
|
||||
统一使用 `uv`:
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
`third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local
|
||||
PD/session-cache patches in later commits. Keep SGLang changes scoped under that
|
||||
directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they
|
||||
stay easy to review against the vendor baseline.
|
||||
默认模型路径:
|
||||
|
||||
## CLI
|
||||
|
||||
Print one-node PD launch commands:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid print-launch \
|
||||
--model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
|
||||
--prefill-workers 2 \
|
||||
--decode-workers 2 \
|
||||
--transfer-backend mooncake
|
||||
```text
|
||||
~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
|
||||
```
|
||||
|
||||
Replay the Ali trace in dry-run mode and emit request logs plus a summary:
|
||||
当前主要测试环境是单机 8 GPU,约束是 `prefill + decode <= 8`。
|
||||
|
||||
## 常用命令
|
||||
|
||||
生成小 append trace:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid replay \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--policy sticky \
|
||||
--prefill-workers 2 \
|
||||
--decode-workers 2 \
|
||||
--output outputs/sticky.jsonl
|
||||
uv run agentic-pd-hybrid make-small-append-trace \
|
||||
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
|
||||
--session-count 4 \
|
||||
--turns-per-session 3 \
|
||||
--initial-input-length 30000 \
|
||||
--append-input-length 1000 \
|
||||
--output-length 256
|
||||
```
|
||||
|
||||
Sample a 10-minute shard at session granularity:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid sample-sessions \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--output outputs/sampled-10min.jsonl \
|
||||
--target-duration-s 600 \
|
||||
--session-sample-rate 0.01
|
||||
```
|
||||
|
||||
Sample Ali sessions that keep the small-append KV reuse shape used by the
|
||||
micro-benchmark:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid sample-sessions \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--output outputs/ali-small-append.jsonl \
|
||||
--profile small-append \
|
||||
--target-duration-s 600 \
|
||||
--session-sample-rate 0.01 \
|
||||
--min-turns 2
|
||||
```
|
||||
|
||||
Replay against a live router:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid replay \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--policy sticky \
|
||||
--router-url http://127.0.0.1:8000 \
|
||||
--model Qwen3-Coder-30B-A3B-Instruct \
|
||||
--output outputs/sticky-live.jsonl
|
||||
```
|
||||
|
||||
Launch a real PD stack and collect live performance numbers:
|
||||
跑 live benchmark:
|
||||
|
||||
```bash
|
||||
uv run agentic-pd-hybrid benchmark-live \
|
||||
--trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
|
||||
--policy sticky \
|
||||
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
|
||||
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
|
||||
--mechanism kvcache-centric \
|
||||
--kvcache-admission-mode router \
|
||||
--sample-profile small-append \
|
||||
--policy kv-aware \
|
||||
--kvcache-admission-mode worker \
|
||||
--prefill-workers 1 \
|
||||
--decode-workers 1 \
|
||||
--prefill-gpu-ids 0 \
|
||||
--decode-gpu-ids 1 \
|
||||
--transfer-backend mooncake \
|
||||
--target-duration-s 600 \
|
||||
--session-sample-rate 0.01 \
|
||||
--output-root outputs/live
|
||||
--target-duration-s 2000 \
|
||||
--session-sample-rate 1.0 \
|
||||
--min-turns 2 \
|
||||
--time-scale 1 \
|
||||
--concurrency-limit 1000
|
||||
```
|
||||
|
||||
Notes:
|
||||
只回放并写 metrics:
|
||||
|
||||
- The provided Ali release trace contains lengths and `hash_ids`, not raw
|
||||
prompts. Replay therefore synthesizes deterministic prompt text from
|
||||
`hash_ids` so repeated blocks remain repeated across turns.
|
||||
- `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
|
||||
upstream gateway's `manual` policy semantics for "turn1 default, turn2+
|
||||
sticky".
|
||||
- `kv-aware` computes decode placement from observed `hash_ids` overlap and
|
||||
can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
|
||||
used with a compatible router decode policy.
|
||||
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
|
||||
preserves the real prefill/decode double-request path over loopback without
|
||||
depending on the upstream Rust router build.
|
||||
- Managed live benchmarking prefers the vendored
|
||||
`third_party/sglang/python/sglang` source tree, so local SGLang changes apply
|
||||
immediately without packaging a wheel.
|
||||
- Live benchmarking currently targets the `mooncake` transfer backend, because
|
||||
`mooncake-transfer-engine` is installed and usable on this node.
|
||||
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
|
||||
measurement. Use `--no-stream` for E2E-only runs.
|
||||
- `kvcache-centric` defaults to router-managed admission
|
||||
(`--kvcache-admission-mode router`). This keeps a router-side shadow of
|
||||
decode session residency and capacity, so the critical path does not issue
|
||||
per-request worker `/server_info` and `/v1/loads` probes. Use
|
||||
`--kvcache-admission-mode worker` only as an A/B baseline for the older
|
||||
worker-managed admission path.
|
||||
```bash
|
||||
uv run agentic-pd-hybrid replay \
|
||||
--trace path/to/trace.jsonl \
|
||||
--policy kv-aware \
|
||||
--mechanism pd-disaggregation \
|
||||
--router-url http://127.0.0.1:8000 \
|
||||
--output outputs/replay.jsonl
|
||||
```
|
||||
|
||||
## Output
|
||||
## 输出
|
||||
|
||||
Each replay writes:
|
||||
每次 replay/benchmark 会写:
|
||||
|
||||
- request-level metrics JSONL at the requested output path
|
||||
- summary JSON at `<output>.summary.json`
|
||||
- request metrics:`request-metrics.jsonl`
|
||||
- 汇总结果:`request-metrics.jsonl.summary.json`
|
||||
|
||||
Each request log contains:
|
||||
重点看:
|
||||
|
||||
- request id
|
||||
- session id
|
||||
- turn id
|
||||
- assigned prefill node
|
||||
- assigned decode node
|
||||
- latency fields when a live router is used
|
||||
- whether reuse was expected and whether block overlap was observed
|
||||
- expected KV transfer blocks
|
||||
- per-node load snapshot at assignment time
|
||||
- E2E latency
|
||||
- TTFT / TPOT
|
||||
- execution mode
|
||||
- cached tokens
|
||||
- KV transfer blocks
|
||||
- error
|
||||
|
||||
## 维护约定
|
||||
|
||||
- 项目代码改动:`feat:` / `fix:` / `docs:`。
|
||||
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`。
|
||||
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
|
||||
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
|
||||
|
||||
Reference in New Issue
Block a user