102 lines
2.5 KiB
Markdown
102 lines
2.5 KiB
Markdown
# Agentic PD Hybrid
|
||
|
||
这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:
|
||
|
||
**面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing,能不能降低端到端延迟。**
|
||
|
||
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
|
||
|
||
## 当前做了什么
|
||
|
||
- 启动单机 SGLang P/D 栈。
|
||
- 回放 Ali coding agent trace,并记录 request-level metrics。
|
||
- 支持 `default`、`sticky`、`kv-aware` 路由策略。
|
||
- 支持 `pd-disaggregation`、`kvcache-centric`、`pd-colo` 对比。
|
||
- 支持小 append、多轮 session 的 micro-benchmark trace。
|
||
- 维护了基于 SGLang `v0.5.10` 的本地 patch,放在 `third_party/sglang`。
|
||
|
||
## 环境
|
||
|
||
统一使用 `uv`:
|
||
|
||
```bash
|
||
uv sync
|
||
```
|
||
|
||
默认模型路径:
|
||
|
||
```text
|
||
~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
|
||
```
|
||
|
||
当前主要测试环境是单机 8 GPU,约束是 `prefill + decode <= 8`。
|
||
|
||
## 常用命令
|
||
|
||
生成小 append trace:
|
||
|
||
```bash
|
||
uv run agentic-pd-hybrid make-small-append-trace \
|
||
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
|
||
--session-count 4 \
|
||
--turns-per-session 3 \
|
||
--initial-input-length 30000 \
|
||
--append-input-length 1000 \
|
||
--output-length 256
|
||
```
|
||
|
||
跑 live benchmark:
|
||
|
||
```bash
|
||
uv run agentic-pd-hybrid benchmark-live \
|
||
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
|
||
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
|
||
--mechanism kvcache-centric \
|
||
--policy kv-aware \
|
||
--kvcache-admission-mode worker \
|
||
--prefill-workers 1 \
|
||
--decode-workers 1 \
|
||
--prefill-gpu-ids 0 \
|
||
--decode-gpu-ids 1 \
|
||
--transfer-backend mooncake \
|
||
--target-duration-s 2000 \
|
||
--session-sample-rate 1.0 \
|
||
--min-turns 2 \
|
||
--time-scale 1 \
|
||
--concurrency-limit 1000
|
||
```
|
||
|
||
只回放并写 metrics:
|
||
|
||
```bash
|
||
uv run agentic-pd-hybrid replay \
|
||
--trace path/to/trace.jsonl \
|
||
--policy kv-aware \
|
||
--mechanism pd-disaggregation \
|
||
--router-url http://127.0.0.1:8000 \
|
||
--output outputs/replay.jsonl
|
||
```
|
||
|
||
## 输出
|
||
|
||
每次 replay/benchmark 会写:
|
||
|
||
- request metrics:`request-metrics.jsonl`
|
||
- 汇总结果:`request-metrics.jsonl.summary.json`
|
||
|
||
重点看:
|
||
|
||
- E2E latency
|
||
- TTFT / TPOT
|
||
- execution mode
|
||
- cached tokens
|
||
- KV transfer blocks
|
||
- error
|
||
|
||
## 维护约定
|
||
|
||
- 项目代码改动:`feat:` / `fix:` / `docs:`。
|
||
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`。
|
||
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
|
||
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
|