Files
agentic-pd-hybrid/README.md

102 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agentic PD Hybrid
这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:
**面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing能不能降低端到端延迟。**
更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。
## 当前做了什么
- 启动单机 SGLang P/D 栈。
- 回放 Ali coding agent trace并记录 request-level metrics。
- 支持 `default``sticky``kv-aware` 路由策略。
- 支持 `pd-disaggregation``kvcache-centric``pd-colo` 对比。
- 支持小 append、多轮 session 的 micro-benchmark trace。
- 维护了基于 SGLang `v0.5.10` 的本地 patch放在 `third_party/sglang`
## 环境
统一使用 `uv`
```bash
uv sync
```
默认模型路径:
```text
~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
```
当前主要测试环境是单机 8 GPU约束是 `prefill + decode <= 8`
## 常用命令
生成小 append trace
```bash
uv run agentic-pd-hybrid make-small-append-trace \
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
--session-count 4 \
--turns-per-session 3 \
--initial-input-length 30000 \
--append-input-length 1000 \
--output-length 256
```
跑 live benchmark
```bash
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
--mechanism kvcache-centric \
--policy kv-aware \
--kvcache-admission-mode worker \
--prefill-workers 1 \
--decode-workers 1 \
--prefill-gpu-ids 0 \
--decode-gpu-ids 1 \
--transfer-backend mooncake \
--target-duration-s 2000 \
--session-sample-rate 1.0 \
--min-turns 2 \
--time-scale 1 \
--concurrency-limit 1000
```
只回放并写 metrics
```bash
uv run agentic-pd-hybrid replay \
--trace path/to/trace.jsonl \
--policy kv-aware \
--mechanism pd-disaggregation \
--router-url http://127.0.0.1:8000 \
--output outputs/replay.jsonl
```
## 输出
每次 replay/benchmark 会写:
- request metrics`request-metrics.jsonl`
- 汇总结果:`request-metrics.jsonl.summary.json`
重点看:
- E2E latency
- TTFT / TPOT
- execution mode
- cached tokens
- KV transfer blocks
- error
## 维护约定
- 项目代码改动:`feat:` / `fix:` / `docs:`
- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`
- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。