diff --git a/README.md b/README.md index 12eb036..63e3193 100644 --- a/README.md +++ b/README.md @@ -1,162 +1,101 @@ -## Agentic PD Hybrid +# Agentic PD Hybrid -Minimal prototype scaffold for evaluating session-aware and KV-cache-aware -prefill/decode routing on top of SGLang PD disaggregation. +这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断: -For a concise description of the project design, implemented features, current -findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md). +**面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing,能不能降低端到端延迟。** -Current implementation covers the initial MVP path in `AGENTS.md`: +更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。 -1. One-node PD/xPyD launch planning -2. Trace replay plus request-level metrics logging -3. Real end-to-end benchmark orchestration +## 当前做了什么 -Routing policy is kept separate from mechanism: +- 启动单机 SGLang P/D 栈。 +- 回放 Ali coding agent trace,并记录 request-level metrics。 +- 支持 `default`、`sticky`、`kv-aware` 路由策略。 +- 支持 `pd-disaggregation`、`kvcache-centric`、`pd-colo` 对比。 +- 支持小 append、多轮 session 的 micro-benchmark trace。 +- 维护了基于 SGLang `v0.5.10` 的本地 patch,放在 `third_party/sglang`。 -- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher` - handle cluster shape and SGLang command generation. -- `agentic_pd_hybrid.policies` - handles decode selection heuristics. -- `agentic_pd_hybrid.replay` - handles trace pacing, synthetic prompt generation, and metrics. -- `agentic_pd_hybrid.sampling` - handles session-granularity trace sampling for live tests. -- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark` - handles launching and tearing down a real PD stack. +## 环境 -## Environment - -Use `uv` for all environment management. - -Sync the environment: +统一使用 `uv`: ```bash uv sync ``` -`third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local -PD/session-cache patches in later commits. Keep SGLang changes scoped under that -directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they -stay easy to review against the vendor baseline. +默认模型路径: -## CLI - -Print one-node PD launch commands: - -```bash -uv run agentic-pd-hybrid print-launch \ - --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \ - --prefill-workers 2 \ - --decode-workers 2 \ - --transfer-backend mooncake +```text +~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct ``` -Replay the Ali trace in dry-run mode and emit request logs plus a summary: +当前主要测试环境是单机 8 GPU,约束是 `prefill + decode <= 8`。 + +## 常用命令 + +生成小 append trace: ```bash -uv run agentic-pd-hybrid replay \ - --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ - --policy sticky \ - --prefill-workers 2 \ - --decode-workers 2 \ - --output outputs/sticky.jsonl +uv run agentic-pd-hybrid make-small-append-trace \ + --output outputs/smoke-hotcap-30k-1k-256.jsonl \ + --session-count 4 \ + --turns-per-session 3 \ + --initial-input-length 30000 \ + --append-input-length 1000 \ + --output-length 256 ``` -Sample a 10-minute shard at session granularity: - -```bash -uv run agentic-pd-hybrid sample-sessions \ - --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ - --output outputs/sampled-10min.jsonl \ - --target-duration-s 600 \ - --session-sample-rate 0.01 -``` - -Sample Ali sessions that keep the small-append KV reuse shape used by the -micro-benchmark: - -```bash -uv run agentic-pd-hybrid sample-sessions \ - --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ - --output outputs/ali-small-append.jsonl \ - --profile small-append \ - --target-duration-s 600 \ - --session-sample-rate 0.01 \ - --min-turns 2 -``` - -Replay against a live router: - -```bash -uv run agentic-pd-hybrid replay \ - --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ - --policy sticky \ - --router-url http://127.0.0.1:8000 \ - --model Qwen3-Coder-30B-A3B-Instruct \ - --output outputs/sticky-live.jsonl -``` - -Launch a real PD stack and collect live performance numbers: +跑 live benchmark: ```bash uv run agentic-pd-hybrid benchmark-live \ - --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \ - --policy sticky \ + --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \ + --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \ --mechanism kvcache-centric \ - --kvcache-admission-mode router \ - --sample-profile small-append \ + --policy kv-aware \ + --kvcache-admission-mode worker \ --prefill-workers 1 \ --decode-workers 1 \ + --prefill-gpu-ids 0 \ + --decode-gpu-ids 1 \ --transfer-backend mooncake \ - --target-duration-s 600 \ - --session-sample-rate 0.01 \ - --output-root outputs/live + --target-duration-s 2000 \ + --session-sample-rate 1.0 \ + --min-turns 2 \ + --time-scale 1 \ + --concurrency-limit 1000 ``` -Notes: +只回放并写 metrics: -- The provided Ali release trace contains lengths and `hash_ids`, not raw - prompts. Replay therefore synthesizes deterministic prompt text from - `hash_ids` so repeated blocks remain repeated across turns. -- `sticky` mode emits `x-smg-routing-key=`, which matches the - upstream gateway's `manual` policy semantics for "turn1 default, turn2+ - sticky". -- `kv-aware` computes decode placement from observed `hash_ids` overlap and - can emit `x-smg-target-worker=` when `--header-mode target-worker` is - used with a compatible router decode policy. -- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which - preserves the real prefill/decode double-request path over loopback without - depending on the upstream Rust router build. -- Managed live benchmarking prefers the vendored - `third_party/sglang/python/sglang` source tree, so local SGLang changes apply - immediately without packaging a wheel. -- Live benchmarking currently targets the `mooncake` transfer backend, because - `mooncake-transfer-engine` is installed and usable on this node. -- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT - measurement. Use `--no-stream` for E2E-only runs. -- `kvcache-centric` defaults to router-managed admission - (`--kvcache-admission-mode router`). This keeps a router-side shadow of - decode session residency and capacity, so the critical path does not issue - per-request worker `/server_info` and `/v1/loads` probes. Use - `--kvcache-admission-mode worker` only as an A/B baseline for the older - worker-managed admission path. +```bash +uv run agentic-pd-hybrid replay \ + --trace path/to/trace.jsonl \ + --policy kv-aware \ + --mechanism pd-disaggregation \ + --router-url http://127.0.0.1:8000 \ + --output outputs/replay.jsonl +``` -## Output +## 输出 -Each replay writes: +每次 replay/benchmark 会写: -- request-level metrics JSONL at the requested output path -- summary JSON at `.summary.json` +- request metrics:`request-metrics.jsonl` +- 汇总结果:`request-metrics.jsonl.summary.json` -Each request log contains: +重点看: -- request id -- session id -- turn id -- assigned prefill node -- assigned decode node -- latency fields when a live router is used -- whether reuse was expected and whether block overlap was observed -- expected KV transfer blocks -- per-node load snapshot at assignment time +- E2E latency +- TTFT / TPOT +- execution mode +- cached tokens +- KV transfer blocks +- error + +## 维护约定 + +- 项目代码改动:`feat:` / `fix:` / `docs:`。 +- SGLang 改动:`feat(sglang): ...` / `fix(sglang): ...`。 +- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。 +- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。 diff --git a/docs/PROJECT_OVERVIEW.md b/docs/PROJECT_OVERVIEW.md index 4ae0e5a..a60cd3f 100644 --- a/docs/PROJECT_OVERVIEW.md +++ b/docs/PROJECT_OVERVIEW.md @@ -1,141 +1,98 @@ -# Project Overview +# 项目概览 -This repository is a minimal research prototype for evaluating whether -session-aware and KV-cache-aware prefill/decode routing can improve end-to-end -latency for agentic coding workloads on top of SGLang xPyD. +这个项目验证一个问题: -The current target environment is a single 8-GPU node running SGLang `v0.5.10` -with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under -`third_party/sglang` so our xPyD/session-cache changes are maintained together -with the benchmark harness. The local setup keeps the P -> D transfer path -through SGLang disaggregation and Mooncake loopback instead of replacing it with -an in-process shortcut. +**agentic coding workload 里,如果 router 更懂 session 和 KV cache,P/D serving 的端到端延迟能不能更低。** -## Design +当前基于: -The code keeps policy separate from mechanism. +- SGLang `v0.5.10` +- Qwen3-Coder-30B-A3B-Instruct +- 单机 8 GPU +- Mooncake loopback 模拟 P -> D 传输 -- Mechanism code launches SGLang workers, sends requests, manages streaming - sessions, and records request-level metrics. -- Policy code decides which prefill worker and decode worker should receive a - request. -- Replay and benchmark code preserve trace arrival times unless explicitly - configured otherwise, so concurrency comes from the workload shape rather than - from an artificial fixed-concurrency driver. +## 设计 -The main comparison points are: +代码按两层分开: -- `pd-disaggregation`: normal router-managed P/D serving. -- `kvcache-centric`: worker/router assisted session-aware routing that can keep - a decode streaming session resident and send later small appends directly to D. -- `pd-colo`: direct colocated serving baseline for experiments that do not use - the P/D router path. +- **机制**:启动 SGLang、发送请求、管理 session、收集 metrics。 +- **策略**:决定请求去哪个 P node、哪个 D node。 -## Implemented +这样后续可以单独改 routing policy,不把它和 SGLang/xPyD 机制混在一起。 -The prototype currently includes: +## 已实现 -- One-node P/D launch planning and managed stack lifecycle. -- A lightweight Python PD router used for live local experiments. -- Ali trace loading, session-granularity sampling, and synthetic prompt - generation from `hash_ids`. -- Trace replay with natural pacing, request dependencies inside a session, and - request-level metrics JSONL plus summary JSON. -- Routing policies: - - `default`: simple baseline placement. - - `sticky`: turn2+ prefers the previous D node for the same session. - - `kv-aware`: uses observed block overlap/session state to choose D placement. -- Live benchmark orchestration through `benchmark-live`. -- Small-append synthetic trace generation for micro-benchmarks. -- KV-cache-centric worker admission modes: - - router shadow-state admission. - - worker queried admission. - - session-level D residency soft cap for worker-managed admission, so only a - small hot set is kept as decode streaming sessions while the rest fall back - to normal PD routing. -- P-side prefill backup bookkeeping for experiments where D evictions can retain - a lower-priority copy on P. -- Fail-fast handling for empty streaming responses and a shorter SGLang - disaggregation wait timeout to avoid treating transfer hangs as successful - long-tail responses. +- 单机 P/D stack 启动和关闭。 +- 本地 Python PD router。 +- Ali trace 加载、session 级采样、synthetic prompt 生成。 +- 按 trace 原始到达时间 replay,不用固定 concurrency 强行压流量。 +- request-level metrics 和 summary。 +- 路由策略: + - `default` + - `sticky` + - `kv-aware` +- serving 机制: + - `pd-disaggregation` + - `kvcache-centric` + - `pd-colo` +- micro-benchmark trace 生成。 +- worker-managed / router-managed KV admission 对比。 +- worker-managed 下的 D session soft-cap,避免所有 session 都挤进 D KV。 +- SGLang patch: + - decode worker 支持 PD mode 下 local append-prefill; + - 暴露 streaming session cache 状态; + - 支持按 session 粒度 evict idle streaming session; + - 支持 direct append admission 查询。 -## SGLang Maintenance +## 当前结论 -SGLang is tracked directly in this repository: +micro-benchmark 上,`kvcache-centric` 可以比 `pd-disaggregation` 好。 -- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline. -- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only - local SGLang changes. -- Generated files such as `__pycache__` and benchmark outputs stay ignored. +原因很简单:session 少,D KV 放得下,turn2+ 可以直接走 D session,省掉一部分 P/D 路径开销。 -The current SGLang patch adds the worker-side mechanisms needed by -KV-cache-centric experiments: +但在 300+ request、58 session 的测试上,情况不同: -- decode workers can optionally accept local append-prefill requests in PD mode; -- streaming session cache status is exposed for router/admission decisions; -- idle streaming sessions can be evicted at session granularity; -- direct append admission can check resident session state and D token pressure - before the replay path bypasses P. +- D KV 放不下全部 session working set。 +- naive worker-managed 会频繁 evict/reseed 整个 session。 +- reseed 和 transfer 压力会抵消 KV reuse 收益。 +- aggressive P-backup 会增加尾延迟风险。 -## Current Findings +当前 soft-cap 优化后: -The micro-benchmark can make KV-cache-centric routing look better than -`pd-disaggregation` because the active sessions fit in D KV cache. Later turns -can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT. +- worker-managed 比旧版本更稳; +- TTFT 明显下降; +- 没有再出现 600s transfer hang 被当成成功响应的问题; +- 但 sampled Ali trace 上,`pd-disaggregation` 仍然略好。 -On the larger 316-request, variable-turn workload, there are 58 sessions and the -working set is larger than the useful D residency budget. A naive worker-managed -KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding -TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it -keeps too much state around. +当前判断: -The current soft-cap optimization improves worker-managed KV-cache-centric -relative to the older worker-managed path, but `pd-disaggregation` is still -slightly better on the sampled Ali workload because most requests fall back to -normal PD routing while a few retained D sessions still consume token budget. +**KV-cache-centric 只应该保留真正 hot 的 session。不是所有 session 都值得占 D KV。** -## Useful Commands +下一步最有价值的是: -Run a live benchmark with natural arrival timing: +- inter-turn-gap-aware admission; +- session aging; +- 更精确地预测哪些 session 会很快复用 KV。 -```bash -uv run agentic-pd-hybrid benchmark-live \ - --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \ - --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \ - --mechanism kvcache-centric \ - --policy kv-aware \ - --kvcache-admission-mode worker \ - --prefill-workers 1 \ - --decode-workers 1 \ - --prefill-gpu-ids 0 \ - --decode-gpu-ids 1 \ - --transfer-backend mooncake \ - --target-duration-s 2000 \ - --session-sample-rate 1.0 \ - --min-turns 2 \ - --time-scale 1 \ - --concurrency-limit 1000 -``` +## SGLang 维护方式 -Generate a 30k input, 1k append, 256 output small-append trace: +`third_party/sglang` 已纳入主仓库。 -```bash -uv run agentic-pd-hybrid make-small-append-trace \ - --output outputs/smoke-hotcap-30k-1k-256.jsonl \ - --session-count 4 \ - --turns-per-session 3 \ - --initial-input-length 30000 \ - --append-input-length 1000 \ - --output-length 256 -``` +历史结构: -## Known Limits +- `chore: vendor sglang v0.5.10 snapshot`:干净上游基线。 +- `feat(sglang): ...` / `fix(sglang): ...`:我们的 SGLang patch。 -- This is not production routing code. -- The current evaluation is single-node and constrained by `prefill + decode <= - 8` GPUs. -- Trace prompts are synthetic because the Ali trace used here contains lengths - and `hash_ids`, not raw prompts. -- KV-cache-centric admission still needs better hot-session prediction. The next - useful step is inter-turn-gap-aware admission and aging, so D cache is held - only for sessions likely to reuse it soon. +后续改 SGLang 时: + +- 只改 `third_party/sglang` 下相关文件; +- 单独提交; +- commit message 带 `(sglang)`; +- 不把 benchmark 输出、pyc、日志混进提交。 + +## 已知限制 + +- 这是实验原型,不是生产 router。 +- 当前主要验证单机 8 GPU。 +- Ali trace 没有原始 prompt,只能用 `hash_ids` 合成 prompt。 +- 当前 routing 还缺少真正的 hot-session 预测。