docs: rewrite project docs in concise chinese

2026-04-24 12:41:52 +00:00
parent 5bdc0ed4f0
commit 08b13d22bc
2 changed files with 139 additions and 243 deletions
--- a/README.md
+++ b/README.md
@@ -1,162 +1,101 @@
-## Agentic PD Hybrid
+# Agentic PD Hybrid

-Minimal prototype scaffold for evaluating session-aware and KV-cache-aware
-prefill/decode routing on top of SGLang PD disaggregation.
+这个项目是在 SGLang xPyD 上做一个最小实验框架，用来判断：

-For a concise description of the project design, implemented features, current
-findings, and known limits, see [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md).
+**面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing，能不能降低端到端延迟。**

-Current implementation covers the initial MVP path in `AGENTS.md`:
+更完整但仍然简洁的说明见 [docs/PROJECT_OVERVIEW.md](docs/PROJECT_OVERVIEW.md)。

-1. One-node PD/xPyD launch planning
-2. Trace replay plus request-level metrics logging
-3. Real end-to-end benchmark orchestration
+## 当前做了什么

-Routing policy is kept separate from mechanism:
+- 启动单机 SGLang P/D 栈。
+- 回放 Ali coding agent trace，并记录 request-level metrics。
+- 支持 `default`、`sticky`、`kv-aware` 路由策略。
+- 支持 `pd-disaggregation`、`kvcache-centric`、`pd-colo` 对比。
+- 支持小 append、多轮 session 的 micro-benchmark trace。
+- 维护了基于 SGLang `v0.5.10` 的本地 patch，放在 `third_party/sglang`。

- `agentic_pd_hybrid.topology` and `agentic_pd_hybrid.launcher`
-  handle cluster shape and SGLang command generation.
- `agentic_pd_hybrid.policies`
-  handles decode selection heuristics.
- `agentic_pd_hybrid.replay`
-  handles trace pacing, synthetic prompt generation, and metrics.
- `agentic_pd_hybrid.sampling`
-  handles session-granularity trace sampling for live tests.
- `agentic_pd_hybrid.stack` / `agentic_pd_hybrid.benchmark`
-  handles launching and tearing down a real PD stack.
+## 环境

-## Environment
-
-Use `uv` for all environment management.
-
-Sync the environment:
+统一使用 `uv`：

 ```bash
 uv sync
 ```

-`third_party/sglang` vendors a clean SGLang `v0.5.10` snapshot plus our local
-PD/session-cache patches in later commits. Keep SGLang changes scoped under that
-directory and commit them with `feat(sglang): ...` or `fix(sglang): ...` so they
-stay easy to review against the vendor baseline.
+默认模型路径：

-## CLI
-
-Print one-node PD launch commands:
-
-```bash
-uv run agentic-pd-hybrid print-launch \
-  --model-path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
-  --prefill-workers 2 \
-  --decode-workers 2 \
-  --transfer-backend mooncake
+```text
+~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
 ```

-Replay the Ali trace in dry-run mode and emit request logs plus a summary:
+当前主要测试环境是单机 8 GPU，约束是 `prefill + decode <= 8`。
+
+## 常用命令
+
+生成小 append trace：

 ```bash
-uv run agentic-pd-hybrid replay \
-  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
-  --policy sticky \
-  --prefill-workers 2 \
-  --decode-workers 2 \
-  --output outputs/sticky.jsonl
+uv run agentic-pd-hybrid make-small-append-trace \
+  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
+  --session-count 4 \
+  --turns-per-session 3 \
+  --initial-input-length 30000 \
+  --append-input-length 1000 \
+  --output-length 256
 ```

-Sample a 10-minute shard at session granularity:
-
-```bash
-uv run agentic-pd-hybrid sample-sessions \
-  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
-  --output outputs/sampled-10min.jsonl \
-  --target-duration-s 600 \
-  --session-sample-rate 0.01
-```
-
-Sample Ali sessions that keep the small-append KV reuse shape used by the
-micro-benchmark:
-
-```bash
-uv run agentic-pd-hybrid sample-sessions \
-  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
-  --output outputs/ali-small-append.jsonl \
-  --profile small-append \
-  --target-duration-s 600 \
-  --session-sample-rate 0.01 \
-  --min-turns 2
-```
-
-Replay against a live router:
-
-```bash
-uv run agentic-pd-hybrid replay \
-  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
-  --policy sticky \
-  --router-url http://127.0.0.1:8000 \
-  --model Qwen3-Coder-30B-A3B-Instruct \
-  --output outputs/sticky-live.jsonl
-```
-
-Launch a real PD stack and collect live performance numbers:
+跑 live benchmark：

 ```bash
 uv run agentic-pd-hybrid benchmark-live \
-  --trace ~/ali-trace/trace-qwen3-coder-formatted/041715-041717.jsonl \
-  --policy sticky \
+  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
+  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
-  --kvcache-admission-mode router \
-  --sample-profile small-append \
+  --policy kv-aware \
+  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
+  --prefill-gpu-ids 0 \
+  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
-  --target-duration-s 600 \
-  --session-sample-rate 0.01 \
-  --output-root outputs/live
+  --target-duration-s 2000 \
+  --session-sample-rate 1.0 \
+  --min-turns 2 \
+  --time-scale 1 \
+  --concurrency-limit 1000
 ```

-Notes:
+只回放并写 metrics：

- The provided Ali release trace contains lengths and `hash_ids`, not raw
-  prompts. Replay therefore synthesizes deterministic prompt text from
-  `hash_ids` so repeated blocks remain repeated across turns.
- `sticky` mode emits `x-smg-routing-key=<session_id>`, which matches the
-  upstream gateway's `manual` policy semantics for "turn1 default, turn2+
-  sticky".
- `kv-aware` computes decode placement from observed `hash_ids` overlap and
-  can emit `x-smg-target-worker=<index>` when `--header-mode target-worker` is
-  used with a compatible router decode policy.
- Live benchmarking uses the repo-local `agentic_pd_hybrid.pd_router`, which
-  preserves the real prefill/decode double-request path over loopback without
-  depending on the upstream Rust router build.
- Managed live benchmarking prefers the vendored
-  `third_party/sglang/python/sglang` source tree, so local SGLang changes apply
-  immediately without packaging a wheel.
- Live benchmarking currently targets the `mooncake` transfer backend, because
-  `mooncake-transfer-engine` is installed and usable on this node.
- `benchmark-live` and `replay` support streaming by default for TTFT/TPOT
-  measurement. Use `--no-stream` for E2E-only runs.
- `kvcache-centric` defaults to router-managed admission
-  (`--kvcache-admission-mode router`). This keeps a router-side shadow of
-  decode session residency and capacity, so the critical path does not issue
-  per-request worker `/server_info` and `/v1/loads` probes. Use
-  `--kvcache-admission-mode worker` only as an A/B baseline for the older
-  worker-managed admission path.
+```bash
+uv run agentic-pd-hybrid replay \
+  --trace path/to/trace.jsonl \
+  --policy kv-aware \
+  --mechanism pd-disaggregation \
+  --router-url http://127.0.0.1:8000 \
+  --output outputs/replay.jsonl
+```

-## Output
+## 输出

-Each replay writes:
+每次 replay/benchmark 会写：

- request-level metrics JSONL at the requested output path
- summary JSON at `<output>.summary.json`
+- request metrics：`request-metrics.jsonl`
+- 汇总结果：`request-metrics.jsonl.summary.json`

-Each request log contains:
+重点看：

- request id
- session id
- turn id
- assigned prefill node
- assigned decode node
- latency fields when a live router is used
- whether reuse was expected and whether block overlap was observed
- expected KV transfer blocks
- per-node load snapshot at assignment time
+- E2E latency
+- TTFT / TPOT
+- execution mode
+- cached tokens
+- KV transfer blocks
+- error
+
+## 维护约定
+
+- 项目代码改动：`feat:` / `fix:` / `docs:`。
+- SGLang 改动：`feat(sglang): ...` / `fix(sglang): ...`。
+- `third_party/sglang` 的基线是 clean SGLang `v0.5.10` snapshot。
+- 不提交 `outputs/`、日志、`__pycache__`、虚拟环境。
--- a/docs/PROJECT_OVERVIEW.md
+++ b/docs/PROJECT_OVERVIEW.md
@@ -1,141 +1,98 @@
-# Project Overview
+# 项目概览

-This repository is a minimal research prototype for evaluating whether
-session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
-latency for agentic coding workloads on top of SGLang xPyD.
+这个项目验证一个问题：

-The current target environment is a single 8-GPU node running SGLang `v0.5.10`
-with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under
-`third_party/sglang` so our xPyD/session-cache changes are maintained together
-with the benchmark harness. The local setup keeps the P -> D transfer path
-through SGLang disaggregation and Mooncake loopback instead of replacing it with
-an in-process shortcut.
+**agentic coding workload 里，如果 router 更懂 session 和 KV cache，P/D serving 的端到端延迟能不能更低。**

-## Design
+当前基于：

-The code keeps policy separate from mechanism.
+- SGLang `v0.5.10`
+- Qwen3-Coder-30B-A3B-Instruct
+- 单机 8 GPU
+- Mooncake loopback 模拟 P -> D 传输

- Mechanism code launches SGLang workers, sends requests, manages streaming
-  sessions, and records request-level metrics.
- Policy code decides which prefill worker and decode worker should receive a
-  request.
- Replay and benchmark code preserve trace arrival times unless explicitly
-  configured otherwise, so concurrency comes from the workload shape rather than
-  from an artificial fixed-concurrency driver.
+## 设计

-The main comparison points are:
+代码按两层分开：

- `pd-disaggregation`: normal router-managed P/D serving.
- `kvcache-centric`: worker/router assisted session-aware routing that can keep
-  a decode streaming session resident and send later small appends directly to D.
- `pd-colo`: direct colocated serving baseline for experiments that do not use
-  the P/D router path.
+- **机制**：启动 SGLang、发送请求、管理 session、收集 metrics。
+- **策略**：决定请求去哪个 P node、哪个 D node。

-## Implemented
+这样后续可以单独改 routing policy，不把它和 SGLang/xPyD 机制混在一起。

-The prototype currently includes:
+## 已实现

- One-node P/D launch planning and managed stack lifecycle.
- A lightweight Python PD router used for live local experiments.
- Ali trace loading, session-granularity sampling, and synthetic prompt
-  generation from `hash_ids`.
- Trace replay with natural pacing, request dependencies inside a session, and
-  request-level metrics JSONL plus summary JSON.
- Routing policies:
-  - `default`: simple baseline placement.
-  - `sticky`: turn2+ prefers the previous D node for the same session.
-  - `kv-aware`: uses observed block overlap/session state to choose D placement.
- Live benchmark orchestration through `benchmark-live`.
- Small-append synthetic trace generation for micro-benchmarks.
- KV-cache-centric worker admission modes:
-  - router shadow-state admission.
-  - worker queried admission.
-  - session-level D residency soft cap for worker-managed admission, so only a
-    small hot set is kept as decode streaming sessions while the rest fall back
-    to normal PD routing.
- P-side prefill backup bookkeeping for experiments where D evictions can retain
-  a lower-priority copy on P.
- Fail-fast handling for empty streaming responses and a shorter SGLang
-  disaggregation wait timeout to avoid treating transfer hangs as successful
-  long-tail responses.
+- 单机 P/D stack 启动和关闭。
+- 本地 Python PD router。
+- Ali trace 加载、session 级采样、synthetic prompt 生成。
+- 按 trace 原始到达时间 replay，不用固定 concurrency 强行压流量。
+- request-level metrics 和 summary。
+- 路由策略：
+  - `default`
+  - `sticky`
+  - `kv-aware`
+- serving 机制：
+  - `pd-disaggregation`
+  - `kvcache-centric`
+  - `pd-colo`
+- micro-benchmark trace 生成。
+- worker-managed / router-managed KV admission 对比。
+- worker-managed 下的 D session soft-cap，避免所有 session 都挤进 D KV。
+- SGLang patch：
+  - decode worker 支持 PD mode 下 local append-prefill；
+  - 暴露 streaming session cache 状态；
+  - 支持按 session 粒度 evict idle streaming session；
+  - 支持 direct append admission 查询。

-## SGLang Maintenance
+## 当前结论

-SGLang is tracked directly in this repository:
+micro-benchmark 上，`kvcache-centric` 可以比 `pd-disaggregation` 好。

- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline.
- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only
-  local SGLang changes.
- Generated files such as `__pycache__` and benchmark outputs stay ignored.
+原因很简单：session 少，D KV 放得下，turn2+ 可以直接走 D session，省掉一部分 P/D 路径开销。

-The current SGLang patch adds the worker-side mechanisms needed by
-KV-cache-centric experiments:
+但在 300+ request、58 session 的测试上，情况不同：

- decode workers can optionally accept local append-prefill requests in PD mode;
- streaming session cache status is exposed for router/admission decisions;
- idle streaming sessions can be evicted at session granularity;
- direct append admission can check resident session state and D token pressure
-  before the replay path bypasses P.
+- D KV 放不下全部 session working set。
+- naive worker-managed 会频繁 evict/reseed 整个 session。
+- reseed 和 transfer 压力会抵消 KV reuse 收益。
+- aggressive P-backup 会增加尾延迟风险。

-## Current Findings
+当前 soft-cap 优化后：

-The micro-benchmark can make KV-cache-centric routing look better than
-`pd-disaggregation` because the active sessions fit in D KV cache. Later turns
-can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT.
+- worker-managed 比旧版本更稳；
+- TTFT 明显下降；
+- 没有再出现 600s transfer hang 被当成成功响应的问题；
+- 但 sampled Ali trace 上，`pd-disaggregation` 仍然略好。

-On the larger 316-request, variable-turn workload, there are 58 sessions and the
-working set is larger than the useful D residency budget. A naive worker-managed
-KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding
-TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it
-keeps too much state around.
+当前判断：

-The current soft-cap optimization improves worker-managed KV-cache-centric
-relative to the older worker-managed path, but `pd-disaggregation` is still
-slightly better on the sampled Ali workload because most requests fall back to
-normal PD routing while a few retained D sessions still consume token budget.
+**KV-cache-centric 只应该保留真正 hot 的 session。不是所有 session 都值得占 D KV。**

-## Useful Commands
+下一步最有价值的是：

-Run a live benchmark with natural arrival timing:
+- inter-turn-gap-aware admission；
+- session aging；
+- 更精确地预测哪些 session 会很快复用 KV。

-```bash
-uv run agentic-pd-hybrid benchmark-live \
-  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
-  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
-  --mechanism kvcache-centric \
-  --policy kv-aware \
-  --kvcache-admission-mode worker \
-  --prefill-workers 1 \
-  --decode-workers 1 \
-  --prefill-gpu-ids 0 \
-  --decode-gpu-ids 1 \
-  --transfer-backend mooncake \
-  --target-duration-s 2000 \
-  --session-sample-rate 1.0 \
-  --min-turns 2 \
-  --time-scale 1 \
-  --concurrency-limit 1000
-```
+## SGLang 维护方式

-Generate a 30k input, 1k append, 256 output small-append trace:
+`third_party/sglang` 已纳入主仓库。

-```bash
-uv run agentic-pd-hybrid make-small-append-trace \
-  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
-  --session-count 4 \
-  --turns-per-session 3 \
-  --initial-input-length 30000 \
-  --append-input-length 1000 \
-  --output-length 256
-```
+历史结构：

-## Known Limits
+- `chore: vendor sglang v0.5.10 snapshot`：干净上游基线。
+- `feat(sglang): ...` / `fix(sglang): ...`：我们的 SGLang patch。

- This is not production routing code.
- The current evaluation is single-node and constrained by `prefill + decode <=
-  8` GPUs.
- Trace prompts are synthetic because the Ali trace used here contains lengths
-  and `hash_ids`, not raw prompts.
- KV-cache-centric admission still needs better hot-session prediction. The next
-  useful step is inter-turn-gap-aware admission and aging, so D cache is held
-  only for sessions likely to reuse it soon.
+后续改 SGLang 时：
+
+- 只改 `third_party/sglang` 下相关文件；
+- 单独提交；
+- commit message 带 `(sglang)`；
+- 不把 benchmark 输出、pyc、日志混进提交。
+
+## 已知限制
+
+- 这是实验原型，不是生产 router。
+- 当前主要验证单机 8 GPU。
+- Ali trace 没有原始 prompt，只能用 `hash_ids` 合成 prompt。
+- 当前 routing 还缺少真正的 hot-session 预测。