docs: rewrite project docs in concise chinese

This commit is contained in:
2026-04-24 12:41:52 +00:00
parent 5bdc0ed4f0
commit 08b13d22bc
2 changed files with 139 additions and 243 deletions

View File

@@ -1,141 +1,98 @@
# Project Overview
# 项目概览
This repository is a minimal research prototype for evaluating whether
session-aware and KV-cache-aware prefill/decode routing can improve end-to-end
latency for agentic coding workloads on top of SGLang xPyD.
这个项目验证一个问题:
The current target environment is a single 8-GPU node running SGLang `v0.5.10`
with Qwen3-Coder-30B-A3B-Instruct. The repo vendors SGLang under
`third_party/sglang` so our xPyD/session-cache changes are maintained together
with the benchmark harness. The local setup keeps the P -> D transfer path
through SGLang disaggregation and Mooncake loopback instead of replacing it with
an in-process shortcut.
**agentic coding workload 里,如果 router 更懂 session 和 KV cacheP/D serving 的端到端延迟能不能更低。**
## Design
当前基于:
The code keeps policy separate from mechanism.
- SGLang `v0.5.10`
- Qwen3-Coder-30B-A3B-Instruct
- 单机 8 GPU
- Mooncake loopback 模拟 P -> D 传输
- Mechanism code launches SGLang workers, sends requests, manages streaming
sessions, and records request-level metrics.
- Policy code decides which prefill worker and decode worker should receive a
request.
- Replay and benchmark code preserve trace arrival times unless explicitly
configured otherwise, so concurrency comes from the workload shape rather than
from an artificial fixed-concurrency driver.
## 设计
The main comparison points are:
代码按两层分开:
- `pd-disaggregation`: normal router-managed P/D serving.
- `kvcache-centric`: worker/router assisted session-aware routing that can keep
a decode streaming session resident and send later small appends directly to D.
- `pd-colo`: direct colocated serving baseline for experiments that do not use
the P/D router path.
- **机制**:启动 SGLang、发送请求、管理 session、收集 metrics。
- **策略**:决定请求去哪个 P node、哪个 D node。
## Implemented
这样后续可以单独改 routing policy不把它和 SGLang/xPyD 机制混在一起。
The prototype currently includes:
## 已实现
- One-node P/D launch planning and managed stack lifecycle.
- A lightweight Python PD router used for live local experiments.
- Ali trace loading, session-granularity sampling, and synthetic prompt
generation from `hash_ids`.
- Trace replay with natural pacing, request dependencies inside a session, and
request-level metrics JSONL plus summary JSON.
- Routing policies:
- `default`: simple baseline placement.
- `sticky`: turn2+ prefers the previous D node for the same session.
- `kv-aware`: uses observed block overlap/session state to choose D placement.
- Live benchmark orchestration through `benchmark-live`.
- Small-append synthetic trace generation for micro-benchmarks.
- KV-cache-centric worker admission modes:
- router shadow-state admission.
- worker queried admission.
- session-level D residency soft cap for worker-managed admission, so only a
small hot set is kept as decode streaming sessions while the rest fall back
to normal PD routing.
- P-side prefill backup bookkeeping for experiments where D evictions can retain
a lower-priority copy on P.
- Fail-fast handling for empty streaming responses and a shorter SGLang
disaggregation wait timeout to avoid treating transfer hangs as successful
long-tail responses.
- 单机 P/D stack 启动和关闭。
- 本地 Python PD router
- Ali trace 加载、session 级采样、synthetic prompt 生成。
- 按 trace 原始到达时间 replay不用固定 concurrency 强行压流量。
- request-level metrics 和 summary。
- 路由策略:
- `default`
- `sticky`
- `kv-aware`
- serving 机制:
- `pd-disaggregation`
- `kvcache-centric`
- `pd-colo`
- micro-benchmark trace 生成。
- worker-managed / router-managed KV admission 对比。
- worker-managed 下的 D session soft-cap,避免所有 session 都挤进 D KV。
- SGLang patch
- decode worker 支持 PD mode 下 local append-prefill
- 暴露 streaming session cache 状态;
- 支持按 session 粒度 evict idle streaming session
- 支持 direct append admission 查询。
## SGLang Maintenance
## 当前结论
SGLang is tracked directly in this repository:
micro-benchmark 上,`kvcache-centric` 可以比 `pd-disaggregation` 好。
- `chore: vendor sglang v0.5.10 snapshot` records the clean upstream baseline.
- Later `feat(sglang): ...` / `fix(sglang): ...` commits should contain only
local SGLang changes.
- Generated files such as `__pycache__` and benchmark outputs stay ignored.
原因很简单session 少D KV 放得下turn2+ 可以直接走 D session省掉一部分 P/D 路径开销。
The current SGLang patch adds the worker-side mechanisms needed by
KV-cache-centric experiments:
但在 300+ request、58 session 的测试上,情况不同:
- decode workers can optionally accept local append-prefill requests in PD mode;
- streaming session cache status is exposed for router/admission decisions;
- idle streaming sessions can be evicted at session granularity;
- direct append admission can check resident session state and D token pressure
before the replay path bypasses P.
- D KV 放不下全部 session working set。
- naive worker-managed 会频繁 evict/reseed 整个 session。
- reseed 和 transfer 压力会抵消 KV reuse 收益。
- aggressive P-backup 会增加尾延迟风险。
## Current Findings
当前 soft-cap 优化后:
The micro-benchmark can make KV-cache-centric routing look better than
`pd-disaggregation` because the active sessions fit in D KV cache. Later turns
can then bypass P and use `kvcache-direct-to-d-session`, reducing TTFT.
- worker-managed 比旧版本更稳;
- TTFT 明显下降;
- 没有再出现 600s transfer hang 被当成成功响应的问题;
- 但 sampled Ali trace 上,`pd-disaggregation` 仍然略好。
On the larger 316-request, variable-turn workload, there are 58 sessions and the
working set is larger than the useful D residency budget. A naive worker-managed
KV-cache-centric policy repeatedly evicts and reseeds whole sessions, adding
TTFT and transfer pressure. Aggressive P-backup also increases tail risk when it
keeps too much state around.
当前判断:
The current soft-cap optimization improves worker-managed KV-cache-centric
relative to the older worker-managed path, but `pd-disaggregation` is still
slightly better on the sampled Ali workload because most requests fall back to
normal PD routing while a few retained D sessions still consume token budget.
**KV-cache-centric 只应该保留真正 hot 的 session。不是所有 session 都值得占 D KV。**
## Useful Commands
下一步最有价值的是:
Run a live benchmark with natural arrival timing:
- inter-turn-gap-aware admission
- session aging
- 更精确地预测哪些 session 会很快复用 KV。
```bash
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
--mechanism kvcache-centric \
--policy kv-aware \
--kvcache-admission-mode worker \
--prefill-workers 1 \
--decode-workers 1 \
--prefill-gpu-ids 0 \
--decode-gpu-ids 1 \
--transfer-backend mooncake \
--target-duration-s 2000 \
--session-sample-rate 1.0 \
--min-turns 2 \
--time-scale 1 \
--concurrency-limit 1000
```
## SGLang 维护方式
Generate a 30k input, 1k append, 256 output small-append trace:
`third_party/sglang` 已纳入主仓库。
```bash
uv run agentic-pd-hybrid make-small-append-trace \
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
--session-count 4 \
--turns-per-session 3 \
--initial-input-length 30000 \
--append-input-length 1000 \
--output-length 256
```
历史结构:
## Known Limits
- `chore: vendor sglang v0.5.10 snapshot`:干净上游基线。
- `feat(sglang): ...` / `fix(sglang): ...`:我们的 SGLang patch。
- This is not production routing code.
- The current evaluation is single-node and constrained by `prefill + decode <=
8` GPUs.
- Trace prompts are synthetic because the Ali trace used here contains lengths
and `hash_ids`, not raw prompts.
- KV-cache-centric admission still needs better hot-session prediction. The next
useful step is inter-turn-gap-aware admission and aging, so D cache is held
only for sessions likely to reuse it soon.
后续改 SGLang 时:
- 只改 `third_party/sglang` 下相关文件;
- 单独提交;
- commit message 带 `(sglang)`
- 不把 benchmark 输出、pyc、日志混进提交。
## 已知限制
- 这是实验原型,不是生产 router。
- 当前主要验证单机 8 GPU。
- Ali trace 没有原始 prompt只能用 `hash_ids` 合成 prompt。
- 当前 routing 还缺少真正的 hot-session 预测。