Files

Gahow Wang 9a81c993ab docs(onboarding): link new audit / design / eval docs from

the root README + AGENTS.md

Without this, the four docs added on this branch
(AUDIT_AND_ROADMAP, INDEX, BLOCK_LEVEL_EVICTION_DESIGN,
D_TO_P_SYNC_CONTRACT, EVALUATION_PROTOCOL) are reachable
only by listing docs/. This wires them into the two entry
points an agent or collaborator hits first.

README.md changes:
  - top-of-page pointer to INDEX_ZH for new collaborators
  - pointer to AUDIT_AND_ROADMAP_ZH for project state
  - "单元测试 (无 GPU)" section: how to run pytest
  - "评测脚本" section: invocations for the two new
    analysis scripts

AGENTS.md changes:
  - top section "For new collaborators / agents" before
    the existing "Environment" block, pointing at INDEX_ZH,
    AUDIT_AND_ROADMAP_ZH, the two ready-to-pick-up design
    docs, and EVALUATION_PROTOCOL_ZH
  - pytest invocation under Environment

2026-05-12 23:58:56 +08:00

3.4 KiB

Raw Blame History

Agentic PD Hybrid

这个项目是在 SGLang xPyD 上做一个最小实验框架，用来判断：

面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing，能不能降低端到端延迟。

更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md。

新加入的合作者：先看 docs/INDEX_ZH.md，按"我是谁"选 3 篇必读文档。项目当前进度、薄弱点、路线图总览见 docs/AUDIT_AND_ROADMAP_ZH.md。

当前做了什么

启动单机 SGLang P/D 栈。
回放 Ali coding agent trace，并记录 request-level metrics。
支持 default、sticky、kv-aware 路由策略。
支持 pd-disaggregation、kvcache-centric、pd-colo 对比。
支持小 append、多轮 session 的 micro-benchmark trace。
维护了基于 SGLang v0.5.10 的本地 patch，放在 third_party/sglang。

环境

统一使用 uv：

uv sync

默认模型路径：

~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct

当前主要测试环境是单机 8 GPU，约束是 prefill + decode <= 8。

常用命令

生成小 append trace：

uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256

跑 live benchmark：

uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000

只回放并写 metrics：

uv run agentic-pd-hybrid replay \
  --trace path/to/trace.jsonl \
  --policy kv-aware \
  --mechanism pd-disaggregation \
  --router-url http://127.0.0.1:8000 \
  --output outputs/replay.jsonl

输出

每次 replay/benchmark 会写：

request metrics：request-metrics.jsonl
汇总结果：request-metrics.jsonl.summary.json

重点看：

E2E latency
TTFT / TPOT
execution mode
cached tokens
KV transfer blocks
error

维护约定

项目代码改动：feat: / fix: / docs:。
SGLang 改动：feat(sglang): ... / fix(sglang): ...。
third_party/sglang 的基线是 clean SGLang v0.5.10 snapshot。
不提交 outputs/、日志、__pycache__、虚拟环境。

单元测试（无 GPU）

算法层（policies、Algorithm 1 / Theorem 1）有 pure-Python 单测，跑测试不需要 GPU、不需要 SGLang：

uv sync --group test
uv run pytest

详见 tests/README.md。

评测脚本

按 docs/EVALUATION_PROTOCOL_ZH.md 跑数据后：

# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl

# M2: paired-on-same-trial bootstrap 95% CI
scripts/analysis/paired_compare.py \
    --baseline outputs/run-dp/request-metrics.jsonl \
    --candidate outputs/run-kvc/request-metrics.jsonl

3.4 KiB Raw Blame History Unescape Escape