Gahow Wang d93228e156 docs(sglang): patch surface inventory + retire-after-refactor list
Resolves AUDIT_AND_ROADMAP §S6: the 785 lines of vendored
SGLang patch are a known reviewer trust risk because the
prototype touches scheduler.py / schedule_batch.py /
session_aware_cache.py / disaggregation hot paths. Without
classification readers cannot tell core mechanism from
temporary scaffold.

Classifies each of the 10 patched files into:
  MUST-HAVE         — Algorithm 1/2/3, streaming session
                       lifecycle, admit RPC. ~450 lines.
                       Long-term retention.
  WORKAROUND        — release_session token-free,
                       maybe_trim_decode_session_cache,
                       streaming-session extend_input_len
                       correction (incl. the E3 landmine
                       hotfix from commit 986f351),
                       DecodePreallocQueue trim trigger.
                       ~150 lines. To DELETE entirely
                       after block-level eviction refactor
                       (BLOCK_LEVEL_EVICTION_DESIGN §3.7).
  EXPERIMENTAL      — backpressure pause hint
                       (_compute_backpressure_pause_hint).
                       ~60 lines. Signal not closed-loop
                       per REAL_ALI §4.3; retain as hook
                       or retire in 1 month.
  INSTRUMENTATION   — _compute_pool_breakdown_for_diagnostics.
                       ~50 lines. Keep behind a flag.
  MINOR             — ~3 lines. Ignore.

The §2 summary gives reviewers a one-glance picture of
what's core vs. scaffold. Maintenance convention in §3
mandates classifying every new (sglang) patch at commit
time.

§4 wires the classification into the roadmap: clearing
the WORKAROUND bucket is the objective completion marker
for block-level eviction refactor.

No code change.
2026-05-13 00:42:22 +08:00
2026-04-24 12:17:40 +00:00

Agentic PD Hybrid

这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:

面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing能不能降低端到端延迟。

更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md

新加入的合作者:先看 docs/INDEX_ZH.md,按"我是谁"选 3 篇必读文档。 项目当前进度、薄弱点、路线图总览见 docs/AUDIT_AND_ROADMAP_ZH.md

当前做了什么

  • 启动单机 SGLang P/D 栈。
  • 回放 Ali coding agent trace并记录 request-level metrics。
  • 支持 defaultstickykv-aware 路由策略。
  • 支持 pd-disaggregationkvcache-centricpd-colo 对比。
  • 支持小 append、多轮 session 的 micro-benchmark trace。
  • 维护了基于 SGLang v0.5.10 的本地 patch放在 third_party/sglang

环境

统一使用 uv

uv sync

默认模型路径:

~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct

当前主要测试环境是单机 8 GPU约束是 prefill + decode <= 8

常用命令

生成小 append trace

uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256

跑 live benchmark

uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000

只回放并写 metrics

uv run agentic-pd-hybrid replay \
  --trace path/to/trace.jsonl \
  --policy kv-aware \
  --mechanism pd-disaggregation \
  --router-url http://127.0.0.1:8000 \
  --output outputs/replay.jsonl

输出

每次 replay/benchmark 会写:

  • request metricsrequest-metrics.jsonl
  • 汇总结果:request-metrics.jsonl.summary.json

重点看:

  • E2E latency
  • TTFT / TPOT
  • execution mode
  • cached tokens
  • KV transfer blocks
  • error

维护约定

  • 项目代码改动:feat: / fix: / docs:
  • SGLang 改动:feat(sglang): ... / fix(sglang): ...
  • third_party/sglang 的基线是 clean SGLang v0.5.10 snapshot。
  • 不提交 outputs/、日志、__pycache__、虚拟环境。

单元测试(无 GPU

算法层policies、Algorithm 1 / Theorem 1有 pure-Python 单测,跑测试不需要 GPU、不需要 SGLang

uv sync --group test
uv run pytest

详见 tests/README.md

评测脚本

docs/EVALUATION_PROTOCOL_ZH.md 跑数据后:

# M3: 按 turn_id / input_length / overlap_ratio / append_tokens 分桶
scripts/analysis/stratified.py outputs/<run>/request-metrics.jsonl

# M2: paired-on-same-trial bootstrap 95% CI
scripts/analysis/paired_compare.py \
    --baseline outputs/run-dp/request-metrics.jsonl \
    --candidate outputs/run-kvc/request-metrics.jsonl
Description
No description provided
Readme 18 MiB
Languages
Python 100%