Go to file

Claude Code Agent 86412bb174 feat(sglang): D→P snapshot link integration — controller + RPC handlers

Phase 2 of the D→P sync feature (Phase 1 in dc4867c verified the
underlying RDMA link in isolation). This commit wires that link into
each SGLang worker's scheduler so D and P can exchange session KV
without going through the PD prefill pipeline.

New module:
  third_party/sglang/python/sglang/srt/disaggregation/snapshot/
    controller.py — SnapshotLinkController owns one mooncake transfer
                    engine per worker, pre-registers all kv_pool layer
                    buffers, and exposes prepare_receive() and
                    push_session_kv() APIs. Receive bookkeeping via
                    a session_id → SnapshotIngestRecord side-table.

Three RPC types added to io_struct.py and full plumbing wired through:
  SnapshotPrepareReceiveReqInput/Output   P-side alloc + return layout
  SnapshotDumpReqInput/Output             D-side read kv_pool + RDMA push
  SnapshotFinalizeIngestReqInput/Output   P-side radix tree insert

Files touched:
  managers/io_struct.py                   3 new ReqInput/ReqOutput pairs
  managers/tokenizer_communicator_mixin.py  3 communicators, 3 awaitables
  managers/scheduler.py                   init controller + 3 handlers
  entrypoints/http_server.py              3 HTTP endpoints under /_snapshot

Activation: set SGLANG_SNAPSHOT_LINK_ENABLE=1 (and
SGLANG_SNAPSHOT_LINK_HOST / _PORT / _IB_DEVICE) per worker. Controller
init is opt-in and defaults off, so production PD pipeline is
untouched.

Subsequent work (Phase 3): agentic-pd-hybrid orchestration in
_invoke_kvcache_seeded_router to call prepare_receive on P, dump on
D-old, finalize_ingest on P, then trigger the existing P→D' transfer
which will now hit P's radix cache (skipping re-prefill).

2026-05-13 08:12:04 +08:00

docs

feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified

2026-05-13 00:59:43 +08:00

outputs

docs: v4 final results, error analysis, and updated journey

2026-04-28 23:34:01 +08:00

scripts

feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified

2026-05-13 00:59:43 +08:00

src/agentic_pd_hybrid

feat(snapshot): D→P RDMA link Phase 1 — minimal byte transport

2026-05-13 00:55:55 +08:00

third_party/sglang

feat(sglang): D→P snapshot link integration — controller + RPC handlers

2026-05-13 08:12:04 +08:00

.gitignore

chore: vendor sglang v0.5.10 snapshot

2026-04-24 12:29:36 +00:00

.python-version

chore: initialize repo hygiene

2026-04-24 12:17:40 +00:00

AGENTS.md

docs: document project design and status

2026-04-24 12:17:55 +00:00

pyproject.toml

feat(env): install vendored SGLang via uv path source

2026-05-12 00:09:50 +08:00

README.md

docs: rewrite project docs in concise chinese

2026-04-24 12:41:52 +00:00

uv.lock

feat(env): install vendored SGLang via uv path source

2026-05-12 00:09:50 +08:00

README.md

Agentic PD Hybrid

这个项目是在 SGLang xPyD 上做一个最小实验框架，用来判断：

面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing，能不能降低端到端延迟。

更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md。

当前做了什么

启动单机 SGLang P/D 栈。
回放 Ali coding agent trace，并记录 request-level metrics。
支持 default、sticky、kv-aware 路由策略。
支持 pd-disaggregation、kvcache-centric、pd-colo 对比。
支持小 append、多轮 session 的 micro-benchmark trace。
维护了基于 SGLang v0.5.10 的本地 patch，放在 third_party/sglang。

环境

统一使用 uv：

uv sync

默认模型路径：

~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct

当前主要测试环境是单机 8 GPU，约束是 prefill + decode <= 8。

常用命令

生成小 append trace：

uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256

跑 live benchmark：

uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000

只回放并写 metrics：

uv run agentic-pd-hybrid replay \
  --trace path/to/trace.jsonl \
  --policy kv-aware \
  --mechanism pd-disaggregation \
  --router-url http://127.0.0.1:8000 \
  --output outputs/replay.jsonl

输出

每次 replay/benchmark 会写：

request metrics：request-metrics.jsonl
汇总结果：request-metrics.jsonl.summary.json

重点看：

E2E latency
TTFT / TPOT
execution mode
cached tokens
KV transfer blocks
error

维护约定

项目代码改动：feat: / fix: / docs:。
SGLang 改动：feat(sglang): ... / fix(sglang): ...。
third_party/sglang 的基线是 clean SGLang v0.5.10 snapshot。
不提交 outputs/、日志、__pycache__、虚拟环境。

README.md Unescape Escape

Agentic PD Hybrid

当前做了什么

环境

常用命令

输出

维护约定

README.md