kzlin 9ccd853066 docs(kvc): correct reseed cost decomposition + flag D->P sync gap
After an independent Opus-agent forensic audit, the previous "(c) 增量
fetch (工程量较大,未实现)" line in V2_DEEP_ANALYSIS §4.2 was understating
the gap. The audit confirmed:

- No D->P KV transfer code exists in the framework at any layer
  (agentic_pd_hybrid orchestration, vendored SGLang disaggregation,
  or mooncake transport).
- Mooncake MooncakeKVManager has a hard role split: PREFILL = sender,
  DECODE = receiver-only loop. `add_transfer_request` asserts the
  disaggregation_mode is PREFILL.
- The BaseKVSender / BaseKVReceiver abstraction has no bidirectional slot.
- session_aware_cache.release_session only calls kv_pool_allocator.free()
  on eviction -- no serialization, no outbound network call.
- _commit_prefill_backup_residency is only called from the seed/reseed
  path (_invoke_kvcache_seeded_router). direct-to-D path never updates
  P-side backup state.
- "capacity-backup" policy semantics: it only skips the close on P after
  reseed -- the backup is the seed-time static snapshot, never refreshed
  by D-side append-prefill activity.

V2_DEEP_ANALYSIS §4.2:
- Decomposed the 3-7s reseed cost into the P-side re-prefill segment
  (1.5-3s, dominant) and the P->D mooncake transfer segment (1.5-4s).
- Quantified the realistic effect of enabling RDMA: only the transfer
  segment shrinks, reseed reduces to 1.7-3.2s, TTFT p99 ~0.7s, still
  loses to DP's 0.43s.
- Replaced the throwaway "(c) incremental fetch" line with a full
  paragraph explaining what D->P sync would require, why it's the
  largest engineering gap, and that the blocker is SGLang's radix-tree
  single-producer assumption, not the network layer.

KVC_ROUTER_ALGORITHM §9:
- Refined Open Question 3 (RDMA) to clarify it only helps the transfer
  segment, not the re-prefill segment.
- Added Open Question 4: D->P incremental KV sync as the central
  future-work contribution gap, with cited evidence for why it doesn't
  currently exist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:07:14 +08:00
2026-04-24 12:17:40 +00:00

Agentic PD Hybrid

这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:

面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing能不能降低端到端延迟。

更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md

当前做了什么

  • 启动单机 SGLang P/D 栈。
  • 回放 Ali coding agent trace并记录 request-level metrics。
  • 支持 defaultstickykv-aware 路由策略。
  • 支持 pd-disaggregationkvcache-centricpd-colo 对比。
  • 支持小 append、多轮 session 的 micro-benchmark trace。
  • 维护了基于 SGLang v0.5.10 的本地 patch放在 third_party/sglang

环境

统一使用 uv

uv sync

默认模型路径:

~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct

当前主要测试环境是单机 8 GPU约束是 prefill + decode <= 8

常用命令

生成小 append trace

uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256

跑 live benchmark

uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000

只回放并写 metrics

uv run agentic-pd-hybrid replay \
  --trace path/to/trace.jsonl \
  --policy kv-aware \
  --mechanism pd-disaggregation \
  --router-url http://127.0.0.1:8000 \
  --output outputs/replay.jsonl

输出

每次 replay/benchmark 会写:

  • request metricsrequest-metrics.jsonl
  • 汇总结果:request-metrics.jsonl.summary.json

重点看:

  • E2E latency
  • TTFT / TPOT
  • execution mode
  • cached tokens
  • KV transfer blocks
  • error

维护约定

  • 项目代码改动:feat: / fix: / docs:
  • SGLang 改动:feat(sglang): ... / fix(sglang): ...
  • third_party/sglang 的基线是 clean SGLang v0.5.10 snapshot。
  • 不提交 outputs/、日志、__pycache__、虚拟环境。
Description
No description provided
Readme 18 MiB
Languages
Python 100%