docs(design): D->P sync interface contract + 4-phase rollout
Companion to BLOCK_LEVEL_EVICTION_DESIGN_ZH. Specifies the
three-layer contract (mooncake / SGLang / agentic-pd-hybrid)
that the empty feat/d-to-p-sync branch is meant to fill.
Contents:
- §1 staleness budget β as a first-class system parameter,
with recommended default (page_size .. 4096 tokens)
- §2.1 mooncake double-role API: KVRole enum extension,
DecodeKVSender / PrefillKVReceiver class shapes,
independent bootstrap channel
- §2.2 SGLang RadixCache.insert_external signature with
five concrete design decisions (re-mapping policy,
failure handling, lock_ref discipline, evict
interaction, multi-P backup view)
- §2.3 agentic-pd-hybrid CLI flags, DirectSessionState
additions, hook points in _invoke_session_direct
and _invoke_kvcache_seeded_router
- §3 candidate Theorem 4 (reseed_cost upper bound under
staleness budget β)
- §4 P1..P4 rollout with validation criteria per phase
- §5 five enumerated risks + mitigation
- §6 explicit decoupling: block-level eviction first,
then D->P sync; do NOT bundle in one PR
Makes the feat/d-to-p-sync branch actionable for the next
collaborator without GPU until P2 microbench phase.
This commit is contained in:
247
docs/D_TO_P_SYNC_CONTRACT_ZH.md
Normal file
247
docs/D_TO_P_SYNC_CONTRACT_ZH.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# D→P 增量 KV 同步 — 接口契约与 rollout 计划
|
||||
|
||||
**日期**:2026-05-12
|
||||
**前置**:[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)(缺口定位)+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)(前置条件)
|
||||
**性质**:跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
|
||||
**Status**:草案。`feat/d-to-p-sync` 分支当前为空,本文是该分支应当首先 land 的设计文档
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
reseed 慢路径的 50% 时间在 P 重 prefill,**修复 transfer 段(启 RDMA)只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append:
|
||||
|
||||
> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix,无需 re-prefill,仅一次 P→D transfer。
|
||||
|
||||
本文给出三层(mooncake / SGLang / agentic-pd-hybrid)的接口契约、一个 **staleness budget β** 的形式化定义,以及四阶段 rollout 计划,让该工作可以与 block-level eviction 解耦推进。
|
||||
|
||||
---
|
||||
|
||||
## 1. Staleness Budget β —— 形式化定义
|
||||
|
||||
设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`(time `t` 的瞬时值),P 上同 session 的 backup prefix 长度为 `L_P(s, t)`。
|
||||
|
||||
```
|
||||
staleness(s, t) := L_D(s, t) - L_P(s, t) ≥ 0
|
||||
```
|
||||
|
||||
**Staleness budget β** 是系统承诺维持的上界:
|
||||
|
||||
```
|
||||
∀ s, ∀ t : staleness(s, t) ≤ β
|
||||
```
|
||||
|
||||
直观:β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
|
||||
|
||||
- **β = 0**:完全同步(D 每 commit 一块就阻塞等 P ack)。延迟成本高,不推荐。
|
||||
- **β = ∞**:当前状态(P 端 backup 永远 seed-time 静态快照)。
|
||||
- **β = 一个 page(24 tokens)**:单 block sync。理论最优粒度,但 D 端每次 append 都触发一次 D→P RPC。
|
||||
- **β = O(append_len)(典型 1K–4K)**:批量 sync。推荐起点,把同 turn 的 decode 输出聚合后整批推送。
|
||||
- **β = O(turn_size)(典型 ~50K)**:粗粒度 sync。失效 reseed bypass,仅减少 transfer。不可取。
|
||||
|
||||
→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))`,`β_max` 默认 4096。
|
||||
|
||||
---
|
||||
|
||||
## 2. 三层接口契约
|
||||
|
||||
### 2.1 Mooncake 层:双角色化
|
||||
|
||||
**当前状态**(详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3):
|
||||
|
||||
- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
|
||||
- `MooncakeKVSender` 仅在 PREFILL 模式实例化,`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
|
||||
- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`。
|
||||
|
||||
**目标接口**:
|
||||
|
||||
```python
|
||||
# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
|
||||
class BaseKVManager:
|
||||
roles: set[KVRole] # 替换原单值字段,允许 {PREFILL, DECODE}
|
||||
|
||||
class KVRole(Enum):
|
||||
PREFILL = "prefill"
|
||||
DECODE = "decode"
|
||||
PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver" # 新:P 端接收 D→P sync
|
||||
DECODE_BACKUP_SENDER = "decode_backup_sender" # 新:D 端发送 D→P sync
|
||||
```
|
||||
|
||||
**新增类**(实现层 ~400 LOC):
|
||||
|
||||
| 类 | 角色 | 关键方法 |
|
||||
|---|---|---|
|
||||
| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队,返回 `sync_id` |
|
||||
| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程;每个包触发 callback 注入 radix tree |
|
||||
|
||||
**Bootstrap channel**:需要独立于现有 P→D 通道的第二个 bootstrap socket(避免 buffer pointer 协商冲突)。配置:
|
||||
- 默认 disable,由 ServerArgs flag `--enable-d2p-sync` 开启
|
||||
- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
|
||||
|
||||
### 2.2 SGLang 层:Radix 多生产者扩展
|
||||
|
||||
**当前状态**:P 端 radix 假设单生产者(本 worker 模型输出)。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
|
||||
|
||||
**目标接口**(在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后):
|
||||
|
||||
```python
|
||||
class RadixCache(BasePrefixCache):
|
||||
def insert_external(
|
||||
self,
|
||||
token_ids: Sequence[int],
|
||||
kv_tensor: torch.Tensor,
|
||||
*,
|
||||
source_worker_id: str,
|
||||
session_id: str,
|
||||
) -> InsertExternalResult:
|
||||
"""
|
||||
Insert KV blocks supplied by an external worker (D→P sync).
|
||||
|
||||
Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
|
||||
and threads the resulting indices through the radix tree exactly like
|
||||
cache_finished_req would for a local prefill.
|
||||
|
||||
Invariants:
|
||||
- Same model layout (verified at handshake time, not per-call).
|
||||
- On collision with existing radix path, no-op for the shared prefix
|
||||
and only insert the diverging suffix.
|
||||
- Inserted nodes get lock_ref += 1 if `pin=True`, default False.
|
||||
D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
|
||||
"""
|
||||
```
|
||||
|
||||
**关键设计点**:
|
||||
|
||||
| 决策 | 选项 | 推荐 |
|
||||
|---|---|---|
|
||||
| KV index 重映射 | A) D 发原 indices, P 重映射;B) D 发紧密打包的 tensor,P 重新分配 | **B**:避免跨 worker 索引泄漏 |
|
||||
| 失败处理 | A) D→P 失败 → 退化为重 prefill;B) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
|
||||
| Reference counting | sync 进 P 的 KV 是否被 pin? | **不 pin**:P 端 LRU 自然管理,避免 backup 把生产 KV 挤出 |
|
||||
| 与 evict 协调 | sync 来到时 P 满怎么办? | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
|
||||
| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办? | **接受 multi-source**:每个 P 维护自己的 backup;reseed 时挑 staleness 最小者 |
|
||||
|
||||
### 2.3 agentic-pd-hybrid 层:Hooks 与状态机
|
||||
|
||||
**新增 CLI flag**:
|
||||
|
||||
```bash
|
||||
--enable-d2p-sync # off by default
|
||||
--d2p-staleness-budget-tokens 4096 # β_max
|
||||
--d2p-sync-batch-min-tokens 24 # 至少 ≥ 1 page 才触发
|
||||
--d2p-sync-target-policy {last_p, round_robin, broadcast}
|
||||
# last_p: 推回该 session 上次 seed 的 P
|
||||
# broadcast: 推到所有 P(reseed 时灵活但带宽大)
|
||||
```
|
||||
|
||||
**新增 state 字段**(`replay.py` 的 `DirectSessionState`):
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class DirectSessionState:
|
||||
...
|
||||
# NEW: per-P backup view, populated by D->P sync callbacks.
|
||||
prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
|
||||
last_d2p_sync_at: float | None = None
|
||||
```
|
||||
|
||||
**Hook 在 `_invoke_session_direct` 完成后**:
|
||||
|
||||
```python
|
||||
async def _invoke_session_direct(...):
|
||||
...
|
||||
response = await self._stream_direct_to_d(...)
|
||||
if response.ok and self.config.enable_d2p_sync:
|
||||
new_committed = response.kv_committed_len
|
||||
prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
|
||||
staleness = new_committed - prev_p_resident
|
||||
if staleness >= self.config.d2p_sync_batch_min_tokens:
|
||||
target_p = self._choose_d2p_target(session)
|
||||
asyncio.create_task(
|
||||
self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
|
||||
)
|
||||
```
|
||||
|
||||
**Hook 在 reseed 路径**(`_invoke_kvcache_seeded_router`):
|
||||
|
||||
```python
|
||||
async def _invoke_kvcache_seeded_router(..., request):
|
||||
...
|
||||
if self.config.enable_d2p_sync:
|
||||
# Probe P-side residency before issuing full re-prefill.
|
||||
probe = await self._probe_prefill_residency(session_id)
|
||||
if probe.resident_tokens >= request.prefix_len - β_max:
|
||||
# Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
|
||||
return await self._invoke_p_to_d_transfer_only(...)
|
||||
# Fall back to existing path.
|
||||
return await self._invoke_kvcache_seeded_router_legacy(...)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 性质(待证明)
|
||||
|
||||
### 3.1 Theorem 4 候选(论文形式)
|
||||
|
||||
*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发:*
|
||||
|
||||
```
|
||||
reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
|
||||
```
|
||||
|
||||
*其中 T_p2d 是 P→D transfer 时间(在 RDMA 下 ~L · 4 ns/token),T_prefill 是 prefill 时间(在 H100 TP1 Qwen3-30B 下 ~50K tokens/s)。当 β ≪ L 时退化为 single P→D transfer 主导。*
|
||||
|
||||
**对比 baseline**(无 D→P sync):`reseed_cost = T_p2d(L) + T_prefill(L − seed_size)`,re-prefill 占主导。
|
||||
|
||||
### 3.2 与 Theorem 2 的关系
|
||||
|
||||
Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级,使 KVC 在**全分位数**击败 DP 成为可能。
|
||||
|
||||
---
|
||||
|
||||
## 4. 四阶段 Rollout
|
||||
|
||||
| Phase | 范围 | GPU 需求 | 验收指标 |
|
||||
|---|---|---|---|
|
||||
| **P1** | block-level eviction refactor([BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)) | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
|
||||
| **P2** | mooncake 双角色化 + microbench(D→P 单包 RTT、带宽利用) | 单机 + RDMA | P→D RTT < 50ms(local),单 16K-token block 带宽 ≥ 50% 理论上限 |
|
||||
| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook(仅 best-effort,无 reseed probe) | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成;不引入新 failure mode |
|
||||
| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5s(vs 当前 3–7s),TTFT p99 < 0.5s |
|
||||
|
||||
**关键决策点**:P1 → P2 之间需要走 audit,确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突。若发现严重冲突,引入 "P-only sync mode" 占位,等架构稳定再放开。
|
||||
|
||||
---
|
||||
|
||||
## 5. 风险与对策
|
||||
|
||||
| 风险 | 影响 | 对策 |
|
||||
|---|---|---|
|
||||
| Mooncake 双角色化破坏现有 P→D 单向路径 | E2 已暴露 mooncake "instance not alive" 级联,再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag;保留 disable 路径 |
|
||||
| D→P sync 占用 D 出口带宽,影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QP(RDMA SL=0),且 batch 触发,单 turn 内最多 1 次 |
|
||||
| P 端 radix 被 backup 填满,反而挤出本地生产 KV | P 端 prefill 速度降 | sync 插入不 pin(§2.2),让 LRU 公平竞争 |
|
||||
| 多 P 多 backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policy(recency-biased),观察实测分布再决定是否上 `broadcast` |
|
||||
| 跨 SGLang patch 升级时 `insert_external` 与 upstream API 漂移 | 维护负担 | 把 API 限制在我方 vendor patch 边界(不污染 upstream radix),并写 contract test |
|
||||
|
||||
---
|
||||
|
||||
## 6. 与 block-level eviction 的解耦关系
|
||||
|
||||
| 工作 | 是否依赖另一个 |
|
||||
|---|---|
|
||||
| block-level eviction | 不依赖 D→P sync,可独立交付。能单独降低 reseed 频次 |
|
||||
| D→P sync | **依赖** block-level eviction:需要 P 端 radix 是 streaming session KV 的真值源 |
|
||||
| 一起做 | 收益最大:reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
|
||||
|
||||
→ rollout 顺序:block-level eviction 先 land,D→P sync 随后开 `feat/d-to-p-sync` 推进。两者**不应**合在一个 PR 里。
|
||||
|
||||
---
|
||||
|
||||
## 7. 接班 agent 的最小动作
|
||||
|
||||
1. 在 `feat/d-to-p-sync` 分支上 land 本文。
|
||||
2. 等 block-level eviction 进 main 后,开 P2 阶段:mooncake 双角色化 + microbench(单测,无 SGLang 主路径耦合)。
|
||||
3. P3 阶段加 `insert_external` 与 hook;以 disabled-by-default 进 main。
|
||||
4. P4 端到端 evaluation 后再判断 reseed probe policy(`last_p` vs `broadcast`)。
|
||||
|
||||
---
|
||||
|
||||
**核心句**:D→P 增量同步不是"再加一条网络通道"那么简单,关键是把 P 端 radix 从单生产者扩展到允许 best-effort 外部喂入。Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后,不能颠倒。
|
||||
Reference in New Issue
Block a user