docs(design): D->P sync interface contract + 4-phase rollout

Companion to BLOCK_LEVEL_EVICTION_DESIGN_ZH. Specifies the three-layer contract (mooncake / SGLang / agentic-pd-hybrid) that the empty feat/d-to-p-sync branch is meant to fill. Contents: - §1 staleness budget β as a first-class system parameter, with recommended default (page_size .. 4096 tokens) - §2.1 mooncake double-role API: KVRole enum extension, DecodeKVSender / PrefillKVReceiver class shapes, independent bootstrap channel - §2.2 SGLang RadixCache.insert_external signature with five concrete design decisions (re-mapping policy, failure handling, lock_ref discipline, evict interaction, multi-P backup view) - §2.3 agentic-pd-hybrid CLI flags, DirectSessionState additions, hook points in _invoke_session_direct and _invoke_kvcache_seeded_router - §3 candidate Theorem 4 (reseed_cost upper bound under staleness budget β) - §4 P1..P4 rollout with validation criteria per phase - §5 five enumerated risks + mitigation - §6 explicit decoupling: block-level eviction first, then D->P sync; do NOT bundle in one PR Makes the feat/d-to-p-sync branch actionable for the next collaborator without GPU until P2 microbench phase.
2026-05-12 23:50:39 +08:00
parent 683c44bd71
commit fd37eda367
1 changed files with 247 additions and 0 deletions
--- a/docs/D_TO_P_SYNC_CONTRACT_ZH.md
+++ b/docs/D_TO_P_SYNC_CONTRACT_ZH.md
@@ -0,0 +1,247 @@
+# D→P 增量 KV 同步 — 接口契约与 rollout 计划
+
+**日期**：2026-05-12
+**前置**：[RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md)（缺口定位）+ [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)（前置条件）
+**性质**：跨层接口契约 + staleness budget 形式化 + 分阶段 rollout
+**Status**：草案。`feat/d-to-p-sync` 分支当前为空，本文是该分支应当首先 land 的设计文档
+
+---
+
+## 0. TL;DR
+
+reseed 慢路径的 50% 时间在 P 重 prefill，**修复 transfer 段（启 RDMA）只能解一半**。彻底消除长尾的唯一办法是让 P 端 backup 增量跟上 D 端的 append：
+
+> D 在 direct-to-D 路径上完成一个 turn → 异步把新 commit 的 KV block 推回 P 端 radix → 下次 reseed 时 P 端 radix 命中完整 prefix，无需 re-prefill，仅一次 P→D transfer。
+
+本文给出三层（mooncake / SGLang / agentic-pd-hybrid）的接口契约、一个 **staleness budget β** 的形式化定义，以及四阶段 rollout 计划，让该工作可以与 block-level eviction 解耦推进。
+
+---
+
+## 1. Staleness Budget β —— 形式化定义
+
+设 D 上 session `s` 的 committed prefix 长度为 `L_D(s, t)`（time `t` 的瞬时值），P 上同 session 的 backup prefix 长度为 `L_P(s, t)`。
+
+```
+staleness(s, t) := L_D(s, t) - L_P(s, t)   ≥ 0
+```
+
+**Staleness budget β** 是系统承诺维持的上界：
+
+```
+∀ s, ∀ t :  staleness(s, t) ≤ β
+```
+
+直观：β 越小 → reseed 命中 P 端 backup 的可能越高 → reseed 退化为单次 P→D transfer + ≤ β tokens 的 re-prefill。
+
+- **β = 0**：完全同步（D 每 commit 一块就阻塞等 P ack）。延迟成本高，不推荐。
+- **β = ∞**：当前状态（P 端 backup 永远 seed-time 静态快照）。
+- **β = 一个 page（24 tokens）**：单 block sync。理论最优粒度，但 D 端每次 append 都触发一次 D→P RPC。
+- **β = O(append_len)（典型 1K–4K）**：批量 sync。推荐起点，把同 turn 的 decode 输出聚合后整批推送。
+- **β = O(turn_size)（典型 ~50K）**：粗粒度 sync。失效 reseed bypass，仅减少 transfer。不可取。
+
+→ rollout 推荐 β = `max(page_size, min(committed_in_turn, β_max))`，`β_max` 默认 4096。
+
+---
+
+## 2. 三层接口契约
+
+### 2.1 Mooncake 层：双角色化
+
+**当前状态**（详见 [RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md](RESEED_SLOW_PATH_AND_D_TO_P_GAP_ZH.md) §3）：
+
+- `MooncakeKVManager` 在初始化时按 `disaggregation_mode ∈ {PREFILL, DECODE}` 强角色化。
+- `MooncakeKVSender` 仅在 PREFILL 模式实例化，`MooncakeKVReceiver` 仅在 DECODE 模式实例化。
+- `add_transfer_request` 含硬约束 `assert disaggregation_mode == PREFILL`。
+
+**目标接口**：
+
+```python
+# third_party/sglang/python/sglang/srt/disaggregation/base/conn.py
+class BaseKVManager:
+    roles: set[KVRole]   # 替换原单值字段，允许 {PREFILL, DECODE}
+
+class KVRole(Enum):
+    PREFILL = "prefill"
+    DECODE = "decode"
+    PREFILL_BACKUP_RECEIVER = "prefill_backup_receiver"   # 新：P 端接收 D→P sync
+    DECODE_BACKUP_SENDER = "decode_backup_sender"         # 新：D 端发送 D→P sync
+```
+
+**新增类**（实现层 ~400 LOC）：
+
+| 类 | 角色 | 关键方法 |
+|---|---|---|
+| `DecodeKVSender` | D 端把 append 后的新 KV block 推回 P | `enqueue_sync(session_id, kv_blocks, target_p)` 异步入队，返回 `sync_id` |
+| `PrefillKVReceiver` | P 端接收 D→P sync 包 | `recv_loop()` 后台线程；每个包触发 callback 注入 radix tree |
+
+**Bootstrap channel**：需要独立于现有 P→D 通道的第二个 bootstrap socket（避免 buffer pointer 协商冲突）。配置：
+- 默认 disable，由 ServerArgs flag `--enable-d2p-sync` 开启
+- 新增 port range `BOOTSTRAP_D2P_PORT_BASE = 22000`
+
+### 2.2 SGLang 层：Radix 多生产者扩展
+
+**当前状态**：P 端 radix 假设单生产者（本 worker 模型输出）。`RadixCache.cache_finished_req` 内部直接从 `req_to_token_pool[req_pool_idx, :]` 取 KV indices 插入树。
+
+**目标接口**（在 [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) 完成之后）：
+
+```python
+class RadixCache(BasePrefixCache):
+    def insert_external(
+        self,
+        token_ids: Sequence[int],
+        kv_tensor: torch.Tensor,
+        *,
+        source_worker_id: str,
+        session_id: str,
+    ) -> InsertExternalResult:
+        """
+        Insert KV blocks supplied by an external worker (D→P sync).
+
+        Allocates fresh slots in token_to_kv_pool, copies kv_tensor into them,
+        and threads the resulting indices through the radix tree exactly like
+        cache_finished_req would for a local prefill.
+
+        Invariants:
+            - Same model layout (verified at handshake time, not per-call).
+            - On collision with existing radix path, no-op for the shared prefix
+              and only insert the diverging suffix.
+            - Inserted nodes get lock_ref += 1 if `pin=True`, default False.
+              D→P sync is best-effort; LRU is allowed to evict the inserted leaves.
+        """
+```
+
+**关键设计点**：
+
+| 决策 | 选项 | 推荐 |
+|---|---|---|
+| KV index 重映射 | A) D 发原 indices, P 重映射；B) D 发紧密打包的 tensor，P 重新分配 | **B**：避免跨 worker 索引泄漏 |
+| 失败处理 | A) D→P 失败 → 退化为重 prefill；B) 重试 N 次 | **A** + 后续 reseed 时若 P 未命中走旧路径 |
+| Reference counting | sync 进 P 的 KV 是否被 pin？ | **不 pin**：P 端 LRU 自然管理，避免 backup 把生产 KV 挤出 |
+| 与 evict 协调 | sync 来到时 P 满怎么办？ | 让 sync insert 触发 inner.evict → 与本地生产 KV 公平 LRU 竞争 |
+| 同 session 多 P 实例 | router round-robin 把 turn 派到不同 P 怎么办？ | **接受 multi-source**：每个 P 维护自己的 backup；reseed 时挑 staleness 最小者 |
+
+### 2.3 agentic-pd-hybrid 层：Hooks 与状态机
+
+**新增 CLI flag**：
+
+```bash
+--enable-d2p-sync                     # off by default
+--d2p-staleness-budget-tokens 4096    # β_max
+--d2p-sync-batch-min-tokens 24        # 至少 ≥ 1 page 才触发
+--d2p-sync-target-policy {last_p, round_robin, broadcast}
+                                      # last_p: 推回该 session 上次 seed 的 P
+                                      # broadcast: 推到所有 P（reseed 时灵活但带宽大）
+```
+
+**新增 state 字段**（`replay.py` 的 `DirectSessionState`）：
+
+```python
+@dataclass
+class DirectSessionState:
+    ...
+    # NEW: per-P backup view, populated by D->P sync callbacks.
+    prefill_resident_tokens_by_p: dict[str, int] = field(default_factory=dict)
+    last_d2p_sync_at: float | None = None
+```
+
+**Hook 在 `_invoke_session_direct` 完成后**：
+
+```python
+async def _invoke_session_direct(...):
+    ...
+    response = await self._stream_direct_to_d(...)
+    if response.ok and self.config.enable_d2p_sync:
+        new_committed = response.kv_committed_len
+        prev_p_resident = max(session.prefill_resident_tokens_by_p.values(), default=0)
+        staleness = new_committed - prev_p_resident
+        if staleness >= self.config.d2p_sync_batch_min_tokens:
+            target_p = self._choose_d2p_target(session)
+            asyncio.create_task(
+                self._issue_d2p_sync(session, target_p, prev_p_resident, new_committed)
+            )
+```
+
+**Hook 在 reseed 路径**（`_invoke_kvcache_seeded_router`）：
+
+```python
+async def _invoke_kvcache_seeded_router(..., request):
+    ...
+    if self.config.enable_d2p_sync:
+        # Probe P-side residency before issuing full re-prefill.
+        probe = await self._probe_prefill_residency(session_id)
+        if probe.resident_tokens >= request.prefix_len - β_max:
+            # Use the up-to-date backup: skip re-prefill, just trigger P→D transfer.
+            return await self._invoke_p_to_d_transfer_only(...)
+    # Fall back to existing path.
+    return await self._invoke_kvcache_seeded_router_legacy(...)
+```
+
+---
+
+## 3. 性质（待证明）
+
+### 3.1 Theorem 4 候选（论文形式）
+
+*设 staleness budget β 维持成立。对一个 session `s` 在 D 上累积长度 L、被 evict 后 reseed 触发：*
+
+```
+reseed_cost(s) ≤ T_p2d(L) + T_prefill(min(β, L))
+```
+
+*其中 T_p2d 是 P→D transfer 时间（在 RDMA 下 ~L · 4 ns/token），T_prefill 是 prefill 时间（在 H100 TP1 Qwen3-30B 下 ~50K tokens/s）。当 β ≪ L 时退化为 single P→D transfer 主导。*
+
+**对比 baseline**（无 D→P sync）：`reseed_cost = T_p2d(L) + T_prefill(L − seed_size)`，re-prefill 占主导。
+
+### 3.2 与 Theorem 2 的关系
+
+Theorem 2 只保证 direct-to-D 路径的快速命中。Theorem 4 把"fast path miss 时的 fallback cost"也压低到次秒级，使 KVC 在**全分位数**击败 DP 成为可能。
+
+---
+
+## 4. 四阶段 Rollout
+
+| Phase | 范围 | GPU 需求 | 验收指标 |
+|---|---|---|---|
+| **P1** | block-level eviction refactor（[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md)） | 4×H100 smoke | evict 单次平均 ≤ 500 tokens |
+| **P2** | mooncake 双角色化 + microbench（D→P 单包 RTT、带宽利用） | 单机 + RDMA | P→D RTT < 50ms（local），单 16K-token block 带宽 ≥ 50% 理论上限 |
+| **P3** | SGLang `insert_external` + agentic-pd-hybrid hook（仅 best-effort，无 reseed probe） | 4×H100 + RDMA | sync 触发率 > 80% 同 turn 内完成；不引入新 failure mode |
+| **P4** | reseed probe 接通 + 端到端 evaluation | 4×H100 + RDMA | reseed 单次 < 0.5s（vs 当前 3–7s），TTFT p99 < 0.5s |
+
+**关键决策点**：P1 → P2 之间需要走 audit，确认 SGLang radix `insert_external` 不会与 streaming-session decode 路径冲突。若发现严重冲突，引入 "P-only sync mode" 占位，等架构稳定再放开。
+
+---
+
+## 5. 风险与对策
+
+| 风险 | 影响 | 对策 |
+|---|---|---|
+| Mooncake 双角色化破坏现有 P→D 单向路径 | E2 已暴露 mooncake "instance not alive" 级联，再加一条通道可能放大 | P2 阶段先用独立 bootstrap channel + feature flag；保留 disable 路径 |
+| D→P sync 占用 D 出口带宽，影响 direct-to-D append-prefill 延迟 | 直接劣化主路径 | sync 用低优先级 QP（RDMA SL=0），且 batch 触发，单 turn 内最多 1 次 |
+| P 端 radix 被 backup 填满，反而挤出本地生产 KV | P 端 prefill 速度降 | sync 插入不 pin（§2.2），让 LRU 公平竞争 |
+| 多 P 多 backup view 协调复杂 | router 选择 target_p 时需考虑 staleness | 起点用 `last_p` policy（recency-biased），观察实测分布再决定是否上 `broadcast` |
+| 跨 SGLang patch 升级时 `insert_external` 与 upstream API 漂移 | 维护负担 | 把 API 限制在我方 vendor patch 边界（不污染 upstream radix），并写 contract test |
+
+---
+
+## 6. 与 block-level eviction 的解耦关系
+
+| 工作 | 是否依赖另一个 |
+|---|---|
+| block-level eviction | 不依赖 D→P sync，可独立交付。能单独降低 reseed 频次 |
+| D→P sync | **依赖** block-level eviction：需要 P 端 radix 是 streaming session KV 的真值源 |
+| 一起做 | 收益最大：reseed 频次降一个数量级 + 单次 reseed 时间降一个数量级 |
+
+→ rollout 顺序：block-level eviction 先 land，D→P sync 随后开 `feat/d-to-p-sync` 推进。两者**不应**合在一个 PR 里。
+
+---
+
+## 7. 接班 agent 的最小动作
+
+1. 在 `feat/d-to-p-sync` 分支上 land 本文。
+2. 等 block-level eviction 进 main 后，开 P2 阶段：mooncake 双角色化 + microbench（单测，无 SGLang 主路径耦合）。
+3. P3 阶段加 `insert_external` 与 hook；以 disabled-by-default 进 main。
+4. P4 端到端 evaluation 后再判断 reseed probe policy（`last_p` vs `broadcast`）。
+
+---
+
+**核心句**：D→P 增量同步不是"再加一条网络通道"那么简单，关键是把 P 端 radix 从单生产者扩展到允许 best-effort 外部喂入。Block-level eviction 是这件事的前置条件——所以两件工作可以一前一后，不能颠倒。