docs(d2p): SnapshotStore refactor design — dedicated GPU buffer

Captures the architectural fix for the P-side alloc-failed problem that killed every D→P sync attempt in E4-v4/v5. Designs a dedicated GPU snapshot_buf with a slab allocator, decoupling reception from kv_pool, and defers kv_pool alloc to finalize_ingest time when the snapshot bytes are already in hand. ~365 LOC across controller, io_struct, agentic. Smoke + E4-v6 expected to show first non-zero D→P OK rate.
2026-05-13 14:14:00 +08:00
parent f926a7b87d
commit 6be5f9b57e
1 changed files with 174 additions and 0 deletions
--- a/docs/SNAPSHOT_STORE_REFACTOR_ZH.md
+++ b/docs/SNAPSHOT_STORE_REFACTOR_ZH.md
@@ -0,0 +1,174 @@
+# SnapshotStore 重构（解决 P-side alloc-failed 死局）
+
+**日期**：2026-05-13
+**Status**：设计阶段，开始实施
+**根因**：`docs/E4_VS_E1_RESULTS_ZH.md` §3 + E4-v4/v5 forensic 显示 D→P sync 167 次尝试 0 OK，全部因 `prepare_receive` 试图从 `token_to_kv_pool_allocator.alloc(N)` 拿 N 个 slot 而 P 的池被自己 prefill 工作占满
+
+---
+
+## 0. TL;DR
+
+- 当前 P-side `prepare_receive` 用 `token_to_kv_pool_allocator.alloc(N)` 抢 kv_pool slot —— 跟 P 自己的 prefill 工作直接争抢资源 → 90%+ 时间 alloc-failed
+- 重构方向：**P-side 用独立 GPU buffer 接收 snapshot**，与 kv_pool 解耦
+- 在 finalize_ingest 时才把 snapshot bytes copy 进 kv_pool slots（此时可以等更优的时机）
+- ~250 LOC 新代码，主要在 `disaggregation/snapshot/controller.py`
+
+---
+
+## 1. 当前实现的死局
+
+```
+prepare_receive(sid, num_tokens=50000):
+    indices = self.token_to_kv_pool_allocator.alloc(50000)
+    if indices is None:
+        return ok=False, reason="alloc-failed"   ← 90%+ 时间走这里
+    return slot_indices = indices.tolist()
+```
+
+`alloc(50000)` 在 P 池中找 50000 个 contiguous 空 slot。当 P 正在 prefill 自己的 request 时（这是 P 的常态），池里大部分 slot 被锁定 → 找不出 50K 个空闲的 → fail.
+
+E4-v5 167 次 sync 尝试统计：
+- 148 个 alloc-failed（**88%**）
+- 19 个 session-not-resident（D 端已 evict）
+- 0 个 OK
+
+---
+
+## 2. 新设计：PrefillSnapshotStore 侧表
+
+```
+   ┌─────────────────────────────────────────────────────────────────┐
+   │ P worker scheduler                                               │
+   │                                                                  │
+   │  kv_pool (existing, owned by P's prefill work)                  │
+   │  ┌────────────────────────────────────────────────┐             │
+   │  │ k_buffer[0..L]: (max_tokens, head, dim)        │             │
+   │  │ v_buffer[0..L]: (max_tokens, head, dim)        │             │
+   │  └────────────────────────────────────────────────┘             │
+   │                                                                  │
+   │  snapshot_buf (NEW, dedicated for D→P snapshot reception)       │
+   │  ┌────────────────────────────────────────────────┐             │
+   │  │ pinned GPU tensor of size SNAPSHOT_BUF_BYTES   │             │
+   │  │ (default 8 GB)                                  │             │
+   │  │ • registered with mooncake (one-time at init)  │             │
+   │  │ • slab-allocator manages free space             │             │
+   │  └────────────────────────────────────────────────┘             │
+   └─────────────────────────────────────────────────────────────────┘
+
+Flow:
+  1. prepare_receive(sid, N):
+       slab = snapshot_buf_allocator.alloc(N * per_token_bytes_total)
+       record = (sid, slab_offset, N)
+       return (snapshot_buf_base + slab_offset for K_L, V_L per layer)
+       ← never blocks on kv_pool
+
+  2. (out-of-band) D pushes KV bytes into the slab via mooncake RDMA
+
+  3. finalize_ingest(sid, token_ids):
+       record = pop ingest_record[sid]
+       slots = token_to_kv_pool_allocator.alloc(N)  ← can fail here
+       if alloc-failed:
+           snapshot_buf_allocator.free(record.slab)
+           return ok=False, reason=alloc-failed-on-finalize
+       # copy snapshot_buf[layer L][token range] → kv_pool.k_buffer[L][slots]
+       for L in range(layer_num):
+           kv_pool.k_buffer[L][slots] = snapshot_buf[K_L_offset : K_L_offset + N * K_stride].view(N, head, dim)
+           kv_pool.v_buffer[L][slots] = snapshot_buf[V_L_offset : V_L_offset + N * V_stride].view(N, head, dim)
+       tree_cache.insert(InsertParams(key=token_ids, value=slots))
+       snapshot_buf_allocator.free(record.slab)
+       return ok=True
+```
+
+---
+
+## 3. 关键 design choices
+
+| 决策 | 选择 | 原因 |
+|---|---|---|
+| Snapshot buffer 存哪 | GPU memory | 与 D RDMA 目标对称（D 端 KV 也在 GPU），避免 host↔device 拷贝 |
+| 默认大小 | **8 GB** | Qwen3-30B 一个 ~50K-token session 的 KV ~5 GB；8 GB 让我们至少 hold 一个 + 部分备份 |
+| 分配粒度 | 单次 contiguous 一个 session 全部 KV | 简化 slab allocator + 单次 batch transfer |
+| Layout | K-all-layers concat, then V-all-layers concat | 跟 mooncake 的 batch_transfer 接口对齐 |
+| Free 策略 | finalize 后立即 free | 当 snapshot 已 ingest 到 kv_pool，snapshot_buf 副本不再需要 |
+| 满了怎么办 | prepare_receive 返回 ok=False, reason=snapshot-buf-full | 让 caller fall back 到 re-prefill |
+
+---
+
+## 4. 接口变化
+
+### 4.1 SnapshotPrepareReceiveReqOutput
+
+旧：
+```
+k_base_ptrs: List[int]   # 各 layer 的 k_buffer.data_ptr()
+v_base_ptrs: List[int]
+slot_indices: List[int]  # kv_pool 中分配的 slot
+stride_k_bytes / stride_v_bytes
+```
+
+新：
+```
+snapshot_buf_base_ptr: int  # snapshot_buf.data_ptr()
+k_layer_offsets: List[int]  # 各 layer K 在 snapshot_buf 中的字节偏移
+v_layer_offsets: List[int]  # 各 layer V 偏移
+num_tokens: int
+stride_k_bytes / stride_v_bytes
+slab_handle: int            # opaque handle for finalize/abort
+```
+
+### 4.2 SnapshotFinalizeIngestReqInput
+
+旧：
+```
+session_id, token_ids, slot_indices
+```
+
+新：
+```
+session_id, token_ids, slab_handle   # P 用 handle 找到 record，再 alloc kv_pool + copy + insert
+```
+
+### 4.3 D-side push 逻辑（agentic）
+
+旧：D 算 src_slot[L] → dst_slot[L] mapping，batch_transfer
+
+新：D 算 src_slot[L] → snapshot_buf 中的 k_layer_offsets[L] / v_layer_offsets[L] mapping，batch_transfer。完全不需要 dst slot indices。
+
+---
+
+## 5. 实施步骤
+
+| # | 步骤 | LOC 估计 |
+|---|---|---:|
+| 1 | `SnapshotBufAllocator` 类（slab/bump allocator） | 80 |
+| 2 | `SnapshotLinkController.__init__` 加 snapshot_buf 分配 + 注册 | 30 |
+| 3 | 重写 `prepare_receive`、新加 `_compute_layer_offsets` | 60 |
+| 4 | 新加 `finalize_with_snapshot_buf` + 删旧的 `finalize_ingest` | 70 |
+| 5 | 修改 io_struct 字段 + 删旧字段 | 30 |
+| 6 | 修改 agentic `_attempt_d_to_p_sync` 用新字段 | 40 |
+| 7 | 改 mem leak check 计入 snapshot_buf | 5 |
+| 8 | 单元 smoke test | 50 |
+
+Total: ~365 LOC
+
+---
+
+## 6. 风险
+
+| 风险 | 缓解 |
+|---|---|
+| 8 GB GPU mem cost | 用户可配置；mem-fraction-static 已经留了 buffer |
+| 多 session 抢 snapshot_buf | slab allocator + LRU evict 旧的 snapshot |
+| GPU→GPU copy 性能 | ~5 GB @ 3 TB/s = 1.7 ms，可忽略 |
+| 接口大改影响 smoke | 在 commit 内完成所有接口变更，smoke 同步更新 |
+
+---
+
+## 7. 验收
+
+- [ ] `scripts/smoke_snapshot_sglang_integration.py` 跑通新接口（prepare_receive 不再 alloc-failed）
+- [ ] E4-v6 跑同样 trace，d-to-p-sync.jsonl 出现 OK 事件 ≥ 30%（vs 当前 0%）
+
+---
+
+**核心句**：用 GPU 上独立的 snapshot_buf 接收 D 端推送，把"竞争 P kv_pool"这个根本性 alloc 冲突消掉，把 alloc 决策推迟到 finalize 时机，让 D→P 真正有机会跑通。