docs(d2p): SnapshotStore refactor design — dedicated GPU buffer

Captures the architectural fix for the P-side alloc-failed problem
that killed every D→P sync attempt in E4-v4/v5. Designs a dedicated
GPU snapshot_buf with a slab allocator, decoupling reception from
kv_pool, and defers kv_pool alloc to finalize_ingest time when the
snapshot bytes are already in hand. ~365 LOC across controller,
io_struct, agentic. Smoke + E4-v6 expected to show first non-zero
D→P OK rate.
This commit is contained in:
Claude Code Agent
2026-05-13 14:14:00 +08:00
parent f926a7b87d
commit 6be5f9b57e

View File

@@ -0,0 +1,174 @@
# SnapshotStore 重构(解决 P-side alloc-failed 死局)
**日期**2026-05-13
**Status**:设计阶段,开始实施
**根因**`docs/E4_VS_E1_RESULTS_ZH.md` §3 + E4-v4/v5 forensic 显示 D→P sync 167 次尝试 0 OK全部因 `prepare_receive` 试图从 `token_to_kv_pool_allocator.alloc(N)` 拿 N 个 slot 而 P 的池被自己 prefill 工作占满
---
## 0. TL;DR
- 当前 P-side `prepare_receive``token_to_kv_pool_allocator.alloc(N)` 抢 kv_pool slot —— 跟 P 自己的 prefill 工作直接争抢资源 → 90%+ 时间 alloc-failed
- 重构方向:**P-side 用独立 GPU buffer 接收 snapshot**,与 kv_pool 解耦
- 在 finalize_ingest 时才把 snapshot bytes copy 进 kv_pool slots此时可以等更优的时机
- ~250 LOC 新代码,主要在 `disaggregation/snapshot/controller.py`
---
## 1. 当前实现的死局
```
prepare_receive(sid, num_tokens=50000):
indices = self.token_to_kv_pool_allocator.alloc(50000)
if indices is None:
return ok=False, reason="alloc-failed" ← 90%+ 时间走这里
return slot_indices = indices.tolist()
```
`alloc(50000)` 在 P 池中找 50000 个 contiguous 空 slot。当 P 正在 prefill 自己的 request 时(这是 P 的常态),池里大部分 slot 被锁定 → 找不出 50K 个空闲的 → fail.
E4-v5 167 次 sync 尝试统计:
- 148 个 alloc-failed**88%**
- 19 个 session-not-residentD 端已 evict
- 0 个 OK
---
## 2. 新设计PrefillSnapshotStore 侧表
```
┌─────────────────────────────────────────────────────────────────┐
│ P worker scheduler │
│ │
│ kv_pool (existing, owned by P's prefill work) │
│ ┌────────────────────────────────────────────────┐ │
│ │ k_buffer[0..L]: (max_tokens, head, dim) │ │
│ │ v_buffer[0..L]: (max_tokens, head, dim) │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ snapshot_buf (NEW, dedicated for D→P snapshot reception) │
│ ┌────────────────────────────────────────────────┐ │
│ │ pinned GPU tensor of size SNAPSHOT_BUF_BYTES │ │
│ │ (default 8 GB) │ │
│ │ • registered with mooncake (one-time at init) │ │
│ │ • slab-allocator manages free space │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Flow:
1. prepare_receive(sid, N):
slab = snapshot_buf_allocator.alloc(N * per_token_bytes_total)
record = (sid, slab_offset, N)
return (snapshot_buf_base + slab_offset for K_L, V_L per layer)
← never blocks on kv_pool
2. (out-of-band) D pushes KV bytes into the slab via mooncake RDMA
3. finalize_ingest(sid, token_ids):
record = pop ingest_record[sid]
slots = token_to_kv_pool_allocator.alloc(N) ← can fail here
if alloc-failed:
snapshot_buf_allocator.free(record.slab)
return ok=False, reason=alloc-failed-on-finalize
# copy snapshot_buf[layer L][token range] → kv_pool.k_buffer[L][slots]
for L in range(layer_num):
kv_pool.k_buffer[L][slots] = snapshot_buf[K_L_offset : K_L_offset + N * K_stride].view(N, head, dim)
kv_pool.v_buffer[L][slots] = snapshot_buf[V_L_offset : V_L_offset + N * V_stride].view(N, head, dim)
tree_cache.insert(InsertParams(key=token_ids, value=slots))
snapshot_buf_allocator.free(record.slab)
return ok=True
```
---
## 3. 关键 design choices
| 决策 | 选择 | 原因 |
|---|---|---|
| Snapshot buffer 存哪 | GPU memory | 与 D RDMA 目标对称D 端 KV 也在 GPU避免 host↔device 拷贝 |
| 默认大小 | **8 GB** | Qwen3-30B 一个 ~50K-token session 的 KV ~5 GB8 GB 让我们至少 hold 一个 + 部分备份 |
| 分配粒度 | 单次 contiguous 一个 session 全部 KV | 简化 slab allocator + 单次 batch transfer |
| Layout | K-all-layers concat, then V-all-layers concat | 跟 mooncake 的 batch_transfer 接口对齐 |
| Free 策略 | finalize 后立即 free | 当 snapshot 已 ingest 到 kv_poolsnapshot_buf 副本不再需要 |
| 满了怎么办 | prepare_receive 返回 ok=False, reason=snapshot-buf-full | 让 caller fall back 到 re-prefill |
---
## 4. 接口变化
### 4.1 SnapshotPrepareReceiveReqOutput
旧:
```
k_base_ptrs: List[int] # 各 layer 的 k_buffer.data_ptr()
v_base_ptrs: List[int]
slot_indices: List[int] # kv_pool 中分配的 slot
stride_k_bytes / stride_v_bytes
```
新:
```
snapshot_buf_base_ptr: int # snapshot_buf.data_ptr()
k_layer_offsets: List[int] # 各 layer K 在 snapshot_buf 中的字节偏移
v_layer_offsets: List[int] # 各 layer V 偏移
num_tokens: int
stride_k_bytes / stride_v_bytes
slab_handle: int # opaque handle for finalize/abort
```
### 4.2 SnapshotFinalizeIngestReqInput
旧:
```
session_id, token_ids, slot_indices
```
新:
```
session_id, token_ids, slab_handle # P 用 handle 找到 record再 alloc kv_pool + copy + insert
```
### 4.3 D-side push 逻辑agentic
D 算 src_slot[L] → dst_slot[L] mappingbatch_transfer
D 算 src_slot[L] → snapshot_buf 中的 k_layer_offsets[L] / v_layer_offsets[L] mappingbatch_transfer。完全不需要 dst slot indices。
---
## 5. 实施步骤
| # | 步骤 | LOC 估计 |
|---|---|---:|
| 1 | `SnapshotBufAllocator`slab/bump allocator | 80 |
| 2 | `SnapshotLinkController.__init__` 加 snapshot_buf 分配 + 注册 | 30 |
| 3 | 重写 `prepare_receive`、新加 `_compute_layer_offsets` | 60 |
| 4 | 新加 `finalize_with_snapshot_buf` + 删旧的 `finalize_ingest` | 70 |
| 5 | 修改 io_struct 字段 + 删旧字段 | 30 |
| 6 | 修改 agentic `_attempt_d_to_p_sync` 用新字段 | 40 |
| 7 | 改 mem leak check 计入 snapshot_buf | 5 |
| 8 | 单元 smoke test | 50 |
Total: ~365 LOC
---
## 6. 风险
| 风险 | 缓解 |
|---|---|
| 8 GB GPU mem cost | 用户可配置mem-fraction-static 已经留了 buffer |
| 多 session 抢 snapshot_buf | slab allocator + LRU evict 旧的 snapshot |
| GPU→GPU copy 性能 | ~5 GB @ 3 TB/s = 1.7 ms可忽略 |
| 接口大改影响 smoke | 在 commit 内完成所有接口变更smoke 同步更新 |
---
## 7. 验收
- [ ] `scripts/smoke_snapshot_sglang_integration.py` 跑通新接口prepare_receive 不再 alloc-failed
- [ ] E4-v6 跑同样 traced-to-p-sync.jsonl 出现 OK 事件 ≥ 30%vs 当前 0%
---
**核心句**:用 GPU 上独立的 snapshot_buf 接收 D 端推送,把"竞争 P kv_pool"这个根本性 alloc 冲突消掉,把 alloc 决策推迟到 finalize 时机,让 D→P 真正有机会跑通。