Implements the design in docs/SNAPSHOT_STORE_REFACTOR_ZH.md to fix
the alloc-failed death loop that killed D→P in E4-v4/v5 (167 sync
attempts, 0 OK because P's kv_pool was busy with its own prefill).
Mechanism change:
OLD prepare_receive: token_to_kv_pool_allocator.alloc(N) — 90%+ failure
NEW prepare_receive: SnapshotBufAllocator.alloc(slab_bytes) carves a
range from an 8 GB GPU buffer dedicated to
snapshot reception, decoupled from kv_pool
OLD finalize_ingest: just radix.insert with pre-alloc'd slots
NEW finalize_ingest: kv_pool.alloc NOW + GPU memcpy snapshot_buf →
k_buffer/v_buffer + radix.insert
Wire schema changed (clean break, no back-compat):
PrepareReceiveReqOutput swaps k/v_base_ptrs + slot_indices for
snapshot_buf_base_ptr + k/v_layer_offsets +
num_tokens
DumpReqInput swaps target_k/v_base_ptrs + target_slot_indices
for target_snapshot_buf_base +
target_k/v_layer_offsets
FinalizeIngestReqInput drops slot_indices (P resolves at ingest)
Controller adds:
SnapshotBufAllocator: first-fit free-list with 4 KB alignment
ingest_snapshot_into_kvpool: GPU→GPU copy + radix insert
Configurable buffer size via SGLANG_SNAPSHOT_LINK_BUF_BYTES env
(default 8 GB, scales down to 1 GB if alloc fails).
Removed runtime leak-check accommodation since prepare_receive no
longer touches kv_pool.
Total: ~365 LOC including alloc helper; smoke-test verification next.