Phase 2 prepare_receive allocates kv_pool slots that aren't visible
to radix / session bookkeeping until finalize_ingest. Without this
fix, the scheduler's idle self_check fires:
ValueError: token_to_kv_pool_allocator memory leak detected!
available=288391, evictable=5, protected=0, session_held=0
(expected sum == 288460)
_check_radix_cache_memory now subtracts
sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values())
from the expected total before flagging a leak. Snapshot_reserved is
also printed in the leak message for diagnostics.
Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py):
[smoke] prepare_receive on P → 200: ok=true (96 layer bufs)
[smoke] dump on D → 200: ok=false, reason=session-not-resident
[smoke] finalize on P → 200: ok=true, inserted_prefix_len=0
[smoke] OVERALL: PASS
End-to-end KV-correctness (snapshot ingest yields cache hit on next
prefill) still requires the agentic+router stack — covered in the E4
sweep, not this smoke.
Phase 3 — wires the SGLang-side snapshot RPCs (committed in 86412bb)
into the agentic reseed slow-path. On _invoke_kvcache_seeded_router:
1. POST {prefill_url}/_snapshot/prepare_receive alloc P-side slots
2. POST {old_decode_url}/_snapshot/dump RDMA push session KV
3. POST {prefill_url}/_snapshot/finalize_ingest insert into P radix
After step 3 P's radix tree has the session prefix cached; the subsequent
SGLang router-driven prefill on P hits cache instead of re-computing.
Any RPC failure short-circuits to the existing seeded_router fallback
(re-prefill from scratch). All steps are best-effort and structurally
logged for post-hoc analysis.
Flag plumbing:
cli.py --enable-d-to-p-sync (replay + benchmark)
topology.py SingleNodeTopology.enable_d_to_p_sync
stack.py SGLANG_SNAPSHOT_LINK_ENABLE=1 injection per worker
replay.py ReplayConfig.enable_d_to_p_sync +
_attempt_d_to_p_sync helper
Snapshot port per worker derives from disaggregation_bootstrap_port +
1000 (set in third_party/.../snapshot/controller.py), so different
workers get distinct mooncake snapshot engines on the same node.
Smoke (next): scripts/smoke_snapshot_sglang_integration.py spawns one
D + one P, exercises the 3 RPCs end-to-end, checks cache_tokens on a
follow-up generate request.
See docs/D_TO_P_SYNC_DESIGN_ZH.md for the full design.