agentic-pd-hybrid

Author SHA1 Message Date

Author	SHA1	Message	Date
Claude Code Agent	a369722efe	fix(sglang): account snapshot-reserved slots in radix mem leak check Phase 2 prepare_receive allocates kv_pool slots that aren't visible to radix / session bookkeeping until finalize_ingest. Without this fix, the scheduler's idle self_check fires: ValueError: token_to_kv_pool_allocator memory leak detected! available=288391, evictable=5, protected=0, session_held=0 (expected sum == 288460) _check_radix_cache_memory now subtracts sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values()) from the expected total before flagging a leak. Snapshot_reserved is also printed in the leak message for diagnostics. Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py): [smoke] prepare_receive on P → 200: ok=true (96 layer bufs) [smoke] dump on D → 200: ok=false, reason=session-not-resident [smoke] finalize on P → 200: ok=true, inserted_prefix_len=0 [smoke] OVERALL: PASS End-to-end KV-correctness (snapshot ingest yields cache hit on next prefill) still requires the agentic+router stack — covered in the E4 sweep, not this smoke.	2026-05-13 08:26:16 +08:00
Claude Code Agent	b9b0cf0fac	feat(agentic): D→P snapshot orchestration in reseed path + CLI flag Phase 3 — wires the SGLang-side snapshot RPCs (committed in `86412bb`) into the agentic reseed slow-path. On _invoke_kvcache_seeded_router: 1. POST {prefill_url}/_snapshot/prepare_receive alloc P-side slots 2. POST {old_decode_url}/_snapshot/dump RDMA push session KV 3. POST {prefill_url}/_snapshot/finalize_ingest insert into P radix After step 3 P's radix tree has the session prefix cached; the subsequent SGLang router-driven prefill on P hits cache instead of re-computing. Any RPC failure short-circuits to the existing seeded_router fallback (re-prefill from scratch). All steps are best-effort and structurally logged for post-hoc analysis. Flag plumbing: cli.py --enable-d-to-p-sync (replay + benchmark) topology.py SingleNodeTopology.enable_d_to_p_sync stack.py SGLANG_SNAPSHOT_LINK_ENABLE=1 injection per worker replay.py ReplayConfig.enable_d_to_p_sync + _attempt_d_to_p_sync helper Snapshot port per worker derives from disaggregation_bootstrap_port + 1000 (set in third_party/.../snapshot/controller.py), so different workers get distinct mooncake snapshot engines on the same node. Smoke (next): scripts/smoke_snapshot_sglang_integration.py spawns one D + one P, exercises the 3 RPCs end-to-end, checks cache_tokens on a follow-up generate request. See docs/D_TO_P_SYNC_DESIGN_ZH.md for the full design.	2026-05-13 08:16:46 +08:00

Claude Code Agent

a369722efe

fix(sglang): account snapshot-reserved slots in radix mem leak check

Phase 2 prepare_receive allocates kv_pool slots that aren't visible
to radix / session bookkeeping until finalize_ingest. Without this
fix, the scheduler's idle self_check fires:

  ValueError: token_to_kv_pool_allocator memory leak detected!
    available=288391, evictable=5, protected=0, session_held=0
    (expected sum == 288460)

_check_radix_cache_memory now subtracts
  sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values())
from the expected total before flagging a leak. Snapshot_reserved is
also printed in the leak message for diagnostics.

Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py):
  [smoke] prepare_receive on P → 200: ok=true (96 layer bufs)
  [smoke] dump on D → 200: ok=false, reason=session-not-resident
  [smoke] finalize on P → 200: ok=true, inserted_prefix_len=0
  [smoke] OVERALL: PASS

End-to-end KV-correctness (snapshot ingest yields cache hit on next
prefill) still requires the agentic+router stack — covered in the E4
sweep, not this smoke.

2026-05-13 08:26:16 +08:00

Claude Code Agent

b9b0cf0fac

feat(agentic): D→P snapshot orchestration in reseed path + CLI flag

Phase 3 — wires the SGLang-side snapshot RPCs (committed in 86412bb)
into the agentic reseed slow-path. On _invoke_kvcache_seeded_router:

  1. POST {prefill_url}/_snapshot/prepare_receive   alloc P-side slots
  2. POST {old_decode_url}/_snapshot/dump           RDMA push session KV
  3. POST {prefill_url}/_snapshot/finalize_ingest   insert into P radix

After step 3 P's radix tree has the session prefix cached; the subsequent
SGLang router-driven prefill on P hits cache instead of re-computing.

Any RPC failure short-circuits to the existing seeded_router fallback
(re-prefill from scratch). All steps are best-effort and structurally
logged for post-hoc analysis.

Flag plumbing:
  cli.py             --enable-d-to-p-sync          (replay + benchmark)
  topology.py        SingleNodeTopology.enable_d_to_p_sync
  stack.py           SGLANG_SNAPSHOT_LINK_ENABLE=1 injection per worker
  replay.py          ReplayConfig.enable_d_to_p_sync +
                     _attempt_d_to_p_sync helper

Snapshot port per worker derives from disaggregation_bootstrap_port +
1000 (set in third_party/.../snapshot/controller.py), so different
workers get distinct mooncake snapshot engines on the same node.

Smoke (next): scripts/smoke_snapshot_sglang_integration.py spawns one
D + one P, exercises the 3 RPCs end-to-end, checks cache_tokens on a
follow-up generate request.

See docs/D_TO_P_SYNC_DESIGN_ZH.md for the full design.

2026-05-13 08:16:46 +08:00

2 Commits