Implements the design in docs/SNAPSHOT_STORE_REFACTOR_ZH.md to fix
the alloc-failed death loop that killed D→P in E4-v4/v5 (167 sync
attempts, 0 OK because P's kv_pool was busy with its own prefill).
Mechanism change:
OLD prepare_receive: token_to_kv_pool_allocator.alloc(N) — 90%+ failure
NEW prepare_receive: SnapshotBufAllocator.alloc(slab_bytes) carves a
range from an 8 GB GPU buffer dedicated to
snapshot reception, decoupled from kv_pool
OLD finalize_ingest: just radix.insert with pre-alloc'd slots
NEW finalize_ingest: kv_pool.alloc NOW + GPU memcpy snapshot_buf →
k_buffer/v_buffer + radix.insert
Wire schema changed (clean break, no back-compat):
PrepareReceiveReqOutput swaps k/v_base_ptrs + slot_indices for
snapshot_buf_base_ptr + k/v_layer_offsets +
num_tokens
DumpReqInput swaps target_k/v_base_ptrs + target_slot_indices
for target_snapshot_buf_base +
target_k/v_layer_offsets
FinalizeIngestReqInput drops slot_indices (P resolves at ingest)
Controller adds:
SnapshotBufAllocator: first-fit free-list with 4 KB alignment
ingest_snapshot_into_kvpool: GPU→GPU copy + radix insert
Configurable buffer size via SGLANG_SNAPSHOT_LINK_BUF_BYTES env
(default 8 GB, scales down to 1 GB if alloc fails).
Removed runtime leak-check accommodation since prepare_receive no
longer touches kv_pool.
Total: ~365 LOC including alloc helper; smoke-test verification next.
Add the 54 MB SWE 50sess replay trace to the repo under
third_party/traces/ so it travels with `git clone` to GPU nodes that
can't reach the sandbox network. Previously the trace only lived under
outputs/ which is .gitignored.
Whitelist third_party/traces/ in .gitignore (same pattern as the
existing third_party/sglang/ allowlist).
After cloning on a new host, either symlink the file into outputs/ for
backward compatibility:
ln -sf ../third_party/traces/qwen35-swebench-50sess.jsonl \
outputs/qwen35-swebench-50sess.jsonl
or update sweep scripts to point --trace at third_party/traces/.
README in the new directory documents the file's lineage
(SiCo → SiBench → audit.jsonl → convert_audit_to_trace.py) and the
100 MB GitLab single-file limit warning for future trace additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 prepare_receive allocates kv_pool slots that aren't visible
to radix / session bookkeeping until finalize_ingest. Without this
fix, the scheduler's idle self_check fires:
ValueError: token_to_kv_pool_allocator memory leak detected!
available=288391, evictable=5, protected=0, session_held=0
(expected sum == 288460)
_check_radix_cache_memory now subtracts
sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values())
from the expected total before flagging a leak. Snapshot_reserved is
also printed in the leak message for diagnostics.
Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py):
[smoke] prepare_receive on P → 200: ok=true (96 layer bufs)
[smoke] dump on D → 200: ok=false, reason=session-not-resident
[smoke] finalize on P → 200: ok=true, inserted_prefix_len=0
[smoke] OVERALL: PASS
End-to-end KV-correctness (snapshot ingest yields cache hit on next
prefill) still requires the agentic+router stack — covered in the E4
sweep, not this smoke.
Phase 2 of the D→P sync feature (Phase 1 in dc4867c verified the
underlying RDMA link in isolation). This commit wires that link into
each SGLang worker's scheduler so D and P can exchange session KV
without going through the PD prefill pipeline.
New module:
third_party/sglang/python/sglang/srt/disaggregation/snapshot/
controller.py — SnapshotLinkController owns one mooncake transfer
engine per worker, pre-registers all kv_pool layer
buffers, and exposes prepare_receive() and
push_session_kv() APIs. Receive bookkeeping via
a session_id → SnapshotIngestRecord side-table.
Three RPC types added to io_struct.py and full plumbing wired through:
SnapshotPrepareReceiveReqInput/Output P-side alloc + return layout
SnapshotDumpReqInput/Output D-side read kv_pool + RDMA push
SnapshotFinalizeIngestReqInput/Output P-side radix tree insert
Files touched:
managers/io_struct.py 3 new ReqInput/ReqOutput pairs
managers/tokenizer_communicator_mixin.py 3 communicators, 3 awaitables
managers/scheduler.py init controller + 3 handlers
entrypoints/http_server.py 3 HTTP endpoints under /_snapshot
Activation: set SGLANG_SNAPSHOT_LINK_ENABLE=1 (and
SGLANG_SNAPSHOT_LINK_HOST / _PORT / _IB_DEVICE) per worker. Controller
init is opt-in and defaults off, so production PD pipeline is
untouched.
Subsequent work (Phase 3): agentic-pd-hybrid orchestration in
_invoke_kvcache_seeded_router to call prepare_receive on P, dump on
D-old, finalize_ingest on P, then trigger the existing P→D' transfer
which will now hit P's radix cache (skipping re-prefill).
Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session
correction at the top of ScheduleBatch.prepare_for_extend zeroes
req.extend_input_len when len(fill_ids) <= len(prefix_indices), but
the per-req invariant later in the same function (assert
seq_len - pre_len == req.extend_input_len) is computed from raw
fill_ids/prefix_indices lengths and has no path to be satisfied
when fill_len < prefix_len. The result is an AssertionError that
crashes the entire decode worker.
Add a pre-filter pass at the start of prepare_for_extend that
detects this state, marks the affected reqs with FINISH_ABORT (so
the client gets an error response instead of the worker hanging),
and drops them from the batch before the correction loop runs. If
all reqs are filtered, populate empty tensor/list state and return
early so downstream model.forward sees a valid no-op batch.
This treats fill_ids < prefix_indices as upstream state
inconsistency that should be reported to the client rather than
silently miscomputed. The narrower invariant after this filter:
prepare_for_extend's body only ever sees streaming-session reqs
where actual_extend_len > 0, which is the regime the existing
correction logic was designed for.
Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid
6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648,
prefix_len=43459) — masked in E1/E2 because the cap-out failure
cascade prevented sessions from accumulating deep enough committed
prefix to trigger the inconsistency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D
can advise callers when its transfer queue is heavy or KV pool is near
capacity. The hint is computed from transfer_queue_depth,
retracted_queue_depth, and post-trim token_usage; thresholds are simple
heuristics (>0.90 usage, >=8 queue depth, retracted>0).
Default behavior is unchanged for callers that ignore the field.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hostile audit of the original report flagged three load-bearing errors:
1. held_tokens semantic was inverted. session_held_tokens() at
session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len)
per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held -
avail" actually CONTAINS the radix-tree protected prefix cache (likely the
single biggest component for shared agentic prefixes), not just running
batch + in-flight as the original report claimed.
2. Admission-race causal hypothesis for the 415 EXP2+profile errors is
contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they
passed admission and died downstream ("generate stream ended before
producing any token", raised by the client when a 200 response had an empty
stream).
3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1
(session-cap-fb -356 / kvcache-centric +406), and /server_info is not a
passive read — it dispatches into the scheduler main loop and iterates
every session slot.
Plus: per-D error% confounded by sticky session affinity (only 18 unique
sessions cause 415 errors, decode-3 had 0 errors only because no high-error
session landed there); decile 10 "recovery" was an equal-time binning
artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not
6h; p50/p90 latency comparison is N=1.
Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction
with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4).
Action items split into P0 (verify, must do first) and P1 (instrument):
P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2
(no polling, identical config to the original v5 run) to test whether the
9-error baseline result is reproducible. If 3 runs give ~9 errors and
profile gives 415, polling is the leading suspect. Currently running
in background.
P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only
"pool_breakdown" dict to /server_info covering: radix_evictable_tokens,
radix_protected_tokens, slot_private_held_tokens, session_slot_count,
running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens},
prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these,
"unaccounted = cap - sum(known)" exposes true leakage. replay.py captures
all fields into the per-tick row; analyzer prints the decomposition and
gracefully handles old timeseries (prints "P1 instrument absent").
Mock-tested end-to-end. SGLang patch is read-only and does not affect
admission/scheduling. Old v5+profile data still analyzes correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v4 (cap=16) saw 35% session-cap fallback because the local soft_cap
min(16, usable / target) evaluates to 1-2 for large agentic inputs.
The cap was hit not because D was full but because replay's heuristic
underestimated capacity.
This change makes worker admission_mode authoritative for ALL paths:
SGLang side:
- io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field
("direct_append" | "seed", default "direct_append" preserves prior
behavior).
- scheduler.py:admit_direct_append: when mode == "seed", skip the
resident-on-D requirement and run the same capacity check + LRU
eviction (maybe_trim_decode_session_cache) that direct_append uses.
This lets D atomically decide if a new session can be admitted based
on actual token_to_kv_pool_allocator state.
Replay side (replay.py):
- _query_decode_direct_admission gains a `mode` parameter.
- _reserve_decode_session_capacity: in worker admission_mode, the
seed/reseed branch now queries D with mode="seed" and trusts the
result, instead of estimating capacity from the residency snapshot.
- _should_admit_new_decode_session: in worker mode, skip the local
soft_cap pre-check and let D decide. Same-D session fast-path is
preserved.
Effects:
- Local hardcoded cap of 16 is bypassed under worker mode; D's real
KV pool size is the only constraint.
- LRU eviction runs in D's process atomically with admission, so
starvation (the v3 bimodal "lucky vs starved sessions" pattern)
should resolve.
scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D
configs as v4 with the new admission path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document the iterative debugging from v1 (broken KVC) through v4
(routing fixed + session cap raised), with code-level analysis of
the two main bugs encountered:
1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`):
`--policy default` for KVC mechanism caused replay's round-robin
policy and the PD router's round-robin to diverge, sending requests
with `session_params` to a D worker that did not have the session
open. Resulted in 56-61% truncation with finish_reason
"session id X does not exist".
Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay
emits `x-smg-target-worker` and PD router uses consistent_hashing.
2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap`
dominated 52-65% of requests. Root cause was hardcoded
`min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4
sessions = 28 slots for 52 trace sessions, ~24 sessions starved
permanently (bimodal direct-to-D rate of 0% or 99%).
Fix: raise the cap to 16 (replay.py).
Also includes the v3 finding that direct-to-d-session path P50=0.495s
and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s)
- the KVC core mechanism works when fallback paths are avoided.
Files:
- docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index
- docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes
- scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts
- src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields
- src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill
- src/agentic_pd_hybrid/metrics.py: truncated_request_count
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>