agentic-pd-hybrid

Author	SHA1	Message	Date
Claude Code Agent	2dfe22ab20	refactor(snapshot): dedicated GPU snapshot_buf replaces kv_pool alloc Implements the design in docs/SNAPSHOT_STORE_REFACTOR_ZH.md to fix the alloc-failed death loop that killed D→P in E4-v4/v5 (167 sync attempts, 0 OK because P's kv_pool was busy with its own prefill). Mechanism change: OLD prepare_receive: token_to_kv_pool_allocator.alloc(N) — 90%+ failure NEW prepare_receive: SnapshotBufAllocator.alloc(slab_bytes) carves a range from an 8 GB GPU buffer dedicated to snapshot reception, decoupled from kv_pool OLD finalize_ingest: just radix.insert with pre-alloc'd slots NEW finalize_ingest: kv_pool.alloc NOW + GPU memcpy snapshot_buf → k_buffer/v_buffer + radix.insert Wire schema changed (clean break, no back-compat): PrepareReceiveReqOutput swaps k/v_base_ptrs + slot_indices for snapshot_buf_base_ptr + k/v_layer_offsets + num_tokens DumpReqInput swaps target_k/v_base_ptrs + target_slot_indices for target_snapshot_buf_base + target_k/v_layer_offsets FinalizeIngestReqInput drops slot_indices (P resolves at ingest) Controller adds: SnapshotBufAllocator: first-fit free-list with 4 KB alignment ingest_snapshot_into_kvpool: GPU→GPU copy + radix insert Configurable buffer size via SGLANG_SNAPSHOT_LINK_BUF_BYTES env (default 8 GB, scales down to 1 GB if alloc fails). Removed runtime leak-check accommodation since prepare_receive no longer touches kv_pool. Total: ~365 LOC including alloc helper; smoke-test verification next.	2026-05-13 14:18:23 +08:00
kzlin	f926a7b87d	data: include qwen35-swebench-50sess trace under third_party/traces/ Add the 54 MB SWE 50sess replay trace to the repo under third_party/traces/ so it travels with `git clone` to GPU nodes that can't reach the sandbox network. Previously the trace only lived under outputs/ which is .gitignored. Whitelist third_party/traces/ in .gitignore (same pattern as the existing third_party/sglang/ allowlist). After cloning on a new host, either symlink the file into outputs/ for backward compatibility: ln -sf ../third_party/traces/qwen35-swebench-50sess.jsonl \ outputs/qwen35-swebench-50sess.jsonl or update sweep scripts to point --trace at third_party/traces/. README in the new directory documents the file's lineage (SiCo → SiBench → audit.jsonl → convert_audit_to_trace.py) and the 100 MB GitLab single-file limit warning for future trace additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 14:07:05 +08:00
Claude Code Agent	552f3f564e	chore(submodule): add third_party/agentic-kvcache submodule Pinned to scaleaisys/projects/agentic-kvcache.git HEAD. Whitelisted in .gitignore alongside third_party/sglang/.	2026-05-13 13:59:05 +08:00
Claude Code Agent	a369722efe	fix(sglang): account snapshot-reserved slots in radix mem leak check Phase 2 prepare_receive allocates kv_pool slots that aren't visible to radix / session bookkeeping until finalize_ingest. Without this fix, the scheduler's idle self_check fires: ValueError: token_to_kv_pool_allocator memory leak detected! available=288391, evictable=5, protected=0, session_held=0 (expected sum == 288460) _check_radix_cache_memory now subtracts sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values()) from the expected total before flagging a leak. Snapshot_reserved is also printed in the leak message for diagnostics. Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py): [smoke] prepare_receive on P → 200: ok=true (96 layer bufs) [smoke] dump on D → 200: ok=false, reason=session-not-resident [smoke] finalize on P → 200: ok=true, inserted_prefix_len=0 [smoke] OVERALL: PASS End-to-end KV-correctness (snapshot ingest yields cache hit on next prefill) still requires the agentic+router stack — covered in the E4 sweep, not this smoke.	2026-05-13 08:26:16 +08:00
Claude Code Agent	86412bb174	feat(sglang): D→P snapshot link integration — controller + RPC handlers Phase 2 of the D→P sync feature (Phase 1 in `dc4867c` verified the underlying RDMA link in isolation). This commit wires that link into each SGLang worker's scheduler so D and P can exchange session KV without going through the PD prefill pipeline. New module: third_party/sglang/python/sglang/srt/disaggregation/snapshot/ controller.py — SnapshotLinkController owns one mooncake transfer engine per worker, pre-registers all kv_pool layer buffers, and exposes prepare_receive() and push_session_kv() APIs. Receive bookkeeping via a session_id → SnapshotIngestRecord side-table. Three RPC types added to io_struct.py and full plumbing wired through: SnapshotPrepareReceiveReqInput/Output P-side alloc + return layout SnapshotDumpReqInput/Output D-side read kv_pool + RDMA push SnapshotFinalizeIngestReqInput/Output P-side radix tree insert Files touched: managers/io_struct.py 3 new ReqInput/ReqOutput pairs managers/tokenizer_communicator_mixin.py 3 communicators, 3 awaitables managers/scheduler.py init controller + 3 handlers entrypoints/http_server.py 3 HTTP endpoints under /_snapshot Activation: set SGLANG_SNAPSHOT_LINK_ENABLE=1 (and SGLANG_SNAPSHOT_LINK_HOST / _PORT / _IB_DEVICE) per worker. Controller init is opt-in and defaults off, so production PD pipeline is untouched. Subsequent work (Phase 3): agentic-pd-hybrid orchestration in _invoke_kvcache_seeded_router to call prepare_receive on P, dump on D-old, finalize_ingest on P, then trigger the existing P→D' transfer which will now hit P's radix cache (skipping re-prefill).	2026-05-13 08:12:04 +08:00
tim	986f351365	feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session correction at the top of ScheduleBatch.prepare_for_extend zeroes req.extend_input_len when len(fill_ids) <= len(prefix_indices), but the per-req invariant later in the same function (assert seq_len - pre_len == req.extend_input_len) is computed from raw fill_ids/prefix_indices lengths and has no path to be satisfied when fill_len < prefix_len. The result is an AssertionError that crashes the entire decode worker. Add a pre-filter pass at the start of prepare_for_extend that detects this state, marks the affected reqs with FINISH_ABORT (so the client gets an error response instead of the worker hanging), and drops them from the batch before the correction loop runs. If all reqs are filtered, populate empty tensor/list state and return early so downstream model.forward sees a valid no-op batch. This treats fill_ids < prefix_indices as upstream state inconsistency that should be reported to the client rather than silently miscomputed. The narrower invariant after this filter: prepare_for_extend's body only ever sees streaming-session reqs where actual_extend_len > 0, which is the regime the existing correction logic was designed for. Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid 6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648, prefix_len=43459) — masked in E1/E2 because the cap-out failure cascade prevented sessions from accumulating deep enough committed prefix to trigger the inconsistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:12:14 +08:00
kzlin	ca4b64c79a	feat(sglang): expose backpressure pause hint in admit_direct_append Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D can advise callers when its transfer queue is heavy or KV pool is near capacity. The hint is computed from transfer_queue_depth, retracted_queue_depth, and post-trim token_usage; thresholds are simple heuristics (>0.90 usage, >=8 queue depth, retracted>0). Default behavior is unchanged for callers that ignore the field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:30 +08:00
kzlin	4978c0d0cd	profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument Hostile audit of the original report flagged three load-bearing errors: 1. held_tokens semantic was inverted. session_held_tokens() at session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len) per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held - avail" actually CONTAINS the radix-tree protected prefix cache (likely the single biggest component for shared agentic prefixes), not just running batch + in-flight as the original report claimed. 2. Admission-race causal hypothesis for the 415 EXP2+profile errors is contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they passed admission and died downstream ("generate stream ended before producing any token", raised by the client when a 200 response had an empty stream). 3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1 (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a passive read — it dispatches into the scheduler main loop and iterates every session slot. Plus: per-D error% confounded by sticky session affinity (only 18 unique sessions cause 415 errors, decode-3 had 0 errors only because no high-error session landed there); decile 10 "recovery" was an equal-time binning artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not 6h; p50/p90 latency comparison is N=1. Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4). Action items split into P0 (verify, must do first) and P1 (instrument): P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2 (no polling, identical config to the original v5 run) to test whether the 9-error baseline result is reproducible. If 3 runs give ~9 errors and profile gives 415, polling is the leading suspect. Currently running in background. P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only "pool_breakdown" dict to /server_info covering: radix_evictable_tokens, radix_protected_tokens, slot_private_held_tokens, session_slot_count, running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens}, prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these, "unaccounted = cap - sum(known)" exposes true leakage. replay.py captures all fields into the per-tick row; analyzer prints the decomposition and gracefully handles old timeseries (prints "P1 instrument absent"). Mock-tested end-to-end. SGLang patch is read-only and does not affect admission/scheduling. Old v5+profile data still analyzes correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:29:21 +08:00
kzlin	6e5ed8da80	feat(kvc): Option D - delegate seed/reseed admission to D worker v4 (cap=16) saw 35% session-cap fallback because the local soft_cap min(16, usable / target) evaluates to 1-2 for large agentic inputs. The cap was hit not because D was full but because replay's heuristic underestimated capacity. This change makes worker admission_mode authoritative for ALL paths: SGLang side: - io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field ("direct_append" \| "seed", default "direct_append" preserves prior behavior). - scheduler.py:admit_direct_append: when mode == "seed", skip the resident-on-D requirement and run the same capacity check + LRU eviction (maybe_trim_decode_session_cache) that direct_append uses. This lets D atomically decide if a new session can be admitted based on actual token_to_kv_pool_allocator state. Replay side (replay.py): - _query_decode_direct_admission gains a `mode` parameter. - _reserve_decode_session_capacity: in worker admission_mode, the seed/reseed branch now queries D with mode="seed" and trusts the result, instead of estimating capacity from the residency snapshot. - _should_admit_new_decode_session: in worker mode, skip the local soft_cap pre-check and let D decide. Same-D session fast-path is preserved. Effects: - Local hardcoded cap of 16 is bypassed under worker mode; D's real KV pool size is the only constraint. - LRU eviction runs in D's process atomically with admission, so starvation (the v3 bimodal "lucky vs starved sessions" pattern) should resolve. scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D configs as v4 with the new admission path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:40:03 +08:00
kzlin	c9d350b372	docs: KVC v1-v4 debug journey + raise session soft_cap to 16 Document the iterative debugging from v1 (broken KVC) through v4 (routing fixed + session cap raised), with code-level analysis of the two main bugs encountered: 1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`): `--policy default` for KVC mechanism caused replay's round-robin policy and the PD router's round-robin to diverge, sending requests with `session_params` to a D worker that did not have the session open. Resulted in 56-61% truncation with finish_reason "session id X does not exist". Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay emits `x-smg-target-worker` and PD router uses consistent_hashing. 2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap` dominated 52-65% of requests. Root cause was hardcoded `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4 sessions = 28 slots for 52 trace sessions, ~24 sessions starved permanently (bimodal direct-to-D rate of 0% or 99%). Fix: raise the cap to 16 (replay.py). Also includes the v3 finding that direct-to-d-session path P50=0.495s and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s) - the KVC core mechanism works when fallback paths are avoided. Files: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index - docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes - scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts - src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields - src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill - src/agentic_pd_hybrid/metrics.py: truncated_request_count Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:10:41 +08:00
Gahow Wang	13bb31a446	Add kvcache-centric profiling and admission controls	2026-04-25 16:00:52 +00:00
Gahow Wang	b8e6f13c20	feat(sglang): support decode session cache admission	2026-04-24 12:30:41 +00:00
Gahow Wang	bded08301f	chore: vendor sglang v0.5.10 snapshot	2026-04-24 12:29:36 +00:00

13 Commits