agentic-pd-hybrid

Author	SHA1	Message	Date
kzlin	6e5ed8da80	feat(kvc): Option D - delegate seed/reseed admission to D worker v4 (cap=16) saw 35% session-cap fallback because the local soft_cap min(16, usable / target) evaluates to 1-2 for large agentic inputs. The cap was hit not because D was full but because replay's heuristic underestimated capacity. This change makes worker admission_mode authoritative for ALL paths: SGLang side: - io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field ("direct_append" \| "seed", default "direct_append" preserves prior behavior). - scheduler.py:admit_direct_append: when mode == "seed", skip the resident-on-D requirement and run the same capacity check + LRU eviction (maybe_trim_decode_session_cache) that direct_append uses. This lets D atomically decide if a new session can be admitted based on actual token_to_kv_pool_allocator state. Replay side (replay.py): - _query_decode_direct_admission gains a `mode` parameter. - _reserve_decode_session_capacity: in worker admission_mode, the seed/reseed branch now queries D with mode="seed" and trusts the result, instead of estimating capacity from the residency snapshot. - _should_admit_new_decode_session: in worker mode, skip the local soft_cap pre-check and let D decide. Same-D session fast-path is preserved. Effects: - Local hardcoded cap of 16 is bypassed under worker mode; D's real KV pool size is the only constraint. - LRU eviction runs in D's process atomically with admission, so starvation (the v3 bimodal "lucky vs starved sessions" pattern) should resolve. scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D configs as v4 with the new admission path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:40:03 +08:00
kzlin	74194e660a	docs: v4 final results, error analysis, and updated journey Add v4 sweep results and post-mortem analysis showing: - direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline 8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%). - Overall vs baseline (errors+truncated excluded): v4 2P6D P50=0.85s vs baseline 0.66s (28% slower). Reason is not errors -- 35% of requests still hit fallback-large-append-session-cap, where capacity-based cap = usable_tokens / target_tokens evaluates to 1-2 (not 16) for large agentic inputs. - 9-10% errors on KVC variants are mooncake TCP transfer timeouts, not SGLang logic bugs. Prefill log shows "Failed to send kv chunk ... 32s timeout ... session not alive". Errors concentrate in turn>=31 (large inputs) after run >44.8%. Track: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table, per-mode breakdown, and error root cause. - scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py - outputs/qwen3-30b-tp1-v{3,4}/exp_summary.json (force-added, small JSON; metrics.jsonl excluded due to size). - outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:34:01 +08:00
kzlin	c9d350b372	docs: KVC v1-v4 debug journey + raise session soft_cap to 16 Document the iterative debugging from v1 (broken KVC) through v4 (routing fixed + session cap raised), with code-level analysis of the two main bugs encountered: 1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`): `--policy default` for KVC mechanism caused replay's round-robin policy and the PD router's round-robin to diverge, sending requests with `session_params` to a D worker that did not have the session open. Resulted in 56-61% truncation with finish_reason "session id X does not exist". Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay emits `x-smg-target-worker` and PD router uses consistent_hashing. 2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap` dominated 52-65% of requests. Root cause was hardcoded `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4 sessions = 28 slots for 52 trace sessions, ~24 sessions starved permanently (bimodal direct-to-D rate of 0% or 99%). Fix: raise the cap to 16 (replay.py). Also includes the v3 finding that direct-to-d-session path P50=0.495s and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s) - the KVC core mechanism works when fallback paths are avoided. Files: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index - docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes - scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts - src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields - src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill - src/agentic_pd_hybrid/metrics.py: truncated_request_count Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:10:41 +08:00
Gahow Wang	e9062b1d6e	Document PD baseline comparison	2026-04-25 17:29:27 +00:00
Gahow Wang	c928c7db23	Add transfer queue admission knobs	2026-04-25 17:29:15 +00:00
Gahow Wang	fe583fb413	Document kvcache-centric experiment progress	2026-04-25 16:01:31 +00:00
Gahow Wang	13bb31a446	Add kvcache-centric profiling and admission controls	2026-04-25 16:00:52 +00:00
Gahow Wang	08b13d22bc	docs: rewrite project docs in concise chinese	2026-04-24 12:41:52 +00:00
Gahow Wang	5bdc0ed4f0	docs: document sglang maintenance workflow	2026-04-24 12:31:32 +00:00
Gahow Wang	b8e6f13c20	feat(sglang): support decode session cache admission	2026-04-24 12:30:41 +00:00
Gahow Wang	bded08301f	chore: vendor sglang v0.5.10 snapshot	2026-04-24 12:29:36 +00:00
Gahow Wang	78f0d15221	docs: document project design and status	2026-04-24 12:17:55 +00:00
Gahow Wang	4bca741f32	feat: add agentic pd hybrid benchmark prototype	2026-04-24 12:17:46 +00:00
Gahow Wang	d2fe014db7	chore: initialize repo hygiene	2026-04-24 12:17:40 +00:00

14 Commits