agentic-pd-hybrid

Author	SHA1	Message	Date
tim	ef4dc81ea9	docs(experiments): forensic explanation for E2 80% failure rate Pulling admission-events.jsonl, prefill-0.log, and request-metrics sampling shows the 1054 failures are NOT timeouts as initially assumed. They are a 3-layer cascade: L1: 562 "no-space" + 43 "session-not-resident" worker admission rejects (51% of all admit attempts) because D0/D1 KV pools saturate while D2 stays empty. L2: rejects re-route to seed/reseed which need mooncake P→D KV transfer; the backlog drops mooncake heartbeats and prefill-0 logs "Decode instance could be dead, remote mooncake session ... is not alive". L3: SGLang aborts the request, SSE stream closes with 0 tokens, agentic-pd-hybrid raises "generate stream ended before producing any token" (the literal error string for all 1054). E1 didn't hit this because pd-disaggregation has no admission RPC — sessions just queue behind the running batch, paying TTFT instead of failing. KVC v2's worker admission is supposed to be a safety valve; on the cold-D pathology it becomes a failure amplifier. The real fix is upstream D rebalancing (cold-D bonus or pre-warm), not relaxing admission. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:38:49 +08:00
tim	3db2d84df8	docs(experiments): E2 complete — qualified H1 with a surprise E2 finished 1h33min wall. Headline contrast on the matched Inferact 50-session subset: E1 (naive 1P3D + kv-aware + RDMA): 1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s E2 (KVC v2 + RDMA): 231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among the requests that did complete. Both runs leave D2 entirely unused for the same structural reason: Inferact's shared "permissions instructions" boilerplate makes overlap dominate the kv-aware lex score, and v2's migration mechanism only fires on capacity rejects which never reach D2. The 1054 E2 timeouts are downstream of that imbalance, not a v2 bug per se. The doc closes with five concrete follow-ups for the next agent — cold-D bonus, router-mode admission, default-policy control arm, TCP-loopback comparison, failure mode forensics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 03:23:33 +08:00
tim	e3e5c45ed4	docs(experiments): E2 mid-run finding — D2 stays cold in KVC v2 too Same pathological imbalance E1 showed reproduces in E2: D2 has zero bindings at 33% POSTs in. Root cause is structural, not a KVC v2 bug: all 50 Inferact sessions begin with identical "permissions instructions" boilerplate, so the converter assigns them identical first-block hash_ids. kv-aware policy's overlap term (lex-score position 0) makes any already-resident D dominate a fresh D unconditionally, and v2's migration only activates on admission rejects which never fire because D0/D1 KV pools have headroom. The H1 conclusion is qualified: KVC v2 helps per-request work (direct- to-D fast path) but does not rebalance D worker load on workloads with shared cross-session prefixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 02:08:00 +08:00
tim	631b2c8847	docs(experiments): E1 results — naive 1P3D + kv-aware confirms H1 baseline E1 finished 1h29min wall on the 50-session Inferact subset. Headline: 1200/1285 succeeded, latency p50=93s p99=219s, TTFT p50=89s p99=207s, 85 timeouts. Decode-2 was never bound to a single session — all 50 sessions stuck to decode-0/1 by kv-aware policy stickiness with no migration to rebalance, so effective topology was 1P2D, not 1P3D. This is exactly the failure mode H1 predicts naive pd-disaggregation should exhibit, giving E2 (full KVC v2 with migration) a concrete baseline to improve against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 01:49:52 +08:00
tim	ad8aaa8c5a	feat(experiments): E2 sweep — KVC v2 + RDMA on the matched subset KVC v2 config from sweep_ts1_migration_v2.sh (reset-on-success + direct-append threshold 8192) layered on top of the RDMA-enabled mooncake stack, against the same outputs/inferact_50sess.jsonl subset that E1 uses. Pair-wise contrast tests H1 (KVC layer marginal contribution on top of 1P3D + kv-aware) and H2/H3 (RDMA reducing reseed slow-path tail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:49:53 +08:00
tim	bb9cc249cd	feat(experiments): E1 sweep on 50-session deterministic subset scripts/sample_trace_subset.py — file-order head-cut that takes the first N sessions of a converted trace. No RNG, no hashing — same input yields byte-identical output (the included assertion compares md5 across two runs). scripts/sweep_e1_naive_1p3d.sh — E1 of ONBOARDING_NEXT_AGENT_ZH §3.1: mechanism=pd-disaggregation, policy=kv-aware, 1P3D, RDMA on (mlx5_60). Defaults to outputs/inferact_50sess.jsonl so E1 and E2 can share the exact same subset; override via TRACE= env var to run on the full 20,230-request trace. Reproducing the subset: uv run --no-sync python scripts/sample_trace_subset.py \\ --input outputs/inferact_codex_swebenchpro.jsonl \\ --output outputs/inferact_50sess.jsonl \\ --sessions 50 # expected output_md5: 7bb263a32600ef5a6ef5099ba340a487 # 1285 requests, mean input_length 67631 tokens Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:21:36 +08:00
tim	b55371fe69	docs: H200 + driver 570 setup guide + 11 lessons learned Captures the full debugging journey of getting vendored SGLang 0.5.10 + mooncake RDMA running on a 4×H200 node with the older driver 570.86.15. Driver 570's actual API is cu12.8 — nvidia-smi's "CUDA Version: 13.0" header is a forward-compat ceiling, not the driver's own version — and that single misreading drove most of the detours. Lessons cover: pip vs vendor sglang divergence, why cu13 switching was a dead end (mooncake is cu12-only by wheel, driver 570 can't run cu13 anyway), why --disable-overlap-schedule alone isn't enough, why pip nvidia-cuda-nvcc-cu12 doesn't ship the nvcc binary, and how tvm_ffi's ninja-driven nvcc invocation makes CUDA_HOME the single hook point that fixes everything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:10:14 +08:00
tim	d11a66d11b	feat(scripts): cu12.8 env wrapper + Inferact trace converter setup_env.sh: source-able shell snippet that points tvm_ffi (vendor sglang JIT compiler) at \$HOME/cuda-12.8/bin/nvcc and exposes both libcudart.so.12 (for mooncake.engine, a cu12 wheel) and cu12.8 lib64 (for tvm_ffi compile-time linker) on LD_LIBRARY_PATH. Without this, JIT-compiled kernels NEEDED libcudart.so.13 and driver 570 rejected them at every JIT call. convert_inferact_to_trace.py: turns Inferact codex_swebenchpro_traces (ShareGPT {"from","value"} pairs) into the chat_id/parent_chat_id/ turn/hash_ids JSONL schema replay.py expects. Tokenizes with the model's own tokenizer, builds prefix-sharing 24-token block hashes, synthesizes timestamps. Output cross-checks 20,230 LLM calls — exactly matches the Inferact README count for 610 successful trials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:10:06 +08:00
tim	a418aafeed	feat(stack): pin PD workers to --disable-overlap-schedule On a node with driver 570.86.15 (cu12.8 driver API ceiling), SGLang's overlap event loop hits cudaErrorInsufficientDriver inside event_loop_overlap_disagg_prefill → resolve_future_token_ids JIT kernel. Switching to the normal event loop sidesteps this specific codepath. The flag is harmless on newer drivers and remains a useful default until overlap is independently re-validated on this hardware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:09:56 +08:00
tim	e874b1f055	feat(env): install vendored SGLang via uv path source Replace pip-resolved sglang==0.5.10 with an editable install from third_party/sglang/python. The vendored fork carries patches the pip release does not (admit_direct_append RPC types, _should_allow_local_ prefill_on_decode, maybe_trim_decode_session_cache, backpressure pause hint) — KVC routing depends on them, so the vendored copy must be the import target, not just on PYTHONPATH at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:09:50 +08:00
kzlin	7590e55189	docs: archive deprecated docs to docs/archive/, drop E1 from onboarding Two cleanups: 1. Drop "E1: naive 1P3D default" experiment from the onboarding manual. GPU hours are precious; naive 1P3D + policy=default has near-certain loss on multi-turn cache hit (it's round-robin without prefix awareness), so the comparison doesn't add information vs E1=naive 1P3D kv-aware. The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial / 5.5h parallel. Updated: - §0 TL;DR ("3 组" -> "2 组") - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware) - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop) - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2) - §6 decision table + expected-range table - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2") - §9 deliverables 2. Move 8 deprecated docs to docs/archive/: AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded) STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded) KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes) V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation) REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1) KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress) SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup) SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot) All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS / REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from `docs/FOO.md` to `docs/archive/FOO.md` via sed pass. Added `docs/archive/README.md` explaining what each archived doc is and when (if ever) to reopen it. Designed so a new reader hitting the archive dir immediately knows it's not required reading. After this commit the active docs in docs/ are 9 files (down from 17), which should make the onboarding doc's "Level 1 / Level 2 / Level 3" classification self-evident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:40:35 +08:00
kzlin	5a2fb8799c	docs(kvc): onboarding manual for the next SWE agent A single self-contained reading manual designed to bring a fresh agent (LLM or human) to current-state proficiency in 30 min of reading + 30 min of environment validation, then have them run the next round of ablation experiments without re-litigating questions already settled. Structure: §0 TL;DR -- what you are inheriting in 5 lines §1 Reading order, tiered into Must-Read / On-Demand / Archive, with reasons for each §2 Current-state snapshot: trace/hardware/branches + claims verified + hypotheses pending §3 The three ablation experiments (E1/E2/E3) with full CLI flag specifications and environment-validation checklist §4 Known gotchas (8 of them) with symptoms and fixes -- the most important section to skim before you start §5 CLI cheatsheet: run experiments / read data / plot / git §6 Result-analysis checklist: numbers to collect, expected ranges §7 FAQ for likely stuck-points §8 Anti-patterns: what NOT to do §9 Two specific deliverables the main agent expects back Appendix A: file location lookup table Appendix B: commit lookup table (by intent) Goals encoded into the doc: - Frame "your job is ablation, not new development" -- the new agent should not be tempted to start D->P sync work; that goes on the feat/d-to-p-sync branch in a separate phase. - Make abort-accounting / max-input-len / mooncake-TCP-default pitfalls extremely visible up front so they don't get repeated. - Provide expected-result ranges so a 2x deviation is treated as a config check, not a "finding". - Make the critic-vs-production framing explicit so the new agent knows when an audit-style "MAJOR" is actually a design intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:31:08 +08:00
kzlin	506d360160	fix(figures): GPU utilization figure annotation/headroom polish Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the "P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations clean white-bbox space above the bars instead of crashing into the KVC D bars at x=1. Move both annotation xytext positions to x=2.4 (left panel) and x=5.5 (right panel) so the arrows pull away from the orange P bar toward the center of the panel. Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at y=1.02; subplot titles raised to pad=24 to leave room. Note: a small visual collision between the bboxed group labels and the subplot-title second line remains in the rendered output (acknowledged in the prior conversation). Acceptable for now; full layout rework is deferred. The annotation-vs-bar overlap (the original blocker) is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:28:39 +08:00
kzlin	c01d6101d6	docs(kvc): freeze reseed slow-path audit + three reviewer challenges Standalone reference document capturing the v2 reseed slow-path forensic audit before opening the feat/d-to-p-sync branch. Designed to be quoted directly by future paper drafts and to prevent the team from re-relitigating the same questions verbally. Contents: §1. The three team-member challenges that disproved "capacity-backup will save the slow path" (each with code citation and verdict): 1) P pool can't fit all backups -- replay.py:1618-1620 caps backup count at 1 for sessions with ~50K peak input. 2) P's backup is a stale snapshot -- 49K of direct-to-D append work never flows through P. _commit_prefill_backup_residency (replay.py:1483) is only called from seed/reseed paths; direct-to-D path (replay.py:2719) never touches P-side state. 3) When D evicts, old KV is freed directly (no D->P dump). session_aware_cache.release_session only calls kv_pool_allocator.free(). §2. End-to-end reseed timeline (t=0 to t=4550ms) with code citations showing exactly where each component sits. P-side re-prefill = 1.5-3s, mooncake transfer = 1.5-4s, both contributing 50/50 to total reseed cost. §3. Table of "looks like D->P but isn't" code locations -- every candidate found during forensic search ruled out with line citations. §4. Specification of what D->P incremental sync would require: mooncake bidirectional roles (~400 LOC), D-side append commit hook (easy), P-side radix tree multi-producer extension (the real blocker), agentic-pd-hybrid replay.py hooks. Estimated 1-2 weeks engineering. §5. Confirmation via `git ls-remote origin --refs` that author has NOT secretly implemented D->P on another branch -- only main + this working branch exist on the server. §6. Roadmap for the upcoming feat/d-to-p-sync branch. Appendices: code position crosswalk, related commits, paper section suggestions. This document is referenced by V2_DEEP_ANALYSIS_ZH §4.2 and by KVC_ROUTER_ALGORITHM §9 Open Question 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:20:34 +08:00
kzlin	9ccd853066	docs(kvc): correct reseed cost decomposition + flag D->P sync gap After an independent Opus-agent forensic audit, the previous "(c) 增量 fetch (工程量较大，未实现)" line in V2_DEEP_ANALYSIS §4.2 was understating the gap. The audit confirmed: - No D->P KV transfer code exists in the framework at any layer (agentic_pd_hybrid orchestration, vendored SGLang disaggregation, or mooncake transport). - Mooncake MooncakeKVManager has a hard role split: PREFILL = sender, DECODE = receiver-only loop. `add_transfer_request` asserts the disaggregation_mode is PREFILL. - The BaseKVSender / BaseKVReceiver abstraction has no bidirectional slot. - session_aware_cache.release_session only calls kv_pool_allocator.free() on eviction -- no serialization, no outbound network call. - _commit_prefill_backup_residency is only called from the seed/reseed path (_invoke_kvcache_seeded_router). direct-to-D path never updates P-side backup state. - "capacity-backup" policy semantics: it only skips the close on P after reseed -- the backup is the seed-time static snapshot, never refreshed by D-side append-prefill activity. V2_DEEP_ANALYSIS §4.2: - Decomposed the 3-7s reseed cost into the P-side re-prefill segment (1.5-3s, dominant) and the P->D mooncake transfer segment (1.5-4s). - Quantified the realistic effect of enabling RDMA: only the transfer segment shrinks, reseed reduces to 1.7-3.2s, TTFT p99 ~0.7s, still loses to DP's 0.43s. - Replaced the throwaway "(c) incremental fetch" line with a full paragraph explaining what D->P sync would require, why it's the largest engineering gap, and that the blocker is SGLang's radix-tree single-producer assumption, not the network layer. KVC_ROUTER_ALGORITHM §9: - Refined Open Question 3 (RDMA) to clarify it only helps the transfer segment, not the re-prefill segment. - Added Open Question 4: D->P incremental KV sync as the central future-work contribution gap, with cited evidence for why it doesn't currently exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:07:14 +08:00
kzlin	517677d7f2	docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic) Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to visually rebut the two critic-agent claims that we argued in prose were design intent, not deficiencies. (1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time" Two-panel side-by-side: Left (request count view, the naive reading): KVC P = 328 reqs (7.4%), KVC D = ~1450 each, DP = ~1100 each. P "looks idle." Right (compute work view, the honest reading): KVC P does 1.07M tokens of prefill, comparable to each KVC D worker's ~0.80M. P is a low-frequency high-cost safety net, not idle capacity. Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33% LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity win. (2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win" Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool (276K vs 351K tokens) yet caches MORE per request. Left (cache hit rate vs turn number): KVC's session-affinity lets hit rate accumulate with turns; DP's hash + radix-LRU causes a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP = 95.8% (1.24pp gap). Shows mechanism, not just outcome. Right (ECDF of per-request uncached tokens, log x): KVC's distribution concentrates near zero (50% < 187 tokens), DP's is spread (50% < 781 tokens). At uncached = 500 tokens threshold, KVC has 74% of requests below, DP has 31%. → smaller pool, better retention, less per-request work. Direct empirical rebuttal to "fragmentation is architectural, not policy." Bundled scripts (rerunable): - scripts/analysis/plot_gpu_utilization.py - scripts/analysis/plot_cache_efficiency.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:04:49 +08:00
kzlin	c5519066de	docs(kvc): add TTFT probability density figure (KVC v2 vs 4DP) Adds a two-panel TTFT PDF comparison plot inserted as a new V2_DEEP_ANALYSIS §3.4 ("TTFT 概率密度对比: bimodal vs unimodal"). Single-percentile numbers (p50 / p99) hide the qualitative difference between the two distributions; the figure makes it visible at a glance. Left panel (linear x in [0, 0.6]s, body): KVC has a sharp peak at ~40ms (the direct-to-D fast path). DP has a broad peak around 50-200ms (full prefill per request). Annotated with p50 and p90 markers for each side. Right panel (log x in [10ms, 10s], full range): KVC is visibly bimodal: a tall fast-path peak plus a small reseed tail around 1-5s. DP is unimodal: a single broad peak with shorter tail. Annotated with p99 callouts pointing to each tail. KDE: scipy.stats.gaussian_kde, bandwidth=0.15 for the body (Scott's rule oversmooths the sharp fast-path peak), log10-transformed for the full-range panel so the bimodal structure is visible. Bundled: - scripts/analysis/plot_ttft_pdf.py -- rerunable when v2 / DP data change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:46:27 +08:00
kzlin	b5af19583b	docs(kvc): replace v2 path breakdown tables with generated figures V2_DEEP_ANALYSIS §3.1 (execution_mode distribution) and §3.2 (path-level latency vs DP) had hand-typed tables with approximate latencies (e.g. "~1.0s") and required readers to mentally compare 5+ rows × 5 columns. Both sections now reference generated PNG figures derived directly from the v2 + DP metrics.jsonl files. §3.1 figure (v2_execution_mode_distribution.png): Horizontal bar chart, log x-axis. 4076 direct-to-D fast-path requests (green) dwarf the rest by ~30x; the long tail of slow / fallback / failure modes is visible at one glance. Counts and percentages annotated on each bar. §3.2 figure (v2_path_level_latency.png): Grouped bar chart, log y-axis. Per-path TTFT p50 / TTFT p99 / Lat p50 with exact numeric labels (no more "~1.0s" approximations). Sample counts annotated below each path. Quick visual reads: - KVC fast path TTFT p50 41ms vs DP 92ms (2.2x faster) - KVC reseed TTFT p99 5.12s vs DP 0.43s (12x slower) -- the cost - KVC no-d-capacity TTFT p99 7.65s (worst case) Bundled: - scripts/analysis/plot_v2_path_breakdown.py -- the script that generates both figures; rerunable when v2 data changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:38:43 +08:00
kzlin	37e9caa431	docs(kvc): production-decision reframe + formal router algorithm spec After the critic-agent audit, V2_DEEP_ANALYSIS had drifted into an audit-grade "5 wins / 1 loss / 3 draws" framing that mistook KVC's deliberate design motifs (cache concentration via session affinity; prefill-GPU idle as TTFT-stability trade-off) for "comparison unfairness." This commit corrects the framing back to a production- decision lens and adds a paper-track formal specification of the router algorithm. V2_DEEP_ANALYSIS_ZH.md changes: - §0 TL;DR: lead with "online coding agent serving should pick KVC 1P3D"; the only real cost is TTFT p99 long-tail (3x DP) from the 8.3% mooncake reseed path, mitigable with real RDMA. - §4 restructured into three buckets: real costs (TTFT p99 tail, abort accounting now fixed), counter-arguments to the critic (cache concentration and idle prefill GPU are design intent, not deficits), methodology to-do (naive-1P3D control, v2 N>=2 determinism). - §6 replaces "5/1/3 rescoring" with production decision rationale: KVC wins on 6 latency/TTFT metrics + lower failure rate; pays TTFT p99 tail; lists workloads where DP would reverse the call. - §8 decision points: D1 recommends Yes (accept v2 as milestone); D8 added: paper motif "KVC trades P idle for TTFT stability." KVC_ROUTER_ALGORITHM.md (new, paper-track, Chinese narrative + English algorithm boxes / variable names / theorems for direct paper reuse): - Problem formulation, system model, full notation - Algorithm 1 Route: lexicographic-tuple scoring on (overlap+alpha*sticky, sticky, -inflight, -assigned) - Algorithm 2 Admit: D-worker autonomous admission deciding Direct / Seed / Reseed / reject (with reason) - Algorithm 3 Dispatch: end-to-end orchestration with reset-on-success (the v2-specific fix that eliminates v1's self-amplifying thrashing) - Theorem 1 (no permanent starvation) and Theorem 2 (fast-path determinism), each with a proof sketch - Comparison table vs vanilla pd-disagg / DP cache-aware - Anti-patterns ("what KVC explicitly is NOT") - Open questions for reviewers - Suggested paper citation phrasing - Appendix A: algorithm-step to source-file:line crosswalk Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:29:18 +08:00
kzlin	5eac9b4f6b	fix(metrics): exclude aborted requests from latency/ttft/tpot stats The old filter `if row.latency_s is not None` accepted SGLang's fast input-length-aborts (latency_s ~ 0.08s, finish_reason='abort/BadRequest') as if they were successful zero-cost requests. This deflated mean/p50 of any run where the model rejected oversized inputs. Impact on existing comparisons (ts=1 4-run validation + v2): KVC v2 has 40 aborts + 5 ReadTimeouts (was reported as just 5); DP 4w has 67 aborts (was reported as 5). Both runs have abort behavior; the asymmetry (40 vs 67) is purely from SGLang's mem-fraction-derived max-input-len: KVC decode-only worker gets ~10 GB free GPU mem -> max-input=92098, DP fused worker gets ~9 GB -> max-input=87811, because DP also needs chunked-prefill workspace. The KVC-vs-DP latency-win direction holds and widens slightly under the fixed filter (lat mean delta: -0.8% -> -1.4%); see V2_DEEP_ANALYSIS_ZH §4.3 for the recomputed table. Changes: - metrics.py: new _is_failed_request(row) helper; latency/ttft/tpot stats now exclude both errors and aborts. New summary fields abort_count and failure_count expose the counts directly. - scripts/analysis/recompute_summary.py: re-derives summary.json from existing metrics.jsonl using the fixed code, with optional --diff against the old buggy summary for inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:29:18 +08:00
kzlin	0c25168cad	docs(kvc): v2 deep analysis vs TEAM_REPORT baseline Post-v2 audit consolidating ts=1 validation + v1 thrashing + v2 win, plus critic-agent adversarial review of the v2 vs 4DP comparison. Headline outcomes: - TEAM_REPORT §1 (session pin starvation) fully fixed by v2 migration + reset-on-success; direct-to-D 42.8% -> 91.6%. - TEAM_REPORT §2/§3/§5 (LRU, backpressure, admission RPC) are absorbed by ts=1 natural drain time, not mechanism-fixed -- will resurface under ts=10/longer traces/higher concurrency. - TEAM_REPORT §6 (ts=10 distortion) confirmed and locked as precondition; TEAM_REPORT §8 (N=1 unreliable) rewritten to "high-pressure N>=3, normal N=1". Three new problems exposed by adversarial review: - TTFT p99: KVC 1.285s vs DP 0.427s (KVC 3.0x worse) -- cherry-picked out of the V2_RESULTS_ZH.md headline table. Root cause: 8.3% non-direct path pays 3-7s mooncake reseed cost on 50-90K-token KV transfer. - Error accounting asymmetry: DP has 67 fast-aborts (not 5) at ~0.08s each counted in latency stats; KVC's 5 ReadTimeouts excluded entirely. Root cause: --max-input-len 87811 (DP) vs 92098 (KVC) + metrics.py:124 filter. - Topology mismatch: KVC 1P3D's prefill GPU is idle 91.7% of the time (only ~373/4449 requests use seed/P path); 4DP CA has all 4 GPUs at full utilization. Plus: no naive 1P3D control exists in the repo -- cannot isolate KVC-layer contribution from 1P3D-topology contribution. Re-scored headline: 5 KVC wins / 1 DP win / 3 draws -- still net positive but not the "7/8 wins" framing the V2_RESULTS_ZH.md claims. Recommended follow-ups (ROI order): 1. naive 1P3D ts=1 N=1 control (critic's only CRITICAL finding) 2. v2 N=2/N=3 to verify ts=1 determinism with new code paths 3. symmetric error accounting recompute + DP max-input-len = 92098 rerun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 11:17:00 +08:00
kzlin	2ec0debef4	feat(kvc): session migration with reset-on-success + direct-append threshold tuning KVC v2 beats 4DP at ts=1 same-scale on 7/8 metrics: TTFT mean -24%, p50 -54%, p90 -64%; lat mean -0.8%, p50 -12.6%, p90 -0.7%. Direct-to-D rate jumped 42.8% -> 91.7%. REFACTOR_PLAN_V1 scenario C achieved. Two-knob fix: - reset-on-success blacklist decay: clear (sess, D) reject counter on successful direct-to-D path. Eliminates v1 thrashing where session 6880 was stable on decode-1 for 70 turns then collapsed to 75 D-changes after cumulative transient pressure tripped the permanent blacklist. - bump --kvcache-direct-max-uncached-tokens default 2048 -> 8192 via CLI flag. 41% of v1 fallbacks were 'real-large-append' (>2048 token append); raising the threshold lets these go through the direct-to-D fast path. Code: - policies.py: RoutingState.session_d_rejects counter + KvAwarePolicy migration_reject_threshold; degenerate fallback picks least-rejected D. - replay.py: record_admission_reject + reset-on-success in _run_request; _fallthrough_reason classifies turn-2+ fall-throughs as session-not-resident / real-large-append / etc, replacing misleading 'large-append' suffix (TEAM_REPORT §2.7). - cli.py + benchmark.py: --kvcache-migration-reject-threshold flag wiring. Docs: - REFACTOR_PLAN_V1_ZH.md: forward-looking plan after ts=1 validation. - MIGRATION_V1_FINDINGS_ZH.md: v1 thrashing root-cause analysis. - V2_RESULTS_ZH.md: v2 results, scenario C achievement, attribution. - TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md: comprehensive team report. Scripts: - sweep_ts1_kvc_n3_plus_dp.sh: ts=1 baseline (KVC 1P3D N=3 + 4DP CA). - sweep_ts1_migration_v1.sh / v2.sh: validation runs. - analyze_ts1_validation.py: 4-way comparison analyzer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 01:18:13 +08:00
kzlin	1d51704dad	docs(kvc): agentic-fit analysis, refactor plan, validation report Three new docs covering the structural-fit investigation: - AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess). Quantifies session pinning, LRU shortfall, P-side imbalance, time-scale distortion, etc., with code citations and N=3 rerun data. - REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the original "estimate inflation" and "resident_blocks aging" claims were not real bugs, scope shrinks to one code change (backpressure) plus a 4-run smoke sweep within an 8h budget. - STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled fully-supported / indirect / retracted with the data source. Notes that backpressure E2E validation is pending GPU smoke run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:30:11 +08:00
kzlin	7affb565b2	feat(kvc): add backpressure smoke sweep + analyzer (and v6 p1 profile script) scripts/sweep_backpressure_smoke.sh: 4-run smoke matrix (KVC baseline / KVC + backpressure / KVC + backpressure @ time-scale=1 / DP @ time-scale=1) designed to fit ~3-4h GPU budget. Validates §3 backpressure implementation and partially probes §7 time-scale distortion. scripts/analysis/analyze_backpressure_smoke.py: consumes the new structural/* jsonl files plus request-metrics; emits headline metrics, backpressure histograms, admission probe stats, and per-session pinning distribution. scripts/sweep_tp1_v6_p1_profile.sh: pre-existing v6 P1 profile sweep script (was untracked; included for completeness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:56 +08:00
kzlin	c47adaf8e3	feat(kvc): honor admission backpressure hints + structural event logging Replay-side changes paired with the SGLang admission hint: - DecodeResidencyState gains pause_until_s; admission probe parses recommended_pause_ms and updates the per-D pause window. - _wait_for_decode_pause is invoked at request entry points (_invoke_router, _invoke_session_direct) so requests stall before hitting a saturated D instead of timing out via mooncake. - New CLI flags: --enable-backpressure (default off, baseline preserved), --backpressure-max-pause-s (cap on per-request sleep, default 2s). Structural instrumentation written under <run_dir>/structural/: - admission-events.jsonl: every admission probe (RTT, queue_depth, pause_ms, available_tokens, evicted_count) - backpressure-events.jsonl: every actual pause sleep - session-d-binding.jsonl: per-request policy decision Used to validate the structural claims documented separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:46 +08:00
kzlin	ca4b64c79a	feat(sglang): expose backpressure pause hint in admit_direct_append Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D can advise callers when its transfer queue is heavy or KV pool is near capacity. The hint is computed from transfer_queue_depth, retracted_queue_depth, and post-trim token_usage; thresholds are simple heuristics (>0.90 usage, >=8 queue depth, retracted>0). Default behavior is unchanged for callers that ignore the field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:30 +08:00
kzlin	4978c0d0cd	profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument Hostile audit of the original report flagged three load-bearing errors: 1. held_tokens semantic was inverted. session_held_tokens() at session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len) per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held - avail" actually CONTAINS the radix-tree protected prefix cache (likely the single biggest component for shared agentic prefixes), not just running batch + in-flight as the original report claimed. 2. Admission-race causal hypothesis for the 415 EXP2+profile errors is contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they passed admission and died downstream ("generate stream ended before producing any token", raised by the client when a 200 response had an empty stream). 3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1 (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a passive read — it dispatches into the scheduler main loop and iterates every session slot. Plus: per-D error% confounded by sticky session affinity (only 18 unique sessions cause 415 errors, decode-3 had 0 errors only because no high-error session landed there); decile 10 "recovery" was an equal-time binning artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not 6h; p50/p90 latency comparison is N=1. Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4). Action items split into P0 (verify, must do first) and P1 (instrument): P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2 (no polling, identical config to the original v5 run) to test whether the 9-error baseline result is reproducible. If 3 runs give ~9 errors and profile gives 415, polling is the leading suspect. Currently running in background. P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only "pool_breakdown" dict to /server_info covering: radix_evictable_tokens, radix_protected_tokens, slot_private_held_tokens, session_slot_count, running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens}, prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these, "unaccounted = cap - sum(known)" exposes true leakage. replay.py captures all fields into the per-tick row; analyzer prints the decomposition and gracefully handles old timeseries (prints "P1 instrument absent"). Mock-tested end-to-end. SGLang patch is read-only and does not affect admission/scheduling. Old v5+profile data still analyzes correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:29:21 +08:00
kzlin	51f5386691	profile(kvc): add D KV pool timeseries poller + analyzer for v6 root-cause v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding v6 mitigations we need to attribute that capacity loss to one of: (a) active sessions — real footprint (b) idle-evictable sessions — LRU not aggressive enough (c) prefill backup blocks / in-flight / fragmentation — release timing Without this it's all guessing. Plumb a 1Hz poller into replay that hits each P/D worker's /server_info, captures session_cache + memory_usage, and writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl. Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at 1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible relative to the 50min run. Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's capacity into active_held / idle_evictable / other (= cap-held-avail, the backup-blocks bucket) / free, and reports session residency churn across workers as a starvation/thrashing signal. Mock-tested poller end-to-end (cancellation clean, file flushed, sessions captured); analyzer validated against synthetic timeseries. Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then analyze results to pick a v6 direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:04:21 +08:00
kzlin	6572d7f3f4	docs: add v5 chapter (Option D worker-mode admission) and rename to V1_TO_V5 v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D: worker admission_mode authoritative for direct_append + seed + reseed, bypassing replay's local _decode_session_soft_cap. Key findings now documented: - errors collapse from 9-10% to 0.2% (mooncake timeouts gone) - session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the binding constraint, not replay's estimator; v4's "low fallback" was hiding capacity overruns as transfer-timeout errors - direct-to-D subset latency unchanged from v4 (admission overhead negligible) - new bottleneck: D's physical KV pool — points v6 at prefill backup release timing, priority eviction tuning, chunked seed, cross-D session migration, and real RDMA Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the code index with the v5 endpoint extension and new CLI knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:13:25 +08:00
kzlin	6e5ed8da80	feat(kvc): Option D - delegate seed/reseed admission to D worker v4 (cap=16) saw 35% session-cap fallback because the local soft_cap min(16, usable / target) evaluates to 1-2 for large agentic inputs. The cap was hit not because D was full but because replay's heuristic underestimated capacity. This change makes worker admission_mode authoritative for ALL paths: SGLang side: - io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field ("direct_append" \| "seed", default "direct_append" preserves prior behavior). - scheduler.py:admit_direct_append: when mode == "seed", skip the resident-on-D requirement and run the same capacity check + LRU eviction (maybe_trim_decode_session_cache) that direct_append uses. This lets D atomically decide if a new session can be admitted based on actual token_to_kv_pool_allocator state. Replay side (replay.py): - _query_decode_direct_admission gains a `mode` parameter. - _reserve_decode_session_capacity: in worker admission_mode, the seed/reseed branch now queries D with mode="seed" and trusts the result, instead of estimating capacity from the residency snapshot. - _should_admit_new_decode_session: in worker mode, skip the local soft_cap pre-check and let D decide. Same-D session fast-path is preserved. Effects: - Local hardcoded cap of 16 is bypassed under worker mode; D's real KV pool size is the only constraint. - LRU eviction runs in D's process atomically with admission, so starvation (the v3 bimodal "lucky vs starved sessions" pattern) should resolve. scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D configs as v4 with the new admission path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:40:03 +08:00
kzlin	74194e660a	docs: v4 final results, error analysis, and updated journey Add v4 sweep results and post-mortem analysis showing: - direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline 8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%). - Overall vs baseline (errors+truncated excluded): v4 2P6D P50=0.85s vs baseline 0.66s (28% slower). Reason is not errors -- 35% of requests still hit fallback-large-append-session-cap, where capacity-based cap = usable_tokens / target_tokens evaluates to 1-2 (not 16) for large agentic inputs. - 9-10% errors on KVC variants are mooncake TCP transfer timeouts, not SGLang logic bugs. Prefill log shows "Failed to send kv chunk ... 32s timeout ... session not alive". Errors concentrate in turn>=31 (large inputs) after run >44.8%. Track: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table, per-mode breakdown, and error root cause. - scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py - outputs/qwen3-30b-tp1-v{3,4}/exp_summary.json (force-added, small JSON; metrics.jsonl excluded due to size). - outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:34:01 +08:00
kzlin	c9d350b372	docs: KVC v1-v4 debug journey + raise session soft_cap to 16 Document the iterative debugging from v1 (broken KVC) through v4 (routing fixed + session cap raised), with code-level analysis of the two main bugs encountered: 1. v2 root cause (mis-diagnosed previously as `allow_local_prefill`): `--policy default` for KVC mechanism caused replay's round-robin policy and the PD router's round-robin to diverge, sending requests with `session_params` to a D worker that did not have the session open. Resulted in 56-61% truncation with finish_reason "session id X does not exist". Fix: use `--policy kv-aware` (sweep_tp1_v3_kvaware.sh) so replay emits `x-smg-target-worker` and PD router uses consistent_hashing. 2. v3 new bottleneck: `pd-router-fallback-large-append-session-cap` dominated 52-65% of requests. Root cause was hardcoded `min(4, ...)` in `_decode_session_soft_cap`. With 7 D workers x 4 sessions = 28 slots for 52 trace sessions, ~24 sessions starved permanently (bimodal direct-to-D rate of 0% or 99%). Fix: raise the cap to 16 (replay.py). Also includes the v3 finding that direct-to-d-session path P50=0.495s and TTFT P50=0.043s already beats the 8-way DP baseline (0.65s/0.093s) - the KVC core mechanism works when fallback paths are avoided. Files: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: full journey + code location index - docs/SWEBENCH_EXPERIMENT_{PROGRESS,RESULTS}.md: prior session notes - scripts/sweep_tp1_v{2,3,4}*.sh: experiment driver scripts - src/agentic_pd_hybrid/replay.py: cap 4 -> 16, audit fields - src/agentic_pd_hybrid/pd_router.py: strip session_params from prefill - src/agentic_pd_hybrid/metrics.py: truncated_request_count Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:10:41 +08:00
Gahow Wang	e9062b1d6e	Document PD baseline comparison	2026-04-25 17:29:27 +00:00
Gahow Wang	c928c7db23	Add transfer queue admission knobs	2026-04-25 17:29:15 +00:00
Gahow Wang	fe583fb413	Document kvcache-centric experiment progress	2026-04-25 16:01:31 +00:00
Gahow Wang	13bb31a446	Add kvcache-centric profiling and admission controls	2026-04-25 16:00:52 +00:00
Gahow Wang	08b13d22bc	docs: rewrite project docs in concise chinese	2026-04-24 12:41:52 +00:00
Gahow Wang	5bdc0ed4f0	docs: document sglang maintenance workflow	2026-04-24 12:31:32 +00:00
Gahow Wang	b8e6f13c20	feat(sglang): support decode session cache admission	2026-04-24 12:30:41 +00:00
Gahow Wang	bded08301f	chore: vendor sglang v0.5.10 snapshot	2026-04-24 12:29:36 +00:00
Gahow Wang	78f0d15221	docs: document project design and status	2026-04-24 12:17:55 +00:00
Gahow Wang	4bca741f32	feat: add agentic pd hybrid benchmark prototype	2026-04-24 12:17:46 +00:00
Gahow Wang	d2fe014db7	chore: initialize repo hygiene	2026-04-24 12:17:40 +00:00

43 Commits