agentic-pd-hybrid

Author	SHA1	Message	Date
Claude Code Agent	9aac36fd89	docs: branch executive summary h200-cu130	2026-05-13 12:24:56 +08:00
Claude Code Agent	e9ad1c4bc7	feat(experiments): E4 vs E1 results + p99 attribution figures Headline: KVC v2 + load-floor + RDMA beats naive PD-disagg on mean/p50/p90 by 30-65% (TTFT p50 31s vs 88s, lat p50 37s vs 93s, wall-clock 64 min vs 88 min). Loses p99 by ~8% (TTFT 224 vs 207). Wrote 4 figures (docs/figures/): e1_vs_e4_ttft_pdf.png — bimodal E4 fast-path peak vs E1 single peak e1_vs_e4_latency_cdf.png — CDF + log-survival showing tail crossover e4_path_latency.png — per-execution-mode latency breakdown e1_vs_e4_p99_attribution.png — what makes up E4's p99 tail P99 tail attribution (this is the key finding): E4 p99 tail (n=65, TTFT ≥ 179.9s): fast-path direct-to-d 0 % (0/65) reseed paths 5 % (3/65) fallback paths 88 % (57/65) large-append-session-cap 43 % ← biggest culprit no-d-capacity 17 % large-append 14 % Implication: D→P snapshot (designed to optimize reseed slow path) even if fully working would touch ≤5% of the p99 tail. The real bottleneck is fallback chain (admission retry + seeded-router cold start), not reseed. Optimizing p99 needs work on fallback, not more D→P plumbing. Full analysis: docs/E4_VS_E1_RESULTS_ZH.md	2026-05-13 12:23:11 +08:00
Claude Code Agent	af966f2371	fix(cli): plumb --enable-d-to-p-sync through benchmark-live → ReplayConfig E4-v3 forensic: structural d-to-p-sync.jsonl is empty despite the sweep passing --enable-d-to-p-sync. Root cause: BenchmarkLiveConfig (benchmark.py) had no enable_d_to_p_sync field, and the benchmark-live cli builder (line ~821) never threaded args.enable_d_to_p_sync into the ReplayConfig that gets built inside replay_trace. So config.enable_d_to_p_sync was always False even though the CLI flag was set, and _attempt_d_to_p_sync was gated off → 0 calls → 0 RPCs → 0 structural log entries. The replay subcommand (cli.py:672) already plumbed it correctly; benchmark-live just got missed. Adding the field + the wire-up. This means E4-v3's headline numbers (KVC v2 + load-floor + RDMA beat naive PD on mean/p50/p90, lose by ~8% on p99) reflect only KVC's session-affinity gains, not D→P. A v4 with this fix should exercise D→P on reseed-after-eviction events and we'll see whether the p99 long tail also shrinks.	2026-05-13 12:17:28 +08:00
Claude Code Agent	f6d6dc01ea	feat(cli): per-role --mem-fraction-static + use in E4-pressured E4-v1 / v2 / pressured-v1 all failed to fire admission rejections in this workload because the default 0.6 mem-fraction-static gives 288K-token kv_pool per decoder, more than enough to absorb the 50-session trace even at concurrency=32. This commit adds: --decode-mem-fraction-static (overrides per-decode SGLang arg) --prefill-mem-fraction-static (symmetric for completeness) Plumbed via topology.{decode,prefill}_extra_server_args. The pressured sweep now uses --decode-mem-fraction-static 0.4 which shrinks decoder kv_pool to ~192K tokens — should force enough admission rejections to actually exercise the D→P snapshot path.	2026-05-13 10:43:26 +08:00
Claude Code Agent	fbeb968f2f	feat(experiments): E4-pressured sweep — force reseed via reject_threshold=1 E4-v1 produced 272 admission rejects (good) but zero /_snapshot HTTP calls (bad, entrance gate bug fixed in `e729d62`). E4-v2 went the other way: 0 rejects through 53% of trace, sync function never even called. E4-pressured locks in the fix-verified code path by lowering --kvcache-migration-reject-threshold from 3 to 1. After ONE rejection the policy forces session migration, which lands in _invoke_kvcache_seeded_router → _attempt_d_to_p_sync. With the `e729d62` fix in place, the d-to-p-sync.jsonl structural log should now capture every prepare/dump/finalize decision so we can forensic verify the D→P fast path is actually delivering KV bytes to P's radix tree.	2026-05-13 10:22:58 +08:00
Claude Code Agent	e729d62ddf	fix(d2p): structural log + relax entrance condition for sync E4 forensic (docs/E4_RESULTS_ZH.md): 272 admission rejections triggered the fallback seeded_router path, but zero /_snapshot/* HTTP calls hit the workers. Two root causes: 1. _attempt_d_to_p_sync gated on agentic-side `decode_session.opened`. By the time fallback runs, agentic has already flipped that flag to False in response to admission rejection. But D-side SessionAwareCache may still hold the session (release_session is not called automatically on admission rejection). Removing the gate; let D respond authoritatively with "session-not-resident" if it has actually evicted. 2. _attempt_d_to_p_sync logged decisions via logger.info, but agentic has no root logger handler so those events silently sank. Switching every branch (entry skip, prepare fail/not-ok, dump fail/not-ok, finalize fail/not-ok, ok) to write a structural-log line at outputs/<run>/structural/d-to-p-sync.jsonl. Each line carries stage, reason, durations, bytes pushed. The result doc is updated to reflect the honest E4-1 outcome and the P1 fix list.	2026-05-13 09:34:09 +08:00
Claude Code Agent	1d68ad66a7	docs(experiments): E4 results — initial scaffold + mid-run observation Captures the mid-run state of the E4 sweep (35 min in, 41% of trace served, 0 admission rejections, 0 d_to_p_sync triggers) along with the interpretation of that observation: under load-floor K=200 + 3D topology, admission rarely rejects → reseed is rarely needed → D→P snapshot is a safety net that doesn't fire in the common case. Includes a fill-in-after-sweep matrix for H1/H2/H3 verdicts and a follow-up plan (high-pressure variant to force reseed, ablation to isolate D→P marginal benefit).	2026-05-13 09:10:02 +08:00
Claude Code Agent	9149b530c0	feat(experiments): E4 cross-comparison analysis helper scripts/analyze_e4_d_to_p.py loads E1 / E3 / E4 summary.json + E4's metrics.jsonl, prints latency / TTFT / per-decode-load side-by-side, breaks E4 down by execution_mode (so the reseed-mode improvement vs E3 can be isolated), and emits PASS/FAIL verdicts for H1 and H3 from the protocol.	2026-05-13 08:30:46 +08:00
Claude Code Agent	a4f30e6bd3	docs(d2p): implementation status snapshot — Phase 1-3 audit Captures the current state of the D→P RDMA snapshot push work for the next agent (or future me): which commits land which phase, which phases are verified vs in-flight, and the known unverified surfaces (byte-level KV layout, cross-node, multi-D contention, token_id consistency, D-side evict races, chunked-prefill interactions). Also maps the §2 design points to their implementation locations so the doc-to-code traceability is explicit.	2026-05-13 08:29:26 +08:00
Claude Code Agent	8a2f72f18e	feat(experiments): E4 protocol + sweep script — KVC + D→P vs naive PD Pre-registers the E4 experiment that tests whether KVC + D→P RDMA snapshot push beats the naive PD-disagg E1 baseline on the inferact_50sess subset. Compared to E3 the only changed flag is --enable-d-to-p-sync. Three hypotheses (see docs/E4_PROTOCOL_ZH.md §2.3): H1 (main): E4 TTFT p99 ≤ E1 TTFT p99 H2: E4 reseed-mode TTFT < E3 reseed-mode TTFT H3: E4 success count ≥ E3 success count The full reseed → snapshot-push orchestration is wired in `b9b0cf0` (_attempt_d_to_p_sync); the SGLang scheduler RPCs and the runtime mem-leak fix are in `86412bb` / `a369722`.	2026-05-13 08:27:40 +08:00
Claude Code Agent	a369722efe	fix(sglang): account snapshot-reserved slots in radix mem leak check Phase 2 prepare_receive allocates kv_pool slots that aren't visible to radix / session bookkeeping until finalize_ingest. Without this fix, the scheduler's idle self_check fires: ValueError: token_to_kv_pool_allocator memory leak detected! available=288391, evictable=5, protected=0, session_held=0 (expected sum == 288460) _check_radix_cache_memory now subtracts sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values()) from the expected total before flagging a leak. Snapshot_reserved is also printed in the leak message for diagnostics. Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py): [smoke] prepare_receive on P → 200: ok=true (96 layer bufs) [smoke] dump on D → 200: ok=false, reason=session-not-resident [smoke] finalize on P → 200: ok=true, inserted_prefix_len=0 [smoke] OVERALL: PASS End-to-end KV-correctness (snapshot ingest yields cache hit on next prefill) still requires the agentic+router stack — covered in the E4 sweep, not this smoke.	2026-05-13 08:26:16 +08:00
Claude Code Agent	b9b0cf0fac	feat(agentic): D→P snapshot orchestration in reseed path + CLI flag Phase 3 — wires the SGLang-side snapshot RPCs (committed in `86412bb`) into the agentic reseed slow-path. On _invoke_kvcache_seeded_router: 1. POST {prefill_url}/_snapshot/prepare_receive alloc P-side slots 2. POST {old_decode_url}/_snapshot/dump RDMA push session KV 3. POST {prefill_url}/_snapshot/finalize_ingest insert into P radix After step 3 P's radix tree has the session prefix cached; the subsequent SGLang router-driven prefill on P hits cache instead of re-computing. Any RPC failure short-circuits to the existing seeded_router fallback (re-prefill from scratch). All steps are best-effort and structurally logged for post-hoc analysis. Flag plumbing: cli.py --enable-d-to-p-sync (replay + benchmark) topology.py SingleNodeTopology.enable_d_to_p_sync stack.py SGLANG_SNAPSHOT_LINK_ENABLE=1 injection per worker replay.py ReplayConfig.enable_d_to_p_sync + _attempt_d_to_p_sync helper Snapshot port per worker derives from disaggregation_bootstrap_port + 1000 (set in third_party/.../snapshot/controller.py), so different workers get distinct mooncake snapshot engines on the same node. Smoke (next): scripts/smoke_snapshot_sglang_integration.py spawns one D + one P, exercises the 3 RPCs end-to-end, checks cache_tokens on a follow-up generate request. See docs/D_TO_P_SYNC_DESIGN_ZH.md for the full design.	2026-05-13 08:16:46 +08:00
Claude Code Agent	86412bb174	feat(sglang): D→P snapshot link integration — controller + RPC handlers Phase 2 of the D→P sync feature (Phase 1 in `dc4867c` verified the underlying RDMA link in isolation). This commit wires that link into each SGLang worker's scheduler so D and P can exchange session KV without going through the PD prefill pipeline. New module: third_party/sglang/python/sglang/srt/disaggregation/snapshot/ controller.py — SnapshotLinkController owns one mooncake transfer engine per worker, pre-registers all kv_pool layer buffers, and exposes prepare_receive() and push_session_kv() APIs. Receive bookkeeping via a session_id → SnapshotIngestRecord side-table. Three RPC types added to io_struct.py and full plumbing wired through: SnapshotPrepareReceiveReqInput/Output P-side alloc + return layout SnapshotDumpReqInput/Output D-side read kv_pool + RDMA push SnapshotFinalizeIngestReqInput/Output P-side radix tree insert Files touched: managers/io_struct.py 3 new ReqInput/ReqOutput pairs managers/tokenizer_communicator_mixin.py 3 communicators, 3 awaitables managers/scheduler.py init controller + 3 handlers entrypoints/http_server.py 3 HTTP endpoints under /_snapshot Activation: set SGLANG_SNAPSHOT_LINK_ENABLE=1 (and SGLANG_SNAPSHOT_LINK_HOST / _PORT / _IB_DEVICE) per worker. Controller init is opt-in and defaults off, so production PD pipeline is untouched. Subsequent work (Phase 3): agentic-pd-hybrid orchestration in _invoke_kvcache_seeded_router to call prepare_receive on P, dump on D-old, finalize_ingest on P, then trigger the existing P→D' transfer which will now hit P's radix cache (skipping re-prefill).	2026-05-13 08:12:04 +08:00
Claude Code Agent	7216507773	feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified Confirms snapshot_link works for cuda device pointers, not just host memory. Sender on cuda:0 pushes to receiver on cuda:1 via RDMA over mlx5_60. All 5 sizes (16K, 1M, 16M, 64M, 256M) pass SHA verification. 16 KB 8.3 ms 0.016 Gbps (cold openSegment) 1 MB 0.10 ms 87.6 Gbps 16 MB 0.84 ms 159 Gbps 64 MB 2.52 ms 213 Gbps 256 MB 8.54 ms 251 Gbps (~60% NDR400 line rate) For Inferact-scale sessions (~50K tokens × ~80 KB layer-per-token = ~4 GB), this projects D→P transfer time at ~130 ms — within the "reseed-savings" envelope sketched in design doc §3.2. Files: scripts/snapshot_link_receiver_gpu.py scripts/smoke_snapshot_link_gpu.py Next: SGLang scheduler integration for D-side dump + P-side ingest.	2026-05-13 00:59:43 +08:00
Claude Code Agent	dc4867c270	feat(snapshot): D→P RDMA link Phase 1 — minimal byte transport A thin wrapper around mooncake.engine.TransferEngine that does one-sided RDMA writes between two SnapshotPeer endpoints. Bypasses SGLang's MooncakeKVManager (which is hard-gated to PREFILL/DECODE roles via add_transfer_request assertion at conn.py:1563) so the D→P direction doesn't require invasive role-axis changes upstream. Smoke test (two subprocess.Popen processes, mlx5_60, 127.0.0.1): 1 KB 9.0 ms (one-time openSegment handshake) 16 KB 0.04 ms 3.5 Gbps 1 MB 0.10 ms 82 Gbps 16 MB 0.58 ms 232 Gbps 64 MB 1.70 ms 316 Gbps (~80% of NDR 400G line rate) All 5 sizes pass SHA256 verification end-to-end. Files: src/agentic_pd_hybrid/snapshot_link.py — SnapshotPeer, SnapshotEndpoint scripts/snapshot_link_receiver.py — child-process receiver scripts/smoke_snapshot_link.py — sender + verifier docs/D_TO_P_PHASE1_LINK_ZH.md — phase 1 acceptance doc Next: Phase 2 (D-side scheduler commit hook), Phase 3 (P-side prefill bypass with snapshot KV). See docs/D_TO_P_SYNC_DESIGN_ZH.md §5.	2026-05-13 00:55:55 +08:00
Claude Code Agent	9c35eddc79	docs(design): D→P RDMA snapshot push design Goal: skip P-side re-prefill on reseed path. Push session KV snapshot from D back to P after each direct-to-D append; reseed re-uses P's snapshot to fire only the P→D' transfer (no model.forward on P). Decision: Option C — D→P snapshot at append-commit, P-side PrefillSnapshotStore (side-table, not in radix tree), prefill bypass when snapshot is fresh. Rejects A (radix multi-producer), B (D→D' direct, fails for session-not-resident), D (eviction-only). Lays out 8-commit roadmap, wire protocol, failure modes, and the E4 experiment plan (KVC + D→P vs naive PD-disagg E1 baseline).	2026-05-13 00:44:03 +08:00
tim	6d1c9237fa	docs(architecture): KVC eviction granularity is the wrong abstraction After E3 exposed massive session-level eviction (90 trims × avg 67K tokens/evict = 6.1M tokens trashed in 1h12min), we have to acknowledge the local-patch sequence (E2→load-floor→Fix A → proposed disable-migration → proposed disable-admission) was a KVC-to-DP collapse trajectory, not a fix. The fundamental issue: SessionAwareCache merged two responsibilities that should be separate. 1. Session lifecycle tracking (legitimate — streaming sessions reuse KV across turns and need per-session metadata). 2. Eviction granularity decision (wrong — sessions should not be the eviction unit). `release_session` frees the session-exclusive range [cache_protected_len, kv_allocated_len), which is the post-radix- commit tail accumulated over decode/extend. On Inferact's 50-session workload this is 35-87K tokens per session. The radix tree never gets a chance to do block-level leaf-LRU on that range because it was never committed there. Effect: evict-revisit cycle forces full 50-90K re-prefill per session per evict — which is exactly the per-request cost of naive PD-disagg. KVC's direct-to-D fast-path advantage collapses. The right fix is structural (not a patch): progressively commit streaming-session decode output to the radix tree so SGLang's block-level LRU can shed only the deepest leaves, preserving the recent prefix that next-turn requests are most likely to match. SessionSlot becomes pure metadata. Scope is ~1-2 weeks of vendored SGLang refactor, orthogonal-and-complementary to the D→P sync work proposed in RESEED_SLOW_PATH_AND_D_TO_P_GAP §4. Doc lists five anti-patterns the next agent should avoid (tuning migration_reject_threshold, disabling migration/admission, etc) — all of those are local symptoms downstream of the eviction granularity choice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:21:45 +08:00
tim	986f351365	feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session correction at the top of ScheduleBatch.prepare_for_extend zeroes req.extend_input_len when len(fill_ids) <= len(prefix_indices), but the per-req invariant later in the same function (assert seq_len - pre_len == req.extend_input_len) is computed from raw fill_ids/prefix_indices lengths and has no path to be satisfied when fill_len < prefix_len. The result is an AssertionError that crashes the entire decode worker. Add a pre-filter pass at the start of prepare_for_extend that detects this state, marks the affected reqs with FINISH_ABORT (so the client gets an error response instead of the worker hanging), and drops them from the batch before the correction loop runs. If all reqs are filtered, populate empty tensor/list state and return early so downstream model.forward sees a valid no-op batch. This treats fill_ids < prefix_indices as upstream state inconsistency that should be reported to the client rather than silently miscomputed. The narrower invariant after this filter: prepare_for_extend's body only ever sees streaming-session reqs where actual_extend_len > 0, which is the regime the existing correction logic was designed for. Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid 6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648, prefix_len=43459) — masked in E1/E2 because the cap-out failure cascade prevented sessions from accumulating deep enough committed prefix to trigger the inconsistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:12:14 +08:00
tim	d40db1f117	docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug H1 (load balance) confirmed at the 15-min checkpoint: D2 received 22.5% of bindings (225 out of 1001) covering 30 unique sessions, versus 0 in both E1 and E2. The graduated load-floor formula with K=200 produces the intended distribution: fresh sessions on under-loaded D, sticky sessions stay put. But decode-1 crashed at 11:51:21 (~5 min into benchmark) with an SGLang AssertionError in schedule_batch.py:1646. Root cause: the streaming-session correction at line 1572-1585 patches req.extend_input_len to 0 when len(fill_ids) < len(prefix_indices), but the downstream invariant uses raw fill_ids/prefix_indices lengths, so the arithmetic check fails. This is a pre-existing landmine in the `b8e6f13` SGLang vendor patch, not caused by the load-floor bonus. It just happened to be masked in E2 by the failure cascade preventing sessions from accumulating deep enough prefix to trigger the correction. Crash session 1000195 stayed on decode-1 the whole time (not a migration race). E3 exposes this faster because sessions actually run further with rebalanced load. 5 fix options evaluated. Recommended: Fix A — local patch at schedule_batch.py:1646 to skip zero-extend-len reqs before asserting. Less invasive than C (recomputing seq/prefix arrays); addresses the actual case (D and E are workarounds, not fixes). 4 decision points for review; no code changes in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:05:51 +08:00
tim	a1abdcd50c	feat(experiments): E3 sweep — KVC v2 + RDMA + load-floor bonus Same outputs/inferact_50sess.jsonl subset as E1/E2 (md5 7bb263a32600ef5a6ef5099ba340a487). Identical to E2 except adds --kvcache-load-floor-bonus 200. Tests three hypotheses: H1 (load balance): D2 receives non-trivial bindings (E1/E2: 0) H2 (failure rate): mooncake batch_transfer timeouts disappear because D0/D1 KV pool no longer saturates (E2 had 1054 fails; expect ≤ E1's 85) H3 (TTFT): E2's 0.43s p50 (over the 231 successes) generalizes to most reqs once cascade is gone K override via LOAD_FLOOR_BONUS env var (default 200). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:45:09 +08:00
tim	93fce42747	feat(policy): load-floor bonus for KvAwarePolicy (Q2.B) Implements the design proposed and approved in docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B. KvAwarePolicy gains a `load_floor_bonus: int = 0` knob. When > 0: mean_assigned = sum(assigned[]) / len(D) for each D candidate: if not sticky and mean_assigned > 0: deficit = max(0, mean_assigned - assigned[D]) floor_bonus = K deficit / mean_assigned else: floor_bonus = 0 score = (overlap + sticky*α + floor_bonus, sticky, -inflight, -assigned) Properties (verified by unit-style probe in commit message): - Default 0 = old behavior preserved - Sticky-gated: turn-1+ requests of an existing session keep going to their original D (cache locality preserved) - Graduated: bonus magnitude scales with the D's deficit ratio, approaches K as deficit/mean → 1, drops to 0 when balanced - Set above max expected boilerplate overlap (Inferact ~50 → 200) so cross-session shared-prefix overlap doesn't pin cold D's idle, but real per-session prefix overlap (>K blocks) still wins Plumbed through ReplayConfig, BenchmarkConfig, and CLI flag --kvcache-load-floor-bonus on both `replay` and `benchmark-live`. Empirical verification on synthetic state (same conditions as the E2 cold-D pathology): - OFF (K=0): route fresh session → decode-0 (boilerplate winner) - ON (K=200): route fresh session → decode-1 (cold D rebalanced) Validation pass next: scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh (committed separately). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:45:09 +08:00
tim	905d671135	feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack Mooncake C++ batch_transfer_sync defaults to 30s timeout; on saturated D scheduler threads doing LRU eviction, that fires as a false positive and the SGLang hair-trigger in conn.py:1270 permanently blacklists the D's mooncake_session_id (E2 forensic in docs/E1_E2_RESULTS_ZH.md §5c). Bump to 1800s in setup_env.sh and mirror to subprocess env in stack.py so SGLang workers get it too. 30-min envelope still detects genuinely broken peers eventually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:45:09 +08:00
tim	9a166ac43b	docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D) For Q1 (D scheduler LRU starves mooncake control plane → 30s batch_transfer_sync timeout → hair-trigger blacklist), six candidate fixes evaluated. Recommendation: do Q2 fix first since it removes the only condition under which we observe LRU thrash; bump mooncake timeout to 120s as cheap defense-in-depth; avoid invasive SGLang vendor changes (windowed hair-trigger, async eviction thread) until Q2 fix demonstrates they're insufficient. For Q2 (overlap-first lex score + shared boilerplate → permanent D2 cold), seven candidate fixes evaluated. Recommendation: load- floor bonus (graduated, decoupled from overlap, gated on not-sticky) as the primary mechanism — proactive on first-touch as user requested, avoiding the binary one-shot pitfall of the reverted cold-D bonus. Orthogonal cleanup: fix the substring filter in _is_admission_rejection_mode so the existing migration mechanism serves as a backstop when load balancing alone isn't enough. 7 decision points listed for review; no code merged until a shape is approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:20:00 +08:00
tim	976115ea5e	Revert "feat(policy): cold-D bonus to break overlap-pinning death spiral" Implementation jumped ahead of design. The cold-D bonus is one of several candidates for the overlap-pinning fix (others: load-floor bonus, idle-D bonus, capacity-aware overlap discount, pre-warming boilerplate). Need to evaluate the design space first, including whether a single bonus is even the right shape vs a separate term in the lex score, before committing to a specific knob. This reverts commit `786cbb8` cleanly (forensic docs in `bf4da28` and `7f2ebf3` are kept since they record observations, not designs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:17:16 +08:00
tim	786cbb8d91	feat(policy): cold-D bonus to break overlap-pinning death spiral KvAwarePolicy now accepts an optional cold_d_bonus int. When > 0, fresh requests (sticky=0, i.e. no prior D for this session) receive the bonus added to lex-score position 0 (overlap+sticky_bonus) for any D worker that has never been assigned a session yet (decode_assignment_counts == 0). This breaks the pathology documented in docs/E1_E2_RESULTS_ZH.md §5d where workloads with shared cross-session prefix (e.g. Inferact's "permissions instructions" boilerplate) cause every D that has hosted any session to dominate the overlap term against any cold D, leaving the cold D permanently unused. Sticky behavior is preserved: turn 1+ requests of an existing session continue to stick to their original D because the bonus is gated on `not sticky`. Plumbed through ReplayConfig.kvcache_cold_d_bonus (default 0, keeping current behavior unchanged), BenchmarkConfig, and CLI flag --kvcache-cold-d-bonus on both `replay` and `benchmark-live` subcommands. Set above max expected boilerplate overlap (Inferact's ~50 24-token blocks → 1000 is safe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:14:00 +08:00
tim	bf4da281c0	docs(experiments): mooncake "is not alive" deep-dives to LRU starvation The Q1 mystery resolves: P-side mooncake C++ logs show "Sync batch data transfer timeout after 37452515723ns" (37.45 s) at 01:56:42 — this is mooncake's batch_transfer_sync giving up after its internal timeout. The hair-trigger >=1 in conn.py:1270 is correct in the idle case (a 30-s RDMA stall genuinely means the peer is broken), but it fires here because of D-side congestion: decode-0.log shows two consecutive LRU evictions ("Trimmed decode session cache via LRU. evicted_sessions: 2, freed_tokens: 77675") firing at the exact same wall second the timeout triggers. The D scheduler thread is busy with multi-session GPU memory frees + session-aware-cache bookkeeping under lock; the mooncake C++ control plane on the receive side gets starved for >30 s; P times out and marks the whole D's mooncake_session_id failed. Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D bonus, next commit); defense-in-depth = windowed threshold + retry in vendored mooncake conn.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:14:00 +08:00
tim	7f2ebf3d87	docs(experiments): forensic on Q1 (mooncake death) and Q2 (no D2 migration) Q1: Mooncake "is not alive" is hair-trigger — a single send_kvcache_slice ret != 0 in third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py :1270 permanently adds the D's mooncake_session_id to failed_sessions and blacklists it for the rest of the process lifetime. The D worker process is alive (D1 keeps serving admit_direct_append OK seconds after), but every subsequent P→D transfer for that session short-circuits at conn.py:1184. The "Failures should never happen if the session is not dead" comment encodes the wrong assumption for the saturation regime we hit. Q2: KVC v2's migration mechanism IS sound but its trigger is gated by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap", "no-d-capacity", "d-backpressure"). All 1054 failures have execution_mode="kvcache-centric" (generic fallback bucket) which contains none of those substrings, so session_d_rejects is never incremented. Empirically 46 of 49 (sess, D) pairs that the worker RPC rejected would have qualified for blacklist (most-rejected pair: 25 rejects), but policy never saw them. Result: D0 reject → next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject → next-bind D2 (0×). Fix paths documented for both, shortest path is widening the substring filter to include the failure-fallback bucket, but the right fix is to call record_admission_reject directly from the actual rejection signal site instead of string-matching execution_mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:45:18 +08:00
tim	ef4dc81ea9	docs(experiments): forensic explanation for E2 80% failure rate Pulling admission-events.jsonl, prefill-0.log, and request-metrics sampling shows the 1054 failures are NOT timeouts as initially assumed. They are a 3-layer cascade: L1: 562 "no-space" + 43 "session-not-resident" worker admission rejects (51% of all admit attempts) because D0/D1 KV pools saturate while D2 stays empty. L2: rejects re-route to seed/reseed which need mooncake P→D KV transfer; the backlog drops mooncake heartbeats and prefill-0 logs "Decode instance could be dead, remote mooncake session ... is not alive". L3: SGLang aborts the request, SSE stream closes with 0 tokens, agentic-pd-hybrid raises "generate stream ended before producing any token" (the literal error string for all 1054). E1 didn't hit this because pd-disaggregation has no admission RPC — sessions just queue behind the running batch, paying TTFT instead of failing. KVC v2's worker admission is supposed to be a safety valve; on the cold-D pathology it becomes a failure amplifier. The real fix is upstream D rebalancing (cold-D bonus or pre-warm), not relaxing admission. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:38:49 +08:00
tim	3db2d84df8	docs(experiments): E2 complete — qualified H1 with a surprise E2 finished 1h33min wall. Headline contrast on the matched Inferact 50-session subset: E1 (naive 1P3D + kv-aware + RDMA): 1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s E2 (KVC v2 + RDMA): 231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among the requests that did complete. Both runs leave D2 entirely unused for the same structural reason: Inferact's shared "permissions instructions" boilerplate makes overlap dominate the kv-aware lex score, and v2's migration mechanism only fires on capacity rejects which never reach D2. The 1054 E2 timeouts are downstream of that imbalance, not a v2 bug per se. The doc closes with five concrete follow-ups for the next agent — cold-D bonus, router-mode admission, default-policy control arm, TCP-loopback comparison, failure mode forensics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 03:23:33 +08:00
tim	e3e5c45ed4	docs(experiments): E2 mid-run finding — D2 stays cold in KVC v2 too Same pathological imbalance E1 showed reproduces in E2: D2 has zero bindings at 33% POSTs in. Root cause is structural, not a KVC v2 bug: all 50 Inferact sessions begin with identical "permissions instructions" boilerplate, so the converter assigns them identical first-block hash_ids. kv-aware policy's overlap term (lex-score position 0) makes any already-resident D dominate a fresh D unconditionally, and v2's migration only activates on admission rejects which never fire because D0/D1 KV pools have headroom. The H1 conclusion is qualified: KVC v2 helps per-request work (direct- to-D fast path) but does not rebalance D worker load on workloads with shared cross-session prefixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 02:08:00 +08:00
tim	631b2c8847	docs(experiments): E1 results — naive 1P3D + kv-aware confirms H1 baseline E1 finished 1h29min wall on the 50-session Inferact subset. Headline: 1200/1285 succeeded, latency p50=93s p99=219s, TTFT p50=89s p99=207s, 85 timeouts. Decode-2 was never bound to a single session — all 50 sessions stuck to decode-0/1 by kv-aware policy stickiness with no migration to rebalance, so effective topology was 1P2D, not 1P3D. This is exactly the failure mode H1 predicts naive pd-disaggregation should exhibit, giving E2 (full KVC v2 with migration) a concrete baseline to improve against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 01:49:52 +08:00
tim	ad8aaa8c5a	feat(experiments): E2 sweep — KVC v2 + RDMA on the matched subset KVC v2 config from sweep_ts1_migration_v2.sh (reset-on-success + direct-append threshold 8192) layered on top of the RDMA-enabled mooncake stack, against the same outputs/inferact_50sess.jsonl subset that E1 uses. Pair-wise contrast tests H1 (KVC layer marginal contribution on top of 1P3D + kv-aware) and H2/H3 (RDMA reducing reseed slow-path tail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:49:53 +08:00
tim	bb9cc249cd	feat(experiments): E1 sweep on 50-session deterministic subset scripts/sample_trace_subset.py — file-order head-cut that takes the first N sessions of a converted trace. No RNG, no hashing — same input yields byte-identical output (the included assertion compares md5 across two runs). scripts/sweep_e1_naive_1p3d.sh — E1 of ONBOARDING_NEXT_AGENT_ZH §3.1: mechanism=pd-disaggregation, policy=kv-aware, 1P3D, RDMA on (mlx5_60). Defaults to outputs/inferact_50sess.jsonl so E1 and E2 can share the exact same subset; override via TRACE= env var to run on the full 20,230-request trace. Reproducing the subset: uv run --no-sync python scripts/sample_trace_subset.py \\ --input outputs/inferact_codex_swebenchpro.jsonl \\ --output outputs/inferact_50sess.jsonl \\ --sessions 50 # expected output_md5: 7bb263a32600ef5a6ef5099ba340a487 # 1285 requests, mean input_length 67631 tokens Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:21:36 +08:00
tim	b55371fe69	docs: H200 + driver 570 setup guide + 11 lessons learned Captures the full debugging journey of getting vendored SGLang 0.5.10 + mooncake RDMA running on a 4×H200 node with the older driver 570.86.15. Driver 570's actual API is cu12.8 — nvidia-smi's "CUDA Version: 13.0" header is a forward-compat ceiling, not the driver's own version — and that single misreading drove most of the detours. Lessons cover: pip vs vendor sglang divergence, why cu13 switching was a dead end (mooncake is cu12-only by wheel, driver 570 can't run cu13 anyway), why --disable-overlap-schedule alone isn't enough, why pip nvidia-cuda-nvcc-cu12 doesn't ship the nvcc binary, and how tvm_ffi's ninja-driven nvcc invocation makes CUDA_HOME the single hook point that fixes everything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:10:14 +08:00
tim	d11a66d11b	feat(scripts): cu12.8 env wrapper + Inferact trace converter setup_env.sh: source-able shell snippet that points tvm_ffi (vendor sglang JIT compiler) at \$HOME/cuda-12.8/bin/nvcc and exposes both libcudart.so.12 (for mooncake.engine, a cu12 wheel) and cu12.8 lib64 (for tvm_ffi compile-time linker) on LD_LIBRARY_PATH. Without this, JIT-compiled kernels NEEDED libcudart.so.13 and driver 570 rejected them at every JIT call. convert_inferact_to_trace.py: turns Inferact codex_swebenchpro_traces (ShareGPT {"from","value"} pairs) into the chat_id/parent_chat_id/ turn/hash_ids JSONL schema replay.py expects. Tokenizes with the model's own tokenizer, builds prefix-sharing 24-token block hashes, synthesizes timestamps. Output cross-checks 20,230 LLM calls — exactly matches the Inferact README count for 610 successful trials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:10:06 +08:00
tim	a418aafeed	feat(stack): pin PD workers to --disable-overlap-schedule On a node with driver 570.86.15 (cu12.8 driver API ceiling), SGLang's overlap event loop hits cudaErrorInsufficientDriver inside event_loop_overlap_disagg_prefill → resolve_future_token_ids JIT kernel. Switching to the normal event loop sidesteps this specific codepath. The flag is harmless on newer drivers and remains a useful default until overlap is independently re-validated on this hardware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:09:56 +08:00
tim	e874b1f055	feat(env): install vendored SGLang via uv path source Replace pip-resolved sglang==0.5.10 with an editable install from third_party/sglang/python. The vendored fork carries patches the pip release does not (admit_direct_append RPC types, _should_allow_local_ prefill_on_decode, maybe_trim_decode_session_cache, backpressure pause hint) — KVC routing depends on them, so the vendored copy must be the import target, not just on PYTHONPATH at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:09:50 +08:00
kzlin	7590e55189	docs: archive deprecated docs to docs/archive/, drop E1 from onboarding Two cleanups: 1. Drop "E1: naive 1P3D default" experiment from the onboarding manual. GPU hours are precious; naive 1P3D + policy=default has near-certain loss on multi-turn cache hit (it's round-robin without prefix awareness), so the comparison doesn't add information vs E1=naive 1P3D kv-aware. The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial / 5.5h parallel. Updated: - §0 TL;DR ("3 组" -> "2 组") - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware) - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop) - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2) - §6 decision table + expected-range table - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2") - §9 deliverables 2. Move 8 deprecated docs to docs/archive/: AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded) STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded) KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes) V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation) REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1) KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress) SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup) SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot) All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS / REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from `docs/FOO.md` to `docs/archive/FOO.md` via sed pass. Added `docs/archive/README.md` explaining what each archived doc is and when (if ever) to reopen it. Designed so a new reader hitting the archive dir immediately knows it's not required reading. After this commit the active docs in docs/ are 9 files (down from 17), which should make the onboarding doc's "Level 1 / Level 2 / Level 3" classification self-evident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:40:35 +08:00
kzlin	5a2fb8799c	docs(kvc): onboarding manual for the next SWE agent A single self-contained reading manual designed to bring a fresh agent (LLM or human) to current-state proficiency in 30 min of reading + 30 min of environment validation, then have them run the next round of ablation experiments without re-litigating questions already settled. Structure: §0 TL;DR -- what you are inheriting in 5 lines §1 Reading order, tiered into Must-Read / On-Demand / Archive, with reasons for each §2 Current-state snapshot: trace/hardware/branches + claims verified + hypotheses pending §3 The three ablation experiments (E1/E2/E3) with full CLI flag specifications and environment-validation checklist §4 Known gotchas (8 of them) with symptoms and fixes -- the most important section to skim before you start §5 CLI cheatsheet: run experiments / read data / plot / git §6 Result-analysis checklist: numbers to collect, expected ranges §7 FAQ for likely stuck-points §8 Anti-patterns: what NOT to do §9 Two specific deliverables the main agent expects back Appendix A: file location lookup table Appendix B: commit lookup table (by intent) Goals encoded into the doc: - Frame "your job is ablation, not new development" -- the new agent should not be tempted to start D->P sync work; that goes on the feat/d-to-p-sync branch in a separate phase. - Make abort-accounting / max-input-len / mooncake-TCP-default pitfalls extremely visible up front so they don't get repeated. - Provide expected-result ranges so a 2x deviation is treated as a config check, not a "finding". - Make the critic-vs-production framing explicit so the new agent knows when an audit-style "MAJOR" is actually a design intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:31:08 +08:00
kzlin	506d360160	fix(figures): GPU utilization figure annotation/headroom polish Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the "P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations clean white-bbox space above the bars instead of crashing into the KVC D bars at x=1. Move both annotation xytext positions to x=2.4 (left panel) and x=5.5 (right panel) so the arrows pull away from the orange P bar toward the center of the panel. Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at y=1.02; subplot titles raised to pad=24 to leave room. Note: a small visual collision between the bboxed group labels and the subplot-title second line remains in the rendered output (acknowledged in the prior conversation). Acceptable for now; full layout rework is deferred. The annotation-vs-bar overlap (the original blocker) is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:28:39 +08:00
kzlin	c01d6101d6	docs(kvc): freeze reseed slow-path audit + three reviewer challenges Standalone reference document capturing the v2 reseed slow-path forensic audit before opening the feat/d-to-p-sync branch. Designed to be quoted directly by future paper drafts and to prevent the team from re-relitigating the same questions verbally. Contents: §1. The three team-member challenges that disproved "capacity-backup will save the slow path" (each with code citation and verdict): 1) P pool can't fit all backups -- replay.py:1618-1620 caps backup count at 1 for sessions with ~50K peak input. 2) P's backup is a stale snapshot -- 49K of direct-to-D append work never flows through P. _commit_prefill_backup_residency (replay.py:1483) is only called from seed/reseed paths; direct-to-D path (replay.py:2719) never touches P-side state. 3) When D evicts, old KV is freed directly (no D->P dump). session_aware_cache.release_session only calls kv_pool_allocator.free(). §2. End-to-end reseed timeline (t=0 to t=4550ms) with code citations showing exactly where each component sits. P-side re-prefill = 1.5-3s, mooncake transfer = 1.5-4s, both contributing 50/50 to total reseed cost. §3. Table of "looks like D->P but isn't" code locations -- every candidate found during forensic search ruled out with line citations. §4. Specification of what D->P incremental sync would require: mooncake bidirectional roles (~400 LOC), D-side append commit hook (easy), P-side radix tree multi-producer extension (the real blocker), agentic-pd-hybrid replay.py hooks. Estimated 1-2 weeks engineering. §5. Confirmation via `git ls-remote origin --refs` that author has NOT secretly implemented D->P on another branch -- only main + this working branch exist on the server. §6. Roadmap for the upcoming feat/d-to-p-sync branch. Appendices: code position crosswalk, related commits, paper section suggestions. This document is referenced by V2_DEEP_ANALYSIS_ZH §4.2 and by KVC_ROUTER_ALGORITHM §9 Open Question 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:20:34 +08:00
kzlin	9ccd853066	docs(kvc): correct reseed cost decomposition + flag D->P sync gap After an independent Opus-agent forensic audit, the previous "(c) 增量 fetch (工程量较大，未实现)" line in V2_DEEP_ANALYSIS §4.2 was understating the gap. The audit confirmed: - No D->P KV transfer code exists in the framework at any layer (agentic_pd_hybrid orchestration, vendored SGLang disaggregation, or mooncake transport). - Mooncake MooncakeKVManager has a hard role split: PREFILL = sender, DECODE = receiver-only loop. `add_transfer_request` asserts the disaggregation_mode is PREFILL. - The BaseKVSender / BaseKVReceiver abstraction has no bidirectional slot. - session_aware_cache.release_session only calls kv_pool_allocator.free() on eviction -- no serialization, no outbound network call. - _commit_prefill_backup_residency is only called from the seed/reseed path (_invoke_kvcache_seeded_router). direct-to-D path never updates P-side backup state. - "capacity-backup" policy semantics: it only skips the close on P after reseed -- the backup is the seed-time static snapshot, never refreshed by D-side append-prefill activity. V2_DEEP_ANALYSIS §4.2: - Decomposed the 3-7s reseed cost into the P-side re-prefill segment (1.5-3s, dominant) and the P->D mooncake transfer segment (1.5-4s). - Quantified the realistic effect of enabling RDMA: only the transfer segment shrinks, reseed reduces to 1.7-3.2s, TTFT p99 ~0.7s, still loses to DP's 0.43s. - Replaced the throwaway "(c) incremental fetch" line with a full paragraph explaining what D->P sync would require, why it's the largest engineering gap, and that the blocker is SGLang's radix-tree single-producer assumption, not the network layer. KVC_ROUTER_ALGORITHM §9: - Refined Open Question 3 (RDMA) to clarify it only helps the transfer segment, not the re-prefill segment. - Added Open Question 4: D->P incremental KV sync as the central future-work contribution gap, with cited evidence for why it doesn't currently exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:07:14 +08:00
kzlin	517677d7f2	docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic) Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to visually rebut the two critic-agent claims that we argued in prose were design intent, not deficiencies. (1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time" Two-panel side-by-side: Left (request count view, the naive reading): KVC P = 328 reqs (7.4%), KVC D = ~1450 each, DP = ~1100 each. P "looks idle." Right (compute work view, the honest reading): KVC P does 1.07M tokens of prefill, comparable to each KVC D worker's ~0.80M. P is a low-frequency high-cost safety net, not idle capacity. Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33% LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity win. (2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win" Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool (276K vs 351K tokens) yet caches MORE per request. Left (cache hit rate vs turn number): KVC's session-affinity lets hit rate accumulate with turns; DP's hash + radix-LRU causes a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP = 95.8% (1.24pp gap). Shows mechanism, not just outcome. Right (ECDF of per-request uncached tokens, log x): KVC's distribution concentrates near zero (50% < 187 tokens), DP's is spread (50% < 781 tokens). At uncached = 500 tokens threshold, KVC has 74% of requests below, DP has 31%. → smaller pool, better retention, less per-request work. Direct empirical rebuttal to "fragmentation is architectural, not policy." Bundled scripts (rerunable): - scripts/analysis/plot_gpu_utilization.py - scripts/analysis/plot_cache_efficiency.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:04:49 +08:00
kzlin	c5519066de	docs(kvc): add TTFT probability density figure (KVC v2 vs 4DP) Adds a two-panel TTFT PDF comparison plot inserted as a new V2_DEEP_ANALYSIS §3.4 ("TTFT 概率密度对比: bimodal vs unimodal"). Single-percentile numbers (p50 / p99) hide the qualitative difference between the two distributions; the figure makes it visible at a glance. Left panel (linear x in [0, 0.6]s, body): KVC has a sharp peak at ~40ms (the direct-to-D fast path). DP has a broad peak around 50-200ms (full prefill per request). Annotated with p50 and p90 markers for each side. Right panel (log x in [10ms, 10s], full range): KVC is visibly bimodal: a tall fast-path peak plus a small reseed tail around 1-5s. DP is unimodal: a single broad peak with shorter tail. Annotated with p99 callouts pointing to each tail. KDE: scipy.stats.gaussian_kde, bandwidth=0.15 for the body (Scott's rule oversmooths the sharp fast-path peak), log10-transformed for the full-range panel so the bimodal structure is visible. Bundled: - scripts/analysis/plot_ttft_pdf.py -- rerunable when v2 / DP data change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:46:27 +08:00
kzlin	b5af19583b	docs(kvc): replace v2 path breakdown tables with generated figures V2_DEEP_ANALYSIS §3.1 (execution_mode distribution) and §3.2 (path-level latency vs DP) had hand-typed tables with approximate latencies (e.g. "~1.0s") and required readers to mentally compare 5+ rows × 5 columns. Both sections now reference generated PNG figures derived directly from the v2 + DP metrics.jsonl files. §3.1 figure (v2_execution_mode_distribution.png): Horizontal bar chart, log x-axis. 4076 direct-to-D fast-path requests (green) dwarf the rest by ~30x; the long tail of slow / fallback / failure modes is visible at one glance. Counts and percentages annotated on each bar. §3.2 figure (v2_path_level_latency.png): Grouped bar chart, log y-axis. Per-path TTFT p50 / TTFT p99 / Lat p50 with exact numeric labels (no more "~1.0s" approximations). Sample counts annotated below each path. Quick visual reads: - KVC fast path TTFT p50 41ms vs DP 92ms (2.2x faster) - KVC reseed TTFT p99 5.12s vs DP 0.43s (12x slower) -- the cost - KVC no-d-capacity TTFT p99 7.65s (worst case) Bundled: - scripts/analysis/plot_v2_path_breakdown.py -- the script that generates both figures; rerunable when v2 data changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:38:43 +08:00
kzlin	37e9caa431	docs(kvc): production-decision reframe + formal router algorithm spec After the critic-agent audit, V2_DEEP_ANALYSIS had drifted into an audit-grade "5 wins / 1 loss / 3 draws" framing that mistook KVC's deliberate design motifs (cache concentration via session affinity; prefill-GPU idle as TTFT-stability trade-off) for "comparison unfairness." This commit corrects the framing back to a production- decision lens and adds a paper-track formal specification of the router algorithm. V2_DEEP_ANALYSIS_ZH.md changes: - §0 TL;DR: lead with "online coding agent serving should pick KVC 1P3D"; the only real cost is TTFT p99 long-tail (3x DP) from the 8.3% mooncake reseed path, mitigable with real RDMA. - §4 restructured into three buckets: real costs (TTFT p99 tail, abort accounting now fixed), counter-arguments to the critic (cache concentration and idle prefill GPU are design intent, not deficits), methodology to-do (naive-1P3D control, v2 N>=2 determinism). - §6 replaces "5/1/3 rescoring" with production decision rationale: KVC wins on 6 latency/TTFT metrics + lower failure rate; pays TTFT p99 tail; lists workloads where DP would reverse the call. - §8 decision points: D1 recommends Yes (accept v2 as milestone); D8 added: paper motif "KVC trades P idle for TTFT stability." KVC_ROUTER_ALGORITHM.md (new, paper-track, Chinese narrative + English algorithm boxes / variable names / theorems for direct paper reuse): - Problem formulation, system model, full notation - Algorithm 1 Route: lexicographic-tuple scoring on (overlap+alpha*sticky, sticky, -inflight, -assigned) - Algorithm 2 Admit: D-worker autonomous admission deciding Direct / Seed / Reseed / reject (with reason) - Algorithm 3 Dispatch: end-to-end orchestration with reset-on-success (the v2-specific fix that eliminates v1's self-amplifying thrashing) - Theorem 1 (no permanent starvation) and Theorem 2 (fast-path determinism), each with a proof sketch - Comparison table vs vanilla pd-disagg / DP cache-aware - Anti-patterns ("what KVC explicitly is NOT") - Open questions for reviewers - Suggested paper citation phrasing - Appendix A: algorithm-step to source-file:line crosswalk Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:29:18 +08:00
kzlin	5eac9b4f6b	fix(metrics): exclude aborted requests from latency/ttft/tpot stats The old filter `if row.latency_s is not None` accepted SGLang's fast input-length-aborts (latency_s ~ 0.08s, finish_reason='abort/BadRequest') as if they were successful zero-cost requests. This deflated mean/p50 of any run where the model rejected oversized inputs. Impact on existing comparisons (ts=1 4-run validation + v2): KVC v2 has 40 aborts + 5 ReadTimeouts (was reported as just 5); DP 4w has 67 aborts (was reported as 5). Both runs have abort behavior; the asymmetry (40 vs 67) is purely from SGLang's mem-fraction-derived max-input-len: KVC decode-only worker gets ~10 GB free GPU mem -> max-input=92098, DP fused worker gets ~9 GB -> max-input=87811, because DP also needs chunked-prefill workspace. The KVC-vs-DP latency-win direction holds and widens slightly under the fixed filter (lat mean delta: -0.8% -> -1.4%); see V2_DEEP_ANALYSIS_ZH §4.3 for the recomputed table. Changes: - metrics.py: new _is_failed_request(row) helper; latency/ttft/tpot stats now exclude both errors and aborts. New summary fields abort_count and failure_count expose the counts directly. - scripts/analysis/recompute_summary.py: re-derives summary.json from existing metrics.jsonl using the fixed code, with optional --diff against the old buggy summary for inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:29:18 +08:00
kzlin	0c25168cad	docs(kvc): v2 deep analysis vs TEAM_REPORT baseline Post-v2 audit consolidating ts=1 validation + v1 thrashing + v2 win, plus critic-agent adversarial review of the v2 vs 4DP comparison. Headline outcomes: - TEAM_REPORT §1 (session pin starvation) fully fixed by v2 migration + reset-on-success; direct-to-D 42.8% -> 91.6%. - TEAM_REPORT §2/§3/§5 (LRU, backpressure, admission RPC) are absorbed by ts=1 natural drain time, not mechanism-fixed -- will resurface under ts=10/longer traces/higher concurrency. - TEAM_REPORT §6 (ts=10 distortion) confirmed and locked as precondition; TEAM_REPORT §8 (N=1 unreliable) rewritten to "high-pressure N>=3, normal N=1". Three new problems exposed by adversarial review: - TTFT p99: KVC 1.285s vs DP 0.427s (KVC 3.0x worse) -- cherry-picked out of the V2_RESULTS_ZH.md headline table. Root cause: 8.3% non-direct path pays 3-7s mooncake reseed cost on 50-90K-token KV transfer. - Error accounting asymmetry: DP has 67 fast-aborts (not 5) at ~0.08s each counted in latency stats; KVC's 5 ReadTimeouts excluded entirely. Root cause: --max-input-len 87811 (DP) vs 92098 (KVC) + metrics.py:124 filter. - Topology mismatch: KVC 1P3D's prefill GPU is idle 91.7% of the time (only ~373/4449 requests use seed/P path); 4DP CA has all 4 GPUs at full utilization. Plus: no naive 1P3D control exists in the repo -- cannot isolate KVC-layer contribution from 1P3D-topology contribution. Re-scored headline: 5 KVC wins / 1 DP win / 3 draws -- still net positive but not the "7/8 wins" framing the V2_RESULTS_ZH.md claims. Recommended follow-ups (ROI order): 1. naive 1P3D ts=1 N=1 control (critic's only CRITICAL finding) 2. v2 N=2/N=3 to verify ts=1 determinism with new code paths 3. symmetric error accounting recompute + DP max-input-len = 92098 rerun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 11:17:00 +08:00
kzlin	2ec0debef4	feat(kvc): session migration with reset-on-success + direct-append threshold tuning KVC v2 beats 4DP at ts=1 same-scale on 7/8 metrics: TTFT mean -24%, p50 -54%, p90 -64%; lat mean -0.8%, p50 -12.6%, p90 -0.7%. Direct-to-D rate jumped 42.8% -> 91.7%. REFACTOR_PLAN_V1 scenario C achieved. Two-knob fix: - reset-on-success blacklist decay: clear (sess, D) reject counter on successful direct-to-D path. Eliminates v1 thrashing where session 6880 was stable on decode-1 for 70 turns then collapsed to 75 D-changes after cumulative transient pressure tripped the permanent blacklist. - bump --kvcache-direct-max-uncached-tokens default 2048 -> 8192 via CLI flag. 41% of v1 fallbacks were 'real-large-append' (>2048 token append); raising the threshold lets these go through the direct-to-D fast path. Code: - policies.py: RoutingState.session_d_rejects counter + KvAwarePolicy migration_reject_threshold; degenerate fallback picks least-rejected D. - replay.py: record_admission_reject + reset-on-success in _run_request; _fallthrough_reason classifies turn-2+ fall-throughs as session-not-resident / real-large-append / etc, replacing misleading 'large-append' suffix (TEAM_REPORT §2.7). - cli.py + benchmark.py: --kvcache-migration-reject-threshold flag wiring. Docs: - REFACTOR_PLAN_V1_ZH.md: forward-looking plan after ts=1 validation. - MIGRATION_V1_FINDINGS_ZH.md: v1 thrashing root-cause analysis. - V2_RESULTS_ZH.md: v2 results, scenario C achievement, attribution. - TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md: comprehensive team report. Scripts: - sweep_ts1_kvc_n3_plus_dp.sh: ts=1 baseline (KVC 1P3D N=3 + 4DP CA). - sweep_ts1_migration_v1.sh / v2.sh: validation runs. - analyze_ts1_validation.py: 4-way comparison analyzer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 01:18:13 +08:00
kzlin	1d51704dad	docs(kvc): agentic-fit analysis, refactor plan, validation report Three new docs covering the structural-fit investigation: - AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess). Quantifies session pinning, LRU shortfall, P-side imbalance, time-scale distortion, etc., with code citations and N=3 rerun data. - REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the original "estimate inflation" and "resident_blocks aging" claims were not real bugs, scope shrinks to one code change (backpressure) plus a 4-run smoke sweep within an 8h budget. - STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled fully-supported / indirect / retracted with the data source. Notes that backpressure E2E validation is pending GPU smoke run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:30:11 +08:00

1 2

70 Commits