agentic-pd-hybrid

Author	SHA1	Message	Date
Claude Code Agent	a4f30e6bd3	docs(d2p): implementation status snapshot — Phase 1-3 audit Captures the current state of the D→P RDMA snapshot push work for the next agent (or future me): which commits land which phase, which phases are verified vs in-flight, and the known unverified surfaces (byte-level KV layout, cross-node, multi-D contention, token_id consistency, D-side evict races, chunked-prefill interactions). Also maps the §2 design points to their implementation locations so the doc-to-code traceability is explicit.	2026-05-13 08:29:26 +08:00
Claude Code Agent	8a2f72f18e	feat(experiments): E4 protocol + sweep script — KVC + D→P vs naive PD Pre-registers the E4 experiment that tests whether KVC + D→P RDMA snapshot push beats the naive PD-disagg E1 baseline on the inferact_50sess subset. Compared to E3 the only changed flag is --enable-d-to-p-sync. Three hypotheses (see docs/E4_PROTOCOL_ZH.md §2.3): H1 (main): E4 TTFT p99 ≤ E1 TTFT p99 H2: E4 reseed-mode TTFT < E3 reseed-mode TTFT H3: E4 success count ≥ E3 success count The full reseed → snapshot-push orchestration is wired in `b9b0cf0` (_attempt_d_to_p_sync); the SGLang scheduler RPCs and the runtime mem-leak fix are in `86412bb` / `a369722`.	2026-05-13 08:27:40 +08:00
Claude Code Agent	a369722efe	fix(sglang): account snapshot-reserved slots in radix mem leak check Phase 2 prepare_receive allocates kv_pool slots that aren't visible to radix / session bookkeeping until finalize_ingest. Without this fix, the scheduler's idle self_check fires: ValueError: token_to_kv_pool_allocator memory leak detected! available=288391, evictable=5, protected=0, session_held=0 (expected sum == 288460) _check_radix_cache_memory now subtracts sum(len(rec.slot_indices) for rec in ctrl._ingest_records.values()) from the expected total before flagging a leak. Snapshot_reserved is also printed in the leak message for diagnostics. Smoke confirmed (scripts/smoke_snapshot_sglang_integration.py): [smoke] prepare_receive on P → 200: ok=true (96 layer bufs) [smoke] dump on D → 200: ok=false, reason=session-not-resident [smoke] finalize on P → 200: ok=true, inserted_prefix_len=0 [smoke] OVERALL: PASS End-to-end KV-correctness (snapshot ingest yields cache hit on next prefill) still requires the agentic+router stack — covered in the E4 sweep, not this smoke.	2026-05-13 08:26:16 +08:00
Claude Code Agent	b9b0cf0fac	feat(agentic): D→P snapshot orchestration in reseed path + CLI flag Phase 3 — wires the SGLang-side snapshot RPCs (committed in `86412bb`) into the agentic reseed slow-path. On _invoke_kvcache_seeded_router: 1. POST {prefill_url}/_snapshot/prepare_receive alloc P-side slots 2. POST {old_decode_url}/_snapshot/dump RDMA push session KV 3. POST {prefill_url}/_snapshot/finalize_ingest insert into P radix After step 3 P's radix tree has the session prefix cached; the subsequent SGLang router-driven prefill on P hits cache instead of re-computing. Any RPC failure short-circuits to the existing seeded_router fallback (re-prefill from scratch). All steps are best-effort and structurally logged for post-hoc analysis. Flag plumbing: cli.py --enable-d-to-p-sync (replay + benchmark) topology.py SingleNodeTopology.enable_d_to_p_sync stack.py SGLANG_SNAPSHOT_LINK_ENABLE=1 injection per worker replay.py ReplayConfig.enable_d_to_p_sync + _attempt_d_to_p_sync helper Snapshot port per worker derives from disaggregation_bootstrap_port + 1000 (set in third_party/.../snapshot/controller.py), so different workers get distinct mooncake snapshot engines on the same node. Smoke (next): scripts/smoke_snapshot_sglang_integration.py spawns one D + one P, exercises the 3 RPCs end-to-end, checks cache_tokens on a follow-up generate request. See docs/D_TO_P_SYNC_DESIGN_ZH.md for the full design.	2026-05-13 08:16:46 +08:00
Claude Code Agent	86412bb174	feat(sglang): D→P snapshot link integration — controller + RPC handlers Phase 2 of the D→P sync feature (Phase 1 in `dc4867c` verified the underlying RDMA link in isolation). This commit wires that link into each SGLang worker's scheduler so D and P can exchange session KV without going through the PD prefill pipeline. New module: third_party/sglang/python/sglang/srt/disaggregation/snapshot/ controller.py — SnapshotLinkController owns one mooncake transfer engine per worker, pre-registers all kv_pool layer buffers, and exposes prepare_receive() and push_session_kv() APIs. Receive bookkeeping via a session_id → SnapshotIngestRecord side-table. Three RPC types added to io_struct.py and full plumbing wired through: SnapshotPrepareReceiveReqInput/Output P-side alloc + return layout SnapshotDumpReqInput/Output D-side read kv_pool + RDMA push SnapshotFinalizeIngestReqInput/Output P-side radix tree insert Files touched: managers/io_struct.py 3 new ReqInput/ReqOutput pairs managers/tokenizer_communicator_mixin.py 3 communicators, 3 awaitables managers/scheduler.py init controller + 3 handlers entrypoints/http_server.py 3 HTTP endpoints under /_snapshot Activation: set SGLANG_SNAPSHOT_LINK_ENABLE=1 (and SGLANG_SNAPSHOT_LINK_HOST / _PORT / _IB_DEVICE) per worker. Controller init is opt-in and defaults off, so production PD pipeline is untouched. Subsequent work (Phase 3): agentic-pd-hybrid orchestration in _invoke_kvcache_seeded_router to call prepare_receive on P, dump on D-old, finalize_ingest on P, then trigger the existing P→D' transfer which will now hit P's radix cache (skipping re-prefill).	2026-05-13 08:12:04 +08:00
Claude Code Agent	7216507773	feat(snapshot): D→P RDMA Phase 1b — GPU pointer path verified Confirms snapshot_link works for cuda device pointers, not just host memory. Sender on cuda:0 pushes to receiver on cuda:1 via RDMA over mlx5_60. All 5 sizes (16K, 1M, 16M, 64M, 256M) pass SHA verification. 16 KB 8.3 ms 0.016 Gbps (cold openSegment) 1 MB 0.10 ms 87.6 Gbps 16 MB 0.84 ms 159 Gbps 64 MB 2.52 ms 213 Gbps 256 MB 8.54 ms 251 Gbps (~60% NDR400 line rate) For Inferact-scale sessions (~50K tokens × ~80 KB layer-per-token = ~4 GB), this projects D→P transfer time at ~130 ms — within the "reseed-savings" envelope sketched in design doc §3.2. Files: scripts/snapshot_link_receiver_gpu.py scripts/smoke_snapshot_link_gpu.py Next: SGLang scheduler integration for D-side dump + P-side ingest.	2026-05-13 00:59:43 +08:00
Claude Code Agent	dc4867c270	feat(snapshot): D→P RDMA link Phase 1 — minimal byte transport A thin wrapper around mooncake.engine.TransferEngine that does one-sided RDMA writes between two SnapshotPeer endpoints. Bypasses SGLang's MooncakeKVManager (which is hard-gated to PREFILL/DECODE roles via add_transfer_request assertion at conn.py:1563) so the D→P direction doesn't require invasive role-axis changes upstream. Smoke test (two subprocess.Popen processes, mlx5_60, 127.0.0.1): 1 KB 9.0 ms (one-time openSegment handshake) 16 KB 0.04 ms 3.5 Gbps 1 MB 0.10 ms 82 Gbps 16 MB 0.58 ms 232 Gbps 64 MB 1.70 ms 316 Gbps (~80% of NDR 400G line rate) All 5 sizes pass SHA256 verification end-to-end. Files: src/agentic_pd_hybrid/snapshot_link.py — SnapshotPeer, SnapshotEndpoint scripts/snapshot_link_receiver.py — child-process receiver scripts/smoke_snapshot_link.py — sender + verifier docs/D_TO_P_PHASE1_LINK_ZH.md — phase 1 acceptance doc Next: Phase 2 (D-side scheduler commit hook), Phase 3 (P-side prefill bypass with snapshot KV). See docs/D_TO_P_SYNC_DESIGN_ZH.md §5.	2026-05-13 00:55:55 +08:00
Claude Code Agent	9c35eddc79	docs(design): D→P RDMA snapshot push design Goal: skip P-side re-prefill on reseed path. Push session KV snapshot from D back to P after each direct-to-D append; reseed re-uses P's snapshot to fire only the P→D' transfer (no model.forward on P). Decision: Option C — D→P snapshot at append-commit, P-side PrefillSnapshotStore (side-table, not in radix tree), prefill bypass when snapshot is fresh. Rejects A (radix multi-producer), B (D→D' direct, fails for session-not-resident), D (eviction-only). Lays out 8-commit roadmap, wire protocol, failure modes, and the E4 experiment plan (KVC + D→P vs naive PD-disagg E1 baseline).	2026-05-13 00:44:03 +08:00
tim	6d1c9237fa	docs(architecture): KVC eviction granularity is the wrong abstraction After E3 exposed massive session-level eviction (90 trims × avg 67K tokens/evict = 6.1M tokens trashed in 1h12min), we have to acknowledge the local-patch sequence (E2→load-floor→Fix A → proposed disable-migration → proposed disable-admission) was a KVC-to-DP collapse trajectory, not a fix. The fundamental issue: SessionAwareCache merged two responsibilities that should be separate. 1. Session lifecycle tracking (legitimate — streaming sessions reuse KV across turns and need per-session metadata). 2. Eviction granularity decision (wrong — sessions should not be the eviction unit). `release_session` frees the session-exclusive range [cache_protected_len, kv_allocated_len), which is the post-radix- commit tail accumulated over decode/extend. On Inferact's 50-session workload this is 35-87K tokens per session. The radix tree never gets a chance to do block-level leaf-LRU on that range because it was never committed there. Effect: evict-revisit cycle forces full 50-90K re-prefill per session per evict — which is exactly the per-request cost of naive PD-disagg. KVC's direct-to-D fast-path advantage collapses. The right fix is structural (not a patch): progressively commit streaming-session decode output to the radix tree so SGLang's block-level LRU can shed only the deepest leaves, preserving the recent prefix that next-turn requests are most likely to match. SessionSlot becomes pure metadata. Scope is ~1-2 weeks of vendored SGLang refactor, orthogonal-and-complementary to the D→P sync work proposed in RESEED_SLOW_PATH_AND_D_TO_P_GAP §4. Doc lists five anti-patterns the next agent should avoid (tuning migration_reject_threshold, disabling migration/admission, etc) — all of those are local symptoms downstream of the eviction granularity choice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 14:21:45 +08:00
tim	986f351365	feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session correction at the top of ScheduleBatch.prepare_for_extend zeroes req.extend_input_len when len(fill_ids) <= len(prefix_indices), but the per-req invariant later in the same function (assert seq_len - pre_len == req.extend_input_len) is computed from raw fill_ids/prefix_indices lengths and has no path to be satisfied when fill_len < prefix_len. The result is an AssertionError that crashes the entire decode worker. Add a pre-filter pass at the start of prepare_for_extend that detects this state, marks the affected reqs with FINISH_ABORT (so the client gets an error response instead of the worker hanging), and drops them from the batch before the correction loop runs. If all reqs are filtered, populate empty tensor/list state and return early so downstream model.forward sees a valid no-op batch. This treats fill_ids < prefix_indices as upstream state inconsistency that should be reported to the client rather than silently miscomputed. The narrower invariant after this filter: prepare_for_extend's body only ever sees streaming-session reqs where actual_extend_len > 0, which is the regime the existing correction logic was designed for. Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid 6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648, prefix_len=43459) — masked in E1/E2 because the cap-out failure cascade prevented sessions from accumulating deep enough committed prefix to trigger the inconsistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:12:14 +08:00
tim	d40db1f117	docs(experiments): E3 first run — load-floor bonus works, exposes SGLang bug H1 (load balance) confirmed at the 15-min checkpoint: D2 received 22.5% of bindings (225 out of 1001) covering 30 unique sessions, versus 0 in both E1 and E2. The graduated load-floor formula with K=200 produces the intended distribution: fresh sessions on under-loaded D, sticky sessions stay put. But decode-1 crashed at 11:51:21 (~5 min into benchmark) with an SGLang AssertionError in schedule_batch.py:1646. Root cause: the streaming-session correction at line 1572-1585 patches req.extend_input_len to 0 when len(fill_ids) < len(prefix_indices), but the downstream invariant uses raw fill_ids/prefix_indices lengths, so the arithmetic check fails. This is a pre-existing landmine in the `b8e6f13` SGLang vendor patch, not caused by the load-floor bonus. It just happened to be masked in E2 by the failure cascade preventing sessions from accumulating deep enough prefix to trigger the correction. Crash session 1000195 stayed on decode-1 the whole time (not a migration race). E3 exposes this faster because sessions actually run further with rebalanced load. 5 fix options evaluated. Recommended: Fix A — local patch at schedule_batch.py:1646 to skip zero-extend-len reqs before asserting. Less invasive than C (recomputing seq/prefix arrays); addresses the actual case (D and E are workarounds, not fixes). 4 decision points for review; no code changes in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 12:05:51 +08:00
tim	a1abdcd50c	feat(experiments): E3 sweep — KVC v2 + RDMA + load-floor bonus Same outputs/inferact_50sess.jsonl subset as E1/E2 (md5 7bb263a32600ef5a6ef5099ba340a487). Identical to E2 except adds --kvcache-load-floor-bonus 200. Tests three hypotheses: H1 (load balance): D2 receives non-trivial bindings (E1/E2: 0) H2 (failure rate): mooncake batch_transfer timeouts disappear because D0/D1 KV pool no longer saturates (E2 had 1054 fails; expect ≤ E1's 85) H3 (TTFT): E2's 0.43s p50 (over the 231 successes) generalizes to most reqs once cascade is gone K override via LOAD_FLOOR_BONUS env var (default 200). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:45:09 +08:00
tim	93fce42747	feat(policy): load-floor bonus for KvAwarePolicy (Q2.B) Implements the design proposed and approved in docs/E1_E2_FIX_DESIGN_ZH.md §Q2.B. KvAwarePolicy gains a `load_floor_bonus: int = 0` knob. When > 0: mean_assigned = sum(assigned[]) / len(D) for each D candidate: if not sticky and mean_assigned > 0: deficit = max(0, mean_assigned - assigned[D]) floor_bonus = K deficit / mean_assigned else: floor_bonus = 0 score = (overlap + sticky*α + floor_bonus, sticky, -inflight, -assigned) Properties (verified by unit-style probe in commit message): - Default 0 = old behavior preserved - Sticky-gated: turn-1+ requests of an existing session keep going to their original D (cache locality preserved) - Graduated: bonus magnitude scales with the D's deficit ratio, approaches K as deficit/mean → 1, drops to 0 when balanced - Set above max expected boilerplate overlap (Inferact ~50 → 200) so cross-session shared-prefix overlap doesn't pin cold D's idle, but real per-session prefix overlap (>K blocks) still wins Plumbed through ReplayConfig, BenchmarkConfig, and CLI flag --kvcache-load-floor-bonus on both `replay` and `benchmark-live`. Empirical verification on synthetic state (same conditions as the E2 cold-D pathology): - OFF (K=0): route fresh session → decode-0 (boilerplate winner) - ON (K=200): route fresh session → decode-1 (cold D rebalanced) Validation pass next: scripts/sweep_e3_kvc_v2_loadfloor_rdma.sh (committed separately). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:45:09 +08:00
tim	905d671135	feat(env): MC_TRANSFER_TIMEOUT=1800s default in setup_env + stack Mooncake C++ batch_transfer_sync defaults to 30s timeout; on saturated D scheduler threads doing LRU eviction, that fires as a false positive and the SGLang hair-trigger in conn.py:1270 permanently blacklists the D's mooncake_session_id (E2 forensic in docs/E1_E2_RESULTS_ZH.md §5c). Bump to 1800s in setup_env.sh and mirror to subprocess env in stack.py so SGLang workers get it too. 30-min envelope still detects genuinely broken peers eventually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:45:09 +08:00
tim	9a166ac43b	docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D) For Q1 (D scheduler LRU starves mooncake control plane → 30s batch_transfer_sync timeout → hair-trigger blacklist), six candidate fixes evaluated. Recommendation: do Q2 fix first since it removes the only condition under which we observe LRU thrash; bump mooncake timeout to 120s as cheap defense-in-depth; avoid invasive SGLang vendor changes (windowed hair-trigger, async eviction thread) until Q2 fix demonstrates they're insufficient. For Q2 (overlap-first lex score + shared boilerplate → permanent D2 cold), seven candidate fixes evaluated. Recommendation: load- floor bonus (graduated, decoupled from overlap, gated on not-sticky) as the primary mechanism — proactive on first-touch as user requested, avoiding the binary one-shot pitfall of the reverted cold-D bonus. Orthogonal cleanup: fix the substring filter in _is_admission_rejection_mode so the existing migration mechanism serves as a backstop when load balancing alone isn't enough. 7 decision points listed for review; no code merged until a shape is approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:20:00 +08:00
tim	976115ea5e	Revert "feat(policy): cold-D bonus to break overlap-pinning death spiral" Implementation jumped ahead of design. The cold-D bonus is one of several candidates for the overlap-pinning fix (others: load-floor bonus, idle-D bonus, capacity-aware overlap discount, pre-warming boilerplate). Need to evaluate the design space first, including whether a single bonus is even the right shape vs a separate term in the lex score, before committing to a specific knob. This reverts commit `786cbb8` cleanly (forensic docs in `bf4da28` and `7f2ebf3` are kept since they record observations, not designs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:17:16 +08:00
tim	786cbb8d91	feat(policy): cold-D bonus to break overlap-pinning death spiral KvAwarePolicy now accepts an optional cold_d_bonus int. When > 0, fresh requests (sticky=0, i.e. no prior D for this session) receive the bonus added to lex-score position 0 (overlap+sticky_bonus) for any D worker that has never been assigned a session yet (decode_assignment_counts == 0). This breaks the pathology documented in docs/E1_E2_RESULTS_ZH.md §5d where workloads with shared cross-session prefix (e.g. Inferact's "permissions instructions" boilerplate) cause every D that has hosted any session to dominate the overlap term against any cold D, leaving the cold D permanently unused. Sticky behavior is preserved: turn 1+ requests of an existing session continue to stick to their original D because the bonus is gated on `not sticky`. Plumbed through ReplayConfig.kvcache_cold_d_bonus (default 0, keeping current behavior unchanged), BenchmarkConfig, and CLI flag --kvcache-cold-d-bonus on both `replay` and `benchmark-live` subcommands. Set above max expected boilerplate overlap (Inferact's ~50 24-token blocks → 1000 is safe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:14:00 +08:00
tim	bf4da281c0	docs(experiments): mooncake "is not alive" deep-dives to LRU starvation The Q1 mystery resolves: P-side mooncake C++ logs show "Sync batch data transfer timeout after 37452515723ns" (37.45 s) at 01:56:42 — this is mooncake's batch_transfer_sync giving up after its internal timeout. The hair-trigger >=1 in conn.py:1270 is correct in the idle case (a 30-s RDMA stall genuinely means the peer is broken), but it fires here because of D-side congestion: decode-0.log shows two consecutive LRU evictions ("Trimmed decode session cache via LRU. evicted_sessions: 2, freed_tokens: 77675") firing at the exact same wall second the timeout triggers. The D scheduler thread is busy with multi-session GPU memory frees + session-aware-cache bookkeeping under lock; the mooncake C++ control plane on the receive side gets starved for >30 s; P times out and marks the whole D's mooncake_session_id failed. Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D bonus, next commit); defense-in-depth = windowed threshold + retry in vendored mooncake conn.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:14:00 +08:00
tim	7f2ebf3d87	docs(experiments): forensic on Q1 (mooncake death) and Q2 (no D2 migration) Q1: Mooncake "is not alive" is hair-trigger — a single send_kvcache_slice ret != 0 in third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py :1270 permanently adds the D's mooncake_session_id to failed_sessions and blacklists it for the rest of the process lifetime. The D worker process is alive (D1 keeps serving admit_direct_append OK seconds after), but every subsequent P→D transfer for that session short-circuits at conn.py:1184. The "Failures should never happen if the session is not dead" comment encodes the wrong assumption for the saturation regime we hit. Q2: KVC v2's migration mechanism IS sound but its trigger is gated by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap", "no-d-capacity", "d-backpressure"). All 1054 failures have execution_mode="kvcache-centric" (generic fallback bucket) which contains none of those substrings, so session_d_rejects is never incremented. Empirically 46 of 49 (sess, D) pairs that the worker RPC rejected would have qualified for blacklist (most-rejected pair: 25 rejects), but policy never saw them. Result: D0 reject → next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject → next-bind D2 (0×). Fix paths documented for both, shortest path is widening the substring filter to include the failure-fallback bucket, but the right fix is to call record_admission_reject directly from the actual rejection signal site instead of string-matching execution_mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:45:18 +08:00
tim	ef4dc81ea9	docs(experiments): forensic explanation for E2 80% failure rate Pulling admission-events.jsonl, prefill-0.log, and request-metrics sampling shows the 1054 failures are NOT timeouts as initially assumed. They are a 3-layer cascade: L1: 562 "no-space" + 43 "session-not-resident" worker admission rejects (51% of all admit attempts) because D0/D1 KV pools saturate while D2 stays empty. L2: rejects re-route to seed/reseed which need mooncake P→D KV transfer; the backlog drops mooncake heartbeats and prefill-0 logs "Decode instance could be dead, remote mooncake session ... is not alive". L3: SGLang aborts the request, SSE stream closes with 0 tokens, agentic-pd-hybrid raises "generate stream ended before producing any token" (the literal error string for all 1054). E1 didn't hit this because pd-disaggregation has no admission RPC — sessions just queue behind the running batch, paying TTFT instead of failing. KVC v2's worker admission is supposed to be a safety valve; on the cold-D pathology it becomes a failure amplifier. The real fix is upstream D rebalancing (cold-D bonus or pre-warm), not relaxing admission. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:38:49 +08:00
tim	3db2d84df8	docs(experiments): E2 complete — qualified H1 with a surprise E2 finished 1h33min wall. Headline contrast on the matched Inferact 50-session subset: E1 (naive 1P3D + kv-aware + RDMA): 1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s E2 (KVC v2 + RDMA): 231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among the requests that did complete. Both runs leave D2 entirely unused for the same structural reason: Inferact's shared "permissions instructions" boilerplate makes overlap dominate the kv-aware lex score, and v2's migration mechanism only fires on capacity rejects which never reach D2. The 1054 E2 timeouts are downstream of that imbalance, not a v2 bug per se. The doc closes with five concrete follow-ups for the next agent — cold-D bonus, router-mode admission, default-policy control arm, TCP-loopback comparison, failure mode forensics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 03:23:33 +08:00
tim	e3e5c45ed4	docs(experiments): E2 mid-run finding — D2 stays cold in KVC v2 too Same pathological imbalance E1 showed reproduces in E2: D2 has zero bindings at 33% POSTs in. Root cause is structural, not a KVC v2 bug: all 50 Inferact sessions begin with identical "permissions instructions" boilerplate, so the converter assigns them identical first-block hash_ids. kv-aware policy's overlap term (lex-score position 0) makes any already-resident D dominate a fresh D unconditionally, and v2's migration only activates on admission rejects which never fire because D0/D1 KV pools have headroom. The H1 conclusion is qualified: KVC v2 helps per-request work (direct- to-D fast path) but does not rebalance D worker load on workloads with shared cross-session prefixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 02:08:00 +08:00
tim	631b2c8847	docs(experiments): E1 results — naive 1P3D + kv-aware confirms H1 baseline E1 finished 1h29min wall on the 50-session Inferact subset. Headline: 1200/1285 succeeded, latency p50=93s p99=219s, TTFT p50=89s p99=207s, 85 timeouts. Decode-2 was never bound to a single session — all 50 sessions stuck to decode-0/1 by kv-aware policy stickiness with no migration to rebalance, so effective topology was 1P2D, not 1P3D. This is exactly the failure mode H1 predicts naive pd-disaggregation should exhibit, giving E2 (full KVC v2 with migration) a concrete baseline to improve against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 01:49:52 +08:00
tim	ad8aaa8c5a	feat(experiments): E2 sweep — KVC v2 + RDMA on the matched subset KVC v2 config from sweep_ts1_migration_v2.sh (reset-on-success + direct-append threshold 8192) layered on top of the RDMA-enabled mooncake stack, against the same outputs/inferact_50sess.jsonl subset that E1 uses. Pair-wise contrast tests H1 (KVC layer marginal contribution on top of 1P3D + kv-aware) and H2/H3 (RDMA reducing reseed slow-path tail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:49:53 +08:00
tim	bb9cc249cd	feat(experiments): E1 sweep on 50-session deterministic subset scripts/sample_trace_subset.py — file-order head-cut that takes the first N sessions of a converted trace. No RNG, no hashing — same input yields byte-identical output (the included assertion compares md5 across two runs). scripts/sweep_e1_naive_1p3d.sh — E1 of ONBOARDING_NEXT_AGENT_ZH §3.1: mechanism=pd-disaggregation, policy=kv-aware, 1P3D, RDMA on (mlx5_60). Defaults to outputs/inferact_50sess.jsonl so E1 and E2 can share the exact same subset; override via TRACE= env var to run on the full 20,230-request trace. Reproducing the subset: uv run --no-sync python scripts/sample_trace_subset.py \\ --input outputs/inferact_codex_swebenchpro.jsonl \\ --output outputs/inferact_50sess.jsonl \\ --sessions 50 # expected output_md5: 7bb263a32600ef5a6ef5099ba340a487 # 1285 requests, mean input_length 67631 tokens Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:21:36 +08:00
tim	b55371fe69	docs: H200 + driver 570 setup guide + 11 lessons learned Captures the full debugging journey of getting vendored SGLang 0.5.10 + mooncake RDMA running on a 4×H200 node with the older driver 570.86.15. Driver 570's actual API is cu12.8 — nvidia-smi's "CUDA Version: 13.0" header is a forward-compat ceiling, not the driver's own version — and that single misreading drove most of the detours. Lessons cover: pip vs vendor sglang divergence, why cu13 switching was a dead end (mooncake is cu12-only by wheel, driver 570 can't run cu13 anyway), why --disable-overlap-schedule alone isn't enough, why pip nvidia-cuda-nvcc-cu12 doesn't ship the nvcc binary, and how tvm_ffi's ninja-driven nvcc invocation makes CUDA_HOME the single hook point that fixes everything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:10:14 +08:00
tim	d11a66d11b	feat(scripts): cu12.8 env wrapper + Inferact trace converter setup_env.sh: source-able shell snippet that points tvm_ffi (vendor sglang JIT compiler) at \$HOME/cuda-12.8/bin/nvcc and exposes both libcudart.so.12 (for mooncake.engine, a cu12 wheel) and cu12.8 lib64 (for tvm_ffi compile-time linker) on LD_LIBRARY_PATH. Without this, JIT-compiled kernels NEEDED libcudart.so.13 and driver 570 rejected them at every JIT call. convert_inferact_to_trace.py: turns Inferact codex_swebenchpro_traces (ShareGPT {"from","value"} pairs) into the chat_id/parent_chat_id/ turn/hash_ids JSONL schema replay.py expects. Tokenizes with the model's own tokenizer, builds prefix-sharing 24-token block hashes, synthesizes timestamps. Output cross-checks 20,230 LLM calls — exactly matches the Inferact README count for 610 successful trials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:10:06 +08:00
tim	a418aafeed	feat(stack): pin PD workers to --disable-overlap-schedule On a node with driver 570.86.15 (cu12.8 driver API ceiling), SGLang's overlap event loop hits cudaErrorInsufficientDriver inside event_loop_overlap_disagg_prefill → resolve_future_token_ids JIT kernel. Switching to the normal event loop sidesteps this specific codepath. The flag is harmless on newer drivers and remains a useful default until overlap is independently re-validated on this hardware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:09:56 +08:00
tim	e874b1f055	feat(env): install vendored SGLang via uv path source Replace pip-resolved sglang==0.5.10 with an editable install from third_party/sglang/python. The vendored fork carries patches the pip release does not (admit_direct_append RPC types, _should_allow_local_ prefill_on_decode, maybe_trim_decode_session_cache, backpressure pause hint) — KVC routing depends on them, so the vendored copy must be the import target, not just on PYTHONPATH at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 00:09:50 +08:00
kzlin	7590e55189	docs: archive deprecated docs to docs/archive/, drop E1 from onboarding Two cleanups: 1. Drop "E1: naive 1P3D default" experiment from the onboarding manual. GPU hours are precious; naive 1P3D + policy=default has near-certain loss on multi-turn cache hit (it's round-robin without prefix awareness), so the comparison doesn't add information vs E1=naive 1P3D kv-aware. The new manifest has only 2 runs: E1 (naive 1P3D kv-aware) + E2 (KVC v2 + RDMA). Run-time budget drops from 16.5h serial to 11h serial / 5.5h parallel. Updated: - §0 TL;DR ("3 组" -> "2 组") - §2 H1 hypothesis (drop "default and kv-aware each one" -> just kv-aware) - §3.1 experiment matrix (3 rows -> 2 rows + rationale for the drop) - §3.2 startup config (drop E1 default section, renumber E2/E3 -> E1/E2) - §6 decision table + expected-range table - §7 FAQ ("3 个 E1-E3" -> "2 个 E1-E2") - §9 deliverables 2. Move 8 deprecated docs to docs/archive/: AGENTIC_FIT_ANALYSIS_ZH.md (ts=10 era analysis; superseded) STRUCTURAL_VALIDATION_REPORT_ZH.md (ts=10 era validation; superseded) KVC_DEBUG_JOURNEY_V1_TO_V5.md (v1-v5 sweep process notes) V5_PROFILE_INVESTIGATION_ZH.md (v5 1Hz polling investigation) REFACTOR_PLAN_ZH.md (v0 plan; superseded by V1) KVCACHE_CENTRIC_PROGRESS_ZH.md (earliest 2026-04-27 progress) SWEBENCH_EXPERIMENT_PROGRESS.md (early SWE trace setup) SWEBENCH_EXPERIMENT_RESULTS.md (early SWE result snapshot) All cross-references in active docs (V2_DEEP_ANALYSIS / V2_RESULTS / REFACTOR_PLAN_V1 / TEAM_REPORT / ONBOARDING) rewritten from `docs/FOO.md` to `docs/archive/FOO.md` via sed pass. Added `docs/archive/README.md` explaining what each archived doc is and when (if ever) to reopen it. Designed so a new reader hitting the archive dir immediately knows it's not required reading. After this commit the active docs in docs/ are 9 files (down from 17), which should make the onboarding doc's "Level 1 / Level 2 / Level 3" classification self-evident. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:40:35 +08:00
kzlin	5a2fb8799c	docs(kvc): onboarding manual for the next SWE agent A single self-contained reading manual designed to bring a fresh agent (LLM or human) to current-state proficiency in 30 min of reading + 30 min of environment validation, then have them run the next round of ablation experiments without re-litigating questions already settled. Structure: §0 TL;DR -- what you are inheriting in 5 lines §1 Reading order, tiered into Must-Read / On-Demand / Archive, with reasons for each §2 Current-state snapshot: trace/hardware/branches + claims verified + hypotheses pending §3 The three ablation experiments (E1/E2/E3) with full CLI flag specifications and environment-validation checklist §4 Known gotchas (8 of them) with symptoms and fixes -- the most important section to skim before you start §5 CLI cheatsheet: run experiments / read data / plot / git §6 Result-analysis checklist: numbers to collect, expected ranges §7 FAQ for likely stuck-points §8 Anti-patterns: what NOT to do §9 Two specific deliverables the main agent expects back Appendix A: file location lookup table Appendix B: commit lookup table (by intent) Goals encoded into the doc: - Frame "your job is ablation, not new development" -- the new agent should not be tempted to start D->P sync work; that goes on the feat/d-to-p-sync branch in a separate phase. - Make abort-accounting / max-input-len / mooncake-TCP-default pitfalls extremely visible up front so they don't get repeated. - Provide expected-result ranges so a 2x deviation is treated as a config check, not a "finding". - Make the critic-vs-production framing explicit so the new agent knows when an audit-style "MAJOR" is actually a design intent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:31:08 +08:00
kzlin	506d360160	fix(figures): GPU utilization figure annotation/headroom polish Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the "P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations clean white-bbox space above the bars instead of crashing into the KVC D bars at x=1. Move both annotation xytext positions to x=2.4 (left panel) and x=5.5 (right panel) so the arrows pull away from the orange P bar toward the center of the panel. Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at y=1.02; subplot titles raised to pad=24 to leave room. Note: a small visual collision between the bboxed group labels and the subplot-title second line remains in the rendered output (acknowledged in the prior conversation). Acceptable for now; full layout rework is deferred. The annotation-vs-bar overlap (the original blocker) is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:28:39 +08:00
kzlin	c01d6101d6	docs(kvc): freeze reseed slow-path audit + three reviewer challenges Standalone reference document capturing the v2 reseed slow-path forensic audit before opening the feat/d-to-p-sync branch. Designed to be quoted directly by future paper drafts and to prevent the team from re-relitigating the same questions verbally. Contents: §1. The three team-member challenges that disproved "capacity-backup will save the slow path" (each with code citation and verdict): 1) P pool can't fit all backups -- replay.py:1618-1620 caps backup count at 1 for sessions with ~50K peak input. 2) P's backup is a stale snapshot -- 49K of direct-to-D append work never flows through P. _commit_prefill_backup_residency (replay.py:1483) is only called from seed/reseed paths; direct-to-D path (replay.py:2719) never touches P-side state. 3) When D evicts, old KV is freed directly (no D->P dump). session_aware_cache.release_session only calls kv_pool_allocator.free(). §2. End-to-end reseed timeline (t=0 to t=4550ms) with code citations showing exactly where each component sits. P-side re-prefill = 1.5-3s, mooncake transfer = 1.5-4s, both contributing 50/50 to total reseed cost. §3. Table of "looks like D->P but isn't" code locations -- every candidate found during forensic search ruled out with line citations. §4. Specification of what D->P incremental sync would require: mooncake bidirectional roles (~400 LOC), D-side append commit hook (easy), P-side radix tree multi-producer extension (the real blocker), agentic-pd-hybrid replay.py hooks. Estimated 1-2 weeks engineering. §5. Confirmation via `git ls-remote origin --refs` that author has NOT secretly implemented D->P on another branch -- only main + this working branch exist on the server. §6. Roadmap for the upcoming feat/d-to-p-sync branch. Appendices: code position crosswalk, related commits, paper section suggestions. This document is referenced by V2_DEEP_ANALYSIS_ZH §4.2 and by KVC_ROUTER_ALGORITHM §9 Open Question 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:20:34 +08:00
kzlin	9ccd853066	docs(kvc): correct reseed cost decomposition + flag D->P sync gap After an independent Opus-agent forensic audit, the previous "(c) 增量 fetch (工程量较大，未实现)" line in V2_DEEP_ANALYSIS §4.2 was understating the gap. The audit confirmed: - No D->P KV transfer code exists in the framework at any layer (agentic_pd_hybrid orchestration, vendored SGLang disaggregation, or mooncake transport). - Mooncake MooncakeKVManager has a hard role split: PREFILL = sender, DECODE = receiver-only loop. `add_transfer_request` asserts the disaggregation_mode is PREFILL. - The BaseKVSender / BaseKVReceiver abstraction has no bidirectional slot. - session_aware_cache.release_session only calls kv_pool_allocator.free() on eviction -- no serialization, no outbound network call. - _commit_prefill_backup_residency is only called from the seed/reseed path (_invoke_kvcache_seeded_router). direct-to-D path never updates P-side backup state. - "capacity-backup" policy semantics: it only skips the close on P after reseed -- the backup is the seed-time static snapshot, never refreshed by D-side append-prefill activity. V2_DEEP_ANALYSIS §4.2: - Decomposed the 3-7s reseed cost into the P-side re-prefill segment (1.5-3s, dominant) and the P->D mooncake transfer segment (1.5-4s). - Quantified the realistic effect of enabling RDMA: only the transfer segment shrinks, reseed reduces to 1.7-3.2s, TTFT p99 ~0.7s, still loses to DP's 0.43s. - Replaced the throwaway "(c) incremental fetch" line with a full paragraph explaining what D->P sync would require, why it's the largest engineering gap, and that the blocker is SGLang's radix-tree single-producer assumption, not the network layer. KVC_ROUTER_ALGORITHM §9: - Refined Open Question 3 (RDMA) to clarify it only helps the transfer segment, not the re-prefill segment. - Added Open Question 4: D->P incremental KV sync as the central future-work contribution gap, with cited evidence for why it doesn't currently exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:07:14 +08:00
kzlin	517677d7f2	docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic) Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to visually rebut the two critic-agent claims that we argued in prose were design intent, not deficiencies. (1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time" Two-panel side-by-side: Left (request count view, the naive reading): KVC P = 328 reqs (7.4%), KVC D = ~1450 each, DP = ~1100 each. P "looks idle." Right (compute work view, the honest reading): KVC P does 1.07M tokens of prefill, comparable to each KVC D worker's ~0.80M. P is a low-frequency high-cost safety net, not idle capacity. Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33% LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity win. (2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win" Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool (276K vs 351K tokens) yet caches MORE per request. Left (cache hit rate vs turn number): KVC's session-affinity lets hit rate accumulate with turns; DP's hash + radix-LRU causes a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP = 95.8% (1.24pp gap). Shows mechanism, not just outcome. Right (ECDF of per-request uncached tokens, log x): KVC's distribution concentrates near zero (50% < 187 tokens), DP's is spread (50% < 781 tokens). At uncached = 500 tokens threshold, KVC has 74% of requests below, DP has 31%. → smaller pool, better retention, less per-request work. Direct empirical rebuttal to "fragmentation is architectural, not policy." Bundled scripts (rerunable): - scripts/analysis/plot_gpu_utilization.py - scripts/analysis/plot_cache_efficiency.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:04:49 +08:00
kzlin	c5519066de	docs(kvc): add TTFT probability density figure (KVC v2 vs 4DP) Adds a two-panel TTFT PDF comparison plot inserted as a new V2_DEEP_ANALYSIS §3.4 ("TTFT 概率密度对比: bimodal vs unimodal"). Single-percentile numbers (p50 / p99) hide the qualitative difference between the two distributions; the figure makes it visible at a glance. Left panel (linear x in [0, 0.6]s, body): KVC has a sharp peak at ~40ms (the direct-to-D fast path). DP has a broad peak around 50-200ms (full prefill per request). Annotated with p50 and p90 markers for each side. Right panel (log x in [10ms, 10s], full range): KVC is visibly bimodal: a tall fast-path peak plus a small reseed tail around 1-5s. DP is unimodal: a single broad peak with shorter tail. Annotated with p99 callouts pointing to each tail. KDE: scipy.stats.gaussian_kde, bandwidth=0.15 for the body (Scott's rule oversmooths the sharp fast-path peak), log10-transformed for the full-range panel so the bimodal structure is visible. Bundled: - scripts/analysis/plot_ttft_pdf.py -- rerunable when v2 / DP data change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:46:27 +08:00
kzlin	b5af19583b	docs(kvc): replace v2 path breakdown tables with generated figures V2_DEEP_ANALYSIS §3.1 (execution_mode distribution) and §3.2 (path-level latency vs DP) had hand-typed tables with approximate latencies (e.g. "~1.0s") and required readers to mentally compare 5+ rows × 5 columns. Both sections now reference generated PNG figures derived directly from the v2 + DP metrics.jsonl files. §3.1 figure (v2_execution_mode_distribution.png): Horizontal bar chart, log x-axis. 4076 direct-to-D fast-path requests (green) dwarf the rest by ~30x; the long tail of slow / fallback / failure modes is visible at one glance. Counts and percentages annotated on each bar. §3.2 figure (v2_path_level_latency.png): Grouped bar chart, log y-axis. Per-path TTFT p50 / TTFT p99 / Lat p50 with exact numeric labels (no more "~1.0s" approximations). Sample counts annotated below each path. Quick visual reads: - KVC fast path TTFT p50 41ms vs DP 92ms (2.2x faster) - KVC reseed TTFT p99 5.12s vs DP 0.43s (12x slower) -- the cost - KVC no-d-capacity TTFT p99 7.65s (worst case) Bundled: - scripts/analysis/plot_v2_path_breakdown.py -- the script that generates both figures; rerunable when v2 data changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:38:43 +08:00
kzlin	37e9caa431	docs(kvc): production-decision reframe + formal router algorithm spec After the critic-agent audit, V2_DEEP_ANALYSIS had drifted into an audit-grade "5 wins / 1 loss / 3 draws" framing that mistook KVC's deliberate design motifs (cache concentration via session affinity; prefill-GPU idle as TTFT-stability trade-off) for "comparison unfairness." This commit corrects the framing back to a production- decision lens and adds a paper-track formal specification of the router algorithm. V2_DEEP_ANALYSIS_ZH.md changes: - §0 TL;DR: lead with "online coding agent serving should pick KVC 1P3D"; the only real cost is TTFT p99 long-tail (3x DP) from the 8.3% mooncake reseed path, mitigable with real RDMA. - §4 restructured into three buckets: real costs (TTFT p99 tail, abort accounting now fixed), counter-arguments to the critic (cache concentration and idle prefill GPU are design intent, not deficits), methodology to-do (naive-1P3D control, v2 N>=2 determinism). - §6 replaces "5/1/3 rescoring" with production decision rationale: KVC wins on 6 latency/TTFT metrics + lower failure rate; pays TTFT p99 tail; lists workloads where DP would reverse the call. - §8 decision points: D1 recommends Yes (accept v2 as milestone); D8 added: paper motif "KVC trades P idle for TTFT stability." KVC_ROUTER_ALGORITHM.md (new, paper-track, Chinese narrative + English algorithm boxes / variable names / theorems for direct paper reuse): - Problem formulation, system model, full notation - Algorithm 1 Route: lexicographic-tuple scoring on (overlap+alpha*sticky, sticky, -inflight, -assigned) - Algorithm 2 Admit: D-worker autonomous admission deciding Direct / Seed / Reseed / reject (with reason) - Algorithm 3 Dispatch: end-to-end orchestration with reset-on-success (the v2-specific fix that eliminates v1's self-amplifying thrashing) - Theorem 1 (no permanent starvation) and Theorem 2 (fast-path determinism), each with a proof sketch - Comparison table vs vanilla pd-disagg / DP cache-aware - Anti-patterns ("what KVC explicitly is NOT") - Open questions for reviewers - Suggested paper citation phrasing - Appendix A: algorithm-step to source-file:line crosswalk Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:29:18 +08:00
kzlin	5eac9b4f6b	fix(metrics): exclude aborted requests from latency/ttft/tpot stats The old filter `if row.latency_s is not None` accepted SGLang's fast input-length-aborts (latency_s ~ 0.08s, finish_reason='abort/BadRequest') as if they were successful zero-cost requests. This deflated mean/p50 of any run where the model rejected oversized inputs. Impact on existing comparisons (ts=1 4-run validation + v2): KVC v2 has 40 aborts + 5 ReadTimeouts (was reported as just 5); DP 4w has 67 aborts (was reported as 5). Both runs have abort behavior; the asymmetry (40 vs 67) is purely from SGLang's mem-fraction-derived max-input-len: KVC decode-only worker gets ~10 GB free GPU mem -> max-input=92098, DP fused worker gets ~9 GB -> max-input=87811, because DP also needs chunked-prefill workspace. The KVC-vs-DP latency-win direction holds and widens slightly under the fixed filter (lat mean delta: -0.8% -> -1.4%); see V2_DEEP_ANALYSIS_ZH §4.3 for the recomputed table. Changes: - metrics.py: new _is_failed_request(row) helper; latency/ttft/tpot stats now exclude both errors and aborts. New summary fields abort_count and failure_count expose the counts directly. - scripts/analysis/recompute_summary.py: re-derives summary.json from existing metrics.jsonl using the fixed code, with optional --diff against the old buggy summary for inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 17:29:18 +08:00
kzlin	0c25168cad	docs(kvc): v2 deep analysis vs TEAM_REPORT baseline Post-v2 audit consolidating ts=1 validation + v1 thrashing + v2 win, plus critic-agent adversarial review of the v2 vs 4DP comparison. Headline outcomes: - TEAM_REPORT §1 (session pin starvation) fully fixed by v2 migration + reset-on-success; direct-to-D 42.8% -> 91.6%. - TEAM_REPORT §2/§3/§5 (LRU, backpressure, admission RPC) are absorbed by ts=1 natural drain time, not mechanism-fixed -- will resurface under ts=10/longer traces/higher concurrency. - TEAM_REPORT §6 (ts=10 distortion) confirmed and locked as precondition; TEAM_REPORT §8 (N=1 unreliable) rewritten to "high-pressure N>=3, normal N=1". Three new problems exposed by adversarial review: - TTFT p99: KVC 1.285s vs DP 0.427s (KVC 3.0x worse) -- cherry-picked out of the V2_RESULTS_ZH.md headline table. Root cause: 8.3% non-direct path pays 3-7s mooncake reseed cost on 50-90K-token KV transfer. - Error accounting asymmetry: DP has 67 fast-aborts (not 5) at ~0.08s each counted in latency stats; KVC's 5 ReadTimeouts excluded entirely. Root cause: --max-input-len 87811 (DP) vs 92098 (KVC) + metrics.py:124 filter. - Topology mismatch: KVC 1P3D's prefill GPU is idle 91.7% of the time (only ~373/4449 requests use seed/P path); 4DP CA has all 4 GPUs at full utilization. Plus: no naive 1P3D control exists in the repo -- cannot isolate KVC-layer contribution from 1P3D-topology contribution. Re-scored headline: 5 KVC wins / 1 DP win / 3 draws -- still net positive but not the "7/8 wins" framing the V2_RESULTS_ZH.md claims. Recommended follow-ups (ROI order): 1. naive 1P3D ts=1 N=1 control (critic's only CRITICAL finding) 2. v2 N=2/N=3 to verify ts=1 determinism with new code paths 3. symmetric error accounting recompute + DP max-input-len = 92098 rerun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 11:17:00 +08:00
kzlin	2ec0debef4	feat(kvc): session migration with reset-on-success + direct-append threshold tuning KVC v2 beats 4DP at ts=1 same-scale on 7/8 metrics: TTFT mean -24%, p50 -54%, p90 -64%; lat mean -0.8%, p50 -12.6%, p90 -0.7%. Direct-to-D rate jumped 42.8% -> 91.7%. REFACTOR_PLAN_V1 scenario C achieved. Two-knob fix: - reset-on-success blacklist decay: clear (sess, D) reject counter on successful direct-to-D path. Eliminates v1 thrashing where session 6880 was stable on decode-1 for 70 turns then collapsed to 75 D-changes after cumulative transient pressure tripped the permanent blacklist. - bump --kvcache-direct-max-uncached-tokens default 2048 -> 8192 via CLI flag. 41% of v1 fallbacks were 'real-large-append' (>2048 token append); raising the threshold lets these go through the direct-to-D fast path. Code: - policies.py: RoutingState.session_d_rejects counter + KvAwarePolicy migration_reject_threshold; degenerate fallback picks least-rejected D. - replay.py: record_admission_reject + reset-on-success in _run_request; _fallthrough_reason classifies turn-2+ fall-throughs as session-not-resident / real-large-append / etc, replacing misleading 'large-append' suffix (TEAM_REPORT §2.7). - cli.py + benchmark.py: --kvcache-migration-reject-threshold flag wiring. Docs: - REFACTOR_PLAN_V1_ZH.md: forward-looking plan after ts=1 validation. - MIGRATION_V1_FINDINGS_ZH.md: v1 thrashing root-cause analysis. - V2_RESULTS_ZH.md: v2 results, scenario C achievement, attribution. - TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md: comprehensive team report. Scripts: - sweep_ts1_kvc_n3_plus_dp.sh: ts=1 baseline (KVC 1P3D N=3 + 4DP CA). - sweep_ts1_migration_v1.sh / v2.sh: validation runs. - analyze_ts1_validation.py: 4-way comparison analyzer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 01:18:13 +08:00
kzlin	1d51704dad	docs(kvc): agentic-fit analysis, refactor plan, validation report Three new docs covering the structural-fit investigation: - AGENTIC_FIT_ANALYSIS_ZH.md: §1-§7 of structural design issues that surface KVC vs vanilla DP gap on real agentic workloads (SWE 50sess). Quantifies session pinning, LRU shortfall, P-side imbalance, time-scale distortion, etc., with code citations and N=3 rerun data. - REFACTOR_PLAN_ZH.md: KISS-edition refactor plan. After verifying the original "estimate inflation" and "resident_blocks aging" claims were not real bugs, scope shrinks to one code change (backpressure) plus a 4-run smoke sweep within an 8h budget. - STRUCTURAL_VALIDATION_REPORT_ZH.md: validates §1-§7 claims using existing v5 baseline rerun data + 8DP CA baseline. Each claim labeled fully-supported / indirect / retracted with the data source. Notes that backpressure E2E validation is pending GPU smoke run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:30:11 +08:00
kzlin	7affb565b2	feat(kvc): add backpressure smoke sweep + analyzer (and v6 p1 profile script) scripts/sweep_backpressure_smoke.sh: 4-run smoke matrix (KVC baseline / KVC + backpressure / KVC + backpressure @ time-scale=1 / DP @ time-scale=1) designed to fit ~3-4h GPU budget. Validates §3 backpressure implementation and partially probes §7 time-scale distortion. scripts/analysis/analyze_backpressure_smoke.py: consumes the new structural/* jsonl files plus request-metrics; emits headline metrics, backpressure histograms, admission probe stats, and per-session pinning distribution. scripts/sweep_tp1_v6_p1_profile.sh: pre-existing v6 P1 profile sweep script (was untracked; included for completeness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:56 +08:00
kzlin	c47adaf8e3	feat(kvc): honor admission backpressure hints + structural event logging Replay-side changes paired with the SGLang admission hint: - DecodeResidencyState gains pause_until_s; admission probe parses recommended_pause_ms and updates the per-D pause window. - _wait_for_decode_pause is invoked at request entry points (_invoke_router, _invoke_session_direct) so requests stall before hitting a saturated D instead of timing out via mooncake. - New CLI flags: --enable-backpressure (default off, baseline preserved), --backpressure-max-pause-s (cap on per-request sleep, default 2s). Structural instrumentation written under <run_dir>/structural/: - admission-events.jsonl: every admission probe (RTT, queue_depth, pause_ms, available_tokens, evicted_count) - backpressure-events.jsonl: every actual pause sleep - session-d-binding.jsonl: per-request policy decision Used to validate the structural claims documented separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:46 +08:00
kzlin	ca4b64c79a	feat(sglang): expose backpressure pause hint in admit_direct_append Add `recommended_pause_ms` field to DirectAppendAdmissionReqOutput so D can advise callers when its transfer queue is heavy or KV pool is near capacity. The hint is computed from transfer_queue_depth, retracted_queue_depth, and post-trim token_usage; thresholds are simple heuristics (>0.90 usage, >=8 queue depth, retracted>0). Default behavior is unchanged for callers that ignore the field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:29:30 +08:00
kzlin	4978c0d0cd	profile(kvc): rewrite v5+profile report after critic audit + P0/P1 instrument Hostile audit of the original report flagged three load-bearing errors: 1. held_tokens semantic was inverted. session_held_tokens() at session_aware_cache.py:278-282 sums (kv_allocated_len - cache_protected_len) per slot, i.e. slot-private (NOT in radix tree). So "other = cap - held - avail" actually CONTAINS the radix-tree protected prefix cache (likely the single biggest component for shared agentic prefixes), not just running batch + in-flight as the original report claimed. 2. Admission-race causal hypothesis for the 415 EXP2+profile errors is contradicted by the data: 414/415 errors have kv_transfer_blocks > 0 — they passed admission and died downstream ("generate stream ended before producing any token", raised by the client when a 200 response had an empty stream). 3. Polling deconfound was too quickly dismissed. Mode counts shift ~1:1 (session-cap-fb -356 / kvcache-centric +406), and /server_info is not a passive read — it dispatches into the scheduler main loop and iterates every session slot. Plus: per-D error% confounded by sticky session affinity (only 18 unique sessions cause 415 errors, decode-3 had 0 errors only because no high-error session landed there); decile 10 "recovery" was an equal-time binning artifact (24.5% under equal-count); v5 vs v5+profile time gap was 21h not 6h; p50/p90 latency comparison is N=1. Rewritten report (docs/V5_PROFILE_INVESTIGATION_ZH.md) marks each correction with ⚠️ and demotes admission-race to one of four hypotheses (H1-H4). Action items split into P0 (verify, must do first) and P1 (instrument): P0 — scripts/sweep_tp1_v5_baseline_rerun_exp2.sh runs 3x v5 baseline EXP2 (no polling, identical config to the original v5 run) to test whether the 9-error baseline result is reproducible. If 3 runs give ~9 errors and profile gives 415, polling is the leading suspect. Currently running in background. P1 — scheduler.py:_compute_pool_breakdown_for_diagnostics adds a read-only "pool_breakdown" dict to /server_info covering: radix_evictable_tokens, radix_protected_tokens, slot_private_held_tokens, session_slot_count, running_batch_{reqs,kv_tokens}, transfer_queue_{reqs,tokens}, prealloc_queue_{reqs,tokens}, retracted_queue_{reqs,tokens}. With these, "unaccounted = cap - sum(known)" exposes true leakage. replay.py captures all fields into the per-tick row; analyzer prints the decomposition and gracefully handles old timeseries (prints "P1 instrument absent"). Mock-tested end-to-end. SGLang patch is read-only and does not affect admission/scheduling. Old v5+profile data still analyzes correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:29:21 +08:00
kzlin	51f5386691	profile(kvc): add D KV pool timeseries poller + analyzer for v6 root-cause v5 dropped errors but pushed session-cap fallback to 46-51%. Before adding v6 mitigations we need to attribute that capacity loss to one of: (a) active sessions — real footprint (b) idle-evictable sessions — LRU not aggressive enough (c) prefill backup blocks / in-flight / fragmentation — release timing Without this it's all guessing. Plumb a 1Hz poller into replay that hits each P/D worker's /server_info, captures session_cache + memory_usage, and writes a per-worker time-series JSONL to <run_dir>/d-pool-timeseries.jsonl. Off by default (--pool-poll-interval-s 0); v5+profile sweep enables it at 1.0s. Per-tick HTTP cost is ~8 parallel /server_info calls — negligible relative to the 50min run. Analyzer (scripts/analysis/analyze_pool_timeseries.py) decomposes each D's capacity into active_held / idle_evictable / other (= cap-held-avail, the backup-blocks bucket) / free, and reports session residency churn across workers as a starvation/thrashing signal. Mock-tested poller end-to-end (cancellation clean, file flushed, sessions captured); analyzer validated against synthetic timeseries. Next: run scripts/sweep_tp1_v5_optD_profile.sh on hardware (~90min), then analyze results to pick a v6 direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:04:21 +08:00
kzlin	6572d7f3f4	docs: add v5 chapter (Option D worker-mode admission) and rename to V1_TO_V5 v5 sweep (sweep_tp1_v5_optD.sh) lands the previously-deferred Option D: worker admission_mode authoritative for direct_append + seed + reseed, bypassing replay's local _decode_session_soft_cap. Key findings now documented: - errors collapse from 9-10% to 0.2% (mooncake timeouts gone) - session-cap fallback rises 33-35% -> 46-51% — D's true KV pool is the binding constraint, not replay's estimator; v4's "low fallback" was hiding capacity overruns as transfer-timeout errors - direct-to-D subset latency unchanged from v4 (admission overhead negligible) - new bottleneck: D's physical KV pool — points v6 at prefill backup release timing, priority eviction tuning, chunked seed, cross-D session migration, and real RDMA Also adds a 5th lesson on errors-vs-fallback reciprocity and updates the code index with the v5 endpoint extension and new CLI knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:13:25 +08:00
kzlin	6e5ed8da80	feat(kvc): Option D - delegate seed/reseed admission to D worker v4 (cap=16) saw 35% session-cap fallback because the local soft_cap min(16, usable / target) evaluates to 1-2 for large agentic inputs. The cap was hit not because D was full but because replay's heuristic underestimated capacity. This change makes worker admission_mode authoritative for ALL paths: SGLang side: - io_struct.py: DirectAppendAdmissionReqInput gains a `mode` field ("direct_append" \| "seed", default "direct_append" preserves prior behavior). - scheduler.py:admit_direct_append: when mode == "seed", skip the resident-on-D requirement and run the same capacity check + LRU eviction (maybe_trim_decode_session_cache) that direct_append uses. This lets D atomically decide if a new session can be admitted based on actual token_to_kv_pool_allocator state. Replay side (replay.py): - _query_decode_direct_admission gains a `mode` parameter. - _reserve_decode_session_capacity: in worker admission_mode, the seed/reseed branch now queries D with mode="seed" and trusts the result, instead of estimating capacity from the residency snapshot. - _should_admit_new_decode_session: in worker mode, skip the local soft_cap pre-check and let D decide. Same-D session fast-path is preserved. Effects: - Local hardcoded cap of 16 is bypassed under worker mode; D's real KV pool size is the only constraint. - LRU eviction runs in D's process atomically with admission, so starvation (the v3 bimodal "lucky vs starved sessions" pattern) should resolve. scripts/sweep_tp1_v5_optD.sh added to run the same 1P7D / 2P6D configs as v4 with the new admission path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:40:03 +08:00
kzlin	74194e660a	docs: v4 final results, error analysis, and updated journey Add v4 sweep results and post-mortem analysis showing: - direct-to-D path: 54.3% (1P7D) / 58.0% (2P6D) of requests now use KVC cleanly. P50=0.5s and TTFT P50=0.043s; this path beats baseline 8DP across the board (P50 -24%, TTFT P50 -54%, TTFT P90 -79%). - Overall vs baseline (errors+truncated excluded): v4 2P6D P50=0.85s vs baseline 0.66s (28% slower). Reason is not errors -- 35% of requests still hit fallback-large-append-session-cap, where capacity-based cap = usable_tokens / target_tokens evaluates to 1-2 (not 16) for large agentic inputs. - 9-10% errors on KVC variants are mooncake TCP transfer timeouts, not SGLang logic bugs. Prefill log shows "Failed to send kv chunk ... 32s timeout ... session not alive". Errors concentrate in turn>=31 (large inputs) after run >44.8%. Track: - docs/KVC_DEBUG_JOURNEY_V1_TO_V4.md: append v4 results table, per-mode breakdown, and error root cause. - scripts/analysis/{analyze_v3,analyze_v4,analyze_errors,compare_no_error}.py - outputs/qwen3-30b-tp1-v{3,4}/exp_summary.json (force-added, small JSON; metrics.jsonl excluded due to size). - outputs/qwen3-30b-tp1-v{3,4}*/sweep_results.txt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:34:01 +08:00

1 2

62 Commits