agentic-pd-hybrid

Author	SHA1	Message	Date
tim	bf4da281c0	docs(experiments): mooncake "is not alive" deep-dives to LRU starvation The Q1 mystery resolves: P-side mooncake C++ logs show "Sync batch data transfer timeout after 37452515723ns" (37.45 s) at 01:56:42 — this is mooncake's batch_transfer_sync giving up after its internal timeout. The hair-trigger >=1 in conn.py:1270 is correct in the idle case (a 30-s RDMA stall genuinely means the peer is broken), but it fires here because of D-side congestion: decode-0.log shows two consecutive LRU evictions ("Trimmed decode session cache via LRU. evicted_sessions: 2, freed_tokens: 77675") firing at the exact same wall second the timeout triggers. The D scheduler thread is busy with multi-session GPU memory frees + session-aware-cache bookkeeping under lock; the mooncake C++ control plane on the receive side gets starved for >30 s; P times out and marks the whole D's mooncake_session_id failed. Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D bonus, next commit); defense-in-depth = windowed threshold + retry in vendored mooncake conn.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 11:14:00 +08:00
tim	7f2ebf3d87	docs(experiments): forensic on Q1 (mooncake death) and Q2 (no D2 migration) Q1: Mooncake "is not alive" is hair-trigger — a single send_kvcache_slice ret != 0 in third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py :1270 permanently adds the D's mooncake_session_id to failed_sessions and blacklists it for the rest of the process lifetime. The D worker process is alive (D1 keeps serving admit_direct_append OK seconds after), but every subsequent P→D transfer for that session short-circuits at conn.py:1184. The "Failures should never happen if the session is not dead" comment encodes the wrong assumption for the saturation regime we hit. Q2: KVC v2's migration mechanism IS sound but its trigger is gated by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap", "no-d-capacity", "d-backpressure"). All 1054 failures have execution_mode="kvcache-centric" (generic fallback bucket) which contains none of those substrings, so session_d_rejects is never incremented. Empirically 46 of 49 (sess, D) pairs that the worker RPC rejected would have qualified for blacklist (most-rejected pair: 25 rejects), but policy never saw them. Result: D0 reject → next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject → next-bind D2 (0×). Fix paths documented for both, shortest path is widening the substring filter to include the failure-fallback bucket, but the right fix is to call record_admission_reject directly from the actual rejection signal site instead of string-matching execution_mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:45:18 +08:00
tim	ef4dc81ea9	docs(experiments): forensic explanation for E2 80% failure rate Pulling admission-events.jsonl, prefill-0.log, and request-metrics sampling shows the 1054 failures are NOT timeouts as initially assumed. They are a 3-layer cascade: L1: 562 "no-space" + 43 "session-not-resident" worker admission rejects (51% of all admit attempts) because D0/D1 KV pools saturate while D2 stays empty. L2: rejects re-route to seed/reseed which need mooncake P→D KV transfer; the backlog drops mooncake heartbeats and prefill-0 logs "Decode instance could be dead, remote mooncake session ... is not alive". L3: SGLang aborts the request, SSE stream closes with 0 tokens, agentic-pd-hybrid raises "generate stream ended before producing any token" (the literal error string for all 1054). E1 didn't hit this because pd-disaggregation has no admission RPC — sessions just queue behind the running batch, paying TTFT instead of failing. KVC v2's worker admission is supposed to be a safety valve; on the cold-D pathology it becomes a failure amplifier. The real fix is upstream D rebalancing (cold-D bonus or pre-warm), not relaxing admission. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 10:38:49 +08:00
tim	3db2d84df8	docs(experiments): E2 complete — qualified H1 with a surprise E2 finished 1h33min wall. Headline contrast on the matched Inferact 50-session subset: E1 (naive 1P3D + kv-aware + RDMA): 1200/1285 succ, lat p50=93s p99=219s, TTFT p50=89s p99=207s E2 (KVC v2 + RDMA): 231/1285 succ, lat p50= 7.4s p99=65s, TTFT p50=0.43s p99=8.7s E2 is 12.4× worse on failure rate but 20× better on TTFT p50 among the requests that did complete. Both runs leave D2 entirely unused for the same structural reason: Inferact's shared "permissions instructions" boilerplate makes overlap dominate the kv-aware lex score, and v2's migration mechanism only fires on capacity rejects which never reach D2. The 1054 E2 timeouts are downstream of that imbalance, not a v2 bug per se. The doc closes with five concrete follow-ups for the next agent — cold-D bonus, router-mode admission, default-policy control arm, TCP-loopback comparison, failure mode forensics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 03:23:33 +08:00
tim	e3e5c45ed4	docs(experiments): E2 mid-run finding — D2 stays cold in KVC v2 too Same pathological imbalance E1 showed reproduces in E2: D2 has zero bindings at 33% POSTs in. Root cause is structural, not a KVC v2 bug: all 50 Inferact sessions begin with identical "permissions instructions" boilerplate, so the converter assigns them identical first-block hash_ids. kv-aware policy's overlap term (lex-score position 0) makes any already-resident D dominate a fresh D unconditionally, and v2's migration only activates on admission rejects which never fire because D0/D1 KV pools have headroom. The H1 conclusion is qualified: KVC v2 helps per-request work (direct- to-D fast path) but does not rebalance D worker load on workloads with shared cross-session prefixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 02:08:00 +08:00
tim	631b2c8847	docs(experiments): E1 results — naive 1P3D + kv-aware confirms H1 baseline E1 finished 1h29min wall on the 50-session Inferact subset. Headline: 1200/1285 succeeded, latency p50=93s p99=219s, TTFT p50=89s p99=207s, 85 timeouts. Decode-2 was never bound to a single session — all 50 sessions stuck to decode-0/1 by kv-aware policy stickiness with no migration to rebalance, so effective topology was 1P2D, not 1P3D. This is exactly the failure mode H1 predicts naive pd-disaggregation should exhibit, giving E2 (full KVC v2 with migration) a concrete baseline to improve against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 01:49:52 +08:00

6 Commits