docs(experiments): forensic on Q1 (mooncake death) and Q2 (no D2 migration)

Q1: Mooncake "is not alive" is hair-trigger — a single send_kvcache_slice ret != 0 in third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py :1270 permanently adds the D's mooncake_session_id to failed_sessions and blacklists it for the rest of the process lifetime. The D worker process is alive (D1 keeps serving admit_direct_append OK seconds after), but every subsequent P→D transfer for that session short-circuits at conn.py:1184. The "Failures should never happen if the session is not dead" comment encodes the wrong assumption for the saturation regime we hit. Q2: KVC v2's migration mechanism IS sound but its trigger is gated by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap", "no-d-capacity", "d-backpressure"). All 1054 failures have execution_mode="kvcache-centric" (generic fallback bucket) which contains none of those substrings, so session_d_rejects is never incremented. Empirically 46 of 49 (sess, D) pairs that the worker RPC rejected would have qualified for blacklist (most-rejected pair: 25 rejects), but policy never saw them. Result: D0 reject → next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject → next-bind D2 (0×). Fix paths documented for both, shortest path is widening the substring filter to include the failure-fallback bucket, but the right fix is to call record_admission_reject directly from the actual rejection signal site instead of string-matching execution_mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 10:45:18 +08:00
parent ef4dc81ea9
commit 7f2ebf3d87
1 changed files with 100 additions and 0 deletions
--- a/docs/E1_E2_RESULTS_ZH.md
+++ b/docs/E1_E2_RESULTS_ZH.md
@@ -213,6 +213,106 @@ Worker admission per se is not the bug — the bug is that there is no D-rebalan

 ---

+## 5c. Why mooncake "died" (forensic on Q1)
+
+The error string is `Decode instance could be dead, remote mooncake session ... is not alive`, which sounds like the D worker process crashed. **It did not.** Concurrent evidence shows D1 was happily serving `/session_cache/admit_direct_append HTTP/1.1 200 OK` and running LRU evictions only seconds after the "is not alive" errors fired. The real mechanism is hair-trigger.
+
+### What the SGLang mooncake conn.py actually does
+
+In `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1267-1276`:
+
+```python
+if ret != 0:                                    # one transfer slice failed
+    with self.session_lock:
+        self.session_failures[req.mooncake_session_id] += 1
+        # Failures should never happen if the session is not dead,
+        # if the session fails once, mark it as failed
+        if self.session_failures[req.mooncake_session_id] >= 1:
+            self.failed_sessions.add(req.mooncake_session_id)
+            logger.error(f"Session {req.mooncake_session_id} failed.")
+    ...
+```
+
+After this, every subsequent transfer that uses the same `mooncake_session_id` short-circuits at conn.py:1184:
+
+```python
+if req.mooncake_session_id in self.failed_sessions:
+    self.record_failure(kv_chunk.room,
+        f"Decode instance could be dead, remote mooncake session ... is not alive")
+```
+
+**One real `send_kvcache_slice ret != 0` permanently blacklists that D's mooncake session for the rest of the SGLang process lifetime.** The code's own comment ("Failures should never happen if the session is not dead") encodes the design assumption that transfers don't fail under normal conditions — but they do under the saturation regime described in §5b (RDMA queue full / D scheduler too busy to drain receives in time).
+
+### Connecting back to Q1 timeline
+
+Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
+
+### The fix would be in vendored SGLang
+
+The hair-trigger threshold (`>= 1`) is wrong for our regime. Options:
+1. Raise the threshold to N transient failures within a short window before declaring the session dead.
+2. Make the "failed" mark expire (e.g. retry the session after a backoff).
+3. Pair the hair-trigger with the existing heartbeat checker (conn.py:1497) — only blacklist if both a transfer failed AND the periodic heartbeat HTTP probe to the bootstrap address reports ≥ N failures.
+
+None of these are quick fixes; they require touching `MooncakeKVManager.start_prefill_thread` and the failed-session lifecycle.
+
+---
+
+## 5d. Why no session ever migrated to D2 (forensic on Q2)
+
+KVC v2's design (KVC_ROUTER_ALGORITHM §3.3) uses `state.session_d_rejects[(session_id, D)] += 1` after a rejection, then policy.select skips any D with `rejects >= migration_reject_threshold (=3)`. The mechanism is conceptually sound. The bug is in *which* failures count as rejections.
+
+### The substring filter is too narrow
+
+In `replay.py:1379`:
+
+```python
+_ADMISSION_REJECTION_SUBSTRINGS = (
+    "session-cap",
+    "no-d-capacity",
+    "d-backpressure",
+)
+
+def _is_admission_rejection_mode(execution_mode: str) -> bool:
+    return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
+```
+
+Only execution_modes containing one of those three substrings increment the per-(session, D) reject counter. **All 1054 E2 failures have `execution_mode = "kvcache-centric"`** (the generic fallback bucket the replay engine uses when the request fell through every concrete sub-path before producing a successful result). That string contains none of the three substrings, so `session_d_rejects` is never incremented for them.
+
+### Empirical confirmation
+
+Counting from `structural/admission-events.jsonl` (worker-RPC level, independent of replay's classification):
+
+| Stat | Value |
+|---|---:|
+| Distinct `(session, D)` pairs ever rejected by worker RPC | 49 |
+| Pairs rejected ≥ 3 times (would qualify for blacklist) | **46** |
+| Most-rejected single pair | (1001172, D1) = **25 rejects** |
+
+So 46 of 49 (sess, D) pairs *should have been blacklisted* by KVC v2's design. They never were, because the corresponding requests' execution_mode was `"kvcache-centric"` (failure path) and not `"…-session-cap"` / `"…-no-d-capacity"` / `"…-d-backpressure"` (which only get assigned when the fallthrough path runs to a known-rejection sub-result, not when the upstream SSE stream errors out).
+
+Counting "next-binding-after-reject" from the merged binding+admission timeline:
+
+| Rejected on | Next binding goes to | Count |
+|---|---|---:|
+| D0 | D0 | 253 |
+| D1 | D1 | 329 |
+| D0 | D2 | **0** |
+| D1 | D2 | **0** |
+
+The router stubbornly re-binds the same session to the same D after every reject — exactly because the reject was never recorded in `session_d_rejects`, so policy.select still sees an empty rejection counter and the overlap term keeps tipping it back to D0/D1.
+
+### The fix
+
+Two paths, in increasing scope:
+
+1. **Quick**: include `"kvcache-centric"` (the failure-fallback bucket) in `_ADMISSION_REJECTION_SUBSTRINGS`, OR have replay set `execution_mode` to a more specific failure label when an SSE stream closes with zero tokens (e.g. `"upstream-aborted"`) and add that to the substring set.
+2. **Better**: don't rely on string-matching at all. Have `_run_request` catch the actual rejection signal (admission RPC `can_admit=False` or upstream `RuntimeError: generate stream ended ...`) and call `state.record_admission_reject(...)` directly at that point. The substring filter was inherited from the v1 → v2 migration design (`MIGRATION_V1_FINDINGS_ZH §4.1`) when only specific fallback paths set those names.
+
+Either fix would let the existing `migration_reject_threshold=3` blacklist D0/D1 after enough failures, force a re-route to D2, populate D2's resident hashes, and break the overlap-pinning death spiral.
+
+---
+
 ## 6. What this experiment actually shows

 1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.