docs(experiments): mooncake "is not alive" deep-dives to LRU starvation

The Q1 mystery resolves: P-side mooncake C++ logs show "Sync batch data transfer timeout after 37452515723ns" (37.45 s) at 01:56:42 — this is mooncake's batch_transfer_sync giving up after its internal timeout. The hair-trigger >=1 in conn.py:1270 is correct in the idle case (a 30-s RDMA stall genuinely means the peer is broken), but it fires here because of D-side congestion: decode-0.log shows two consecutive LRU evictions ("Trimmed decode session cache via LRU. evicted_sessions: 2, freed_tokens: 77675") firing at the exact same wall second the timeout triggers. The D scheduler thread is busy with multi-session GPU memory frees + session-aware-cache bookkeeping under lock; the mooncake C++ control plane on the receive side gets starved for >30 s; P times out and marks the whole D's mooncake_session_id failed. Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D bonus, next commit); defense-in-depth = windowed threshold + retry in vendored mooncake conn.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:14:00 +08:00
parent 7f2ebf3d87
commit bf4da281c0
1 changed files with 66 additions and 6 deletions
--- a/docs/E1_E2_RESULTS_ZH.md
+++ b/docs/E1_E2_RESULTS_ZH.md
@@ -247,14 +247,74 @@ if req.mooncake_session_id in self.failed_sessions:

 Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.

-### The fix would be in vendored SGLang
+### What the hair-trigger is actually reacting to

-The hair-trigger threshold (`>= 1`) is wrong for our regime. Options:
-1. Raise the threshold to N transient failures within a short window before declaring the session dead.
-2. Make the "failed" mark expire (e.g. retry the session after a backoff).
-3. Pair the hair-trigger with the existing heartbeat checker (conn.py:1497) — only blacklist if both a transfer failed AND the periodic heartbeat HTTP probe to the bootstrap address reports ≥ N failures.
+Pulling the mooncake C++ logs (filter `^E0`/`^I0` lines from prefill-0.log) reveals the actual underlying error:

-None of these are quick fixes; they require touching `MooncakeKVManager.start_prefill_thread` and the failed-session lifecycle.
+```
+I0512 01:56:42.242062 transfer_engine_py.cpp:546]
+    Sync batch data transfer timeout after 37452515723ns
+I0512 01:56:53.335597 transfer_engine_py.cpp:546]
+    Sync batch data transfer timeout after 30892690400ns
+```
+
+**37.45 s** and **30.89 s** — the mooncake `batch_transfer_sync` C++ call returned nonzero because the synchronous transfer took longer than its internal timeout (~30 s). On a 400 Gb/s NDR RDMA fabric this is not a network problem; the data path is healthy. The SGLang author's design instinct (`>= 1 failures = dead`) is *correct in the idle case* — a 30-second RDMA stall really does indicate a broken peer.
+
+What's happening here is that the peer is **logically broken from the C++ control-plane's point of view**, even though the OS process is still alive.
+
+### Why does the D side stall the control plane for 30 s?
+
+Cross-referencing decode-0.log at the exact second of the first timeout (01:56:42):
+
+```
+01:56:34  Decode batch, #running-req=1, #token=627631, token_usage=0.83,
+          gen throughput=174.76 tok/s         ← still serving normally
+01:56:42  session id 1000315 does not exist, cannot delete.
+01:56:42  session id 1000360 does not exist, cannot delete.
+01:56:42  Trimmed decode session cache via LRU.
+            #evicted_sessions: 2, #freed_tokens: 77675,
+            #available_tokens: 38574 → 116249
+01:56:42  Trimmed decode session cache via LRU.
+            #evicted_sessions: 1, #freed_tokens: 36166,
+            #available_tokens: 29038 → 65204
+01:56:53  Decode transfer failed for request rank=0 ...
+            Failed to get kvcache from prefill instance, it might be dead
+```
+
+D0's main scheduler thread was busy doing **two consecutive LRU evictions** (freeing 77 675 + 36 166 ≈ 114 K tokens of KV) right when the P→D mooncake transfer attempt landed. Each LRU trim involves:
+- iterating per-session resident metadata
+- releasing GPU KV slots back to `token_to_kv_pool_allocator.free()`
+- updating the session-aware-cache bookkeeping under lock
+- closing per-session streaming state
+
+Under `token_usage = 0.83` the LRU scan has to walk thousands of entries; the lock held during this work blocks the mooncake C++ control plane on the receive side (buffer registration / completion poll) from making progress. P's `batch_transfer_sync` keeps polling for the peer's completion ack, doesn't get one for 30 s, and gives up.
+
+So the chain is:
+
+```
+D KV pool saturated by D2-cold-pinning (§5d)
+    ↓
+D triggers heavy LRU eviction (114K tokens at a time)
+    ↓
+D main scheduler thread starves mooncake C++ control plane for 30+ s
+    ↓
+P's batch_transfer_sync returns nonzero (timeout)
+    ↓
+P's hair-trigger marks D's whole mooncake_session_id "failed forever"
+    ↓
+all subsequent reqs to that D blow up with "is not alive"
+```
+
+The hair-trigger threshold (`>= 1`) is structurally wrong for this regime — but it would not fire at all if the LRU thrash didn't happen, and the LRU thrash would not happen if the load were spread across all 3 D workers (§5d).
+
+### Two layers of fix
+
+| Layer | What | Cost |
+|---|---|---|
+| Root cause | Spread load to D2 so D0/D1's KV never saturate, LRU never thrashes. See §5d and the cold-D bonus implementation in `policies.py` (next commit). | Low — pure policy change |
+| Defense in depth | In `mooncake/conn.py:1267-1276`, replace `>= 1` with a windowed threshold (e.g. ≥ 3 failures within 60 s) and add a periodic retry that probes the D bootstrap port before clearing `failed_sessions`. | Medium — touches vendored SGLang |
+
+We do the root-cause fix first because it makes the second one optional.

 ---