docs(experiments): mooncake "is not alive" deep-dives to LRU starvation

The Q1 mystery resolves: P-side mooncake C++ logs show
"Sync batch data transfer timeout after 37452515723ns" (37.45 s) at
01:56:42 — this is mooncake's batch_transfer_sync giving up after
its internal timeout. The hair-trigger >=1 in conn.py:1270 is
correct in the idle case (a 30-s RDMA stall genuinely means the
peer is broken), but it fires here because of D-side congestion:
decode-0.log shows two consecutive LRU evictions ("Trimmed decode
session cache via LRU. evicted_sessions: 2, freed_tokens: 77675")
firing at the exact same wall second the timeout triggers.

The D scheduler thread is busy with multi-session GPU memory frees
+ session-aware-cache bookkeeping under lock; the mooncake C++
control plane on the receive side gets starved for >30 s; P times
out and marks the whole D's mooncake_session_id failed.

Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D
bonus, next commit); defense-in-depth = windowed threshold + retry
in vendored mooncake conn.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
tim
2026-05-12 11:14:00 +08:00
parent 7f2ebf3d87
commit bf4da281c0

View File

@@ -247,14 +247,74 @@ if req.mooncake_session_id in self.failed_sessions:
Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
### The fix would be in vendored SGLang
### What the hair-trigger is actually reacting to
The hair-trigger threshold (`>= 1`) is wrong for our regime. Options:
1. Raise the threshold to N transient failures within a short window before declaring the session dead.
2. Make the "failed" mark expire (e.g. retry the session after a backoff).
3. Pair the hair-trigger with the existing heartbeat checker (conn.py:1497) — only blacklist if both a transfer failed AND the periodic heartbeat HTTP probe to the bootstrap address reports ≥ N failures.
Pulling the mooncake C++ logs (filter `^E0`/`^I0` lines from prefill-0.log) reveals the actual underlying error:
None of these are quick fixes; they require touching `MooncakeKVManager.start_prefill_thread` and the failed-session lifecycle.
```
I0512 01:56:42.242062 transfer_engine_py.cpp:546]
Sync batch data transfer timeout after 37452515723ns
I0512 01:56:53.335597 transfer_engine_py.cpp:546]
Sync batch data transfer timeout after 30892690400ns
```
**37.45 s** and **30.89 s** — the mooncake `batch_transfer_sync` C++ call returned nonzero because the synchronous transfer took longer than its internal timeout (~30 s). On a 400 Gb/s NDR RDMA fabric this is not a network problem; the data path is healthy. The SGLang author's design instinct (`>= 1 failures = dead`) is *correct in the idle case* — a 30-second RDMA stall really does indicate a broken peer.
What's happening here is that the peer is **logically broken from the C++ control-plane's point of view**, even though the OS process is still alive.
### Why does the D side stall the control plane for 30 s?
Cross-referencing decode-0.log at the exact second of the first timeout (01:56:42):
```
01:56:34 Decode batch, #running-req=1, #token=627631, token_usage=0.83,
gen throughput=174.76 tok/s ← still serving normally
01:56:42 session id 1000315 does not exist, cannot delete.
01:56:42 session id 1000360 does not exist, cannot delete.
01:56:42 Trimmed decode session cache via LRU.
#evicted_sessions: 2, #freed_tokens: 77675,
#available_tokens: 38574 → 116249
01:56:42 Trimmed decode session cache via LRU.
#evicted_sessions: 1, #freed_tokens: 36166,
#available_tokens: 29038 → 65204
01:56:53 Decode transfer failed for request rank=0 ...
Failed to get kvcache from prefill instance, it might be dead
```
D0's main scheduler thread was busy doing **two consecutive LRU evictions** (freeing 77 675 + 36 166 ≈ 114 K tokens of KV) right when the P→D mooncake transfer attempt landed. Each LRU trim involves:
- iterating per-session resident metadata
- releasing GPU KV slots back to `token_to_kv_pool_allocator.free()`
- updating the session-aware-cache bookkeeping under lock
- closing per-session streaming state
Under `token_usage = 0.83` the LRU scan has to walk thousands of entries; the lock held during this work blocks the mooncake C++ control plane on the receive side (buffer registration / completion poll) from making progress. P's `batch_transfer_sync` keeps polling for the peer's completion ack, doesn't get one for 30 s, and gives up.
So the chain is:
```
D KV pool saturated by D2-cold-pinning (§5d)
D triggers heavy LRU eviction (114K tokens at a time)
D main scheduler thread starves mooncake C++ control plane for 30+ s
P's batch_transfer_sync returns nonzero (timeout)
P's hair-trigger marks D's whole mooncake_session_id "failed forever"
all subsequent reqs to that D blow up with "is not alive"
```
The hair-trigger threshold (`>= 1`) is structurally wrong for this regime — but it would not fire at all if the LRU thrash didn't happen, and the LRU thrash would not happen if the load were spread across all 3 D workers (§5d).
### Two layers of fix
| Layer | What | Cost |
|---|---|---|
| Root cause | Spread load to D2 so D0/D1's KV never saturate, LRU never thrashes. See §5d and the cold-D bonus implementation in `policies.py` (next commit). | Low — pure policy change |
| Defense in depth | In `mooncake/conn.py:1267-1276`, replace `>= 1` with a windowed threshold (e.g. ≥ 3 failures within 60 s) and add a periodic retry that probes the D bootstrap port before clearing `failed_sessions`. | Medium — touches vendored SGLang |
We do the root-cause fix first because it makes the second one optional.
---