docs(experiments): mooncake "is not alive" deep-dives to LRU starvation
The Q1 mystery resolves: P-side mooncake C++ logs show
"Sync batch data transfer timeout after 37452515723ns" (37.45 s) at
01:56:42 — this is mooncake's batch_transfer_sync giving up after
its internal timeout. The hair-trigger >=1 in conn.py:1270 is
correct in the idle case (a 30-s RDMA stall genuinely means the
peer is broken), but it fires here because of D-side congestion:
decode-0.log shows two consecutive LRU evictions ("Trimmed decode
session cache via LRU. evicted_sessions: 2, freed_tokens: 77675")
firing at the exact same wall second the timeout triggers.
The D scheduler thread is busy with multi-session GPU memory frees
+ session-aware-cache bookkeeping under lock; the mooncake C++
control plane on the receive side gets starved for >30 s; P times
out and marks the whole D's mooncake_session_id failed.
Two-layer fix listed in §5c: root-cause = spread load to D2 (cold-D
bonus, next commit); defense-in-depth = windowed threshold + retry
in vendored mooncake conn.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -247,14 +247,74 @@ if req.mooncake_session_id in self.failed_sessions:
|
||||
|
||||
Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
|
||||
|
||||
### The fix would be in vendored SGLang
|
||||
### What the hair-trigger is actually reacting to
|
||||
|
||||
The hair-trigger threshold (`>= 1`) is wrong for our regime. Options:
|
||||
1. Raise the threshold to N transient failures within a short window before declaring the session dead.
|
||||
2. Make the "failed" mark expire (e.g. retry the session after a backoff).
|
||||
3. Pair the hair-trigger with the existing heartbeat checker (conn.py:1497) — only blacklist if both a transfer failed AND the periodic heartbeat HTTP probe to the bootstrap address reports ≥ N failures.
|
||||
Pulling the mooncake C++ logs (filter `^E0`/`^I0` lines from prefill-0.log) reveals the actual underlying error:
|
||||
|
||||
None of these are quick fixes; they require touching `MooncakeKVManager.start_prefill_thread` and the failed-session lifecycle.
|
||||
```
|
||||
I0512 01:56:42.242062 transfer_engine_py.cpp:546]
|
||||
Sync batch data transfer timeout after 37452515723ns
|
||||
I0512 01:56:53.335597 transfer_engine_py.cpp:546]
|
||||
Sync batch data transfer timeout after 30892690400ns
|
||||
```
|
||||
|
||||
**37.45 s** and **30.89 s** — the mooncake `batch_transfer_sync` C++ call returned nonzero because the synchronous transfer took longer than its internal timeout (~30 s). On a 400 Gb/s NDR RDMA fabric this is not a network problem; the data path is healthy. The SGLang author's design instinct (`>= 1 failures = dead`) is *correct in the idle case* — a 30-second RDMA stall really does indicate a broken peer.
|
||||
|
||||
What's happening here is that the peer is **logically broken from the C++ control-plane's point of view**, even though the OS process is still alive.
|
||||
|
||||
### Why does the D side stall the control plane for 30 s?
|
||||
|
||||
Cross-referencing decode-0.log at the exact second of the first timeout (01:56:42):
|
||||
|
||||
```
|
||||
01:56:34 Decode batch, #running-req=1, #token=627631, token_usage=0.83,
|
||||
gen throughput=174.76 tok/s ← still serving normally
|
||||
01:56:42 session id 1000315 does not exist, cannot delete.
|
||||
01:56:42 session id 1000360 does not exist, cannot delete.
|
||||
01:56:42 Trimmed decode session cache via LRU.
|
||||
#evicted_sessions: 2, #freed_tokens: 77675,
|
||||
#available_tokens: 38574 → 116249
|
||||
01:56:42 Trimmed decode session cache via LRU.
|
||||
#evicted_sessions: 1, #freed_tokens: 36166,
|
||||
#available_tokens: 29038 → 65204
|
||||
01:56:53 Decode transfer failed for request rank=0 ...
|
||||
Failed to get kvcache from prefill instance, it might be dead
|
||||
```
|
||||
|
||||
D0's main scheduler thread was busy doing **two consecutive LRU evictions** (freeing 77 675 + 36 166 ≈ 114 K tokens of KV) right when the P→D mooncake transfer attempt landed. Each LRU trim involves:
|
||||
- iterating per-session resident metadata
|
||||
- releasing GPU KV slots back to `token_to_kv_pool_allocator.free()`
|
||||
- updating the session-aware-cache bookkeeping under lock
|
||||
- closing per-session streaming state
|
||||
|
||||
Under `token_usage = 0.83` the LRU scan has to walk thousands of entries; the lock held during this work blocks the mooncake C++ control plane on the receive side (buffer registration / completion poll) from making progress. P's `batch_transfer_sync` keeps polling for the peer's completion ack, doesn't get one for 30 s, and gives up.
|
||||
|
||||
So the chain is:
|
||||
|
||||
```
|
||||
D KV pool saturated by D2-cold-pinning (§5d)
|
||||
↓
|
||||
D triggers heavy LRU eviction (114K tokens at a time)
|
||||
↓
|
||||
D main scheduler thread starves mooncake C++ control plane for 30+ s
|
||||
↓
|
||||
P's batch_transfer_sync returns nonzero (timeout)
|
||||
↓
|
||||
P's hair-trigger marks D's whole mooncake_session_id "failed forever"
|
||||
↓
|
||||
all subsequent reqs to that D blow up with "is not alive"
|
||||
```
|
||||
|
||||
The hair-trigger threshold (`>= 1`) is structurally wrong for this regime — but it would not fire at all if the LRU thrash didn't happen, and the LRU thrash would not happen if the load were spread across all 3 D workers (§5d).
|
||||
|
||||
### Two layers of fix
|
||||
|
||||
| Layer | What | Cost |
|
||||
|---|---|---|
|
||||
| Root cause | Spread load to D2 so D0/D1's KV never saturate, LRU never thrashes. See §5d and the cold-D bonus implementation in `policies.py` (next commit). | Low — pure policy change |
|
||||
| Defense in depth | In `mooncake/conn.py:1267-1276`, replace `>= 1` with a windowed threshold (e.g. ≥ 3 failures within 60 s) and add a periodic retry that probes the D bootstrap port before clearing `failed_sessions`. | Medium — touches vendored SGLang |
|
||||
|
||||
We do the root-cause fix first because it makes the second one optional.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user