docs(experiments): forensic on Q1 (mooncake death) and Q2 (no D2 migration)
Q1: Mooncake "is not alive" is hair-trigger — a single
send_kvcache_slice ret != 0 in
third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py
:1270 permanently adds the D's mooncake_session_id to failed_sessions
and blacklists it for the rest of the process lifetime. The D worker
process is alive (D1 keeps serving admit_direct_append OK seconds
after), but every subsequent P→D transfer for that session
short-circuits at conn.py:1184. The "Failures should never happen if
the session is not dead" comment encodes the wrong assumption for the
saturation regime we hit.
Q2: KVC v2's migration mechanism IS sound but its trigger is gated
by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap",
"no-d-capacity", "d-backpressure"). All 1054 failures have
execution_mode="kvcache-centric" (generic fallback bucket) which
contains none of those substrings, so session_d_rejects is never
incremented. Empirically 46 of 49 (sess, D) pairs that the worker
RPC rejected would have qualified for blacklist (most-rejected
pair: 25 rejects), but policy never saw them. Result: D0 reject
→ next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject
→ next-bind D2 (0×).
Fix paths documented for both, shortest path is widening the
substring filter to include the failure-fallback bucket, but the
right fix is to call record_admission_reject directly from the
actual rejection signal site instead of string-matching execution_mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -213,6 +213,106 @@ Worker admission per se is not the bug — the bug is that there is no D-rebalan
|
||||
|
||||
---
|
||||
|
||||
## 5c. Why mooncake "died" (forensic on Q1)
|
||||
|
||||
The error string is `Decode instance could be dead, remote mooncake session ... is not alive`, which sounds like the D worker process crashed. **It did not.** Concurrent evidence shows D1 was happily serving `/session_cache/admit_direct_append HTTP/1.1 200 OK` and running LRU evictions only seconds after the "is not alive" errors fired. The real mechanism is hair-trigger.
|
||||
|
||||
### What the SGLang mooncake conn.py actually does
|
||||
|
||||
In `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1267-1276`:
|
||||
|
||||
```python
|
||||
if ret != 0: # one transfer slice failed
|
||||
with self.session_lock:
|
||||
self.session_failures[req.mooncake_session_id] += 1
|
||||
# Failures should never happen if the session is not dead,
|
||||
# if the session fails once, mark it as failed
|
||||
if self.session_failures[req.mooncake_session_id] >= 1:
|
||||
self.failed_sessions.add(req.mooncake_session_id)
|
||||
logger.error(f"Session {req.mooncake_session_id} failed.")
|
||||
...
|
||||
```
|
||||
|
||||
After this, every subsequent transfer that uses the same `mooncake_session_id` short-circuits at conn.py:1184:
|
||||
|
||||
```python
|
||||
if req.mooncake_session_id in self.failed_sessions:
|
||||
self.record_failure(kv_chunk.room,
|
||||
f"Decode instance could be dead, remote mooncake session ... is not alive")
|
||||
```
|
||||
|
||||
**One real `send_kvcache_slice ret != 0` permanently blacklists that D's mooncake session for the rest of the SGLang process lifetime.** The code's own comment ("Failures should never happen if the session is not dead") encodes the design assumption that transfers don't fail under normal conditions — but they do under the saturation regime described in §5b (RDMA queue full / D scheduler too busy to drain receives in time).
|
||||
|
||||
### Connecting back to Q1 timeline
|
||||
|
||||
Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
|
||||
|
||||
### The fix would be in vendored SGLang
|
||||
|
||||
The hair-trigger threshold (`>= 1`) is wrong for our regime. Options:
|
||||
1. Raise the threshold to N transient failures within a short window before declaring the session dead.
|
||||
2. Make the "failed" mark expire (e.g. retry the session after a backoff).
|
||||
3. Pair the hair-trigger with the existing heartbeat checker (conn.py:1497) — only blacklist if both a transfer failed AND the periodic heartbeat HTTP probe to the bootstrap address reports ≥ N failures.
|
||||
|
||||
None of these are quick fixes; they require touching `MooncakeKVManager.start_prefill_thread` and the failed-session lifecycle.
|
||||
|
||||
---
|
||||
|
||||
## 5d. Why no session ever migrated to D2 (forensic on Q2)
|
||||
|
||||
KVC v2's design (KVC_ROUTER_ALGORITHM §3.3) uses `state.session_d_rejects[(session_id, D)] += 1` after a rejection, then policy.select skips any D with `rejects >= migration_reject_threshold (=3)`. The mechanism is conceptually sound. The bug is in *which* failures count as rejections.
|
||||
|
||||
### The substring filter is too narrow
|
||||
|
||||
In `replay.py:1379`:
|
||||
|
||||
```python
|
||||
_ADMISSION_REJECTION_SUBSTRINGS = (
|
||||
"session-cap",
|
||||
"no-d-capacity",
|
||||
"d-backpressure",
|
||||
)
|
||||
|
||||
def _is_admission_rejection_mode(execution_mode: str) -> bool:
|
||||
return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
|
||||
```
|
||||
|
||||
Only execution_modes containing one of those three substrings increment the per-(session, D) reject counter. **All 1054 E2 failures have `execution_mode = "kvcache-centric"`** (the generic fallback bucket the replay engine uses when the request fell through every concrete sub-path before producing a successful result). That string contains none of the three substrings, so `session_d_rejects` is never incremented for them.
|
||||
|
||||
### Empirical confirmation
|
||||
|
||||
Counting from `structural/admission-events.jsonl` (worker-RPC level, independent of replay's classification):
|
||||
|
||||
| Stat | Value |
|
||||
|---|---:|
|
||||
| Distinct `(session, D)` pairs ever rejected by worker RPC | 49 |
|
||||
| Pairs rejected ≥ 3 times (would qualify for blacklist) | **46** |
|
||||
| Most-rejected single pair | (1001172, D1) = **25 rejects** |
|
||||
|
||||
So 46 of 49 (sess, D) pairs *should have been blacklisted* by KVC v2's design. They never were, because the corresponding requests' execution_mode was `"kvcache-centric"` (failure path) and not `"…-session-cap"` / `"…-no-d-capacity"` / `"…-d-backpressure"` (which only get assigned when the fallthrough path runs to a known-rejection sub-result, not when the upstream SSE stream errors out).
|
||||
|
||||
Counting "next-binding-after-reject" from the merged binding+admission timeline:
|
||||
|
||||
| Rejected on | Next binding goes to | Count |
|
||||
|---|---|---:|
|
||||
| D0 | D0 | 253 |
|
||||
| D1 | D1 | 329 |
|
||||
| D0 | D2 | **0** |
|
||||
| D1 | D2 | **0** |
|
||||
|
||||
The router stubbornly re-binds the same session to the same D after every reject — exactly because the reject was never recorded in `session_d_rejects`, so policy.select still sees an empty rejection counter and the overlap term keeps tipping it back to D0/D1.
|
||||
|
||||
### The fix
|
||||
|
||||
Two paths, in increasing scope:
|
||||
|
||||
1. **Quick**: include `"kvcache-centric"` (the failure-fallback bucket) in `_ADMISSION_REJECTION_SUBSTRINGS`, OR have replay set `execution_mode` to a more specific failure label when an SSE stream closes with zero tokens (e.g. `"upstream-aborted"`) and add that to the substring set.
|
||||
2. **Better**: don't rely on string-matching at all. Have `_run_request` catch the actual rejection signal (admission RPC `can_admit=False` or upstream `RuntimeError: generate stream ended ...`) and call `state.record_admission_reject(...)` directly at that point. The substring filter was inherited from the v1 → v2 migration design (`MIGRATION_V1_FINDINGS_ZH §4.1`) when only specific fallback paths set those names.
|
||||
|
||||
Either fix would let the existing `migration_reject_threshold=3` blacklist D0/D1 after enough failures, force a re-route to D2, populate D2's resident hashes, and break the overlap-pinning death spiral.
|
||||
|
||||
---
|
||||
|
||||
## 6. What this experiment actually shows
|
||||
|
||||
1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.
|
||||
|
||||
Reference in New Issue
Block a user