docs(experiments): forensic on Q1 (mooncake death) and Q2 (no D2 migration)

Q1: Mooncake "is not alive" is hair-trigger — a single
send_kvcache_slice ret != 0 in
third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py
:1270 permanently adds the D's mooncake_session_id to failed_sessions
and blacklists it for the rest of the process lifetime. The D worker
process is alive (D1 keeps serving admit_direct_append OK seconds
after), but every subsequent P→D transfer for that session
short-circuits at conn.py:1184. The "Failures should never happen if
the session is not dead" comment encodes the wrong assumption for the
saturation regime we hit.

Q2: KVC v2's migration mechanism IS sound but its trigger is gated
by replay.py:1379 _ADMISSION_REJECTION_SUBSTRINGS = ("session-cap",
"no-d-capacity", "d-backpressure"). All 1054 failures have
execution_mode="kvcache-centric" (generic fallback bucket) which
contains none of those substrings, so session_d_rejects is never
incremented. Empirically 46 of 49 (sess, D) pairs that the worker
RPC rejected would have qualified for blacklist (most-rejected
pair: 25 rejects), but policy never saw them. Result: D0 reject
→ next-bind D0 (253×), D1 reject → next-bind D1 (329×), D0/D1 reject
→ next-bind D2 (0×).

Fix paths documented for both, shortest path is widening the
substring filter to include the failure-fallback bucket, but the
right fix is to call record_admission_reject directly from the
actual rejection signal site instead of string-matching execution_mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
tim
2026-05-12 10:45:18 +08:00
parent ef4dc81ea9
commit 7f2ebf3d87

View File

@@ -213,6 +213,106 @@ Worker admission per se is not the bug — the bug is that there is no D-rebalan
---
## 5c. Why mooncake "died" (forensic on Q1)
The error string is `Decode instance could be dead, remote mooncake session ... is not alive`, which sounds like the D worker process crashed. **It did not.** Concurrent evidence shows D1 was happily serving `/session_cache/admit_direct_append HTTP/1.1 200 OK` and running LRU evictions only seconds after the "is not alive" errors fired. The real mechanism is hair-trigger.
### What the SGLang mooncake conn.py actually does
In `third_party/sglang/python/sglang/srt/disaggregation/mooncake/conn.py:1267-1276`:
```python
if ret != 0: # one transfer slice failed
with self.session_lock:
self.session_failures[req.mooncake_session_id] += 1
# Failures should never happen if the session is not dead,
# if the session fails once, mark it as failed
if self.session_failures[req.mooncake_session_id] >= 1:
self.failed_sessions.add(req.mooncake_session_id)
logger.error(f"Session {req.mooncake_session_id} failed.")
...
```
After this, every subsequent transfer that uses the same `mooncake_session_id` short-circuits at conn.py:1184:
```python
if req.mooncake_session_id in self.failed_sessions:
self.record_failure(kv_chunk.room,
f"Decode instance could be dead, remote mooncake session ... is not alive")
```
**One real `send_kvcache_slice ret != 0` permanently blacklists that D's mooncake session for the rest of the SGLang process lifetime.** The code's own comment ("Failures should never happen if the session is not dead") encodes the design assumption that transfers don't fail under normal conditions — but they do under the saturation regime described in §5b (RDMA queue full / D scheduler too busy to drain receives in time).
### Connecting back to Q1 timeline
Looking at decode-1.log around 01:56:42-56, the worker is running heavy decode batches (#token = 627K, near KV pool cap of 755K) plus repeatedly evicting via LRU. Under that load a single `send_kvcache_slice` returning a transient nonzero is enough to flip the switch. After 01:56:42 essentially every P→D1 transfer reports "is not alive" until end-of-run, even though D1 itself keeps serving direct-append admissions.
### The fix would be in vendored SGLang
The hair-trigger threshold (`>= 1`) is wrong for our regime. Options:
1. Raise the threshold to N transient failures within a short window before declaring the session dead.
2. Make the "failed" mark expire (e.g. retry the session after a backoff).
3. Pair the hair-trigger with the existing heartbeat checker (conn.py:1497) — only blacklist if both a transfer failed AND the periodic heartbeat HTTP probe to the bootstrap address reports ≥ N failures.
None of these are quick fixes; they require touching `MooncakeKVManager.start_prefill_thread` and the failed-session lifecycle.
---
## 5d. Why no session ever migrated to D2 (forensic on Q2)
KVC v2's design (KVC_ROUTER_ALGORITHM §3.3) uses `state.session_d_rejects[(session_id, D)] += 1` after a rejection, then policy.select skips any D with `rejects >= migration_reject_threshold (=3)`. The mechanism is conceptually sound. The bug is in *which* failures count as rejections.
### The substring filter is too narrow
In `replay.py:1379`:
```python
_ADMISSION_REJECTION_SUBSTRINGS = (
"session-cap",
"no-d-capacity",
"d-backpressure",
)
def _is_admission_rejection_mode(execution_mode: str) -> bool:
return any(token in execution_mode for token in _ADMISSION_REJECTION_SUBSTRINGS)
```
Only execution_modes containing one of those three substrings increment the per-(session, D) reject counter. **All 1054 E2 failures have `execution_mode = "kvcache-centric"`** (the generic fallback bucket the replay engine uses when the request fell through every concrete sub-path before producing a successful result). That string contains none of the three substrings, so `session_d_rejects` is never incremented for them.
### Empirical confirmation
Counting from `structural/admission-events.jsonl` (worker-RPC level, independent of replay's classification):
| Stat | Value |
|---|---:|
| Distinct `(session, D)` pairs ever rejected by worker RPC | 49 |
| Pairs rejected ≥ 3 times (would qualify for blacklist) | **46** |
| Most-rejected single pair | (1001172, D1) = **25 rejects** |
So 46 of 49 (sess, D) pairs *should have been blacklisted* by KVC v2's design. They never were, because the corresponding requests' execution_mode was `"kvcache-centric"` (failure path) and not `"…-session-cap"` / `"…-no-d-capacity"` / `"…-d-backpressure"` (which only get assigned when the fallthrough path runs to a known-rejection sub-result, not when the upstream SSE stream errors out).
Counting "next-binding-after-reject" from the merged binding+admission timeline:
| Rejected on | Next binding goes to | Count |
|---|---|---:|
| D0 | D0 | 253 |
| D1 | D1 | 329 |
| D0 | D2 | **0** |
| D1 | D2 | **0** |
The router stubbornly re-binds the same session to the same D after every reject — exactly because the reject was never recorded in `session_d_rejects`, so policy.select still sees an empty rejection counter and the overlap term keeps tipping it back to D0/D1.
### The fix
Two paths, in increasing scope:
1. **Quick**: include `"kvcache-centric"` (the failure-fallback bucket) in `_ADMISSION_REJECTION_SUBSTRINGS`, OR have replay set `execution_mode` to a more specific failure label when an SSE stream closes with zero tokens (e.g. `"upstream-aborted"`) and add that to the substring set.
2. **Better**: don't rely on string-matching at all. Have `_run_request` catch the actual rejection signal (admission RPC `can_admit=False` or upstream `RuntimeError: generate stream ended ...`) and call `state.record_admission_reject(...)` directly at that point. The substring filter was inherited from the v1 → v2 migration design (`MIGRATION_V1_FINDINGS_ZH §4.1`) when only specific fallback paths set those names.
Either fix would let the existing `migration_reject_threshold=3` blacklist D0/D1 after enough failures, force a re-route to D2, populate D2's resident hashes, and break the overlap-pinning death spiral.
---
## 6. What this experiment actually shows
1. **The H200 + driver 570 + cu12.8 toolchain works for production-scale SGLang xPyD workloads.** Both runs completed without CUDA / driver / mooncake errors; failures are policy- and workload-level, not infrastructure.