For Q1 (D scheduler LRU starves mooncake control plane → 30s batch_transfer_sync timeout → hair-trigger blacklist), six candidate fixes evaluated. Recommendation: do Q2 fix first since it removes the only condition under which we observe LRU thrash; bump mooncake timeout to 120s as cheap defense-in-depth; avoid invasive SGLang vendor changes (windowed hair-trigger, async eviction thread) until Q2 fix demonstrates they're insufficient. For Q2 (overlap-first lex score + shared boilerplate → permanent D2 cold), seven candidate fixes evaluated. Recommendation: load- floor bonus (graduated, decoupled from overlap, gated on not-sticky) as the primary mechanism — proactive on first-touch as user requested, avoiding the binary one-shot pitfall of the reverted cold-D bonus. Orthogonal cleanup: fix the substring filter in _is_admission_rejection_mode so the existing migration mechanism serves as a backstop when load balancing alone isn't enough. 7 decision points listed for review; no code merged until a shape is approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
E1 / E2 Failure Modes — Fix Design Space (no code changes)
Status: design proposal for review.
Branch: h200-cu130.
Companion: docs/E1_E2_RESULTS_ZH.md §5b–§5d for the forensic findings this design responds to.
This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:
- Q1: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side
batch_transfer_syncto time out (~30 s) and the hair-trigger inconn.py:1270to permanently blacklist the D's mooncake_session_id. - Q2: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.
For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. No code is committed until a path is chosen.
Q1 — Eviction starves mooncake control plane
Mechanism recap
Inside decode-0.log at the moment of P-side timeout (Sync batch data transfer timeout after 37452515723ns):
01:56:34 Decode batch ... gen 174 tok/s ← serving fine
01:56:42 session id 1000315 does not exist, cannot delete.
01:56:42 Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
01:56:42 Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
01:56:42 Decode transfer failed ... ← P-side timeout fires
maybe_trim_decode_session_cache (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via kv_pool_allocator.free(), and updates session_aware_cache under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → batch_transfer_sync returns nonzero → hair-trigger fires.
Design space
| # | Fix | Layer | Mechanism | Assumes | Risks |
|---|---|---|---|---|---|
| Q1.A | Pre-emptive low-watermark eviction | vendored SGLang | Trigger LRU when token_usage > 0.7 in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has _decode_session_cache_low_watermark_tokens; question is whether it currently runs proactively or only on-demand. |
Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state. | If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning. |
| Q1.B | Async eviction thread | vendored SGLang | Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls notify_evict_needed(); mooncake control plane keeps running. |
KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity. | Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness. |
| Q1.C | Bump mooncake transfer timeout | mooncake env / wheel patch | Set MC_TRANSFER_TIMEOUT_NS (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up. |
A real broken link won't go unnoticed for ≥120 s. | Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection. |
| Q1.D | Windowed hair-trigger | vendored SGLang conn.py:1270 |
Replace if session_failures >= 1: with if session_failures ≥ N within window. Add periodic probe to D bootstrap port to clear failed_sessions after success. |
Transient stalls are recoverable; real deaths are not. | Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath. |
| Q1.E | Router-side backpressure | our --enable-backpressure (already exists, off by default) |
D returns recommended_pause_ms in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented. |
Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes. | Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires. |
| Q1.F | Upstream load balance (= Q2 fix) | our policies.py |
Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires. | Q2 fix is sound and the workload's KV demand fits into 3 D's evenly. | The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger. |
Recommendation for Q1
Primary: Q1.F (do Q2 fix first). This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we know it's a real symptom and need defense-in-depth.
Defense-in-depth (cheap): Q1.C (bump mooncake timeout). Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.
Avoid for now: Q1.B and Q1.D. Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.
Open question for the team: does SGLang's existing low_watermark LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.
Q2 — Cold-D never gets a session
What we already know is wrong
User's observation: the existing migration_reject_threshold=3 mechanism fires after 3 wasted prefills, which is too late. The fix needs to be proactive: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.
Design space
Let assigned[D] = state.decode_assignment_counts[D] and inflight[D] = state.inflight_decode[D]. Lex score is currently:
score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)
| # | Fix | Mechanism | Assumes | Risks |
|---|---|---|---|---|
| Q2.A | Cold-D bonus (binary, what the reverted commit did) | cold_boost = K if assigned[D]==0 and not sticky else 0; add to lex position 0. |
Each D needs to be "popped" from cold once, after that the bonus disappears. | One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently. |
| Q2.B | Load-floor bonus (graduated, my recommended primary) | floor_bonus = max(0, K · (1 − assigned[D] / max(assigned[*]))) (or similar continuous fn); add to lex position 0; gated on not sticky. |
"Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold. | Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions). |
| Q2.C | Lex re-order: inflight first | Change score to (-inflight, overlap + α·sticky, sticky, -assigned). |
Idle D always wins ties → idle D2 wins fresh sessions immediately. | Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load is balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns. |
| Q2.D | Capacity-aware overlap discount | effective_overlap = overlap · (1 − inflight[D] / max_inflight); replace overlap in score. |
Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff. | More complex than Q2.B; needs max_inflight estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B. |
| Q2.E | Pre-warm cold D's at startup | After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating state.resident[D] evenly. |
We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start). | Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle. |
| Q2.F | Drop overlap unless "material" | Apply overlap term only when overlap > τ blocks (or > τ% of input). | Tiny overlap doesn't actually save meaningful prefill work. | Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue. |
| Q2.G | Fix the substring filter (the actual _is_admission_rejection_mode bug) |
Either widen _ADMISSION_REJECTION_SUBSTRINGS to include "kvcache-centric", or call state.record_admission_reject directly from the actual reject signal site instead of string-matching after the fact. |
Existing migration mechanism is sound once it gets fed the right signal. | User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup. |
Recommendation for Q2
Primary: Q2.B (load-floor bonus, graduated).
- Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
- Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
- Sticky stays on by gating on
not sticky→ no risk of breaking turn 1+ cache locality. - Single knob (
K) to tune.
Orthogonal cleanup: Q2.G (fix the reject-substring filter). Independent of Q2.B, since the migration mechanism is the backstop (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the primary mechanism, but as a backstop after primary load balancing, it's still valuable.
Avoid: Q2.C (lex re-order destroys overlap-first design). Avoid: Q2.E (workload-coupled, brittle). Q2.D / Q2.F are reasonable but more complex than Q2.B with marginal gain.
Concrete shape of Q2.B (for review, not for merge)
# In KvAwarePolicy.select, replacing the current score line:
total_assigned = sum(state.decode_assignment_counts.values())
n_decoders = max(1, len(topology.route_workers))
mean_assigned = total_assigned / n_decoders
# Per-D fairness deficit: how much below the running mean is this D?
deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0
score = (
overlap + sticky * self.sticky_bonus + floor_bonus,
sticky,
inflight_penalty,
assignment_penalty,
)
Knob: load_floor_bonus: int = 0 (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets floor_bonus = 200 * 16 / 16 = 200, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets floor_bonus = 200 * 1 / 16 ≈ 12, which doesn't override real prefix-cache wins.
But this is just a sketch — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.
Validation plan if we go with Q2.B
- Implement Q2.B + flag, default off.
- Re-run E2 on the same
outputs/inferact_50sess.jsonlsubset with--kvcache-load-floor-bonus 200. - Check structural log: do D0/D1/D2 each get a non-trivial share of
session-d-binding.jsonlrows? - Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
- Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
- Re-evaluate H1 with E1 vs the new E2.
Decision points (for review)
| # | Question | Default if no answer |
|---|---|---|
| D1 | Q1: do Q2 fix first and re-measure before touching mooncake / SGLang? | Yes (recommended) |
| D2 | Q1: bump mooncake MC_TRANSFER_TIMEOUT_NS to 120 s as cheap defense-in-depth? |
Yes |
| D3 | Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table? | Q2.B |
| D4 | Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup? | Yes |
| D5 | Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form? | Defer |
| D6 | Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values? | Try 200 first |
| D7 | Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom? | Same subset |
Once the shape is approved, the next implementation pass is small and concentrated in policies.py + replay.py + cli.py (no SGLang vendor changes needed for the primary fix).