Files

tim 9a166ac43b docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)

For Q1 (D scheduler LRU starves mooncake control plane → 30s
batch_transfer_sync timeout → hair-trigger blacklist), six candidate
fixes evaluated. Recommendation: do Q2 fix first since it removes
the only condition under which we observe LRU thrash; bump mooncake
timeout to 120s as cheap defense-in-depth; avoid invasive SGLang
vendor changes (windowed hair-trigger, async eviction thread) until
Q2 fix demonstrates they're insufficient.

For Q2 (overlap-first lex score + shared boilerplate → permanent
D2 cold), seven candidate fixes evaluated. Recommendation: load-
floor bonus (graduated, decoupled from overlap, gated on
not-sticky) as the primary mechanism — proactive on first-touch as
user requested, avoiding the binary one-shot pitfall of the
reverted cold-D bonus. Orthogonal cleanup: fix the substring filter
in _is_admission_rejection_mode so the existing migration mechanism
serves as a backstop when load balancing alone isn't enough.

7 decision points listed for review; no code merged until a shape
is approved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 11:20:00 +08:00

13 KiB

Raw Blame History

E1 / E2 Failure Modes — Fix Design Space (no code changes)

Status: design proposal for review. Branch: h200-cu130. Companion: docs/E1_E2_RESULTS_ZH.md §5b–§5d for the forensic findings this design responds to.

This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:

Q1: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side batch_transfer_sync to time out (~30 s) and the hair-trigger in conn.py:1270 to permanently blacklist the D's mooncake_session_id.
Q2: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.

For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. No code is committed until a path is chosen.

Q1 — Eviction starves mooncake control plane

Mechanism recap

Inside decode-0.log at the moment of P-side timeout (Sync batch data transfer timeout after 37452515723ns):

01:56:34  Decode batch ... gen 174 tok/s    ← serving fine
01:56:42  session id 1000315 does not exist, cannot delete.
01:56:42  Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
01:56:42  Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
01:56:42  Decode transfer failed ...        ← P-side timeout fires

maybe_trim_decode_session_cache (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via kv_pool_allocator.free(), and updates session_aware_cache under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → batch_transfer_sync returns nonzero → hair-trigger fires.

Design space

#	Fix	Layer	Mechanism	Assumes	Risks
Q1.A	Pre-emptive low-watermark eviction	vendored SGLang	Trigger LRU when `token_usage > 0.7` in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has `_decode_session_cache_low_watermark_tokens`; question is whether it currently runs proactively or only on-demand.	Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state.	If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning.
Q1.B	Async eviction thread	vendored SGLang	Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls `notify_evict_needed()`; mooncake control plane keeps running.	KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity.	Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness.
Q1.C	Bump mooncake transfer timeout	mooncake env / wheel patch	Set `MC_TRANSFER_TIMEOUT_NS` (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up.	A real broken link won't go unnoticed for ≥120 s.	Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection.
Q1.D	Windowed hair-trigger	vendored SGLang `conn.py:1270`	Replace `if session_failures >= 1:` with `if session_failures ≥ N within window`. Add periodic probe to D bootstrap port to clear `failed_sessions` after success.	Transient stalls are recoverable; real deaths are not.	Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath.
Q1.E	Router-side backpressure	our `--enable-backpressure` (already exists, off by default)	D returns `recommended_pause_ms` in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented.	Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes.	Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires.
Q1.F	Upstream load balance (= Q2 fix)	our `policies.py`	Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires.	Q2 fix is sound and the workload's KV demand fits into 3 D's evenly.	The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger.

Recommendation for Q1

Primary: Q1.F (do Q2 fix first). This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we know it's a real symptom and need defense-in-depth.

Defense-in-depth (cheap): Q1.C (bump mooncake timeout). Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.

Avoid for now: Q1.B and Q1.D. Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.

Open question for the team: does SGLang's existing low_watermark LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.

Q2 — Cold-D never gets a session

What we already know is wrong

User's observation: the existing migration_reject_threshold=3 mechanism fires after 3 wasted prefills, which is too late. The fix needs to be proactive: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.

Design space

Let assigned[D] = state.decode_assignment_counts[D] and inflight[D] = state.inflight_decode[D]. Lex score is currently:

score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)

#	Fix	Mechanism	Assumes	Risks
Q2.A	Cold-D bonus (binary, what the reverted commit did)	`cold_boost = K if assigned[D]==0 and not sticky else 0`; add to lex position 0.	Each D needs to be "popped" from cold once, after that the bonus disappears.	One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently.
Q2.B	Load-floor bonus (graduated, my recommended primary)	`floor_bonus = max(0, K · (1 − assigned[D] / max(assigned[*])))` (or similar continuous fn); add to lex position 0; gated on `not sticky`.	"Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold.	Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions).
Q2.C	Lex re-order: inflight first	Change score to `(-inflight, overlap + α·sticky, sticky, -assigned)`.	Idle D always wins ties → idle D2 wins fresh sessions immediately.	Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load is balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns.
Q2.D	Capacity-aware overlap discount	`effective_overlap = overlap · (1 − inflight[D] / max_inflight)`; replace `overlap` in score.	Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff.	More complex than Q2.B; needs `max_inflight` estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B.
Q2.E	Pre-warm cold D's at startup	After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating `state.resident[D]` evenly.	We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start).	Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle.
Q2.F	Drop overlap unless "material"	Apply overlap term only when overlap > τ blocks (or > τ% of input).	Tiny overlap doesn't actually save meaningful prefill work.	Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue.
Q2.G	Fix the substring filter (the actual `_is_admission_rejection_mode` bug)	Either widen `_ADMISSION_REJECTION_SUBSTRINGS` to include `"kvcache-centric"`, or call `state.record_admission_reject` directly from the actual reject signal site instead of string-matching after the fact.	Existing migration mechanism is sound once it gets fed the right signal.	User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup.

Recommendation for Q2

Primary: Q2.B (load-floor bonus, graduated).

Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
Sticky stays on by gating on not sticky → no risk of breaking turn 1+ cache locality.
Single knob (K) to tune.

Orthogonal cleanup: Q2.G (fix the reject-substring filter). Independent of Q2.B, since the migration mechanism is the backstop (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the primary mechanism, but as a backstop after primary load balancing, it's still valuable.

Avoid: Q2.C (lex re-order destroys overlap-first design). Avoid: Q2.E (workload-coupled, brittle). Q2.D / Q2.F are reasonable but more complex than Q2.B with marginal gain.

Concrete shape of Q2.B (for review, not for merge)

# In KvAwarePolicy.select, replacing the current score line:
total_assigned = sum(state.decode_assignment_counts.values())
n_decoders = max(1, len(topology.route_workers))
mean_assigned = total_assigned / n_decoders

# Per-D fairness deficit: how much below the running mean is this D?
deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0

score = (
    overlap + sticky * self.sticky_bonus + floor_bonus,
    sticky,
    inflight_penalty,
    assignment_penalty,
)

Knob: load_floor_bonus: int = 0 (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets floor_bonus = 200 * 16 / 16 = 200, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets floor_bonus = 200 * 1 / 16 ≈ 12, which doesn't override real prefix-cache wins.

But this is just a sketch — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.

Validation plan if we go with Q2.B

Implement Q2.B + flag, default off.
Re-run E2 on the same outputs/inferact_50sess.jsonl subset with --kvcache-load-floor-bonus 200.
Check structural log: do D0/D1/D2 each get a non-trivial share of session-d-binding.jsonl rows?
Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
Re-evaluate H1 with E1 vs the new E2.

Decision points (for review)

#	Question	Default if no answer
D1	Q1: do Q2 fix first and re-measure before touching mooncake / SGLang?	Yes (recommended)
D2	Q1: bump mooncake `MC_TRANSFER_TIMEOUT_NS` to 120 s as cheap defense-in-depth?	Yes
D3	Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table?	Q2.B
D4	Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup?	Yes
D5	Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form?	Defer
D6	Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values?	Try 200 first
D7	Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom?	Same subset

Once the shape is approved, the next implementation pass is small and concentrated in policies.py + replay.py + cli.py (no SGLang vendor changes needed for the primary fix).

13 KiB Raw Blame History Unescape Escape

E1 / E2 Failure Modes — Fix Design Space (no code changes)

Q1 — Eviction starves mooncake control plane

Mechanism recap

Design space

Recommendation for Q1

Q2 — Cold-D never gets a session

What we already know is wrong

Design space

Recommendation for Q2

Concrete shape of Q2.B (for review, not for merge)

Validation plan if we go with Q2.B

Decision points (for review)

13 KiB

Raw Blame History