Files
agentic-pd-hybrid/docs/E1_E2_FIX_DESIGN_ZH.md
tim 9a166ac43b docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)
For Q1 (D scheduler LRU starves mooncake control plane → 30s
batch_transfer_sync timeout → hair-trigger blacklist), six candidate
fixes evaluated. Recommendation: do Q2 fix first since it removes
the only condition under which we observe LRU thrash; bump mooncake
timeout to 120s as cheap defense-in-depth; avoid invasive SGLang
vendor changes (windowed hair-trigger, async eviction thread) until
Q2 fix demonstrates they're insufficient.

For Q2 (overlap-first lex score + shared boilerplate → permanent
D2 cold), seven candidate fixes evaluated. Recommendation: load-
floor bonus (graduated, decoupled from overlap, gated on
not-sticky) as the primary mechanism — proactive on first-touch as
user requested, avoiding the binary one-shot pitfall of the
reverted cold-D bonus. Orthogonal cleanup: fix the substring filter
in _is_admission_rejection_mode so the existing migration mechanism
serves as a backstop when load balancing alone isn't enough.

7 decision points listed for review; no code merged until a shape
is approved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 11:20:00 +08:00

13 KiB
Raw Blame History

E1 / E2 Failure Modes — Fix Design Space (no code changes)

Status: design proposal for review. Branch: h200-cu130. Companion: docs/E1_E2_RESULTS_ZH.md §5b§5d for the forensic findings this design responds to.

This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:

  • Q1: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side batch_transfer_sync to time out (~30 s) and the hair-trigger in conn.py:1270 to permanently blacklist the D's mooncake_session_id.
  • Q2: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.

For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. No code is committed until a path is chosen.


Q1 — Eviction starves mooncake control plane

Mechanism recap

Inside decode-0.log at the moment of P-side timeout (Sync batch data transfer timeout after 37452515723ns):

01:56:34  Decode batch ... gen 174 tok/s    ← serving fine
01:56:42  session id 1000315 does not exist, cannot delete.
01:56:42  Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
01:56:42  Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
01:56:42  Decode transfer failed ...        ← P-side timeout fires

maybe_trim_decode_session_cache (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via kv_pool_allocator.free(), and updates session_aware_cache under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → batch_transfer_sync returns nonzero → hair-trigger fires.

Design space

# Fix Layer Mechanism Assumes Risks
Q1.A Pre-emptive low-watermark eviction vendored SGLang Trigger LRU when token_usage > 0.7 in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has _decode_session_cache_low_watermark_tokens; question is whether it currently runs proactively or only on-demand. Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state. If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning.
Q1.B Async eviction thread vendored SGLang Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls notify_evict_needed(); mooncake control plane keeps running. KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity. Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness.
Q1.C Bump mooncake transfer timeout mooncake env / wheel patch Set MC_TRANSFER_TIMEOUT_NS (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up. A real broken link won't go unnoticed for ≥120 s. Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection.
Q1.D Windowed hair-trigger vendored SGLang conn.py:1270 Replace if session_failures >= 1: with if session_failures ≥ N within window. Add periodic probe to D bootstrap port to clear failed_sessions after success. Transient stalls are recoverable; real deaths are not. Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath.
Q1.E Router-side backpressure our --enable-backpressure (already exists, off by default) D returns recommended_pause_ms in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented. Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes. Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires.
Q1.F Upstream load balance (= Q2 fix) our policies.py Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires. Q2 fix is sound and the workload's KV demand fits into 3 D's evenly. The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger.

Recommendation for Q1

Primary: Q1.F (do Q2 fix first). This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we know it's a real symptom and need defense-in-depth.

Defense-in-depth (cheap): Q1.C (bump mooncake timeout). Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.

Avoid for now: Q1.B and Q1.D. Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.

Open question for the team: does SGLang's existing low_watermark LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.


Q2 — Cold-D never gets a session

What we already know is wrong

User's observation: the existing migration_reject_threshold=3 mechanism fires after 3 wasted prefills, which is too late. The fix needs to be proactive: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.

Design space

Let assigned[D] = state.decode_assignment_counts[D] and inflight[D] = state.inflight_decode[D]. Lex score is currently:

score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)
# Fix Mechanism Assumes Risks
Q2.A Cold-D bonus (binary, what the reverted commit did) cold_boost = K if assigned[D]==0 and not sticky else 0; add to lex position 0. Each D needs to be "popped" from cold once, after that the bonus disappears. One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently.
Q2.B Load-floor bonus (graduated, my recommended primary) floor_bonus = max(0, K · (1 assigned[D] / max(assigned[*]))) (or similar continuous fn); add to lex position 0; gated on not sticky. "Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold. Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions).
Q2.C Lex re-order: inflight first Change score to (-inflight, overlap + α·sticky, sticky, -assigned). Idle D always wins ties → idle D2 wins fresh sessions immediately. Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load is balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns.
Q2.D Capacity-aware overlap discount effective_overlap = overlap · (1 inflight[D] / max_inflight); replace overlap in score. Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff. More complex than Q2.B; needs max_inflight estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B.
Q2.E Pre-warm cold D's at startup After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating state.resident[D] evenly. We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start). Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle.
Q2.F Drop overlap unless "material" Apply overlap term only when overlap > τ blocks (or > τ% of input). Tiny overlap doesn't actually save meaningful prefill work. Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue.
Q2.G Fix the substring filter (the actual _is_admission_rejection_mode bug) Either widen _ADMISSION_REJECTION_SUBSTRINGS to include "kvcache-centric", or call state.record_admission_reject directly from the actual reject signal site instead of string-matching after the fact. Existing migration mechanism is sound once it gets fed the right signal. User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup.

Recommendation for Q2

Primary: Q2.B (load-floor bonus, graduated).

  • Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
  • Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
  • Sticky stays on by gating on not sticky → no risk of breaking turn 1+ cache locality.
  • Single knob (K) to tune.

Orthogonal cleanup: Q2.G (fix the reject-substring filter). Independent of Q2.B, since the migration mechanism is the backstop (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the primary mechanism, but as a backstop after primary load balancing, it's still valuable.

Avoid: Q2.C (lex re-order destroys overlap-first design). Avoid: Q2.E (workload-coupled, brittle). Q2.D / Q2.F are reasonable but more complex than Q2.B with marginal gain.

Concrete shape of Q2.B (for review, not for merge)

# In KvAwarePolicy.select, replacing the current score line:
total_assigned = sum(state.decode_assignment_counts.values())
n_decoders = max(1, len(topology.route_workers))
mean_assigned = total_assigned / n_decoders

# Per-D fairness deficit: how much below the running mean is this D?
deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0

score = (
    overlap + sticky * self.sticky_bonus + floor_bonus,
    sticky,
    inflight_penalty,
    assignment_penalty,
)

Knob: load_floor_bonus: int = 0 (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets floor_bonus = 200 * 16 / 16 = 200, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets floor_bonus = 200 * 1 / 16 ≈ 12, which doesn't override real prefix-cache wins.

But this is just a sketch — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.

Validation plan if we go with Q2.B

  1. Implement Q2.B + flag, default off.
  2. Re-run E2 on the same outputs/inferact_50sess.jsonl subset with --kvcache-load-floor-bonus 200.
  3. Check structural log: do D0/D1/D2 each get a non-trivial share of session-d-binding.jsonl rows?
  4. Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
  5. Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
  6. Re-evaluate H1 with E1 vs the new E2.

Decision points (for review)

# Question Default if no answer
D1 Q1: do Q2 fix first and re-measure before touching mooncake / SGLang? Yes (recommended)
D2 Q1: bump mooncake MC_TRANSFER_TIMEOUT_NS to 120 s as cheap defense-in-depth? Yes
D3 Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table? Q2.B
D4 Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup? Yes
D5 Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form? Defer
D6 Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values? Try 200 first
D7 Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom? Same subset

Once the shape is approved, the next implementation pass is small and concentrated in policies.py + replay.py + cli.py (no SGLang vendor changes needed for the primary fix).