docs(experiments): design space for Q1 (mooncake stall) + Q2 (cold-D)
For Q1 (D scheduler LRU starves mooncake control plane → 30s batch_transfer_sync timeout → hair-trigger blacklist), six candidate fixes evaluated. Recommendation: do Q2 fix first since it removes the only condition under which we observe LRU thrash; bump mooncake timeout to 120s as cheap defense-in-depth; avoid invasive SGLang vendor changes (windowed hair-trigger, async eviction thread) until Q2 fix demonstrates they're insufficient. For Q2 (overlap-first lex score + shared boilerplate → permanent D2 cold), seven candidate fixes evaluated. Recommendation: load- floor bonus (graduated, decoupled from overlap, gated on not-sticky) as the primary mechanism — proactive on first-touch as user requested, avoiding the binary one-shot pitfall of the reverted cold-D bonus. Orthogonal cleanup: fix the substring filter in _is_admission_rejection_mode so the existing migration mechanism serves as a backstop when load balancing alone isn't enough. 7 decision points listed for review; no code merged until a shape is approved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
137
docs/E1_E2_FIX_DESIGN_ZH.md
Normal file
137
docs/E1_E2_FIX_DESIGN_ZH.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# E1 / E2 Failure Modes — Fix Design Space (no code changes)
|
||||
|
||||
**Status**: design proposal for review.
|
||||
**Branch**: `h200-cu130`.
|
||||
**Companion**: `docs/E1_E2_RESULTS_ZH.md` §5b–§5d for the forensic findings this design responds to.
|
||||
|
||||
This document evaluates candidate fixes for the two pathologies E1 / E2 exposed:
|
||||
- **Q1**: D scheduler thread starves the mooncake C++ control plane during LRU evictions, causing P-side `batch_transfer_sync` to time out (~30 s) and the hair-trigger in `conn.py:1270` to permanently blacklist the D's mooncake_session_id.
|
||||
- **Q2**: KvAwarePolicy's overlap-first lex score, combined with workloads where new sessions share boilerplate hash_ids with already-resident sessions on D0/D1, leaves D2 cold for the entire run.
|
||||
|
||||
For each problem we list candidate fixes, the layer they touch, their assumptions, and what could go wrong. **No code is committed** until a path is chosen.
|
||||
|
||||
---
|
||||
|
||||
## Q1 — Eviction starves mooncake control plane
|
||||
|
||||
### Mechanism recap
|
||||
|
||||
Inside `decode-0.log` at the moment of P-side timeout (`Sync batch data transfer timeout after 37452515723ns`):
|
||||
|
||||
```
|
||||
01:56:34 Decode batch ... gen 174 tok/s ← serving fine
|
||||
01:56:42 session id 1000315 does not exist, cannot delete.
|
||||
01:56:42 Trimmed decode session cache via LRU. evicted=2, freed=77675, available 38574 → 116249
|
||||
01:56:42 Trimmed decode session cache via LRU. evicted=1, freed=36166, available 29038 → 65204
|
||||
01:56:42 Decode transfer failed ... ← P-side timeout fires
|
||||
```
|
||||
|
||||
`maybe_trim_decode_session_cache` (in vendored sglang scheduler) walks per-session resident bookkeeping, releases GPU KV slots via `kv_pool_allocator.free()`, and updates `session_aware_cache` under lock. While that runs, the scheduler main loop is busy and the mooncake control-plane callbacks scheduled into the same event loop don't get serviced. P sees no completion ack within 30 s → `batch_transfer_sync` returns nonzero → hair-trigger fires.
|
||||
|
||||
### Design space
|
||||
|
||||
| # | Fix | Layer | Mechanism | Assumes | Risks |
|
||||
|---|---|---|---|---|---|
|
||||
| **Q1.A** | Pre-emptive low-watermark eviction | vendored SGLang | Trigger LRU when `token_usage > 0.7` in idle scheduler ticks, so admission rarely needs to evict inline. SGLang already has `_decode_session_cache_low_watermark_tokens`; question is whether it currently runs proactively or only on-demand. | Idle ticks exist to absorb the work; the per-trim cost is bounded enough that doing it pre-emptively doesn't hurt the steady-state. | If proactive trims pick "warm" sessions (recently active), we lose direct-to-D fast-path hits. Need careful watermark + LRU-priority tuning. |
|
||||
| **Q1.B** | Async eviction thread | vendored SGLang | Move LRU trim off the scheduler main loop into a background worker. Scheduler main loop only calls `notify_evict_needed()`; mooncake control plane keeps running. | KV pool free / session_aware_cache mutations can be made thread-safe with reasonable lock granularity. | Largest blast radius. Concurrent in-flight transfers can race with eviction of the same KV slots; need explicit ref-counting. Harder to reason about correctness. |
|
||||
| **Q1.C** | Bump mooncake transfer timeout | mooncake env / wheel patch | Set `MC_TRANSFER_TIMEOUT_NS` (or equivalent) from 30 s default → 120 s+, giving D's eviction more headroom before P gives up. | A real broken link won't go unnoticed for ≥120 s. | Pure defense-in-depth. Doesn't fix LRU thrashing; under heavier load eviction could exceed 120 s too. Slows real-failure detection. |
|
||||
| **Q1.D** | Windowed hair-trigger | vendored SGLang `conn.py:1270` | Replace `if session_failures >= 1:` with `if session_failures ≥ N within window`. Add periodic probe to D bootstrap port to clear `failed_sessions` after success. | Transient stalls are recoverable; real deaths are not. | Changes core failure semantics. We may keep dispatching to a D that is actually slow-dying. Adds windowed-state bookkeeping to a stable codepath. |
|
||||
| **Q1.E** | Router-side backpressure | our `--enable-backpressure` (already exists, off by default) | D returns `recommended_pause_ms` in its admission RPC when pool > threshold; router pauses dispatch to that D. Already implemented. | Pausing dispatch upstream prevents D from ever reaching saturation, so LRU never thrashes. | Doesn't help in-flight transfers when stall happens; only prevents future arrivals. Won't rescue requests already mid-mooncake when LRU fires. |
|
||||
| **Q1.F** | Upstream load balance (= Q2 fix) | our `policies.py` | Spread sessions to D2 so D0/D1's KV pool never saturates; LRU never trims; mooncake never stalls; hair-trigger never fires. | Q2 fix is sound and the workload's KV demand fits into 3 D's evenly. | The LRU+mooncake interaction stays latent. A different workload that still imbalances (e.g. a few sessions much larger than others) could re-trigger. |
|
||||
|
||||
### Recommendation for Q1
|
||||
|
||||
**Primary: Q1.F (do Q2 fix first).** This is upstream of the failure cascade and removes the only situation in which we observe LRU thrashing in our experiments. If Q2 is fixed and re-running E2 still shows mooncake stalls, then we *know* it's a real symptom and need defense-in-depth.
|
||||
|
||||
**Defense-in-depth (cheap): Q1.C (bump mooncake timeout).** Single env-var change, gives 4× safety margin, costs nothing. Safe to do regardless.
|
||||
|
||||
**Avoid for now: Q1.B and Q1.D.** Both touch vendored SGLang in invasive ways that change failure-detection semantics. Hold until Q1.F + Q1.C demonstrate they aren't enough.
|
||||
|
||||
**Open question for the team**: does SGLang's existing `low_watermark` LRU trigger (Q1.A) already run proactively? If we read the scheduler loop and find it only trims on demand, Q1.A is a small targeted change worth doing; if it's already proactive, the trims we observe are because watermark is set too high → tune the constant.
|
||||
|
||||
---
|
||||
|
||||
## Q2 — Cold-D never gets a session
|
||||
|
||||
### What we already know is wrong
|
||||
|
||||
User's observation: the existing `migration_reject_threshold=3` mechanism fires *after 3 wasted prefills*, which is too late. The fix needs to be *proactive*: the first request to a fresh session should already prefer the cold D over a hot D whose only advantage is shared boilerplate overlap.
|
||||
|
||||
### Design space
|
||||
|
||||
Let `assigned[D] = state.decode_assignment_counts[D]` and `inflight[D] = state.inflight_decode[D]`. Lex score is currently:
|
||||
|
||||
```
|
||||
score(D) = (overlap + α·sticky, sticky, -inflight, -assigned)
|
||||
```
|
||||
|
||||
| # | Fix | Mechanism | Assumes | Risks |
|
||||
|---|---|---|---|---|
|
||||
| **Q2.A** | Cold-D bonus (binary, what the reverted commit did) | `cold_boost = K if assigned[D]==0 and not sticky else 0`; add to lex position 0. | Each D needs to be "popped" from cold once, after that the bonus disappears. | One-shot: only protects the first session per D. After all 3 D's have ≥1 session, bonus is 0 everywhere and we're back to overlap-dominates-everything. If new session pressure remains skewed (e.g. boilerplate keeps growing on D0/D1), we re-imbalance silently. |
|
||||
| **Q2.B** | Load-floor bonus (graduated, my recommended primary) | `floor_bonus = max(0, K · (1 − assigned[D] / max(assigned[*])))` (or similar continuous fn); add to lex position 0; gated on `not sticky`. | "Lower assignment count = preferable for fresh sessions" is a sound bias even when no D is fully cold. | Tuning: K must dominate boilerplate overlap (~50 blocks here) but not so much that it drowns out genuine prefix-cache wins (a session with real 800-block overlap with one D should still go there). Suggest K ≈ 100×median(overlap_for_fresh_sessions). |
|
||||
| **Q2.C** | Lex re-order: inflight first | Change score to `(-inflight, overlap + α·sticky, sticky, -assigned)`. | Idle D always wins ties → idle D2 wins fresh sessions immediately. | Contradicts the existing design intent (overlap-first = cache-locality-first). Hurts cache reuse when load *is* balanced. Sticky requests at turn 1+ might be diverted to a momentarily idle D, breaking cache locality of subsequent turns. |
|
||||
| **Q2.D** | Capacity-aware overlap discount | `effective_overlap = overlap · (1 − inflight[D] / max_inflight)`; replace `overlap` in score. | Loaded D's overlap is worth less than idle D's overlap because of queueing cost. Matches what theory says about cache-vs-load tradeoff. | More complex than Q2.B; needs `max_inflight` estimate (per-D? global?). Harder to reason about and tune. Saves only marginal modeling correctness over Q2.B. |
|
||||
| **Q2.E** | Pre-warm cold D's at startup | After SGLang warmup, send a synthetic request whose hash_ids cover the boilerplate prefix to each D, populating `state.resident[D]` evenly. | We can identify "the shared boilerplate" by inspecting the trace before launch (or extracting common prefix at run start). | Trace-aware / requires upstream knowledge. Doesn't help workloads with multiple distinct shared prefixes. Workload-coupled — feels brittle. |
|
||||
| **Q2.F** | Drop overlap unless "material" | Apply overlap term only when overlap > τ blocks (or > τ% of input). | Tiny overlap doesn't actually save meaningful prefill work. | Hides imbalance instead of solving it. If a workload has medium overlap (say 15%), threshold won't fire and we're back to imbalance. Doesn't address the bigger issue. |
|
||||
| **Q2.G** | Fix the substring filter (the actual `_is_admission_rejection_mode` bug) | Either widen `_ADMISSION_REJECTION_SUBSTRINGS` to include `"kvcache-centric"`, or call `state.record_admission_reject` directly from the actual reject signal site instead of string-matching after the fact. | Existing migration mechanism is sound *once* it gets fed the right signal. | User has explicitly said 3-reject threshold is too late. So Q2.G alone isn't enough. But it's still a real bug — fixing it is orthogonal cleanup. |
|
||||
|
||||
### Recommendation for Q2
|
||||
|
||||
**Primary: Q2.B (load-floor bonus, graduated).**
|
||||
- Continuous, not binary one-shot like Q2.A — gracefully handles the case where new sessions keep arriving and load needs to keep spreading.
|
||||
- Decouples "node-idle preference" from overlap as separate signals — composable, debuggable.
|
||||
- Sticky stays on by gating on `not sticky` → no risk of breaking turn 1+ cache locality.
|
||||
- Single knob (`K`) to tune.
|
||||
|
||||
**Orthogonal cleanup: Q2.G (fix the reject-substring filter).** Independent of Q2.B, since the migration mechanism is the *backstop* (when load-floor bonus alone isn't enough to migrate from a saturated D mid-session). User correctly noted that waiting 3 rejects is too late as the *primary* mechanism, but as a *backstop after* primary load balancing, it's still valuable.
|
||||
|
||||
**Avoid: Q2.C** (lex re-order destroys overlap-first design). **Avoid: Q2.E** (workload-coupled, brittle). **Q2.D / Q2.F** are reasonable but more complex than Q2.B with marginal gain.
|
||||
|
||||
### Concrete shape of Q2.B (for review, not for merge)
|
||||
|
||||
```python
|
||||
# In KvAwarePolicy.select, replacing the current score line:
|
||||
total_assigned = sum(state.decode_assignment_counts.values())
|
||||
n_decoders = max(1, len(topology.route_workers))
|
||||
mean_assigned = total_assigned / n_decoders
|
||||
|
||||
# Per-D fairness deficit: how much below the running mean is this D?
|
||||
deficit = max(0, mean_assigned - state.decode_assignment_counts.get(worker.worker_id, 0))
|
||||
floor_bonus = int(self.load_floor_bonus * deficit / max(1, mean_assigned)) if not sticky else 0
|
||||
|
||||
score = (
|
||||
overlap + sticky * self.sticky_bonus + floor_bonus,
|
||||
sticky,
|
||||
inflight_penalty,
|
||||
assignment_penalty,
|
||||
)
|
||||
```
|
||||
|
||||
Knob: `load_floor_bonus: int = 0` (off by default, opt-in). When set to e.g. 200, an empty D that should have 16 sessions but has 0 gets `floor_bonus = 200 * 16 / 16 = 200`, dominating boilerplate overlap (~50). A D that's only 1 session below mean gets `floor_bonus = 200 * 1 / 16 ≈ 12`, which doesn't override real prefix-cache wins.
|
||||
|
||||
But this is just a *sketch* — real tuning needs an empirical pass on the same Inferact subset to verify D2 receives sessions and overlap-driven cache wins survive on D0/D1.
|
||||
|
||||
### Validation plan if we go with Q2.B
|
||||
|
||||
1. Implement Q2.B + flag, default off.
|
||||
2. Re-run E2 on the same `outputs/inferact_50sess.jsonl` subset with `--kvcache-load-floor-bonus 200`.
|
||||
3. Check structural log: do D0/D1/D2 each get a non-trivial share of `session-d-binding.jsonl` rows?
|
||||
4. Check failure rate: drop from 1054 → < 100? (Hypothesis: yes, because the LRU thrash that triggered the mooncake hair-trigger was downstream of D0/D1 saturation.)
|
||||
5. Check direct-to-D rate: should stay similar or improve (load-balancing should not destroy cache reuse, since sticky still wins for known sessions).
|
||||
6. Re-evaluate H1 with E1 vs the new E2.
|
||||
|
||||
---
|
||||
|
||||
## Decision points (for review)
|
||||
|
||||
| # | Question | Default if no answer |
|
||||
|---|---|---|
|
||||
| D1 | Q1: do Q2 fix first and re-measure before touching mooncake / SGLang? | **Yes** (recommended) |
|
||||
| D2 | Q1: bump mooncake `MC_TRANSFER_TIMEOUT_NS` to 120 s as cheap defense-in-depth? | Yes |
|
||||
| D3 | Q2: is Q2.B (load-floor bonus, graduated) the right shape, or should we pick a different option from the table? | Q2.B |
|
||||
| D4 | Q2: also do Q2.G (fix the reject-substring filter) as orthogonal cleanup? | Yes |
|
||||
| D5 | Q2.B: is the proposed deficit-vs-mean formula OK, or do you prefer a simpler "bonus = K · (max - mine) / max" form? | Defer |
|
||||
| D6 | Q2.B: bonus magnitude K = 200 reasonable, or want to grid-search a few values? | Try 200 first |
|
||||
| D7 | Validation: re-run E2 on same 50-session subset, or expand to 100 sessions for more headroom? | Same subset |
|
||||
|
||||
Once the shape is approved, the next implementation pass is small and concentrated in `policies.py` + `replay.py` + `cli.py` (no SGLang vendor changes needed for the primary fix).
|
||||
Reference in New Issue
Block a user