docs(failures): consolidated 5-mode failure taxonomy
Consolidates failure modes scattered across V2_DEEP_ANALYSIS,
E1_E2_RESULTS, E3_FINDINGS, KVC_EVICTION_GRANULARITY,
REAL_ALI_KVC_EXPERIMENT into a single lookup table with
five fields per mode: symptom → root cause → trigger →
current mitigation → real fix.
Five modes covered:
A. Mooncake "instance not alive" cascade
— E2 80%-failure pathology; admission no-space →
seed burst → heartbeat drop → batch abort
B. Cold-D / overlap-pinning
— shared boilerplate hash pins all sessions to a
subset of D's; load_floor_bonus is a patch, the
real fix is exclusive_overlap redefinition
C. Evict storm (session-level eviction)
— release_session frees 38–88K tokens in one shot;
fix is BLOCK_LEVEL_EVICTION_DESIGN
C'. Reseed storm (turn-1 concurrent seeds)
— startup-phase mooncake burst; fix is per-D
pending-seed budget, frequency drops after C
D. Streaming-session correction invariant crash (E3)
— schedule_batch.py:1646 landmine, hotfixed by
986f351, root-fix is removing the correction
path entirely (BLOCK_LEVEL_EVICTION §3.7)
Each mode has a forensic link back to the original
experiment doc that surfaced it.
§6 adds a diagnostic cheat sheet: "if you see X, look at Y."
§7 wires every mode to a roadmap item — Milestone 1 should
graduate §1–§4 to "mitigated" and eliminate §5.
INDEX_ZH gets a new §1.6 section linking this and the
SGLang patch inventory.
No code change. Reading dependency for anyone debugging
a sweep or writing paper §Limitations.
This commit is contained in:
222
docs/FAILURE_MODES_ZH.md
Normal file
222
docs/FAILURE_MODES_ZH.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Failure-mode Taxonomy
|
||||
|
||||
**日期**:2026-05-13
|
||||
**性质**:集中清单 + 诊断手册
|
||||
**对象**:跑实验时遇到失败要立刻 lookup 的合作者;写 paper §Limitations 时需引用的人;reviewer 想问"你为什么觉得这次会更稳"时的答案
|
||||
|
||||
本文把当前系统已识别的失败模式按"症状 → 根因 → 触发条件 → 当前缓解 → 真正的修复"梳成一张表。所有条目都有 forensic 链接到原始实验 doc。
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
5 类已识别失败模式,按"是否阻碍 paper claim"分组:
|
||||
|
||||
| 类别 | 名称 | 阻碍 paper | 真正修复 |
|
||||
|---|---|:---:|---|
|
||||
| **A. 控制层级联** | Mooncake "instance not alive" cascade | ✅ | admission backoff + per-D pending-seed budget |
|
||||
| **B. 路由偏置** | Cold-D / overlap-pinning | ✅ | first-principles overlap term redefinition |
|
||||
| **C. KV 抖动** | Evict storm(session-level evict) | ✅ | [BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) |
|
||||
| **C'. KV 抖动** | Reseed storm(turn 1 大 seed 并发) | ✅ | per-D pending-seed budget + (C 缓解后频率自降) |
|
||||
| **D. Vendor 不变量** | streaming-session correction invariant crash (E3) | ❌(hotfix 已 land) | 删除 correction 路径(block-level evict 完成后) |
|
||||
|
||||
A / B / C 三类是 Milestone 1 必须解决的;C' 是 A 的次因;D 已临时止血但根本修复绑在 C 上。
|
||||
|
||||
---
|
||||
|
||||
## 1. A — Mooncake "instance not alive" cascade
|
||||
|
||||
### 1.1 症状
|
||||
|
||||
- 客户端看:`RuntimeError: generate stream ended before producing any token`
|
||||
- D scheduler 日志:`[mooncake] Decode instance could be dead, dropping ...`
|
||||
- 整批请求被 abort,单一 sweep 在数分钟内从健康降到 80% failure([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md) E2:1054 / 1285 失败)
|
||||
|
||||
### 1.2 根因(forensic 链路)
|
||||
|
||||
```
|
||||
admission no-space (D KV pool 满)
|
||||
→ router 立刻 fallback 走 seed/reseed 路径
|
||||
→ 多个并发 seed 同时打 mooncake P→D
|
||||
→ P→D 出口排队,handshake 阶段超时
|
||||
→ mooncake 把对端标记 dead
|
||||
→ SGLang 把 dead 链路上的 in-flight req 全部 abort
|
||||
→ 客户端看到批量 generate-stream 中断
|
||||
```
|
||||
|
||||
### 1.3 触发条件
|
||||
|
||||
- D KV pool 接近满(≥ ρ·K_d,默认 0.95)
|
||||
- router fallback chain 把多个 reseed 在毫秒级窗口内发起
|
||||
- mooncake heartbeat 超时(默认窗口短)
|
||||
|
||||
### 1.4 当前缓解
|
||||
|
||||
- `--kvcache-seed-min-turn-id=2` 跳过 turn 1 大 seed,减少首爆(main 分支 stable 配置)
|
||||
- `--mc-transfer-timeout=1800s` 默认值(commit 905d671)减少假性 dead
|
||||
- `--request-timeout-s=180/300` 让客户端不至于看见整 hour 卡死,但不阻止 cascade 自身
|
||||
|
||||
→ 这些都是治标,不是治本。E2 在 4×H200 NDR 真硬件下仍 80% 失败 ([E1_E2_RESULTS_ZH.md](E1_E2_RESULTS_ZH.md))。
|
||||
|
||||
### 1.5 真正的修复(路线图 §S3)
|
||||
|
||||
1. **admission RPC backoff + jitter**:拒绝时不立刻 fallback,给 D scheduler 喘息机会。
|
||||
2. **per-D pending-seed budget**:同时刻最多 K 个 seed 在 transfer 队列里,超出排队而不爆裂。
|
||||
3. **mooncake heartbeat 与 admission 解耦**:admission 路径不再 imply "对端 alive"。
|
||||
4. **Backpressure pause hint 闭环**([SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) §2.3 当前 EXPERIMENTAL)。
|
||||
|
||||
---
|
||||
|
||||
## 2. B — Cold-D / overlap-pinning
|
||||
|
||||
### 2.1 症状
|
||||
|
||||
- N=k decode workers,但只有 ~k-1 真正承载流量;某些 D 0 binding
|
||||
- Per-D load 直方图严重偏斜(E2:D0:600 / D1:685 / **D2:0**)
|
||||
- 整体 throughput 受最忙 D 限制;裸 latency 不一定差,但容量利用率差 33%+
|
||||
|
||||
### 2.2 根因
|
||||
|
||||
Inferact / Ali coding agent trace 在每个 session 开头有 ~12K 的"system prompt + tool schema",这些 24-token 块在所有 session 之间共享 hash。kv-aware policy 的 `overlap` term 把它们当成"该 D 已经常驻这些 hash" → 任何新 session 都被 score 推向 D0/D1(最先 warm 的两个)→ D2 永远 0 overlap → 永远不被选 → 永远 cold。
|
||||
|
||||
### 2.3 触发条件
|
||||
|
||||
- 多 session workload + 共享 boilerplate prefix
|
||||
- `migration_reject_threshold > 0` 且 reject 从未触发(因为 D0/D1 还没满)
|
||||
|
||||
### 2.4 当前缓解
|
||||
|
||||
`KvAwarePolicy.load_floor_bonus`(commit 93fce42):
|
||||
|
||||
```
|
||||
floor_bonus = K * max(0, mean - assigned) / max(1, mean)
|
||||
```
|
||||
|
||||
E3 实测 D2 binding 从 0 升到 22.5%([E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §1)。
|
||||
|
||||
→ 这是 patch,不是修复。`K` 是 magic number;boilerplate 的 hash 数量大于 `K / sticky_bonus` 时仍 cold。
|
||||
|
||||
### 2.5 真正的修复(路线图 §S5)
|
||||
|
||||
把 `overlap` 重新定义为 **"该 session 在该 D 上独占 prefix 的 hash 数"**:
|
||||
|
||||
```
|
||||
exclusive_overlap(s, d) := |prefix_hashes(s) ∩ resident[d] ∩ session_owned[s]|
|
||||
```
|
||||
|
||||
其中 `session_owned[s]` 排除其它 session 也持有的 hash。Boilerplate 共享 hash 不进 `exclusive_overlap`,score 自然分散。需要 D 端在 `admit_direct_append` 响应里返回 per-session resident hash 集合的 sketch(Bloom filter / minhash)。
|
||||
|
||||
---
|
||||
|
||||
## 3. C — Evict storm(session-level eviction)
|
||||
|
||||
### 3.1 症状
|
||||
|
||||
- 在 D 内存有压力的 workload 下,每 1–2 分钟出现 30–90K tokens 的 KV pool 释放峰
|
||||
- 紧随其后的同 session 请求触发 `Reseed`:P 重 prefill 50K + mooncake transfer 50K(3–7s)
|
||||
- TTFT 长尾完全由这类 reseed 主导([V2_DEEP_ANALYSIS_ZH.md](V2_DEEP_ANALYSIS_ZH.md) §3.2)
|
||||
|
||||
### 3.2 根因
|
||||
|
||||
`SessionAwareCache.release_session` 一次性 `free([cache_protected_len, kv_allocated_len))`——即整段 session-exclusive 尾部。E3 实测:90 次 evict、平均一次 free 67,726 tokens、25/50 session 受影响([KVC_EVICTION_GRANULARITY_DESIGN_ZH.md](KVC_EVICTION_GRANULARITY_DESIGN_ZH.md) §0)。
|
||||
|
||||
→ 与 SGLang 标准 radix 的 leaf-by-leaf 渐进 evict 形成鲜明对比。这部分 KV 从未进 radix,所以享受不到 LRU 的细粒度蚕食。
|
||||
|
||||
### 3.3 触发条件
|
||||
|
||||
- D KV pool 接近满
|
||||
- `maybe_trim_decode_session_cache` 被 scheduler 触发(在 `DecodePreallocQueue` 检测到 `available_size() <= 0` 时)
|
||||
|
||||
### 3.4 当前缓解
|
||||
|
||||
- `--kvcache-session-soft-cap=N`(main 分支):限制 D 上常驻 session 数 → 提前 trim,避免顶到爆
|
||||
- `--kvcache-direct-max-uncached-tokens=8192`(v2):降低 direct path 吃 KV 的速度
|
||||
|
||||
→ 都是放慢节奏,没有解决"单次 free 太大"的根本问题。
|
||||
|
||||
### 3.5 真正的修复(路线图 §S1)
|
||||
|
||||
[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md):让 streaming-session decode 输出每 turn finish 时 `inner.cache_finished_req` 进 radix → `release_session` 退化为 `dec_lock_ref` + 删 slot → radix LRU 按 24-token leaf 蚕食。
|
||||
|
||||
预期:单次 evict 从 67K 降到 ≤ 500 tokens;reseed 频次降一个数量级。
|
||||
|
||||
---
|
||||
|
||||
## 4. C' — Reseed storm(turn 1 大 seed 并发)
|
||||
|
||||
### 4.1 症状
|
||||
|
||||
- workload 起步阶段(前 30–60s)所有 session 同时打 turn 1
|
||||
- 多个并发 `Seed`(每个 ~50–90K tokens)打 mooncake → 与 §1 cascade 重合
|
||||
|
||||
### 4.2 根因
|
||||
|
||||
`KvAwarePolicy` 启动阶段 `resident[d]` 全空,所有 D score 相同,但 ε 重试 + per-trial admit 不阻止并发。
|
||||
|
||||
### 4.3 触发条件
|
||||
|
||||
- trace `time_scale=1` 重放下,session 在原始到达密度内同时启动
|
||||
- 没有 per-D pending-seed 限流
|
||||
|
||||
### 4.4 当前缓解
|
||||
|
||||
- `--kvcache-seed-min-turn-id=2`:跳过 turn 1 seed 完全(main 分支 stable 配置)
|
||||
- 副作用:失去 turn-1 的 KV 注入,turn 2 必走 reseed(但反而稳定,因为 reseed 是分散在时间上的)
|
||||
|
||||
### 4.5 真正的修复
|
||||
|
||||
- per-D pending-seed budget(同 §1.5 第 2 项)
|
||||
- §3.5 完成后 evict 频次自降,间接降低 reseed 频次
|
||||
|
||||
---
|
||||
|
||||
## 5. D — Streaming-session correction invariant crash (E3 landmine)
|
||||
|
||||
### 5.1 症状
|
||||
|
||||
- D scheduler 抛 `AssertionError` at `schedule_batch.py:1646`:`seq_len - pre_len == req.extend_input_len`
|
||||
- 整个 D worker 进程退出 → router 看见对端死 → §1 cascade
|
||||
|
||||
### 5.2 根因
|
||||
|
||||
[E3_FINDINGS_ZH.md](E3_FINDINGS_ZH.md) §2:streaming-session correction(commit b8e6f13)把 `extend_input_len` 改写为 `max(0, fill_len - prefix_len)`,但下游 invariant 还从原始 fill_ids/prefix_indices 计算。当 `fill_len < prefix_len`(多 turn 累积 prefix > 当前 turn 增量)时数学上不可能满足。
|
||||
|
||||
### 5.3 触发条件
|
||||
|
||||
- streaming session 跨 turn 已 commit prefix 长于本 turn 的新增 fill_ids
|
||||
- E2 因 pipeline 阻塞从未跑到这个状态;E3 修了 cold-D bottleneck → pipeline 更快 → landmine 暴露
|
||||
|
||||
### 5.4 当前缓解
|
||||
|
||||
commit 986f351 的 pre-filter pass:在 `prepare_for_extend` 入口 drop 这类 req(让 client 看错误响应而不是 worker 崩)。是止血。
|
||||
|
||||
### 5.5 真正的修复
|
||||
|
||||
`schedule_batch.py:1572–1646` 这整段 correction 路径在 block-level eviction refactor 完成后**结构上不再需要**——[BLOCK_LEVEL_EVICTION_DESIGN_ZH.md](BLOCK_LEVEL_EVICTION_DESIGN_ZH.md) §3.7 已说明 refactor 后 fill_ids / prefix_indices 一致性由 radix `match_prefix` 自动保证。
|
||||
|
||||
→ 不要再加更多 correction 子句;要删整段。
|
||||
|
||||
---
|
||||
|
||||
## 6. 失败诊断 cheat sheet
|
||||
|
||||
跑 sweep 时按下表 lookup:
|
||||
|
||||
| 你看到 | 大概率是 | 先查 |
|
||||
|---|---|---|
|
||||
| 客户端 `RuntimeError: generate stream ended before...` | §1 cascade | D scheduler log 搜 `instance could be dead` |
|
||||
| 某个 D `binding=0` 而其它 D 繁忙 | §2 cold-D | `per_decode_load` 直方图 |
|
||||
| TTFT p99 突然抬到 5–8s 量级 | §3 evict storm | `release_session` 调用频次 + 平均 free tokens |
|
||||
| Sweep 起步阶段失败率高、稳态低 | §4 reseed storm | mooncake transfer queue 在前 30s 的峰值 |
|
||||
| D worker 进程异常退出 | §5 invariant crash | scheduler log 搜 `AssertionError`、`extend_input_len` |
|
||||
|
||||
---
|
||||
|
||||
## 7. 与路线图的衔接
|
||||
|
||||
- [AUDIT_AND_ROADMAP_ZH.md](AUDIT_AND_ROADMAP_ZH.md) Milestone 1 的第 1/3/4 项分别对应本表 C / A / B 的真正修复。完成 Milestone 1 后本表 §1–§4 应该都从"未修"降级为"已缓解",§5 直接消失。
|
||||
- 论文 §Limitations 必须老实写出现状:"we identify five failure modes; A/C are addressed by this work, B/C' are partially addressed, D is a transient artifact of the in-progress refactor."
|
||||
|
||||
---
|
||||
|
||||
**核心句**:把失败模式当 first-class artifact 来管理——每个失败都有"症状 → 根因 → 触发 → 缓解 → 真正修复"五字段,是把 prototype 推到 production-grade 的关键工具。reviewer 看见你能枚举失败远比看见你赢得 baseline 更让人信服。
|
||||
@@ -58,7 +58,14 @@
|
||||
| [REFACTOR_PLAN_V1_ZH.md](REFACTOR_PLAN_V1_ZH.md) | 为什么从 ts=10 切到 ts=1 |
|
||||
| [TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md](TEAM_REPORT_AGENTIC_PD_HYBRID_ZH.md) | ts=10 时代的结构性问题清单(多数已 supersede) |
|
||||
|
||||
### 1.6 环境
|
||||
### 1.6 工程债 / 失败模式
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
| [SGLANG_PATCH_INVENTORY_ZH.md](SGLANG_PATCH_INVENTORY_ZH.md) | 785 行 vendored SGLang patch 的归类清单(MUST-HAVE / WORKAROUND / EXPERIMENTAL / INSTRUMENTATION)—— 本分支新增 |
|
||||
| [FAILURE_MODES_ZH.md](FAILURE_MODES_ZH.md) | 5 类失败模式的诊断 + 缓解 + 真正修复(mooncake cascade / cold-D / evict storm / reseed storm / E3 invariant)—— 本分支新增 |
|
||||
|
||||
### 1.7 环境
|
||||
|
||||
| 文档 | 内容 |
|
||||
|---|---|
|
||||
|
||||
Reference in New Issue
Block a user